Getting Started with Hydrolix
In this tutorial we will explore world events using the GDELT public data set of world news events. The data set contains 344 M rows (with 60 fields each) and is 24 GB when compressed & indexed by Hydrolix on AWS S3. The original data set size is challenging to calculate since it is in several zipped files, but our calculations approximate that the ingested size is about half the size of the GZIP’ed files.
GDELT Data Structure
The GDELT data structure described here, there is also a handy cheat-sheet. We converted the field names to lower snake case (i.e. actor1_name) and moved TIMEADDED to the timestamp field.
GDELT has used machine learning techniques to extract entity (actor1, actor2, action), sentiment/interest (goldstein, num_mentions, num_sources, num_articles, tone), geo location (lat/long, ISO codes) and categorization (event_code, quad_class, geo_code) from the world news.
Working with time
The GDELT data has 5 fields to represent various time intervals: TIMEADDED (YYYYMMDDHHMMSS UTC timezone), DAY (integer), MONTHYEAR (integer), YEAR (integer) and FRACTIONDATE (floating point). This feels excessive to a modern data engineer’s eye. Hydrolix can derive all interval/date patterns from a single timestamp.
Transforms and Views in Hydrolix
Hydrolix allows for defining the data structure on ingest and query.
A Transform is how data is written/ingested. It informs Hydrolix not only what data types are being written but also how the data should be treated.
A primary timestamp (treatment: primary i.e. TIMEADDED) drives time based analytics. Other fields are for analytics (treatment: tag - string/numerical fields) vs computational (treatment: metric - numerical fields).
A View is how data is read/queried. Multiple views can be applied to the same data set. When a view is not specified, all columns are available with their original treatments and data types. This is true of the GDELT data set in Hydrolix. A Transform was specified on ingestion, and that definition is the View on query.
GDELT Columns and Types
Column | Data Type |
---|---|
timestamp | DateTime |
global_event_id | String |
day | String |
month_year | String |
year | String |
fraction_date | String |
actor1_code | String |
actor1_name | String |
actor1_country_code | String |
actor1_known_group_code | String |
actor1_ethnic_code | String |
actor1_religion_code | String |
actor1_religion2_code | String |
actor1_type1_code | String |
actor1_type2_code | String |
actor1_type3_code | String |
actor2_code | String |
actor2_name | String |
actor2_country_code | String |
actor2_known_group_code | String |
actor2_ethnic_code | String |
actor2_religion_code | String |
actor2_religion2_code | String |
actor2_type1_code | String |
actor2_type2_code | String |
actor2_type3_code | String |
is_root_event | UInt64 |
event_code | String |
event_base_code | String |
event_root_code | String |
quad_class | UInt64 |
goldstein_scale | Float64 |
num_mentions | UInt64 |
num_sources | UInt64 |
num_articles | UInt64 |
avg_tone | Float64 |
actor1_geo_type | UInt64 |
actor1_geo_fullname | String |
actor1_geo_country_code | String |
actor1_adm1_code | String |
actor1_adm2_code | String |
actor1_geo_lat | Float64 |
actor1_geo_long | Float64 |
actor1_geo_feature_id | String |
actor2_geo_type | UInt64 |
actor2_geo_fullname | String |
actor2_geo_country_code | String |
actor2_adm1_code | String |
actor2_adm2_code | String |
actor2_geo_lat | Float64 |
actor2_geo_long | Float64 |
actor2_geo_feature_id | String |
action_geo_type | UInt64 |
action_geo_fullname | String |
action_geo_country_code | String |
action_adm1_code | String |
action_adm2_code | String |
action_geo_lat | Float64 |
action_geo_long | Float64 |
action_geo_feature_id | String |
source_url | String |
Deployment
Hydrolix hosts the GDELT data set in a running Hydrolix deployment at https://try.hydrolix.io
. All Try It
examples will call out to this environment.
FROM: Projects and Tables
A project represents a collection of tables. The project sample
represents our collection of sample data sets. The table gdelt
is where we ingested the GDELT data. This is expressed in the SQL FROM clause in dot notation FROM sample.gdelt
.
WHERE: Time Range
All data sets must have a primary timestamp, which Hydrolix uses to store data into time partitioned blocks on ingest. Therefore, all SQL queries must contain a WHERE clause with a time range.
Working with time
The underlying data has a 15 minute granularity. Aggregating the data into higher time intervals is achieved using Time Functions. Here we are using toYear(datetime)
and the count()
function to count the number of events by year.
SELECT toYear(timestamp) AS year, count()
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC
This returns about 344 million events with 61 million in 2018 and about 51 million in 2019
2015 59,063,790
2016 97,228,713
2017 75,609,229
2018 61,544,481
2019 50,877,104
We could have partitioned by toMonth(), toWeek(), toHour(), toQuarter()
and hence derived all the extra unnecessary time fields in the GDELT data schema.
This query used absolute times, but relative time and date math is also valid. We could as easily query the last 10 years(subtractYears(date, num)
) from now()
SELECT count()
FROM sample.gdelt
WHERE (timestamp BETWEEN subtractYears(now(), 10) AND now())
Time Functions can be used in SQL SELECT
, WHERE
and GROUP
clauses.
Count distinct values
Figure out how many distinct primary actors there are.
SELECT uniq(actor1_name)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00')
During this time, there were 16,955 unique identities in actor1_name
.
Find the most frequently occuring values
Find the 10 most occuring values of actor1_name
SELECT topK(10)(actor1_name)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00') ;
This query returns this list:
['UNITED STATES','\0','POLICE','UNITED KINGDOM','PRESIDENT','GOVERNMENT','CHINA','SCHOOL','RUSSIA','CANADA']
One of the most occurring values is ‘\0’ where no primary actor exists.
Find the last article classified SCHOOL & avg_tone > 1
We can easily find the last value in a filtered data set. argMax(field, datetime)
is much simpler than grouping and ordering by time desc. It also works with field tuples.
SELECT argMax((timestamp, avg_tone, source_url), timestamp)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2018-01-01 00:00:00' AND '2020-01-01 00:00:00')
AND actor1_name='SCHOOL'
AND avg_tone > 1
The latest article classified as SCHOOL
and with a positive avg_tone
was an Edison State Community College conference
'2019-11-20 21:45:00',
3.6649214659685896,
'https://www.dailycall.com/news/69658/edison-state-holds-guidance-counselor-conference'
Find the average value
Calculate the average tone of events for a particular primary actor by month.
SELECT toYear(timestamp) AS year, round(avg(avg_tone),3) AS tone
FROM sample.gdelt
WHERE (timestamp between '2018-01-01 00:00:00' AND '2020-01-01 00:00:00')
AND actor1_name='SCHOOL'
GROUP BY year
ORDER BY year ASC;
2018 -0.651
2019 -0.443
avg_tone
has values ranging from -100 (very negative) to +100 (very positive). Even though average sentiment has gone up in events where SCHOOL
is the primary actor, these articles very slightly negative in tone. The value is greater than -1, but less than zero.
Count event types
Protests (event_root_code=‘14’) are an interesting event type to monitor over time.
SELECT toStartOfMonth(timestamp) month_year, sum(num_articles) articles
FROM sample.gdelt
WHERE (timestamp BETWEEN '2016-09-01 00:00:00' AND '2017-01-31 00:00:00')
AND event_root_code='14'
GROUP BY month_year
Take for example the 2016 US election time frame. There is a marked jump in the number of protest around election time.
2016-09-01 295,535
2016-10-01 282,903
2016-11-01 369,609
2016-12-01 730,917
2017-01-01 309,717
Find the top source_url domains by year
Each event has a source_url. We can use url functions to easily extract root domains domain(url)
.
SELECT toYear(timestamp) as year, topK(10)(domain(source_url))
FROM sample.gdelt
WHERE (timestamp BETWEEN '2018-01-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
We have extracted the root domain from the full url at query time, allowing us to trend news outlets over time.
2018 ['www.msn.com','www.dailymail.co.uk','www.business-standard.com','allafrica.com','www.xinhuanet.com','www.thenews.com.pk','www.yahoo.com','www.4-traders.com','www.nbcnews.com','www.sfgate.com']
2019 ['www.dailymail.co.uk','www.msn.com','www.business-standard.com','news.yahoo.com','allafrica.com','menafn.com','www.thenews.com.pk','www.nbcnews.com','www.reuters.com','www.xinhuanet.com']