GDELT Data
In this tutorial we will explore world events using the GDELT public data set of world news events. The data set contains 344 M rows (with 60 fields each) and is 24 GB when compressed & indexed by Hydrolix on AWS S3. The original data set size is challenging to calculate since it is in several zipped files, but our calculations approximate that the ingested size is about half the size of the GZIP'ed files.
GDELT Data Structure
The GDELT data structure described here, there is also a handy cheat-sheet. We converted the field names to lower snake case (i.e. actor1_name) and moved TIMEADDED to the timestamp field.
GDELT Columns and Types
Click arrow to expand
Column | Data Type |
---|---|
timestamp | DateTime |
global_event_id | String |
day | String |
month_year | String |
year | String |
fraction_date | String |
actor1_code | String |
actor1_name | String |
actor1_country_code | String |
actor1_known_group_code | String |
actor1_ethnic_code | String |
actor1_religion_code | String |
actor1_religion2_code | String |
actor1_type1_code | String |
actor1_type2_code | String |
actor1_type3_code | String |
actor2_code | String |
actor2_name | String |
actor2_country_code | String |
actor2_known_group_code | String |
actor2_ethnic_code | String |
actor2_religion_code | String |
actor2_religion2_code | String |
actor2_type1_code | String |
actor2_type2_code | String |
actor2_type3_code | String |
is_root_event | UInt64 |
event_code | String |
event_base_code | String |
event_root_code | String |
quad_class | UInt64 |
goldstein_scale | Float64 |
num_mentions | UInt64 |
num_sources | UInt64 |
num_articles | UInt64 |
avg_tone | Float64 |
actor1_geo_type | UInt64 |
actor1_geo_fullname | String |
actor1_geo_country_code | String |
actor1_adm1_code | String |
actor1_adm2_code | String |
actor1_geo_lat | Float64 |
actor1_geo_long | Float64 |
actor1_geo_feature_id | String |
actor2_geo_type | UInt64 |
actor2_geo_fullname | String |
actor2_geo_country_code | String |
actor2_adm1_code | String |
actor2_adm2_code | String |
actor2_geo_lat | Float64 |
actor2_geo_long | Float64 |
actor2_geo_feature_id | String |
action_geo_type | UInt64 |
action_geo_fullname | String |
action_geo_country_code | String |
action_adm1_code | String |
action_adm2_code | String |
action_geo_lat | Float64 |
action_geo_long | Float64 |
action_geo_feature_id | String |
source_url | String |
GDELT has used machine learning techniques to extract entity (actor1, actor2, action), sentiment/interest (goldstein, num_mentions, num_sources, num_articles, tone), geo location (lat/long, ISO codes) and categorization (event_code, quad_class, geo_code) from the world news.
Working with timeThe GDELT data has 5 fields to represent various time intervals: TIMEADDED (YYYYMMDDHHMMSS UTC timezone), DAY (integer), MONTHYEAR (integer), YEAR (integer) and FRACTIONDATE (floating point). This feels excessive to a modern data engineer's eye. Hydrolix can derive all interval/date patterns from a single timestamp.
Transforms and Views in Hydrolix
Hydrolix allows for defining the data structure on ingest and query.
A Transform is how data is written/ingested. It informs Hydrolix not only what data types are being written but also how the data should be treated.
GDELT Transform
Click arrow to expand
{
"name": "transform_gdelt",
"type": "csv",
"table": "{{table_uuid}}",
"settings": {
"compression": "gzip",
"is_default": true,
"format_details": {
"delimiter": "\t"
},
"output_columns": [
{
"name": "timestamp",
"datatype": {
"primary": true,
"type": "datetime",
"format": "20060102150405",
"resolution": "seconds",
"source": { "from_input_index": 0 }
}
},
{
"name": "global_event_id",
"datatype": {
"type": "string",
"source": { "from_input_index": 1 }
}
},
{
"name": "day",
"datatype": {
"type": "string",
"source": { "from_input_index": 2 }
}
},
{
"name": "month_year",
"datatype": {
"type": "string",
"source": { "from_input_index": 3 }
}
},
{
"name": "year",
"datatype": {
"type": "string",
"source": { "from_input_index": 4 }
}
},
{
"name": "fraction_date",
"datatype": {
"type": "string"
"source": { "from_input_index": 5 }
}
},
{
"name": "actor1_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 6 }
}
},
{
"name": "actor1_name",
"datatype": {
"type": "string"
"source": { "from_input_index": 7 }
}
},
{
"name": "actor1_country_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 8 }
}
},
{
"name": "actor1_known_group_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 9 }
}
},
{
"name": "actor1_ethnic_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 10 }
}
},
{
"name": "actor1_religion_code",
"datatype": {
"type": "string"
"source": { "from_input_index": 11 }
}
},
{
"name": "actor1_religion2_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 12 }
}
},
{
"name": "actor1_type1_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 13 }
}
},
{
"name": "actor1_type2_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 14 }
}
},
{
"name": "actor1_type3_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 15 }
}
},
{
"name": "actor2_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 16 }
}
},
{
"name": "actor2_name",
"datatype": {
"type": "string",
"source": { "from_input_index": 17 }
}
},
{
"name": "actor2_country_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 18 }
}
},
{
"name": "actor2_known_group_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 19 }
}
},
{
"name": "actor2_ethnic_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 20 }
}
},
{
"name": "actor2_religion_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 21 }
}
},
{
"name": "actor2_religion2_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 22 }
}
},
{
"name": "actor2_type1_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 23 }
}
},
{
"name": "actor2_type2_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 24 }
}
"position": 24
},
{
"name": "actor2_type3_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 25 }
}
},
{
"name": "is_root_event",
"datatype": {
"type": "bool",
"source": { "from_input_index": 26 }
}
},
{
"name": "event_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 27 }
}
},
{
"name": "event_base_code",
"datatype": {
"type": "string"
"source": { "from_input_index": 28 }
}
},
{
"name": "event_root_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 29 }
}
},
{
"name": "quad_class",
"datatype": {
"type": "uint32",
"source": { "from_input_index": 30 }
}
},
{
"name": "goldstein_scale",
"datatype": {
"type": "double",
"source": { "from_input_index": 31 }
}
},
{
"name": "num_mentions",
"datatype": {
"type": "uint32",
"source": { "from_input_index": 32 }
}
},
{
"name": "num_sources",
"datatype": {
"type": "uint32",
"source": { "from_input_index": 33 }
},
},
{
"name": "num_articles",
"datatype": {
"type": "uint32",
"source": { "from_input_index": 34 }
}
},
{
"name": "avg_tone",
"datatype": {
"type": "double",
"source": { "from_input_index": 35 }
}
},
{
"name": "actor1_geo_type",
"datatype": {
"type": "uint32",
"source": { "from_input_index": 36 }
}
},
{
"name": "actor1_geo_fullname",
"datatype": {
"type": "string",
"source": { "from_input_index": 37 }
}
},
{
"name": "actor1_geo_country_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 3 }
}
},
{
"name": "actor1_adm1_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 39 }
}
},
{
"name": "actor1_adm2_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 40 }
}
},
{
"name": "actor1_geo_lat",
"datatype": {
"type": "double",
"source": { "from_input_index": 41 }
}
},
{
"name": "actor1_geo_long",
"datatype": {
"type": "double",
"source": { "from_input_index": 42 }
}
},
{
"name": "actor1_geo_feature_id",
"datatype": {
"type": "string",
"source": { "from_input_index": 43 }
}
},
{
"name": "actor2_geo_type",
"datatype": {
"type": "uint32",
"source": { "from_input_index": 44 }
}
},
{
"name": "actor2_geo_fullname",
"datatype": {
"type": "string",
"source": { "from_input_index": 45 }
}
},
{
"name": "actor2_geo_country_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 46 }
}
},
{
"name": "actor2_adm1_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 47 }
}
},
{
"name": "actor2_adm2_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 48 }
}
},
{
"name": "actor2_geo_lat",
"datatype": {
"type": "double",
"source": { "from_input_index": 49 }
}
},
{
"name": "actor2_geo_long",
"datatype": {
"type": "double",
"source": { "from_input_index": 50 }
}
},
{
"name": "actor2_geo_feature_id",
"datatype": {
"type": "string",
"source": { "from_input_index": 51 }
}
},
{
"name": "action_geo_type",
"datatype": {
"type": "uint32",
"source": { "from_input_index": 52 }
}
},
{
"name": "action_geo_fullname",
"datatype": {
"type": "string",
"source": { "from_input_index": 53 }
}
},
{
"name": "action_geo_country_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 54 }
}
},
{
"name": "action_adm1_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 55 }
}
},
{
"name": "action_adm2_code",
"datatype": {
"type": "string",
"source": { "from_input_index": 56 }
}
},
{
"name": "action_geo_lat",
"datatype": {
"type": "double",
"source": { "from_input_index": 57 }
}
},
{
"name": "action_geo_long",
"datatype": {
"type": "double",
"source": { "from_input_index": 58 }
}
},
{
"name": "action_geo_feature_id",
"datatype": {
"type": "string",
"source": { "from_input_index": 59 }
}
},
{
"name": "source_url",
"datatype": {
"type": "string",
"source": { "from_input_index": 60 }
}
}
]
}
}
Updated 3 months ago