GDELT Data

In this tutorial we will explore world events using the GDELT public data set of world news events. The data set contains 344 M rows (with 60 fields each) and is 24 GB when compressed & indexed by Hydrolix on AWS S3. The original data set size is challenging to calculate since it is in several zipped files, but our calculations approximate that the ingested size is about half the size of the GZIP'ed files.

GDELT Data Structure

The GDELT data structure described here, there is also a handy cheat-sheet. We converted the field names to lower snake case (i.e. actor1_name) and moved TIMEADDED to the timestamp field.

GDELT Columns and Types

Click arrow to expand

ColumnData Type
timestampDateTime
global_event_idString
dayString
month_yearString
yearString
fraction_dateString
actor1_codeString
actor1_nameString
actor1_country_codeString
actor1_known_group_codeString
actor1_ethnic_codeString
actor1_religion_codeString
actor1_religion2_codeString
actor1_type1_codeString
actor1_type2_codeString
actor1_type3_codeString
actor2_codeString
actor2_nameString
actor2_country_codeString
actor2_known_group_codeString
actor2_ethnic_codeString
actor2_religion_codeString
actor2_religion2_codeString
actor2_type1_codeString
actor2_type2_codeString
actor2_type3_codeString
is_root_eventUInt64
event_codeString
event_base_codeString
event_root_codeString
quad_classUInt64
goldstein_scaleFloat64
num_mentionsUInt64
num_sourcesUInt64
num_articlesUInt64
avg_toneFloat64
actor1_geo_typeUInt64
actor1_geo_fullnameString
actor1_geo_country_codeString
actor1_adm1_codeString
actor1_adm2_codeString
actor1_geo_latFloat64
actor1_geo_longFloat64
actor1_geo_feature_idString
actor2_geo_typeUInt64
actor2_geo_fullnameString
actor2_geo_country_codeString
actor2_adm1_codeString
actor2_adm2_codeString
actor2_geo_latFloat64
actor2_geo_longFloat64
actor2_geo_feature_idString
action_geo_typeUInt64
action_geo_fullnameString
action_geo_country_codeString
action_adm1_codeString
action_adm2_codeString
action_geo_latFloat64
action_geo_longFloat64
action_geo_feature_idString
source_urlString

GDELT has used machine learning techniques to extract entity (actor1, actor2, action), sentiment/interest (goldstein, num_mentions, num_sources, num_articles, tone), geo location (lat/long, ISO codes) and categorization (event_code, quad_class, geo_code) from the world news.   

Working with time

The GDELT data has 5 fields to represent various time intervals: TIMEADDED (YYYYMMDDHHMMSS UTC timezone), DAY (integer), MONTHYEAR (integer), YEAR (integer) and FRACTIONDATE (floating point). This feels excessive to a modern data engineer's eye. Hydrolix can derive all interval/date patterns from a single timestamp.

Transforms and Views in Hydrolix

Hydrolix allows for defining the data structure on ingest and query. 

A Transform is how data is written/ingested. It informs Hydrolix not only what data types are being written but also how the data should be treated. 

GDELT Transform

Click arrow to expand

{
    "name": "transform_gdelt",
    "type": "csv",
    "table": "{{table_uuid}}",
    "settings": {
        "compression": "gzip",
        "is_default": true,
        "format_details": {
            "delimiter": "\t"
        },
        "output_columns": [
            {
                "name": "timestamp",
                "datatype": {
                    "primary": true,
                    "type": "datetime",
                    "format": "20060102150405",
                    "resolution": "seconds",
                    "source": { "from_input_index": 0 }                  
                }
            },
            {
                "name": "global_event_id",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 1 }                  
                }
            },
            {
                "name": "day",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 2 }                  
                }
            },
            {
                "name": "month_year",
                "datatype": {
                    "type": "string",                
                    "source": { "from_input_index": 3 }                  
                }
            },
            {
                "name": "year",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 4 }                  
                }
            },
            {
                "name": "fraction_date",
                "datatype": {
                    "type": "string"
                    "source": { "from_input_index": 5 }                  
                }
            },
            {
                "name": "actor1_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 6 }                  
                }
            },
            {
                "name": "actor1_name",
                "datatype": {
                    "type": "string"
                    "source": { "from_input_index": 7 }                  
                }
            },
            {
                "name": "actor1_country_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 8 }                  
                }
            },
            {
                "name": "actor1_known_group_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 9 }                  
                }
            },
            {
                "name": "actor1_ethnic_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 10 }                  
                }
            },
            {
                "name": "actor1_religion_code",
                "datatype": {
                    "type": "string"
                    "source": { "from_input_index": 11 }                  
                }
            },
            {
                "name": "actor1_religion2_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 12 }                  
                }
            },
            {
                "name": "actor1_type1_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 13 }                  
                }
            },
            {
                "name": "actor1_type2_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 14 }                  
                }
            },
            {
                "name": "actor1_type3_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 15 }                  
                }
            },
            {
                "name": "actor2_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 16 }                  
                }
            },
            {
                "name": "actor2_name",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 17 }                  
                }
            },
            {
                "name": "actor2_country_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 18 }                  
                }
            },
            {
                "name": "actor2_known_group_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 19 }                  
                }
            },
            {
                "name": "actor2_ethnic_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 20 }                  
                }
            },
            {
                "name": "actor2_religion_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 21 }                  
                }
            },
            {
                "name": "actor2_religion2_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 22 }                  
                }
            },
            {
                "name": "actor2_type1_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 23 }                  
                }
            },
            {
                "name": "actor2_type2_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 24 }                  
                }
                "position": 24
            },
            {
                "name": "actor2_type3_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 25 }                  
                }
            },
            {
                "name": "is_root_event",
                "datatype": {
                    "type": "bool",
                    "source": { "from_input_index": 26 }                  
                }
            },
            {
                "name": "event_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 27 }                  
                }
            },
            {
                "name": "event_base_code",
                "datatype": {
                    "type": "string"
                    "source": { "from_input_index": 28 }                  
                }
            },
            {
                "name": "event_root_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 29 }                  
                }
            },
            {
                "name": "quad_class",
                "datatype": {
                    "type": "uint32",
                    "source": { "from_input_index": 30 }                  
                }
            },
            {
                "name": "goldstein_scale",
                "datatype": {
                    "type": "double",
                    "source": { "from_input_index": 31 }                  
                }
            },
            {
                "name": "num_mentions",
                "datatype": {
                    "type": "uint32",
                    "source": { "from_input_index": 32 }                  
                }
            },
            {
                "name": "num_sources",
                "datatype": {
                    "type": "uint32",
                    "source": { "from_input_index": 33 }                  
                },
            },
            {
                "name": "num_articles",
                "datatype": {
                    "type": "uint32",
                    "source": { "from_input_index": 34 }                  
                }
            },
            {
                "name": "avg_tone",
                "datatype": {
                    "type": "double",
                    "source": { "from_input_index": 35 }                  
                }
            },
            {
                "name": "actor1_geo_type",
                "datatype": {
                    "type": "uint32",
                    "source": { "from_input_index": 36 }                  
                }
            },
            {
                "name": "actor1_geo_fullname",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 37 }                  
                }
            },
            {
                "name": "actor1_geo_country_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 3 }                  
                }
            },
            {
                "name": "actor1_adm1_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 39 }                  
                }
            },
            {
                "name": "actor1_adm2_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 40 }                  
                }
            },
            {
                "name": "actor1_geo_lat",
                "datatype": {
                    "type": "double",
                    "source": { "from_input_index": 41 }                  
                }
            },
            {
                "name": "actor1_geo_long",
                "datatype": {
                    "type": "double",
                    "source": { "from_input_index": 42 }                  
                }
            },
            {
                "name": "actor1_geo_feature_id",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 43 }                  
                }
            },
            {
                "name": "actor2_geo_type",
                "datatype": {
                    "type": "uint32",
                    "source": { "from_input_index": 44 }                  
                }
            },
            {
                "name": "actor2_geo_fullname",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 45 }                  
                }
            },
            {
                "name": "actor2_geo_country_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 46 }                  
                }
            },
            {
                "name": "actor2_adm1_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 47 }                  
                }
            },
            {
                "name": "actor2_adm2_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 48 }                  
                }
            },
            {
                "name": "actor2_geo_lat",
                "datatype": {
                    "type": "double",
                    "source": { "from_input_index": 49 }                  
                }
            },
            {
                "name": "actor2_geo_long",
                "datatype": {
                    "type": "double",
                    "source": { "from_input_index": 50 }                  
                }
            },
            {
                "name": "actor2_geo_feature_id",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 51 }                  
                }
            },
            {
                "name": "action_geo_type",
                "datatype": {
                    "type": "uint32",
                    "source": { "from_input_index": 52 }                  
                }
            },
            {
                "name": "action_geo_fullname",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 53 }                  
                }
            },
            {
                "name": "action_geo_country_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 54 }                  
                }
            },
            {
                "name": "action_adm1_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 55 }                  
                }
            },
            {
                "name": "action_adm2_code",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 56 }                  
                }
            },
            {
                "name": "action_geo_lat",
                "datatype": {
                    "type": "double",
                    "source": { "from_input_index": 57 }                  
                }
            },
            {
                "name": "action_geo_long",
                "datatype": {
                    "type": "double",
                    "source": { "from_input_index": 58 }                  
                }
            },
            {
                "name": "action_geo_feature_id",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 59 }                  
                }
            },
            {
                "name": "source_url",
                "datatype": {
                    "type": "string",
                    "source": { "from_input_index": 60 }                  
                }
            }
        ]
    }
}