Getting Started with Hydrolix

In this tutorial we will explore world events using the GDELT public data set of world news events. The data set contains 344 M rows (with 60 fields each) and is 24 GB when compressed & indexed by Hydrolix on AWS S3. The original data set size is challenging to calculate since it is in several zipped files, but our calculations approximate that the ingested size is about half the size of the GZIP’ed files.

GDELT Data Structure

The GDELT data structure described here, there is also a handy cheat-sheet. We converted the field names to lower snake case (i.e. actor1_name) and moved TIMEADDED to the timestamp field.

GDELT Columns and Types

Click arrow to expand

Column Data Type
timestamp DateTime
global_event_id String
day String
month_year String
year String
fraction_date String
actor1_code String
actor1_name String
actor1_country_code String
actor1_known_group_code String
actor1_ethnic_code String
actor1_religion_code String
actor1_religion2_code String
actor1_type1_code String
actor1_type2_code String
actor1_type3_code String
actor2_code String
actor2_name String
actor2_country_code String
actor2_known_group_code String
actor2_ethnic_code String
actor2_religion_code String
actor2_religion2_code String
actor2_type1_code String
actor2_type2_code String
actor2_type3_code String
is_root_event UInt64
event_code String
event_base_code String
event_root_code String
quad_class UInt64
goldstein_scale Float64
num_mentions UInt64
num_sources UInt64
num_articles UInt64
avg_tone Float64
actor1_geo_type UInt64
actor1_geo_fullname String
actor1_geo_country_code String
actor1_adm1_code String
actor1_adm2_code String
actor1_geo_lat Float64
actor1_geo_long Float64
actor1_geo_feature_id String
actor2_geo_type UInt64
actor2_geo_fullname String
actor2_geo_country_code String
actor2_adm1_code String
actor2_adm2_code String
actor2_geo_lat Float64
actor2_geo_long Float64
actor2_geo_feature_id String
action_geo_type UInt64
action_geo_fullname String
action_geo_country_code String
action_adm1_code String
action_adm2_code String
action_geo_lat Float64
action_geo_long Float64
action_geo_feature_id String
source_url String

GDELT has used machine learning techniques to extract entity (actor1, actor2, action), sentiment/interest (goldstein, num_mentions, num_sources, num_articles, tone), geo location (lat/long, ISO codes) and categorization (event_code, quad_class, geo_code) from the world news.

Working with time

The GDELT data has 5 fields to represent various time intervals: TIMEADDED (YYYYMMDDHHMMSS UTC timezone), DAY (integer), MONTHYEAR (integer), YEAR (integer) and FRACTIONDATE (floating point). This feels excessive to a modern data engineer’s eye. Hydrolix can derive all interval/date patterns from a single timestamp.

Transforms and Views in Hydrolix

Hydrolix allows for defining the data structure on ingest and query.

A Transform is how data is written/ingested. It informs Hydrolix not only what data types are being written but also how the data should be treated.

GDELT Transform

Click arrow to expand

{
	"name": "transform_gdelt",
	"type": "csv",
	"table": "{{table_uuid}}",
	"settings": {
		"compression": "gzip",
		"is_default": true,
		"format_details": {
			"delimiter": "\t"
		},
		"output_columns": [
			{
				"name": "timestamp",
				"type": "datetime",
				"format": "20060102150405",
				"treatment": "primary",
				"position": 0
			},
			{
				"name": "global_event_id",
				"type": "string",
				"treatment": "tag",
				"position": 1
			},
			{
				"name": "day",
				"type": "string",
				"treatment": "tag",
				"position": 2
			},
			{
				"name": "month_year",
				"type": "string",
				"treatment": "tag",
				"position": 3
			},
			{
				"name": "year",
				"type": "string",
				"treatment": "tag",
				"position": 4
			},
			{
				"name": "fraction_date",
				"type": "string",
				"treatment": "tag",
				"position": 5
			},
			{
				"name": "actor1_code",
				"type": "string",
				"treatment": "tag",
				"position": 6
			},
			{
				"name": "actor1_name",
				"type": "string",
				"treatment": "tag",
				"position": 7
			},
			{
				"name": "actor1_country_code",
				"type": "string",
				"treatment": "tag",
				"position": 8
			},
			{
				"name": "actor1_known_group_code",
				"type": "string",
				"treatment": "tag",
				"position": 9
			},
			{
				"name": "actor1_ethnic_code",
				"type": "string",
				"treatment": "tag",
				"position": 10
			},
			{
				"name": "actor1_religion_code",
				"type": "string",
				"treatment": "tag",
				"position": 11
			},
			{
				"name": "actor1_religion2_code",
				"type": "string",
				"treatment": "tag",
				"position": 12
			},
			{
				"name": "actor1_type1_code",
				"type": "string",
				"treatment": "tag",
				"position": 13
			},
			{
				"name": "actor1_type2_code",
				"type": "string",
				"treatment": "tag",
				"position": 14
			},
			{
				"name": "actor1_type3_code",
				"type": "string",
				"treatment": "tag",
				"position": 15
			},
			{
				"name": "actor2_code",
				"type": "string",
				"treatment": "tag",
				"position": 16
			},
			{
				"name": "actor2_name",
				"type": "string",
				"treatment": "tag",
				"position": 17
			},
			{
				"name": "actor2_country_code",
				"type": "string",
				"treatment": "tag",
				"position": 18
			},
			{
				"name": "actor2_known_group_code",
				"type": "string",
				"treatment": "tag",
				"position": 19
			},
			{
				"name": "actor2_ethnic_code",
				"type": "string",
				"treatment": "tag",
				"position": 20
			},
			{
				"name": "actor2_religion_code",
				"type": "string",
				"treatment": "tag",
				"position": 21
			},
			{
				"name": "actor2_religion2_code",
				"type": "string",
				"treatment": "tag",
				"position": 22
			},
			{
				"name": "actor2_type1_code",
				"type": "string",
				"treatment": "tag",
				"position": 23
			},
			{
				"name": "actor2_type2_code",
				"type": "string",
				"treatment": "tag",
				"position": 24
			},
			{
				"name": "actor2_type3_code",
				"type": "string",
				"treatment": "tag",
				"position": 25
			},
			{
				"name": "is_root_event",
				"type": "uint64",
				"treatment": "tag",
				"position": 26
			},
			{
				"name": "event_code",
				"type": "string",
				"treatment": "tag",
				"position": 27
			},
			{
				"name": "event_base_code",
				"type": "string",
				"treatment": "tag",
				"position": 28
			},
			{
				"name": "event_root_code",
				"type": "string",
				"treatment": "tag",
				"position": 29
			},
			{
				"name": "quad_class",
				"type": "uint64",
				"treatment": "tag",
				"position": 30
			},
			{
				"name": "goldstein_scale",
				"type": "double",
				"treatment": "metric",
				"position": 31
			},
			{
				"name": "num_mentions",
				"type": "uint64",
				"treatment": "metric",
				"position": 32
			},
			{
				"name": "num_sources",
				"type": "uint64",
				"treatment": "metric",
				"position": 33
			},
			{
				"name": "num_articles",
				"type": "uint64",
				"treatment": "metric",
				"position": 34
			},
			{
				"name": "avg_tone",
				"type": "double",
				"treatment": "metric",
				"position": 35
			},
			{
				"name": "actor1_geo_type",
				"type": "uint64",
				"treatment": "tag",
				"position": 36
			},
			{
				"name": "actor1_geo_fullname",
				"type": "string",
				"treatment": "tag",
				"position": 37
			},
			{
				"name": "actor1_geo_country_code",
				"type": "string",
				"treatment": "tag",
				"position": 38
			},
			{
				"name": "actor1_adm1_code",
				"type": "string",
				"treatment": "tag",
				"position": 39
			},
			{
				"name": "actor1_adm2_code",
				"type": "string",
				"treatment": "tag",
				"position": 40
			},
			{
				"name": "actor1_geo_lat",
				"type": "double",
				"treatment": "metric",
				"position": 41
			},
			{
				"name": "actor1_geo_long",
				"type": "double",
				"treatment": "metric",
				"position": 42
			},
			{
				"name": "actor1_geo_feature_id",
				"type": "string",
				"treatment": "tag",
				"position": 43
			},
			{
				"name": "actor2_geo_type",
				"type": "uint64",
				"treatment": "tag",
				"position": 44
			},
			{
				"name": "actor2_geo_fullname",
				"type": "string",
				"treatment": "tag",
				"position": 45
			},
			{
				"name": "actor2_geo_country_code",
				"type": "string",
				"treatment": "tag",
				"position": 46
			},
			{
				"name": "actor2_adm1_code",
				"type": "string",
				"treatment": "tag",
				"position": 47
			},
			{
				"name": "actor2_adm2_code",
				"type": "string",
				"treatment": "tag",
				"position": 48
			},
			{
				"name": "actor2_geo_lat",
				"type": "double",
				"treatment": "metric",
				"position": 49
			},
			{
				"name": "actor2_geo_long",
				"type": "double",
				"treatment": "metric",
				"position": 50
			},
			{
				"name": "actor2_geo_feature_id",
				"type": "string",
				"treatment": "tag",
				"position": 51
			},
			{
				"name": "action_geo_type",
				"type": "uint64",
				"treatment": "tag",
				"position": 52
			},
			{
				"name": "action_geo_fullname",
				"type": "string",
				"treatment": "tag",
				"position": 53
			},
			{
				"name": "action_geo_country_code",
				"type": "string",
				"treatment": "tag",
				"position": 54
			},
			{
				"name": "action_adm1_code",
				"type": "string",
				"treatment": "tag",
				"position": 55
			},
			{
				"name": "action_adm2_code",
				"type": "string",
				"treatment": "tag",
				"position": 56
			},
			{
				"name": "action_geo_lat",
				"type": "double",
				"treatment": "metric",
				"position": 57
			},
			{
				"name": "action_geo_long",
				"type": "double",
				"treatment": "metric",
				"position": 58
			},
			{
				"name": "action_geo_feature_id",
				"type": "string",
				"treatment": "tag",
				"position": 59
			},
			{
				"name": "source_url",
				"type": "string",
				"treatment": "tag",
				"position": 60
			}
		]
	}
}

A View is how data is read/queried. Multiple views can be applied to the same data set. When a view is not specified, all columns are available with their original treatments and data types. This is true of the GDELT data set in Hydrolix. A Transform was specified on ingestion, and that definition is the View on query.

GDELT View

Click arrow to expand

{
	"name": "transform_gdelt",
	"type": "csv",
	"table": "{{table_uuid}}",
	"settings": {
		"compression": "gzip",
		"is_default": true,
		"format_details": {
			"delimiter": "\t"
		},
		"output_columns": [
			{
				"name": "timestamp",
				"type": "datetime",
				"format": "20060102150405",
				"treatment": "primary",
				"position": 0
			},
			{
				"name": "global_event_id",
				"type": "string",
				"treatment": "tag",
				"position": 1
			},
			{
				"name": "day",
				"type": "string",
				"treatment": "tag",
				"position": 2
			},
			{
				"name": "month_year",
				"type": "string",
				"treatment": "tag",
				"position": 3
			},
			{
				"name": "year",
				"type": "string",
				"treatment": "tag",
				"position": 4
			},
			{
				"name": "fraction_date",
				"type": "string",
				"treatment": "tag",
				"position": 5
			},
			{
				"name": "actor1_code",
				"type": "string",
				"treatment": "tag",
				"position": 6
			},
			{
				"name": "actor1_name",
				"type": "string",
				"treatment": "tag",
				"position": 7
			},
			{
				"name": "actor1_country_code",
				"type": "string",
				"treatment": "tag",
				"position": 8
			},
			{
				"name": "actor1_known_group_code",
				"type": "string",
				"treatment": "tag",
				"position": 9
			},
			{
				"name": "actor1_ethnic_code",
				"type": "string",
				"treatment": "tag",
				"position": 10
			},
			{
				"name": "actor1_religion_code",
				"type": "string",
				"treatment": "tag",
				"position": 11
			},
			{
				"name": "actor1_religion2_code",
				"type": "string",
				"treatment": "tag",
				"position": 12
			},
			{
				"name": "actor1_type1_code",
				"type": "string",
				"treatment": "tag",
				"position": 13
			},
			{
				"name": "actor1_type2_code",
				"type": "string",
				"treatment": "tag",
				"position": 14
			},
			{
				"name": "actor1_type3_code",
				"type": "string",
				"treatment": "tag",
				"position": 15
			},
			{
				"name": "actor2_code",
				"type": "string",
				"treatment": "tag",
				"position": 16
			},
			{
				"name": "actor2_name",
				"type": "string",
				"treatment": "tag",
				"position": 17
			},
			{
				"name": "actor2_country_code",
				"type": "string",
				"treatment": "tag",
				"position": 18
			},
			{
				"name": "actor2_known_group_code",
				"type": "string",
				"treatment": "tag",
				"position": 19
			},
			{
				"name": "actor2_ethnic_code",
				"type": "string",
				"treatment": "tag",
				"position": 20
			},
			{
				"name": "actor2_religion_code",
				"type": "string",
				"treatment": "tag",
				"position": 21
			},
			{
				"name": "actor2_religion2_code",
				"type": "string",
				"treatment": "tag",
				"position": 22
			},
			{
				"name": "actor2_type1_code",
				"type": "string",
				"treatment": "tag",
				"position": 23
			},
			{
				"name": "actor2_type2_code",
				"type": "string",
				"treatment": "tag",
				"position": 24
			},
			{
				"name": "actor2_type3_code",
				"type": "string",
				"treatment": "tag",
				"position": 25
			},
			{
				"name": "is_root_event",
				"type": "uint64",
				"treatment": "tag",
				"position": 26
			},
			{
				"name": "event_code",
				"type": "string",
				"treatment": "tag",
				"position": 27
			},
			{
				"name": "event_base_code",
				"type": "string",
				"treatment": "tag",
				"position": 28
			},
			{
				"name": "event_root_code",
				"type": "string",
				"treatment": "tag",
				"position": 29
			},
			{
				"name": "quad_class",
				"type": "uint64",
				"treatment": "tag",
				"position": 30
			},
			{
				"name": "goldstein_scale",
				"type": "double",
				"treatment": "metric",
				"position": 31
			},
			{
				"name": "num_mentions",
				"type": "uint64",
				"treatment": "metric",
				"position": 32
			},
			{
				"name": "num_sources",
				"type": "uint64",
				"treatment": "metric",
				"position": 33
			},
			{
				"name": "num_articles",
				"type": "uint64",
				"treatment": "metric",
				"position": 34
			},
			{
				"name": "avg_tone",
				"type": "double",
				"treatment": "metric",
				"position": 35
			},
			{
				"name": "actor1_geo_type",
				"type": "uint64",
				"treatment": "tag",
				"position": 36
			},
			{
				"name": "actor1_geo_fullname",
				"type": "string",
				"treatment": "tag",
				"position": 37
			},
			{
				"name": "actor1_geo_country_code",
				"type": "string",
				"treatment": "tag",
				"position": 38
			},
			{
				"name": "actor1_adm1_code",
				"type": "string",
				"treatment": "tag",
				"position": 39
			},
			{
				"name": "actor1_adm2_code",
				"type": "string",
				"treatment": "tag",
				"position": 40
			},
			{
				"name": "actor1_geo_lat",
				"type": "double",
				"treatment": "metric",
				"position": 41
			},
			{
				"name": "actor1_geo_long",
				"type": "double",
				"treatment": "metric",
				"position": 42
			},
			{
				"name": "actor1_geo_feature_id",
				"type": "string",
				"treatment": "tag",
				"position": 43
			},
			{
				"name": "actor2_geo_type",
				"type": "uint64",
				"treatment": "tag",
				"position": 44
			},
			{
				"name": "actor2_geo_fullname",
				"type": "string",
				"treatment": "tag",
				"position": 45
			},
			{
				"name": "actor2_geo_country_code",
				"type": "string",
				"treatment": "tag",
				"position": 46
			},
			{
				"name": "actor2_adm1_code",
				"type": "string",
				"treatment": "tag",
				"position": 47
			},
			{
				"name": "actor2_adm2_code",
				"type": "string",
				"treatment": "tag",
				"position": 48
			},
			{
				"name": "actor2_geo_lat",
				"type": "double",
				"treatment": "metric",
				"position": 49
			},
			{
				"name": "actor2_geo_long",
				"type": "double",
				"treatment": "metric",
				"position": 50
			},
			{
				"name": "actor2_geo_feature_id",
				"type": "string",
				"treatment": "tag",
				"position": 51
			},
			{
				"name": "action_geo_type",
				"type": "uint64",
				"treatment": "tag",
				"position": 52
			},
			{
				"name": "action_geo_fullname",
				"type": "string",
				"treatment": "tag",
				"position": 53
			},
			{
				"name": "action_geo_country_code",
				"type": "string",
				"treatment": "tag",
				"position": 54
			},
			{
				"name": "action_adm1_code",
				"type": "string",
				"treatment": "tag",
				"position": 55
			},
			{
				"name": "action_adm2_code",
				"type": "string",
				"treatment": "tag",
				"position": 56
			},
			{
				"name": "action_geo_lat",
				"type": "double",
				"treatment": "metric",
				"position": 57
			},
			{
				"name": "action_geo_long",
				"type": "double",
				"treatment": "metric",
				"position": 58
			},
			{
				"name": "action_geo_feature_id",
				"type": "string",
				"treatment": "tag",
				"position": 59
			},
			{
				"name": "source_url",
				"type": "string",
				"treatment": "tag",
				"position": 60
			}
		]
	}
}