Skip to content

Basic Queries

Basic Queries⚓︎

search by time⚓︎

The underlying data has a 15 minute granularity. Aggregating the data into higher time intervals is achieved using Time Functions. Here we are using toYear(datetime) and the count() function to count the number of events by year. 

1
2
3
4
5
SELECT toYear(timestamp) AS year, count()
  FROM sample.gdelt 
  WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00') 
  GROUP BY year 
  ORDER BY year ASC
curl cURL curl --data-binary @- https://try3.hydrolix.live/query <<EOF SELECT toYear(timestamp) AS year, count() FROM sample.gdelt WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00') GROUP BY year ORDER BY year ASC EOF
This returns about 344 million events with 61 million in 2018 and about 51 million in 2019
2015 59,063,790 2016 97,228,713 2017 75,609,229 2018 61,544,481 2019 50,877,104
1
2
3
4
5
6
7
8
We could have partitioned by `toMonth(), toWeek(), toHour(), toQuarter()` and hence derived all the extra unnecessary time fields in the GDELT data schema.

This query used absolute times, but relative time and date math is also valid. We could as easily query the last 10 years(`subtractYears(date, num)`) from `now()`

```sql
SELECT count()
  FROM sample.gdelt 
  WHERE (timestamp BETWEEN subtractYears(now(), 10)  AND now())
curl cURL curl --data-binary @- https://try3.hydrolix.live/query <<EOF SELECT count() FROM sample.gdelt WHERE (timestamp BETWEEN subtractYears(now(), 10) AND now()) EOF
Time Functions can be used in SQL `SELECT`, `WHERE` and `GROUP` clauses.

## Count distinct values

Figure out how many distinct primary actors there are.

```sql
SELECT uniq(actor1_name) 
  FROM sample.gdelt  
  WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00')
```curl cURL curl --data-binary @- https://try3.hydrolix.live/query <<EOF SELECT uniq(actor1_name) FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00') EOF
During this time, there were 16,955 unique identities in `actor1_name`.

## Find the most frequently occuring values

Find the 10 most occuring values of `actor1_name`

```sql
SELECT topK(10)(actor1_name) 
  FROM sample.gdelt
  WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00') ;
1
2
3
4
5
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT topK(10)(actor1_name) 
  FROM sample.gdelt
  WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00') ;
EOF

This query returns this list:

['UNITED STATES','\0','POLICE','UNITED KINGDOM','PRESIDENT','GOVERNMENT','CHINA','SCHOOL','RUSSIA','CANADA']

One of the most occurring values is '\0' where no primary actor exists. 

Find the last article classified SCHOOL & avg_tone > 1⚓︎

We can easily find the last value in a filtered data set. argMax(field, datetime) is much simpler than grouping and ordering by time desc. It also works with field tuples.

1
2
3
4
5
SELECT argMax((timestamp, avg_tone, source_url), timestamp)
  FROM sample.gdelt 
  WHERE (timestamp BETWEEN '2018-01-01 00:00:00' AND '2020-01-01 00:00:00') 
    AND actor1_name='SCHOOL' 
    AND avg_tone > 1
curl cURL curl --data-binary @- https://try3.hydrolix.live/query <<EOF SELECT argMax((timestamp, avg_tone, source_url), timestamp) FROM sample.gdelt WHERE (timestamp BETWEEN '2018-01-01 00:00:00' AND '2020-01-01 00:00:00') AND actor1_name='SCHOOL' AND avg_tone > 1 EOF
The latest article classified as `SCHOOL` and with a positive `avg_tone` was an Edison State Community College conference
'2019-11-20 21:45:00', 3.6649214659685896, 'https://www.dailycall.com/news/69658/edison-state-holds-guidance-counselor-conference'
## Find the average value

Calculate the average tone of events for a particular primary actor by month.

```sql
SELECT toYear(timestamp) AS year, round(avg(avg_tone),3) AS tone
  FROM sample.gdelt 
  WHERE (timestamp between '2018-01-01 00:00:00' AND '2020-01-01 00:00:00') 
    AND actor1_name='SCHOOL'
  GROUP BY year 
  ORDER BY year ASC;
curl cURL curl --data-binary @- https://try3.hydrolix.live/query <<EOF SELECT toYear(timestamp) AS year, round(avg(avg_tone),3) AS tone FROM sample.gdelt WHERE (timestamp between '2018-01-01 00:00:00' AND '2020-01-01 00:00:00') AND actor1_name='SCHOOL' GROUP BY year ORDER BY year ASC EOF

2018 -0.651 2019 -0.443
`avg_tone` has values ranging from -100 (very negative) to +100 (very positive). Even though average sentiment has gone up in events where `SCHOOL` is the primary actor, these articles **very** slightly negative in tone. The value is greater than -1, but less than zero. 

## Count event types

Protests (event_root_code='14') are an interesting event type to monitor over time. 

```sql
SELECT toStartOfMonth(timestamp) month_year, sum(num_articles) articles
  FROM sample.gdelt 
  WHERE (timestamp BETWEEN '2016-09-01 00:00:00' AND '2017-01-31 00:00:00') 
    AND event_root_code='14'
  GROUP BY month_year
1
2
3
4
5
6
7
8
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toYear(timestamp) AS year, round(avg(avg_tone),3) AS tone
  FROM sample.gdelt 
  WHERE (timestamp between '2018-01-01 00:00:00' AND '2020-01-01 00:00:00') 
    AND actor1_name='SCHOOL'
  GROUP BY year 
  ORDER BY year ASC
EOF

Take for example the 2016 US election time frame. There is a marked jump in the number of protest around election time.

1
2
3
4
5
2016-09-01 295,535
2016-10-01 282,903
2016-11-01 369,609
2016-12-01 730,917
2017-01-01 309,717

Find the top source_url domains by year⚓︎

Each event has a source_url. We can use url functions to easily extract root domains domain(url).

1
2
3
4
SELECT toYear(timestamp) as year, topK(10)(domain(source_url)) 
 FROM sample.gdelt 
 WHERE (timestamp BETWEEN '2018-01-01 00:00:00' AND '2020-01-01 00:00:00') 
 GROUP BY year
1
2
3
4
5
6
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toYear(timestamp) as year, topK(10)(domain(source_url)) 
 FROM sample.gdelt 
 WHERE (timestamp BETWEEN '2018-01-01 00:00:00' AND '2020-01-01 00:00:00') 
 GROUP BY year
EOF

We have extracted the root domain from the full url at query time, allowing us to trend news outlets over time.

1
2
3
2018    ['www.msn.com','www.dailymail.co.uk','www.business-standard.com','allafrica.com','www.xinhuanet.com','www.thenews.com.pk','www.yahoo.com','www.4-traders.com','www.nbcnews.com','www.sfgate.com']

2019    ['www.dailymail.co.uk','www.msn.com','www.business-standard.com','news.yahoo.com','allafrica.com','menafn.com','www.thenews.com.pk','www.nbcnews.com','www.reuters.com','www.xinhuanet.com']