Statistical Aggregations

Statistical aggregations can tell you a lot of information about a data set. For this tutorial, we will use the GDELT data set and focus on the avg_tone column. From the GDELT documentation, avg_tone ‘is the average “tone” of all documents containing one or more mentions of this event during the 15 minute update in which it was first seen. The score ranges from -100 (extremely negative) to +100 (extremely positive). Common values range between -10 and +10, with 0 indicating neutral.’

First, let’s get some general information, by year. Understanding the min, max, and avg is a great place to start:

Calculate min, avg, max, stddev

SELECT toYear(timestamp) as year, round(min(avg_tone),3) AS min, round(avg(avg_tone),3) AS avg, round(max(avg_tone),3) AS max, round(stddevPop(avg_tone),3) 
FROM sample.gdelt 
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00') 
GROUP BY year 
ORDER BY year ASC

Looking at this, average tone of news articles are slightly negative, but over time, the range of values of tone has varied greatly. average tone and the standard deviation of the tone is staying fairly steady, meaning that the news, while staying mostly neutral, has more outliers of extemely positive and extremely negative articles.

Over these years, it would be interesting to see the top source domains:

find top occurring values in a set, topK

SELECT toYear(timestamp) as year, topK(3)(domainWithoutWWW(source_url)) 
FROM sample.gdelt  
WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00') 
GROUP BY year 
ORDER BY year ASC

Also interesting to investigate would be the domains that had the post positive and most negative article per year:

argMax, argMin - find the URLs with the most +/- articles

SELECT toYear(timestamp) as year, argMax(domainWithoutWWW(source_url), avg_tone), argMin(domainWithoutWWW(source_url), avg_tone) 
FROM sample.gdelt 
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00') 
GROUP BY year 
ORDER BY year ASC

The one that stands out is msn.com in 2017. It would be interesting to drill into more of the tone trends for that website.

Statistics for an entity in the set

SELECT toYear(timestamp) as year, round(min(avg_tone),3), round(avg(avg_tone),3), round(max(avg_tone),3), count() 
FROM sample.gdelt 
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00') 
  AND equals(domainWithoutWWW(source_url),'msn.com') 
GROUP BY year 
ORDER BY year ASC

The fact that there are fewer articles in 2019 is most likely because the 2019 data set is not complete.

Get the month_year of the most positive and negative events by year

SELECT toYear(timestamp) as year, argMin(toMonth(timestamp), avg_tone), argMax(toMonth(timestamp), avg_tone) 
FROM sample.gdelt 
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00') 
GROUP BY year 
ORDER BY year ASC

The most negative month in 2016 was March, which was likely due to US primary elections.