Statistical Aggregations

Statistical aggregations can tell you a lot of information about a data set. For this tutorial, we will use the GDELT data set and focus on the avg_tone column. From the GDELT documentation, avg_tone 'is the average “tone” of all documents containing one or more mentions of this event during the 15 minute update in which it was first seen. The score ranges from -100 (extremely negative) to +100 (extremely positive). Common values range between -10 and +10, with 0 indicating neutral.

First, let's get some general information, by year. Understanding the min, max, and avg is a great place to start:

Calculate min, avg, max, stddev⚓︎

SQLCURL

SELECT toYear(timestamp) as year, round(min(avg_tone),3) AS min, round(avg(avg_tone),3) AS avg, round(max(avg_tone),3) AS max, round(stddevPop(avg_tone),3)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC

curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toYear(timestamp) as year, round(min(avg_tone),3) AS min, round(avg(avg_tone),3) AS avg, round(max(avg_tone),3) AS max, round(stddevPop(avg_tone),3)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC
EOF

Looking at this, average tone of news articles are slightly negative, but over time, the range of values of tone has varied greatly. average tone and the standard deviation of the tone is staying fairly steady, meaning that the news, while staying mostly neutral, has more outliers of extemely positive and extremely negative articles.

Over these years, it would be interesting to see the top source domains:

Find top occurring values in a set, topK⚓︎

SQLCURL

SELECT toYear(timestamp) as year, topK(3)(domainWithoutWWW(source_url))
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC

curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toYear(timestamp) as year, topK(3)(domainWithoutWWW(source_url))
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC
EOF

Also interesting to investigate would be the domains that had the post positive and most negative article per year:

argMax, argMin - find the URLs with the most +/- articles⚓︎

SQLCURL

SELECT toYear(timestamp) as year, argMax(domainWithoutWWW(source_url), avg_tone), argMin(domainWithoutWWW(source_url), avg_tone)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC

curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toYear(timestamp) as year, argMax(domainWithoutWWW(source_url), avg_tone), argMin(domainWithoutWWW(source_url), avg_tone)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC
EOF

The one that stands out is msn.com in 2017. It would be interesting to drill into more of the tone trends for that website.

Statistics for an entity in the set⚓︎

SQLCURL

SELECT toYear(timestamp) as year, round(min(avg_tone),3), round(avg(avg_tone),3), round(max(avg_tone),3), count()
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
  AND equals(domainWithoutWWW(source_url),'msn.com')
GROUP BY year
ORDER BY year ASC

curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toYear(timestamp) as year, round(min(avg_tone),3), round(avg(avg_tone),3), round(max(avg_tone),3), count()
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
  AND equals(domainWithoutWWW(source_url),'msn.com')
GROUP BY year
ORDER BY year ASC
EOF

The fact that there are fewer articles in 2019 is most likely because the 2019 data set isn't complete.

Get the month_year of the most positive and negative events by year⚓︎

SQLCURL

SELECT toYear(timestamp) as year, argMin(toMonth(timestamp), avg_tone), argMax(toMonth(timestamp), avg_tone)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC

curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toYear(timestamp) as year, argMin(toMonth(timestamp), avg_tone), argMax(toMonth(timestamp), avg_tone)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC
EOF

The most negative month in 2016 was March, which was likely due to US primary elections.