Statistical aggregations
Statistical aggregations can tell you a lot of information about a data set. For this tutorial, we will use the GDELT data set and focus on the avg_tone column. From the GDELT documentation, avg_tone 'is the average “tone” of all documents containing one or more
mentions of this event during the 15 minute update in which it was first seen. The score
ranges from -100 (extremely negative) to +100 (extremely positive). Common values range
between -10 and +10, with 0 indicating neutral.'
First, let's get some general information, by year. Understanding the min, max, and avg is a great place to start:
Calculate min, avg, max, stddev
SELECT toYear(timestamp) as year, round(min(avg_tone),3) AS min, round(avg(avg_tone),3) AS avg, round(max(avg_tone),3) AS max, round(stddevPop(avg_tone),3)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toYear(timestamp) as year, round(min(avg_tone),3) AS min, round(avg(avg_tone),3) AS avg, round(max(avg_tone),3) AS max, round(stddevPop(avg_tone),3)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC
EOF
Looking at this, average tone of news articles are slightly negative, but over time, the range of values of tone has varied greatly. average tone and the standard deviation of the tone is staying fairly steady, meaning that the news, while staying mostly neutral, has more outliers of extemely positive and extremely negative articles.
Over these years, it would be interesting to see the top source domains:
find top occurring values in a set, topK
SELECT toYear(timestamp) as year, topK(3)(domainWithoutWWW(source_url))
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toYear(timestamp) as year, topK(3)(domainWithoutWWW(source_url))
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-01-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC
EOF
Also interesting to investigate would be the domains that had the post positive and most negative article per year:
argMax, argMin - find the URLs with the most +/- articles
SELECT toYear(timestamp) as year, argMax(domainWithoutWWW(source_url), avg_tone), argMin(domainWithoutWWW(source_url), avg_tone)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toYear(timestamp) as year, argMax(domainWithoutWWW(source_url), avg_tone), argMin(domainWithoutWWW(source_url), avg_tone)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC
EOF
The one that stands out is msn.com in 2017. It would be interesting to drill into more of the tone trends for that website.
Statistics for an entity in the set
SELECT toYear(timestamp) as year, round(min(avg_tone),3), round(avg(avg_tone),3), round(max(avg_tone),3), count()
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
AND equals(domainWithoutWWW(source_url),'msn.com')
GROUP BY year
ORDER BY year ASC
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toYear(timestamp) as year, round(min(avg_tone),3), round(avg(avg_tone),3), round(max(avg_tone),3), count()
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
AND equals(domainWithoutWWW(source_url),'msn.com')
GROUP BY year
ORDER BY year ASC
EOF
The fact that there are fewer articles in 2019 is most likely because the 2019 data set is not complete.
Get the month_year of the most positive and negative events by year
SELECT toYear(timestamp) as year, argMin(toMonth(timestamp), avg_tone), argMax(toMonth(timestamp), avg_tone)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toYear(timestamp) as year, argMin(toMonth(timestamp), avg_tone), argMax(toMonth(timestamp), avg_tone)
FROM sample.gdelt
WHERE (timestamp BETWEEN '2015-02-01 00:00:00' AND '2020-01-01 00:00:00')
GROUP BY year
ORDER BY year ASC
EOF
The most negative month in 2016 was March, which was likely due to US primary elections.
Updated about 2 months ago