Statistical Aggregations
Statistical aggregations can tell you a lot of information about a data set. For this tutorial, we will use the GDELT data set and focus on the avg_tone column. From the GDELT documentation, avg_tone 'is the average “tone” of all documents containing one or more
mentions of this event during the 15 minute update in which it was first seen. The score
ranges from -100 (extremely negative) to +100 (extremely positive). Common values range
between -10 and +10, with 0 indicating neutral.'
First, let's get some general information, by year. Understanding the min, max, and avg is a great place to start:
Calculate min, avg, max, stddev⚓︎
Looking at this, average tone of news articles are slightly negative, but over time, the range of values of tone has varied greatly. average tone and the standard deviation of the tone is staying fairly steady, meaning that the news, while staying mostly neutral, has more outliers of extemely positive and extremely negative articles.
Over these years, it would be interesting to see the top source domains:
Find top occurring values in a set, topK⚓︎
Also interesting to investigate would be the domains that had the post positive and most negative article per year:
argMax, argMin - find the URLs with the most +/- articles⚓︎
The one that stands out is msn.com in 2017. It would be interesting to drill into more of the tone trends for that website.
Statistics for an entity in the set⚓︎
The fact that there are fewer articles in 2019 is most likely because the 2019 data set is not complete.
Get the month_year of the most positive and negative events by year⚓︎
The most negative month in 2016 was March, which was likely due to US primary elections.