Working with High Cardinality Data

In this tutorial, we will explore a high cardinality data using the BGP public data set of the University of Oregon Route Views Archive Project.

Not reflective of actual performance: This data set is quite large and the compute that it is running on has been sized down for demonstration purposes. If you are interested in exploring BGP data with more compute power, please contact Hydrolix Sales.

BGP, or Border Gateway Protocol RFC4271, is the protocol that makes the Internet work. It does this by enabling data routing on the Internet. The Internet is a network of networks; it’s broken up into hundreds of thousands of smaller networks known as autonomous systems (AS). Each of these networks is essentially a large pool of routers run by a single organization. Autonomous system numbers, or ASNs, are an easy way to identify and configure networks and routing.

  • 16-bit ASN – 0 to 65535
  • 32-bit ASN – 65536 to 4294967285


BGP Autonomous System Path

The BGP AS path is a well-known, mandatory attribute, which means that it is present for all prefixes exchanged between BGP neighbors. When a BGP router sends out an update to a neighbor in a different autonomous system (i.e., an external or eBGP neighbor), it adds its own AS number to the front (left side) of the AS path. So the AS path lists all the ASes that need to be traversed to reach the location where the prefix that the path is attached to is advertised from. As such, a traceroute should encounter those same ASes.

For example - The AS_PATH attribute AS777, AS3232, AS9843247, AS23242


In the following example, the AS Path is 22652 9002 57717 204655 208410 57717 6939 39737 62380 6663 3257 3356 33891 57717 204655 208410

BGP4MP,12/01/19 00:00:52,A,,22652,,22652 9002 57717 204655 208410 57717 6939 39737 62380 6663 3257 3356 33891 57717 204655 208410,IGP,

University of Oregon Route Views Project

The University’s Route Views project was originally conceived as a tool for Internet operators to obtain real-time BGP information about the global routing system from the perspectives of several different backbones and locations around the Internet.

The RouteViews project collected 20 years of quagga bgpd data from The original data available at is available as Quagga BGP RIB files. The RouteViews project makes full routing information base (RIB) dumps of data available every 2 hours (UTC time), and provides updates every 15 minutes. These files are compressed in .bz2 format. Data for each monitor we display can be found on the RouteViews site, in the location shown by the table at the bottom of this page.

Extracting the data files

The BGP data consists of 91,305,937,190 Records - min(timestamp) 2001-12-25 13:52:40 and max(timestamp) 2020-04-01 00:00:46. The original data size was not available, however this data GZIP’ed was 726 GB, and in HDX format it is 250 GB.

To review raw BGP data, The data first needs to be parsed. bgpdump was used for this, which provides human-readable data from the raw BGP information.

Installing and using BGPdump

Find compiled versions of bgpdump for OSX or Ubuntu Linux.

bgpdump -m outputs data in the following column order:

  • BGP Protocol
  • timestamp (in epoch format)
  • W/A/B (withdrawal/announcement/routing table)
  • Peer IP (address of the monitor)
  • Peer ASN (ASN of the monitor)
  • Prefix
  • ASPath
  • Origin Protocol (typically always IGP)
  • IX (host route added by Hydrolix)

The following command was used to extract and ingest the data:

$ bgpdump -m rib.20140801.0000.bz2
2014-08-07 13:49:42 [info] logging to syslog

The output looks like this:

TABLE_DUMP2|1406851200|B||293||293 15169|IGP||0|0||NAG||
TABLE_DUMP2|1406851200|B||3257||3257 15169|IGP||0|10|3257:8012 3257:30016 3257:50001 3257:54900 3257:54901|NAG||

The Ingest Transform Schema and Query View Schema are provided for reference.

count by Peer_IP address for Dec 1st 2019

SELECT Peer_IP, count() 
FROM sample.bgp 
WHERE (timestamp BETWEEN '2019-12-01 00:00:00' AND '2019-12-02 00:00:00') GROUP BY Peer_IP;

count by Peer_ASN for Dec 1st 2019

SELECT Peer_ASN, count() 
FROM sample.bgp 
WHERE (timestamp BETWEEN '2019-12-01 00:00:00' AND '2019-12-02 00:00:00') GROUP BY Peer_ASN;