Fastly transaction logs

Now that we’ve demonstrated just how easy it is to integrate your Fastly CDN logs with Hydrolix, it’s time to focus on some basic data exploration. As a reminder, Hydrolix supports native integration with Fastly’s real-time log streaming service. The complete configuration walkthrough using a Fastly extended log format is available here.

This short tutorial covers a few example queries that are designed to tell us a bit more about the Fastly log data that we are collecting. While Hydrolix has been leveraging Fastly technologies to increase website performance for years, we have only recently moved to using an extended log format. The new extended log takes the number of logged fields from the default of 14 (for a generic Fastly HTTPS streaming endpoint) to 64 - all in support of providing us with greater insight into our customer-facing infrastructure so that we can ensure the best possible experience.

The data that we will be querying is stored in the fastly.xlogs table.

Defining the dataset

Let's start with understanding a bit more about the data that we are working with. The extended log format adds quite a few more columns so visual confirmation of the table structure is always useful. We can employ a simple DESCRIBE command to take a closer look at the table definition.

DESCRIBE fastly.xlogs
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
DESCRIBE fastly.xlogs
EOF
cache_status                    String                  
client_ip                   String                  
content_type                    String                  
geo_city                    String                  
geo_continent_code              String                  
geo_country_code                String                  
geo_datacenter                  String                  
geo_region                  String                  
h2_stream_id                    String                  
host                        String                  
is_cacheable                    Nullable(UInt8)                 
is_h2                       Nullable(UInt8)                 
is_h2_push                  Nullable(UInt8)                 
is_ipv6                     Nullable(UInt8)                 
is_tls                      Nullable(UInt8)                 
origin_host                 String                  
protocol                    String                  
req_body_size                   Nullable(UInt32)                    
req_header_size                 Nullable(UInt32)                    
request                     String                  
request_accept_charset              String                  
request_accept_content              String                  
request_accept_encoding             String                  
request_accept_language             String                  
request_cache_control               String                  
request_connection              String                  
request_dnt                 String                  
request_forwarded               String                  
request_referer                 String                  
request_user_agent              String                  
request_via                 String                  
request_x_forwarded_for             String                  
request_x_requested_with            String                  
resp_body_size                  Nullable(UInt32)                    
resp_header_size                Nullable(UInt32)                    
response_age                    Nullable(UInt32)                    
response_cache_control              String                  
response_expires                String                  
response_last_modified              String                  
response_tsv                    String                  
service_id                  String                  
service_version                 String                  
socket_cwnd                 Nullable(UInt32)                    
socket_nexthop                  String                  
socket_ploss                    Nullable(Float64)                   
socket_tcpi_delta_retrans           Nullable(UInt32)                    
socket_tcpi_last_data_sent          Nullable(UInt32)                    
socket_tcpi_rcv_mss             Nullable(UInt32)                    
socket_tcpi_rcv_rtt             Nullable(UInt32)                    
socket_tcpi_rcv_space               Nullable(UInt32)                    
socket_tcpi_rtt                 Nullable(UInt32)                    
socket_tcpi_rttvar              Nullable(UInt32)                    
socket_tcpi_snd_mss             Nullable(UInt32)                    
socket_tcpi_total_retrans           Nullable(UInt32)                    
status                      String                  
time_elapsed                    Nullable(UInt32)                    
time_end                    Nullable(DateTime)                  
time_start                  DateTime                    
tls_client_cipher               String                  
tls_client_cipher_sha               String                  
tls_client_protocol             String                  
tls_client_servername               String                  
tls_client_tlsexts_sha              String                  
url                     String

As previously mentioned, we recently moved to an extended log format for all of our Fastly real-time streaming logs. The following query will help to establish exactly when that cutover took place as well as providing us with the most recent record and overall row count. min() and max() aggregate functions against our selected datetimecolumn, time_start, will provide us with the earliest and latest records for our data.

SELECT min(time_start), max(time_start), count() 
FROM fastly.xlogs 
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT min(time_start), max(time_start), count() 
FROM fastly.xlogs 
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
EOF

Find the top N requested urls per week

The url column represents page requests which can contain query string parameters and references to static assets.

We can clean up the urls at query time using cutQueryString(url) to strip any query string params and also filter out the static assets using match(string, pattern) to exclude unwanted results. Finally, we can group byStartOfWeek(time_start) and topK(N)(column) function to calculate top values. A sub-query will do the filtering, cleaning, and time grouping.

SELECT week, topK(10)(path) 
FROM (
   SELECT toStartOfWeek(time_start) AS week, cutQueryString(url) AS path 
   FROM fastly.xlogs  
   WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
   AND  match(url,'^.*\.(css$|js$|json$|jpg$|js$|png$|svg$|xml$)') != 1 
   GROUP BY week, path) 
GROUP BY week
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT week, topK(10)(path) 
FROM (
   SELECT toStartOfWeek(time_start) AS week, cutQueryString(url) AS path 
   FROM fastly.xlogs  
   WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
   AND  match(url,'^.*\.(css$|js$|json$|jpg$|js$|png$|svg$|xml$)') != 1 
   GROUP BY week, path) 
GROUP BY week 
EOF

Find the top 3 slowest urls per host

While it’s certainly useful to know what the top requested urls for the week are, data becomes significantly more actionable with the application of context. What would happen if our top requested urls were also our slowest? It goes without saying that they probably wouldn’t be the top N urls for long. As such, we may also want to keep an eye out for potentially problematic urls by monitoring response times in the event that we can proactively identify issues and resolve them before they become a problem.

This query is a bit different in that it looks at the full urls that have the highest time_elapsed values (in milliseconds) since the start of the week. In this case, it’s useful to have full visibility into the entire url since we are looking to identify areas of concern from a latency perspective; hence, we won’t be using cutQueryString(url) but instead using concat(host, url) to gives us the full url path.

If you didn’t notice, we are using a relatively unknown clause, LIMIT BY, which is unique to Clickhouse. It helps us limit the results returned in the subquery to 3 urls per host and is a great way to gain greater control of your data. The LIMIT n BY expression clause selects the first n rows for each distinct value of expressions. The key for LIMIT BY can contain any number of expressions.

SELECT week, concat(host, url) as full_url, (time_elapsed/1000) as time_elapsed_msec 
FROM (
   SELECT toStartOfWeek(time_start) AS week, host, url, time_elapsed 
   FROM fastly.xlogs 
   WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
   GROUP BY week, host, url, time_elapsed 
   LIMIT 3 BY host) 
GROUP BY week, full_url, time_elapsed_msec 
ORDER BY time_elapsed_msec
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT week, concat(host, url) as full_url, (time_elapsed/1000) as time_elapsed_msec 
FROM (
   SELECT toStartOfWeek(time_start) AS week, host, url, time_elapsed 
   FROM fastly.xlogs 
   WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
   GROUP BY week, host, url, time_elapsed 
   LIMIT 3 BY host) 
GROUP BY week, full_url, time_elapsed_msec 
ORDER BY time_elapsed_msec
EOF

Identify the top error causing urls using HTTP 4xx/5xx response codes

Keeping with the theme of leveraging the data to better understand the health of our website, HTTP response codes offer an option for monitoring client and server-side errors. HTTP 4xx status codes typically represent client-side errors and HTTP 5xx status codes are reserved for cases where the server has encountered an error and is incapable of performing the request.
In this example, instead of setting the timeframe toStartOfWeek(), we use interval 7 day with now() to provide a trailing time window and match(status, ‘^[45]\d\d’) to pick up any status string result that starts with '4' or '5'.

SELECT time_start, status, host, url, count(), cache_status 
FROM fastly.xlogs 
WHERE (time_start >= now() - interval 7 day) and match(status, '^[45]\d\d') 
GROUP BY time_start, host, url, status, cache_status 
ORDER by count() DESC 
LIMIT 50
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT time_start, status, host, url, count(), cache_status 
FROM fastly.xlogs 
WHERE (time_start >= now() - interval 7 day) and match(status, '^[45]\d\d') 
GROUP BY time_start, host, url, status, cache_status 
ORDER by count() DESC 
LIMIT 50
EOF

Find out what % of requests go to docs.hydrolix.io

Our website hosts both our documentation (docs.hydrolix.io) and live query API (try.hydrolix.io) that executes these SQL examples. Previously, we found our topN urls but what percentage of that traffic is directed at our docs site?

The host column represents which domain the traffic is targetting. It's a simple task at query time to compare a foreground data set to the full data set. We will use the countIf(column, condition) which is a combinator aggregation function on count(). The condition will be based on startsWith(string, condition). This allows us to conditionally count a foreground subset of the data. The time will be grouped on toStartOfDay(datetime). As this returns a time component by default, we format the output using formatDateTime(datetime, format) to remove the time.

SELECT 
formatDateTime(toStartOfDay(time_start), '%D') as day, 
countIf(startsWith(host,'docs')) / count() * 100 as %percent_docs 
FROM fastly.xlogs  
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
GROUP BY day 
ORDER BY day
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT 
formatDateTime(toStartOfDay(time_start), '%D') as day, 
countIf(startsWith(host,'docs')) / count() * 100 as %percent_docs 
FROM fastly.xlogs  
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
GROUP BY day 
ORDER BY day
EOF

Top 10 weekly visitor locations

Now that we have a pretty good handle on our top urls and are on the lookout for potential issues that may impact the user experience, it begs the question where are all these visitors coming from?

Everyone likes a good geolocation-based stat so let’s use a simple query to find out. Note that there are some null values present in the dataset so we’ve added city != (null) to avoid returning results that don’t provide us with a city.

SELECT toStartOfWeek(time_start) as week, 
geo_city AS city, geo_country_code AS country, count() as count 
FROM fastly.xlogs 
WHERE ((time_start BETWEEN '2021-01-01 00:00:00' AND now()) AND city != '(null)') 
GROUP BY week, city, country 
ORDER BY count DESC 
LIMIT 10
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toStartOfWeek(time_start) as week, 
geo_city AS city, geo_country_code AS country, count() as count 
FROM fastly.xlogs 
WHERE ((time_start BETWEEN '2021-01-01 00:00:00' AND now()) AND city != '(null)') 
GROUP BY week, city, country 
ORDER BY count DESC 
LIMIT 10 
EOF

While the query itself is simple, this may serve as a starting point for better understanding who your clients are, where they are coming from, and ultimately using the additional insight to ensure you are best serving your audience. Alternatively, if you are more of a security-minded individual, you could combine a geolocation-based query with uniq(client_ip), request_referer, request_user_agent, request_x_forwarded_for and look for non-standard request types as an example.

Find the impact of response time vs cache status

The time_elapsed column provides an indicator to the responsiveness of our website. In theory, this should be impacted by the cache_status.

It’s worth mentioning that you could use the quantiles(l1, l2 ..)(column) function to quickly grab multiple percentile quantiles on a single scan but we’ve chosen to use separate quantileExact(level)(expr) functions for higher resolution and better clarity in the results formatting.

SELECT 
toDate(time_start) AS day, host, cache_status, 
(quantileExact(0.50)(time_elapsed)/1000) AS 50_percentile, 
(quantileExact(0.75)(time_elapsed)/1000) AS 75_percentile,
(quantileExact(0.99)(time_elapsed)/1000) AS 99_percentile 
FROM fastly.xlogs 
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
GROUP BY day, host, cache_status 
ORDER BY day, cache_status
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT 
toDate(time_start) AS day, host, cache_status, 
(quantileExact(0.50)(time_elapsed)/1000) AS 50_percentile, 
(quantileExact(0.75)(time_elapsed)/1000) AS 75_percentile,
(quantileExact(0.99)(time_elapsed)/1000) AS 99_percentile 
FROM fastly.xlogs 
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
GROUP BY day, host, cache_status 
ORDER BY day, cache_status 
EOF

As expected, a cache MISS will have a big impact on response times; although, the worst response times seem to occur with a cache ERROR. Hey, that sounds like a great opportunity to possibly modify the top N query from above with the addition of a cache_status filter on cache_status ERROR!

Monitor the cache error rate

Based on what we just discovered above with cache ERROR, a high-level metric that helps to monitor overall cache error rate sounds like it would be of value.
Similar to the approach we used in the query to determine what percent of visitors are visiting docs.hydrolix.io versus try.hydrolix.io, we leverage countIf(cache_status = 'ERROR') to count and increment only when cache_status = 'ERROR'.That value is then divided by the total number of requests to produce a percentage rate.

SELECT 
toStartOfMonth(day) AS month, 
sum(error) AS error, 
sum(total) AS sumtotal, 
round((error/sumtotal)*100, 2) AS error_rate 
FROM (
   SELECT toDate(time_start) AS day, 
   count(request) AS total, 
   countIf(cache_status = 'ERROR') AS error 
   FROM fastly.xlogs 
   WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
   GROUP BY day 
   ORDER BY day ASC) 
GROUP BY month 
ORDER BY month
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT 
toStartOfMonth(day) AS month, 
sum(error) AS error, 
sum(total) AS sumtotal, 
round((error/sumtotal)*100, 2) AS error_rate 
FROM (
   SELECT toDate(time_start) AS day, 
   count(request) AS total, 
   countIf(cache_status = 'ERROR') AS error 
   FROM fastly.xlogs 
   WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
   GROUP BY day 
   ORDER BY day ASC) 
GROUP BY month 
ORDER BY month
EOF

Monthly HTTP response code metrics

For our final example, we produce an extremely useful suite of monthly metrics based on HTTP response code status. countIf is used in conjunction with (match(haystack, pattern) to produce selective and running totals based on response codes in the subquery. We subsequently take those totals, summarize them using sum(), and produce monthly totals and rates

SELECT 
toStartOfMonth(day) AS month, 
sum(reqtotal) AS total_requests, 
sum(1XX_2XX) AS success_total, 
round((success_total/total_requests)*100, 2) AS success_rate, 
sum(3XX) AS redirect_total, 
round((redirect_total/total_requests)*100, 2) AS redirect_rate, 
sum(4XX_5XX) AS error_total, 
round((error_total/total_requests)*100, 2) AS error_rate 
FROM (
   SELECT toDate(time_start) AS day, 
   count(request) AS reqtotal, 
   countIf(match(status, '^[12]\d\d')) AS 1XX_2XX, 
   countIf(match(status, '^3\d\d')) AS 3XX, 
    countIf(match(status, '^[45]\d\d')) AS 4XX_5XX 
   FROM fastly.xlogs 
   WHERE (time_start BETWEEN day AND now()) 
   GROUP BY day 
   ORDER BY day ASC) 
GROUP BY month 
ORDER BY month
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT 
toStartOfMonth(day) AS month, 
sum(reqtotal) AS total_requests, 
sum(1XX_2XX) AS success_total, 
round((success_total/total_requests)*100, 2) AS success_rate, 
sum(3XX) AS redirect_total, 
round((redirect_total/total_requests)*100, 2) AS redirect_rate, 
sum(4XX_5XX) AS error_total, 
round((error_total/total_requests)*100, 2) AS error_rate 
FROM (
   SELECT toDate(time_start) AS day, 
   count(request) AS reqtotal, 
   countIf(match(status, '^[12]\d\d')) AS 1XX_2XX, 
   countIf(match(status, '^3\d\d')) AS 3XX, 
    countIf(match(status, '^[45]\d\d')) AS 4XX_5XX 
   FROM fastly.xlogs 
   WHERE (time_start BETWEEN day AND now()) 
   GROUP BY day 
   ORDER BY day ASC) 
GROUP BY month 
ORDER BY month
EOF

Stating the obvious here but perhaps there's a better way to visualize results of this query. While the data is undoubtedly useful in the right context, deriving value from metrics-oriented datasets can be difficult without the right visualization strategy. In our next segment, Building dashboards with Grafana, we'll leverage Grafana to help us with that task.