Fastly transaction logs
Now that we’ve demonstrated just how easy it is to integrate your Fastly CDN logs with Hydrolix, it’s time to focus on some basic data exploration. As a reminder, Hydrolix supports native integration with Fastly’s real-time log streaming service. The complete configuration walkthrough using a Fastly extended log format is available here.
This short tutorial covers a few example queries that are designed to tell us a bit more about the Fastly log data that we are collecting. While Hydrolix has been leveraging Fastly technologies to increase website performance for years, we have only recently moved to using an extended log format. The new extended log takes the number of logged fields from the default of 14 (for a generic Fastly HTTPS streaming endpoint) to 64 - all in support of providing us with greater insight into our customer-facing infrastructure so that we can ensure the best possible experience.
The data that we will be querying is stored in the fastly.xlogs
table.
Defining the dataset
Let's start with understanding a bit more about the data that we are working with. The extended log format adds quite a few more columns so visual confirmation of the table structure is always useful. We can employ a simple DESCRIBE
command to take a closer look at the table definition.
DESCRIBE fastly.xlogs
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
DESCRIBE fastly.xlogs
EOF
cache_status String
client_ip String
content_type String
geo_city String
geo_continent_code String
geo_country_code String
geo_datacenter String
geo_region String
h2_stream_id String
host String
is_cacheable Nullable(UInt8)
is_h2 Nullable(UInt8)
is_h2_push Nullable(UInt8)
is_ipv6 Nullable(UInt8)
is_tls Nullable(UInt8)
origin_host String
protocol String
req_body_size Nullable(UInt32)
req_header_size Nullable(UInt32)
request String
request_accept_charset String
request_accept_content String
request_accept_encoding String
request_accept_language String
request_cache_control String
request_connection String
request_dnt String
request_forwarded String
request_referer String
request_user_agent String
request_via String
request_x_forwarded_for String
request_x_requested_with String
resp_body_size Nullable(UInt32)
resp_header_size Nullable(UInt32)
response_age Nullable(UInt32)
response_cache_control String
response_expires String
response_last_modified String
response_tsv String
service_id String
service_version String
socket_cwnd Nullable(UInt32)
socket_nexthop String
socket_ploss Nullable(Float64)
socket_tcpi_delta_retrans Nullable(UInt32)
socket_tcpi_last_data_sent Nullable(UInt32)
socket_tcpi_rcv_mss Nullable(UInt32)
socket_tcpi_rcv_rtt Nullable(UInt32)
socket_tcpi_rcv_space Nullable(UInt32)
socket_tcpi_rtt Nullable(UInt32)
socket_tcpi_rttvar Nullable(UInt32)
socket_tcpi_snd_mss Nullable(UInt32)
socket_tcpi_total_retrans Nullable(UInt32)
status String
time_elapsed Nullable(UInt32)
time_end Nullable(DateTime)
time_start DateTime
tls_client_cipher String
tls_client_cipher_sha String
tls_client_protocol String
tls_client_servername String
tls_client_tlsexts_sha String
url String
As previously mentioned, we recently moved to an extended log format for all of our Fastly real-time streaming logs. The following query will help to establish exactly when that cutover took place as well as providing us with the most recent record and overall row count. min()
and max()
aggregate functions against our selected datetimecolumn, time_start
, will provide us with the earliest and latest records for our data.
SELECT min(time_start), max(time_start), count()
FROM fastly.xlogs
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT min(time_start), max(time_start), count()
FROM fastly.xlogs
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
EOF
Find the top N requested urls per week
The url
column represents page requests which can contain query string parameters
and references to static assets.
We can clean up the urls at query time using cutQueryString(url)
to strip any query string params
and also filter out the static assets using match(string, pattern)
to exclude unwanted results. Finally, we can group byStartOfWeek(time_start)
and topK(N)(column)
function to calculate top values. A sub-query will do the filtering, cleaning, and time grouping.
SELECT week, topK(10)(path)
FROM (
SELECT toStartOfWeek(time_start) AS week, cutQueryString(url) AS path
FROM fastly.xlogs
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
AND match(url,'^.*\.(css$|js$|json$|jpg$|js$|png$|svg$|xml$)') != 1
GROUP BY week, path)
GROUP BY week
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT week, topK(10)(path)
FROM (
SELECT toStartOfWeek(time_start) AS week, cutQueryString(url) AS path
FROM fastly.xlogs
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
AND match(url,'^.*\.(css$|js$|json$|jpg$|js$|png$|svg$|xml$)') != 1
GROUP BY week, path)
GROUP BY week
EOF
Find the top 3 slowest urls per host
While it’s certainly useful to know what the top requested urls for the week are, data becomes significantly more actionable with the application of context. What would happen if our top requested urls were also our slowest? It goes without saying that they probably wouldn’t be the top N urls for long. As such, we may also want to keep an eye out for potentially problematic urls by monitoring response times in the event that we can proactively identify issues and resolve them before they become a problem.
This query is a bit different in that it looks at the full urls that have the highest time_elapsed
values (in milliseconds) since the start of the week. In this case, it’s useful to have full visibility into the entire url since we are looking to identify areas of concern from a latency perspective; hence, we won’t be using cutQueryString(url)
but instead using concat(host, url)
to gives us the full url path.
If you didn’t notice, we are using a relatively unknown clause, LIMIT BY
, which is unique to Clickhouse. It helps us limit the results returned in the subquery to 3 urls per host and is a great way to gain greater control of your data. The LIMIT n BY
expression clause selects the first n
rows for each distinct value of expressions. The key for LIMIT BY
can contain any number of expressions.
SELECT week, concat(host, url) as full_url, (time_elapsed/1000) as time_elapsed_msec
FROM (
SELECT toStartOfWeek(time_start) AS week, host, url, time_elapsed
FROM fastly.xlogs
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
GROUP BY week, host, url, time_elapsed
LIMIT 3 BY host)
GROUP BY week, full_url, time_elapsed_msec
ORDER BY time_elapsed_msec
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT week, concat(host, url) as full_url, (time_elapsed/1000) as time_elapsed_msec
FROM (
SELECT toStartOfWeek(time_start) AS week, host, url, time_elapsed
FROM fastly.xlogs
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
GROUP BY week, host, url, time_elapsed
LIMIT 3 BY host)
GROUP BY week, full_url, time_elapsed_msec
ORDER BY time_elapsed_msec
EOF
Identify the top error causing urls using HTTP 4xx/5xx response codes
Keeping with the theme of leveraging the data to better understand the health of our website, HTTP response codes offer an option for monitoring client and server-side errors. HTTP 4xx status codes typically represent client-side errors and HTTP 5xx status codes are reserved for cases where the server has encountered an error and is incapable of performing the request.
In this example, instead of setting the timeframe toStartOfWeek()
, we use interval 7 day
with now()
to provide a trailing time window and match(status, ‘^[45]\d\d’)
to pick up any status string result that starts with '4' or '5'.
SELECT time_start, status, host, url, count(), cache_status
FROM fastly.xlogs
WHERE (time_start >= now() - interval 7 day) and match(status, '^[45]\d\d')
GROUP BY time_start, host, url, status, cache_status
ORDER by count() DESC
LIMIT 50
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT time_start, status, host, url, count(), cache_status
FROM fastly.xlogs
WHERE (time_start >= now() - interval 7 day) and match(status, '^[45]\d\d')
GROUP BY time_start, host, url, status, cache_status
ORDER by count() DESC
LIMIT 50
EOF
Find out what % of requests go to docs.hydrolix.io
Previously, we found our topN urls but what percentage of that traffic is directed at our docs site?
The host
column represents which domain the traffic is targeting. It's a simple task at query time to compare a foreground data set to the full data set. We will use the countIf(column, condition)
which is a combinator aggregation function
on count()
. The condition will be based on startsWith(string, condition)
. This allows us to conditionally count a foreground subset of the data. The time will be grouped on toStartOfDay(datetime)
. As this returns a time component by default, we format the output using formatDateTime(datetime, format)
to remove the time.
SELECT
formatDateTime(toStartOfDay(time_start), '%D') as day,
countIf(startsWith(host,'docs')) / count() * 100 as %percent_docs
FROM fastly.xlogs
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
GROUP BY day
ORDER BY day
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT
formatDateTime(toStartOfDay(time_start), '%D') as day,
countIf(startsWith(host,'docs')) / count() * 100 as %percent_docs
FROM fastly.xlogs
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
GROUP BY day
ORDER BY day
EOF
Top 10 weekly visitor locations
Now that we have a pretty good handle on our top urls and are on the lookout for potential issues that may impact the user experience, it begs the question where are all these visitors coming from?
Everyone likes a good geolocation-based stat so let’s use a simple query to find out. Note that there are some null
values present in the dataset so we’ve added city != (null)
to avoid returning results that don’t provide us with a city.
SELECT toStartOfWeek(time_start) as week,
geo_city AS city, geo_country_code AS country, count() as count
FROM fastly.xlogs
WHERE ((time_start BETWEEN '2021-01-01 00:00:00' AND now()) AND city != '(null)')
GROUP BY week, city, country
ORDER BY count DESC
LIMIT 10
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT toStartOfWeek(time_start) as week,
geo_city AS city, geo_country_code AS country, count() as count
FROM fastly.xlogs
WHERE ((time_start BETWEEN '2021-01-01 00:00:00' AND now()) AND city != '(null)')
GROUP BY week, city, country
ORDER BY count DESC
LIMIT 10
EOF
While the query itself is simple, this may serve as a starting point for better understanding who your clients are, where they are coming from, and ultimately using the additional insight to ensure you are best serving your audience. Alternatively, if you are more of a security-minded individual, you could combine a geolocation-based query with uniq(client_ip)
, request_referer
, request_user_agent
, request_x_forwarded_for
and look for non-standard request types as an example.
Find the impact of response time vs cache status
The time_elapsed
column provides an indicator to the responsiveness of our website. In theory, this should be impacted by the cache_status
.
It’s worth mentioning that you could use the quantiles(l1, l2 ..)(column)
function to quickly grab multiple percentile quantiles on a single scan but we’ve chosen to use separate quantileExact(level)(expr)
functions for higher resolution and better clarity in the results formatting.
SELECT
toDate(time_start) AS day, host, cache_status,
(quantileExact(0.50)(time_elapsed)/1000) AS 50_percentile,
(quantileExact(0.75)(time_elapsed)/1000) AS 75_percentile,
(quantileExact(0.99)(time_elapsed)/1000) AS 99_percentile
FROM fastly.xlogs
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
GROUP BY day, host, cache_status
ORDER BY day, cache_status
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT
toDate(time_start) AS day, host, cache_status,
(quantileExact(0.50)(time_elapsed)/1000) AS 50_percentile,
(quantileExact(0.75)(time_elapsed)/1000) AS 75_percentile,
(quantileExact(0.99)(time_elapsed)/1000) AS 99_percentile
FROM fastly.xlogs
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
GROUP BY day, host, cache_status
ORDER BY day, cache_status
EOF
As expected, a cache MISS
will have a big impact on response times; although, the worst response times seem to occur with a cache ERROR
. Hey, that sounds like a great opportunity to possibly modify the top N query from above with the addition of a cache_status
filter on cache_status
ERROR
!
Monitor the cache error rate
Based on what we just discovered above with cache ERROR
, a high-level metric that helps to monitor overall cache error rate sounds like it would be of value.
Similar to the approach we used in the query to determine what percent of visitors are visiting docs.hydrolix.io versus try.hydrolix.io, we leverage countIf(cache_status = 'ERROR')
to count and increment only when cache_status
= 'ERROR'
.That value is then divided by the total number of requests to produce a percentage rate.
SELECT
toStartOfMonth(day) AS month,
sum(error) AS error,
sum(total) AS sumtotal,
round((error/sumtotal)*100, 2) AS error_rate
FROM (
SELECT toDate(time_start) AS day,
count(request) AS total,
countIf(cache_status = 'ERROR') AS error
FROM fastly.xlogs
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
GROUP BY day
ORDER BY day ASC)
GROUP BY month
ORDER BY month
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT
toStartOfMonth(day) AS month,
sum(error) AS error,
sum(total) AS sumtotal,
round((error/sumtotal)*100, 2) AS error_rate
FROM (
SELECT toDate(time_start) AS day,
count(request) AS total,
countIf(cache_status = 'ERROR') AS error
FROM fastly.xlogs
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())
GROUP BY day
ORDER BY day ASC)
GROUP BY month
ORDER BY month
EOF
Monthly HTTP response code metrics
For our final example, we produce an extremely useful suite of monthly metrics based on HTTP response code status. countIf
is used in conjunction with (match(haystack, pattern)
to produce selective and running totals based on response codes in the subquery. We subsequently take those totals, summarize them using sum()
, and produce monthly totals and rates
SELECT
toStartOfMonth(day) AS month,
sum(reqtotal) AS total_requests,
sum(1XX_2XX) AS success_total,
round((success_total/total_requests)*100, 2) AS success_rate,
sum(3XX) AS redirect_total,
round((redirect_total/total_requests)*100, 2) AS redirect_rate,
sum(4XX_5XX) AS error_total,
round((error_total/total_requests)*100, 2) AS error_rate
FROM (
SELECT toDate(time_start) AS day,
count(request) AS reqtotal,
countIf(match(status, '^[12]\d\d')) AS 1XX_2XX,
countIf(match(status, '^3\d\d')) AS 3XX,
countIf(match(status, '^[45]\d\d')) AS 4XX_5XX
FROM fastly.xlogs
WHERE (time_start BETWEEN day AND now())
GROUP BY day
ORDER BY day ASC)
GROUP BY month
ORDER BY month
curl --data-binary @- https://try3.hydrolix.live/query <<EOF
SELECT
toStartOfMonth(day) AS month,
sum(reqtotal) AS total_requests,
sum(1XX_2XX) AS success_total,
round((success_total/total_requests)*100, 2) AS success_rate,
sum(3XX) AS redirect_total,
round((redirect_total/total_requests)*100, 2) AS redirect_rate,
sum(4XX_5XX) AS error_total,
round((error_total/total_requests)*100, 2) AS error_rate
FROM (
SELECT toDate(time_start) AS day,
count(request) AS reqtotal,
countIf(match(status, '^[12]\d\d')) AS 1XX_2XX,
countIf(match(status, '^3\d\d')) AS 3XX,
countIf(match(status, '^[45]\d\d')) AS 4XX_5XX
FROM fastly.xlogs
WHERE (time_start BETWEEN day AND now())
GROUP BY day
ORDER BY day ASC)
GROUP BY month
ORDER BY month
EOF
Stating the obvious here but perhaps there's a better way to visualize results of this query. While the data is undoubtedly useful in the right context, deriving value from metrics-oriented datasets can be difficult without the right visualization strategy. In our next segment, Building dashboards with Grafana, we'll leverage Grafana to help us with that task.
Updated 8 months ago