Working with Fastly Logging
Hydrolix provides native integration for Fastly log storage. Full integration intructions are available here.
Once integrated, you can query and analyse the logs in real-time using Hydrolix. In fact, we’ve done just that with this web site as it is fronted by Fastly. The data is stored in the
Checking date time ranges & columns
We switched on log storage mid October 2020, but a quick check can clarify date time ranges and that the data is flowing.
SELECT min(timestamp), max(timestamp), count() FROM fastly.logs WHERE (timestamp BETWEEN '2020-01-01 00:00:00' AND now())
2020-10-11 01:03:59 and
now. We can also
Describe the table to check what the column names are.
cache_status String geo_city String geo_country_code String host String is_tls UInt64 request String request_accept_charset String request_accept_language String request_referer String request_user_agent String time_elapsed UInt64 timestamp DateTime url String
Find the top N requested urls per week
url column represents page requests which can contain
query string parameters and also includes references to static assets.
We can clean up the urls at query time, using
cutQueryString(url) to strip any
query string params and also filter out the static assets using
match(string, pattern) to exclude unwanted results. Finally we can group
topK(N)(column) function to calculate top values. A sub-query will do the filtering, cleaning and time grouping.
SELECT week, topK(10)(path) FROM ( SELECT toStartOfWeek(timestamp) AS week, cutQueryString(url) AS path FROM fastly.logs WHERE (timestamp BETWEEN '2020-10-11 00:00:00' AND now()) AND match(url,'^.*\.(css$|json$|js$|svg$|png$|xml|jpg$)') != 1 GROUP BY week, path) GROUP BY week
Find the % doc request daily
Our website hosts both our documentation and live query API that executes these SQL examples. The API routes based on a different sub domain (try.hydrolix.io) to our Hydrolix servers.
Previously we found our topN docs, but how much percentage of the traffic is that?
host column represents which domain the traffic is targetting. It’s a simple task at query time to compare a foreground data set to the full data set. We will use the
countIf(column, condition) which is a
combinator aggregation function on
count(). The condition will be based on
startsWith(string, condition). This allows us to conditionally count a foreground subset of the data. The time will be grouped on
toStartOfDay(datetime). As this returns a time component by default, we format the output using
formatDateTime(datetime, format) to remove the time.
SELECT formatDateTime(toStartOfDay(timestamp), '%D') as day, countIf(startsWith(host,'docs')) / count() * 100 as %percent_docs FROM fastly.logs WHERE (timestamp BETWEEN '2020-10-11 00:00:00' AND now()) GROUP BY day ORDER BY day
Ok. Get clicking on these live SQL examples and influence the results!
Find the impact of repsonse time v cache status
time_elapsed column provides an indicator to the responsiveness of our website. In theory, this should be impacted by the
cache_status. Lets check.
quantiles(l1, l2 ..)(column) function can quickly grab multiple percentile quantiles on a single scan. We will check
quantiles(0.5, 0.75, 0.99)(time_elasped) and group daily by
SELECT formatDateTime(toStartOfDay(timestamp), '%D') AS day, cache_status, quantiles(0.5, 0.75, 0.99)(time_elapsed) FROM fastly.logs WHERE (timestamp BETWEEN '2020-10-11 00:00:00' AND now()) GROUP BY day, cache_status ORDER BY day, cache_status
As expected, cache
MISS have a big impact on response times. Although the worse response times are on
We recommend monitoring
ERROR, possibly modifying the top N query with an addition
cache_status filter for example. After doing so, we discover it was the query API endpoint. Removing the
cutQueryString function revealed it was due to malformed queries string params.
Yay, for data driven analysis!