Useful Data Tools & Techniques

During our adventures designing, building and testing Hydrolix's platform we've had to navigate the many dimensions of data that are out there. This has led us to seek out tools and techniques that allow us to manipulate data and work with systems.

External Tools

Below are a collection of tools we've found that have simply been helpful in that journey. They are not developed by us, nor do we have any kind of relationships with the companies or people who have developed them, we simply hope you will find them as useful as we have. We will keep adding to this list.

ToolUseLink
jqJSON Manipulationhttps://stedolan.github.io/jq/
Ebay UtilsCSV/TSV Manipulationhttps://github.com/eBay/tsv-utils
Re2Regex Testinghttps://regoio.herokuapp.com/
RegexperRegex "Drawing"https://regexper.com/

Go DateTime Tools

Hydrolix uses Go's date parsing for ingesting datetimes into its system. The following we have found useful in checking this datetime format.

ResourceLink
GoTime Docshttps://golang.org/pkg/time/
Go Playgroundhttps://play.golang.org/

The Go PlayGround can be really helpful checking your parsing of datetime formats. We'd recommend using the below to check how datetime format is being correctly parsed if you are having trouble with it.

In the Go Playground box add the following:

package main

import (
	"fmt"
	"time"
)

func main() {
	layout := "2006-01-02T15:04:05.000+0000"
	data := "2020-11-09T00:00:00.000+0000"
	t, err := time.Parse(layout, data)
	fmt.Println(t, err)
}

Replace Layout with your defined layout for the timestamp and data as an example datetime you wish to parse. Once clicking Run the tool will tell you if it parses or not.

Catch-All Transform

Sometimes you won't know the format of your incoming data until you actually see it streaming into your Hydrolix cluster. This makes it very difficult to write a transform for that data.

One way around this problem is to dump incoming data into a table for inspection. Then you can write your transform based on what you find. Assuming the data you're receiving is in JSON format, here's a transform that will do that for you, applying a timestamp to each line:

{
    "name": "debug",
    "settings": {
        "is_default": true,
        "output_columns": [
            {
                "name": "auto_receive_time",
                "datatype": {
                    "type": "datetime",
                     "format": "2006-01-02T15:04:05.999Z",
                    "source": {
                          "from_automatic_value": "receive_time"
                     },
                    "virtual": true,
                    "resolution": "sec",
                    "primary": true
                }
            },
            {
                "name": "catchall",
                "datatype": {
                    "type": "string",
                    "catch_all": true
                }
            }
        ],
        "compression": "none",
        "format_details": {
            "flattening": {
                "active": false
            }
        }
    },
    "type": "json"
}
 

Another, more complex catch-all transform is available on the Test and Validate Transforms page.

Anonymizing Datasets

The following technique can be used to anonymize customer data while maintaining data "shape," preserving cardinality and length of columns.

In the example below, we anonymize two fields: name and age. We use the SHA256 function to shift all ints and strings to different characters in the same way.

SELECT
    primary,
    substring(base64Encode(SHA256(toString(name))), 1, length(toString(name))) AS name,
    substring(base64Encode(SHA256(toString(age))), 1, length(toString(age))) AS age
FROM VALUES('primary DateTime, name String, age UInt64', (now(), 'bev', 36))

Query id: 12b4cede-aca7-4cd1-8e95-f5880c7dc4bf

┌─────────────primary─┬─name─┬─age─┐
│ 2023-11-22 18:50:46 │ IOe  │ dq  │
└─────────────────────┴──────┴─────┘

SHA256 is a good step towards anonymization, but not perfect. Feel free to experiment with other ClickHouse hash functions, but keep an eye on compute and memory overhead.