Transforms (Write Schema)

Before you can ingest data into a Hydrolix table, you must prepare a transform. A transform is a kind of write schema document, in JSON format, that describes how data from a particular source applies to that table.

How tables use transforms

A transform tells the Hydrolix ingestion process about the incoming data's overall format. It maps that data's fields onto the target table's columns, and instructs Hydrolix about how it should index, store, or otherwise handle each piece of incoming data.

Every ingestion you set up in Hydrolix must specify not just a target table, but a transform to apply. You can permanently associate transforms with tables, and refer to them by name upon any ingestion. Alternately, using the streaming-ingest API with self-described messaging lets you include a transform document within your HTTP request.

Attaching multiple transforms to a table allows it to accept data from a wide variety of sources, flexibly translating each into its own common format. If a table expects to receive data from only one source, however, it can function quite well with only one transform attached and set as its default.

A sample transform

Let's create a short transform document that meets a simple time-based data storage need. We'll then examine that transform's parts in more detail.

Imagine that we have a Hydrolix table to that must accept raw data containing four columns: a timestamp, an IP address, a message, a language tag, and a country tag. We expect the data to arrive as uncompressed CSV.

A couple of rows of that incoming data might look like this:

2021-01-16 02:00:00,192.168.0.5,Hello!,en,us
2021-01-16 03:00:00,192.168.0.20,Bonjour!,fr,ca

To prepare our transform, we need to decide how Hydrolix should handle each of these columns.

Since all Hydrolix tables store time-series data, we will of course need the ingestion process to parse and index every incoming row's timestamp field, treating it as primary data. Given the likelihood that queries on this table will have WHERE clauses based on IP address or language, we'll want to index those fields, as well.

For the sake of our example, let's decide that the message column contains data we'll want to see in query results, but which we won't use as query filter. That means we should store it without indexing it.

Finally, our example project has no use for the country tag, so we can simply tell the transform to ignore ("ignore": true) this column in our transform. This will cause Hydrolix disregard that column entirely from the incoming data when applying this transform.

Our planning and preparations complete, we now have enough information to create our transform document:

{
    "name": "example_transform",
    "description": "A short example transform.",
    "type": "csv",
    "settings": {
        "compression": "none",
        "output_columns": [
            {
                "name": "timestamp",
                "position": 0,
                "datatype": {
                    "primary": true,
                    "type": "datetime",
                    "format": "2006-01-02 15:04:05"
                }
            },
            {
                "name": "ip_address",
                "position": 1,
                "datatype": {
                    "type": "string",
                    "index": true
                }
            },
            {
                "name": "message",
                "position": 2,
                "datatype": {
                    "type": "string"
                }
            },
            {
                "name": "language",
                "position": 3,
                "datatype": {
                    "type": "string",
                    "index": true
                }
            } ,
            {
                "name": "countryCode",
                "position": 4,
                "datatype": {
                    "type": "string",
                    "ignore": true
                }
            } 
        ]
    }
}

Elements of a transform

Let's explore the major sections of that example transform document, discussing what each does in turn.

Name, description, and type keys

The transform's name is the identifier used when referring to this transform programmatically, such as when setting up an ingest. Its optional description is some human-readable text explaining the transform's purpose.

The type key tells Hydrolix what overall format to expect the data in: either "csv" or "json". Hydrolix applies sensible defaults to both formats--for example, using a comma character as a field delimiter for CSV data--and you can override these settings as needed elsewhere in the transform.

The settings key

The top-level settings key defines a handful of configuration options before setting up the all-important column mappings.

Our simple example document makes use of only two such options. Setting is_default to 1 allows Hydrolix to apply this transform to an ingest that doesn't explicitly specify a transform. And since we don't expect the data to arrive with any compression methods applied, we set compression to "none".

A transform can supply lots of other optional configuration here: whether to skip the first row of the incoming data, for example, or detailed instructions about field delimiters or JSON flattening. Our example keeps things simple and relies on defaults, rather than specifying any further configuration. See the transform reference documentation for a complete list of all possible settings.

Output columns

The array found under the output_columns key provides the meat of any Hydrolix transform. This array contains one JSON object for every column in the target table that will receive new values from the incoming data.

Each of these objects defines what Hydrolix should do with the data found in its associated incoming field: transform it, index it, or store it. You can also set up special columns that don't directly copy values from the incoming data, but instead propagate constant values, or even run JavaScript expressions that produce new values.

The primary timestamp

Every transform must declare exactly one of these columns as primary, and that column's datatype must be a timestamp in some format that Hydrolix recognizes.

In our example's case, the timestamp column's definition satisfies this requirement. The object's format key tells Hydrolix how to parse the incoming timestamp, using Go datetime formatting rules: that is, rendering how the data would represent Go's "reference time" of 3:04:05 PM, in the GMT-7 time zone, on Jan 2, 2006.

Indexed columns

Hydrolix will both store and index columns with an "index" setting of true by default and when you manually set it to true. If instead you set a column's "index" to false then Hydrolix will only store that column's data, bypassing the index.

(One caveat: Hydrolix cannot index columns using the double datatype.)

Hydrolix has been built so that you do not need to decide if you want to index a column or not, but we'd still like to give you the option. The old rules of index bloat of a dataset do not apply here, so index as often or as much as you like (hence it being on by default) and only turn indexing off if you really don't want to index something. If you use it as a dimension in your WHERE statements, then it definitely should always be indexed.

Our example transform matches the expectation that the resulting table would frequently handle queries like SELECT message FROM example_table WHERE language = 'en', seeking to retrieve stored messages based on a known language tag--and not the other way around.

Virtual columns

Not present in this example is a virtual column definition. Virtual columns are not directly mapped to incoming data fields. Instead, they derive new values based on simple default values you define in the transform, or from JavaScript expressions that can take other values in the same row as arguments.


Did this page help you?