Overview

Before you can ingest data into a Hydrolix table, you must prepare a transform. A transform is a kind of write schema document, in JSON format, that describes how data from a particular source applies to that table.

What are transforms?⚓︎

A transform tells the Hydrolix ingestion process about the incoming data's overall format. It maps that data's fields onto the target table's columns, and instructs Hydrolix about how it should index, store, or otherwise handle each piece of incoming data.

Every ingestion mechanism used in Hydrolix must specify not just a target table, but a transform to apply. You can permanently associate transforms with tables, and refer to them by name upon any ingestion. Alternately, using the streaming-ingest API with self-described messaging lets you include a transform document within your HTTP request.

Multi-ingest transforms⚓︎

Hydrolix has separated the "schema" from the table. This means that a single table can have multiple transforms, allowing it to accept data from a wide variety of sources and translating them from their respective schemas into a common format.

Create a sample transform⚓︎

Let's create a short transform document that meets a simple time-based data storage need. We'll then examine that transform's parts in more detail.

Imagine a Hydrolix table built to accept CSV data containing four columns: a timestamp, an IP address, a message, a language tag, and a country tag.

Example rows of incoming CSV data

1 2	`2021-01-16 02:00:00,192.168.0.5,Hello!,en,us 2021-01-16 03:00:00,192.168.0.20,Bonjour!,fr,ca`

To prepare our transform, we need to decide how Hydrolix should handle each of these columns.

Since all Hydrolix tables store time-series data, we will of course need the ingestion process to parse and index every incoming row's timestamp field, treating it as primary data. Given the likelihood that queries on this table will have WHERE clauses based on IP address or language, we'll want to index those fields, as well.

For the sake of our example, let's decide that the message column contains data we'll want to see in query results, but which we won't use as query filter. That means we should store it without indexing it.

Finally, our example project has no use for the country tag, so we can simply tell the transform to ignore ("ignore": true) this column in our transform. This will cause Hydrolix disregard that column entirely from the incoming data when applying this transform.

Our planning and preparations complete, we now have enough information to create our transform document:

Example Transform⚓︎

{
    "name": "example_transform",
    "description": "A short example transform.",
    "type": "csv",
    "settings": {
        "compression": "none",
        "output_columns": [
            {
                "name": "timestamp",
                "datatype": {
                    "primary": true,
                    "type": "datetime",
                    "format": "2006-01-02 15:04:05",
                    "source": {
                        "from_input_index": 0
                    }
                }
            },
            {
                "name": "ip_address",
                "datatype": {
                    "type": "string",
                    "source": {
                        "from_input_index": 1
                    }
                }
            },
            {
                "name": "message",
                "datatype": {
                    "type": "string",
                    "source": {
                        "from_input_index": 2
                    },
                    "index": false
                }
            },
            {
                "name": "language",
                "datatype": {
                    "type": "string",
                    "source": {
                        "from_input_index": 3
                    }
                }
            },
            {
                "name": "countryCode",
                "datatype": {
                    "type": "string",
                    "source": {
                        "from_input_index": 4
                    },
                    "ignore": true
                }
            }
        ]
    }
}

Elements of a transform⚓︎

Let's explore the major sections of that example transform document, discussing what each does in turn.

Name, description, and type keys⚓︎

The transform's name is the identifier used when referring to this transform programmatically, such as when setting up an ingest. Its optional description is some human-readable text explaining the transform's purpose.

The type key tells Hydrolix what overall format to expect the data in: either csv or json. Hydrolix applies sensible defaults to both formats. For example, a comma character acts as a field delimiter for CSV data, and you can override these settings as needed elsewhere in the transform.

Settings key⚓︎

The top-level settings key defines a handful of configuration options before setting up the column mappings.

The example transform makes use of two options. Setting is_default to 1 instructs Hydrolix to apply this transform to an ingest that doesn't explicitly specify a transform. And since we don't expect the data to arrive with any compression methods applied, we set compression to none.

The example transform relies on default values. See Transform Structure for a complete list of all possible settings.

Output columns⚓︎

The array found under the output_columns key contains one JSON object for every column in the target table that will receive new values from the incoming data.

Data can be transformed, indexed, and stored. You can also set up special columns that don't directly copy values from the incoming data, but instead propagate constant values, calculate automatic values, look up values from external sources such as WURFL, or even run JavaScript expressions that produce new values.

If you don't know the exact shape of your data, a catch_all field can help you with your investigation, showing you field names and values, regardless of their type. See Use the catch_all feature for more information.

Primary timestamp⚓︎

Every transform must declare exactly one of these columns as primary, and that column's datatype must be a timestamp in a format that Hydrolix recognizes.

In the example, the timestamp column definition satisfies this requirement. The format key tells Hydrolix how to parse the incoming timestamp, using Go datetime formatting rules.

Indexed columns⚓︎

Hydrolix will both store and index columns with an index setting of true by default and when you manually set it to true. If instead you set a column's index to false then Hydrolix will only store that column's data, bypassing the index.

See Basic Data Types, Complex Data Types, and Timestamp Data Types for which column types Hydrolix indexes by default.

Hydrolix has been built so that you don't need to decide if you want to index a column or not, but we'd still like to give you the option. The old rules of index bloat of a dataset don't apply here, so index as often or as much as you like (hence it being on by default) and only turn indexing off if you really don't want to index something. If you use it as a dimension in your WHERE statements, then it definitely should always be indexed.

Our example transform matches the expectation that the resulting table would frequently handle queries like SELECT message FROM example_table WHERE language = 'en', seeking to retrieve stored messages based on a known language tag--and not the other way around.

Virtual columns⚓︎

Not present in this example is a virtual column definition. Virtual columns aren't directly mapped to incoming data fields. Instead, they derive new values based on simple default values you define in the transform, or from JavaScript expressions that can take other values in the same row as arguments.

Unknown formats⚓︎

Sometimes you don't know the format of your data before you start defining your transform. In this case, set up a temporary transform that simply copies data into a table for your inspection using the catch-all setting. See our Useful Data Tools page for more details.

Intake stages⚓︎

A transform contains quite a bit of information, describing how to use features like Data Enrichment, Automatic Values, Custom Functions, and Custom Dictionaries.

The transform object describes all configurable steps of intake process prior to the data indexing.

Payloads accepted into the system passes through the following intake stages

Transform Execution - Light mode Transform Execution - Dark mode

Step	Description
Validate Data	Decompress incoming data, read and perform preliminary validation of payloads according to specified format
Apply Data Types	Apply data types incoming fields. Examples: UInt64, String Basic Data Types
Check Limits	Apply Datatype Limiters checks on the data being ingested
Autovalues and JavaScript Expressions	Apply Automatic Values and execute JavaScript expression scripts
Prepare for Indexing	Collect optional transform SQL and send to indexer for data enrichment
Index Data	Process incoming stream with transform SQL, using dictionaries and functions. Generates a table partition file
Write to Storage	Write partition to the object storage platform
Update Catalog	Write partition metadata to the catalog. This makes partitions available for query system
Write Rejects	Write any rejected data from processing steps to object storage