Write Transforms
Before you can ingest data into a Hydrolix table, you must prepare a transform. A transform is a kind of write schema document, in JSON format, that describes how data from a particular source applies to that table.
What are Transforms?
A transform tells the Hydrolix ingestion process about the incoming data's overall format. It maps that data's fields onto the target table's columns, and instructs Hydrolix about how it should index, store, or otherwise handle each piece of incoming data.
Every ingestion mechanism used in Hydrolix must specify not just a target table, but a transform to apply. You can permanently associate transforms with tables, and refer to them by name upon any ingestion. Alternately, using the streaming-ingest API with self-described messaging lets you include a transform document within your HTTP request.
Multi-Ingest Transforms
Hydrolix has separated the "schema" from the table. This means that a single table can have multiple transforms, allowing it to accept data from a wide variety of sources and translating them from their respective schemas into a common format.
Sample Transform
Let's create a short transform document that meets a simple time-based data storage need. We'll then examine that transform's parts in more detail.
Imagine that we have a Hydrolix table to that must accept raw data containing four columns: a timestamp, an IP address, a message, a language tag, and a country tag. We expect the data to arrive as uncompressed CSV.
A couple of rows of that incoming data might look like this:
2021-01-16 02:00:00,192.168.0.5,Hello!,en,us
2021-01-16 03:00:00,192.168.0.20,Bonjour!,fr,ca
To prepare our transform, we need to decide how Hydrolix should handle each of these columns.
Since all Hydrolix tables store time-series data, we will of course need the ingestion process to parse and index every incoming row's timestamp field, treating it as primary data. Given the likelihood that queries on this table will have WHERE
clauses based on IP address or language, we'll want to index those fields, as well.
For the sake of our example, let's decide that the message column contains data we'll want to see in query results, but which we won't use as query filter. That means we should store it without indexing it.
Finally, our example project has no use for the country tag, so we can simply tell the transform to ignore ("ignore": true
) this column in our transform. This will cause Hydrolix disregard that column entirely from the incoming data when applying this transform.
Our planning and preparations complete, we now have enough information to create our transform document:
{
"name": "example_transform",
"description": "A short example transform.",
"type": "csv",
"settings": {
"compression": "none",
"output_columns": [
{
"name": "timestamp",
"datatype": {
"primary": true,
"type": "datetime",
"format": "2006-01-02 15:04:05",
"source": {
"from_input_index": 0
}
}
},
{
"name": "ip_address",
"datatype": {
"type": "string",
"source": {
"from_input_index": 1
}
}
},
{
"name": "message",
"datatype": {
"type": "string",
"source": {
"from_input_index": 2
},
"index": false
}
},
{
"name": "language",
"datatype": {
"type": "string",
"source": {
"from_input_index": 3
}
}
},
{
"name": "countryCode",
"datatype": {
"type": "string",
"source": {
"from_input_index": 4
},
"ignore": true
}
}
]
}
}
Elements of a Transform
Let's explore the major sections of that example transform document, discussing what each does in turn.
Name, Description, and Type Keys
The transform's name
is the identifier used when referring to this transform programmatically, such as when setting up an ingest. Its optional description
is some human-readable text explaining the transform's purpose.
The type
key tells Hydrolix what overall format to expect the data in: either "csv"
or "json"
. Hydrolix applies sensible defaults to both formats--for example, using a comma character as a field delimiter for CSV data--and you can override these settings as needed elsewhere in the transform.
Settings Key
The top-level settings
key defines a handful of configuration options before setting up the all-important column mappings.
Our simple example document makes use of only two such options. Setting is_default
to 1
allows Hydrolix to apply this transform to an ingest that doesn't explicitly specify a transform. And since we don't expect the data to arrive with any compression methods applied, we set compression
to "none"
.
A transform can supply lots of other optional configuration here: whether to skip the first row of the incoming data, for example, or detailed instructions about field delimiters or JSON flattening. Our example keeps things simple and relies on defaults, rather than specifying any further configuration. See Transform Structure
for a complete list of all possible settings.
Output Columns
The array found under the output_columns
key provides the meat of any Hydrolix transform. This array contains one JSON object for every column in the target table that will receive new values from the incoming data.
Each of these objects defines what Hydrolix should do with the data found in its associated incoming field: transform it, index it, or store it. You can also set up special columns that don't directly copy values from the incoming data, but instead propagate constant values, calculate automatic values, look up values from external sources such as WURFL, or even run JavaScript expressions that produce new values.
If you don't know the exact shape of your data, a catch_all
field can help you with your investigation, showing you field names and values, regardless of their type. See using the catch_all
feature for more information.
Primary Timestamp
Every transform must declare exactly one of these columns as primary, and that column's datatype must be a timestamp in some format that Hydrolix recognizes.
In our example's case, the timestamp
column's definition satisfies this requirement. The object's format
key tells Hydrolix how to parse the incoming timestamp, using Go datetime formatting rules: that is, rendering how the data would represent Go's "reference time" of 3:04:05 PM, in the GMT-7 time zone, on Jan 2, 2006.
Indexed Columns
Hydrolix will both store and index columns with an "index"
setting of true
by default and when you manually set it to true
. If instead you set a column's "index"
to false
then Hydrolix will only store that column's data, bypassing the index.
(One caveat: Hydrolix cannot index columns using the double
datatype.)
Hydrolix has been built so that you do not need to decide if you want to index a column or not, but we'd still like to give you the option. The old rules of index bloat of a dataset do not apply here, so index as often or as much as you like (hence it being on by default) and only turn indexing off if you really don't want to index something. If you use it as a dimension in your WHERE statements, then it definitely should always be indexed.
Our example transform matches the expectation that the resulting table would frequently handle queries like SELECT message FROM example_table WHERE language = 'en'
, seeking to retrieve stored messages based on a known language tag--and not the other way around.
Virtual Columns
Not present in this example is a virtual column definition. Virtual columns are not directly mapped to incoming data fields. Instead, they derive new values based on simple default values you define in the transform, or from JavaScript expressions that can take other values in the same row as arguments.
Unknown Formats
Sometimes you don't know the format of your data before you start defining your transform. In this case, set up a temporary transform that simply copies data into a table for your inspection using the catch-all
setting. See our Useful Data Tools page for more details.
Transform Execution
A Transform can contain quite a bit of information and processing, Data Enrichment, Automatic Values, Custom Functions and Custom Dictionaries can all be used in validating, cleaning and enriching the data that is finally stored.
Each Transform has a logical flow in how data is parsed the following diagram provides some insight into the execution and ingest time.
Step | Description |
---|---|
Basic Data/Transform Validation Check | Basic checks on the compression format, type of file being ingested etc. |
Apply Data Types | Apply the data-types to the data being ingested e.g. UInt64, String etc. Basic Data Types. |
Limit Checks | Apply Datatype Limiters checks on the data being ingested. |
AutoValues / JS Scripting | Apply Automatic values to the inbound data and apply any JS scripting. |
Execution of SQL_Transform | Execute the SQL_Transformation within the transform for data enrichment |
Apply Functions | Custom user generated Functions are made available to the SQL_Transform |
Apply Dictionaries | Custom user generated Dictionaries are made available to the SQL_Transform |
Indexing | Compression and indexing of the final data is completed. |
Write File to Storage | Data in partitions are written to the cloud storage platform. |
Update Catalog | Metadata written to the Catalog to make partitions available for querying |
Write File Rejects to Storage | Any Rejects in the above steps are written to storage. |
Updated 2 months ago