Transforms: Reference
This is a detailed reference guide to all the parts of a transform document. For a conceptual overview of transforms and how they fit into a Hydrolix-based project, please see the guide to transforms.
Document structure
A transform is a JSON document that configures how Hydrolix will index and store incoming data onto a given fluid table. A transform’s content includes a general description of the incoming data’s format, as well as a description of every column that data will produce and populate in the target table.
The following example transform would let a Hydrolix table ingest a GZIP-compressed, CSV-formatted data stream with two columns:
{
"name": "my_special_transform",
"description": "description of my data",
"type": "csv",
"settings": {
"output_columns": [{
"position": 0,
"name": "timestamp",
"type": "datetime",
"treatment": "primary",
"format": "2006-01-02 15:04:05 MST"
}, {
"position": 1,
"name": "the_data",
"type": "uint64",
"treatment": "metric"
}],
"compression": "gzip",
"format_details": {
"skip_head": 2,
"delimiter": "\t"
}
}
}
Basic properties
Transform documents have the following top-level properties:
Property | Purpose |
---|---|
name | A name for this transform; must be unique within its table. |
description | An optional description of this transform. |
type | The file type of the input data this transform handles. One of csv , or json . |
settings | A JSON object describing the data to ingest. |
The settings object
"settings": {
"is_default": 1,
"compression": "gzip",
"format_details": {
// Configuration specific to the data's overall
// format goes here
},
"output_columns": [
// A list of column-definition objects goes here
]
}
The settings
object contains the details of the content and structure of the data. This is where the data input format, compression, and resulting columns are defined.
Property | Purpose |
---|---|
is_default | If set to 1 , then this transform’s table will apply this transform to a streaming ingest that doesn’t otherwise specify a transform. (Has no effect on batch ingests.) |
output_columns | An array of JSON objects with a description of every column in the target table that the incoming data will populate. See Column definitions, below. |
compression | A string describing the compression applied to the data (or "none" , the default). See Handling compression, below. |
format_details | A JSON object containing configuration information specific to the data’s overall format, just as field-delimiter characters with CSV files. See Format details, below. |
Column definitions
"settings": {
"output_columns": [
{
"position": 0,
"name": "timestamp",
"type": "datetime",
"treatment": "primary",
"format": "2006-01-02 15:04:05 MST",
},
{
"position": 1,
"name": "data",
"type": "uint64",
"treatment": "metric"
}
],
// Other "settings" properties go here
}
A transform must contain one object in its output_columns
list for every column in the target table that its ingests will affect. This includes the table’s primary timestamp column, as well as any column that will copy a new value based directly on incoming data. The transform enables Hydrolix to copy these values into the target table by specifying a positional or name-based mapping between the incoming data fields and the table’s columns.
The transform can also set columns to derive a new value based on a JavaScript expression, or a simple default.
The properties supported by column definition objects include the following:
Attribute | Purpose | Required |
---|---|---|
name | A name for this column, for use by subsequent SQL queries. (See also the note on column names and JSON data below.) | Yes |
type | The data type of the field. Can be a Hydrolix native data type, or an aliased data type. | Yes |
treatment | How to index the column. See Data treatments, below. | Yes |
position | Column position in the source CSV data, starting with 0 . |
Yes, if the data is CSV. |
format | Parsing information for datetime and epoch data types. See Formatting timestamps, below. |
No, except for datetime and epoch . |
default | A default value to apply when the source data’s value for this column is empty. | No |
script | A JavaScript expression whose output becomes the stored value of this column. See Scripted column values, below. | No |
Column names and JSON data
When ingesting rows formatted as JSON objects, Hydrolix uses the names of the objects’ top-level keys to establish the mapping between output_columns
and the source data. That is, if your source data contains a top-level property named "employees"
that you wish to ingest, then you must name corresponding column definition in your transform "employees"
as well.
This also applies to JSON flattening: your output columns must share the full names of any flattened data field whose value you wish to copy into them. So, if your flattened incoming data structure has a relevant property named "employees.departments[0]"
, and you wish to copy its values into your Hydrolix table, then one of your transform’s output_columns
must also have its name
property set to the string "employees.departments[0]"
.
Aliased data types
Column definitions’ type
property can use either one of Hydrolix’s native types, such as uint64
or string
, or one of the following “mapped types”. Hydrolix converts each to a native data type prior to storage, according to the following table:
Mapped type | Treatment |
---|---|
boolean |
Converted to a uint64 prior to storage. The case-insensitive strings "false" or "0" get converted to 0 . Any other non-0 value gets converted to 1 . |
epoch |
Treated as datetime , but presented as a single number, as with Unix timestamps. Using this mapping requires additional formatting information; see Formatting timestamps below. |
Data treatments
Every column definition requires a treatment, specifying how Hydrolix should index or store the data found in that column upon ingestion.
You must assign the primary
treatment to exactly one column definition per transform. The name
property of this column must match the primary column definition of any other transform attached to the same table.
You can assign other treatments to the transform’s remaining columns nearly any way you wish, according to the following table.
Treatment | Supported data types | Description |
---|---|---|
primary | datetime or epoch |
The table’s main index, always a timestamp. Every transform must define exactly one. |
tag | Any except double |
Data or metadata that Hydrolix should both store and index. |
metric | Any | “Payload” data to store without indexing. |
virtual | Any | Derived or constant data, stored and indexed. |
Use the primary
, tag
, and metric
treatments for columns whose whose new rows will receive values copied directly from the incoming data.
Use the virtual
treatment for columns that derive their value via the script
property upon the row’s creation, as described in Scripted column values, below. The virtual
treatment is also useful for columns that receive a constant value thorugh the default
property.
Scripted column values
The script
property, when present in a column definition, contains a JavaScript expression that executes for every row of data ingested. The expression’s output becomes the stored value for that column.
The expression may, as part of its computation, read the values of other fields in the same row. Hydrolix runs each new row’s JavaScript expressions in the order the transform defines them, and after it has set all of the row’s non-scripted values, including defaults. A script
-defined expression, then, has access to any non-scripted value in the same row, as well as the row’s value for any scripted column defined earlier within the transform.
In the following example, the field named "ts_millis"
derives its value based on data in the field named "timestamp"
:
"settings": {
"output_columns": [
{
"name": "timestamp",
"type": "epoch",
"treatment": "primary",
"format": "s"
},
{
"name": "ts_millis",
"type": "uint64",
"treatment": "virtual",
"script": "new Date(row['timestamp']).getMilliseconds()"
}
]
}
Handling compression
The compression
property describes one or more compression algorithms that Hydrolix should expect to find already applied to the data package as a whole, and which it will need to uncompress prior to working with the data.
For example, setting the transform’s compression
property to "gzip"
means that you expect the source data, in its entirety, to have had the GZIP compression method applied to it prior to its receipt by Hydrolix.
Recognized compression algorithms
Valid values for the compression
property include the following:
Value | Meaning |
---|---|
gzip |
Content is compressed via gzip (LZ77 with 32-bit CRC). |
zip |
Content is ZIP-encoded via zlib (RFC 1950) |
deflate |
Content is encoded in zlib structure and the deflate compression algorithm. |
bzip2 |
Content is compressed with the bzip2 algorithm. |
none |
Content is not compressed. (Equivalent to not specifying compression at all.) |
Note that, in streaming ingestion, the request document may have compression represented via the content-encoding
header, but the data has its own compression potentially.
Handling multiple compression layers
To define multiple layers of compression, specify them in a comma-and-space-separated list:
"compression": "gzip, bzip2, zip"
The order matters: Hydrolix will attempt to apply decompression algorithms in the order specified, right-to-left.
In the above example, Hydrolix would apply zlib decompression to all received data, then further apply bzip2 decompression, and end with applying gzip decompression.
Format details
{
"name": "my_special_transform",
"type": "csv",
"settings": {
"format_details": {
"skip_head": 2,
"delimiter": ","
},
...
}
}
Use the format_details
property to configure how Hydrolix will parse the incoming data. The sub-properties you can define differ depending upon the data’s overall type
.
Configuring CSV ingests
Element | Type | Default | Description |
---|---|---|---|
delimiter | number, string | , | The delimiter substring. |
escape | number, string | " | The escape character. |
skip_head | number | 0 | The number of rows to skip before ingestion starts. |
quote | number, string | " | The quote character. |
comment | number, string | # | The comment character. Only single characters are supported. |
skip_comments | boolean | false | If true, then the ingester will not process lines beginning with the comment character. |
windows_ending | boolean | false | If true, then Hydrolix will expect incoming data to use Windows-style (CR-LF) line endings. |
Note that Hydrolix recognizes "\t"
as a tab character for the purposes of CSV configuration.
Configuring JSON ingests
JSON configuration has only one option, regarding data-flattening:
Element | Type | Default | Description |
---|---|---|---|
flattening | boolean | false | If true, then Hydrolix will flatten JSON data structures prior to ingesting them. (See JSON flattening, below.) |
JSON Flattening
When accepting JSON-formatted source data, you may optionally flatten each incoming object as a pre-processing step prior to ingesting it. This can transform complex, multi-level JSON structures into simple objects comprising one level of key/value pairs, ready for storage in a single table row.
To do this, define a flattening
property within your transform’s format_details
. Set its value an object with the following properties:
Value | Value |
---|---|
active |
If 1 (or any other true value), Hydrolix will flatten incoming JSON objects before ingesting them as rows. |
map_flattening_strategy |
Configuration for flattening any JSON objects within each row’s main object. |
slice_flattening_strategy |
Configuration for flattening any JSON arrays within each row’s main object. |
The two “strategy” properties accept an object that defines the rules that Hydrolix should follow to create new key names for the resulting, flattened JSON object.
|----|----| | left | The substring to use when concatenating an element’s key with its parent’s key. | right | The substring to use when concatenating an element’s key with its child’s key.
Not defining (or defining as null
) either of the “strategy” properties will deactivate flattening for either objects or arrays, respectively.
An example of JSON flattening
Consider the following JSON object, which we wish to ingest as a single row:
{
"date": "2020-01-01",
"data": {
"oranges": [ 1, 2, 3 ],
"apples": [
{
"cortland": 6,
"honeycrisp": [ 7, 8, 9 ]
},
[ 10, 11, 12 ]
]
}
}
Imagine that the transform handling it contains the following flattening
configuration:
"settings": {
"format_details": {
"flattening": {
"active": true,
"map_flattening_strategy": {
"left": ".",
"right": ""
},
"slice_flattening_strategy": {
"left": "[",
"right": "]"
}
}
},
...
}
After applying these JSON flattening strategies, Hydrolix would end up ingesting the following, single-level JSON object:
{
"date": "2020-01-01"
"data.oranges[0]": 1
"data.oranges[1]": 2
"data.oranges[2]": 3
"data.apples[0].cortland": 6
"data.apples[0].honeycrisp[0]": 7
"data.apples[0].honeycrisp[1]": 8
"data.apples[0].honeycrisp[2]": 9
"data.apples[1][0]": 10
"data.apples[1][1]": 11
"data.apples[1][2]": 12
}
Formatting timestamps
Column definitions using either the datetime
and epoch
require the additional definition of a format
property, so that Hydrolix knows how to parse and interpret the incoming time data.
Formatting datetime
values
The datetime
datatype allows most any format of time representation in incoming data, provided that you can describe that format as a Go-style time format string. In that column definition’s format
property, you must indicate how the data source would represent Go’s “reference time” of 03:04:05 PM on January 2, 2006, in the Mountain Standard Time zone (UTC-07:00).
A handful of examples:
Possible data value | The literal format to specify |
---|---|
2020-12-01 | 2006-01-02 |
12-01-20 13:05:00 | 01-02-06 15:04:05 |
2020-12-01T08:05:00-0500 | 2006-01-02T15:04:05-0700 |
Formatting epoch
values
Use the epoch
datatype for Unix-style “epoch time”, or other representations of a timestamp a number of time-units that have passed since 1970-01-01 00:00:00 UTC, a.k.a. the epoch.
With epoch
, acceptable values of format
include the following:
Format | Meaning |
---|---|
ns |
nanoseconds since epoch |
us |
microseconds since epoch |
ms |
milliseconds since epoch |
cs |
centiseconds since epoch |
s |
seconds since epoch (a.k.a. Unix time) |
In all cases, the Hydrolix ingester accepts either integers or numerical strings as legal epoch
data.
The ingester will round down (that is, truncate) epoch
-based timestamps to the nearest second, if necessary, before storing them in your table.
Hydrolix treats all epoch
-based timestamps as UTC time.