Transform options

A transform is a JSON document that configures how Hydrolix will index and store incoming data onto a given fluid table. A transform’s content includes a general description of the incoming data’s format, as well as a description of every column that data will produce and populate in the target table.

The following example transform would let a Hydrolix table ingest a GZIP-compressed, CSV-formatted data stream with two columns:

{
    "name": "my_special_transform",
	"description": "description of my data",
    "type": "csv",
    "settings": {
        "output_columns": [
            {
                "name": "timestamp",
                "position": 0,
                "datatype": {
                    "type": "datetime",
                    "primary": true,
                    "format": "2006-01-02 15:04:05 MST"
                }
            },
            {
                "name": "the_data",
                "position": 1,
                "datatype": {
                    "type": "uint64"
                }
            }
        ],
        "compression": "gzip",
        "format_details": {
            "skip_head": 2,
            "delimiter": "\t"
        }
    }
}

Basic properties

Transform documents have the following top-level properties:

Property Purpose
name A name for this transform; must be unique within its table.
description An optional description of this transform.
type The file type of the input data this transform handles. One of csv, or json, or parquet.
settings A JSON object describing the data to ingest.

The settings object

"settings": {
	"is_default": 1,
    "compression": "gzip",
    "format_details": {
        // Configuration specific to the data's overall 
        // format goes here
    },
    "output_columns": [
        // A list of column-definition objects goes here
    ]
}

The settings object contains the details of the content and structure of the data. This is where the data input format, compression, and resulting columns are defined.

Property Purpose
is_default If set to 1, then this transform’s table will apply this transform to a streaming ingest that doesn’t otherwise specify a transform. (Has no effect on batch ingests.)
output_columns An array of JSON objects with a description of every column in the target table that the incoming data will populate. See Column definitions, below.
compression A string describing the compression applied to the data (or "none", the default). See Handling compression, below.
format_details A JSON object containing configuration information specific to the data’s overall format, just as field-delimiter characters with CSV files. See Format details, below.

Output column definitions

"settings": {
   "output_columns": [
        {
            "name": "timestamp",
            "position": 0,
            "datatype": {
                "type": "datetime",
                "primary": true,
                "format": "2006-01-02 15:04:05 MST"
            }
        },
        {
            "name": "data",
            "position": 1,
            "datatype": {
                "type": "uint64"
            }
        }
    ],
    
    // Other "settings" properties go here

}

A transform must contain one object in its output_columns list for every column in the target table that its ingests will affect. This includes the table’s primary timestamp column, as well as any column that will copy a new value based directly on incoming data. The transform enables Hydrolix to copy these values into the target table by specifying a positional or name-based mapping between the incoming data fields and the table’s columns.

The transform can also set columns to derive a new value based on a JavaScript expression, or a simple default.

The properties supported by column definition objects include the following:

Attribute Purpose Required
name A name for this column, for use by subsequent SQL queries. (See also the note on column names and JSON data below.) Yes
position Column position in the source CSV data, starting with 0. Yes, if the data is CSV.
datatype An object containing detailed column definition. Yes

The datatype property requires an object as its value in order to allow recursive definitions, as described in Working with arrays, below. The properties supported by datatype include the following:

Attribute Purpose Required
default A default value to apply when the source data’s value for this column is empty. No
elements A description of the constituent elements of array-type data. See Working with arrays, below. No, except for array.
format Parsing information for datetime and epoch data types. See Timestamp formatting. No, except for datetime and epoch.
index If true, then Hydrolix will index this column’s data as well as storing it as usual. No
primary If true, then this marks the target table’s single, primary field. It must have either datetime or epoch type. Exactly one per transform
resolution Set to "ms" (milliseconds) or "s" (seconds) to define the time-granularity that Hydrolix will apply when storing this column’s datetime or epoch-type data. Defaults to "s". No
script A JavaScript expression whose output becomes the stored value of this column. See Scripted column values, below. No
type The data type of the field. Yes
virtual If true, then Hydrolix will not map this column to any incoming data. Useful for constant or derived data when paired with the default or script settings, respectively. No

Working with arrays

In addition to primitive datatypes, Hydrolix supports arrays as column values. If you set a column’s type as "array" in its output_columns definition, then you must also provide an elements property. This property defines the datatype of the elements that make up the array.

This definition takes the form of a single-element JSON array containing an object that supports the same properties as the datatype property described in the previous table.

The following example defines a column named "index_data" whose value is an array of integers, each of which Hydrolix will index upon ingestion.

"output_columns": [
        {
            "name": "timestamp",
            "datatype": {
                "type": "datetime",            
                "primary": true,
                "format": "2006-01-02 15:04:05 MST"
            }
        },
        {
            "name": "index_data",
            "datatype": {
                "type": "array",
                "elements": [
                    {
                        "type": "uint64",
                        "index": true
                    }
                ]
            }   
        }
    ]
}

Scripted column values

The script property, when present in a column definition, contains a JavaScript expression that executes for every row of data ingested. The expression’s output becomes the stored value for that column.

The expression may, as part of its computation, read the values of other fields in the same row. Hydrolix runs each new row’s JavaScript expressions in the order the transform defines them, and after it has set all of the row’s non-scripted values, including defaults. A script-defined expression, then, has access to any non-scripted value in the same row, as well as the row’s value for any scripted column defined earlier within the transform.

In the following example, the field named "ts_millis" derives its value based on data in the field named "timestamp":

"settings": {
   "output_columns": [
        {
        	"name": "timestamp",
        	"datatype": {
            	"type": "epoch",
            	"primary": true,
            	"format": "s"
            }
        },
        {
        	"name": "ts_millis",
        	"datatype": {
            	"type": "uint64",
            	"virtual": true,
            	"script": "new Date(row['timestamp']).getMilliseconds()"
            }
        }
    ]
}

Handling compression

The compression property describes one or more compression algorithms that Hydrolix should expect to find already applied to the data package as a whole, and which it will need to uncompress prior to working with the data.

For example, setting the transform’s compression property to "gzip" means that you expect the source data, in its entirety, to have had the GZIP compression method applied to it prior to its receipt by Hydrolix.

Recognized compression algorithms

Valid values for the compression property include the following:

Value Meaning
gzip Content is compressed via gzip (LZ77 with 32-bit CRC).
zip Content is ZIP-encoded via zlib (RFC 1950)
deflate Content is encoded in zlib structure and the deflate compression algorithm.
bzip2 Content is compressed with the bzip2 algorithm.
none Content is not compressed. (Equivalent to not specifying compression at all.)

Note that, in streaming ingestion, the request document may have compression represented via the content-encoding header, but the data has its own compression potentially.

Handling multiple compression layers

To define multiple layers of compression, specify them in a comma-and-space-separated list:

"compression": "gzip, bzip2, zip"

The order matters: Hydrolix will attempt to apply decompression algorithms in the order specified, right-to-left.

In the above example, Hydrolix would apply zlib decompression to all received data, then further apply bzip2 decompression, and end with applying gzip decompression.