Transform Structure

A transform (or write schema) is a JSON document that configures how Hydrolix will index and store incoming data onto a given table. A transform's content includes a general description of the incoming data's format, as well as a description of every column that data will produce and populate in the target table.

The following example transform would let a Hydrolix table ingest a GZIP-compressed, CSV-formatted data stream with two columns:

{
    "name": "my_special_transform",
        "description": "description of my data",
    "type": "csv",
    "settings": {
        "output_columns": [
            {
                "name": "timestamp",
                "position": 0,
                "datatype": {
                    "type": "datetime",
                    "primary": true,
                    "format": "2006-01-02 15:04:05 MST"
                }
            },
            {
                "name": "the_data",
                "position": 1,
                "datatype": {
                    "type": "uint64"
                }
            }
        ],
        "compression": "gzip",
        "format_details": {
            "skip_head": 1,
            "delimiter": "\t"
        }
    }
}

Basic properties

Transform documents have the following top-level properties:

Property

Purpose

name

A name for this transform; must be unique within its table.

description

An optional description of this transform.

type

The file type of the input data this transform handles. One of csv, or json.

settings

A JSON object describing the data to ingest.

Settings

"settings": {
    "is_default": true,
    "compression": "gzip",
    "sql_transform": "select ..... from {STREAM}",
    "null_values" : ["^","-"],
    "format_details": {
        // Configuration specific to the data's overall 
        // format goes here
    },
    "output_columns": [
        // A list of column-definition objects goes here
    ]
}

The settings object contains the details of the content and structure of the data. This is where the data input format, compression, and resulting columns are defined.

Property

Purpose

Required

is_default

If set to true, then this transform's table will apply this transform to a streaming ingest that doesn't otherwise specify a transform. Default is required for auto-ingest

Y

output_columns

An array of JSON objects with a description of every column in the target table that the incoming data will populate. See Column definitions|

Y

compression

A string describing the compression applied to the data (or "none", the default). See Message Encoding & Compression

Y

format_details

A JSON object containing configuration information specific to the data's overall format, just as field-delimiter characters with CSV files. See Message Encoding & Compression

Y

sql_transform

A SQL statement runs at ingest allowing users to aggregate, enrich, modify and/or filter the data.
See Real-Time Enrichment and filtering at Ingest and Summary Tables (Aggregation)

N

null_values

An array of string which will be replaced by null if found in the value.
For example ["^","-"] will replace ^ OR - by null on any column.

N

Settings: Output Columns

The core of a transform is the Output Columns section under settings. This section is where you define what the data looks like and how it should be loaded into the Hydrolix platform. These definitions are used to specify Data-Types, indexing and other characteristics your data should have when it is being written into the HDX format.

"settings": {
   "output_columns": [
        {
            "name": "timestamp",
            "position": 0,
            "datatype": {
                "type": "datetime",
                "primary": true,
                "format": "2006-01-02 15:04:05 MST"
            }
        },
        {
            "name": "data",
            "position": 1,
            "datatype": {
                "type": "uint64"
            }
        },
        {
            ..........
    ],
    
    // Other "settings" properties go here

}

A transform must contain one object in its output_columns list for every column in the target table that its ingests.
This includes the table's primary timestamp column, as well as any column that will copy a new value based directly on incoming data. The transform enables Hydrolix to copy these values into the target table by specifying a positional (CSV) or name-based (JSON) mapping between the incoming data fields and the table's columns.

The transform can also set columns to derive a new value based on a JavaScript expression, or a simple default.

The properties supported by column definition objects include the following:

Attribute

Purpose

Required

name

A name for the Column, for use by subsequent SQL queries.

Yes

position

Column position in the source CSV data, starting with 0.

Yes (CSV encoding)
No (JSON encoding)

datatype

An object containing detailed column definition.

Yes

The datatype property requires an object as its value in order to allow recursive definitions, as described in arrays. The properties supported by datatype include the following:

👍

Primary

Every table has to have a primary field set, this is a datetime/epoch field that is used as a primary index that the system will partition files by.

Attribute

Purpose

Required

default

A default value to apply when the source data's value for this column is empty

No

elements

A description of the constituent elements of array-type data. See arrays.

No, except for array and map datatypes.

format

Parsing information for datetime and epoch data types. See Timestamp formatting.

No, except for datetime and epoch.

index

Enable indexing on the column. Default to true (indexed), except datatype double.

No.

primary

If true, then this marks the target table's single, primary field. It must have either datetime or epoch type.

Exactly one per transform.

resolution

Set to "ms" (milliseconds) or "s" (seconds) to define the time-granularity that Hydrolix will apply when storing this column's datetime or epoch-type data. Defaults to "s" (seconds).

No.

script

A JavaScript expression whose output becomes the stored value of this column. See Scripting & Auto Values.

No.

source

Used with JSON encoded files, if specified this is used as the source column name. When data is then written to the database the column will be called by the "name" value above. It is possible to specify multiple source columns, the first value found for the column is the one that is stored. This is best used when there is an OR requirement between two columns (e.g. If column A has a value OR column B has a value then store it in this column).

No

type

The data type of the field.

Yes

virtual

If true, then Hydrolix will not map this column to any incoming data. Useful for constant or derived data when paired with the default or script settings, respectively.

No

catch_all

Primarily used with JSON encoded datasets. If true, then Hydrolix will put any data that is not matched in JSON into this field as a string.

No

null_values

Array of string, which are replaced by a null.
Example:
["^","-"] will replace ^ OR - by null.

No

ignore

Boolean to specify if the field is ignore or not by indexer

No

fulltext

Boolean to specify if the column is indexed using full text search. Enable only for string type, and requires to set index to false.

No


Did this page help you?