Transform Structure

A transform determines how Hydrolix will process, index and store incoming data into a given table.

The transform JSON document describes the shape and format of incoming fields, controls ingestion features and behavior, and determines output data columns for insertion into the target table.

For each Hydrolix table, you may create multiple transforms to receive differently formatted files or streams. Only one transform may serve as the default for a table. The default transform will be used for any data ingested to the table that doesn't have a transform explicitly specified.

Shape of a transform

Each transform document consists of four major, nested parts.

Basic properties is the main JSON object
- Settings describe input handling and control ingestion properties and some features
  - Output columns describe details of field to column transformation
    - Datatypes provides a detailed specification of each column's data

Basic properties

Transform documents have the following top-level properties:

Property	Type	Purpose	Required
`name`	string	A name for this transform; must be unique within its table	Yes
`description`	string	An optional description	No
`type`	string	Input file type handled by this transform. One of `csv` or `json`	Yes
`settings`	object	See Settings	Yes

{
    "name": "my_special_transform",
    "description": "description of my data",
    "type": "csv",
    "settings": {
        // See next section
    }
}

See also the simple complete transform JSON.

Settings

The settings object specifies transform-level behavior, describing input data format, compression, rate limiting and, optionally, sample data for ease of use in testing and validating the transform's behavior.

Property	Type	Purpose	Default	Required
`is_default`	boolean	See Multiple transforms	`false`	Yes
`output_columns`	array	See Column definitions		Yes
`format_details`	object	Incoming data format description. See Format Options		Yes
`compression`	string	Convey content-level compression. See Compression	`none`	Yes
`null_values`	array	Of type `string`. See Null handling and default values		No
`sql_transform`	string	SQL used during data enrichment. See SQL transforms		No
`rate_limit`	object	Limits bytes per second ingest rate and max payload size. See Rate Limits.	no limit	No
`wurfl`	string	See WURFL Integration		No
`sample_data`	string	See Test and Validate Transforms		No

Settings block example

"settings": {
    "is_default": true,
    "compression": "gzip",
    "sql_transform": "select ..... from {STREAM}",
    "null_values" : ["^","-"],
    "format_details": {
        // Configuration specific to the data's overall 
        // format goes here
    },
    "output_columns": [
        // A list of column-definition objects goes here
    ]
}

See Data Enrichment and Summary Tables (Aggregation)

Output columns

The output_columns section of the document is the core of a transform. It maps incoming data fields to table columns.

Select from the several supported Timestamp Data Types and be sure to mark one field as primary. If your data lack an event timestamp, consider using receive_time from Auto Values.

❗️
primary must be true on exactly one column's datatype definition
Each transform must specify one primary column of type timestamp or epoch. This column must be indexed, isn't nullable, and serves as a partitioning key.

Output columns block example

"output_columns": [
    {
        "name": "timestamp",
        "datatype": {
            "type": "datetime",
            "primary": true,
            "format": "2006-01-02 15:04:05 MST",
            "source": { "from_input_index": 0 }
        }
    },
    {
        "name": "the_data",
        "datatype": {
            "type": "uint64",
            "source": { "from_input_index": 1 }
        }
    }
}

Datatypes

The datatype property in an output_column is an object itself, to allow for specification of Complex Data Types and type-specific characteristics of columns. Each column must include a datatype definition, which supports the following properties.

Property	Type	Purpose	Default	Required
`type`	string	A type name, for example `string`, `boolean`, `int32`, `int8`, `epoch`, or `datetime`		Yes
`index`	boolean	See indexability of types below	`true`	No
`primary`	boolean	Exactly one columun must be primary. See Timestamp Data Types	`false`	Once per transform
`format`	string	Parsed time format. See Format and Resolution		Only for datetime and epoch
`resolution`	string	Stored time resolution. See Format and Resolution	`seconds`	Only for datetime and epoch
`elements`	object	See Complex Data Types		Only for array and map
`source`	object	See Datatype source		Format-specific
`limits`	object	Constrain acceptable values. See Limiters		No
`script`	string	Derive column values using JS expressions. See Scripting		No
`null_values`	array	Of type `string`. See Null handling and default values		No
`default`		The type of the `default` value depends on the specified `type`		No
`denullify`	boolean	Convert stored nulls on this column to empty in automatic view	`false`	No
`ignore`	boolean	Omit input field from all ingestion stages. Defensive handling	`false`	No
`suppress`	boolean	Don't pass this field through to indexing and storage. See Defensive handling	`false`	No
`virtual`	boolean	Mark that this field accepts constructed values. See Defensive handling	`false`	No
`catch_all`	boolean	Store all unknown fields in this column. See Catch features	`false`	No
`catch_rejects`	boolean	Use this column for rejects. See Catch features	`false`	No

ℹ️
Indexability of types
In the absence of the index field in a transform specification, Hydrolix will default to true. Most, but not all, data types are indexable. See Basic Data Types, Timestamp Data Types, and Complex Data Types for specifics.

Datatype block examples

"datatype": {
    "type": "epoch",
    "primary": true,
    "format": "2006-01-02 15:04:05 MST",
    "source": { "from_input_index": 0 }
}

  {
    "type": "datetime",
    "primary": true,
    "format": "2006-01-02T15:04:05.999999Z",
    "resolution": "ms",
    "source": {
      "from_input_field": "timestamp"
    }
  }

Ingestion sequence

For cleaning and enriching your data, you can use name-based (JSON) mapping, CSV positional indexing, Auto Values, JavaScript Scripting, or more advanced ingestion and data manipulation features.

The next several sections parallel the stages of transform execution. The diagram may help illustrate the relationship between properties processing order.

Immutable column types

A column's type is immutable upon creation. Type mismatch on ingestion is an error, which will result in rejected values (see Catch features for coping with rejects).

All subsequent transforms on the same table, whether revisions or new, must specify matching values for type, index, and primary for that column. Avoid this kind of conflict by using different column names to support different types.

For example, consider a scenario in which different data streams have the field country but the value is an ISO-3166 code, a string or a numeric identifier. Consider defining multiple output_columns of different types and using other ingestion features: country_code as string, country_name as string, and country_id as int32. Transform A could accept country into column country_name, looking up the corresponding country_id and country_code column values. Consider other Hydrolix ingestion control features, such as Custom dictionaries or Auto values.

Null handling and default values

The transform object offers control over handling empty values at both the transform level and more fine-grained for individual columns.

You can supply a list of null_values on either the settings object which will apply to all columns, or to a specific output_column.

The null_values apply before default.

During initial basic data validation checks, incoming fields will be converted to nulls using the union of the transform-level null_values (settings object) and each column null_values, after which nulls may be replaced with column-specified default.

An example for null_values. For example [\"^\",\"-\"] will replace ^ OR - by null on any column.

    "null_values": [
      "-",
      "^"
    ],
    // incoming values of "^" or "-" will be replaced with null for further processing

If a column contains null values, you can influence retrieval using the denullify property. See Views and denullify.

Limits, scripting, and auto values

After all incoming data fields have been converted to the type described in their output_columns, the limits checking occurs. Any failures in type conversion or limit checking may result in rejects. See Catch features.

Next, each output column with a non-null script or an Auto Values source instruction will be run.

SQL transforms

A powerful data enrichment stage allows you to configure SQL to operate on the incoming data stream and produce additional fields.

In this last phase before index and partition generation, you can populate output_columns with data generated from columns of a SELECT clause operating on the data stream after prior stages have completed. After this point in the ingestion processing, all of the column typing, default-setting, scripting, and auto value generation is complete.

For more detail on this advanced feature, see Data Enrichment, Custom Functions, and Custom Dictionaries.

General guidance

Transforms are the primary tool for converting incoming data fields into values stored into a table's columns.

Select a single input field (or construct one using script) to serve as the primary timestamp of the table and set a format. For storage, the resolution setting is optional and defaults to seconds.
Define a data type for every field stored to the table. Use suppress to keep a field through the intake process but omit in the resulting table.
If you wish to store NULLs, use null_values to convert empty values to null for the entire incoming data stream and use null_values on specific input fields. See also Null handling and default values.
Use the catch_all and catch_rejects properties to store unknown or invalid data directly in a table, rather than generate Reject files into your storage bucket.
Control data value ranges on specific fields by using the limits property.
Remember that a single transform on a table is implicitly the default.

Multiple transforms

Only one transform on a table may set is_default to true. If set to true, then this transform applies to any streaming ingest for that table that doesn't otherwise specify a transform. The autoingest feature can only use the default transform. Streams, however, can specify the desired transform using the HTTP header x-hdx-transform.

Defensive handling

The less-common properties ignore, suppress, and virtual offer control over visibility of values during transform execution.

Use ignore to explicitly remove the field in the first stage, basic data validation checks. It will be unavailable during any further processing and won't be present in the table. A more common solution than using ignore is to avoid creating an output_column for the field. Since a transform only inserts data into existing output_columns, omission of a field's name from output_columns suffices. The ignore option is simply more explicit.

Use suppress to keep a field's value during early stages of the transform execution, but suppress the value before the indexing stage and subsequent writing to a table partition. This ensures the incoming field values don't get written to the table. Note, this occurs after the SQL Transform stage and before the indexing stage.

Use virtual to occlude any incoming field named the same as an output column. This behaves similarly to ignore, but supports derived columns from Scripting, SQL Transforms and default values. Marking a column virtual for derived columns can prevent a future addition of a field in the input stream from taking precedence over the default.

Catch features

There are two features available at ingestion to help deal with changing formats of incoming data.

Each transform can earmark a specific column with catch_all to receive incoming fields lacking any output column definition. This is a good way to collect unknown fields, rather than drop the data.

When incoming values trigger Limiters or when values don't match a column's datatype, the catch_rejects field allows a column to accumulate the bad data in the table rather than generate a Rejects file to the storage system.

The catch features retain incoming fields and data in the table for applications or analysts.

Examples:

Views and denullify

The denullify property isn't used during ingestion, but applies to the automatically created view. See also Custom Views.

When denullify is true, null values in a column are converted into an type-appropriate blank value at query time. The contents of the table partitions are completely unaffected by this setting. If partitions contain null values in the column, then changing this setting will change query results even for partitions created long before.

⚠️
Warning: Last update wins.
The automatically created view will respect the value of denullify from the latest updated or created transform. When using the denullify property, pay attention to multiple transforms.

Complete minimal example

The following example transform would let a Hydrolix table ingest a GZIP-compressed, CSV-formatted data stream.

In English, this describes tab-separated data with a single header line. Only the first two fields are consumed and created as columns. The first column (0) will be used as the partitioning key, primary and holds a datetime including time zone name. The second field is a large unsigned integer, uint64.

{
    "name": "my_special_transform",
    "description": "description of my data",
    "type": "csv",
    "settings": {
        "output_columns": [
            {
                "name": "timestamp",
                "datatype": {
                    "type": "datetime",
                    "primary": true,
                    "format": "2006-01-02 15:04:05 MST",
                    "source": { "from_input_index": 0 }
                }
            },
            {
                "name": "the_data",
                "datatype": {
                    "type": "uint64",
                    "source": { "from_input_index": 1 }
                }
            }
        ],
        "compression": "gzip",
        "format_details": {
            "skip_head": 1,
            "delimiter": "\t"
        }
    }
}

Add alias columns

See Alias Columns to learn how to simplify queries and unify data.

Transform Structure

Shape of a transform

Basic properties

Settings

Settings block example

Output columns

❗️
`primary` must be `true` on exactly one column's `datatype` definition

Output columns block example

Datatypes

ℹ️
Indexability of types

Datatype block examples

Ingestion sequence

Immutable column types

Null handling and default values

Limits, scripting, and auto values

SQL transforms

General guidance

Multiple transforms

Defensive handling

Catch features

Views and denullify

⚠️
Warning: Last update wins.

Complete minimal example

Add alias columns

Shape of a transform

Basic properties

Settings

Settings block example

Output columns

❗️primary must be true on exactly one column's datatype definition

Output columns block example

Datatypes

ℹ️Indexability of types

Datatype block examples

Ingestion sequence

Immutable column types

Null handling and default values

Limits, scripting, and auto values

SQL transforms

General guidance

Multiple transforms

Defensive handling

Catch features

Views and denullify

⚠️Warning: Last update wins.

Complete minimal example

Add alias columns

❗️
`primary` must be `true` on exactly one column's `datatype` definition

ℹ️
Indexability of types

⚠️
Warning: Last update wins.