Skip to content

Transform Structure

A transform determines how Hydrolix will process, index, and store incoming data into a given table.

The transform JSON document describes the shape and format of incoming fields, controls ingestion features, and behavior, and determines output data columns for insertion into the target table.

Each Hydrolix table can have multiple transforms to receive differently formatted files or streams. Only one transform may serve as the default for a table. The default transform will be used for any data ingested to the table that doesn't have a transform explicitly specified.

Shape of a transform⚓︎

Each transform document consists of four major, nested parts.

  • Basic properties is the main JSON object
  • Settings describe input handling and control ingestion properties and some features
    • Output columns describe details of field to column transformation
    • Datatypes provides a detailed specification of each column's data

Basic properties⚓︎

Transform documents have the following top-level properties:

Property Type Purpose Default Required
name string A name for this transform; must be unique within its table Yes
description string An optional description No
type string Input file type handled by this transform. One of csv or json Yes
settings object See Settings Yes
1
2
3
4
5
6
7
8
{
    "name": "my_special_transform",
    "description": "description of my data",
    "type": "csv",
    "settings": {
        // See next section
    }
}

See also the complete minimal example.

Settings⚓︎

The settings object specifies transform-level behavior, describing input data format, compression, rate limiting, and optionally sample data to test and validate the transform's behavior.

Property Type Purpose Default Required
is_default boolean See Multiple transforms false Yes
output_columns array See Column definitions Yes
format_details object Incoming data format description. See Format Options Yes
compression string Convey content-level compression. See Compression none Yes
null_values array Of type string. See Null handling and default values No
sql_transform string SQL used during data enrichment. See SQL transforms No
rate_limit object Limits bytes per second ingest rate and max payload size. See Rate Limits. no limit No
wurfl string See WURFL Integration No
sample_data string See Test and Validate Transforms No

Settings block example⚓︎

"settings": {
    "is_default": true,
    "compression": "gzip",
    "sql_transform": "select ..... from {STREAM}",
    "null_values" : ["^","-"],
    "format_details": {
        // Configuration specific to the data's overall 
        // format goes here
    },
    "output_columns": [
        // A list of column-definition objects goes here
    ]
}

See Data Enrichment and Summary Tables (Aggregation)

Output columns⚓︎

The output_columns section of the document is the core of a transform. It maps incoming data fields to table columns.

Select from the several supported Timestamp Data Types and be sure to mark one field as primary. If the data lack an event timestamp, use receive_time from Auto Values.

primary must be true on exactly one column's datatype definition

Each transform must specify one primary column of type timestamp or epoch. This column must be indexed, isn't nullable, and serves as a partitioning key.

Output columns block example⚓︎

"output_columns": [
    {
        "name": "timestamp",
        "datatype": {
            "type": "datetime",
            "primary": true,
            "format": "2006-01-02 15:04:05 MST",
            "source": { "from_input_index": 0 }
        }
    },
    {
        "name": "the_data",
        "datatype": {
            "type": "uint64",
            "source": { "from_input_index": 1 }
        }
    }
}

Datatypes⚓︎

The datatype property in an output_column is an object itself, to allow for specification of Complex Data Types and type-specific characteristics of columns. Each column must include a datatype definition, which supports the following properties.

Property Type Purpose Default Required
type string A type name, for example string, boolean, ip, uuid,int32, int8, epoch, or datetime Yes
index boolean See indexability of types below true No
index_options object Configure indexing behavior. See Full-Text Search No
primary boolean Exactly one columun must be primary. See Timestamp Data Types false Once per transform
format string Parsed time format. See Format and Resolution Only for datetime and epoch
resolution string Stored time resolution. See Format and Resolution seconds Only for datetime and epoch
elements object See Complex Data Types Only for array and map
source object See Datatype source Format-specific
limits object Constrain acceptable values. See Limiters No
script string Derive column values using JS expressions. See Scripting No
null_values array Of type string. See Null handling and default values No
default The type of the default value depends on the specified type No
denullify boolean Convert stored nulls on this column to empty in automatic view false No
ignore boolean Omit input field from all ingestion stages. Defensive handling false No
suppress boolean Don't pass this field through to indexing and storage. See Defensive handling false No
virtual boolean Mark that this field accepts constructed values. See Defensive handling false No
catch_all boolean Store all unknown fields in this column. See Catch features false No
catch_rejects boolean Use this column for rejects. See Catch features false No

Indexability of types

In the absence of the index field in a transform specification, Hydrolix will default to true. Most, but not all, data types are indexable. See Basic Data Types, Timestamp Data Types, and Complex Data Types for specifics.

Index options⚓︎

The index_options object configures specialized indexing behavior for a column. Set index_options.fulltext to true to enable Full-Text Search on string columns for improved pattern matching performance.

{
  "name": "message",
  "datatype": {
    "type": "string",
    "index": true,
    "index_options": {
      "fulltext": true
    }
  }
}

Datatype block examples⚓︎

1
2
3
4
5
6
"datatype": {
    "type": "epoch",
    "primary": true,
    "format": "2006-01-02 15:04:05 MST",
    "source": { "from_input_index": 0 }
}
1
2
3
4
5
6
7
8
9
  {
    "type": "datetime",
    "primary": true,
    "format": "2006-01-02T15:04:05.999999Z",
    "resolution": "ms",
    "source": {
      "from_input_field": "timestamp"
    }
  }

Ingestion sequence⚓︎

To clean and enrich data, use name-based (JSON) mapping, CSV positional indexing, Auto Values, JavaScript Scripting, or more advanced ingestion and data manipulation features.

The next several sections parallel the stages of transform execution. The diagram illustrates the relationship between properties processing order.

Immutable column types⚓︎

A column's type is immutable upon creation. Type mismatch on ingestion is an error and causes Hydrolix to reject values (see Catch features for coping with rejects).

All subsequent transforms on the same table, whether revisions or new, must specify matching values for type, index, and primary for that column. Avoid this kind of conflict by using different column names to support different types.

For example, consider a scenario in which different data streams have the field country but the value is an ISO-3166 code, a string or a numeric identifier. Consider defining multiple output_columns of different types and using other ingestion features: country_code as string, country_name as string, and country_id as int32. Transform A could accept country into column country_name, looking up the corresponding country_id and country_code column values. Consider other Hydrolix ingestion control features, such as Custom dictionaries or Auto values.

Null handling and default values⚓︎

The transform object offers control over handling empty values at both the transform level and more fine-grained for individual columns.

Supply a list of null_values on either the settings object which will apply to all columns, or to a specific output_column.

The null_values apply before default.

During initial basic data validation checks, incoming fields will be converted to nulls using the union of the transform-level null_values (settings object) and each column null_values, after which nulls may be replaced with column-specified default.

An example for null_values. For example [\"^\",\"-\"] will replace ^ or - by null on any column.

1
2
3
4
5
    "null_values": [
      "-",
      "^"
    ],
    // incoming values of "^" or "-" will be replaced with null for further processing

To control how Hydrolix retrieves null values, use the denullify property. See Views and denullify.

Limits, scripting, and auto values⚓︎

After all incoming data fields have been converted to the type described in their output_columns, the limits checking occurs. Any failures in type conversion or limit checking may result in rejects. See Catch features.

Next, Hydrolix runs each output column with a non-null script or an Auto Values source instruction.

SQL transforms⚓︎

The data enrichment stage uses SQL to operate on the incoming data stream and produce additional fields.

In this last phase before index and partition generation, populate output_columns with data that the SELECT clause generates from columns operating on the data stream after prior stages have completed. After this point in the ingestion processing, all of the column typing, default-setting, scripting, and auto value generation is complete.

For more detail on this advanced feature, see Data Enrichment, Custom Functions, and Custom Dictionaries.

General guidance⚓︎

Transforms are the primary tool for converting incoming data fields into values stored into a table's columns.

  • Select a single input field (or construct one using script) to serve as the primary timestamp of the table and set a format. For storage, the resolution setting is optional and defaults to seconds.
  • Define a data type for every field stored to the table. Use suppress to keep a field through the intake process but omit it from the resulting table.
  • To store NULLs, use null_values to convert empty values to null for the entire incoming data stream and use null_values on specific input fields. See also Null handling and default values.
  • Use the catch_all and catch_rejects properties to store unknown or invalid data directly in a table, rather than generating Reject files into the storage bucket.
  • Control data value ranges on specific fields by using the limits property.
  • Remember that a single transform on a table is implicitly the default.

Multiple transforms⚓︎

Only one transform on a table may set is_default to true. If set to true, then this transform applies to any streaming ingest for that table that doesn't otherwise specify a transform. The autoingest feature can only use the default transform. Streams, however, can specify the desired transform using the HTTP header x-hdx-transform.

Defensive handling⚓︎

The less-common properties ignore, suppress, and virtual offer control over visibility of values during transform execution.

Use ignore to explicitly remove the field in the first stage, basic data validation checks. It will be unavailable during any further processing and won't be present in the table. A more common solution than using ignore is to avoid creating an output_column for the field. Since a transform only inserts data into existing output_columns, omitting a field's name from output_columns is sufficient. The ignore option is more explicit.

Use suppress to keep a field's value during early stages of the transform execution, but suppress the value before the indexing stage and subsequent writing to a table partition. This prevents Hydrolix from writing the incoming field values to the table. Note, this occurs after the SQL Transform stage and before the indexing stage.

Use virtual to occlude any incoming field named the same as an output column. This behaves similarly to ignore, but supports derived columns from Scripting, SQL Transforms and default values. Marking a column virtual for derived columns can prevent a future addition of a field in the input stream from taking precedence over the default.

Catch features⚓︎

Two features help handle changing formats of incoming data.

Each transform can earmark a specific column with catch_all to receive incoming fields lacking any output column definition. This collects unknown fields directly, rather than drop the data.

When incoming values trigger Limiters or when values don't match a column's datatype, the catch_rejects field stores rejected data in a table column in the table rather than generate a Rejects file to the storage system.

The catch features retain incoming fields and data in the table for applications or analysts.

Examples:

Views and denullify⚓︎

Hydrolix doesn't apply the denullify property during ingestion. When processing incoming data, Hydrolix stores a NULL in the raw partition.

The attribute applies to the automatically created view, which combines all transforms on a table. See also Custom Views.

Warning: Last update wins.

The automatically created view respects the value of denullify from the most recently modified transform. When using the denullify property, use caution when working with multiple transforms.

Complete minimal example⚓︎

This example transform enables a Hydrolix table to ingest a GZIP-compressed, CSV-formatted data stream.

This example shows tab-separated data with a single header line. The transform consumes only the first two fields and creates them as columns. The first column (0) will be used as the partitioning key, primary and holds a datetime including time zone name. The second field is a large unsigned integer, uint64.

{
    "name": "my_special_transform",
    "description": "description of my data",
    "type": "csv",
    "settings": {
        "output_columns": [
            {
                "name": "timestamp",
                "datatype": {
                    "type": "datetime",
                    "primary": true,
                    "format": "2006-01-02 15:04:05 MST",
                    "source": { "from_input_index": 0 }
                }
            },
            {
                "name": "the_data",
                "datatype": {
                    "type": "uint64",
                    "source": { "from_input_index": 1 }
                }
            }
        ],
        "compression": "gzip",
        "format_details": {
            "skip_head": 1,
            "delimiter": "\t"
        }
    }
}

Add calculated columns⚓︎

See Calculated Columns to learn how to simplify queries and unify data.