Transform Structure

A transform determines how Hydrolix will process, index and store incoming data into a given table.

The transform JSON document describes the shape and format of incoming fields, controls ingestion features and behavior, and determines output data columns for insertion into the target table.

For each Hydrolix table, you may create multiple transforms to receive differently formatted files or streams. Only one transform may serve as the default for a table. The default transform will be used for any data ingested to the table that doesn't have a transform explicitly specified.

Shape of a transform

Each transform document consists of four major, nested parts.

  • Basic properties is the main JSON object
    • Settings describe input handling and control ingestion properties and some features
      • Output columns describe details of field to column transformation
        • Datatypes provides a detailed specification of each column's data

Basic properties

Transform documents have the following top-level properties:

PropertyTypePurposeDefaultRequired
namestringA name for this transform; must be unique within its tableYes
descriptionstringAn optional descriptionNo
typestringInput file type handled by this transform. One of csv or jsonYes
settingsobjectSee SettingsYes
{
    "name": "my_special_transform",
    "description": "description of my data",
    "type": "csv",
    "settings": {
        // See next section
    }
}

See also the simple complete transform JSON.

Settings

The settings object specifies transform-level behavior, describing input data format, compression, rate limiting and, optionally, sample data for ease of use in testing and validating the transform's behavior.

PropertyTypePurposeDefaultRequired
is_defaultbooleanSee Multiple transformsfalseYes
output_columnsarraySee Column definitionsYes
format_detailsobjectIncoming data format description. See Format OptionsYes
compressionstringConvey content-level compression. See CompressionnoneYes
null_valuesarrayOf type string. See Null handling and default valuesNo
sql_transformstringSQL used during data enrichment. See SQL transformsNo
rate_limitobjectLimits bytes per second ingest rate and max payload size. See Rate Limits.no limitNo
wurflstringSee WURFL IntegrationNo
sample_datastringSee Test and Validate TransformsNo

Settings block example

"settings": {
    "is_default": true,
    "compression": "gzip",
    "sql_transform": "select ..... from {STREAM}",
    "null_values" : ["^","-"],
    "format_details": {
        // Configuration specific to the data's overall 
        // format goes here
    },
    "output_columns": [
        // A list of column-definition objects goes here
    ]
}

See Data Enrichment and Summary Tables (Aggregation)

Output columns

The output_columns section of the document is the core of a transform. It maps incoming data fields to table columns.

Select from the several supported Timestamp Data Types and be sure to mark one field as primary. If your data lack an event timestamp, consider using receive_time from Auto Values.

❗️

primary must be true on exactly one column's datatype definition

Each transform must specify one primary column of type timestamp or epoch. This column must be indexed, isn't nullable, and serves as a partitioning key.

Output columns block example

"output_columns": [
    {
        "name": "timestamp",
        "datatype": {
            "type": "datetime",
            "primary": true,
            "format": "2006-01-02 15:04:05 MST",
            "source": { "from_input_index": 0 }
        }
    },
    {
        "name": "the_data",
        "datatype": {
            "type": "uint64",
            "source": { "from_input_index": 1 }
        }
    }
}

Datatypes

The datatype property in an output_column is an object itself, to allow for specification of Complex Data Types and type-specific characteristics of columns. Each column must include a datatype definition, which supports the following properties.

PropertyTypePurposeDefaultRequired
typestringA type name, for example string, boolean, int32, int8, epoch, or datetimeYes
indexbooleanSee indexability of types belowtrueNo
index_optionsobjectUse {"fulltext": true} to enable. Full Text SearchNo
primarybooleanExactly one columun must be primary. See Timestamp Data TypesfalseOnce per transform
formatstringParsed time format. See Format and ResolutionOnly for datetime and epoch
resolutionstringStored time resolution. See Format and ResolutionsecondsOnly for datetime and epoch
elementsobjectSee Complex Data TypesOnly for array and map
sourceobjectSee Datatype sourceFormat-specific
limitsobjectConstrain acceptable values. See LimitersNo
scriptstringDerive column values using JS expressions. See ScriptingNo
null_valuesarrayOf type string. See Null handling and default valuesNo
defaultThe type of the default value depends on the specified typeNo
denullifybooleanConvert stored nulls on this column to empty in automatic viewfalseNo
ignorebooleanOmit input field from all ingestion stages. Defensive handlingfalseNo
suppressbooleanDon't pass this field through to indexing and storage. See Defensive handlingfalseNo
virtualbooleanMark that this field accepts constructed values. See Defensive handlingfalseNo
catch_allbooleanStore all unknown fields in this column. See Catch featuresfalseNo
catch_rejectsbooleanUse this column for rejects. See Catch featuresfalseNo

ℹ️

Indexability of types

In the absence of the index field in a transform specification, Hydrolix will default to true. Most, but not all, data types are indexable. See Basic Data Types, Timestamp Data Types, and Complex Data Types for specifics.

Datatype block examples

"datatype": {
    "type": "epoch",
    "primary": true,
    "format": "2006-01-02 15:04:05 MST",
    "source": { "from_input_index": 0 }
}
  {
    "type": "datetime",
    "primary": true,
    "format": "2006-01-02T15:04:05.999999Z",
    "resolution": "ms",
    "source": {
      "from_input_field": "timestamp"
    }
  }

Ingestion sequence

For cleaning and enriching your data, you can use name-based (JSON) mapping, CSV positional indexing, Auto Values, JavaScript Scripting, or more advanced ingestion and data manipulation features.

The next several sections parallel the stages of transform execution. The diagram may help illustrate the relationship between properties processing order.

Immutable column types

A column's type is immutable upon creation. Type mismatch on ingestion is an error, which will result in rejected values (see Catch features for coping with rejects).

All subsequent transforms on the same table, whether revisions or new, must specify matching values for type, index, and primary for that column. Avoid this kind of conflict by using different column names to support different types.

For example, consider a scenario in which different data streams have the field country but the value is an ISO-3166 code, a string or a numeric identifier. Consider defining multiple output_columns of different types and using other ingestion features: country_code as string, country_name as string, and country_id as int32. Transform A could accept country into column country_name, looking up the corresponding country_id and country_code column values. Consider other Hydrolix ingestion control features, such as Custom dictionaries or Auto values.

Null handling and default values

The transform object offers control over handling empty values at both the transform level and more fine-grained for individual columns.

You can supply a list of null_values on either the settings object which will apply to all columns, or to a specific output_column.

The null_values apply before default.

During initial basic data validation checks, incoming fields will be converted to nulls using the union of the transform-level null_values (settings object) and each column null_values, after which nulls may be replaced with column-specified default.

An example for null_values. For example [\"^\",\"-\"] will replace ^ OR - by null on any column.

    "null_values": [
      "-",
      "^"
    ],
    // incoming values of "^" or "-" will be replaced with null for further processing

If a column contains null values, you can influence retrieval using the denullify property. See Views and denullify.

Limits, scripting, and auto values

After all incoming data fields have been converted to the type described in their output_columns, the limits checking occurs. Any failures in type conversion or limit checking may result in rejects. See Catch features.

Next, each output column with a non-null script or an Auto Values source instruction will be run.

SQL transforms

A powerful data enrichment stage allows you to configure SQL to operate on the incoming data stream and produce additional fields.

In this last phase before index and partition generation, you can populate output_columns with data generated from columns of a SELECT clause operating on the data stream after prior stages have completed. After this point in the ingestion processing, all of the column typing, default-setting, scripting, and auto value generation is complete.

For more detail on this advanced feature, see Data Enrichment, Custom Functions, and Custom Dictionaries.

General guidance

Transforms are the primary tool for converting incoming data fields into values stored into a table's columns.

  • Select a single input field (or construct one using script) to serve as the primary timestamp of the table and set a format. For storage, the resolution setting is optional and defaults to seconds.
  • Define a data type for every field stored to the table. Use suppress to keep a field through the intake process but omit in the resulting table.
  • If you wish to store NULLs, use null_values to convert empty values to null for the entire incoming data stream and use null_values on specific input fields. See also Null handling and default values.
  • Use the catch_all and catch_rejects properties to store unknown or invalid data directly in a table, rather than generate Reject files into your storage bucket.
  • Control data value ranges on specific fields by using the limits property.
  • Remember that a single transform on a table is implicitly the default.

Multiple transforms

Only one transform on a table may set is_default to true. If set to true, then this transform applies to any streaming ingest for that table that doesn't otherwise specify a transform. The autoingest feature can only use the default transform. Streams, however, can specify the desired transform using the HTTP header x-hdx-transform.

Defensive handling

The less-common properties ignore, suppress, and virtual offer control over visibility of values during transform execution.

Use ignore to explicitly remove the field in the first stage, basic data validation checks. It will be unavailable during any further processing and won't be present in the table. A more common solution than using ignore is to avoid creating an output_column for the field. Since a transform only inserts data into existing output_columns, omission of a field's name from output_columns suffices. The ignore option is simply more explicit.

Use suppress to keep a field's value during early stages of the transform execution, but suppress the value before the indexing stage and subsequent writing to a table partition. This ensures the incoming field values don't get written to the table. Note, this occurs after the SQL Transform stage and before the indexing stage.

Use virtual to occlude any incoming field named the same as an output column. This behaves similarly to ignore, but supports derived columns from Scripting, SQL Transforms and default values. Marking a column virtual for derived columns can prevent a future addition of a field in the input stream from taking precedence over the default.

Catch features

There are two features available at ingestion to help deal with changing formats of incoming data.

Each transform can earmark a specific column with catch_all to receive incoming fields lacking any output column definition. This is a good way to collect unknown fields, rather than drop the data.

When incoming values trigger Limiters or when values don't match a column's datatype, the catch_rejects field allows a column to accumulate the bad data in the table rather than generate a Rejects file to the storage system.

The catch features retain incoming fields and data in the table for applications or analysts.

Examples:

Views and denullify

The denullify property isn't used during ingestion, but applies to the automatically created view. See also Custom Views.

When denullify is true, null values in a column are converted into an type-appropriate blank value at query time. The contents of the table partitions are completely unaffected by this setting. If partitions contain null values in the column, then changing this setting will change query results even for partitions created long before.

⚠️

Warning: Last update wins.

The automatically created view will respect the value of denullify from the latest updated or created transform. When using the denullify property, pay attention to multiple transforms.

Complete minimal example

The following example transform would let a Hydrolix table ingest a GZIP-compressed, CSV-formatted data stream.

In English, this describes tab-separated data with a single header line. Only the first two fields are consumed and created as columns. The first column (0) will be used as the partitioning key, primary and holds a datetime including time zone name. The second field is a large unsigned integer, uint64.

{
    "name": "my_special_transform",
    "description": "description of my data",
    "type": "csv",
    "settings": {
        "output_columns": [
            {
                "name": "timestamp",
                "datatype": {
                    "type": "datetime",
                    "primary": true,
                    "format": "2006-01-02 15:04:05 MST",
                    "source": { "from_input_index": 0 }
                }
            },
            {
                "name": "the_data",
                "datatype": {
                    "type": "uint64",
                    "source": { "from_input_index": 1 }
                }
            }
        ],
        "compression": "gzip",
        "format_details": {
            "skip_head": 1,
            "delimiter": "\t"
        }
    }
}