Transform Structure
A transform determines how Hydrolix will process, index and store incoming data into a given table.
The transform JSON document describes the shape and format of incoming fields, controls ingestion features and behavior, and determines output data columns for insertion into the target table.
For each Hydrolix table, you may create multiple transforms to receive differently formatted files or streams. Only one transform may serve as the default for a table. The default transform will be used for any data ingested to the table that doesn't have a transform explicitly specified.
Shape of a transform
Each transform document consists of four major, nested parts.
- Basic properties is the main JSON object
- Settings describe input handling and control ingestion properties and some features
- Output columns describe details of field to column transformation
- Datatypes provides a detailed specification of each column's data
- Output columns describe details of field to column transformation
- Settings describe input handling and control ingestion properties and some features
Basic properties
Transform documents have the following top-level properties:
Property | Type | Purpose | Default | Required |
---|---|---|---|---|
name | string | A name for this transform; must be unique within its table | Yes | |
description | string | An optional description | No | |
type | string | Input file type handled by this transform. One of csv or json | Yes | |
settings | object | See Settings | Yes |
{
"name": "my_special_transform",
"description": "description of my data",
"type": "csv",
"settings": {
// See next section
}
}
See also the simple complete transform JSON.
Settings
The settings
object specifies transform-level behavior, describing input data format, compression, rate limiting and, optionally, sample data for ease of use in testing and validating the transform's behavior.
Property | Type | Purpose | Default | Required |
---|---|---|---|---|
is_default | boolean | See Multiple transforms | false | Yes |
output_columns | array | See Column definitions | Yes | |
format_details | object | Incoming data format description. See Format Options | Yes | |
compression | string | Convey content-level compression. See Compression | none | Yes |
null_values | array | Of type string . See Null handling and default values | No | |
sql_transform | string | SQL used during data enrichment. See SQL transforms | No | |
rate_limit | object | Limits bytes per second ingest rate and max payload size. See Rate Limits. | no limit | No |
wurfl | string | See WURFL Integration | No | |
sample_data | string | See Test and Validate Transforms | No |
Settings block example
"settings": {
"is_default": true,
"compression": "gzip",
"sql_transform": "select ..... from {STREAM}",
"null_values" : ["^","-"],
"format_details": {
// Configuration specific to the data's overall
// format goes here
},
"output_columns": [
// A list of column-definition objects goes here
]
}
See Data Enrichment and Summary Tables (Aggregation)
Output columns
The output_columns
section of the document is the core of a transform. It maps incoming data fields to table columns.
Select from the several supported Timestamp Data Types and be sure to mark one field as primary
. If your data lack an event timestamp, consider using receive_time
from Auto Values.
primary
must betrue
on exactly one column'sdatatype
definitionEach transform must specify one
primary
column of typetimestamp
orepoch
. This column must be indexed, isn't nullable, and serves as a partitioning key.
Output columns block example
"output_columns": [
{
"name": "timestamp",
"datatype": {
"type": "datetime",
"primary": true,
"format": "2006-01-02 15:04:05 MST",
"source": { "from_input_index": 0 }
}
},
{
"name": "the_data",
"datatype": {
"type": "uint64",
"source": { "from_input_index": 1 }
}
}
}
Datatypes
The datatype
property in an output_column
is an object itself, to allow for specification of Complex Data Types and type-specific characteristics of columns. Each column must include a datatype
definition, which supports the following properties.
Property | Type | Purpose | Default | Required |
---|---|---|---|---|
type | string | A type name, for example string , boolean , int32 , int8 , epoch , or datetime | Yes | |
index | boolean | See indexability of types below | true | No |
index_options | object | Use {"fulltext": true} to enable. Full Text Search | No | |
primary | boolean | Exactly one columun must be primary. See Timestamp Data Types | false | Once per transform |
format | string | Parsed time format. See Format and Resolution | Only for datetime and epoch | |
resolution | string | Stored time resolution. See Format and Resolution | seconds | Only for datetime and epoch |
elements | object | See Complex Data Types | Only for array and map | |
source | object | See Datatype source | Format-specific | |
limits | object | Constrain acceptable values. See Limiters | No | |
script | string | Derive column values using JS expressions. See Scripting | No | |
null_values | array | Of type string . See Null handling and default values | No | |
default | The type of the default value depends on the specified type | No | ||
denullify | boolean | Convert stored nulls on this column to empty in automatic view | false | No |
ignore | boolean | Omit input field from all ingestion stages. Defensive handling | false | No |
suppress | boolean | Don't pass this field through to indexing and storage. See Defensive handling | false | No |
virtual | boolean | Mark that this field accepts constructed values. See Defensive handling | false | No |
catch_all | boolean | Store all unknown fields in this column. See Catch features | false | No |
catch_rejects | boolean | Use this column for rejects. See Catch features | false | No |
Indexability of types
In the absence of the
index
field in a transform specification, Hydrolix will default totrue
. Most, but not all, data types are indexable. See Basic Data Types, Timestamp Data Types, and Complex Data Types for specifics.
Datatype block examples
"datatype": {
"type": "epoch",
"primary": true,
"format": "2006-01-02 15:04:05 MST",
"source": { "from_input_index": 0 }
}
{
"type": "datetime",
"primary": true,
"format": "2006-01-02T15:04:05.999999Z",
"resolution": "ms",
"source": {
"from_input_field": "timestamp"
}
}
Ingestion sequence
For cleaning and enriching your data, you can use name-based (JSON) mapping, CSV positional indexing, Auto Values, JavaScript Scripting, or more advanced ingestion and data manipulation features.
The next several sections parallel the stages of transform execution. The diagram may help illustrate the relationship between properties processing order.
Immutable column types
A column's type is immutable upon creation. Type mismatch on ingestion is an error, which will result in rejected values (see Catch features for coping with rejects).
All subsequent transforms on the same table, whether revisions or new, must specify matching values for type, index, and primary for that column. Avoid this kind of conflict by using different column names to support different types.
For example, consider a scenario in which different data streams have the field country
but the value is an ISO-3166 code, a string or a numeric identifier. Consider defining multiple output_columns
of different types and using other ingestion features: country_code
as string
, country_name
as string
, and country_id
as int32
. Transform A could accept country
into column country_name
, looking up the corresponding country_id
and country_code
column values. Consider other Hydrolix ingestion control features, such as Custom dictionaries or Auto values.
Null handling and default values
The transform object offers control over handling empty values at both the transform level and more fine-grained for individual columns.
You can supply a list of null_values
on either the settings
object which will apply to all columns, or to a specific output_column
.
The null_values
apply before default
.
During initial basic data validation checks, incoming fields will be converted to nulls using the union of the transform-level null_values
(settings
object) and each column null_values
, after which nulls may be replaced with column-specified default
.
An example for null_values
. For example [\"^\",\"-\"]
will replace ^
OR -
by null
on any column.
"null_values": [
"-",
"^"
],
// incoming values of "^" or "-" will be replaced with null for further processing
If a column contains null values, you can influence retrieval using the denullify
property. See Views and denullify.
Limits, scripting, and auto values
After all incoming data fields have been converted to the type described in their output_columns
, the limits
checking occurs. Any failures in type conversion or limit checking may result in rejects. See Catch features.
Next, each output column with a non-null script
or an Auto Values source
instruction will be run.
SQL transforms
A powerful data enrichment stage allows you to configure SQL to operate on the incoming data stream and produce additional fields.
In this last phase before index and partition generation, you can populate output_columns
with data generated from columns of a SELECT clause operating on the data stream after prior stages have completed. After this point in the ingestion processing, all of the column typing, default-setting, scripting, and auto value generation is complete.
For more detail on this advanced feature, see Data Enrichment, Custom Functions, and Custom Dictionaries.
General guidance
Transforms are the primary tool for converting incoming data fields into values stored into a table's columns.
- Select a single input field (or construct one using
script
) to serve as theprimary
timestamp of the table and set aformat
. For storage, theresolution
setting is optional and defaults toseconds
. - Define a data
type
for every field stored to the table. Usesuppress
to keep a field through the intake process but omit in the resulting table. - If you wish to store NULLs, use
null_values
to convert empty values to null for the entire incoming data stream and usenull_values
on specific input fields. See also Null handling and default values. - Use the
catch_all
andcatch_rejects
properties to store unknown or invalid data directly in a table, rather than generate Reject files into your storage bucket. - Control data value ranges on specific fields by using the
limits
property. - Remember that a single transform on a table is implicitly the default.
Multiple transforms
Only one transform on a table may set is_default
to true
. If set to true, then this transform applies to any streaming ingest for that table that doesn't otherwise specify a transform. The autoingest feature can only use the default transform. Streams, however, can specify the desired transform using the HTTP header x-hdx-transform
.
Defensive handling
The less-common properties ignore
, suppress
, and virtual
offer control over visibility of values during transform execution.
Use ignore
to explicitly remove the field in the first stage, basic data validation checks. It will be unavailable during any further processing and won't be present in the table. A more common solution than using ignore
is to avoid creating an output_column
for the field. Since a transform only inserts data into existing output_columns
, omission of a field's name from output_columns
suffices. The ignore
option is simply more explicit.
Use suppress
to keep a field's value during early stages of the transform execution, but suppress the value before the indexing stage and subsequent writing to a table partition. This ensures the incoming field values don't get written to the table. Note, this occurs after the SQL Transform stage and before the indexing stage.
Use virtual
to occlude any incoming field named the same as an output column. This behaves similarly to ignore
, but supports derived columns from Scripting, SQL Transforms and default
values. Marking a column virtual
for derived columns can prevent a future addition of a field in the input stream from taking precedence over the default
.
Catch features
There are two features available at ingestion to help deal with changing formats of incoming data.
Each transform can earmark a specific column with catch_all
to receive incoming fields lacking any output column definition. This is a good way to collect unknown fields, rather than drop the data.
When incoming values trigger Limiters or when values don't match a column's datatype, the catch_rejects
field allows a column to accumulate the bad data in the table rather than generate a Rejects file to the storage system.
The catch features retain incoming fields and data in the table for applications or analysts.
Examples:
Views and denullify
The denullify
property isn't used during ingestion, but applies to the automatically created view. See also Custom Views.
When denullify
is true, null values in a column are converted into an type-appropriate blank value at query time. The contents of the table partitions are completely unaffected by this setting. If partitions contain null values in the column, then changing this setting will change query results even for partitions created long before.
Warning: Last update wins.
The automatically created view will respect the value of
denullify
from the latest updated or created transform. When using thedenullify
property, pay attention to multiple transforms.
Complete minimal example
The following example transform would let a Hydrolix table ingest a GZIP-compressed, CSV-formatted data stream.
In English, this describes tab-separated data with a single header line. Only the first two fields are consumed and created as columns. The first column (0) will be used as the partitioning key, primary
and holds a datetime
including time zone name. The second field is a large unsigned integer, uint64
.
{
"name": "my_special_transform",
"description": "description of my data",
"type": "csv",
"settings": {
"output_columns": [
{
"name": "timestamp",
"datatype": {
"type": "datetime",
"primary": true,
"format": "2006-01-02 15:04:05 MST",
"source": { "from_input_index": 0 }
}
},
{
"name": "the_data",
"datatype": {
"type": "uint64",
"source": { "from_input_index": 1 }
}
}
],
"compression": "gzip",
"format_details": {
"skip_head": 1,
"delimiter": "\t"
}
}
}
Updated 15 days ago