Transform Structure
A transform (or write schema) is a JSON document that configures how Hydrolix will index and store incoming data onto a given table. A transform's content includes a general description of the incoming data's format, as well as a description of every column that data will produce and populate in the target table.
The following example transform would let a Hydrolix table ingest a GZIP-compressed, CSV-formatted data stream with two columns:
{
"name": "my_special_transform",
"description": "description of my data",
"type": "csv",
"settings": {
"output_columns": [
{
"name": "timestamp",
"datatype": {
"type": "datetime",
"primary": true,
"format": "2006-01-02 15:04:05 MST",
"source": { "from_input_index": 0 }
}
},
{
"name": "the_data",
"datatype": {
"type": "uint64",
"source": { "from_input_index": 1 }
}
}
],
"compression": "gzip",
"format_details": {
"skip_head": 1,
"delimiter": "\t"
}
}
}
Basic properties
Transform documents have the following top-level properties:
Property | Purpose |
---|---|
name | A name for this transform; must be unique within its table. |
description | An optional description of this transform. |
type | The file type of the input data this transform handles. One of csv , or json . |
settings | A JSON object describing the data to ingest. |
Settings
"settings": {
"is_default": true,
"compression": "gzip",
"sql_transform": "select ..... from {STREAM}",
"null_values" : ["^","-"],
"format_details": {
// Configuration specific to the data's overall
// format goes here
},
"output_columns": [
// A list of column-definition objects goes here
]
}
The settings
object contains the details of the content and structure of the data. This is where the data input format, compression, and resulting columns are defined.
Property | Purpose | Required |
---|---|---|
is_default | If set to true, then this transform's table will apply this transform to a streaming ingest that doesn't otherwise specify a transform. Default is required for auto-ingest | Y |
output_columns | An array of JSON objects with a description of every column in the target table that the incoming data will populate. See Column definitions | Y |
compression | A string describing the compression applied to the data (or "none" , the default). See Compression | Y |
format_details | A JSON object containing configuration information specific to the data's overall format, just as field-delimiter characters with CSV files. See Format Options | Y |
sql_transform | A SQL statement runs at ingest allowing users to aggregate, enrich, modify and/or filter the data. See Real-Time Enrichment and filtering at Ingest and Summary Tables (Aggregation) | N |
null_values | An array of string which will be replaced by null if found in the value.For example ["^","-"] will replace ^ OR - by null on any column. | N |
Settings: Output Columns
The core of a transform is the Output Columns section under settings. This section is where you define what the data looks like and how it should be loaded into the Hydrolix platform. These definitions are used to specify Data-Types, indexing and other characteristics your data should have when it is being written into the HDX format.
"settings": {
"output_columns": [
{
"name": "timestamp",
"datatype": {
"type": "datetime",
"primary": true,
"format": "2006-01-02 15:04:05 MST",
"source": { "from_input_index": 0 }
}
},
{
"name": "data",
"datatype": {
"type": "uint64",
"source": { "from_input_index": 1 }
}
},
{
...
],
// Other "settings" properties go here
}
A transform must contain one object in its output_columns
list for every column in the target table that it ingests.
This includes the table's primary timestamp column, as well as any column that will copy a new value based directly on incoming data. The transform enables Hydrolix to copy these values into the target table by specifying a positional (CSV) or name-based (JSON) mapping between the incoming data fields and the table's columns.
The transform can also set columns to derive a new value based on a JavaScript expression, or a simple default.
The properties supported by column definition objects include the following:
Attribute | Purpose | Required |
---|---|---|
name | A name for the column, for use by subsequent SQL queries. | Yes |
datatype | An object containing a detailed column definition. | Yes |
The datatype
property requires an object as its value in order to allow recursive definitions, as described in complex data types. The properties supported by datatype
include the following:
Primary
Every table has to have a primary field set. This is a datetime/epoch field that is a primary index used by Hydrolix for partitioning.
Attribute | Purpose | Required |
---|---|---|
catch_all | Primarily used with JSON encoded datasets. If true , then Hydrolix will put any data that is not matched in JSON into this field as a string. See using the catch_all feature for more information. | No |
catch_rejects | Catches rejects where a column exists however the data type does not match the incoming data. See the Rejects page for more information. | No |
default | A default value to apply when the source data's value for this column is empty. | No |
denullify | Forces a column to have a "non-nullable" entry. For example a String is stored as "" or a UInt8 as 0 . | No |
elements | A description of the constituent elements of array -type data. See arrays. | No, except for array and map datatypes. |
format | Parsing information for datetime and epoch data types. See Timestamp formatting. | No, except for datetime and epoch . |
ignore | Boolean to specify if the field is ignored or not by indexer. An output column marked as ignore:true will not be processed and will not be available to the SQL transform nor appear in the output. Compare this to suppress and virtual in this table. | No |
index | Enable indexing on the column. Defaults to true (indexed), except datatype double . | No |
index_options | Element which contains advanced indexing options such as fulltext boolean.More details on fulltext page. | No |
limits | Reject rows or clamp values based on rules. More information can be found on these here: Limiters | No |
null_values | Array of strings which are replaced by a null .Example: ["^","-"] will replace ^ OR - by null . | No |
primary | If true , then this marks the target table's single, primary field. It must have either datetime or epoch type. | Exactly one per transform. |
resolution | Set to "ms" (milliseconds) or "s" (seconds) to define the time-granularity that Hydrolix will apply when storing this column's datetime or epoch -type data. Defaults to "s" (seconds). | No. |
script | A JavaScript expression whose output becomes the stored value of this column. See Scripting and Auto Values. | No. |
source | Used with CSV and JSON encoded files, if specified this is used as the source column name. When data is then written to the database the column will be called by the "name" value above. It is possible to specify multiple source columns, the first value found for the column is the one that is stored. This is best used when there is an OR requirement between two columns (e.g. If column A has a value OR column B has a value then store it in this column). | No |
suppress | Boolean to state that the field should not be indexed or stored, but is available to an SQL transform. In other words, If a field is marked "suppress": true , the field will be processed normally but will not be stored in the table. If there is pre-existing data in this field, or data from another transform, that data will still be returned by queries. If you wish to completely remove the field from query output, remove the field from the table's default view. Compare this to ignore and virtual in this table. | No |
type | The data type of the field. | Yes |
virtual | If true , then Hydrolix will not map this column to any incoming data. Useful for constant or derived data when paired with the default or script settings, respectively. Compare this to ignore and suppress in this table. | No |
Updated about 1 month ago