Format Options
Hydrolix can process input data structured in many different ways. Hydrolix currently supports data in CSV or JSON formats compressed with several different algorithms.
CSV
The CSV (Character-Separated Values) format encodes data as collections of lines containing columns separated by a specified delimiter character.
To create a transform schema that handles CSV input, set the type property to "csv" and the format_details field to an object that includes the following configuration options:
| Element | Type | Default | Description |
|---|---|---|---|
delimiter | number, string | The delimiter substring, for example , or \t. | |
escape | number, string | " | The escape character. |
skip_head | number | 0 | The number of rows to skip at the beginning of any ingestion. |
quote | single-character string | " | The character that specifies the beginning and end of a string literal. |
comment | single-character string | # | The character that marks the beginning of a comment line. |
skip_comments | boolean | false | Whether or not to ignore lines beginning with the comment character. |
windows_ending | boolean | false | Whether or not to allow Windows-style (CR-LF) line endings. |
CSV Example
{
"name": "my_special_transform",
"type": "csv",
"settings": {
"format_details": {
"skip_head": 2,
"delimiter": ","
},
...
}
}
JSON
To create a transform schema that handles JSON input, set the type property to "json". Because JSON input allows for complex nested structures, Hydrolix provides options to flatten incoming JSON during a pre-processing step and explicitly source output column data from the resulting structure.
JSON subtypes
JSON data may arrive wrapped in transport- or service-specific formats.
Hydrolix provides subtypes to unwrap these outer formats during ingestion. A subtype processes incoming JSON to extract the actual data payload before Hydrolix applies your transform's column mappings.
Hydrolix supports the following JSON subtypes:
firehose- Unwraps Amazon Data Firehose message formatfirehose/gzip- Unwraps gzip-compressed Amazon Data Firehose messagescloudwatch- Unwraps Amazon CloudWatch log formatmPulse- Unwraps Akamai mPulse data format
Data uses a single subtype
To process data with a single subtype, include the subtype field in your transform's format_details.
{
"name": "firehose_transform",
"type": "json",
"settings": {
"format_details": {
"subtype": "firehose"
},
...
}
}
Data uses multiple subtypes
When data passes through multiple services before reaching Hydrolix, use pre-transforms to chain multiple subtypes together.
- Fields
pretransformsandsubtypeare mutually exclusive. - The pre-transforms feature is available in Hydrolix v5.6 and later.
Datatype Source
By default, Hydrolix uses the names of the top-level keys to establish the mapping between output_columns and the source data. For example, if your source data contains a top-level property named employees that you wish to ingest, you must name the corresponding column definition in your transform employees. This convention also applies to flattened property names.
You can override this default mapping for any column definition in output_columns with the datatype.source field:
| Source | Syntax |
|---|---|
| Input data from a single field | "source": { "from_input_field": "field-name" } |
| Input data from multiple fields | "source": { "from_input_fields": ["field-name-1", "field-name-2", ...] } |
| Input data by index, in this example field 3 | "source": { "from_input_index": 3 } |
| Input data in a nested JSON structure | "source": { "from_json_pointers": ["/path/alternative/1", "/path/alternative/2", ...] } |
| Variable | "source": { "from_automatic_value": "variable-name" } |
Create a New Column from an Input
The following example creates a new column named test from the data in the third column of input CSV data:
{
"name": "copy_example",
"type": "csv",
"table": "{{tableid}}",
"settings": {
"is_default": true,
"compression": "none",
"output_columns": [
{
"name": "test",
"datatype": {
"type": "string",
"index": false,
"source": { "from_input_index": 3 }
}
]
}
}
The following example creates a new column named test from the data in a field called initial_name in input JSON data:
{
"name": "copy_example",
"type": "json",
"table": "{{tableid}}",
"settings": {
"is_default": true,
"compression": "none",
"output_columns": [
{
"name": "test",
"datatype": {
"type": "string",
"index": true,
"source": {
"from_input_field": "initial_name"
}
}
}
]
}
}
Copy Data
By copying data into multiple columns, you can index the same data multiple times with different types.
Create a New Column from Nested Input
Expressions can also query data stored in a nested JSON structure using JSON Pointer syntax. Assume input data stored in the following structure:
- A JSON object called
vegetables. - Within
vegetables, a nested JSON object calledlegumes. - Within
legumes, a field calledspecies.
{
"vegetables": {
"legumes": {
species: "example_data"
}
}
}
Given this input data, you could use the following transform to define an output column named example_species:
{
"name": "copy_example",
"type": "json",
"table": "{{tableid}}",
"settings": {
"is_default": true,
"compression": "none",
"output_columns": [
{
"name": "example_species",
"datatype": {
"type": "string",
"index": true,
"source": {
"from_json_pointers": ["/vegetables/legumes/species"]
}
}
}
]
}
}
Flattening
When accepting JSON source data, you may optionally flatten each incoming object prior to ingestion. This can transform complex, multi-level JSON structures into simple objects comprising one level of key/value pairs, ready for storage in a single table row.
To do this, define a flattening property within your transform's format_details. Set the value to an object with the following properties:
| Property | Value |
|---|---|
active | If 1 (or any other true value), Hydrolix will flatten incoming JSON objects before ingesting them as rows. |
map_flattening_strategy | Configuration for flattening any JSON objects within each row's main object. |
slice_flattening_strategy | Configuration for flattening any JSON arrays within each row's main object. |
depth | Configuration for specifying how "deep" flattening goes. Use a value of 0 to impose no limit. |
The two strategy properties accept an object that defines the rules that Hydrolix should follow to create new key names for the resulting, flattened JSON object.
| Property | Value |
|---|---|
| left | The substring to use when concatenating an element's key with its parent's key. |
| right | The substring to use when concatenating an element's key with its child's key. |
Not defining (or defining as null) either of the "strategy" properties will deactivate flattening for either objects or arrays, respectively.
JSON Flattening Impacts Source Naming
Use the flattened version of a property name when defining output columns. For example, consider the following input JSON:
{ "employees" : { "departments" : [ "produce", "meat" ] } }Assume the following flattening configuration:
"flattening": { "active": true, "map_flattening_strategy": { "left": ".", "right": "" }, "slice_flattening_strategy": { "left": "[", "right": "]" } }To refer to the first value in the array stored in
employees.departments, use the field name"employees.departments[0]".
Example
Consider the following JSON object:
{
"date": "2020-01-01",
"data": {
"oranges": [ 1, 2, 3 ],
"apples": [
{
"cortland": 6,
"honeycrisp": [ 7, 8, 9 ]
},
[ 10, 11, 12 ]
]
}
}
Assume the following flattening configuration in the ingestion transform:
"settings": {
"format_details": {
"flattening": {
"active": true,
"map_flattening_strategy": {
"left": ".",
"right": ""
},
"slice_flattening_strategy": {
"left": "[",
"right": "]"
}
}
},
...
}
After applying JSON flattening, Hydrolix ingests the following, single-level JSON object:
{
"date": "2020-01-01",
"data.oranges[0]": 1,
"data.oranges[1]": 2,
"data.oranges[2]": 3,
"data.apples[0].cortland": 6,
"data.apples[0].honeycrisp[0]": 7,
"data.apples[0].honeycrisp[1]": 8,
"data.apples[0].honeycrisp[2]": 9,
"data.apples[1][0]": 10,
"data.apples[1][1]": 11,
"data.apples[1][2]": 12
}
Depth Example
Consider the following transform that specifies a flattening depth of 1:
"settings": {
"format_details": {
"flattening": {
"active": true,
"depth": 1,
"map_flattening_strategy": {
"left": ".",
"right": ""
},
"slice_flattening_strategy": {
"left": "[",
"right": "]"
}
}
},
...
}
Applying the transform produces the following JSON:
{
"date": "2020-01-01",
"data.apples":[
{
"cortland":6,
"honeycrisp":[7,8,9]
},
[10,11,12]
],
"data.oranges":[1,2,3]
}
This is useful when you want to leverage the map datatype and only flatten at a specific level.
Compression
The compression property in a transform describes one or more compression algorithms that Hydrolix should expect to find already applied to input. Hydrolix uses the listed algorithms to decompress input into parseable data.
For example, setting the transform's compression property to "gzip" means that you expect the source data, in its entirety, to have had the GZIP compression method applied to it prior to its receipt by Hydrolix.
Algorithms
Valid values for the compression property include the following:
| Value | Meaning |
|---|---|
gzip | Content is compressed via gzip (LZ77 with 32-bit CRC). |
zip | Content is ZIP-encoded via zlib (RFC 1950) |
deflate | Content is encoded in zlib structure and the deflate compression algorithm. |
bzip2 | Content is compressed with the bzip2 algorithm. |
none | Content isn't compressed. (Equivalent to not specifying compression at all.) |
During streaming ingestion, the request may also use the content-encoding header. Hydrolix automatically handles that encoding during the streaming process.
Layering
To define multiple layers of compression, specify them in a string separated by , :
"compression": "gzip, bzip2, zip"
In the above example, Hydrolix first decompresses received data with zlib, then bzip2, and finally with gzip.
Layered Order Matters
Hydrolix decompresses data using algorithms in right-to-left (<-) order.
Updated 10 days ago