Format Options
Hydrolix can process input data structured in many different ways. Hydrolix currently supports data in CSV or JSON formats compressed with several different algorithms.
CSV
The CSV (Character-Separated Values) format encodes data as collections of lines containing columns separated by a specified delimiter character.
To create a transform schema that handles CSV input, set the type
property to "csv"
and the format_details
field to an object that includes the following configuration options:
Element | Type | Default | Description |
---|---|---|---|
delimiter | number, string | The delimiter substring, e.g. , or \t . | |
escape | number, string | " | The escape character. |
skip_head | number | 0 | The number of rows to skip at the beginning of any ingestion. |
quote | single-character string | " | The character that specifies the beginning and end of a string literal. |
comment | single-character string | # | The character that marks the beginning of a comment line. |
skip_comments | boolean | false | Whether or not to ignore lines beginning with the comment character. |
windows_ending | boolean | false | Whether or not to allow Windows-style (CR-LF) line endings. |
CSV Example
{
"name": "my_special_transform",
"type": "csv",
"settings": {
"format_details": {
"skip_head": 2,
"delimiter": ","
},
...
}
}
JSON
To create a transform schema that handles JSON input, set the type
property to "json"
. Because JSON input allows for complex nested structures, Hydrolix provides options to flatten incoming JSON during a pre-processing step and explicitly source output column data from the resulting structure.
Datatype Source
By default, Hydrolix uses the names of the top-level keys to establish the mapping between output_columns
and the source data. For example, if your source data contains a top-level property named employees
that you wish to ingest, you must name the corresponding column definition in your transform employees
. This convention also applies to flattened property names.
You can override this default mapping for any column definition in output_columns
with the datatype.source
field:
Source | Syntax |
---|---|
input data by field | "source": { "from_input_fields": ["field-name-1", "field-name-2", ...] } |
input data by index | "source": { "from_input_index": n } |
input data in a nested JSON structure | "source": { "from_json_pointers": ["/path/alternative/1", "/path/alternative/2", ...] } |
variable | "source": { "from_automatic_value": "variable-name" } |
Create a New Column from an Input
The following example creates a new column named test
from the data in the third column of input CSV data:
{
"name": "copy_example",
"type": "csv",
"table": "{{tableid}}",
"settings": {
"is_default": true,
"compression": "none",
"output_columns": [
{
"name": "test",
"datatype": {
"type": "string",
"index": false,
"source": { "from_input_index": 3 }
}
]
}
}
The following example creates a new column named test
from the data in a field called initial_name
in input JSON data:
{
"name": "copy_example",
"type": "json",
"table": "{{tableid}}",
"settings": {
"is_default": true,
"compression": "none",
"output_columns": [
{
"name": "test",
"datatype": {
"type": "string",
"index": true,
"source": {
"from_input_field": "initial_name"
}
}
}
]
}
}
Copy Data
By copying data into multiple columns, you can index the same data multiple times with different types.
Create a New Column from Nested Input
Expressions can also query data stored in a nested JSON structure using JSON Pointer syntax. Assume input data stored in the following structure:
- A JSON object called
vegetables
. - Within
vegetables
, a nested JSON object calledlegumes
. - Within
legumes
, a field calledspecies
.
{
"vegetables": {
"legumes": {
species: "example_data"
}
}
}
Given this input data, you could use the following transform to define an output column named example_species
:
{
"name": "copy_example",
"type": "json",
"table": "{{tableid}}",
"settings": {
"is_default": true,
"compression": "none",
"output_columns": [
{
"name": "example_species",
"datatype": {
"type": "string",
"index": true,
"source": {
"from_json_pointers": ["/vegetables/legumes/species"]
}
}
}
]
}
}
Flattening
When accepting JSON source data, you may optionally flatten each incoming object prior to ingestion. This can transform complex, multi-level JSON structures into simple objects comprising one level of key/value pairs, ready for storage in a single table row.
To do this, define a flattening
property within your transform's format_details
. Set the value to an object with the following properties:
Property | Value |
---|---|
active | If 1 (or any other true value), Hydrolix will flatten incoming JSON objects before ingesting them as rows. |
map_flattening_strategy | Configuration for flattening any JSON objects within each row's main object. |
slice_flattening_strategy | Configuration for flattening any JSON arrays within each row's main object. |
depth | Configuration for specifying how "deep" flattening goes. Use a value of 0 to impose no limit. |
The two strategy
properties accept an object that defines the rules that Hydrolix should follow to create new key names for the resulting, flattened JSON object.
Property | Value |
---|---|
left | The substring to use when concatenating an element's key with its parent's key. |
right | The substring to use when concatenating an element's key with its child's key. |
Not defining (or defining as null
) either of the "strategy" properties will deactivate flattening for either objects or arrays, respectively.
JSON Flattening Impacts Source Naming
Use the flattened version of a property name when defining output columns. For example, consider the following input JSON:
{ "employees" : { "departments" : [ "produce", "meat" ] } }
Assume the following flattening configuration:
"flattening": { "active": true, "map_flattening_strategy": { "left": ".", "right": "" }, "slice_flattening_strategy": { "left": "[", "right": "]" } }
To refer to the first value in the array stored in
employees.departments
, use the field name"employees.departments[0]"
.
Example
Consider the following JSON object:
{
"date": "2020-01-01",
"data": {
"oranges": [ 1, 2, 3 ],
"apples": [
{
"cortland": 6,
"honeycrisp": [ 7, 8, 9 ]
},
[ 10, 11, 12 ]
]
}
}
Assume the following flattening
configuration in the ingestion transform:
"settings": {
"format_details": {
"flattening": {
"active": true,
"map_flattening_strategy": {
"left": ".",
"right": ""
},
"slice_flattening_strategy": {
"left": "[",
"right": "]"
}
}
},
...
}
After applying JSON flattening, Hydrolix ingests the following, single-level JSON object:
{
"date": "2020-01-01",
"data.oranges[0]": 1,
"data.oranges[1]": 2,
"data.oranges[2]": 3,
"data.apples[0].cortland": 6,
"data.apples[0].honeycrisp[0]": 7,
"data.apples[0].honeycrisp[1]": 8,
"data.apples[0].honeycrisp[2]": 9,
"data.apples[1][0]": 10,
"data.apples[1][1]": 11,
"data.apples[1][2]": 12
}
Depth Example
Consider the following transform that specifies a flattening depth
of 1:
"settings": {
"format_details": {
"flattening": {
"active": true,
"depth": 1,
"map_flattening_strategy": {
"left": ".",
"right": ""
},
"slice_flattening_strategy": {
"left": "[",
"right": "]"
}
}
},
...
}
Applying the transform produces the following JSON:
{
"date": "2020-01-01",
"data.apples":[
{
"cortland":6,
"honeycrisp":[7,8,9]
},
[10,11,12]
],
"data.oranges":[1,2,3]
}
This is useful when you want to leverage the map datatype and only flatten at a specific level.
Compression
The compression
property in a transform describes one or more compression algorithms that Hydrolix should expect to find already applied to input. Hydrolix uses the listed algorithms to decompress input into parseable data.
For example, setting the transform's compression
property to "gzip"
means that you expect the source data, in its entirety, to have had the GZIP compression method applied to it prior to its receipt by Hydrolix.
Algorithms
Valid values for the compression
property include the following:
Value | Meaning |
---|---|
gzip | Content is compressed via gzip (LZ77 with 32-bit CRC). |
zip | Content is ZIP-encoded via zlib (RFC 1950) |
deflate | Content is encoded in zlib structure and the deflate compression algorithm. |
bzip2 | Content is compressed with the bzip2 algorithm. |
none | Content is not compressed. (Equivalent to not specifying compression at all.) |
During streaming ingestion, the request may also use the content-encoding
header. Hydrolix automatically handles that encoding during the streaming process.
Layering
To define multiple layers of compression, specify them in a string separated by ,
:
"compression": "gzip, bzip2, zip"
In the above example, Hydrolix first decompresses received data with zlib, then bzip2, and finally with gzip.
Layered Order Matters
Hydrolix decompresses data using algorithms in right-to-left (<-) order.
Updated about 2 months ago