Format Options

Hydrolix can process input data structured in many different ways. Hydrolix currently supports data in CSV or JSON formats compressed with several different algorithms.

CSV

The CSV (Character-Separated Values) format encodes data as collections of lines containing columns separated by a specified delimiter character.

To create a transform schema that handles CSV input, set the type property to "csv" and the format_details field to an object that includes the following configuration options:

ElementTypeDefaultDescription
delimiternumber, stringThe delimiter substring, e.g. , or \t.
escapenumber, string"The escape character.
skip_headnumber0The number of rows to skip at the beginning of any ingestion.
quotesingle-character string"The character that specifies the beginning and end of a string literal.
commentsingle-character string#The character that marks the beginning of a comment line.
skip_commentsbooleanfalseWhether or not to ignore lines beginning with the comment character.
windows_endingbooleanfalseWhether or not to allow Windows-style (CR-LF) line endings.

CSV Example

{
    "name": "my_special_transform",
    "type": "csv",
    "settings": {
        "format_details": {
            "skip_head": 2,
            "delimiter": ","
        },
        ...
    }
}

JSON

To create a transform schema that handles JSON input, set the type property to "json". Because JSON input allows for complex nested structures, Hydrolix provides options to flatten incoming JSON during a pre-processing step and explicitly source output column data from the resulting structure.

Datatype Source

By default, Hydrolix uses the names of the top-level keys to establish the mapping between output_columns and the source data. For example, if your source data contains a top-level property named employees that you wish to ingest, you must name the corresponding column definition in your transform employees. This convention also applies to flattened property names.

You can override this default mapping for any column definition in output_columns with the datatype.source field:

SourceSyntax
input data by field"source": { "from_input_fields": ["field-name-1", "field-name-2", ...] }
input data by index"source": { "from_input_index": n }
input data in a nested JSON structure"source": { "from_json_pointers": ["/path/alternative/1", "/path/alternative/2", ...] }
variable"source": { "from_automatic_value": "variable-name" }

Create a New Column from an Input

The following example creates a new column named test from the data in the third column of input CSV data:

{
  	"name": "copy_example",
  	"type": "csv",
  	"table": "{{tableid}}",
  	"settings": {
  	    "is_default": true,
  	    "compression": "none",
  	    "output_columns": [
  	        {
  	            "name": "test",
  	            "datatype": {
  	      	        "type": "string",
  	      	        "index": false,
  	                "source": { "from_input_index": 3 }
  					}
				]
 		}
}

The following example creates a new column named test from the data in a field called initial_name in input JSON data:

{
    "name": "copy_example",
    "type": "json",
    "table": "{{tableid}}",
    "settings": {
        "is_default": true,
        "compression": "none",
        "output_columns": [
            {
                "name": "test",
                "datatype": {
                    "type": "string",
                    "index": true,
                    "source": {
                        "from_input_field": "initial_name"
                    }
                }
            }
        ]
    }
}

πŸ“˜

Copy Data

By copying data into multiple columns, you can index the same data multiple times with different types.

Create a New Column from Nested Input

Expressions can also query data stored in a nested JSON structure using JSON Pointer syntax. Assume input data stored in the following structure:

  1. A JSON object called vegetables.
  2. Within vegetables, a nested JSON object called legumes.
  3. Within legumes, a field called species.
{
  "vegetables": {
    "legumes": {
      species: "example_data"
    }
  }
}

Given this input data, you could use the following transform to define an output column named example_species:

{
    "name": "copy_example",
    "type": "json",
    "table": "{{tableid}}",
    "settings": {
        "is_default": true,
        "compression": "none",
        "output_columns": [
            {
                "name": "example_species",
                "datatype": {
                    "type": "string",
                    "index": true,
                    "source": {
                       "from_json_pointers": ["/vegetables/legumes/species"]
                    }
                }
            }
        ]
    }
}

Flattening

When accepting JSON source data, you may optionally flatten each incoming object prior to ingestion. This can transform complex, multi-level JSON structures into simple objects comprising one level of key/value pairs, ready for storage in a single table row.

To do this, define a flattening property within your transform's format_details. Set the value to an object with the following properties:

PropertyValue
activeIf 1 (or any other true value), Hydrolix will flatten incoming JSON objects before ingesting them as rows.
map_flattening_strategyConfiguration for flattening any JSON objects within each row's main object.
slice_flattening_strategyConfiguration for flattening any JSON arrays within each row's main object.
depthConfiguration for specifying how "deep" flattening goes. Use a value of 0 to impose no limit.

The two strategy properties accept an object that defines the rules that Hydrolix should follow to create new key names for the resulting, flattened JSON object.

PropertyValue
leftThe substring to use when concatenating an element's key with its parent's key.
rightThe substring to use when concatenating an element's key with its child's key.

Not defining (or defining as null) either of the "strategy" properties will deactivate flattening for either objects or arrays, respectively.

πŸ“˜

JSON Flattening Impacts Source Naming

Use the flattened version of a property name when defining output columns. For example, consider the following input JSON:

{
  "employees" : {
    "departments" : [
      	"produce",
      	"meat"
      ]
  }
}

Assume the following flattening configuration:

"flattening": {
  "active": true,
  "map_flattening_strategy": {
    "left": ".",
    "right": ""
  },
  "slice_flattening_strategy": {
    "left": "[",
    "right": "]"
  }
}

To refer to the first value in the array stored in employees.departments, use the field name "employees.departments[0]".

Example

Consider the following JSON object:

{
  "date": "2020-01-01",
  "data": {
    "oranges": [ 1, 2, 3 ],
    "apples": [
      {
        "cortland": 6,
        "honeycrisp": [ 7, 8, 9 ]
      },
      [ 10, 11, 12 ]
    ]
  }
}

Assume the following flattening configuration in the ingestion transform:

"settings": {
    "format_details": {
        "flattening": {
            "active": true,
            "map_flattening_strategy": {
                "left": ".",
                "right": ""
            },
            "slice_flattening_strategy": {
                "left": "[",
                "right": "]"
            }
        }
    },
    ...
}

After applying JSON flattening, Hydrolix ingests the following, single-level JSON object:

{
    "date": "2020-01-01",
    "data.oranges[0]": 1,
    "data.oranges[1]": 2,
    "data.oranges[2]": 3,
    "data.apples[0].cortland": 6,
    "data.apples[0].honeycrisp[0]": 7,
    "data.apples[0].honeycrisp[1]": 8,
    "data.apples[0].honeycrisp[2]": 9,
    "data.apples[1][0]": 10,
    "data.apples[1][1]": 11,
    "data.apples[1][2]": 12
}

Depth Example

Consider the following transform that specifies a flattening depth of 1:

"settings": {
    "format_details": {
        "flattening": {
            "active": true,
            "depth": 1,
            "map_flattening_strategy": {
                "left": ".",
                "right": ""
            },
            "slice_flattening_strategy": {
                "left": "[",
                "right": "]"
            }
        }
    },
    ...
}

Applying the transform produces the following JSON:

{
    "date": "2020-01-01",
    "data.apples":[
        {
            "cortland":6,
            "honeycrisp":[7,8,9]
        },
        [10,11,12]
    ],
    "data.oranges":[1,2,3]
}

This is useful when you want to leverage the map datatype and only flatten at a specific level.

Compression

The compression property in a transform describes one or more compression algorithms that Hydrolix should expect to find already applied to input. Hydrolix uses the listed algorithms to decompress input into parseable data.

For example, setting the transform's compression property to "gzip" means that you expect the source data, in its entirety, to have had the GZIP compression method applied to it prior to its receipt by Hydrolix.

Algorithms

Valid values for the compression property include the following:

ValueMeaning
gzipContent is compressed via gzip (LZ77 with 32-bit CRC).
zipContent is ZIP-encoded via zlib (RFC 1950)
deflateContent is encoded in zlib structure and the deflate compression algorithm.
bzip2Content is compressed with the bzip2 algorithm.
noneContent is not compressed. (Equivalent to not specifying compression at all.)

During streaming ingestion, the request may also use the content-encoding header. Hydrolix automatically handles that encoding during the streaming process.

Layering

To define multiple layers of compression, specify them in a string separated by , :

"compression": "gzip, bzip2, zip"

In the above example, Hydrolix first decompresses received data with zlib, then bzip2, and finally with gzip.

🚧

Layered Order Matters

Hydrolix decompresses data using algorithms in right-to-left (<-) order.