Skip to content

Format Options

Hydrolix can process input data structured in many different ways. Hydrolix currently supports data in CSV or JSON formats compressed with several different algorithms.

CSV⚓︎

The CSV (Character-Separated Values) format encodes data as collections of lines containing columns separated by a specified delimiter character.

To create a transform schema that handles CSV input, set the type property to "csv" and the format_details field to an object that includes the following configuration options:

Element Type Default Description
delimiter number, string The delimiter substring, for example , or \t.
escape number, string " The escape character.
skip_head number 0 The number of rows to skip at the beginning of any ingestion.
quote single-character string " The character that specifies the beginning and end of a string literal.
comment single-character string # The character that marks the beginning of a comment line.
skip_comments boolean false Whether or not to ignore lines beginning with the comment character.
windows_ending boolean false Whether or not to allow Windows-style (CR-LF) line endings.

CSV Example⚓︎

{
    "name": "my_special_transform",
    "type": "csv",
    "settings": {
        "format_details": {
            "skip_head": 2,
            "delimiter": ","
        },
        ...
    }
}

JSON⚓︎

To create a transform schema that handles JSON input, set the type property to "json". Because JSON input allows for complex nested structures, Hydrolix provides options to flatten incoming JSON during a pre-processing step and explicitly source output column data from the resulting structure.

Datatype Source⚓︎

By default, Hydrolix uses the names of the top-level keys to establish the mapping between output_columns and the source data. For example, if your source data contains a top-level property named employees that you wish to ingest, you must name the corresponding column definition in your transform employees. This convention also applies to flattened property names.

You can override this default mapping for any column definition in output_columns with the datatype.source field:

Source Syntax
Input data from a single field "source": { "from_input_field": "field-name" }
Input data from multiple fields "source": { "from_input_fields": ["field-name-1", "field-name-2", ...] }
Input data by index, in this example field 3 "source": { "from_input_index": 3 }
Input data in a nested JSON structure "source": { "from_json_pointers": ["/path/alternative/1", "/path/alternative/2", ...] }
Variable "source": { "from_automatic_value": "variable-name" }

Create a New Column from an Input⚓︎

The following example creates a new column named test from the data in the third column of input CSV data:

{
    "name": "copy_example",
    "type": "csv",
    "table": "{{tableid}}",
    "settings": {
        "is_default": true,
        "compression": "none",
        "output_columns": [
            {
                "name": "test",
                "datatype": {
                    "type": "string",
                    "index": false,
                    "source": { "from_input_index": 3 }
                    }
                ]
        }
}

The following example creates a new column named test from the data in a field called initial_name in input JSON data:

{
    "name": "copy_example",
    "type": "json",
    "table": "{{tableid}}",
    "settings": {
        "is_default": true,
        "compression": "none",
        "output_columns": [
            {
                "name": "test",
                "datatype": {
                    "type": "string",
                    "index": true,
                    "source": {
                        "from_input_field": "initial_name"
                    }
                }
            }
        ]
    }
}

Copy Data

By copying data into multiple columns, you can index the same data multiple times with different types.

Create a New Column from Nested Input⚓︎

Expressions can also query data stored in a nested JSON structure using JSON Pointer syntax. Assume input data stored in the following structure:

  1. A JSON object called vegetables.
  2. Within vegetables, a nested JSON object called legumes.
  3. Within legumes, a field called species.

    1
    2
    3
    4
    5
    6
    7
    {
      "vegetables": {
        "legumes": {
          species: "example_data"
        }
      }
    }
    

Given this input data, you could use the following transform to define an output column named example_species:

{
    "name": "copy_example",
    "type": "json",
    "table": "{{tableid}}",
    "settings": {
        "is_default": true,
        "compression": "none",
        "output_columns": [
            {
                "name": "example_species",
                "datatype": {
                    "type": "string",
                    "index": true,
                    "source": {
                       "from_json_pointers": ["/vegetables/legumes/species"]
                    }
                }
            }
        ]
    }
}

Flattening⚓︎

When accepting JSON source data, you may optionally flatten each incoming object prior to ingestion. This can transform complex, multi-level JSON structures into simple objects comprising one level of key/value pairs, ready for storage in a single table row.

To do this, define a flattening property within your transform's format_details. Set the value to an object with the following properties:

Property Value
active If 1 (or any other true value), Hydrolix will flatten incoming JSON objects before ingesting them as rows.
map_flattening_strategy Configuration for flattening any JSON objects within each row's main object.
slice_flattening_strategy Configuration for flattening any JSON arrays within each row's main object.
depth Configuration for specifying how "deep" flattening goes. Use a value of 0 to impose no limit.

The two strategy properties accept an object that defines the rules that Hydrolix should follow to create new key names for the resulting, flattened JSON object.

Property Value
left The substring to use when concatenating an element's key with its parent's key.
right The substring to use when concatenating an element's key with its child's key.

Not defining (or defining as null) either of the "strategy" properties will deactivate flattening for either objects or arrays, respectively.

JSON Flattening Impacts Source Naming

Use the flattened version of a property name when defining output columns. For example, consider the following input JSON:

1
2
3
4
5
6
7
8
{
"employees" : {
"departments" : [
"produce",
"meat"
]
}
}
Assume the following flattening configuration:
"flattening": {
"active": true,
"map_flattening_strategy": {
"left": ".",
"right": ""
},
"slice_flattening_strategy": {
"left": "[",
"right": "]"
}
}
To refer to the first value in the array stored in employees.departments, use the field name "employees.departments[0]".

Example⚓︎

Consider the following JSON object:

{
  "date": "2020-01-01",
  "data": {
    "oranges": [ 1, 2, 3 ],
    "apples": [
      {
        "cortland": 6,
        "honeycrisp": [ 7, 8, 9 ]
      },
      [ 10, 11, 12 ]
    ]
  }
}

Assume the following flattening configuration in the ingestion transform:

"settings": {
    "format_details": {
        "flattening": {
            "active": true,
            "map_flattening_strategy": {
                "left": ".",
                "right": ""
            },
            "slice_flattening_strategy": {
                "left": "[",
                "right": "]"
            }
        }
    },
    ...
}

After applying JSON flattening, Hydrolix ingests the following, single-level JSON object:

{
    "date": "2020-01-01",
    "data.oranges[0]": 1,
    "data.oranges[1]": 2,
    "data.oranges[2]": 3,
    "data.apples[0].cortland": 6,
    "data.apples[0].honeycrisp[0]": 7,
    "data.apples[0].honeycrisp[1]": 8,
    "data.apples[0].honeycrisp[2]": 9,
    "data.apples[1][0]": 10,
    "data.apples[1][1]": 11,
    "data.apples[1][2]": 12
}

Depth Example⚓︎

Consider the following transform that specifies a flattening depth of 1:

"settings": {
    "format_details": {
        "flattening": {
            "active": true,
            "depth": 1,
            "map_flattening_strategy": {
                "left": ".",
                "right": ""
            },
            "slice_flattening_strategy": {
                "left": "[",
                "right": "]"
            }
        }
    },
    ...
}

Applying the transform produces the following JSON:

{
    "date": "2020-01-01",
    "data.apples":[
        {
            "cortland":6,
            "honeycrisp":[7,8,9]
        },
        [10,11,12]
    ],
    "data.oranges":[1,2,3]
}

This is useful when you want to leverage the map datatype and only flatten at a specific level.

Compression⚓︎

The compression property in a transform describes one or more compression algorithms that Hydrolix should expect to find already applied to input. Hydrolix uses the listed algorithms to decompress input into parseable data.

For example, setting the transform's compression property to "gzip" means that you expect the source data, in its entirety, to have had the GZIP compression method applied to it prior to its receipt by Hydrolix.

Algorithms⚓︎

Valid values for the compression property include the following:

Value Meaning
gzip Content is compressed via gzip (LZ77 with 32-bit CRC).
zip Content is ZIP-encoded via zlib (RFC 1950)
deflate Content is encoded in zlib structure and the deflate compression algorithm.
bzip2 Content is compressed with the bzip2 algorithm.
none Content isn't compressed. (Equivalent to not specifying compression at all.)

During streaming ingestion, the request may also use the content-encoding header. Hydrolix automatically handles that encoding during the streaming process.

Layering⚓︎

To define multiple layers of compression, specify them in a string separated by ,:

"compression": "gzip, bzip2, zip"

In the above example, Hydrolix first decompresses received data with zlib, then bzip2, and finally with gzip.

Layered Order Matters

Hydrolix decompresses data using algorithms in right-to-left (\<-) order.