Skip to content

Format Options

Hydrolix can process input data structured in many different ways. Hydrolix currently supports data in CSV or JSON formats compressed with several different algorithms.

CSV⚓︎

The CSV (Character-Separated Values) format encodes data as collections of lines containing columns separated by a specified delimiter character.

To create a transform schema that handles CSV input, set the type property to "csv" and the format_details field to an object that includes the following configuration options:

Element Type Default Description
delimiter number, string , The delimiter substring, for example , or \t.
escape number, string " The escape character.
skip_head number 0 The number of rows to skip at the beginning of any ingestion.
quote single-character string " The character that specifies the beginning and end of a string literal.
comment single-character string # The character that marks the beginning of a comment line.
skip_comments boolean false Whether or not to ignore lines beginning with the comment character.
windows_ending boolean false Whether or not to allow Windows-style (CR-LF) line endings.

The values of the delimiter and escape fields must be either one-byte character strings or the ASCII numeric equivalent.

The values of the quote and comment fields must be one-byte character strings.

{
    "name": "my_special_transform",
    "type": "csv",
    "settings": {
        "format_details": {
            "skip_head": 2,
            "delimiter": ","
        },
        ...
    }
}

JSON⚓︎

To create a transform schema that handles JSON input, set the type property to "json". Because JSON input allows for complex nested structures, Hydrolix provides options to explicitly source output column data from nested structures and to flatten incoming data during a pre-processing step.

Datatype source⚓︎

By default, Hydrolix uses the names of the top-level keys to establish the mapping between output_columns and the source data. For example, if your source data contains a top-level property named employees that you wish to ingest, you must name the corresponding column definition in your transform employees. This convention also applies to flattened property names.

You can override this default mapping for any column definition in output_columns with the datatype.source field:

Source Syntax
Input data from a single field "source": { "from_input_field": "field-name" }
Input data from multiple fields "source": { "from_input_fields": ["field-name-1", "field-name-2", ...] }
Input data by index, in this example field 3 "source": { "from_input_index": 3 }
Input data in a nested JSON structure "source": { "from_json_pointers": ["/path/alternative/1", "/path/alternative/2", ...] }
Variable "source": { "from_automatic_value": "variable-name" }

Create a New Column from an Input⚓︎

The following example creates a new column named test from the data in the third column of input CSV data:

{
   "name" : "copy_example",
   "settings" : {
      "compression" : "none",
      "is_default" : true,
      "output_columns" : [
         {
            "datatype" : {
               "index" : false,
               "source" : {
                  "from_input_index" : 3
               },
               "type" : "string"
            },
            "name" : "test"
         }
      ]
   },
   "table" : "{{tableid}}",
   "type" : "csv"
}

The following example creates a new column named test from the data in a field called initial_name in input JSON data:

{
    "name": "copy_example",
    "type": "json",
    "table": "{{tableid}}",
    "settings": {
        "is_default": true,
        "compression": "none",
        "output_columns": [
            {
                "name": "test",
                "datatype": {
                    "type": "string",
                    "index": true,
                    "source": {
                        "from_input_field": "initial_name"
                    }
                }
            }
        ]
    }
}

Copy Data

By copying data into multiple columns, you can index the same data multiple times with different types.

Create a New Column from Nested Input⚓︎

Expressions can also query data stored in a nested JSON structure using JSON Pointer syntax. Assume input data stored in the following structure:

  1. A JSON object called vegetables.
  2. Within vegetables, a nested JSON object called legumes.
  3. Within legumes, a field called species.

    1
    2
    3
    4
    5
    6
    7
    {
      "vegetables": {
        "legumes": {
          "species": "example_data"
        }
      }
    }
    

Given this input data, you could use the following transform to define an output column named example_species:

{
    "name": "copy_example",
    "type": "json",
    "table": "{{tableid}}",
    "settings": {
        "is_default": true,
        "compression": "none",
        "output_columns": [
            {
                "name": "example_species",
                "datatype": {
                    "type": "string",
                    "index": true,
                    "source": {
                       "from_json_pointers": ["/vegetables/legumes/species"]
                    }
                }
            }
        ]
    }
}

JSON flattening⚓︎

JSON input can contain complex, multi-level nested objects and arrays. You can optionally flatten nested JSON data into single-level key.value pairs during ingestion, making it easier to map onto table columns. Configure flattening in the transform's format_details. See Flattening for configuration options and examples.

JSON compression⚓︎

For JSON transforms, set the compression property in the transform's settings to specify the compression algorithm applied to the input data. See Compression for supported algorithms and layering options.

During HTTP streaming ingestion, you can also use the Content-Encoding header for transport-level compression.

Parquet⚓︎

Apache Parquet is a columnar storage format used in data engineering pipelines and analytics systems such as Apache Spark, Apache Hive, and PyArrow. Parquet files carry their own schema: they're self-describing and don't require format-level configuration such as delimiters or quote characters.

Hydrolix supports ingesting Parquet data, however it doesn't have full Parquet table support.

To create a transform schema that handles Parquet input, set the type property to "parquet".

Parquet ingestion is supported across all Hydrolix intake methods. When ingesting via HTTP streaming, set the request header Content-Type: application/vnd.apache.parquet.

Example Parquet transform
{
    "name": "parquet_example",
    "type": "parquet",
    "settings": {
        "is_default": false,
        "output_columns": [
            {
                "name": "timestamp",
                "datatype": {
                    "type": "epoch",
                    "primary": true,
                    "format": "s",
                    "resolution": "s"
                }
            },
            {
                "name": "src_ip",
                "datatype": {
                    "type": "string",
                    "index": true
                }
            },
            {
                "name": "bytes_sent",
                "datatype": {
                    "type": "uint64",
                    "index": true
                }
            }
        ]
    }
}

Column mapping⚓︎

Parquet uses name-based column mapping. By default, the column names in the Parquet file must match the name field of each output column definition in the transform.

You can override the default mapping for any output column using the datatype.source field:

Source Syntax
Input data from a single field "source": { "from_input_field": "field-name" }
Input data from multiple fields "source": { "from_input_fields": ["field-name-1", "field-name-2", ...] }
Input data in a nested structure "source": { "from_json_pointers": ["/path/to/field"] }

For example, to map a Parquet column named source_address to an output column named src_ip:

Mapping a Parquet Column to an Output Column with a Different Name
{
    "name": "src_ip",
    "datatype": {
        "type": "string",
        "index": true,
        "source": {
            "from_input_field": "source_address"
        }
    }
}

To extract a value from a nested Parquet structure, use from_json_pointers with the path to the target field. Despite its name, this option isn't JSON-specific. It uses RFC 6901 JSON Pointer syntax to navigate any nested structure, including nested Parquet columns.

Extracting a Value from a Nested Parquet Column by Path
{
    "name": "region",
    "datatype": {
        "type": "string",
        "index": true,
        "source": {
            "from_json_pointers": ["/metadata/geo/region"]
        }
    }
}

Unmatched Parquet Columns

Parquet columns present in the file but not matched to any output column in the transform are silently ignored.

Nested structures and flattening⚓︎

For complex nested Parquet data, you can optionally flatten multi-level records into single-level key/value pairs during ingestion, making them easier to map onto table columns. Parquet transforms support the same format_details options as JSON, including flattening, pretransforms, and subtype. See Flattening for configuration options and examples.

Data type mapping⚓︎

Each Parquet column is initially mapped from its physical type to an intermediate representation.

Parquet Physical Type Intermediate Representation Notes
BOOLEAN boolean
INT32 64-bit integer Widened from 32-bit
INT64 64-bit integer
FLOAT 64-bit float Widened from 32-bit
DOUBLE 64-bit float
BYTE_ARRAY string
All other types string Converted through string representation

From there, each column is converted to the Hydrolix type specified in the output column definition. See Data Types for the full list of supported Hydrolix column types.

Parquet Logical Types Aren't Interpreted

Hydrolix reads Parquet physical types only. Parquet logical type annotations such as DATE, TIME, TIMESTAMP_MILLIS, TIMESTAMP_MICROS, and DECIMAL are not interpreted during ingestion.

Parquet file compression⚓︎

Hydrolix supports all available Parquet files compression codecs, except LZ4. To ingest files compressed with LZ4, re-compress them with a supported codec before sending them to Hydrolix.

Separately from Parquet's internal page-level compression, the HTTP request itself can be compressed using the Content-Encoding header. See Compression for supported request-level algorithms and layering options.

Parquet ingestion limitations⚓︎

  • BYTE_STREAM_SPLIT encoding isn't supported. This encoding is opt-in and doesn't affect files written with default settings.
  • Deprecated two-level list structures aren't supported. These are produced by older Hadoop, Hive, and Avro writers or by setting parquet.avro.write-old-list-structure=true in Apache Spark. Modern tools use the standard three-level list structure.
  • File size is constrained by available memory during both streaming and batch ingestion, as the entire file is read into memory..
  • Parquet files sent to Hydrolix with null metadata values may not be successfully ingested. The Parquet parsing library used by Hydrolix incorrectly treats the optional KeyValue.value metadata field as required. If you encounter a missing required field error in the intake job logs, rewrite the file with clean metadata before ingesting. For example, remove null metadata values or replace them with empty strings.

These limitations primarily affect files written with non-default settings or by legacy Hadoop-era tools.

Flattening⚓︎

When accepting JSON or Parquet source data, you may optionally flatten each incoming object prior to ingestion. This can transform complex, multi-level JSON structures into simple objects comprising one level of key/value pairs, ready for storage in a single table row.

To do this, define a flattening property within your transform's format_details. Set the value to an object with the following properties:

Property Value
active If 1 (or any other true value), Hydrolix will flatten incoming records before ingesting them as rows.
map_flattening_strategy Configuration for flattening any JSON objects within each row's main object.
slice_flattening_strategy Configuration for flattening any JSON arrays within each row's main object.
depth Configuration for specifying how "deep" flattening goes. Use a value of 0 to impose no limit.

The two strategy properties accept an object that defines the rules that Hydrolix should follow to create new key names for the resulting, flattened JSON object.

Property Value
left The substring to use when concatenating an element's key with its parent's key.
right The substring to use when concatenating an element's key with its child's key.

Not defining (or defining as null) either of the "strategy" properties will deactivate flattening for either objects or arrays, respectively.

Flattening Impacts Source Naming

Use the flattened version of a property name when defining output columns. For example, consider the following input:

1
2
3
4
5
6
7
8
{
"employees" : {
"departments" : [
"produce",
"meat"
]
}
}

Assume the following flattening configuration:

"flattening": {
"active": true,
"map_flattening_strategy": {
"left": ".",
"right": ""
},
"slice_flattening_strategy": {
"left": "[",
"right": "]"
}
}

To refer to the first value in the array stored in employees.departments, use the field name "employees.departments[0]".

Example⚓︎

Consider the following JSON object:

{
  "date": "2020-01-01",
  "data": {
    "oranges": [ 1, 2, 3 ],
    "apples": [
      {
        "cortland": 6,
        "honeycrisp": [ 7, 8, 9 ]
      },
      [ 10, 11, 12 ]
    ]
  }
}

Assume the following flattening configuration in the ingestion transform:

"settings": {
    "format_details": {
        "flattening": {
            "active": true,
            "map_flattening_strategy": {
                "left": ".",
                "right": ""
            },
            "slice_flattening_strategy": {
                "left": "[",
                "right": "]"
            }
        }
    },
    ...
}

After applying flattening, Hydrolix ingests the following, single-level JSON object:

{
    "date": "2020-01-01",
    "data.oranges[0]": 1,
    "data.oranges[1]": 2,
    "data.oranges[2]": 3,
    "data.apples[0].cortland": 6,
    "data.apples[0].honeycrisp[0]": 7,
    "data.apples[0].honeycrisp[1]": 8,
    "data.apples[0].honeycrisp[2]": 9,
    "data.apples[1][0]": 10,
    "data.apples[1][1]": 11,
    "data.apples[1][2]": 12
}

Depth example⚓︎

Consider the following transform that specifies a flattening depth of 1:

"settings": {
    "format_details": {
        "flattening": {
            "active": true,
            "depth": 1,
            "map_flattening_strategy": {
                "left": ".",
                "right": ""
            },
            "slice_flattening_strategy": {
                "left": "[",
                "right": "]"
            }
        }
    },
    ...
}

Applying the transform produces the following JSON:

{
    "date": "2020-01-01",
    "data.apples":[
        {
            "cortland":6,
            "honeycrisp":[7,8,9]
        },
        [10,11,12]
    ],
    "data.oranges":[1,2,3]
}

This is useful when you want to leverage the map datatype and only flatten at a specific level.

Compression⚓︎

Hydrolix supports common compression algorithms and offers multiple mechanisms to influence decompression of received data.

Algorithms⚓︎

Valid values for the compression property:

Value Meaning
gzip Content is compressed using gzip (LZ77 with 32-bit CRC).
zip Content is ZIP-encoded using zlib (RFC 1950)
deflate Content is encoded in zlib structure and the deflate compression algorithm.
bzip2 Content is compressed with the bzip2 algorithm.
none Content isn't compressed. (Equivalent to not specifying compression at all.)

Transform-level decompression⚓︎

Use the compression property to specify the compression algorithms applied to the input. Hydrolix decompresses received data accordingly. The decompressed data is treated as the type specified in the transform definition.

For example, a transform handling json data with compression set to gzip means that you expect the entire source data is GZIP-compressed, and when expanded, the data will be in JSON format.

This is honored by all ingestion systems including batch, autoingest, and HTTP Stream API.

Transform-level compression settings are independent of streaming decompression.

Streaming decompression⚓︎

Streaming decompression is only available in the HTTP Stream API. This feature is available to all clients of the HTTP Stream API, including push methods like Amazon Data Firehose and Google Pub/Sub.

The HTTP header Content-Encoding allows an HTTP client to signal the use of on-the-fly network transport compression to improve throughput efficiency or reduce costs.

Use the Content-Encoding header to decompress data before any compression settings from the transform apply.

Layering⚓︎

Define multiple layers of decompression by listing the compression algorithms in the order they were applied to the original data. Separate each algorithm name from the next with a comma and a space whether configuring compression using the Content-Encoding header or the compression field of a transform.

Hydrolix decompresses starting from the last-applied compression algorithm.

Pseudocode Demonstration of Layered Compression Algorithms
# Client compresses data multiple times, first with A, then B, and last C
encoded_data = C(B(A(data)))

# If using HTTP Stream API only, specify compression order in Content-Encoding header
Content-Encoding: A, B, C

# If original data files are compressed, specify the order `settings.compression`
"compression": "A, B, C"

# Decompression order is reverse compression order: C, then B, then A
decoded_data = decodeA(decodeB(decodeC(encoded_data)))
1
2
3
4
5
6
7
8
$ gzip -c < data.json > data.json.gz
$ bzip2 -c < data.json.gz  > data.json.gz.bz2
$ echo data.json.gz.bz2 | zip  --names-stdin data.json.gz.bz2.zip
$ ls -lrt
-rw-rw-r-- 1 user group 12655 Feb 26 06:30 data.json
-rw-rw-r-- 1 user group  2100 Feb 26 06:31 data.json.gz
-rw-rw-r-- 1 user group  2505 Feb 26 06:32 data.json.gz.bz2
-rw-rw-r-- 1 user group  2687 Feb 26 06:36 data.json.gz.bz2.zip

Specify all decompression algorithms in the transform settings.

{
    "name": "decompression_multi",
    "type": "json",
    "settings": {
        "compression": "gzip, bzip2, zip",
        "output_columns": {
            [ ... ]
        },
    },
}

Specify all decompression algorithms in the Content-Encoding header.

1
2
3
4
5
6
7
curl --fail --silent \
  --header "X-Hdx-Table: news.requests" \
  --header "Content-Type: application/json" \
  --header "Authorization: Bearer ${HDX_TOKEN}" \
  --header "Content-Encoding: gzip, bzip2, zip" \
  --data-binary @data.json.gz.bz2.zip \
  -- "${HDX_HYDROLIX_URL}/ingest/event"

When using multiple compression algorithms, it's acceptable to mix the usage of Content-Encoding and the compression setting in a transform.

In the above example, the following compression settings could be used instead:

  • streaming Content-Encoding is bzip2, zip
  • transform-level compression is gzip (first)

With this configuration, the HTTP Stream API runs zip and bzip2 decompression before passing the stream to the transform, which applies gzip decompression.

Streaming decompression allows independent configuration of compression algorithms to support network transport compression. This allows a single transform to be used in the batch or autoingest systems while supporting network transport compression in HTTP Stream API.

Mismatch errors⚓︎

When the incoming data can't be decompressed using the specified algorithm, the ingestion system produces one of the following error messages.

  • bzip2 data invalid: bad magic value: compression specified is bzip2 but incoming data isn't compressed with bzip2
  • failed to create zip reader: failed to open local file: zip: not a valid zip file: compression specified is zip but incoming data isn't a zipfile
  • unable to decode data: invalid character ',' looking for beginning of value: incorrect decompression algorithm applied
  • EOF: end of file, another general class of error for compression mismatch

When mismatch errors are encountered in the HTTP Stream API, the response status code is HTTP 400.

Layered Order Matters

Hydrolix decompresses data using algorithms in right-to-left order.