Format Options
Hydrolix can process input data structured in many different ways. Hydrolix currently supports data in CSV or JSON formats compressed with several different algorithms.
CSV⚓︎
The CSV (Character-Separated Values) format encodes data as collections of lines containing columns separated by a specified delimiter character.
To create a transform schema that handles CSV input, set the type property to "csv" and the format_details field to an object that includes the following configuration options:
| Element | Type | Default | Description |
|---|---|---|---|
delimiter |
number, string | , |
The delimiter substring, for example , or \t. |
escape |
number, string | " |
The escape character. |
skip_head |
number | 0 |
The number of rows to skip at the beginning of any ingestion. |
quote |
single-character string | " |
The character that specifies the beginning and end of a string literal. |
comment |
single-character string | # |
The character that marks the beginning of a comment line. |
skip_comments |
boolean | false |
Whether or not to ignore lines beginning with the comment character. |
windows_ending |
boolean | false |
Whether or not to allow Windows-style (CR-LF) line endings. |
The values of the delimiter and escape fields must be either one-byte character strings or the ASCII numeric equivalent.
The values of the quote and comment fields must be one-byte character strings.
JSON⚓︎
To create a transform schema that handles JSON input, set the type property to "json". Because JSON input allows for complex nested structures, Hydrolix provides options to explicitly source output column data from nested structures and to flatten incoming data during a pre-processing step.
Datatype source⚓︎
By default, Hydrolix uses the names of the top-level keys to establish the mapping between output_columns and the source data. For example, if your source data contains a top-level property named employees that you wish to ingest, you must name the corresponding column definition in your transform employees. This convention also applies to flattened property names.
You can override this default mapping for any column definition in output_columns with the datatype.source field:
| Source | Syntax |
|---|---|
| Input data from a single field | "source": { "from_input_field": "field-name" } |
| Input data from multiple fields | "source": { "from_input_fields": ["field-name-1", "field-name-2", ...] } |
| Input data by index, in this example field 3 | "source": { "from_input_index": 3 } |
| Input data in a nested JSON structure | "source": { "from_json_pointers": ["/path/alternative/1", "/path/alternative/2", ...] } |
| Variable | "source": { "from_automatic_value": "variable-name" } |
Create a New Column from an Input⚓︎
The following example creates a new column named test from the data in the third column of input CSV data:
The following example creates a new column named test from the data in a field called initial_name in input JSON data:
Copy Data
By copying data into multiple columns, you can index the same data multiple times with different types.
Create a New Column from Nested Input⚓︎
Expressions can also query data stored in a nested JSON structure using JSON Pointer syntax. Assume input data stored in the following structure:
- A JSON object called
vegetables. - Within
vegetables, a nested JSON object calledlegumes. -
Within
legumes, a field calledspecies.
Given this input data, you could use the following transform to define an output column named example_species:
JSON flattening⚓︎
JSON input can contain complex, multi-level nested objects and arrays. You can optionally flatten nested JSON data into single-level key.value pairs during ingestion, making it easier to map onto table columns. Configure flattening in the transform's format_details. See Flattening for configuration options and examples.
JSON compression⚓︎
For JSON transforms, set the compression property in the transform's settings to specify the compression algorithm applied to the input data. See Compression for supported algorithms and layering options.
During HTTP streaming ingestion, you can also use the Content-Encoding header for transport-level compression.
Parquet⚓︎
Apache Parquet is a columnar storage format used in data engineering pipelines and analytics systems such as Apache Spark, Apache Hive, and PyArrow. Parquet files carry their own schema: they're self-describing and don't require format-level configuration such as delimiters or quote characters.
Hydrolix supports ingesting Parquet data, however it doesn't have full Parquet table support.
To create a transform schema that handles Parquet input, set the type property to "parquet".
Parquet ingestion is supported across all Hydrolix intake methods. When ingesting via HTTP streaming, set the request header Content-Type: application/vnd.apache.parquet.
Column mapping⚓︎
Parquet uses name-based column mapping. By default, the column names in the Parquet file must match the name field of each output column definition in the transform.
You can override the default mapping for any output column using the datatype.source field:
| Source | Syntax |
|---|---|
| Input data from a single field | "source": { "from_input_field": "field-name" } |
| Input data from multiple fields | "source": { "from_input_fields": ["field-name-1", "field-name-2", ...] } |
| Input data in a nested structure | "source": { "from_json_pointers": ["/path/to/field"] } |
For example, to map a Parquet column named source_address to an output column named src_ip:
| Mapping a Parquet Column to an Output Column with a Different Name | |
|---|---|
To extract a value from a nested Parquet structure, use from_json_pointers with the path to the target field. Despite its name, this option isn't JSON-specific. It uses RFC 6901 JSON Pointer syntax to navigate any nested structure, including nested Parquet columns.
| Extracting a Value from a Nested Parquet Column by Path | |
|---|---|
Unmatched Parquet Columns
Parquet columns present in the file but not matched to any output column in the transform are silently ignored.
- To capture unmatched columns, set
"catch_all": trueon an output column. - To capture type conversion errors from non-primary columns, set
"catch_rejects": true.
Nested structures and flattening⚓︎
For complex nested Parquet data, you can optionally flatten multi-level records into single-level key/value pairs during ingestion, making them easier to map onto table columns. Parquet transforms support the same format_details options as JSON, including flattening, pretransforms, and subtype. See Flattening for configuration options and examples.
Data type mapping⚓︎
Each Parquet column is initially mapped from its physical type to an intermediate representation.
| Parquet Physical Type | Intermediate Representation | Notes |
|---|---|---|
BOOLEAN |
boolean | |
INT32 |
64-bit integer | Widened from 32-bit |
INT64 |
64-bit integer | |
FLOAT |
64-bit float | Widened from 32-bit |
DOUBLE |
64-bit float | |
BYTE_ARRAY |
string | |
| All other types | string | Converted through string representation |
From there, each column is converted to the Hydrolix type specified in the output column definition. See Data Types for the full list of supported Hydrolix column types.
Parquet Logical Types Aren't Interpreted
Hydrolix reads Parquet physical types only. Parquet logical type annotations such as DATE, TIME, TIMESTAMP_MILLIS, TIMESTAMP_MICROS, and DECIMAL are not interpreted during ingestion.
Parquet file compression⚓︎
Hydrolix supports all available Parquet files compression codecs, except LZ4. To ingest files compressed with LZ4, re-compress them with a supported codec before sending them to Hydrolix.
Separately from Parquet's internal page-level compression, the HTTP request itself can be compressed using the Content-Encoding header. See Compression for supported request-level algorithms and layering options.
Parquet ingestion limitations⚓︎
- BYTE_STREAM_SPLIT encoding isn't supported. This encoding is opt-in and doesn't affect files written with default settings.
- Deprecated two-level list structures aren't supported. These are produced by older Hadoop, Hive, and Avro writers or by setting
parquet.avro.write-old-list-structure=truein Apache Spark. Modern tools use the standard three-level list structure. - File size is constrained by available memory during both streaming and batch ingestion, as the entire file is read into memory..
- Parquet files sent to Hydrolix with
nullmetadata values may not be successfully ingested. The Parquet parsing library used by Hydrolix incorrectly treats the optionalKeyValue.valuemetadata field as required. If you encounter a missing required field error in the intake job logs, rewrite the file with clean metadata before ingesting. For example, remove null metadata values or replace them with empty strings.
These limitations primarily affect files written with non-default settings or by legacy Hadoop-era tools.
Flattening⚓︎
When accepting JSON or Parquet source data, you may optionally flatten each incoming object prior to ingestion. This can transform complex, multi-level JSON structures into simple objects comprising one level of key/value pairs, ready for storage in a single table row.
To do this, define a flattening property within your transform's format_details. Set the value to an object with the following properties:
| Property | Value |
|---|---|
active |
If 1 (or any other true value), Hydrolix will flatten incoming records before ingesting them as rows. |
map_flattening_strategy |
Configuration for flattening any JSON objects within each row's main object. |
slice_flattening_strategy |
Configuration for flattening any JSON arrays within each row's main object. |
depth |
Configuration for specifying how "deep" flattening goes. Use a value of 0 to impose no limit. |
The two strategy properties accept an object that defines the rules that Hydrolix should follow to create new key names for the resulting, flattened JSON object.
| Property | Value |
|---|---|
left |
The substring to use when concatenating an element's key with its parent's key. |
right |
The substring to use when concatenating an element's key with its child's key. |
Not defining (or defining as null) either of the "strategy" properties will deactivate flattening for either objects or arrays, respectively.
Flattening Impacts Source Naming
Use the flattened version of a property name when defining output columns. For example, consider the following input:
Assume the following flattening configuration:
To refer to the first value in the array stored in employees.departments, use the field name "employees.departments[0]".
Example⚓︎
Consider the following JSON object:
Assume the following flattening configuration in the ingestion transform:
After applying flattening, Hydrolix ingests the following, single-level JSON object:
Depth example⚓︎
Consider the following transform that specifies a flattening depth of 1:
Applying the transform produces the following JSON:
This is useful when you want to leverage the map datatype and only flatten at a specific level.
Compression⚓︎
Hydrolix supports common compression algorithms and offers multiple mechanisms to influence decompression of received data.
Algorithms⚓︎
Valid values for the compression property:
| Value | Meaning |
|---|---|
gzip |
Content is compressed using gzip (LZ77 with 32-bit CRC). |
zip |
Content is ZIP-encoded using zlib (RFC 1950) |
deflate |
Content is encoded in zlib structure and the deflate compression algorithm. |
bzip2 |
Content is compressed with the bzip2 algorithm. |
none |
Content isn't compressed. (Equivalent to not specifying compression at all.) |
Transform-level decompression⚓︎
Use the compression property to specify the compression algorithms applied to
the input. Hydrolix decompresses received data accordingly. The decompressed
data is treated as the type specified in the transform definition.
For example, a transform handling json data with compression set to gzip
means that you expect the entire source data is GZIP-compressed, and when
expanded, the data will be in JSON format.
This is honored by all ingestion systems including batch, autoingest, and HTTP Stream API.
Transform-level compression settings are independent of streaming decompression.
Streaming decompression⚓︎
Streaming decompression is only available in the HTTP Stream API. This feature is available to all clients of the HTTP Stream API, including push methods like Amazon Data Firehose and Google Pub/Sub.
The HTTP header Content-Encoding allows an HTTP client to signal the use of
on-the-fly network transport compression to improve throughput efficiency or
reduce costs.
Use the Content-Encoding header to decompress data before any compression
settings from the transform apply.
Layering⚓︎
Define multiple layers of decompression by listing the compression algorithms
in the order they were applied to the original data. Separate each algorithm name from the next with a comma and a space whether configuring compression using the Content-Encoding header or the compression field of a transform.
Hydrolix decompresses starting from the last-applied compression algorithm.
Specify all decompression algorithms in the transform settings.
Specify all decompression algorithms in the Content-Encoding header.
When using multiple compression algorithms, it's acceptable to mix the usage
of Content-Encoding and the compression setting in a transform.
In the above example, the following compression settings could be used instead:
- streaming
Content-Encodingisbzip2, zip - transform-level
compressionisgzip(first)
With this configuration, the HTTP Stream API runs zip and bzip2
decompression before passing the stream to the transform, which applies gzip
decompression.
Streaming decompression allows independent configuration of compression algorithms to support network transport compression. This allows a single transform to be used in the batch or autoingest systems while supporting network transport compression in HTTP Stream API.
Mismatch errors⚓︎
When the incoming data can't be decompressed using the specified algorithm, the ingestion system produces one of the following error messages.
bzip2 data invalid: bad magic value: compression specified isbzip2but incoming data isn't compressed with bzip2failed to create zip reader: failed to open local file: zip: not a valid zip file: compression specified iszipbut incoming data isn't a zipfileunable to decode data: invalid character ',' looking for beginning of value: incorrect decompression algorithm appliedEOF: end of file, another general class of error for compression mismatch
When mismatch errors are encountered in the HTTP Stream API, the response status code is HTTP 400.
Layered Order Matters
Hydrolix decompresses data using algorithms in right-to-left order.