Self Described Messages

Alternatively, instead of pre-publishing a transform, Hydrolix can ingest data using “self described” messages which contain their own embedded schemas.

To stream batches of self described messages to Hydrolix, a POST message is sent to the streaming_url for the deployment.

Required HTTP Headers

Name Value
x-hdx-streaming-format json-batch
x-hdx-streaming-type described

Message attributes

Self described messages must be associated with a destination table:

Attribute Purpose
namespace_id To be provided by Hydrolix.
settings Contains the format of the data elements later in the document.
data An array of individual messages conforming to the schema described in settings
{
  "namespace_id": "to be given by Hydrolix",
  "settings": {
    "schema": []
	}
	"data": []
}

schema element

Under settings is a schema. This is an array of data elements to be expected in this message, along with a description of how to treat the field. For example:

 "schema": [
      {
        "name": "timestamp",
        "type": "dateime",
        "format": "2006-01-02 15:04:03 MST",
        "treatment": "primary"
      },
      {
        "name": "clientId",
        "type": "uint64",
        "treatment": "tag",
        "default": 0
      }
	 ]

Each element has five potential attributes:

Attribute Purpose Required
name The column name Turbine should use Yes
type The data type of the column Yes
treatment How to index the column. More on this below. Yes
format For datetime, Turbine needs a pattern to understand how to parse the format in the source file, using the golang format. No, except for DateTime.
default The default value if a particular row does not have a value for that column. No

name

The name the column should have in Turbine. It is recommended to use snake_case to name columns for human readability.

type

See list of currently supported datatypes in Turbine.

treatment

Every output_column requires a treatment.  A treatment is used to specify how a column will be indexed or utilized. A column can have only one treatment.

Data sets must have a datetime column that will be treated as primary. More than one datetime can be defined in the table, but only one can be used as primary. This is used as the main index for the table.

Note: A treatment is not directly linked to a data type, though there are restrictions on what can be a tag or metric. A common assumption is to always have numeric values specified as metrics while strings should be tags. With Turbine it is important to think of these as more as Indexed vs Non-Indexed columns. All columns that are tags can be indexed regardless of cardinality.

Treatment Supported data types Description
primary datetime The column to be used as the main index. It is used in determining the HDX partition.

Each data set must have EXACTLY ONE Primary column.

Primary may not have any Null values.
tag datetime, uint64, string Indexed metadata about a metric.

Used for filtering.
metric uint64, double Non-indexed values.
ignore Any This column is ignored and will not be written in HDX format.
virtual Any A field that is not part of the original data set but should be kept with the data.

An example would be a table name where tables are partitioned by data to handle the limitations of local attached storage.

data

The data element is an array containing individual messages. Each message is represented as an array, containins values in the order they were described in the schema. If a message in this particular batch contains a null or empty value, it will be replaced with the “default” value if one is declared.

"data": [
    [ "2020-02-26 16:01:27 PST", 29991, "1.2.3.4/24", 1223, "1.4.5.7", 1.234 ],
    [ "2020-02-26 16:01:28 PST", 29989, "1.2.3.5/24", 9190, "1.4.5.7", 1.324 ],
    [ "2020-02-26 16:01:28 PST", 29990, "1.2.3.5/24", null, "1.4.5.7", 12.34 ]
  ]

Example API Call

$ curl -s \
     -H "x-hdx-streaming-format: json-batch" \
     -H "x-hdx-streaming-type: described" \
     ${streaming_url} -X POST -d '
{
  "namespace_id": "to be provided",
  "settings": {
    "schema": [
      {
        "name": "timestamp",
        "type": "dateime",
        "format": "2006-01-02 15:04:03 MST",
        "treatment": "primary"
      },
      {
        "name": "clientId",
        "type": "uint64",
        "treatment": "tag",
        "default": 0
      },
      {
        "name": "clientIp",
        "type": "string",
        "treatment": "tag",
        "default": "0.0.0.0"
      },
      {
        "name": "clientCityCode",
        "type": "uint64",
        "treatment": "tag",
        "default": 0
      },
      {
        "name": "resolverIp",
        "type": "string",
        "treatment": "tag",
        "default": "0.0.0.0"
      },
      {
        "name": "resolveDuration",
        "type": "double",
        "treatment": "metric",
        "default": -1.0
      }
    ]
  },
  "data": [
    [ "2020-02-26 16:01:27 PST", 29991, "1.2.3.4/24", 1223, "1.4.5.7", 1.234 ],
    [ "2020-02-26 16:01:28 PST", 29989, "1.2.3.5/24", 9190, "1.4.5.7", 1.324 ],
    [ "2020-02-26 16:01:28 PST", 29990, "1.2.3.5/24", null, "1.4.5.7", 12.34 ]
  ]
}'