Stream Settings

When using a streaming method (HTTP Stream API, Kafka, AWS Kinesis) to ingest data, Hydrolix provides several customizations for how and when data is written into tables. There is also an option for table-specific stream authentication.

Configuring How and When Data is Written

Streaming event data contains two types of data:

  • Hot data that needs to be made available as soon as possible.
  • Cold data that arrives late or out of order.

A good example of this is CDN or appliance logs where 95% of logs are provided within a 15 minutes window with the last 5% being supplied later in a 24 hour period. (Don't confuse this with "hot" storage vs. "cold" storage, a different concept.)

Settings for hot and cold data can be independently configured and cover a variety of settings that determine how often event partitions are flushed to storage and how many partitions are being processed at any one time.

🚧

Defaults

We recommend trying out the system defaults before adjusting any of these settings. The default settings meet the majority of customer use cases.

Configuring Data Write Settings

The settings for how data is written to the table are configured either through the portal and under the sources menu or via the Restful API - Tables.

To effectively manage incoming stream data, Hydrolix Streaming has configurable options for Hot and Cold data.

  • Hot data is near-term data defined as an event received within the hot_data_max_age_minutes window from now().
  • Cold data is defined as data that is late arriving event data that is beyond the hot_data_max_age_minutes but before the cold_data_max_age_days.
  • Data beyond the cold_data_max_age_days, it is rejected and ignored.

Hot Settings

The following settings define Hot data behavior:

ElementDescriptionDefault
hot_data_max_age_minutesHow long data is determined to be Hot from now + hot_data_max_age_minutes. Incoming events have their primary (datetime) column inspected and evaluated with older events than this considered to old to be "hot".3
hot_data_max_active_partitionsThe maximum number of partitions to keep open on the server at any one time.3
hot_data_max_rows_per_partitionThe maximum size (measured in number of rows) to allow any open partition to reach.12288000
hot_data_max_minutes_per_partitionThe maximum width in time of a partition. This is the maximum allowable distance between the newest and oldest primary of rows in the partition.1
hot_data_max_open_secondsThe maximum duration (in wall clock time) to wait for events to trickle in for a recent-data partition.60
hot_data_max_idle_secondsThe maximum duration to wait from the last received event before automatically closing an open partition.30

Cold Settings

The following settings define Cold data behavior:

ElementDescriptionDefault
cold_data_max_age_daysHow long data is determined to be Cold from now + hot_data_max_age_minutes. Incoming events have the primary (datetime) column inspected and evaluated with older events than this considered too old to be worth indexing at all and will be consigned to the scrap heap of history (well Rejects ).365
cold_data_max_active_partitionsThe maximum number of partitions to keep open at any one time.50
cold_data_max_rows_per_partitionThe maximum size (measured in number of rows) to allow any open partition to reach.12288000
cold_data_max_minutes_per_partitionThe maximum width in time of a partition. This is the maximum allowable distance between the newest and oldest primary of rows in the partition.60
cold_data_max_open_secondsThe maximum duration (in wall clock time) to wait for events to trickle in for a recent-data partition.300
cold_data_max_idle_secondsThe maximum duration to wait for new data to appear at all before automatically closing an open partition.60

Other Settings

The following settings define all data behavior, hot and cold:

ElementDescriptionDefault
message_queue_max_rowsThe maximum number of rows to pass into the internal message queue in a single message when Hydrolix receives data. The internal message queue can handle a maximum message size of 1MB. This setting helps you keep messages below that size.0
sample_rateThe sampling rate of Hydrolix's ingest tier. This setting will ingest only a certain fraction of incoming data for this particular table. For example, a value of 0.4keeps 40% of your data and discards the rest. A value of 0 or less will keep all data, as will values of 1 or greater.1

Configuring Stream Authentication Settings

By default, the traefik service uses basic authentication. But Hydrolix also supports token-based authentication for tables. When enabled, API function calls require a token to be passed in the query string or a HTTP header. Hydrolix only performs the operation if the token matches one of the tokens defined for the table.

To enable token-based authentication, set the following values in the stream field of your table settings:

  • set token_auth_enabled to true in the stream settings of the table
  • set token_list to an array of tokens. This allows for easy token rotations as well as using multiple tokens at the same time.

The following snippet shows a full example of a stream authentication configuration for a table:

"settings": {
  ...
  "stream": {
    "token_auth_enabled": true,
    "token_list": ['token1', 'token2']
  }
  ...
}
ElementDescriptionDefault
token_auth_enabledBoolean indicating if token-based authentication is enabled for the table.false
token_listAn array of tokens. If a token is provided in query string
(?token=token1) or in the http header (x-hdx-token) and it matches a token in the token_list, then the operation is authorized.
[]

Configuring via the API

To configure the table via the API, the Tables API endpoints are used.

The following is an example:

POST 'http://myhost/config/v1/orgs/my_org_uuid/projects/my_project_uuid/tables/

{
      "project": "{{project_uuid}}",
      "name": "mytable",
      "description": "An example table",
      "settings": {
         "stream": {
            "hot_data_max_age_minutes": 15,
            "hot_data_max_active_partitions": 4,
            "hot_data_max_rows_per_partition": 1000000,
            "hot_data_max_minutes_per_partition": 5,
            "hot_data_max_open_seconds": 60,
            "hot_data_max_idle_seconds": 30,
            "cold_data_max_age_days": 365,
            "cold_data_max_active_partitions": 5,
            "cold_data_max_rows_per_partition": 1000000,
            "cold_data_max_minutes_per_partition": 15,
            "cold_data_max_open_seconds": 60,
            "cold_data_max_idle_seconds": 30
         }
      }
}

Interdependence with Merge

More information on merge can be found here. It should be noted that the volume of partitions written directly by the stream affects the merge service, giving it more or fewer partitions to merge later.

If you are migrating a large amount of data, a higher partition count during your initial loading process may require higher counts of merge servers and/or larger merge servers. When a higher number of partitions are being created during the initial ingest stream, it is important to ensure there is a sufficient amount of merge peers created to merge the resultant data.