Stream Settings
When using a Streaming method (HTTP Stream API, Kafka, AWS Kinesis) to ingest data, Hydrolix provides several customisations for how and when data is written into the table and for using stream authentication.
Configuring How and When Data is Written
When using a Streaming method (HTTP Stream API, Kafka, AWS Kinesis) to ingest data, Hydrolix provides several customisations in how and when data is written into the table.
These customisations are provided as streaming event data often has the characteristics of containing "Hot" data that needs to be made available as soon as possible, and "cold" data that arrives late or out of order. A good example of this are CDN or appliance logs where 95% of logs are provided within a 15 minutes window with the last 5% being supplied later in a 24 hour period.
Settings for Hot and Cold data can be independently configured and cover a variety of "tunables" that determine how often event partitions are flushed to storage and how many partitions are being worked on at any one time.
Defaults
It is recommended that before changing settings the defaults be used initially. The default settings have been developed to meet the majority of use cases.
Configuring Data Write Settings
The settings for how data is written to the table are configured either through the portal and under the sources menu or via the Restful API - Tables.
To effectively manage incoming stream data, Hydrolix Streaming has configurable options for Hot and Cold data.
- Hot data is near-term data defined as an event received within the
hot_data_max_age_minutes
window fromnow()
. - Cold data is defined as data that is late arriving event data that is beyond the
hot_data_max_age_minutes
but before thecold_data_max_age_days
. - Data beyond the
cold_data_max_age_days
, it is rejected and ignored.
Hot Settings
The following settings are provided to define Hot data and how it is written.
Element | Description | Default |
---|---|---|
hot_data_max_age_minutes | How long data is determined to be Hot from now + hot_data_max_age_minutes . Incoming events have their primary (datetime) column inspected and evaluated with older events than this considered to old to be "hot". | 3 |
hot_data_max_active_partitions | The maximum number of partitions to keep open on the server at any one time. | 3 |
hot_data_max_rows_per_partition | The maximum size (measured in number of rows) to allow any open partition to reach. | 12288000 |
hot_data_max_minutes_per_partition | The maximum width in time of a partition. This is the maximum allowable distance between the newest and oldest primary of rows in the partition. | 1 |
hot_data_max_open_seconds | The maximum duration (in wall clock time) to wait for events to trickle in for a recent-data partition. | 60 |
hot_data_max_idle_seconds | The maximum duration to wait from the last received event before automatically closing an open partition. | 30 |
Cold Settings
The following settings are provided to define Cold data and how it is written.
Element | Description | Default |
---|---|---|
cold_data_max_age_days | How long data is determined to be Cold from now + hot_data_max_age_minutes . Incoming events have the primary (datetime) column inspected and evaluated with older events than this considered too old to be worth indexing at all and will be consigned to the scrap heap of history (well Rejects ). | 365 |
cold_data_max_active_partitions | The maximum number of partitions to keep open at any one time. | 50 |
cold_data_max_rows_per_partition | The maximum size (measured in number of rows) to allow any open partition to reach. | 12288000 |
cold_data_max_minutes_per_partition | The maximum width in time of a partition. This is the maximum allowable distance between the newest and oldest primary of rows in the partition. | 60 |
cold_data_max_open_seconds | The maximum duration (in wall clock time) to wait for events to trickle in for a recent-data partition. | 300 |
cold_data_max_idle_seconds | The maximum duration to wait for new data to appear at all before automatically closing an open partition. | 60 |
Configuring Stream Authentication Settings
By default, basic authentication is done by the Trafefik service but Hydrolix also supports token-based authentication for tables. When token-based authentication is enabled, API function calls will require a token to be passed in the query string or the http header. If the token matches one of the tokens on the table, the operation is performed.
To enable token-based authentication set token_auth_enabled to true in the stream settings of the table. Also set token_list to an array of tokens. An array of token allows for easy token rotations as well as using multiple tokens at the same time.
"settings": {
"stream": {
"token_auth_enabled": true,
"token_list": ['token1', 'token2']
}
Element | Description | Default |
---|---|---|
token_auth_enabled | Boolean indicating if token-based authentication is enabled for the table. | false |
token_list | An array of tokens. If a token is provided in query string (?token=token1) or in the http header (x-hdx-token) and it matches a token in the token_list, then the operation is authrorized. | [] |
Configuring via the API
To configure the table via the API the Tables API endpoints are used.
The following is an example:
POST 'http://myhost/config/v1/orgs/my_org_uuid/projects/my_project_uuid/tables/
{
"project": "{{project_uuid}}",
"name": "mytable",
"description": "An example table",
"settings": {
"stream": {
"hot_data_max_age_minutes": 15,
"hot_data_max_active_partitions": 4,
"hot_data_max_rows_per_partition": 1000000,
"hot_data_max_minutes_per_partition": 5,
"hot_data_max_open_seconds": 60,
"hot_data_max_idle_seconds": 30,
"cold_data_max_age_days": 365,
"cold_data_max_active_partitions": 5,
"cold_data_max_rows_per_partition": 1000000,
"cold_data_max_minutes_per_partition": 15,
"cold_data_max_open_seconds": 60,
"cold_data_max_idle_seconds": 30
}
}
}
Interdependence with Merge.
More information on merge can be found here. It should be noted that the volume of partitions written directly by the Stream affects the merge service - i.e. more/less partitions to merge later.
Where a higher number of partitions are being created during the initial ingest stream it is important to ensure there are a sufficient amount of merge peers created to merge the resultant data. A higher partition count in the initial loading process may require higher counts (or size) of merge servers.
Updated 2 days ago