Streaming(HTTP): Table Configuration

Streaming Ingest API

When using a Streaming method (HTTP API or Kafka) to ingest data within a table, Hydrolix provides several customisations in how and when data is written into storage for querying. These customisations are provided due to streaming event data often having a requirement where a portion of near-term data is needed to be accessible quickly, but some of the incoming data arrives out of order or later in time. A good example of this are CDN logs where often 95% of logs are provided within 15 minutes and the last 5% are supplied within 24 hour period.

With Streaming Ingest the following terms are used to describe the data coming into the system:

  • "Hot" data is defined as near-term or recent data,
  • "Cold" data is data that is recieved later than a specified window.

An event is determined as Hot or Cold based on the primary datetime object specified in the incoming event. Hot data is defined as an event received within the hot_data_max_age_minutes window from now. Cold data is defined as data that is late arriving event data that is beyond the hot_data_max_age_minutes but before the cold_data_max_age_days. Where data is beyond the cold_data_max_age_days, it is rejected and ignored.

Settings for Hot and Cold data can be independently configured and cover a variety of "tunables" that determine how often event partitions are flushed to storage and how many partitions are being worked on at any one time.

Example Configuration settings:

"settings": {
        "stream": {
            "hot_data_max_age_minutes": 15,
            "hot_data_max_active_partitions": 4,
            "hot_data_max_rows_per_partition": 1000000,
            "hot_data_max_minutes_per_partition": 5,
            "hot_data_max_open_seconds": 60,
            "hot_data_max_idle_seconds": 30,
            "cold_data_max_age_days": 365,
            "cold_data_max_active_partitions": 5,
            "cold_data_max_rows_per_partition": 1000000,
            "cold_data_max_minutes_per_partition": 15,
            "cold_data_max_open_seconds": 60,
            "cold_data_max_idle_seconds": 30
        }
    }

Hot Data.

ElementDescriptionDefault
hot_data_max_age_minutesHow long data is determined to be Hot from now + hot_data_max_age_minutes. Incoming events have their primary (datetime) column inspected and evaluated with older events than this considered to old to be "hot"15
hot_data_max_minutes_per_partitionthe maximum width in time of a partition. This is the maximum allowable distance between the newest and oldest primary of rows in the partition5
hot_data_max_active_partitionsthe maximum number of partitions to keep open on the server at any one time4
hot_data_max_rows_per_partitionthe maximum size (measured in number of rows) to allow any open partition to reach1000000
hot_data_max_open_secondsthe maximum duration (in wall clock time) to wait for events to trickle in for a recent-data partition60
hot_data_max_idle_secondsthe maximum duration to wait from the last received event before automatically closing an open partition30

Cold Data.

ElementDescriptionDefault
cold_data_max_age_daysHow long data is determined to be Cold from now + hot_data_max_age_minutes. Incoming events have the primary (datetime) column inspected and evaluated with older events than this considered too old to be worth indexing at all and will be consigned to the scrap heap of history.365
cold_data_max_active_partitionsthe maximum number of partitions to keep open at any one time5
cold_data_max_rows_per_partitionthe maximum size (measured in number of rows) to allow any open partition to reach1000000
cold_data_max_minutes_per_partitionthe maximum width in time of a partition. This is the maximum allowable distance between the newest and oldest primary of rows in the partition15
cold_data_max_open_secondsthe maximum duration (in wall clock time) to wait for events to trickle in for a recent-data partition60
cold_data_max_idle_secondsthe maximum duration to wait for new data to appear at all before automatically closing an open partition30

Deprecated Elements

The following elements may be found within the API, however are now deprecated and no longer affect the Stream.

Element
max_minutes_per_partition
max_rows_per_partition

Interdependence with Merge.

More information on merge can be found here. It should be noted that the volume of partitions written directly by the Stream affects the merge service.
Where a higher number of partitions are being created during the initial ingest stream it is important to ensure there are a sufficent amount of merge peers created to opimize the storage of the resultant data. A higher partition count in the initial loading process may require higher counts (or size) of merge servers.


Did this page help you?