The Hydrolix Data Platform
The Challenge
As technologies, products and services have progressed so has the need to collect more data, in more detail, for longer periods. As a result, the traditional enterprise data footprint has ballooned from millions to the multi-billions. Even trillion row datasets have become commonplace.
This explosion of data has left traditional data warehouses and databases behind, with many built on decades-old technology and failing to take recent advancements into account. Where once simple mitigations of selective indexing, clustering, partitioning, and aggregation were sufficient to manage big datasets, these techniques have become obsolete leading to companies now resorting to an expensive and complex blend of tradeoffs and architectural complexity to meet demands.
The tradeoffs in the industry are well known and have led to expectations that include:
- A short data history – Historical data has value; how much data can be kept for what period? The tradeoff of 'how long to keep data to manage data footprint' forces enterprises to keep data for short periods of time. This damages their ability to analyze behavioral history.
- Aggregated and diluted data, devoid of the detail - Data value is often buried in the details. The tradeoff of 'throwing away data by sampling or aggregating' reduces data footprint while preserving some view of the past. Choosing what to throw away and roll-up is a complex task; businesses change over time, as do the demands of the data. Will yesterday's roll-up be useful today? How much can you sample without losing valuable data? Do you need raw data to build models?
- Inflexible data structures - Data structures evolve over time. Not only in row count, but also in width. Tags change over time. So too does the need to search those tags quickly and efficiently. The tradeoff of data structure design involves many complex decisions. Which columns to index? Which tags to add or remove? Design decisions can easily lead to inflexible data structures with slow adaptation to requirements.
- Not-at-all real-time data – Data is often required quickly to make effective decisions. But with requirements including loading, enriching, and transforming high volume datasets, pipelines have become increasingly complex. Elongated daisy-chains of services add increasing levels of latency delay, making data slow to ingest and slow to view. The tradeoff of how 'fast is fast enough' is a continuous battle.
- Strict, expensive, inflexible architectures – Data ebbs and flows, so systems must gracefully handle the highest peaks and the lowest troughs. This leads to tradeoff of capacity: do you over-provision resources, incurring extra cost during troughs? Or do you under-provision, suffering periods of dropped data or slow queries? Tightly coupled data pipelines and components often force a choice.
Hydrolix Overview
Hydrolix solves these challenges.
Hydrolix is a high performance, petabyte-scale, append only, time-series database platform that is dedicated to changing the expectations and economics of acquisition, storage, and performance of massive datasets.
Hydrolix combines commodity cloud compute with commodity storage such as Google Cloud Storage, Amazon S3 and Azure Blob storage, delivering SSD-like performance across massive datasets in a distributed systems environment.
Through the separation of compute from storage, Hydrolix can achieve workload independence. Ingest, query, lifecycle and other services each have the ability to be separately auto-scaled without impacting one another. All data is located in high performance and high-capacity object data stores where workloads can benefit from accessing a common set of data without the need for data duplication.
Traditional databases make the assumption that distributed object storage is slow due to network latency, so it is always fastest to work with local storage or data cached in memory. This limits the volume of data such platforms can store and query. Hydrolix takes an alternative approach. By designing specifically for the strengths of object storage, using advanced indexing, request, and compression techniques, Hydrolix has developed a new patented storage format and data retrieval engine that can be utilized to provide access at SSD-like speeds across the whole data footprint. Hot and cold data are treated equally: data stored years in the past has the same fast performance as that which was stored just ten minutes ago. This opens up data archives to new opportunities while still making the data of now()
accessible.
While processing data, massive parallel compute is utilized against the object store, breaking data analysis tasks
down into quickly executed sub-tasks. The final result is reconstituted by recombining these sub-tasks, meaning vast volumes of data can be scanned and analyzed quickly. Performance can be determined by “how much” compute capacity a job should have applied to it. On completion of tasks, this compute can be thrown away as no residual data is required “on-box.” Compute is treated as stateless, allowing fast recovery from failure and improved resilience of components.
Hydrolix is a polyglot for ingesting and querying data. Data can be ingested using a variety of interfaces and methodologies; Kafka, Kinesis, HTTP Streaming, and Batch Loading are all supported. Transformation, enrichment, and summarization can all occur at time of ingest without additional latencies or the loss of raw data. Indexing is applied across all columns by default without bloating the storage footprint. Schema flexibility is supplied through the separation of write and read schemas, so additional columns can be added quickly and easily.
Various query interfaces are supplied via JDBC, HTTP API, and native interfaces, supporting ANSI compliant SQL. Additional capabilities in analysis are added with the ability to plug Hydrolix into Spark-based environments, increasing the richness in functionality needed to meet all data tasks.
Why Hydrolix?
- Massive improvements in the economics of data. 75% cost reductions are common amongst Hydrolix customers.
- Exceptional "needle in a haystack" query performance. Per-column indexes enable Hydrolix query servers to selectively read relevant byte-ranges within each column, avoiding the kind of brute-force full table and full column scans associated with serverless databases and data lake technology. This minimizes the amount of data transferred on each query, reducing costs and caching dependency.
- Sandbox isolation. Hydrolix avoids "noisy neighbor" problems with per-team compute resources. Configuration of query pools isolates workloads from one another while sharing only a single copy of the data, allowing data consistency across teams.
- True real-time ingest, enrichment, and summarization. No long pipelines that cause slow-to-view ingest or inconsistencies between raw and summarized data.
- Workload independence. Ingest, Query and Lifecycle infrastructure can scale independently and dynamically resize without downtime. All services within Hydrolix are optimized for stateless operations and rely on cloud object storage as a centralized "source of truth".
- Flexible Massive Parallel Processing (FMPP). Data is not statically partitioned, so customers can decide on a per-query basis how massive they want each query to be without the time (and expense) of shard re-balancing.
- Reduced storage and indexing costs. Hydrolix fully indexes every dimension of data by default and still reduce storage costs up to 95% (on average, 55GB per 1TB of raw data). With Hydrolix, there is no need to choose which columns to index.
- Improved reliability and data durability. Stateless computing means that data is stored solely in decoupled object storage. Individual resources can therefore be scaled down, upgraded, or destroyed at any time without affecting data integrity.
- Polyglot ingest and query interfaces. - ANSI-compliant SQL, Spark, HTTP API , and JDBC interfaces are all available. Plug in your favorite tool to access, review, and analyze data. Ingest data with HTTP Streams, Kafka, Kinesis in a number of formats.
Services
The following services make up the Hydrolix platform:
Updated about 2 months ago