Catalog Metadata
Every table makes metadata about itself available through special virtual tables that you can query through Hydrolix's SQL API or query UI. These include catalogs of the files and directories that hold tables' data and indexes, and descriptions of the columns and time-intervals held by each table's constituent partitions.
Catalog
A catalog virtual table contains one row for every partition used by its associated fluid table. The row's data include that partition's filesystem path, the span of time-series data it contains, and the amount of storage space it takes up.
Catalog columns
In the following table, "this partition" means the particular partition of fluid table data that each catalog row describes.
Name | Purpose |
---|---|
partition | This partition's filesystem path, relative to its table's cloud-storage location. |
min_timestamp | The earliest timestamp of the contiguous time-series data held by this partition. |
max_timestamp | The latest timestamp of the contiguous time-series data held by this partition. |
manifest_size | The size of this partition's manifest file, in bytes. |
data_size | The size of this partition's data file, in bytes. |
index_size | The size of this partition's index file, in bytes. |
Querying a table's catalog
To query a table's catalog, refer to it in a query as TABLE-NAME#.catalog
.
For example, to view the entire catalog of the table my-data
in the project my-project
:
SELECT * FROM `my-project`.`my-data#.catalog`
To view the total size, in bytes, of the table my-data
:
SELECT sum(manifest_size + data_size + index_size) as total_size
FROM `my-project`.`my-data#.catalog`
Useful for Partition Sizing
Catalog access is useful when determining the size of partitions to be stored within your cluster. Each row of data is automatically stored within a partition. To ensure performance, Hydrolix recommends that new partitions should be between 1 GB and 3 GB. Later, one day after ingest, almost all of your partitions should be between 2 GB and 4 GB due to merge cleanup.
A summary query to help visualize this is shown below. It breaks down the last ten days of partitions into categories based on size for a quick view of the distribution of your partition sizes per day:
WITH 1048576 as mb, 1073741824 as gb SELECT toStartOfDay(min_timestamp) as time, count(rows) as partitions, countIf(mem_size < 1 * mb) as "< 1MB", countIf(mem_size > 1 * mb and mem_size < 5 * mb) as "< 5MB", countIf(mem_size > 5 * mb and mem_size < 50 * mb) as " < 50MB", countIf(mem_size > 50 * mb and mem_size < 500 * mb) as " < 500MB", countIf(mem_size > 500 * mb and mem_size < gb) as " < 1 GB", countIf(mem_size > gb and mem_size < 2 * gb) as " < 2 GB", countIf(mem_size > 2 * gb and mem_size < 3 * gb) as " < 3 GB", countIf(mem_size > 3 * gb and mem_size < 4 * gb) as " < 4 GB" FROM project."table#.catalog" where min_timestamp >= now() - INTERVAL 10 DAY GROUP BY time ORDER BY time DESC
## Metadata
A metadata virtual table contains several rows for every partition used by its associated fluid table, with one row for every column handled by that partition. In other words, every row in this virtual table represents one partition-and-column tuple from the files that hold your table's data.
Because a Hydrolix table can have multiple [transform schemas](/docs/transforms-and-write-schema) associated with it, the partitions of a given table might contain different lists of columns from one another. Querying a table's metadata lets you see exactly which columns of data each of its partitions handle, and how they handle them.
### Metadata columns
In the following table, "this partition" means the particular partition of fluid table data that each metadata row describes, and "this column" means the specific column within that partition that the metadata row describes.
| Name | Purpose |
| ----------------------- | --------------------------------------------------------------------------------- |
| partition | This partition's filesystem path, relative to its table's cloud-storage location. |
| min_timestamp | The earliest timestamp of the contiguous time-series data held by this partition. |
| max_timestamp | The latest timestamp of the contiguous time-series data held by this partition. |
| column_name | The name of this column. |
| column_index | `1` if Hydrolix treats this column's data as indexible; `0` otherwise. |
| column_type | |
| column_unique_values | |
| column_order | |
| dict_compressed_bytes | |
| dict_uncompressed_bytes | |
### Querying a table's metadata
To query a table's metadata, refer to it in a query as `TABLE-NAME#.metadata`.
For example, to view all the metadata of the table `my-data` in the project `my-project`:
```sql
SELECT * FROM `my-project`.`my-data#.metadata`
And this query would list all the columns that a given table has ever handled, across all its partitions, including the earliest and latest times each was used:
SELECT column_name, min(min_timestamp) AS first_appearance, max(max_timestamp) AS last_appearance
FROM `my-project`.`my-data#.metadata`
GROUP BY column_name
ORDER BY first_appearance, column_name
Updated about 2 months ago