Regexp Tree Dictionaries
Reduce high-cardinality, unstructured field data into more comprehensible categories by using regexp tree dictionaries.
Unstructured field data is difficult to search with exact-key lookups. Regexp tree dictionaries help categorize your data for better and faster query results.
A regexp tree dictionary matches an input string against a set of regular expressions. It then returns attributes from the patterns that match. Doing this during ingest makes your data ready to query right away.
A common use is to categorize User-Agent strings during ingest. This page demonstrates that with an example. For a fully featured dictionary, see this dictionary file in the Hydrolix Examples GitHub repository.
If you only need exact-key lookups, use hashed or complex_key_hashed layouts described on the Custom Dictionaries page.
How regexp tree dictionaries work⚓︎
A regexp tree dictionary stores patterns as a tree of nodes, where each node must carry a regexp key and one or more attributes. To look up a value, Hydrolix:
- Matches the input string against the regexp nodes in the order they appear in the file.
- Selects the first node that matches.
- Returns that node's attributes, including any inherited from parent nodes.
Child nodes inherit attributes from their parents unless they set their own. Regexp patterns can use capture groups, and attribute values can reference those groups with back-references such as \1.
Prerequisites⚓︎
Make sure you have these RBAC permissions to perform operations on dictionaries:
- add_dictionary
- change_dictionary
- delete_dictionary
- view_dictionary
- dictGet_sql
If you need to modify the dictionary files, these permissions are needed:
- add_dictionaryfile
- change_dictionaryfile
- delete_dictionaryfile
- view_dictionaryfile
For more information, see Account Permissions (RBAC).
This page assumes you're comfortable with YAML syntax and regular expressions.
Create a regexp tree dictionary⚓︎
Creating a regexp tree dictionary takes three steps:
- Write the pattern file
- Upload the pattern file
- Define and create the dictionary
Write the pattern file⚓︎
Define the regexp tree in a YAML file. This pattern file is separate from the dictionary schema you define later. Each node needs a regexp key containing the pattern to match, plus any attributes you want to return. Nest child nodes to inherit attributes from a parent, and reference capture groups with back-references such as \1.
In this example, originTypes contains two child nodes. Hydrolix uses the nesting structure, not the arbitrary originTypes key name.
An input origin.shop.example.com/some-path matches the 'origin\.shop\.example\.com(\/.*)?' regular expression. That node sets originType, version, and path, and inherits host from its parent, so it resolves to these attributes:
A lookup reads these attributes one at a time. The Test and confirm your dictionary section shows how dictGet and dictGetAll return them.
Upload the pattern file⚓︎
Upload the YAML pattern file to Hydrolix using the Upload the dictionary file endpoint, or in the UI:
- Select + Add new > Data file.
- Select the project.
- Enter a file name, such as
my_regexp_dict_v1. - Browse to the pattern file and select it.
- Select Upload.
With the API, post the file as multipart form data. The name form field is the filename you reference when you create the dictionary.
| Upload the Dictionary File | |
|---|---|
Define and create the dictionary⚓︎
Create the dictionary definition with the Create a dictionary API method. Alternatively, if you're using the UI, select + Add new > Data dictionary.
Set these fields:
| Field | Value |
|---|---|
| Format | Regexp |
| Layout | regexp_tree |
| Primary Key | regexp |
| Lifetime (seconds) | 5 |
| Filename | The uploaded pattern file |
The primary key must be regexp. Every node in the pattern file holds its pattern under a key named regexp, and the primary key references that column.
The lifetime_seconds value sets how often Hydrolix refreshes the dictionary from the uploaded file. 5 seconds is the recommended value.
The definition also includes a schema that lists the columns the dictionary returns. The schema is JSON, separate from the YAML pattern file. In the UI, enter it in the Schema field. With the API, include it as the output_columns in the dictionary settings. Include regexp and every attribute used in the pattern file. Attribute names must match the pattern file's keys exactly, including casing, or the dictionary returns empty values.
To create the dictionary, select Save in the UI, or post the definition to the Create a dictionary API method. The settings object carries the same fields and schema set in the previous steps.
Enrich data at ingest⚓︎
The main use of a regexp tree dictionary is to categorize data as it arrives. Add a dictGet call to the sql_transform of a table's transform to derive a new column at ingest time. Queries then read the stored column directly, with no dictionary lookup at query time. For more on SQL transforms, see Data Enrichment.
This transform reads incoming JSON data, creates a and populates an in-memory field called agent, and then calls the sql_transform. The sql_transform executes the dictGet call, which reads the agent field and populates the user_agent_category field also mentioned in the transform's output_columns.
Hydrolix writes timestamp, agent (original string), user_agent_category (derived string) to the table.
Test and confirm your dictionary⚓︎
Use the dictGet family of SQL functions to look up values and confirm the dictionary returns what you expect. Join the project and dictionary names with an underscore, such as myproject_my_dict when using dictGet functions. For a description of each function, see the dictionary functions reference.
The dictGet function returns the value of the attribute you name, taken from the first matching node:
| Get an Attribute for the First Match | |
|---|---|
If the matching node doesn't set that attribute, dictGet returns the attribute type's default, such as an empty string or 0. Use dictGetOrDefault to supply your own fallback.
The dictGetAll function returns attribute values from every matching node in the tree, walking from the child up through its parents:
| Get Attributes for All Matching Nodes | |
|---|---|
The array holds one value for each matching node that sets the attribute, with the child value before the parent value. A node that doesn't set the attribute contributes nothing, so the array can be shorter than the number of matching nodes. The input origin.shop.example.com/some-path matches both the origin child node and its parent, but only the parent sets host:
| Get Attributes When a Node Omits One | |
|---|---|
Request several attributes at once by passing a parenthesized list of attribute names:
| Get Multiple Attributes | |
|---|---|
The dictGetOrDefault function returns a fallback when a node doesn't set the requested attribute:
| Get Attributes With Defaults | |
|---|---|
The dictionary is set up correctly when these checks pass:
- The dictionary appears under Data > Dictionaries with the
regexp_treelayout in the Hydrolix UI. dictGetreturns the expected attribute for a known matching value.dictGetAllreturns an array with both the child and parent values for a known child value.
Troubleshooting⚓︎
| Issue | Solution |
|---|---|
| Dictionary returns empty strings | Confirm the attribute names in the schema match the YAML keys exactly, including casing. |
dictGet returns empty for a known value |
Confirm the pattern is valid and matches the input. Test with SELECT match('your-input', 'your-regexp'). |
| Child node attributes aren't returned | Check that the child regexp is nested under the parent in the YAML and that attributes are defined at the right level. |
| File upload fails | Confirm the YAML is valid with a linter. Use spaces for indentation, not tabs. |
dictGetAll returns one entry |
Confirm the child value also matches a parent pattern. dictGetAll traverses the full tree. |
Example: Categorize user agents⚓︎
This example categorizes user-agent strings into bot and crawler types, including per-vendor AI crawlers. The dictionary file, user_agent_regex_dict.yaml, is available in the hydrolix_examples repository.
Each node matches a group of user agent signatures and sets a category. This excerpt shows two nodes:
| User-Agent Dictionary Excerpt | |
|---|---|
The schema declares the regexp key and the three attributes:
To categorize the user agent at ingest, add a dictGet lookup to the transform that reads the user agent column:
| Categorize User Agents at Ingest | |
|---|---|
With the category stored as a column, queries can group traffic by category or filter out bots without scanning the raw user-agent string.