Usage
Work with data, projects, transforms, and many other Hydrolix features using the HDXCLI
See the table of contents to see an overview of the advice on this page.
init
Command
init
CommandThe init
command is used to set up the initial configuration for hdxcli
. It creates the necessary configuration directory and default profile, allowing you to start using the CLI with your specific environment. The configuration file, by default, is stored in a directory created automatically by the tool, but you can customize its location by setting the HDX_CONFIG_DIR environment variable.
Usage
When you run hdxcli init
, you will be prompted to enter the following details:
- Cluster Hostname: Enter the hostname of the cluster you will be connecting to.
- Username: Provide your cluster's username, typically your email address.
- Protocol: Specify whether you will be using HTTPS by typing Y or N.
After entering these details, a configuration file will be generated with the profile named default
and saved at the specified location (e.g., /path/to/your/config.toml).
Example
$ hdxcli init
================== HDXCLI Init ===================
A new configuration will be created now.
Please, type the host name of your cluster: my-cluster.example.com
Please, type the user name of your cluster: user@example.com
Will you be using https (Y/N): Y
Your configuration with profile [default] has been created at /path/to/your/config.toml
This command must be run before using other commands in hdxcli
, as it sets up the essential connection parameters.
Command-line Tool Organization
The tool is organized mostly with the general invocation form of:
hdxcli <resource> [subresource] <verb> [resource_name]
Table and project resources have defaults that depend on the profile you are working with, so they can be omitted if you previously used the set
command.
For all other resources, you can use --transform
, --dictionary
, --source
, etc. Please see the command line help for more information.
Profiles
hdxcli
supports multiple profiles. You can use a default profile or use the --profile
option to operate on a non-default profile.
When invoking a command, if a login to the server is necessary, a prompt will be shown and the token will be cached.
Listing and Showing Profiles
Listing profiles:
hdxcli profile list
Showing default profile:
hdxcli profile show
Projects, Tables, and Transforms
The basic operations you can do with these resources are:
- list them
- create a new resource
- delete an existing resource
- modify an existing resource
- show a resource in raw JSON format
- show settings from a resource
- write a setting
- show a single setting
Working with Transforms
You can create and override transforms with the following commands.
Create a transform:
hdxcli transform create -f <transform-settings-file>.json <transform-name>
Remember that a transform is applied to a table in a project, so whatever you set with the command line tool will be the target of your transform.
If you want to override it, specify the table name with the --table
setting:
hdxcli transform --project <project-name> --table <table-name> create -f <transform-settings>.json <transform-name>
For an example of a valid transform file structure, see our Transform Structure page.
Data Migration Command for Hydrolix Tables
This command provides a way to migrate Hydrolix tables and its data to a target cluster or even within the same cluster. You only need to specify the source and target table names in the format project_name.table_name
and the RClone service information. The migration process will handle creating the project, functions, dictionaries, table, and transforms at the target location. It will then copy the partitions from the source bucket to the target bucket and finally upload the catalog so that Hydrolix can associate the created table with the migrated partitions.
Usage
hdxcli migrate [OPTIONS] SOURCE_TABLE TARGET_TABLE RCLONE_HOST
Options
-tp, --target-profile
-h, --target-hostname
-u, --target-username
-p, --target-password
-s, --target-uri-scheme
--allow-merge Allow migration if the merge setting is enabled.
--only The migration type: "resources" or "data".
--from-date Minimum timestamp for filtering partitions in YYYY-MM-DD HH:MM:SS format.
--to-date Maximum timestamp for filtering partitions in YYYY-MM-DD HH:MM:SS format.
--reuse-partitions Perform a dry migration without moving partitions. Both clusters must share the bucket(s) where the partitions are stored.
--rc-user The username for authenticating with the RClone server.
--rc-pass The password for authenticating with the RClone server.
--concurrency Number of concurrent requests during file migration. Default is 20.
--temp-catalog Use a previously downloaded catalog stored in a temporary file, instead of downloading it again.
--help Show this message and exit.
--target-profile
This option must be used to provide the name of the profile for the target cluster connection during the migration. You can specify an existing profile if it has already been created, or alternatively, you can provide the required connection options manually, such as --target-hostname
, --target-username
, --target-password
, and --target-uri-scheme
.
--allow-merge
This flag allows skipping the check for the merge setting enabled on the source table.
--only
This option expects either resources
or data
. If resources
is selected, only the resources (project, functions, dictionaries, table, and transforms) will be migrated. If data
is selected, only the data will be migrated, and the resources must already exist.
--from-date and --to-date
These options help filter the partitions to be migrated. They expect dates in the format: YYYY-MM-DD HH:MM:SS
.
--reuse-partitions
This option enables dry migration. Both the source and target clusters must share the storage where the table's partitions are located. This allows migrating the table to the target cluster while reusing the partitions from the source cluster without creating new ones. This results in an almost instant migration but requires that the same partitions are shared by different tables across clusters.
Note: Modifying data in one table may cause issues in the other.
--rc-user and --rc-pass
These options specify the credentials required to authenticate with the RClone service. Ensure you provide valid credentials to enable file migration functionality.
--concurrency
This option allows manually setting the concurrency level for partition migration. The default value is 20, with a minimum of 1 and a maximum of 50.
Note: Generally, higher concurrency is beneficial when migrating a large number of small partitions.
--temp-catalog
This option uses a temporarily saved version of the table catalog stored in the /tmp
directory, if it exists. This is particularly useful when handling large catalogs, as it avoids downloading the catalog multiple times.
Supported Cloud Storages:
- AWS
- GCP
- Azure
- Linode
During the migration process, credentials to access these clouds will likely be required. These credentials need to be provided when prompted:
- GCP: You need the path to the JSON file of the service account with access to the bucket.
- AWS and Linode: Requires access key and secret key.
- Azure: Account and key must be provided.
Pre-Migration Checks and Validations
Before creating resources and migrating partitions, the following checks are performed:
- The source table does not have the merge setting enabled (use
--allow-merge
to skip this validation) - There are no running alter jobs on the source table
- If filtering is applied, it validates that there are partitions remaining to migrate after filtering
- If using the
--reuse-partitions
option, it checks that the storage where the partitions are located is shared between both clusters
Migrating Resources
This command migrates resources from one cluster to another (or even within the same cluster). It supports the following resources: projects, tables, transforms, functions, and dictionaries. These resources are cloned with the same settings to ensure uniqueness in the target cluster.
General Command Syntax
hdxcli --profile <source-profile> project --project <project-name> migrate <new-project-name> --target-profile <target-profile>
Explanation
The above command migrates a project (<project-name>
) from the --profile
specified as <source-profile>
to a new project (<new-project-name>
) in the <target-profile>
. By default, it migrates all related resources in the project’s configuration tree, including tables and transforms.
Flags to Customize Behavior
--only
: Migrates only the project without its related configuration tree resources (tables + transforms).--functions
: Includes functions associated with the project during migration.--dictionaries
: Includes dictionaries associated with the project during migration.--no-rollback
: Disables the rollback process in case an issue occurs during migration.
Cluster Connection Details
If you need to specify the target cluster's connection information directly:
hdxcli --profile <source-profile> project --project <project-name> migrate <new-project-name> \
--target-cluster-hostname <target-cluster-hostname> \
--target-cluster-username <target-cluster-username> \
--target-cluster-password <target-cluster-password> \
--target-cluster-uri-scheme <http/https>
Examples
Project Migration
- Migrate a project with tables and transforms
Migrates the project hydrolix
from the default
profile to the test
profile. The new project name will be new_hydrolix
. This includes the project's tables and transforms.
hdxcli --profile default project --project hydrolix migrate new_hydrolix --target-profile test
- Include functions and dictionaries
Same as above, but also migrates functions and dictionaries associated with the project.
hdxcli --profile default project --project hydrolix migrate new_hydrolix --target-profile test --functions --dictionaries
- Migrate only the project
Migrates only the hydrolix
project without any related tables or transforms.
hdxcli --profile default project --project hydrolix migrate new_hydrolix --target-profile test --only
Table Migration
- Migrate a table with transforms
Migrates the table logs
(within the hydrolix
project) from the default
profile to the test
profile. The new table name will be new_logs
(within the new_hydro
project). This includes any transforms associated with the table.
hdxcli --profile default table --project hydrolix --table logs migrate new_hydro new_logs --target-profile test
- Migrate only the table
Migrates only the logs
table without any associated transforms.
hdxcli --profile default table --project hydrolix --table logs migrate new_hydro new_logs --target-profile test --only
Handling Interactive Prompts During Migration
In some scenarios, the CLI requires user input to handle specific resource configurations during the migration process. These cases ensure that important settings are either preserved, updated, or removed based on the user's decision.
Common Scenarios Requiring User Input
- Default Storage Settings. If a table has a default storage configuration, the CLI prompts the user to choose how to handle it:
- Preserve the current settings
- Specify a new default storage ID
- Remove the settings and use the cluster's default storage
- Auto-Ingest Settings. For tables with auto-ingest configurations, the user can choose whether to keep or remove these settings during the migration.
- Merge Pool Names. Tables with merge pool configurations prompt the user to specify how to handle these settings.
- Summary Tables. If a table is a summary table, the CLI will request the new parent
project.table
for the summary query.
Example: Interactive Migration Workflow
Here is an example of how the CLI handles these prompts during a project and table migration:
hdxcli project --project hydrolix migrate new_hydrolix --target-profile test
[INFO] Migrating project 'new_hydrolix'...
[SUCCESS] Project 'new_hydrolix' Migrated
[INFO] Migrating table 'logs'...
[WARNING] Storage settings found in the table 'logs'
Default Storage ID: 24aa950d-71cc-4940-a34d-da4567cf838a
Column Name: None
Column Value Mapping: -
[PROMPT] How would you like to proceed?
1) Preserve all existing settings without any changes
2) Specify a new default storage ID
3) Remove the storage settings (use cluster default)
Please enter your choice (1/2/3): 2
Please enter the new default storage ID: 0d42b1e9-a1e7-4e5a-96c3-72bd05e580a8
[SUCCESS] Table 'logs' Migrated
[INFO] Migrating table 'summary'...
[WARNING] Summary settings found in the table 'summary'
The current parents for the summary table are: hydrolix.logs
[PROMPT] Please enter a new project and table in 'project.table' format (leave blank to keep current)
New parents 'project.table': new_hydrolix.logs
[WARNING] Storage settings found in the table 'summary'
Default Storage ID: 24aa950d-71cc-4940-a34d-da4567cf838a
Column Name: None
Column Value Mapping: -
[PROMPT] How would you like to proceed?
1) Preserve all existing settings without any changes
2) Specify a new default storage ID
3) Remove the storage settings (use cluster default)
Please enter your choice (1/2/3): 3
[SUCCESS] Table 'summary' Migrated
Mapping DDLs to a Hydrolix Transform
The command transform map-from
consumes data languages such as SQL, Elastic and others and creates a Hydrolix transform from them.
hdxcli transform map-from --ddl-custom-mapping \<sql_to_hdx_mapping>.json \<ddl_file> <transform-name>
There are three things involved:
- The mapping file
sql_to_hdx_mapping.json
: tells how to map simple and compound types from the source DDL into Hydrolix. - The input file
ddl_file
: in SQL, Elastic or other language. - The target table: the table in which the transform is applied. The current table is used if none is provided.
Mapping File
The mapping file contains two sections:
simple_datatypes
compound_datatypes
An example of this for SQL could be the following:
{
"simple_datatypes": {
"INT": ["uint8", "uint16", "uint32", "int8", "int16", "int32_optimal"],
"BIGINT":["uint64", "int64_optimal"],
"STRING": "string",
"DATE": "datetime",
"REAL": "double",
"DOUBLE": "double",
"FLOAT": "double",
"BOOLEAN": "boolean",
"TIMESTAMP": "datetime",
"TEXT": "string"
},
"compound_datatypes": {
"ARRAY": "array",
"STRUCT": "map",
"MAP": "map",
"VARCHAR": "string"
}
}
The mappings that have a list as the value uses the one finishing in _optimal
. This is just a replacement for comments since JSON does not allow comments.
The compound_datatypes
parsing needs code help to be completely parsed. This is where the extensible interfaces for DDLs enters.
DDL File as Input
The ddl_file
specifies the structure of various data storage entities, including SQL and Elasticsearch. It should encompass fields, their associated data types, and any additional constraints. This specification is pivotal for accurate mapping to Hydrolix transforms.
The following DDL SQL could be an example of this:
CREATE TABLE a_project.with_a_table (
account_id STRING,
device_id STRING,
playback_session_id STRING,
user_session_id STRING,
user_agent STRING,
timestamp TIMESTAMP,
start_timestamp BIGINT,
another_time TIMESTAMP PRIMARY KEY,
end_timestamp BIGINT,
video_ranges ARRAY<STRING>,
playback_started BOOLEAN,
average_peak_bitrate BIGINT,
bitrate_changed_count INT,
bitrate_change_list ARRAY<BIGINT>,
bitrate_change_timestamp_list ARRAY<BIGINT>,
start_bitrate BIGINT,
fetch_and_render_media_segment_duration BIGINT,
fetch_and_render_media_segment_start_timestamp BIGINT,
re_buffer_start_timestamp_list ARRAY<BIGINT>,
exit_before_ad_start_flag BOOLEAN,
ingest_utc_hour INT,
ingest_utc_date DATE)
USING delta
PARTITIONED BY (ingest_utc_date, ingest_utc_hour)
LOCATION 's3://...'
TBLPROPERTIES ('delta.minReaderVersion' = '1',
'delta.minWriterVersion' = '2');
User Choices File
After transform map-from
does type mappings, it might need some tweaks that are user choices. There are two ways to provide these choices. One way is to do it interactively (default) and the other way is to provide a user choices file by means of the --user-choices
option followed by a file. The file is a JSON file with key-value pairs specifying the user's preferences for the transform process. These key-value pairs define options such as ingest types, CSV delimiters, and Elastic-specific settings (described below).
The user choices are done as a post-processing step, and some of those options are shared by all DDLs, but some are specific to a single DDL. Example of user choices are:
- General: the ingest index for each field if CSV is used
- Elastic: add fields whose cardinality will be potentially more than one, since this needs to adjust the output of the algorithm to change simple types to an array mapping
There are two kinds of user choices: the general ones and the DDL-specific ones. The DDL-specific user choices are prefixed with the name of the DDL (the same name you pass to the -s
option in the command line). For Elastic it would be elastic
and for SQL, sql
.
User Choice Key | Example Value | Purpose |
---|---|---|
ingest_type | json Type of ingest. Valid values are json and csv . | Tell the transform the expected data type for ingestion. |
csv_indexes | \[[“field1”, 0], [“field2”, 1],…] Array of arrays with index and field name | Know where ingest happens for which fields in the transform. Applies to csv. |
csv_delimters | , a string that separates CSV fields | Delimit fields in csv format |
elastic.array_fields | ['field1', ‘field2’, ‘some_field.*regex’] Which fields are considered arrays in Elastic mappings | In Elastic, all fields have 0..* cardinality by default. Hydrolix will map all to cardinality one except the ones indicated in this user choice, which will be mapped to arrays of that type. |
compression | gzip | Which compression algorithm to use |
primary_key | your_key | The primary key for the transform |
add_ignored_fields_as_string_columns | true or false | Whether ignored fields should be added as a string |
Ingest
Batch Job
Create a batch job:
hdxcli job batch ingest <job-name> <job-settings>.json
job-name
is the name of the job that will be displayed when listing batch jobs. job-settings
is the path to the file containing the specifications required to create that ingestion (for more information on the required specifications, see Hydrolix API Reference).
In this case, the project, table, and transform are being omitted. hdxcli
will use the default transform within the project and table previously configured in the profile with the set
command. Otherwise, you can add --project <project-name> --table <table-name> --transform <transform-name>
.
This allows you to execute the command as follows:
hdxcli job batch --project <project-name> --table <table-name> --transform <transform-name> ingest <job-name> <job-settings>.json
Stream
Create the streaming ingest as follows:
hdxcli stream --project <project-name> --table <table-name> --transform <transform-name> ingest <data-file>
data-file
is the path of the data file to be used for the ingest. This can be .csv, .json, or compressed files. The transform has to have that configuration (type and compression).
Updated 26 days ago