Usage
Work with data, projects, transforms, and many other Hydrolix features using the HDXCLI
See the table of contents to see an overview of the advice on this page.
init
Command
init
CommandThe init
command is used to set up the initial configuration for hdxcli
. It creates the necessary configuration directory and default profile, allowing you to start using the CLI with your specific environment. The configuration file, by default, is stored in a directory created automatically by the tool, but you can customize its location by setting the HDX_CONFIG_DIR environment variable.
Usage
When you run hdxcli init
, you will be prompted to enter the following details:
- Cluster Hostname: Enter the hostname of the cluster you will be connecting to.
- Username: Provide your cluster's username, typically your email address.
- Protocol: Specify whether you will be using HTTPS by typing Y or N.
After entering these details, a configuration file will be generated with the profile named default
and saved at the specified location (e.g., /path/to/your/config.toml).
Example
$ hdxcli init
================== HDXCLI Init ===================
A new configuration will be created now.
Please, type the host name of your cluster: my-cluster.example.com
Please, type the user name of your cluster: [email protected]
Will you be using https (Y/N): Y
Your configuration with profile [default] has been created at /path/to/your/config.toml
This command must be run before using other commands in hdxcli
, as it sets up the essential connection parameters.
Command-line Tool Organization
The tool is organized mostly with the general invocation form of:
hdxcli <resource> [subresource] <verb> [resource_name]
Table and project resources have defaults that depend on the profile you are working with, so they can be omitted if you previously used the set
command.
For all other resources, you can use --transform
, --dictionary
, --source
, etc. Please see the command line help for more information.
Profiles
hdxcli
supports multiple profiles. You can use a default profile or use the --profile
option to operate on a non-default profile.
When invoking a command, if a login to the server is necessary, a prompt will be shown and the token will be cached.
Listing and Showing Profiles
Listing profiles:
hdxcli profile list
Showing default profile:
hdxcli profile show
Projects, Tables, and Transforms
The basic operations you can do with these resources are:
- list them
- create a new resource
- delete an existing resource
- modify an existing resource
- show a resource in raw JSON format
- show settings from a resource
- write a setting
- show a single setting
Working with Transforms
You can create and override transforms with the following commands.
Create a transform:
hdxcli transform create -f <transform-settings-file>.json <transform-name>
Remember that a transform is applied to a table in a project, so whatever you set with the command line tool will be the target of your transform.
If you want to override it, specify the table name with the --table
setting:
hdxcli transform --project <project-name> --table <table-name> create -f <transform-settings>.json <transform-name>
For an example of a valid transform file structure, see our Transform Structure page.
Migration Command for Hydrolix Tables
This command provides a way to migrate Hydrolix tables and its data to a target cluster or even within the same cluster. You only need to specify the source and target table names in the format project_name.table_name
and the RClone service information. The migration process will handle creating the project, functions, dictionaries, table, and transforms at the target location. It will then copy the partitions from the source bucket to the target bucket and finally upload the catalog so that Hydrolix can associate the created table with the migrated partitions.
Usage
hdxcli migrate [OPTIONS] SOURCE_TABLE TARGET_TABLE RCLONE_HOST
Options
-tp, --target-profile
-h, --target-hostname
-u, --target-username
-p, --target-password
-s, --target-uri-scheme
--allow-merge Allow migration if the merge setting is enabled.
--only The migration type: "resources" or "data".
--from-date Minimum timestamp for filtering partitions in YYYY-MM-DD HH:MM:SS format.
--to-date Maximum timestamp for filtering partitions in YYYY-MM-DD HH:MM:SS format.
--reuse-partitions Perform a dry migration without moving partitions. Both clusters must share the bucket(s) where the partitions are stored.
--rc-user The username for authenticating with the RClone server.
--rc-pass The password for authenticating with the RClone server.
--concurrency Number of concurrent requests during file migration. Default is 20.
--temp-catalog Use a previously downloaded catalog stored in a temporary file, instead of downloading it again.
--help Show this message and exit.
--target-profile
This option must be used to provide the name of the profile for the target cluster connection during the migration. You can specify an existing profile if it has already been created, or alternatively, you can provide the required connection options manually, such as --target-hostname
, --target-username
, --target-password
, and --target-uri-scheme
.
--allow-merge
This flag allows skipping the check for the merge setting enabled on the source table.
--only
This option expects either resources
or data
. If resources
is selected, only the resources (project, functions, dictionaries, table, and transforms) will be migrated. If data
is selected, only the data will be migrated, and the resources must already exist.
--from-date and --to-date
These options help filter the partitions to be migrated. They expect dates in the format: YYYY-MM-DD HH:MM:SS
.
--reuse-partitions
This option enables dry migration. Both the source and target clusters must share the storage where the table's partitions are located. This allows migrating the table to the target cluster while reusing the partitions from the source cluster without creating new ones. This results in an almost instant migration but requires that the same partitions are shared by different tables across clusters.
Note: Modifying data in one table may cause issues in the other.
--rc-user and --rc-pass
These options specify the credentials required to authenticate with the RClone service. Ensure you provide valid credentials to enable file migration functionality.
--concurrency
This option allows manually setting the concurrency level for partition migration. The default value is 20, with a minimum of 1 and a maximum of 50.
Note: Generally, higher concurrency is beneficial when migrating a large number of small partitions.
--temp-catalog
This option uses a temporarily saved version of the table catalog stored in the /tmp
directory, if it exists. This is particularly useful when handling large catalogs, as it avoids downloading the catalog multiple times.
Supported Cloud Storages:
- AWS
- GCP
- Azure
- Linode
During the migration process, credentials to access these clouds will likely be required. These credentials need to be provided when prompted:
- GCP: You need the path to the JSON file of the service account with access to the bucket.
- AWS and Linode: Requires access key and secret key.
- Azure: Account and key must be provided.
Pre-Migration Checks and Validations
Before creating resources and migrating partitions, the following checks are performed:
- The source table does not have the merge setting enabled (use
--allow-merge
to skip this validation) - There are no running alter jobs on the source table
- If filtering is applied, it validates that there are partitions remaining to migrate after filtering
- If using the
--reuse-partitions
option, it checks that the storage where the partitions are located is shared between both clusters
Migrating Individual Resources
This command migrates an individual resource from one cluster to another (or even in the same one). This migrate command is available in almost all resources (project, table, transform, function, dictionary, storage), and it clones (with the same settings but different UUID) any resource to a target cluster.
hdxcli project migrate <project-name> --target-cluster-hostname <target-cluster> --target-cluster-username <username> --target-cluster-password <password> --target-cluster-uri-scheme \<http/https>
This command also handles profiles that can be used as the target cluster by specifying them like this: -tp
, --target-cluster.
hdxcli project migrate <project-name> -tp <profile-name>
When migrating a table, it is necessary to provide the project name in the target cluster where that table will be migrated. This option must be provided: -P
, --target-project-name
.
hdxcli table --project <project-name> migrate <table-name> -tp <profile-name> -P <project-name>
Mapping DDLs to a Hydrolix Transform
The command transform map-from
consumes data languages such as SQL, Elastic and others and creates a Hydrolix transform from them.
hdxcli transform map-from --ddl-custom-mapping \<sql_to_hdx_mapping>.json \<ddl_file> <transform-name>
There are three things involved:
- The mapping file
sql_to_hdx_mapping.json
: tells how to map simple and compound types from the source DDL into Hydrolix. - The input file
ddl_file
: in SQL, Elastic or other language. - The target table: the table in which the transform is applied. The current table is used if none is provided.
Mapping File
The mapping file contains two sections:
simple_datatypes
compound_datatypes
An example of this for SQL could be the following:
{
"simple_datatypes": {
"INT": ["uint8", "uint16", "uint32", "int8", "int16", "int32_optimal"],
"BIGINT":["uint64", "int64_optimal"],
"STRING": "string",
"DATE": "datetime",
"REAL": "double",
"DOUBLE": "double",
"FLOAT": "double",
"BOOLEAN": "boolean",
"TIMESTAMP": "datetime",
"TEXT": "string"
},
"compound_datatypes": {
"ARRAY": "array",
"STRUCT": "map",
"MAP": "map",
"VARCHAR": "string"
}
}
The mappings that have a list as the value uses the one finishing in _optimal
. This is just a replacement for comments since JSON does not allow comments.
The compound_datatypes
parsing needs code help to be completely parsed. This is where the extensible interfaces for DDLs enters.
DDL File as Input
The ddl_file
specifies the structure of various data storage entities, including SQL and Elasticsearch. It should encompass fields, their associated data types, and any additional constraints. This specification is pivotal for accurate mapping to Hydrolix transforms.
The following DDL SQL could be an example of this:
CREATE TABLE a_project.with_a_table (
account_id STRING,
device_id STRING,
playback_session_id STRING,
user_session_id STRING,
user_agent STRING,
timestamp TIMESTAMP,
start_timestamp BIGINT,
another_time TIMESTAMP PRIMARY KEY,
end_timestamp BIGINT,
video_ranges ARRAY<STRING>,
playback_started BOOLEAN,
average_peak_bitrate BIGINT,
bitrate_changed_count INT,
bitrate_change_list ARRAY<BIGINT>,
bitrate_change_timestamp_list ARRAY<BIGINT>,
start_bitrate BIGINT,
fetch_and_render_media_segment_duration BIGINT,
fetch_and_render_media_segment_start_timestamp BIGINT,
re_buffer_start_timestamp_list ARRAY<BIGINT>,
exit_before_ad_start_flag BOOLEAN,
ingest_utc_hour INT,
ingest_utc_date DATE)
USING delta
PARTITIONED BY (ingest_utc_date, ingest_utc_hour)
LOCATION 's3://...'
TBLPROPERTIES ('delta.minReaderVersion' = '1',
'delta.minWriterVersion' = '2');
User Choices File
After transform map-from
does type mappings, it might need some tweaks that are user choices. There are two ways to provide these choices. One way is to do it interactively (default) and the other way is to provide a user choices file by means of the --user-choices
option followed by a file. The file is a JSON file with key-value pairs specifying the user's preferences for the transform process. These key-value pairs define options such as ingest types, CSV delimiters, and Elastic-specific settings (described below).
The user choices are done as a post-processing step, and some of those options are shared by all DDLs, but some are specific to a single DDL. Example of user choices are:
- General: the ingest index for each field if CSV is used
- Elastic: add fields whose cardinality will be potentially more than one, since this needs to adjust the output of the algorithm to change simple types to an array mapping
There are two kinds of user choices: the general ones and the DDL-specific ones. The DDL-specific user choices are prefixed with the name of the DDL (the same name you pass to the -s
option in the command line). For Elastic it would be elastic
and for SQL, sql
.
User Choice Key | Example Value | Purpose |
---|---|---|
ingest_type | json Type of ingest. Valid values are json and csv . | Tell the transform the expected data type for ingestion. |
csv_indexes | \[[“field1”, 0], [“field2”, 1],…] Array of arrays with index and field name | Know where ingest happens for which fields in the transform. Applies to csv. |
csv_delimters | , a string that separates CSV fields | Delimit fields in csv format |
elastic.array_fields | ['field1', ‘field2’, ‘some_field.*regex’] Which fields are considered arrays in Elastic mappings | In Elastic, all fields have 0..* cardinality by default. Hydrolix will map all to cardinality one except the ones indicated in this user choice, which will be mapped to arrays of that type. |
compression | gzip | Which compression algorithm to use |
primary_key | your_key | The primary key for the transform |
add_ignored_fields_as_string_columns | true or false | Whether ignored fields should be added as a string |
Ingest
Batch Job
Create a batch job:
hdxcli job batch ingest <job-name> <job-settings>.json
job-name
is the name of the job that will be displayed when listing batch jobs. job-settings
is the path to the file containing the specifications required to create that ingestion (for more information on the required specifications, see Hydrolix API Reference).
In this case, the project, table, and transform are being omitted. hdxcli
will use the default transform within the project and table previously configured in the profile with the set
command. Otherwise, you can add --project <project-name> --table <table-name> --transform <transform-name>
.
This allows you to execute the command as follows:
hdxcli job batch --project <project-name> --table <table-name> --transform <transform-name> ingest <job-name> <job-settings>.json
Stream
Create the streaming ingest as follows:
hdxcli stream --project <project-name> --table <table-name> --transform <transform-name> ingest <data-file>
data-file
is the path of the data file to be used for the ingest. This can be .csv, .json, or compressed files. The transform has to have that configuration (type and compression).
Updated 2 months ago