Usage

Work with data, projects, transforms, and many other Hydrolix features using the HDXCLI

See the table of contents to see an overview of the advice on this page.

init Command

The init command is used to set up the initial configuration for hdxcli. It creates the necessary configuration directory and default profile, allowing you to start using the CLI with your specific environment. The configuration file, by default, is stored in a directory created automatically by the tool, but you can customize its location by setting the HDX_CONFIG_DIR environment variable.

Usage

When you run hdxcli init, you will be prompted to enter the following details:

  1. Cluster Hostname: Enter the hostname of the cluster you will be connecting to.
  2. Username: Provide your cluster's username, typically your email address.
  3. Protocol: Specify whether you will be using HTTPS by typing Y or N.

After entering these details, a configuration file will be generated with the profile named default and saved at the specified location (e.g., /path/to/your/config.toml).

Example

$ hdxcli init
================== HDXCLI Init ===================
A new configuration will be created now.

Please, type the host name of your cluster: my-cluster.example.com
Please, type the user name of your cluster: [email protected]
Will you be using https (Y/N): Y

Your configuration with profile [default] has been created at /path/to/your/config.toml

This command must be run before using other commands in hdxcli, as it sets up the essential connection parameters.

Command-line Tool Organization

The tool is organized mostly with the general invocation form of:

hdxcli <resource> [subresource] <verb> [resource_name]

Table and project resources have defaults that depend on the profile you are working with, so they can be omitted if you previously used the set command.

For all other resources, you can use --transform, --dictionary, --source, etc. Please see the command line help for more information.

Profiles

hdxcli supports multiple profiles. You can use a default profile or use the --profile option to operate on a non-default profile.

When invoking a command, if a login to the server is necessary, a prompt will be shown and the token will be cached.

Listing and Showing Profiles

Listing profiles:

hdxcli profile list

Showing default profile:

hdxcli profile show

Projects, Tables, and Transforms

The basic operations you can do with these resources are:

  • list them
  • create a new resource
  • delete an existing resource
  • modify an existing resource
  • show a resource in raw JSON format
  • show settings from a resource
  • write a setting
  • show a single setting

Working with Transforms

You can create and override transforms with the following commands.

Create a transform:

hdxcli transform create -f <transform-settings-file>.json <transform-name>

Remember that a transform is applied to a table in a project, so whatever you set with the command line tool will be the target of your transform.

If you want to override it, specify the table name with the --table setting:

hdxcli transform --project <project-name> --table <table-name> create -f <transform-settings>.json <transform-name>

Migration Command for Hydrolix Tables

This command provides a way to migrate Hydrolix tables and its data to a target cluster or even within the same cluster. You only need to pass the source and target table names in the format project_name.table_name. The migration process will handle creating the project, table, and transforms at the target location. It will then copy the partitions from the source bucket to the target bucket and finally load the catalog so that Hydrolix can associate the created table with the migrated partitions.

If --target-profile is not provided, or if --target-hostname, --target-username, --target-password, and --target-uri-scheme are not passed, the tool assumes that the migration is performed within the same cluster.

Usage

hdxcli migrate [OPTIONS] SOURCE_TABLE TARGET_TABLE

Options

-tp, --target-profile
-h, --target-hostname 
-u, --target-username
-p, --target-password
-s, --target-uri-scheme
--allow-merge	            Allow migration if the merge setting is enabled
--only	                  The migration type
--min-timestamp           Minimum timestamp for filtering partitions
--max-timestamp           Maximum timestamp for filtering partitions
--recovery                Continue a previous migration
--reuse-partitions        Perform a dry migration without moving partitions
--workers	                Number of worker threads to use for migrating partitions

--allow-merge

This flag allows skipping the check for the merge setting enabled on the source table.

--only

This option expects either resources or data. If resources is selected, only the resources (project, table, and transforms) will be migrated. If data is selected, only the data will be migrated, and the resources must already exist.

--min-timestamp and --max-timestamp

These options help filter the partitions to be migrated. They expect dates in the format: YYYY-MM-DD HH:MM:SS.

--recovery

This flag allows resuming a previous migration that did not complete successfully for any reason.

--reuse-partitions

This option enables dry migration. Both the source and target clusters must share the storage where the table's partitions are located. This allows migrating the table to the target cluster while reusing the partitions from the source cluster without creating new ones. This results in an almost instant migration but requires that the same partitions are shared by different tables across clusters. Note: Modifying data in one table may cause issues in the other.

--workers

This option allows manually setting the number of workers available for partition migration. The default number of workers is 10, with a minimum of 1 and a maximum of 50. Note: Generally, having a large number of workers is beneficial when dealing with many small partitions.

Supported Cloud Storages:

  • AWS
  • GCP
  • Azure
  • Linode

During the migration process, credentials to access these clouds will likely be required. These credentials need to be provided when prompted:

  • AWS: AWS_ACCESS_KEY_ID and AWS_ACCESS_SECRET_ID
  • GCP: GOOGLE_APPLICATION_CREDENTIALS
  • Azure: CONNECTION_STRING
  • Linode: AWS_ACCESS_KEY_ID and AWS_ACCESS_SECRET_ID (same format as AWS)

Pre-Migration Checks and Validations

Before creating resources and migrating partitions, the following checks are performed:

  • The source table does not have the merge setting enabled (use --allow-merge to skip this validation)
  • There are no running alter jobs on the source table
  • If filtering is applied, it validates that there are partitions remaining to migrate after filtering
  • If using the --reuse-partitions option, it checks that the storage where the partitions are located is shared between both clusters

Migrating Individual Resources

This command migrates an individual resource from one cluster to another (or even in the same one). This migrate command is available in almost all resources (project, table, transform, function, dictionary, storage), and it clones (with the same settings but different UUID) any resource to a target cluster.

hdxcli project migrate <project-name> --target-cluster-hostname <target-cluster> --target-cluster-username <username> --target-cluster-password <password> --target-cluster-uri-scheme \<http/https>

This command also handles profiles that can be used as the target cluster by specifying them like this: -tp, --target-cluster.

hdxcli project migrate <project-name> -tp <profile-name>

When migrating a table, it is necessary to provide the project name in the target cluster where that table will be migrated. This option must be provided: -P, --target-project-name.

hdxcli table --project <project-name> migrate <table-name> -tp <profile-name> -P <project-name>

Mapping DDLs to a Hydrolix Transform

The command transform map-from consumes data languages such as SQL, Elastic and others and creates a Hydrolix transform from them.

hdxcli transform map-from --ddl-custom-mapping \<sql_to_hdx_mapping>.json \<ddl_file> <transform-name>

There are three things involved:

  • The mapping file sql_to_hdx_mapping.json: tells how to map simple and compound types from the source DDL into Hydrolix.
  • The input file ddl_file: in SQL, Elastic or other language.
  • The target table: the table in which the transform is applied. The current table is used if none is provided.

Mapping File

The mapping file contains two sections:

  • simple_datatypes
  • compound_datatypes

An example of this for SQL could be the following:

{  
  "simple_datatypes": {  
    "INT": ["uint8", "uint16", "uint32", "int8", "int16", "int32_optimal"],  
    "BIGINT":["uint64", "int64_optimal"],  
    "STRING": "string",  
    "DATE": "datetime",  
    "REAL": "double",  
    "DOUBLE": "double",  
    "FLOAT": "double",  
    "BOOLEAN": "boolean",  
    "TIMESTAMP": "datetime",  
    "TEXT": "string"  
  },  
  "compound_datatypes": {  
    "ARRAY": "array",  
    "STRUCT": "map",  
    "MAP": "map",  
    "VARCHAR": "string"  
  }  
}

The mappings that have a list as the value uses the one finishing in _optimal. This is just a replacement for comments since JSON does not allow comments.

The compound_datatypes parsing needs code help to be completely parsed. This is where the extensible interfaces for DDLs enters.

DDL File as Input

The ddl_file specifies the structure of various data storage entities, including SQL and Elasticsearch. It should encompass fields, their associated data types, and any additional constraints. This specification is pivotal for accurate mapping to Hydrolix transforms.

The following DDL SQL could be an example of this:

CREATE TABLE a_project.with_a_table (  
  account_id STRING,  
  device_id STRING,  
  playback_session_id STRING,  
  user_session_id STRING,  
  user_agent STRING,  
  timestamp TIMESTAMP,  
  start_timestamp BIGINT,  
  another_time TIMESTAMP PRIMARY KEY,  
  end_timestamp BIGINT,  
  video_ranges ARRAY<STRING>,  
  playback_started BOOLEAN,  
  average_peak_bitrate BIGINT,  
  bitrate_changed_count INT,  
  bitrate_change_list ARRAY<BIGINT>,  
  bitrate_change_timestamp_list ARRAY<BIGINT>,  
  start_bitrate BIGINT,  
  fetch_and_render_media_segment_duration BIGINT,  
  fetch_and_render_media_segment_start_timestamp BIGINT,  
  re_buffer_start_timestamp_list ARRAY<BIGINT>,  
  exit_before_ad_start_flag BOOLEAN,  
  ingest_utc_hour INT,  
  ingest_utc_date DATE)  
USING delta  
PARTITIONED BY (ingest_utc_date, ingest_utc_hour)  
LOCATION 's3://...'  
TBLPROPERTIES ('delta.minReaderVersion' = '1',  
              'delta.minWriterVersion' = '2');

User Choices File

After transform map-from does type mappings, it might need some tweaks that are user choices. There are two ways to provide these choices. One way is to do it interactively (default) and the other way is to provide a user choices file by means of the --user-choices option followed by a file. The file is a JSON file with

The user choices are done as a post-processing step, and some of those options are shared by all DDLs, but some are specific to a single DDL. Example of user choices are:

  • general: the ingest index for each field if CSV is used
  • Elastic: add fields whose cardinality will be potentially more than one, since this needs to adjust the output of the algorithm to change simple types to an array mapping

There are two kind of user choices. The general ones and the DDL-specific ones. The DDL-specific user choices go prefixed with the name of the DDL (the same name you pass to the -s option in the command line). For Elastic it would be elastic and for SQL, sql.

User Choice KeyExample ValuePurpose
ingest_type‘json’ Type of ingest. Valid values are ‘json’, ‘csv’.Tell the transform the expected data type for ingestion.
csv_indexes[[“field1”, 0], [“field2”, 1],…]Array of arrays with index and field nameKnow where ingest happens for which fields in the transform. Applies to csv.
csv_delimters',' a string that separates csv fieldsDelimit fields in csv format
elastic.array_fields['field1', ‘field2’, ‘some_field.*regex’]

Which fields are considered arrays in Elastic mappings
In Elastic, all fields have 0..* cardinality by default. Hydrolix will map all to cardinality one except the ones indicated in this user choice, which will be mapped to arrays of that type.
compression‘gzip’Which compression algorithm to use
primary_key‘your_key’the primary key for the transform
add_ignored_fields_as_string_columnstrue|falseWhether ignored fields should be added as a string

Ingest

Batch Job

Create a batch job:

hdxcli job batch ingest <job-name> <job-settings>.json

job-name is the name of the job that will be displayed when listing batch jobs. job-settings is the path to the file containing the specifications required to create that ingestion (for more information on the required specifications, see Hydrolix API Reference).

In this case, the project, table, and transform are being omitted. hdxcli will use the default transform within the project and table previously configured in the profile with the set command. Otherwise, you can add --project <project-name> --table <table-name> --transform <transform-name>.

This allows you to execute the command as follows:

hdxcli job batch --project <project-name> --table <table-name> --transform <transform-name> ingest <job-name> <job-settings>.json

Stream

Create the streaming ingest as follows:

hdxcli stream --project <project-name> --table <table-name> --transform <transform-name> ingest <data-file>

data-file is the path of the data file to be used for the ingest. This can be .csv, .json, or compressed files. The transform has to have that configuration (type and compression).


What’s Next

Review a full reference of HDXCLI commands