Skip to content

Report Historical Usage Data

Overview⚓︎

The Report Usage Tool is used to upload historical Usagemeter data to a Hydrolix cluster's stream ingest endpoint. This data can be gathered from any cluster and sent to any other cluster, but the primary use-case is to gather data from a customer's cluster and send their usage data to Argus. The time range is configurable, as are the project, table, and transform to which data can be sent. The Report Usage Tool itself must be run on the cluster for which usage data is being reported.

Setup⚓︎

Prerequisites⚓︎

  1. You should have already deployed a Tooling pod by following the steps on the tooling pod page.
  2. You should have access to the Hydrolix DevAccess AWS account.
  3. You will need an AWS IAM Security Credential (Access Key ID, Secret Access Key) with the following policies: AdministratorAccess, AmazonS3FullAccess

Instructions⚓︎

Security and Shell History

This method of determining the URL of the latest build:intake job requires running a shell command that assigns a Gitlab personal access token to an environment variable, which risks saving your personal accces token in your shell history. Some simple ways to avoid this are, for example: Bash

export HISTCONTROL=ignorespace
ZSH
setopt HIST_IGNORE_SPACE
both of which allow you to execute the command, preceding it by a space, in order to indicate to your shell not to retain the command in its history.

  1. Create a Gitlab Personal Access Token with permission scope: read_api.
  2. In your local shell, execute the following (retaining the preceding space if you've run one of the commands above to avoid saving the command within your shell history):

     export GITLAB_TOKEN="{your-personal-access-token}"
    
  3. Create a file named get-latest-build-intake-job.sh with the following contents:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    GITLAB_PROJECT_ID="16453416"
    GITLAB_API_URL="https://gitlab.com/api/v4/projects/$GITLAB_PROJECT_ID/jobs"
    LATEST_JOB=$(curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" "$GITLAB_API_URL" | jq -r '[.[] | select(.name == "build:intake")][0].id')
    
    if [ "$LATEST_JOB" != "null" ] && [ -n "$LATEST_JOB" ]; then
        echo "Last 'build:intake' job: https://gitlab.com/hydrolix/turbine/-/jobs/$LATEST_JOB"
    else
        echo "Could not get the last 'build:intake' job"
    fi
    
  4. Now execute the script with ./get-latest-build-intake-job.sh. You should receive output containing the build job URL, for example:

    Last 'build:intake' job: https://gitlab.com/hydrolix/turbine/-/jobs/9053393302
    

Permissions

When you run ./get-latest-build-intake-job.sh, if the output is similar to

permission denied: ./get-latest-build-intake-job.sh`
then make sure you have read/execute permissions by running
chmod +rx get-latest-build-intake-job.sh

The value for GITLAB_PROJECT_ID is the project ID of Turbine which contains the job that builds the artifact we will be using.

If these steps did not work, follow the steps catalogued under Determine the Build Artifact URL (Manual)

Determine the Latest build:intake Job URL (Manual)⚓︎

Construct the URL to the latest binary generated from the build:intake job.

  • Navigate to the Turbine Jobs UI
  • Run cmd+f (Mac) or ctrl+f (Windows/Linux) for "build:intake" to locate the latest completed build:intake job and select the job.
  • This should take you to a URL that is similar to https://gitlab.com/hydrolix/turbine/-/jobs/9055185069.

Determine S3 Location of Build Artifact⚓︎

  • Navigating to the URL you obtained previously, verify the existence a line similar to UPLOAD_PATH="s3://gitlab-drop/builds/$INTAKE_ID" in the log output.
  • Locate the value of the variable $INTAKE_ID. Using this job output as an example, that value would be turbine-intake-46653f7d.tgz. The full URL of the build artifact containing the Report Usage tool is therefore s3://gitlab-drop/builds/turbine-intake-46653f7d.tgz

  • Shell into the tooling pod within the cluster whose usage data you want to send. You can do this by connecting to the cluster using K9s then running the command :pods, navigating to the Tooling pod (for example, tooling-xxxxxxxx-xxxxx), and pressing s. Within the pod, execute the following commands:

  • curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" --output awscliv2.zip

  • unzip awscliv2.zip
  • ./aws/install
  • aws configure
  • Provide the following:
    • AWS Access Key Id
    • AWS Secret Key
    • Region: us-east-2
    • Default output format [None]: {press enter for None}

AWS Credential: Required Permissions

Remember that you need your Access Key ID and Secret Access Key credentials to be linked to a user with the following attached policies:AdministratorAccess, AmazonS3FullAccess

  • aws s3 cp s3://gitlab-drop/builds/turbine-intake-{build_id}.tgz .

You Can Supply Region as a Flag

You can append --region us-east-2 to the aws s3 cp command to set the region

  • tar -xvzf turbine-intake-{build_id}.tgz
  • cd bin

Execution⚓︎

Now that you are in the /bin directory, you can run the Report Usage tool with a command similar to the following.

report-usage --start={YYYY-MM-DD}T{HH:MM:SS}Z --end={YYYY-MM-DD}T{HH:MM:SS}Z --db-host=$CATALOG_DB_HOST --db-pass=$ROOT_DB_PASSWORD --db-name=catalog --db-user=turbine --db-conn-min-num=1 --usagemeter-url=https://{hdx-host}/ingest/event --usagemeter-table={project}.{table} --usagemeter-transform={usagemeter_transform} --usagemeter-request-username=$USAGEMETER_REQUEST_USERNAME --usagemeter-request-password=$USAGEMETER_REQUEST_PASSWORD --usagemeter-source-cluster=$HDX_URL --usagemeter-source-cloud=$USAGEMETER_REPORTING_CLOUD --usagemeter-source-region=$USAGEMETER_REPORTING_REGION

You should at minimum replace values for the following flags:

1
2
3
4
5
--start: the start time for the reporting period
--end: the end time for the reporting period
--usagemeter-url: the remote cluster url to send usage data to
--usagemeter-table: the table on the remote cluster in which to store usage data
--usagemeter-transform: the transform on the remote table which will be used to structure incoming usage data

The remaining flags are optional, they are empty strings by default, and they can be set via their corresponding environment variables:

1
2
3
4
5
usagemeter-request-username: USAGEMETER_REQUEST_USERNAME
usagemeter-request-password: USAGEMETER_REQUEST_PASSWORD
usagemeter-source-cluster: HDX_URL
usagemeter-source-cloud: USAGEMETER_REPORTING_CLOUD 
usagemeter-source-region: USAGEMETER_REPORTING_REGION
1
2
3
4
5
USAGEMETER_REQUEST_USERNAME: Username set on the request sent to the remote cluster
USAGEMETER_REQUEST_PASSWORD: Password set on the request sent to the remote cluster
HDX_URL: Source cluster (for example, https://<cluster>.hydrolix.live) from which to gatehr usage data.
USAGEMETER_REPORTING_CLOUD: Source cloud value to use in reporting from this cluster
USAGEMETER_REPORTING_REGION: Source region value to use in reporting from this cluster

User Commands⚓︎

Usage: report-usage --start=TIME --end=TIME --db-host=STRING --db-pass=STRING [flags]

Flags:
  -h, --help                              Show context-sensitive help.
  -d, --dry-run                           Print what would be done, but don't actually do anything
  -s, --start=TIME                        Start time (format: 2006-01-02T15:04:05Z)
  -e, --end=TIME                          End time (format: 2006-01-02T15:04:05Z)
  -c, --config-root=STRING                Config root path ($CONFIG_ROOT_PATH)
      --db-host=STRING                    Catalog DB Host ($CATALOG_DB_HOST)
      --db-pass=STRING                    Catalog DB Password ($CATALOG_DB_PASSWORD)

Included are the following catalog-related options:

1
2
3
4
5
6
7
8
      --db-port=STRING                    Catalog DB Port ($CATALOG_DB_PORT)
      --db-user=STRING                    Catalog DB Username ($CATALOG_DB_USER)
      --db-name=STRING                    Catalog DB Database name ($CATALOG_DB_NAME)
      --db-conn-max-num=UINT              Catalog DB Max DB Conns ($CATALOG_DB_CONN_MAX_NUM)
      --db-conn-min-num=UINT              Catalog DB Min DB Conns ($CATALOG_DB_CONN_MIN_NUM)
      --db-conn-max-lifetime=DURATION     Catalog DB Max Lifetime Duration ($CATALOG_DB_CONN_MAX_LIFETIME)
      --db-conn-max-idle-time=DURATION    Catalog DB Max Idle Time Duration ($CATALOG_DB_CONN_MAX_IDLE_TIME)
      --db-conn-check-writable            Catalog DB Check Writable After Connect ($CATALOG_DB_CONN_CHECK_WRITABLE)

Example Transform⚓︎

If you are sending Usagemeter data to a Hydrolix cluster that doesn't already have a transform compatible with Usagemeter-formatted data, you can use or adapt the following transform:

{
    "name": "usagemeter_transform",
    "type": "json",
    "settings": {
        "output_columns": [
            {
                "name": "timestamp",
                "datatype": {
                    "type": "datetime",
                    "format": "2006-01-02T15:04:05Z",
                    "primary": true
                }
            },
            {
                "name": "cluster_hostname",
                "datatype": {
                    "type": "string"
                }
            },
            {
                "name": "cluster_cloud",
                "datatype": {
                    "type": "string"
                }
            },
            {
                "name": "cluster_region",
                "datatype": {
                    "type": "string"
                }
            },
            {
                "name": "billing_reference",
                "datatype": {
                    "type": "string"
                }
            },
            {
                "name": "min_created",
                "datatype": {
                    "type": "datetime",
                    "format": "2006-01-02T15:04:05Z"
                }
            },
            {
                "name": "max_created",
                "datatype": {
                    "type": "datetime",
                    "format": "2006-01-02T15:04:05Z"
                }
            },
            {
                "name": "project_id",
                "datatype": {
                    "type": "string"
                }
            },
            {
                "name": "project_name",
                "datatype": {
                    "type": "string"
                }
            },
            {
                "name": "table_id",
                "datatype": {
                    "type": "string"
                }
            },
            {
                "name": "table_name",
                "datatype": {
                    "type": "string"
                }
            },
            {
                "name": "storage_id",
                "datatype": {
                    "type": "string"
                }
            },
            {
                "name": "storage_name",
                "datatype": {
                    "type": "string"
                }
            },
            {
                "name": "storage_cloud",
                "datatype": {
                    "type": "string"
                }
            },
            {
                "name": "storage_region",
                "datatype": {
                    "type": "string"
                }
            },
            {
                "name": "bytes",
                "datatype": {
                    "type": "int64"
                }
            },
            {
                "name": "rows",
                "datatype": {
                    "type": "int64"
                }
            },
            {
                "name": "partition_bytes",
                "datatype": {
                    "type": "int64"
                }
            }
        ]
    }
}