Altering Data

Hydrolix provides a way to Update and Delete specific rows from a table

Hydrolix is an append-only system, however there are occasions where data needs to changed or removed for compliance reasons, for example.

The Alter service API allows you to execute an ALTER TABLE query on a set time range of data. This query will be applied over all partitions in the time range. Allowing both update or delete changes to your data. It is a long running task, and hence performed asynchronously in the background via Alter Peers. Once complete, to make the changes visible to queries, it must be committed.

Creating an Alter Task via the API

  • Create an ALTER TABLE query via API. A job_id is returned in response.
  • Check the status of the alter job via API. The status will be one of the following:
    • ready - the job is starting and a list of partitions to work on is being calculated
    • running - alter-peer(s) are working on a partition
    • failed - see error message for more detail
    • pending - all partitions have been altered. Awaiting either commit or cancel
  • Commit the alter job via API. The status of the job is now done and changes will be visible to all users.

ALTER TABLE syntax

To change or delete specific columnar data or rows Hydrolix provides the ALTER TABLE statement using the API which allows either an UPDATE or DELETE command to change data.

📘

Note

The ALTER TABLE statement syntax is different from many other systems supporting SQL. The statement is intended to signify that unlike similar queries in OLTP databases, the operation being executed is a heavy operation and one that is not designed for frequent, on-going use.

The ALTER TABLE statement is expressed using SQL:

ALTER TABLE project.my_table
UPDATE 'client_ip' = 'secret'
WHERE timestamp > '2021-01-01 00:00:00' AND timestamp < '2022-01-01 00:00:00' AND isNotNull(client_ip)
ALTER TABLE project.my_table
DELETE
WHERE timestamp > '2021-01-01 00:00:00' AND timestamp < '2022-01-01 00:00:00' AND isNotNull(client_ip)

The above looks at all rows where the timestamp is between 2021-01-01 00:00:00 and 2022-01-01 00:00:00 and the client_ip is not null, and update the client_ip column to 'secret' in the table my_table.

📘

Alter function

  • Can only modify an existing column -- you can't add a new column in your data.*
  • Can delete entire row based on your condition.

The WHERE statement acts the same as a normal SQL query -- it can contain any expression that evaluates to a zero or non-zero value (false, or true). The query will then update or delete rows for which the expression takes a non-zero value.

🚧

Alter requires timefilter

Alter requires a time filter on the primary timestamp within the WHERE clause. Queries without this will fail.
The time filter needs to specific and shouldn't use function like now(), today() and yesterday()

We highly recommend testing your SQL statement before making any alterations to your data.
For example, you could issue a COUNT() to verify how many rows would be affected by your query.

❗️

Single ALTER at a time

To avoid contradictory SQL statements only a single ALTER job can be run per table at a time.
We don't support alter on your shard key

ALTER query returns a JSON object:

{
  "created": "2022-01-05T09:43:03.052289Z",
  "description": null,
  "details": {
    "duration_ms": 74,
    "errors": [],
    "job_id": "65724ae0-fc9b-485c-be55-59fb62cb44d5",
    "status_detail": {
      "estimated": true,
      "percent_complete": 0,
      "tasks": {
        "ALTER_LIST": {
          "RUNNING": 1
        }
      }
    }
  },
  "modified": "2022-01-05T09:43:03.158834Z",
  "name": "job_1641375783",
  "org": "e399a2b6-1b34-4f0d-bf03-5c1a2e621739",
  "settings": {
    "max_timestamp": 1641202980,
    "min_timestamp": 1641116582,
    "project_id": "970a78d3-62a7-4a81-b515-2f421537b1c5",
    "reliable_update": true,
    "shard_key": null,
    "sql_stmt": "ALTER TABLE sample.cts\nUPDATE \"data.leaf_cert.issuer.O\" = 'testamaz' \nWHERE timestamp < toString(date_sub(DAY, 2, now())) AND timestamp > toString(date_sub(DAY, 3, now())) AND data.leaf_cert.issuer.O IN ('Amazon')",
    "table_id": "adf173c1-7e24-4f52-873c-5313315f7167"
  },
  "status": "running",
  "type": "alter_table",
  "uuid": "65724ae0-fc9b-485c-be55-59fb62cb44d5"
}

In this object you can see the SQL statement, the status of the job and its UUID. To see updated status for this ALTER job you can use the API.

For example, if the alter job ID is 65724ae0-fc9b-485c-be55-59fb62cb44d5. To get the status you can use the GET request:

https://{hostname}.hydrolix.live/config/v1/orgs/{org_id}/jobs/alter/65724ae0-fc9b-485c-be55-59fb62cb44d5/

This endpoint provides stats like how many partition ALTER is working on, how many are left and the percentage of completion of my job.

📘

Commit/Cancel ALTER pending

After ALTER is done running you need to commit the change to enable reading from the modified partition.
To commit the change you need to call the API /v1/orgs/{org_id}/jobs/alter/{job_id}/commit

If you want to rollback your change you can cancel your alteration (if you didn't commit those).
To cancel the change you need to call the the API /v1/orgs/{org_id}/jobs/alter/{job_id}/cancel

Alter leverages a different host than the query infrastructure, an alter-peer is used to run the alteration process. This separation of infrastructure allows for queries to continue to be executed while data is altered. Alter-peer manage partition one by one, if you want to speed up the alter process you can scale up the number of peers.

For example if you have 1000 partitions and on average alter takes 30s per partition a single peer will take 8h.
If you scale to 50 alter peer the same alteration will take 10min.


Did this page help you?