Alter Data

Hydrolix provides a way to Update and Delete specific rows from a table

Hydrolix is an append-only system, however there are occasions where data needs to be changed or removed for compliance reasons for example.

The Alter service allows you to execute an ALTER TABLE query on a set time range of data. The query can contain either an update or delete statement. Altering data is a long running task, as all relevant data partitions have to be opened and re-written with the query applied. This task is performed asynchronously in the background via Alter Peers. The response to an Alter query request is an Alter Job job_id. The Alter Job API allows you to check on progress status, verify the changes, and finally commit. The newly altered data partitions are not visible to end users until after commit.

Create an Alter Job via the API

  1. POST an ALTER TABLE query via query api. A job_id is returned in response.
  2. Check the status of the alter job via job status api. The status will be one of the following:
    • ready - the job is starting and a list of partitions to work on is being calculated
    • running - alter-peers are working on a partition
    • failed - see error message for more detail
    • pending - all partitions have been altered, awaiting either commit or cancel
  3. Commit the alter job via the job commit API. The status of the job is now done and the altered data is visible to end users .

ALTER TABLE syntax

To update or delete specific columnar data or rows, you must send an ALTER TABLE query statement that includes the target project.table followed by either an UPDATE or DELETE command. A time range WHERE clause must be provided and any number of predicates can also be included.

📘

Note

The ALTER TABLE statement syntax is different from many other systems supporting SQL. The statement is intended to signify that unlike similar queries in OLTP databases, the operation being executed is a heavy operation and one that is not designed for frequent, on-going use.

ALTER TABLE statement with UPDATE example

ALTER TABLE project.my_table
UPDATE 'client_ip' = 'secret'
WHERE timestamp > '2021-01-01 00:00:00' AND timestamp < '2022-01-01 00:00:00' 
AND isNotNull(client_ip)
ALTER TABLE project.my_table
DELETE
WHERE timestamp > '2021-01-01 00:00:00' AND timestamp < '2022-01-01 00:00:00' AND isNotNull(client_ip)

The above looks at all rows where the timestamp is between 2021-01-01 00:00:00 and 2022-01-01 00:00:00 and the client_ip is not null, and update the client_ip column to 'secret' in the table my_table.

ALTER TABLE statement with DELETE example

ALTER TABLE project.my_table
DELETE where timestamp > '2022-02-07 00:00:00' AND timestamp < '2022-02-08 00:00:00'
AND customer_id = 'f4957a96-2dac-11ed-a261-0242ac120002'

This will delete all rows in the specified time range where the customer_id predicate returns true.

📘

Alter function

  • Can only modify an existing column -- you can't add a new column in your data.*
  • Can delete entire row based on your condition.

The WHERE statement acts the same as a normal SQL query -- it can contain any expression that evaluates to a zero or non-zero value (false, or true). The query will then update or delete rows for which the expression takes a non-zero value.

🚧

Alter requires timefilter

Alter requires a time filter on the primary timestamp within the WHERE clause. Queries without this will fail. The time filter needs to be specific and shouldn't use functions like now(), today() and yesterday()

We highly recommend testing your SQL statement before making any alterations to your data.
For example, you could issue a COUNT() to verify how many rows would be affected by your query.

❗️

Single ALTER at a time

To avoid contradictory SQL statements only a single ALTER job can be run per table at a time.
We don't support alter on your shard key

ALTER query response example

{
  "created": "2022-01-05T09:43:03.052289Z",
  "description": null,
  "details": {
    "duration_ms": 74,
    "errors": [],
    "job_id": "65724ae0-fc9b-485c-be55-59fb62cb44d5",
    "status_detail": {
      "estimated": true,
      "percent_complete": 0,
      "tasks": {
        "ALTER_LIST": {
          "RUNNING": 1
        }
      }
    }
  },
  "modified": "2022-01-05T09:43:03.158834Z",
  "name": "job_1641375783",
  "org": "e399a2b6-1b34-4f0d-bf03-5c1a2e621739",
  "settings": {
    "max_timestamp": 1641202980,
    "min_timestamp": 1641116582,
    "project_id": "970a78d3-62a7-4a81-b515-2f421537b1c5",
    "reliable_update": true,
    "shard_key": null,
    "sql_stmt": "ALTER TABLE sample.cts\nUPDATE \"data.leaf_cert.issuer.O\" = 'testamaz' \nWHERE timestamp < toString(date_sub(DAY, 2, now())) AND timestamp > toString(date_sub(DAY, 3, now())) AND data.leaf_cert.issuer.O IN ('Amazon')",
    "table_id": "adf173c1-7e24-4f52-873c-5313315f7167"
  },
  "status": "running",
  "type": "alter_table",
  "uuid": "65724ae0-fc9b-485c-be55-59fb62cb44d5"
}

In this object you can see the SQL statement, the status of the job and its UUID.

Alter Job Status

To see updated status for an ALTER job you can use the API.

https://{hostname}.hydrolix.live/config/v1/orgs/{org_id}/jobs/alter/{job_id}/status

Alter Job Status example - annotated to highlight key information

{
	"name": "job_1660812184",
	"description": null,
	"uuid": "b40576a2-8245-4964-895b-d2b249837f19",
	"created": "2022-08-18T08:43:04.952123Z",
	"modified": "2022-08-18T08:44:06.400328Z",
	"settings": {
        "max_timestamp": 1641202980,
        "min_timestamp": 1641116582,
		"table_id": "260bb873-5d72-422f-8e05-1b5acdd455ab",
        "project_id": "970a78d3-62a7-4a81-b515-2f421537b1c5",
        "reliable_update": true,
        "shard_key": null,
        "sql_stmt": "ALTER TABLE sample.cts\nUPDATE \"data.leaf_cert.issuer.O\" = 'testamaz' \nWHERE timestamp < toString(date_sub(DAY, 2, now())) AND timestamp > toString(date_sub(DAY, 3, now())) AND data.leaf_cert.issuer.O IN ('Amazon')",
		"reliable_update": true
	},
	"status": "running", # <-- alter-peers are processing the request
	"type": "alter_table",
	"org": "0ffa6312-61ba-4620-8d57-96514a7f3859",
	"details": {
		"errors": [],
		"job_id": "b40576a2-8245-4964-895b-d2b249837f19",
		"duration_ms": 53108, # <-- current elapsed processing time
		"status_detail": {
			"tasks": {
				"ALTER": {  # < -- 105 partitions will be impacted. See recommendation running a select query before to get count
					"DONE": 86, # < -- already completed
					"READY": 18, # <-- not yet started
					"RUNNING": 1 # <-- with a scale of 1 alter-peer, runs 1 at a time
				},
				"ALTER_LIST": {
					"DONE": 1 # <--- initial step, list all the partitions for alter-peers work
				}
			},
			"percent_complete": 0.8207547, # <- 82% complete
			"estimated": true
		}
	}
}

📘

Commit/Cancel ALTER pending

After ALTER is done running you need to commit the change to enable reading from the modified partition.
To commit the change you need to call the API /v1/orgs/{org_id}/jobs/alter/{job_id}/commit

If you want to rollback your change you can cancel your alteration (if you didn't commit those).
To cancel the change you need to call the the API /v1/orgs/{org_id}/jobs/alter/{job_id}/cancel

Alter leverages a different host than the query infrastructure, an alter-peer is used to run the alteration process. This separation of infrastructure allows for queries to continue to be executed while data is altered. Alter-peer manage partition one by one, if you want to speed up the alter process you can scale up the number of peers.

For example if you have 1000 partitions and on average alter takes 30s per partition a single peer will take 8h.
If you scale to 50 alter peer the same alteration will take 10min.

Verify Alter Before Commit

Once Alter has ran through all the partition but you didn't commit the change yet, you can verify the data by using a special SQL function.

SELECT client_ip
FROM alter_new('project.my_table','65724ae0-fc9b-485c-be55-59fb62cb44d5')
WHERE timestamp > '2021-01-01 00:00:00' AND timestamp < '2022-01-01 00:00:00' 
AND isNotNull(client_ip)
SELECT client_ip
FROM alter_old('project.my_table','65724ae0-fc9b-485c-be55-59fb62cb44d5')
WHERE timestamp > '2021-01-01 00:00:00' AND timestamp < '2022-01-01 00:00:00' 
AND isNotNull(client_ip)

The alter_new function takes 2 parameters:

  • 'projectname.tablename'
  • 'job_id' from the alter API endpoint

🚧

The alter_new function returns all rows in modified partitions, which may include non-modified rows.

If you'd like to see only the rows that would be modified, make sure to use the same WHERE as you did in your original ALTER TABLE query

You can also use the alter_old function which uses the same parameter, it'll allow to compare the data before and after.

Technically those function only retrieve partitions that are modified by the alter job, any partition not selected by the alter job won't be used by the SQL query.

Commit a Completed Alter Job

If you are satisfied by the result you can then use the Alter API endpoint to commit the change:
POST /v1/orgs/{org_id}/jobs/alter/{job_id}/commit