Alter Data
Hydrolix provides a way to Update and Delete specific rows from a table
Hydrolix is an append-only system, however there are occasions where data needs to be changed or removed for compliance reasons for example.
The Alter service allows you to execute an ALTER TABLE
query on a set time range of data. The query can contain either an update or delete statement. Altering data is a long running task, as all relevant data partitions have to be opened and re-written with the query applied. This task is performed asynchronously in the background via Alter Peers
. The response to an Alter query request is an Alter Job job_id
. The Alter Job API allows you to check on progress status, verify the changes, and finally commit. The newly altered data partitions are not visible to end users until after commit.
Create an Alter Job via the API
- POST an
ALTER TABLE
query via query api. Ajob_id
is returned in response. - Check the status of the alter job via job status api. The status will be one of the following:
ready
- the job is starting and a list of partitions to work on is being calculatedrunning
- alter-peers are working on a partitionfailed
- see error message for more detailpending
- all partitions have been altered, awaiting either commit or cancel
- Commit the alter job via the job commit API. The status of the job is now
done
and the altered data is visible to end users .
ALTER TABLE syntax
To update or delete specific columnar data or rows, you must send an ALTER TABLE
query statement that includes the target project.table
followed by either an UPDATE
or DELETE
command. A time range WHERE
clause must be provided and any number of predicates can also be included.
Note
The
ALTER TABLE
statement syntax is different from many other systems supporting SQL. The statement is intended to signify that unlike similar queries in OLTP databases, the operation being executed is a heavy operation and one that is not designed for frequent, on-going use.
ALTER TABLE
statement with UPDATE
example
ALTER TABLE project.my_table
UPDATE 'client_ip' = 'secret'
WHERE timestamp > '2021-01-01 00:00:00' AND timestamp < '2022-01-01 00:00:00'
AND isNotNull(client_ip)
ALTER TABLE project.my_table
DELETE
WHERE timestamp > '2021-01-01 00:00:00' AND timestamp < '2022-01-01 00:00:00' AND isNotNull(client_ip)
The above looks at all rows where the timestamp
is between 2021-01-01 00:00:00 and 2022-01-01 00:00:00 and the client_ip
is not null, and update the client_ip
column to 'secret'
in the table my_table
.
ALTER TABLE
statement with DELETE
example
ALTER TABLE project.my_table
DELETE where timestamp > '2022-02-07 00:00:00' AND timestamp < '2022-02-08 00:00:00'
AND customer_id = 'f4957a96-2dac-11ed-a261-0242ac120002'
This will delete all rows in the specified time range where the customer_id predicate returns true.
Alter function
- Can only modify an existing column -- you can't add a new column in your data.*
- Can delete entire row based on your condition.
The WHERE statement acts the same as a normal SQL query -- it can contain any expression that evaluates to a zero or non-zero value (false, or true). The query will then update or delete rows for which the expression takes a non-zero value.
Alter requires timefilter
Alter requires a time filter on the primary timestamp within the WHERE clause. Queries without this will fail. The time filter needs to be specific and shouldn't use functions like now(), today() and yesterday()
We highly recommend testing your SQL statement before making any alterations to your data.
For example, you could issue a COUNT()
to verify how many rows would be affected by your query.
Single ALTER at a time
To avoid contradictory SQL statements only a single ALTER job can be run per table at a time.
We don't support alter on your shard key
ALTER query response example
{
"created": "2022-01-05T09:43:03.052289Z",
"description": null,
"details": {
"duration_ms": 74,
"errors": [],
"job_id": "65724ae0-fc9b-485c-be55-59fb62cb44d5",
"status_detail": {
"estimated": true,
"percent_complete": 0,
"tasks": {
"ALTER_LIST": {
"RUNNING": 1
}
}
}
},
"modified": "2022-01-05T09:43:03.158834Z",
"name": "job_1641375783",
"org": "e399a2b6-1b34-4f0d-bf03-5c1a2e621739",
"settings": {
"max_timestamp": 1641202980,
"min_timestamp": 1641116582,
"project_id": "970a78d3-62a7-4a81-b515-2f421537b1c5",
"reliable_update": true,
"shard_key": null,
"sql_stmt": "ALTER TABLE sample.cts\nUPDATE \"data.leaf_cert.issuer.O\" = 'testamaz' \nWHERE timestamp < toString(date_sub(DAY, 2, now())) AND timestamp > toString(date_sub(DAY, 3, now())) AND data.leaf_cert.issuer.O IN ('Amazon')",
"table_id": "adf173c1-7e24-4f52-873c-5313315f7167"
},
"status": "running",
"type": "alter_table",
"uuid": "65724ae0-fc9b-485c-be55-59fb62cb44d5"
}
In this object you can see the SQL statement, the status of the job and its UUID.
Alter Job Status
To see updated status for an ALTER job you can use the API.
https://{hostname}.hydrolix.live/config/v1/orgs/{org_id}/jobs/alter/{job_id}/status
Alter Job Status example - annotated to highlight key information
{
"name": "job_1660812184",
"description": null,
"uuid": "b40576a2-8245-4964-895b-d2b249837f19",
"created": "2022-08-18T08:43:04.952123Z",
"modified": "2022-08-18T08:44:06.400328Z",
"settings": {
"max_timestamp": 1641202980,
"min_timestamp": 1641116582,
"table_id": "260bb873-5d72-422f-8e05-1b5acdd455ab",
"project_id": "970a78d3-62a7-4a81-b515-2f421537b1c5",
"reliable_update": true,
"shard_key": null,
"sql_stmt": "ALTER TABLE sample.cts\nUPDATE \"data.leaf_cert.issuer.O\" = 'testamaz' \nWHERE timestamp < toString(date_sub(DAY, 2, now())) AND timestamp > toString(date_sub(DAY, 3, now())) AND data.leaf_cert.issuer.O IN ('Amazon')",
"reliable_update": true
},
"status": "running", # <-- alter-peers are processing the request
"type": "alter_table",
"org": "0ffa6312-61ba-4620-8d57-96514a7f3859",
"details": {
"errors": [],
"job_id": "b40576a2-8245-4964-895b-d2b249837f19",
"duration_ms": 53108, # <-- current elapsed processing time
"status_detail": {
"tasks": {
"ALTER": { # < -- 105 partitions will be impacted. See recommendation running a select query before to get count
"DONE": 86, # < -- already completed
"READY": 18, # <-- not yet started
"RUNNING": 1 # <-- with a scale of 1 alter-peer, runs 1 at a time
},
"ALTER_LIST": {
"DONE": 1 # <--- initial step, list all the partitions for alter-peers work
}
},
"percent_complete": 0.8207547, # <- 82% complete
"estimated": true
}
}
}
Commit/Cancel ALTER pending
After ALTER is done running you need to commit the change to enable reading from the modified partition.
To commit the change you need to call the API/v1/orgs/{org_id}/jobs/alter/{job_id}/commit
If you want to rollback your change you can cancel your alteration (if you didn't commit those).
To cancel the change you need to call the the API/v1/orgs/{org_id}/jobs/alter/{job_id}/cancel
Alter leverages a different host than the query infrastructure, an alter-peer
is used to run the alteration process. This separation of infrastructure allows for queries to continue to be executed while data is altered. Alter-peer manage partition one by one, if you want to speed up the alter process you can scale up the number of peers.
For example if you have 1000 partitions and on average alter takes 30s per partition a single peer will take 8h.
If you scale to 50 alter peer the same alteration will take 10min.
Verify Alter Before Commit
Once Alter has ran through all the partition but you didn't commit the change yet, you can verify the data by using a special SQL function.
SELECT client_ip
FROM alter_new('project.my_table','65724ae0-fc9b-485c-be55-59fb62cb44d5')
WHERE timestamp > '2021-01-01 00:00:00' AND timestamp < '2022-01-01 00:00:00'
AND isNotNull(client_ip)
SELECT client_ip
FROM alter_old('project.my_table','65724ae0-fc9b-485c-be55-59fb62cb44d5')
WHERE timestamp > '2021-01-01 00:00:00' AND timestamp < '2022-01-01 00:00:00'
AND isNotNull(client_ip)
The alter_new
function takes 2 parameters:
- 'projectname.tablename'
- 'job_id' from the alter API endpoint
The alter_new function returns all rows in modified partitions, which may include non-modified rows.
If you'd like to see only the rows that would be modified, make sure to use the same
WHERE
as you did in your originalALTER TABLE
query
You can also use the alter_old
function which uses the same parameter, it'll allow to compare the data before and after.
Technically those function only retrieve partitions that are modified by the alter job, any partition not selected by the alter job won't be used by the SQL query.
Commit a Completed Alter Job
If you are satisfied by the result you can then use the Alter API endpoint to commit the change:
POST /v1/orgs/{org_id}/jobs/alter/{job_id}/commit
Updated about 2 months ago