Scripting
Hydrolix transforms can execute scripts to manipulate incoming data.
Add the script property to a column definition to execute a JavaScript expression for every row of data ingested. The expression's output becomes the stored value for that column. Within the expression, you can access other columns from the same row with the following syntax:
"datatype": { "script": "row['field-name'])" }
Scripts run:
- after assigning non-scripted values (including defaults) to the row
- in the order defined in the transform
Therefore, script expressions have access to any non-scripted value in the same row as well as the expression output for any script defined earlier within the transform.
Performance
Avoid excessive scripting in transforms. Scripts execute each time Hydrolix ingests a row.
Create a New Column from an Existing Column
The following example creates a new column named ts_millis from data in the existing column named timestamp:
....
"settings": {
"output_columns": [
{
"name": "timestamp",
"datatype": {
"type": "epoch",
"primary": true,
"format": "s"
}
},
{
"name": "ts_millis",
"datatype": {
"type": "uint64",
"virtual": true,
"script": "new Date(row['timestamp']).getMilliseconds()"
}
}
]
}
....
Observe also the use of virtual to prevent any incoming field ts_millis from ever accidentally being used. The derived field from the script is used instead.
Create a New Column from Multiple Existing Columns
For example, logs in W3C extended log file format use 2 separate fields for date and time, separated by a tab.
We can use the script function to create a virtual field that combines the date and time fields into a single timestamp to use as a primary key:
{
"name": "aws_cloudfront_transform",
"type": "csv",
"table": "{{tableid}}",
"settings": {
"is_default": true,
"compression": "gzip",
"output_columns": [
{
"name": "timestamp",
"source": { "from_input_index": 0 },
"datatype": {
"type": "datetime",
"script": "new Date(row['date'] + ' ' + row['hour'])",
"format": "2006-01-02 15:04:05",
"virtual": true,
"primary": true
}
},
{
"name": "date",
"source": { "from_input_index": 1 },
"datatype": {
"type": "string"
}
},
{
"name": "hour",
"source": { "from_input_index": 2 },
"datatype": {
"type": "string"
}
}
]
}
...
Observe also the use of virtual to prevent any incoming field called timestamp from ever accidentally being used. The derived field from the script is used instead.
Updated 4 months ago