Scripting & Auto Values
Scripting
Hydrolix offers the ability within transforms to execute a script to manipulate incoming data.
The script
property, when present in a column definition, contains a JavaScript expression that executes for every row of data ingested. The expression's output becomes the stored value for that column.
The expression may, as part of its computation, read the values of other fields in the same row. Hydrolix runs each new row's JavaScript expressions in the order the transform defines them -- after it has set all of the row's non-scripted values (including defaults). A script
-defined expression thus has access to any non-scripted value in the same row, as well as the row's value for any scripted column defined earlier within the transform.
Performance
Be wary of using too much scripting in your transform. The script is executed for every row being ingested
In the following example, the field named "ts_millis"
derives its value based on data in the field named "timestamp"
:
....
"settings": {
"output_columns": [
{
"name": "timestamp",
"datatype": {
"type": "epoch",
"primary": true,
"format": "s"
}
},
{
"name": "ts_millis",
"datatype": {
"type": "uint64",
"virtual": true,
"script": "new Date(row['timestamp']).getMilliseconds()"
}
}
]
}
....
Another example if you have logs in W3C extended log file format those have 2 separate fields for date and time separated by a TAB.
We can use the script function to create a virtual field which combined the date and time field into a single timestamp which will be used as our primary key:
{
"name": "aws_cloudfront_transform",
"type": "csv",
"table": "{{tableid}}",
"settings": {
"is_default": true,
"compression": "gzip",
"output_columns": [
{
"name": "timestamp",
"position": 0,
"datatype": {
"type": "datetime",
"script": "new Date(row['date'] +' '+ row['hour'])",
"format": "2006-01-02 15:04:05",
"virtual": true,
"primary": true
}
},
{
"name": "date",
"position": 0,
"datatype": {
"type": "string"
}
},
{
"name": "hour",
"position": 1,
"datatype": {
"type": "string"
}
}
You can also create a new column based on an existing on from your data:
This will create a new column named test
which is going to be a copy from the column #3 for CSV
{
"name": "copy_example",
"type": "csv",
"table": "{{tableid}}",
"settings": {
"is_default": true,
"compression": "none",
"output_columns": [
{
"name": "test",
"datatype": {
"type": "string",
"index": false,
"source": { "from_input_index": 3 }
}
]
}
And finally the same example this time for a JSON input, this will copy the field initial_name
and create a new one called test
.
{
"name": "copy_example",
"type": "json",
"table": "{{tableid}}",
"settings": {
"is_default": true,
"compression": "none",
"output_columns": [
{
"name": "test",
"datatype": {
"type": "string",
"index": true,
"source": {
"from_input_field": "initial_name"
}
}
}
]
}
}
This solution also allows you to index multiple times the same data with a different type.
Auto Values
Hydrolix has some automatic values that can be included within a transform.
Auto Value Name | Description | Stream | Kafka | Batch |
---|---|---|---|---|
| Current time | Yes | Yes | Yes |
| Pool name used by the indexer (stream peer, kafka peer, batch peer) | Yes | Yes | Yes |
| For batch only, specify the filename indexed. The filename is the full s3 path: | No | No | Yes |
| Specify the intake method (Batch, Stream / Kafka) | Yes | Yes | Yes |
| Pool name used by the intake service (stream head, batch head) | Yes | Yes | Yes |
| Specify the Kafka topic used to index the data | Yes | No | No |
| Name of the project where the data will be indexed | Yes | Yes | Yes |
| Revision number of the project where the data will be indexed | Yes | Yes | Yes |
| UUID of the project where the data will be indexed | Yes | Yes | Yes |
| Time when the data is received | Yes | Yes | Yes |
| Name of the table where the data will be indexed | Yes | Yes | Yes |
| Revision number of the table where the data will be indexed. | Yes | Yes | Yes |
| UUID of the table where the data will be indexed | Yes | Yes | Yes |
| Name of the transform used to index the data | Yes | Yes | Yes |
| Revision number of the transform used to index the data | Yes | Yes | Yes |
| Transform type (csv or json). | Yes | Yes | Yes |
| UUID of the transform used to index the data. | Yes | Yes | Yes |
| Extract part of the filename with batch/autoingest where the separator is | No | No | Yes |
To use those automatic value you can add the following in your transform:
{
"name": "filename",
"datatype": {
"type": "string",
"index": true,
"source": {
"from_automatic_value": "input_filename"
}
},
"position": 0
},
The source should be from_automatic_value
and then the value name that you can find in the table above.
CSV Specific transform
In CSV you need to specify the position, so for automatic value the position doesn't really matters, it's not used. You can specify a position which already taken by your own data they won't be override, we'll just add a new column with the name specify.
Updated about 2 months ago