Scripting & Auto Values

Scripting

Hydrolix offers the ability within transforms to execute a script to manipulate incoming data.

The script property, when present in a column definition, contains a JavaScript expression that executes for every row of data ingested. The expression's output becomes the stored value for that column.

The expression may, as part of its computation, read the values of other fields in the same row. Hydrolix runs each new row's JavaScript expressions in the order the transform defines them -- after it has set all of the row's non-scripted values (including defaults). A script-defined expression thus has access to any non-scripted value in the same row, as well as the row's value for any scripted column defined earlier within the transform.

πŸ“˜

Performance

Be wary of using too much scripting in your transform. The script is executed for every row being ingested

In the following example, the field named "ts_millis" derives its value based on data in the field named "timestamp":

....
"settings": {
   "output_columns": [
        {
            "name": "timestamp",
            "datatype": {
                "type": "epoch",
                "primary": true,
                "format": "s"
            }
        },
        {
            "name": "ts_millis",
            "datatype": {
                "type": "uint64",
                "virtual": true,
                "script": "new Date(row['timestamp']).getMilliseconds()"
            }
        }
    ]
}
....

Another example if you have logs in W3C extended log file format those have 2 separate fields for date and time separated by a TAB.

We can use the script function to create a virtual field which combined the date and time field into a single timestamp which will be used as our primary key:

{
  "name": "aws_cloudfront_transform",
  "type": "csv",
  "table": "{{tableid}}",
  "settings": {
      "is_default": true,
      "compression": "gzip",
      "output_columns": [
          {
              "name": "timestamp",
              "position": 0,
              "datatype": {
                  "type": "datetime",
                  "script": "new Date(row['date'] +' '+ row['hour'])",
                  "format": "2006-01-02 15:04:05",
                  "virtual": true,
                  "primary": true
              }
          },
          {
              "name": "date",
              "position": 0,
              "datatype": {
                  "type": "string"
              }
          },
          {
              "name": "hour",
              "position": 1,
              "datatype": {
                  "type": "string"
              }
          }

You can also create a new column based on an existing on from your data:
This will create a new column named test which is going to be a copy from the column #3 for CSV

{
  "name": "copy_example",
  "type": "csv",
  "table": "{{tableid}}",
  "settings": {
      "is_default": true,
      "compression": "none",
"output_columns": [
  {
  "name": "test",
  "datatype": {
    "type": "string",
    "index": false,
    "source": { "from_input_index": 3 }
  }
]
}

And finally the same example this time for a JSON input, this will copy the field initial_name and create a new one called test.

{
    "name": "copy_example",
    "type": "json",
    "table": "{{tableid}}",
    "settings": {
        "is_default": true,
        "compression": "none",
        "output_columns": [
            {
                "name": "test",
                "datatype": {
                    "type": "string",
                    "index": true,
                    "source": {
                        "from_input_field": "initial_name"
                    }
                }
            }
          ]
        }
    }

This solution also allows you to index multiple times the same data with a different type.

Auto Values

Hydrolix has some automatic values that can be included within a transform.

Auto Value Name

Description

Stream

Kafka

Batch

current_time

Current time

Yes

Yes

Yes

index_pool

Pool name used by the indexer (stream peer, kafka peer, batch peer)

Yes

Yes

Yes

input_filename

For batch only, specify the filename indexed. The filename is the full s3 path:
s3://bucket/folder/subfolder/filename.csv

No

No

Yes

intake_method

Specify the intake method (Batch, Stream / Kafka)

Yes

Yes

Yes

intake_pool

Pool name used by the intake service (stream head, batch head)

Yes

Yes

Yes

kafka_topic

Specify the Kafka topic used to index the data

Yes

No

No

project_name

Name of the project where the data will be indexed

Yes

Yes

Yes

project_revision

Revision number of the project where the data will be indexed

Yes

Yes

Yes

project_uuid

UUID of the project where the data will be indexed

Yes

Yes

Yes

receive_time

Time when the data is received

Yes

Yes

Yes

table_name

Name of the table where the data will be indexed

Yes

Yes

Yes

table_revision

Revision number of the table where the data will be indexed.

Yes

Yes

Yes

table_uuid

UUID of the table where the data will be indexed

Yes

Yes

Yes

transform_name

Name of the transform used to index the data

Yes

Yes

Yes

transform_revision

Revision number of the transform used to index the data

Yes

Yes

Yes

transform_type

Transform type (csv or json).

Yes

Yes

Yes

transform_uuid

UUID of the transform used to index the data.

Yes

Yes

Yes

input_filename_segment_[0-12]

Extract part of the filename with batch/autoingest where the separator is /
Example:
s3://bucket/folder/subfolder/filename.csv
s3 = input_filename_segment_0
bucket = input_filename_segment_2
folder = input_filename_segment_3
subfolder = input_filename_segment_4
filename.csv = input_filename_segment_5

No

No

Yes

To use those automatic value you can add the following in your transform:

{
    "name": "filename",
    "datatype": {
        "type": "string",
        "index": true,
        "source": {
            "from_automatic_value": "input_filename"
        }
    },
    "position": 0
},

The source should be from_automatic_value and then the value name that you can find in the table above.

πŸ“˜

CSV Specific transform

In CSV you need to specify the position, so for automatic value the position doesn't really matters, it's not used. You can specify a position which already taken by your own data they won't be override, we'll just add a new column with the name specify.


Did this page help you?