Elastic Common Schema

Elastic has been working for awhile now on standardisation for event data, this effort is called the Elastic Common Schema, which we’ll call ECS in this post.

The principle is to map events to specific fields and creating a naming convention that everyone can use. They provide a lot of example on their blogs and nowadays ECS is about 750 fields.

Some applications are now using ECS and on top of that Elastic provides different data shippers such as the beats family to send data using ECS.

Generate a transform from ECS

In order to simplify onboarding of data sources using ECS we have been working on transforming the common schema into Hydrolix transform.

The principle is to transpose the Elastic data type into an Hydrolix one, in order to do that we have a created a simple python script which takes ECS and generate the transform.

It’s available in our gitlab repository, it’s usage is very simple

./ecs_to_transform.py 
Please enter the raw csv URL for the mapping example if empty we'll use: https://raw.githubusercontent.com/elastic/ecs/1.10/generated/csv/fields.csv

Please enter your tableid generated in our portal: 1234

This will output the full transform to use in your Hydrolix table.

So if you are already using ECS migrating to Hydrolix will be transparent as you’ll be able to import your data without changing the schema !

And by using our tool we’ve got you cover, we are mapping every single fields that currently exists in ECS !

Ingesting Data using ECS

There are several data shipper which are using ECS, the beats family is definitely one of those.

To ingest beats data into Hydrolix we have 2 different solutions we can use:

Beats agent are shipping data to Kafka, Hydrolix is pulling from Kafka directly (this is the fastest way to setup if you already have Kafka in place)
Beats agent are shipping to a logstash or fluentd server which is then sending the data using HTTP Streaming into Hydrolix

Ingest via Kafka

Elastic has a lot of documentation regarding how to setup the different beats to push data to a Kafka brokers.
You can check their documentation for Filebeat as an example.

In this example I’ve installed packetbeat and I’m using output Kafka:

output.kafka:
  hosts: ["xxxxx:yyy"]
  topic: 'packetbeat'
  required_acks: 1
  max_message_bytes: 1000000
  compression: "gzip"

Here I’m sending data to my broker using a topic called packetbeat, the max message is 1MB and I’m compressing the message using gzip.

I can now configure Hydrolix to pull the data from Kafka and push it into my table using ECS transform schema.

If you are pushing data using gzip don’t forget to specify in Hydrolix transform table that the data is compressed we have some good examples in our documentation.

Ingest via Logstash

Hydrolix supports HTTP streaming to ingest data, so we can use the HTTP Output plugin from Logstash to stream data into Hydrolix.

In this part I won’t necessary cover how to setup and scale logstash to ingest data from beats, there are a lots of good documentation around that topic.

I’ll focus on the HTTP output plugin to push data into Hydrolix.

Hydrolix is expecting HTTP bulk object with 2 extra headers :

X-HDX-Table: $project.$table
X-HDX-Transform: $transformname

So in our logstash pipeline we need to use the HTTP output plugin and specify those headers :

output {
    http {
         url => "https://$host.hydrolix.live/ingest/event"
         http_method => "post"
         content_type => "application/json"
         headers => {
           "X-HDX-Table" => "$project.$table"
           "X-HDX-Transform" => "$transformname"
         }
         format => "json"
    }
}

That’s it!

If you want to test Hydrolix in parallel by ingesting into Elasticsearch, you can use the "pipeline to pipeline" function in Logstash to isolate each output to avoid any blocking.
The documentation is available here, this way you have no risk of sending data in both environment in real time!