Kinesis

Hydrolix Projects and Tables can continuously ingest data from one or more AWS Kinesis streams. The cluster can run on AWS or any other cloud provider.

Basic steps⚓︎

Create a Project/Table.
Create a Transform.
Configure Kinesis access.
Configure a checkpoint database. The Hydrolix Kinesis client needs a checkpoint database to track progress.
- Use AWS DynamoDB.
- Use GCP Datastore.
Configure the Hydrolix Kinesis source and adjust the scale.

These instructions assumes that project, table, and transform are already configured.

Configure Kinesis access⚓︎

To access the Kinesis queue, your Hydrolix cluster requires an AWS user and role holding read and write permissions to these services in your account.

Create an AWS user and role⚓︎

This can be done from the AWS IAM Console. See AWS Identity and Access Management (IAM) Documentation.

Make sure to record your AWS Access Key ID and Secret Access Key

Hydrolix requires these to connect to the Kinesis service.

AWS shows this to you only once during key creation.

Assign Kinesis permissions⚓︎

For Kinesis access, ensure the user has permission to perform these actions:

kinesis:GetRecords
kinesis:GetShardIterator
kinesis:DescribeStream
kinesis:ListShards

If you are also using an AWS DynamoDB checkpoint, assign DynamoDB permissions.

Configure a checkpoint database⚓︎

AWS DynamoDB⚓︎

Use these instructions to create an AWS DynamoDB table.

DynamoDB is a good choice if your Hydrolix cluster is running on Amazon Elastic Kubernetes Service (EKS).

AWS DynamoDB prerequisites⚓︎

To load data into Hydrolix from Kinesis, you will need the following in your AWS Account:

A DynamoDB table to store checkpoint information
An AWS user/role with access to DynamoDB and Kinesis
The Kinesis Amazon Resource Name (ARN) and region of your Kinesis stream

Create the DynamoDB table⚓︎

Follow instructions at Setting up DynamoDB to create the DynamoDB table.

These are Hydrolix recommended options:

Option	Value
Table Name	Hydrolix suggests using your client Id with the string `_kinesis_check_point` appended for example, `hdxcli-123456_kinesis_check_point`
Partition Key	`StreamShard`
Settings	Select 'Customize settings'
Table Class	`DynamoDB Standard`
Read/Write capacity settings	`On-demand`

Record your DynamoDB ARN

Make sure to record your new DynamoDB ARN for use in the Kinesis table source checkpointer setting.

Assign DynamoDB permissions⚓︎

Assign DynamoDB permissions to the AWS user with Kinesis access, configured in an earlier step.

Ensure the user also has permission to perform these DynamoDB actions:

dynamodb:PutItem
dynamodb:DescribeTable
dynamodb:GetItem

Record Your Kinesis ARN⚓︎

Retrieve your Kinesis ARN from the AWS console. You will need this later when you configure the Hydrolix platform.

GCP Datastore⚓︎

Use these instructions to create a GCP Datastore table.

GCP Datastore is a good choice if your Hydrolix cluster is running on Google Kubernetes Engine (GKE).

GCP Datastore prerequisites⚓︎

A Datastore to hold checkpoint information
The Google service account used by the Hydrolix cluster must have access to the Datastore
The GCP Datastore table name

Add Datastore Role⚓︎

Edit the Google Cloud service account created and add the role Cloud Datastore Owner.

Dialog for configuring a Google Datastore to store Kinesis checkpoints

Configure the Hydrolix Kinesis Source⚓︎

Create your Kinesis source in Hydrolix using the API endpoint Create Kinesis Source within the API.

A Kinesis table source object requires these properties

Property	Description	Example
`name`	Name of your Kinesis source	`myKinesisSource`
`pool_name`	Name for the pool that will service the source	`myKinesisPool`
`k8s_deployment`	Object describing the cpu/memory and service for the pool (kinesis-peer)	`{"cpu": 1,"memory": "10Gi","service": "kinesis-peer"}`
`type`	The method to retrieve stream data. For Kinesis, always `pull`	`pull`
`subtype`	The type of data source. For Kinesis, always `kinesis`	`kinesis`
`transform`	The name of the transform to use in ingesting data	`myTransform`
`table`	The project and table to import the data into	`myproject.mytable`
`settings`	A mapping containing detailed settings information

Checkpointer settings⚓︎

AWS DynamoDB checkpointerGCP Datastore checkpointer

Setting	Description	Example
`stream_name`	ARN for the Kinesis stream	`{"stream_name": "arn:aws:kinesis:us-east-2:1234567890:stream/test-kinesis"}`
`region`	Region for the Kinesis stream	`{"region": "us-east-2"}`
`checkpointer`	ARN for the DynamoDB for AWS	`{"checkpointer": {"name": "arn:aws:kinesis:us-east-2:1234567890:stream/test-kinesis"}`
`aws_key`	AWS Key used to connect to the Kinesis stream
`aws_secret`	AWS Secret used to connect to the Kinesis stream

Setting	Description	Example
`stream_name`	ARN for the Kinesis stream	`{"stream_name": "arn:aws:kinesis:us-east-2:1234567890:stream/test-kinesis"}`
`region`	Region for the Kinesis stream	`{"region": "us-east-2"}`
`checkpointer`	Datastore table name for GCP	`{"checkpointer": {"name": "https://datastore.googleapis.com/hdx-kinesis"}`

Create a Kinesis source⚓︎

Use the access key and checkpoint database identifier in the Create kinesis source endpoint.

Hydrolix adds the credentials into a Kubernetes secret.

AWS DynamoDB checkpointerGCP Datastore checkpointer

Create Kinesis Table Source Using API Endpoint
POST {{base_url}}/v1/orgs/{ org_id}}/projects/{{project_id}}/tables/{{table_id}}/sources/kinesis/
Authorization: Bearer {{token}}
Content-Type: application/json

{
   "name": "kinesissource",
   "pool_name": "kinesispool",
   "k8s_deployment": {
      "cpu": 1,
      "memory": "10Gi",
      "service": "kinesis-peer"
   },
   "type": "pull",
   "subtype": "kinesis",
   "transform": "{{transform_name}}",
   "table": "{{project_name}}.{{table_name}}",
   "settings": {
      "stream_name": "arn:aws:kinesis:us-east-2:1234567890:stream/test-kinesis",
      "region": "us-east-2",
      "checkpointer": {
         "name": "arn:aws:dynamodb:us-east-2:1234567890:table/test-kinesis"
      },
      "aws_key": "AWS_ACCESS_KEY_ID",
      "aws_secret": "AWS_SECRET_KEY"
   }
}

You can now create a new Kinesis datasource using the GCP Datastore for storing checkpoints. Here's an API call example:

POST {{base_url}}/v1/orgs/{ org_id}}/projects/{{project_id}}/tables/{{table_id}}/sources/kinesis/
Authorization: Bearer {{token}}
Content-Type: application/json

{
  "name": "kinesissource",
  "pool_name": "kinesispool",
  "k8s_deployment": {
      "service": "kinesis-peer",
      "replicas": 1
   },
  "transform": "transform",
  "settings": {
    "stream_name": "arn:aws:kinesis:us-east-2:1234567890:stream/test-kinesis",
    "region": "us-east-2",
    "checkpointer": {
      "name": "https://datastore.googleapis.com/hdx-kinesis"
    },
    "aws_key": "AWS_ACCESS_KEY_ID",
    "aws_secret": "AWS_SECRET_KEY"
  }
}

This will create a new Datastore table named hdx-kinesis, which will store the checkpoint for the Kinesis stream named test-kinesis.

Kinesis Pool Creation Response Example When Using DynamoDB

{
   "name" : "kinesissource",
   "settings" : {
      "aws_creds_key" : "K2447EFAD0B5F4E8C91A13CDD0BD5873D",
      "checkpointer" : {
         "name" : "arn:aws:dynamodb:us-east-2:1234567890:table/test-kinesis"
      },
      "organizer" : {
         "root_path" : "/intake/stream/{{id}}"
      },
      "pool" : {
         "created" : "2023-03-22T06:45:42.401740Z",
         "description" : null,
         "location" : "880ecba8-6895-4892-a429-eb6c6d88de64",
         "modified" : "2023-03-22T06:45:42.401766Z",
         "name" : "kinesispool-db9feb26-2a8c-4735-9b8a-6e6b7f207cbe",
         "settings" : {
            "is_default" : false,
            "is_head" : false,
            "k8s_deployment" : {
               "replicas" : "1",
               "service" : "kinesis-peer"
            }
         },
         "tag" : "pool",
         "type" : "elastic",
         "uuid" : "516b0001-ead9-48ae-b413-3ee03f5e9ff2"
      },
      "region" : "us-east-2",
      "stream_name" : "arn:aws:kinesis:us-east-2:1234567890:stream/test-kinesis"
   },
   "subtype" : "kinesis",
   "table" : "{{table_id}}",
   "transform" : "transform",
   "type" : "pull",
   "url" : "{{base_url}}/v1/orgs/{{ org_id}}/projects/{{project_id}}/tables/{{table_id}}/sources/kinesis/{{kinesis_id}}",
   "uuid" : "22b13359-7897-40d7-8e4d-5bcc0e7ce686"
}

aws_creds_key is the name of the secret in Kubernetes that stores your credentials.

Scale the Kinesis Service⚓︎

The default scale profiles for Hydrolix set the scale of the kinesis-peers to zero, since they're not needed for every Hydrolix cluster. To scale the service, edit hydrolixcluster.yaml:

Edit Hydrolix Cluster Spec Directly with KubectlApply a Modified Hydrolix Cluster Spec with Kubectl

1	`kubectl edit hydrolixcluster`

kubectl get hydrolixclusters <NAMESPACE> -o yaml > hydrolix-cluster.yaml
vim hydrolix-cluster.yaml
kubectl apply -f hydrolix-cluster.yaml

Edit the replica count to adjust resources devoted to Kinesis processing using the full pool name from the API response.

Example Kinesis Pool Requesting One Replica

  owner: admin
  pools:
  - cpu: "1"
    memory: 10Gi
    name: kinesispool-db9feb26-2a8c-4735-9b8a-6e6b7f207cbe
    replicas: "1"
    service: kinesis-peer
    storage: 10Gi
  region: us-central1
  scale:

See Scale your Cluster and Scale Profiles for more information.

Special Considerations⚓︎

Specify an Alternative AWS Role⚓︎

If your Kinesis configuration and your Hydrolix configuration don't share AWS roles, you need a way to specify an alternative AWS role ARN for accessing your Kinesis stream. Create a field named aws_role_arn within your Kinesis configuration settings. Use the full ARN as the value:

   "settings": {
      "stream_name": "arn:aws:kinesis:us-east-2:example:stream/my-example-stream",
      "region": "us-east-2",
      "aws_role_arn": "arn:aws:iam::1776422013:role/example",
      "checkpointer": {
         "name": "arn:aws:dynamodb:us-east-2:1234567890:table/test-kinesis"
      }
   }

You can also transmit credentials in the aws_key and aws_secret fields of your Kinesis configuration settings. Hydrolix stores these values in a Kubernetes secret as a JSON object. You can access this secret using the aws_creds_key value.

   "settings": {
      "stream_name": "arn:aws:kinesis:us-east-2:example:stream/my-example-stream",
      "region": "us-east-2",
      "checkpointer": {
          "name": "arn:aws:dynamodb:us-east-2:1234567890:table/test-kinesis"
      },
      "aws_key": "AWS_ACCESS_KEY_ID",
      "aws_secret": "AWS_SECRET_KEY"
   }

You can also specify the ARN, aws_key, and aws_secret. Simply add all three fields given in the two examples above.

Credential Order

Hydrolix first attempts to connect to your Kinesis stream using the ARN stored in aws_role_arn. If that doesn't work, Hydrolix attempts to connect using the environment variable stored in aws_creds. If neither method works, Hydrolix attempts to connect to the Kinesis stream without authentication.

AWS CloudWatch Ingest using Kinesis⚓︎

AWS CloudWatch uses a special format that passes multiple logs at a time through the logEvents message field. This requires special handling in Hydrolix.

Hydrolix supports a special subtype of transform to handle this log form. In your transform, assign settings.format_details.subtype a value of "cloudwatch":

{
  "name": "my_transform",
  "type": "json",
  "settings": {
    "format_details": {
      "subtype": "cloudwatch",
      "flattening": ...,
      ...
      }
    },
    ...
  }
}

When you use the cloudwatch subtype, each element of the array in the logEvents field becomes a separate ingested row, handled separately by your transform. Hydrolix deserializes the JSON in the "message" field of each logEvents element. This becomes the value of the logEvent field in the new log messages. Consider another example of the CloudWatch format: following logEvents array within a CloudWatch message:

{
  "owner": "012345678901",
  "logGroup": "/aws/apigateway/test-api-gateway-mgt-api",
  "logEvents": [
    {
      "id": "37469574158817812474116640748123909459676719559957151744",
      "message": "{\"apiName\":\"test-api-gateway-mgt-api\",\"authMS\":\"18\",\"caller\":\"-\",\"httpMethod\":\"POST\",\"ip\":\"42.188.213.42\",\"jwtActorClientId\":\"7501d544-6398-4c50-a3fc-b55f53616e54\",\"jwtActorEnvId\":\"d991999a-0af5-43e4-a4dc-851ec7d2ff9f\",\"jwtActorOrgId\":\"5d9fbc79-b254-449d-9cb8-6d1f40904cfb\",\"jwtActorUserId\":\"-\",\"pathEnvId\":\"a221fb90-1804-4a12-9aa4-8cf63b430ada\",\"pathOrgId\":\"-\",\"protocol\":\"HTTP/1.1\",\"proxyRoundTripMS\":\"365\",\"regionTag\":\"us-east-2\",\"requestId\":\"37473422-3316-4622-bef6-d1d1b74734c9\",\"requestPath\":\"/v1/environments/a221fb90-1804-4a12-9aa4-8cf63b430ada/users/554a405f-4847-44c7-9aed-851206c11ec7/roleAssignments\",\"requestTime\":\"30/Mar/2023:16:30:15 +0000\",\"resourcePath\":\"/environments/{environmentid}/users/{userid}/roleAssignments\",\"responseLength\":\"909\",\"roundTripMS\":\"392\",\"status\":\"201\",\"user\":\"-\",\"userAgent\":\"Apache-HttpClient/4.5.14 (Java/11.0.18)\",\"xPingTrueClientIP\":\"42.188.213.42\"}",
      "timestamp": 1680193815400
    },
    {
      "id": "37469574158817812474116640748123909459676719559957151744",
      "message": "{\"apiName\":\"test-api-gateway-mgt-api\",\"authMS\":\"18\",\"caller\":\"-\",\"httpMethod\":\"POST\",\"ip\":\"42.188.213.42\",\"jwtActorClientId\":\"7501d544-6398-4c50-a3fc-b55f53616e54\",\"jwtActorEnvId\":\"d991999a-0af5-43e4-a4dc-851ec7d2ff9f\",\"jwtActorOrgId\":\"5d9fbc79-b254-449d-9cb8-6d1f40904cfb\",\"jwtActorUserId\":\"-\",\"pathEnvId\":\"a221fb90-1804-4a12-9aa4-8cf63b430ada\",\"pathOrgId\":\"-\",\"protocol\":\"HTTP/1.1\",\"proxyRoundTripMS\":\"365\",\"regionTag\":\"us-east-2\",\"requestId\":\"37473422-3316-4622-bef6-d1d1b74734c9\",\"requestPath\":\"/v1/environments/a221fb90-1804-4a12-9aa4-8cf63b430ada/users/554a405f-4847-44c7-9aed-851206c11ec7/roleAssignments\",\"requestTime\":\"30/Mar/2023:16:30:15 +0000\",\"resourcePath\":\"/environments/{environmentid}/users/{userid}/roleAssignments\",\"responseLength\":\"909\",\"roundTripMS\":\"392\",\"status\":\"201\",\"user\":\"-\",\"userAgent\":\"Apache-HttpClient/4.5.14 (Java/11.0.18)\",\"xPingTrueClientIP\":\"42.188.213.42\"}",
      "timestamp": 1680193815285
    }
  ],
  "logStream": "68fe326b9e008e1490dd71496307b647",
  "messageType": "DATA_MESSAGE",
  "subscriptionFilters": [
    "KinesisFilter-NewRelic"
  ]
}

The first index of the logEvents field becomes the following ingested row:

{
  "owner": "012345678901",
  "logGroup": "/aws/apigateway/test-api-gateway-mgt-api",
  "logEvent": {
    "id": "37469574158817812474116640748123909459676719559957151744",
    "message": {
      "apiName": "test-api-gateway-mgt-api",
      "authMS": "18",
      "caller": "-",
      "httpMethod": "POST",
      "ip": "42.188.213.42",
      "jwtActorClientId": "7501d544-6398-4c50-a3fc-b55f53616e54",
      "jwtActorEnvId": "d991999a-0af5-43e4-a4dc-851ec7d2ff9f",
      "jwtActorOrgId": "5d9fbc79-b254-449d-9cb8-6d1f40904cfb",
      "jwtActorUserId": "-",
      "pathEnvId": "a221fb90-1804-4a12-9aa4-8cf63b430ada",
      "pathOrgId": "-",
      "protocol": "HTTP/1.1",
      "proxyRoundTripMS": "365",
      "regionTag": "us-east-2",
      "requestId": "37473422-3316-4622-bef6-d1d1b74734c9",
      "requestPath": "/v1/environments/a221fb90-1804-4a12-9aa4-8cf63b430ada/users/554a405f-4847-44c7-9aed-851206c11ec7/roleAssignments",
      "requestTime": "30/Mar/2023:16:30:15 +0000",
      "resourcePath": "/environments/{environmentid}/users/{userid}/roleAssignments",
      "responseLength": "909",
      "roundTripMS": "392",
      "status": "201",
      "user": "-",
      "userAgent": "Apache-HttpClient/4.5.14 (Java/11.0.18)",
      "xPingTrueClientIP": "42.188.213.42"
    },
    "timestamp": 1680193815400
  },
  "logStream": "68fe326b9e008e1490dd71496307b647",
  "messageType": "DATA_MESSAGE",
  "subscriptionFilters": [
    "KinesisFilter-NewRelic"
  ]
}

Scaling⚓︎

To scale the Kubernetes deployment for the ingest pool used by this source, including the number of replicas, memory, and CPU, see the scaling resource pools page.