GCP Storage Notifications
Use Google Pub/Sub with Hydrolix autoingest to read data from a Google storage bucket.
The instructions below describe how to configure Googlpe Pub/Sub notifications to a Hydrolix cluster, whenever new files are uploaded to a Google storage bucket. The cluster automatically reads the file and ingests the data. The Hydrolix cluster can be deployed on the Google Cloud Platform (GCP) or on another cloud.
To create notifications on your bucket path, use the gsutil command. For more information on this utility, see Pub/Sub notifications for Cloud Storage.
Before you begin
| Requirement | Description | Example | Documentation |
|---|---|---|---|
| Hydrolix cluster | A deployed Hydrolix cluster with API and UI access. | https://${MYHOST}.hydrolix.live | Cluster overview |
| Google Cloud project | A GCP project with GCS and Pub/Sub enabled. | my-gcp-project | Create a GCP project |
| GCS bucket | The storage bucket containing data to be ingested. | gs://mybucket | Create a GCS bucket |
Create a Pub/Sub topic and subscription
See Create a Pub/Sub topic for information on creating a notification topic. Use commands like the following to create a Google Pub/Sub topic and subscription:
# -- create topic
gcloud pubsub topics create ${TOPIC_NAME}
# -- create subscription
gcloud pubsub subscriptions create ${SUBSCRIPTION-NAME} --topic=${TOPIC_NAME}
For example:
gcloud pubsub topics create hdx_autoingest_topic
Created topic [projects/my-gcp-project/topics/hdx_autoingest_topic].
gcloud pubsub subscriptions create hdx_autoingest_topic_sub --topic=hdx_autoingest_topic
Subscription projects/my-project/subscriptions/hdx_autoingest_topic_sub created with topic projects/my-project/topics/hdx_autoingest_topic.
The above two commands create a Google topic and subscription using default values. See topic creation and subscription creation for information on overriding default configuration.
Create a notification
Create and configure a new notification to watch a specified path in your storage bucket. The notification system generates a new message into the pub/sub queue when a new object is successfully created in the bucket, or an existing object is copied, rewritten, or restored.
The following command creates a pub/sub notification.
gsutil notification create -f json -e OBJECT_FINALIZE -t ${TOPIC_ID} -p ${PATH} gs://${BUCKET}
For example:
gsutil notification create -f json -e OBJECT_FINALIZE -t projects/my-gcp-project/topics/hdx_autoingest_topic -p mypath/ gs://mybucket
Created notification config projects/_/buckets/mybucket/notificationConfigs/1
The flags in the command are as follows:
-e OBJECT_FINALIZE: Ensures that an event fires only after an object is successfully written to the bucket and becomes visible. See Event Types.
-t ${TOPIC_ID}: The ID of the Pub/Sub topic to send notifications.
-p ${PATH}: When objects are written to this bucket path, notifications will be generated and sent to the destination topic.
Authorization
The Hydrolix cluster needs to be authorized to read both from the notification queue and the storage bucket.
Authorization for a Hydrolix cluster running outside GCP
Use these instructions to authorize autoingest if your Hydrolix cluster is deployed outside GCP.
To allow Hydrolix to automatically ingest data from Google Cloud Storage (GCS) using Pub/Sub notifications, you’ll need to:
- Create and configure a Google service account
- Grant access to the Pub/Sub topic and GCS bucket
- Create and download a service account key
- Import the key as a credential in the Hydrolix cluster
- Add the credential to the autoingest table settings
Create a Google service account
Create a dedicated service account in your Google Cloud project to grant the Hydrolix cluster access to the GCS bucket and Pub/Sub topic.
Follow the steps in Create service accounts to create a service account. Name the account something descriptive such as hydrolix-autoingest.
Grant access to the Pub/Sub topic and GCS bucket
Assign the following roles to the service account:
| Resource | Required Role | Purpose |
|---|---|---|
| GCS bucket | roles/storage.objectAdmin | Read data objects for ingestion |
| Pub/Sub topic | roles/pubsub.subscriber | Subscribe to change notifications |
| Cloud monitoring | roles/monitoring.viewer | Required if you use HDX Autoscaler. Without this role, the metric autoingest_queue_messages will always have the value 0. The hdxscaler component relies on this metric. Without it, it will periodically log an autoingest-related error. |
You can grant these permissions using the Google Cloud Console, gcloud CLI, or Terraform. See Grant a single IAM role for instructions.
It's possible to use two service accounts for PubSub and storage access
Some deployments may prefer to create two separate service accounts: one with access to the GCS bucket and another with access to the Pub/Sub topic. If you do this, you’ll create two credentials in Hydrolix (one for each).
Create and download a service account key
See Create a service account key to create a new key for the service account. Download the JSON-formatted key file once it has been created. This file contains the credentials Hydrolix will use to authenticate to GCP.
Create a credential in Hydrolix
Upload the JSON service account key file to your Hydrolix cluster as a new credential. This is most easily done using the Hydrolix cluster UI.
- In the Hydrolix UI, go to + Add new > Credential.
- Name the credential (For example:
autoingest-gcp-access-key). - Select Cloud Provider Type:
gcp_service_account_keys. - Next to Upload Credential JSON (optional), select Browse files and select the JSON key file.
- Select Create credential to upload the service account JSON to the Hydrolix cluster and create the credential.
Alternatively, you can create the credential via the API using the create credential endpoint:
curl -X POST "https://${MYHOST}.hydrolix.live/api/v1/config/orgs/{org}/projects/{project}/credentials" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"type": "gcp_service_account_keys",
"name": "autoingest-gcp-access-key",
"cloud": "gcp",
"details": "{ "type": "service_account",
"project_id": "my-gcp-project",
"private_key_id": "BOGUS3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0",
"private_key": "-----BEGIN PRIVATE KEY-----\nBOGUSAIBADANBgkqhkiG9w0BAQEFAASC...\n-----END PRIVATE KEY-----\n",
"client_email": "[email protected]",
"client_id": "123456789012345678901",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/hydrolix-autoingest%40my-gcp-project.iam.gserviceaccount.com",
"universe_domain": "googleapis.com" }"
}'
Authorization for a Hydrolix cluster running in GCP
Use these instructions to authorize autoingest if your Hydrolix cluster is deployed in GCP. An existing cluster in GCP has an attached service account after following the instructions in Create a service account for the cluster. The following steps expand the service account to grant the Hydrolix cluster access to the pub/sub topic receiving notifications and the external storage bucket from which the cluster should read in objects.
Add cluster bucket and Pub/Sub Access to the Hydrolix Service Account
The service account name should be in the env.sh file used when building the Hydrolix cluster. The basic format is: hdx-${CLIENT_ID}-sa@${PROJECT_ID}.iam.gserviceaccount.com
gcloud projects add-iam-policy-binding ${PROJECT_ID} --member="serviceAccount:${GCP_STORAGE_SA}" --role='roles/pubsub.subscriber'
gsutil iam ch serviceAccount:${GCP_STORAGE_SA}:roles/storage.objectAdmin gs://mybucket
gcloud iam service-accounts add-iam-policy-binding ${GCP_STORAGE_SA} \
--member="serviceAccount:${GCP_STORAGE_SA}" \
--role="roles/monitoring.viewer" \
--project=${PROJECT_ID}
See the Google Cloud SDK reference for more information on adding an IAM policy binding to a service account or Predefined roles for more information on the monitoring viewer role.
Create or update the destination table configuration.
Finally, configure your Hydrolix destination table to enable batch autoingest. If your cluster is running outside GCP, you will also need to associate the previously-created credentials by setting the source_credential_id and bucket_credential_id.
Use the Update Table API to set the following fields:
{
"name": "table_name",
"description": "autoingest example",
"settings": {
"autoingest": [{
"enabled": true,
"source":"pubsub://my-gcp-project/hdx_autoingest_topic_sub",
"pattern": "^gs://mybucket/mypath"
}]
},
"source_credential_id": "autoingest-gcp-access-key",
"bucket_credential_id": "autoingest-gcp-access-key"
}
settings.autoingest.enabled: Enable automated batch ingestion from GCS into the Hydrolix cluster.
settings.source: See Table autoingest attributes
settings.pattern: See Table autoingest attributes
source_credential_id: Specify the name of the Hydrolix credential used to access the Pub/Sub topic.
bucket_credential_id: Specify the name of the Hydrolix credential used to access the GCS bucket.
If you’re using different service accounts to access the Pub/Sub topic and GCS bucket, set source_credential_id and bucket_credential_id to the respective Hydrolix credential IDs.
Table autoingest attributes
Autoingest is defined in the settings.autoingest object within the table JSON request.
| Element | Purpose |
|---|---|
enable | Default is false |
source | The pub/sub queue name containing the storage notifications. The name must be prefixed with pubsub://. For example: pubsub://project-1/mytopic |
pattern | The gs event notification regex pattern. Default is an empty string. For example: "^gs://mybucket/mypath/.*/log_.*.gz". See Table pattern |
max_rows_per_partition | The max row count limit per partitions. Default 33554432. |
max_minutes_per_partition | The maximum number of minutes to hold in a partition. Default 15. |
max_active_partitions | Maximum number of active partitions. Default 576. |
input_aggregation | Controls how much data should be considered a single unit of work, which ultimately drives the size of the partition. Default 1536000000. |
dry_run | Default is false |
See The settings object for more information on autoingest settings.
Table pattern
Provide a tightly scoped regex for settings.autoingest.pattern. The autoingest service may be handling multiple autoingest streams to various tables and will dispatch ingest requests to the first matching table pattern.
Google event notifications contain the full s3 path. Therefore, regex patterns using a beginning of line anchor should use ^gs://. Given the following example path:
gs://mybucket/level1/2020/01/app1/pattern_xyz.gz
Possible patterns could be:
^.*\\.gz$isn't recommended - too wide a match^gs://mybucket/level1/2020/\\d{2}/app1/.*.gz^.*/level1/2020/\\d{2}/app1/.*.gz^.*/level1/2020/\\d{2}/.*/pattern_\\w{3}.gz^.*/level1/2020/\\d{2}/.*/pattern_.*.gz
It should be noted that as the pattern is submitted in a JSON document. JSON requires \ chars to be escaped, hence \\ in the examples above.
See Ingest file paths for more information on autoingest file paths.
Updated 19 days ago