Superset Integration

Your Hydrolix stack includes Superset, an open-source data visualization platform. You can use Superset to query, visualize, and analyze data ingested into Hydrolix.

Accessing your stack's Superset

Your stack's Superset instance runs its web interface at port 9088. To visit it in your web browser, go to https://YOUR-HYDROLIX-HOSTNAME.hydrolix.live:9088.

The first time you visit your stack's Superset, log in with username admin and password admin, for release higher than 2.11.7 the default password is $host-rogue-derby. You should then proceed to update that account's password by selecting Settings > Info from Superset's main menu bar, then clicking the 🔒 Reset Password button. If you wish, you can also create non-admin accounts via Settings > List Users.

Connecting Superset to Hydrolix

Superset has a default connection to Hydrolix created:

2632

This connection is using SQL Alchemy driver and we are connecting via a local NGINX proxy which load balances the request to the head service.
The connection URL to use is:

clickhouse+native://127.0.0.1/sample_project

This will instruct superset to connect locally on the load balancer, and to use the default project sample_project

If you find yourself unable to establish a connection here, please contact Hydrolix support.

Working with your data in Superset

Once Superset recognizes your Hydrolix stack as a datasource, you can use it to power every Superset feature from simple SQL queries to complex visualizations.

To do this, when setting up a new query or dataset within Superset, select your Hydrolix-based database as its datasource:

Selecting a Hydrolix-based datasource

Note that Superset attaches the label clickhouse to the name you chose for your Hydrolix datasource. This references the Clickhouse SQL driver that Superset uses to communicate with Hydrolix's query engine.

Customise your superset

By default Hydrolix is using a standard configuration for superset:

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#
# This file is included in the final Docker image and SHOULD be overridden when
# deploying the image to prod. Settings configured here are intended for use in local
# development environments. Also note that superset_config_docker.py is imported
# as a final step as a means to override "defaults" configured here
#
import logging
import os

from cachelib.redis import RedisCache
from celery.schedules import crontab

logger = logging.getLogger()


def get_env_variable(var_name, default=None):
    """Get the environment variable or raise exception."""
    try:
        return os.environ[var_name]
    except KeyError:
        if default is not None:
            return default
        else:
            error_msg = "The environment variable {} was missing, abort...".format(var_name)
            raise EnvironmentError(error_msg)


DATABASE_DIALECT = get_env_variable("DATABASE_DIALECT")
DATABASE_USER = get_env_variable("DATABASE_USER")
DATABASE_PASSWORD = get_env_variable("DATABASE_PASSWORD")
DATABASE_HOST = get_env_variable("DATABASE_HOST")
DATABASE_PORT = get_env_variable("DATABASE_PORT")
DATABASE_DB = get_env_variable("DATABASE_DB")
SECRET_KEY = get_env_variable("SUPERSET_SECRET_KEY")

# The SQLAlchemy connection string.
SQLALCHEMY_DATABASE_URI = "%s://%s:%s@%s:%s/%s" % (
    DATABASE_DIALECT,
    DATABASE_USER,
    DATABASE_PASSWORD,
    DATABASE_HOST,
    DATABASE_PORT,
    DATABASE_DB,
)

REDIS_HOST = get_env_variable("REDIS_HOST")
REDIS_PORT = get_env_variable("REDIS_PORT")
REDIS_CELERY_DB = get_env_variable("REDIS_CELERY_DB", 0)
REDIS_RESULTS_DB = get_env_variable("REDIS_RESULTS_DB", 1)


class CeleryConfig(object):
    BROKER_URL = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_CELERY_DB}"
    CELERY_IMPORTS = ("superset.sql_lab", "superset.tasks")
    CELERY_RESULT_BACKEND = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_RESULTS_DB}"
    CELERYD_LOG_LEVEL = "DEBUG"
    CELERYD_PREFETCH_MULTIPLIER = 1
    CELERY_ACKS_LATE = False
    CELERYBEAT_SCHEDULE = {
        "reports.scheduler": {
            "task": "reports.scheduler",
            "schedule": crontab(minute="*", hour="*"),
        },
        "reports.prune_log": {
            "task": "reports.prune_log",
            "schedule": crontab(minute=10, hour=0),
        },
    }


CELERY_CONFIG = CeleryConfig

FEATURE_FLAGS = {"ALERT_REPORTS": True}
ALERT_REPORTS_NOTIFICATION_DRY_RUN = True
WEBDRIVER_BASEURL = "http://superset:8088/"
# The base URL for the email report hyperlinks.
WEBDRIVER_BASEURL_USER_FRIENDLY = WEBDRIVER_BASEURL

SQLLAB_CTAS_NO_LIMIT = True

RESULTS_BACKEND = RedisCache(host=REDIS_HOST, port=REDIS_PORT, db=REDIS_RESULTS_DB)

CACHE_CONFIG = {
    "CACHE_TYPE": "redis",
    "CACHE_DEFAULT_TIMEOUT": 60 * 60 * 8,  # seconds
    "CACHE_KEY_PREFIX": "superset_results",
    "CACHE_REDIS_URL": f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_RESULTS_DB}",
}

DATA_CACHE_CONFIG = {
    "CACHE_TYPE": "redis",
    "CACHE_DEFAULT_TIMEOUT": 60 * 60 * 8,  # seconds
    "CACHE_KEY_PREFIX": "superset_results",
    "CACHE_REDIS_URL": f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_RESULTS_DB}",
}

ENABLE_PROXY_FIX = True
PROXY_FIX_CONFIG = {"x_for": 1, "x_proto": 1, "x_host": 1, "x_port": 1, "x_prefix": 0}

FEATURE_FLAGS = {"DASHBOARD_NATIVE_FILTERS": True, "ENABLE_TEMPLATE_PROCESSING": True}


try:
    import superset_config_local
    from superset_config_local import *  # noqa

    logger.info(f"Loaded your Docker configuration at " f"[{superset_config_local.__file__}]")
except ImportError:
    logger.info("Using default Docker config...")
except Exception:
    logger.exception("error loading local config")

You can update this configuration file to include specific features which fits your need.

For example you can add the following variable:
MAPBOX_API_KEY = '' with API Key to provide maps into your dashboards.

You may also want to use a different authentication mechanism such as OAuth

You can create your own configuration file and override completely the one provided by Hydrolix, using hdxctl.

An example would be:

hdxctl files set hdxcli-xxxx superset_config.py superset

📘

Don't override the whole config

By default superset allows you to add extra functionality but you don't need to add the full config, you can add only the variable you need to set.

🚧

This doesn't update the cluster

The command only saves the configuration file into S3, to apply this configuration to Grafana you need to update your cluster to restart Grafana and uses the new settings.
hdxctl update hdxcli-abcde hdx-xyz

Setting up OAuth

This is an example setup using Google OAuth:

  • Go to https://console.developers.google.com/apis/credentials
  • Click Create Credentials, then click OAuth Client ID in the drop-down menu
  • Enter the following:
    • Application Type: Web Application
    • Name: Superset
    • Authorized JavaScript Origins: https://$host.hydrolix.live
    • Authorized Redirect URLs: https://$host.hydrolix.live:9088/oauth-authorized/google
  • Click Create
    • Copy the Client ID and Client Secret from the ‘OAuth Client’ modal
  • On your the host running HDXCTL create a new file like the following (superset_config.py):
from flask_appbuilder.security.manager import AUTH_OID, AUTH_REMOTE_USER, AUTH_DB, AUTH_LDAP, AUTH_OAUTH

CSRF_ENABLED = True
AUTH_USER_REGISTRATION = True
AUTH_USER_REGISTRATION_ROLE = 'Admin'
AUTH_TYPE = AUTH_OAUTH
OAUTH_PROVIDERS = [
    {
        'name': 'google',
        'icon': 'fa-google',
        'token_key': 'access_token',
        'remote_app': {
            'client_id' : 'XXXXXXXXXXXXXXX.apps.googleusercontent.com',
            'client_secret' : 'YYYYYYYYYYYYYYY',
            'api_base_url': 'https://www.googleapis.com/oauth2/v2/',
            'client_kwargs': {
                'scope': 'email profile'
            },
            'request_token_url': None,
            'access_token_url': 'https://accounts.google.com/o/oauth2/token',
            'authorize_url': 'https://accounts.google.com/o/oauth2/auth',
        }
    }
]

This will instruct superset to allow new signup users using the OAuth ID created in google cloud console, and the new users will be registered as Admin, you can change the role to something else by modifying the configuration.

  • Save your superset configuration file:
hdxctl files set hdxcli-xxxx superset_config.py superset
  • Update your cluster to take into account the new configuration file:
hdxctl update hdxcli-xxxx hdx-xyz

Changing default timeout

By default superset is configured with a timeout of 60s for charts and dashboards, but sometime you may requesting more data and this timeout might be too short.

You can change the default timeout value by changing the configuration file and include the following variable:

SUPERSET_WEBSERVER_TIMEOUT = 300

Here I'm changing the default timeout from 60s to 300s.
Once you have modified your configuration file you need to set it, like in the example above

After setting your configuration you also need to update the worker timeout, you can do that by using hdxctl

hdxctl update --superset-timeout 300 hdxcli-xxxx hdx-xyz

Using Query Pools with User Identification

Hydrolix allow you to customise query pool and add some comment in the query by providing custom settings in the query.

In this example we change the default query pool for users who have the role soc and use a dedicated query pool. We also add automatically the name of user generating the query (wether it's on SQL Labs or in dashboards).

import re
def SQL_QUERY_MUTATOR(sql, username, security_manager, database):
    ## Enable query modification only for SELECT statement
    select_search = re.search(r'^SELECT', sql, flags=re.I)
    if select_search:
        ## Get current user information
        user = security_manager.find_user(username=username)
        ## Get user current roles
        user_roles = security_manager.get_user_roles(user)
        ## Check if user has the role named SOC and assigned pool SOC
        for user_role in user_roles:
            if user_role.name == "soc":
                pool = "soc"
            else:
                ## otherwise default query pool
                pool = "query-peer0"
        ## Check if the query already contains SETTINGS instruction
        settings_search = re.search(r'.*(SETTINGS\s+hdx_.*)\s+LIMIT', sql, flags=re.I|re.M)
        to_ret = ""
        ## If query has settings, extract settings information and query
        ## If query has settings adding a comma to separate with our own
        ## extra settings that will be added
        if settings_search:
            settings = settings_search.group(1)
            query = re.sub(settings, '', sql, flags=re.I|re.M)
            to_ret += f"{query} {settings}, "
        else:
             to_ret += f"{sql} SETTINGS "
        ## Add user information settings and query pool to query
        to_ret += f"hdx_query_admin_comment='User: {user}', hdx_query_pool_name='{pool}'"
        return to_ret
    else:
        return sql

By providing user information in the comment settings it allows you to monitor active queries and the person running the query. This is part of our monitoring query dashboard:

3184

Getting more help

If you need more help with having your Superset and Hydrolix installations working together, or you'd just like to learn more about this integration, please contact Hydrolix support.