Druid Monitoring

Praneeth Kumar Reddy Ballarapu
3 min readSep 3, 2023

--

Monitoring druid query execution, ingestion, and coordination is essential for production clusters which are powering the user-facing dashboards. The expectation of these dashboards/widgets is to be loaded in a few seconds with near-realtime data. So monitoring the health of the druid cluster is required for production setups.

How can I emit the monitoring metrics?

We can configure Druid to emit metrics that are essential for monitoring query execution, ingestion, coordination, and so on.

To know more about all the metrics refer to the official documentation

Monitors Configuration:

We need to use the monitors to monitor the respective processes to collect the metrics. Add this config in common.runtime.properties

druid.monitoring.monitors=["org.apache.druid.client.cache.CacheMonitor", "org.apache.druid.java.util.metrics.JvmMonitor", "org.apache.druid.java.util.metrics.CpuAcctDeltaMonitor", "org.apache.druid.java.util.metrics.JvmThreadsMonitor", "org.apache.druid.server.metrics.EventReceiverFirehoseMonitor", "org.apache.druid.server.metrics.TaskCountStatsMonitor"]

Most of the production druid clusters will be running in a clustered deployment where we will be running master (coordinator, overload), data(middleManager, historical), and query(router, broker) nodes separately in different VMs

So we need to configure the druid.monitoring.monitors property according to the process we are running.

Process level monitors supported

Example:

Master — (Coordinator-Overlord)

druid.monitoring.monitors=["org.apache.druid.client.cache.CacheMonitor", "org.apache.druid.java.util.metrics.JvmMonitor", "org.apache.druid.java.util.metrics.CpuAcctDeltaMonitor", "org.apache.druid.java.util.metrics.JvmThreadsMonitor", "org.apache.druid.server.metrics.EventReceiverFirehoseMonitor", "org.apache.druid.server.metrics.TaskCountStatsMonitor"]

Data — (Historical, MiddleManager)

druid.monitoring.monitors=["org.apache.druid.client.cache.CacheMonitor", "org.apache.druid.java.util.metrics.JvmMonitor", "org.apache.druid.java.util.metrics.CpuAcctDeltaMonitor", "org.apache.druid.java.util.metrics.JvmThreadsMonitor", "org.apache.druid.server.metrics.EventReceiverFirehoseMonitor"]

Query — (Broker, Router)

druid.monitoring.monitors=["org.apache.druid.client.cache.CacheMonitor", "org.apache.druid.java.util.metrics.JvmMonitor", "org.apache.druid.java.util.metrics.CpuAcctDeltaMonitor", "org.apache.druid.java.util.metrics.JvmThreadsMonitor", "org.apache.druid.server.metrics.EventReceiverFirehoseMonitor", "org.apache.druid.server.metrics.QueryCountStatsMonitor"]

Emitter Configuration:

We need to add/edit the following config in the common.runtime.properties file to enable the emitter which emits metrics to an HTTP endpoint.

druid.emitter=http
druid.emitter.http.recipientBaseUrl=http://{{druid_exporter_host}}:{{druid_exporter_port}}/druid

Restart the nodes/processes after adding the above emitter and monitoring the config

Druid Exporter:

To collect and export metrics from druid we have an open-source druid exporter from Opstree — https://github.com/opstree/druid-exporter

Installation:

We can deploy the druid exporter using

  1. releases — https://github.com/opstree/druid-exporter/releases
  2. Kubernetes — https://github.com/opstree/druid-exporter/tree/master/manifests
  3. docker-compose — https://github.com/opstree/druid-exporter/tree/master/compose

I will be going with release deployment directly — running druid exporter in one of the master nodes

wget https://github.com/opstree/druid-exporter/releases/download/v0.11/druid-exporter-v0.11-linux-amd64.tar.gz
tar -xvzf druid-exporter-v0.11-linux-amd64.tar.gz

Let’s create a systemd file to run the druid-exporter

[Unit]
Description=Druid Exporter
Documentation=https://github.com/opstree/druid-exporter
Requires=network.target
After=network.target
[Service]
Type=simple
WorkingDirectory=/opt
User=root
Group=root
ExecStart=/PATH_TO_DOWNLOADED_FOLDER/druid-exporter -p 8020 -d DRUID_COORDINATOR_OR_ROUTER_URL --druid.user="" --druid.password="" --metrics-cleanup-ttl=15 --no-histogram
[Install]
WantedBy=default.target

Available options and flags — https://github.com/opstree/druid-exporter#available-options-or-flags

Enable the systemd file

systemctl enable druid-exporter.service

Start the druid-exporter

systemctl start druid-exporter

Verify the installation and check the metrics in the endpoint

curl http://druid_export_host:port/metrics

Grafana dashboard:

Cool, we have emitted, collected, and exported the monitoring metrics to Prometheus. Now it’s time to visualize and create alerts (if needed).

Druid Overview & Broker query time, bytes
Historical Overview
JVM usage, ingestion count, lag

I have added more panels and visualization to the existing dashboard provided by Opstree. Download the Grafana dashboard JSON from here.

Corrections/suggestions are welcome. Thanks for reading.

References:

https://github.com/opstree/druid-exporter

https://druid.apache.org/docs/latest/configuration/#enabling-metrics

https://druid.apache.org/docs/latest/operations/metrics

--

--

No responses yet