Skip to main content

Metrics

You can configure Druid to emit metrics that are essential for monitoring query execution, ingestion, coordination, and so on.

All Druid metrics share a common set of fields:

  • timestamp: the time the metric was created
  • metric: the name of the metric
  • service: the service name that emitted the metric
  • host: the host name that emitted the metric
  • value: some numeric value associated with the metric

Metrics may have additional dimensions beyond those listed above.

info

Most metric values reset each emission period, as specified in druid.monitoring.emissionPeriod.

Query metrics

Router

MetricDescriptionDimensionsNormal value
query/timeMilliseconds taken to complete a query.Native Query: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id.< 1s

Broker

MetricDescriptionDimensionsNormal value
query/timeMilliseconds taken to complete a query.

Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id.

Aggregation Queries: numMetrics, numComplexMetrics.

GroupBy: numDimensions.

TopN: threshold, dimension.

< 1s
query/bytesThe total number of bytes returned to the requesting client in the query response from the broker. Other services report the total bytes for their portion of the query.

Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id.

Aggregation Queries: numMetrics, numComplexMetrics.

GroupBy: numDimensions.

TopN: threshold, dimension.

query/node/timeMilliseconds taken to query individual historical/realtime processes.id, status, server< 1s
query/node/bytesNumber of bytes returned from querying individual historical/realtime processes.id, status, server
query/node/ttfbTime to first byte. Milliseconds elapsed until Broker starts receiving the response from individual historical/realtime processes.id, status, server< 1s
query/node/backpressureMilliseconds that the channel to this process has spent suspended due to backpressure.id, status, server.
query/countNumber of total queries.This metric is only available if the QueryCountStatsMonitor module is included.
query/success/countNumber of queries successfully processed.This metric is only available if the QueryCountStatsMonitor module is included.
query/failed/countNumber of failed queries.This metric is only available if the QueryCountStatsMonitor module is included.
query/interrupted/countNumber of queries interrupted due to cancellation.This metric is only available if the QueryCountStatsMonitor module is included.
query/timeout/countNumber of timed out queries.This metric is only available if the QueryCountStatsMonitor module is included.
mergeBuffer/pendingRequestsNumber of requests waiting to acquire a batch of buffers from the merge buffer pool.This metric is only available if the QueryCountStatsMonitor module is included.
query/segments/countThis metric is not enabled by default. See the QueryMetrics Interface for reference regarding enabling this metric. Number of segments that will be touched by the query. In the broker, it makes a plan to distribute the query to realtime tasks and historicals based on a snapshot of segment distribution state. If there are some segments moved after this snapshot is created, certain historicals and realtime tasks can report those segments as missing to the broker. The broker will resend the query to the new servers that serve those segments after move. In this case, those segments can be counted more than once in this metric.Varies
query/priorityAssigned lane and priority, only if Laning strategy is enabled. Refer to Laning strategieslane, dataSource, type0
sqlQuery/timeMilliseconds taken to complete a SQL query.id, nativeQueryIds, dataSource, remoteAddress, success, engine< 1s
sqlQuery/planningTimeMsMilliseconds taken to plan a SQL to native query.id, nativeQueryIds, dataSource, remoteAddress, success, engine
sqlQuery/bytesNumber of bytes returned in the SQL query response.id, nativeQueryIds, dataSource, remoteAddress, success, engine
serverview/init/timeTime taken to initialize the broker server view. Useful to detect if brokers are taking too long to start.Depends on the number of segments.
metadatacache/init/timeTime taken to initialize the broker segment metadata cache. Useful to detect if brokers are taking too long to startDepends on the number of segments.
metadatacache/refresh/countNumber of segments to refresh in broker segment metadata cache.dataSource
metadatacache/refresh/timeTime taken to refresh segments in broker segment metadata cache.dataSource
metadatacache/schemaPoll/countNumber of coordinator polls to fetch datasource schema.
metadatacache/schemaPoll/failedNumber of failed coordinator polls to fetch datasource schema.
metadatacache/schemaPoll/timeTime taken for coordinator polls to fetch datasource schema.
serverview/sync/healthySync status of the Broker with a segment-loading server such as a Historical or Peon. Emitted only when HTTP-based server view is enabled. This metric can be used in conjunction with serverview/sync/unstableTime to debug slow startup of Brokers.server, tier1 for fully synced servers, 0 otherwise
serverview/sync/unstableTimeTime in milliseconds for which the Broker has been failing to sync with a segment-loading server. Emitted only when HTTP-based server view is enabled.server, tierNot emitted for synced servers.
subquery/rowsNumber of rows materialized by the subquery's results.id, subqueryIdVaries
subquery/bytesNumber of bytes materialized by the subquery's results. This metric is only emitted if the query uses byte-based subquery guardrailsid, subqueryIdVaries
subquery/rowLimit/countNumber of subqueries whose results are materialized as rows (Java objects on heap).This metric is only available if the SubqueryCountStatsMonitor module is included.
subquery/byteLimit/countNumber of subqueries whose results are materialized as frames (Druid's internal byte representation of rows).This metric is only available if the SubqueryCountStatsMonitor module is included.
subquery/fallback/countNumber of subqueries which cannot be materialized as framesThis metric is only available if the SubqueryCountStatsMonitor module is included.
subquery/fallback/insufficientType/countNumber of subqueries which cannot be materialized as frames due to insufficient type information in the row signature.This metric is only available if the SubqueryCountStatsMonitor module is included.
subquery/fallback/unknownReason/countNumber of subqueries which cannot be materialized as frames due other reasons.This metric is only available if the SubqueryCountStatsMonitor module is included.
query/rowLimit/exceeded/countNumber of queries whose inlined subquery results exceeded the given row limitThis metric is only available if the SubqueryCountStatsMonitor module is included.
query/byteLimit/exceeded/countNumber of queries whose inlined subquery results exceeded the given byte limitThis metric is only available if the SubqueryCountStatsMonitor module is included.

Historical

MetricDescriptionDimensionsNormal value
query/timeMilliseconds taken to complete a query.

Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id.

Aggregation Queries: numMetrics, numComplexMetrics.

GroupBy: numDimensions.

TopN: threshold, dimension.

< 1s
query/segment/timeMilliseconds taken to query individual segment. Includes time to page in the segment from disk.id, status, segment, vectorized.several hundred milliseconds
query/wait/timeMilliseconds spent waiting for a segment to be scanned.id, segment< several hundred milliseconds
segment/scan/pendingNumber of segments in queue waiting to be scanned.Close to 0
segment/scan/activeNumber of segments currently scanned. This metric also indicates how many threads from druid.processing.numThreads are currently being used.Close to druid.processing.numThreads
query/segmentAndCache/timeMilliseconds taken to query individual segment or hit the cache (if it is enabled on the Historical process).id, segmentseveral hundred milliseconds
query/cpu/timeMicroseconds of CPU time taken to complete a query.

Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id.

Aggregation Queries: numMetrics, numComplexMetrics.

GroupBy: numDimensions.

TopN: threshold, dimension.

Varies
query/countTotal number of queries.This metric is only available if the QueryCountStatsMonitor module is included.
query/success/countNumber of queries successfully processed.This metric is only available if the QueryCountStatsMonitor module is included.
query/failed/countNumber of failed queries.This metric is only available if the QueryCountStatsMonitor module is included.
query/interrupted/countNumber of queries interrupted due to cancellation.This metric is only available if the QueryCountStatsMonitor module is included.
query/timeout/countNumber of timed out queries.This metric is only available if the QueryCountStatsMonitor module is included.
mergeBuffer/pendingRequestsNumber of requests waiting to acquire a batch of buffers from the merge buffer pool.This metric is only available if the QueryCountStatsMonitor module is included.

Real-time

MetricDescriptionDimensionsNormal value
query/timeMilliseconds taken to complete a query.

Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id.

Aggregation Queries: numMetrics, numComplexMetrics.

GroupBy: numDimensions.

TopN: threshold, dimension.

< 1s
query/wait/timeMilliseconds spent waiting for a segment to be scanned.id, segmentseveral hundred milliseconds
segment/scan/pendingNumber of segments in queue waiting to be scanned.Close to 0
segment/scan/activeNumber of segments currently scanned. This metric also indicates how many threads from druid.processing.numThreads are currently being used.Close to druid.processing.numThreads
query/cpu/timeMicroseconds of CPU time taken to complete a query.

Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id.

Aggregation Queries: numMetrics, numComplexMetrics.

GroupBy: numDimensions.

TopN: threshold, dimension.

Varies
query/countNumber of total queries.This metric is only available if the QueryCountStatsMonitor module is included.
query/success/countNumber of queries successfully processed.This metric is only available if the QueryCountStatsMonitor module is included.
query/failed/countNumber of failed queries.This metric is only available if the QueryCountStatsMonitor module is included.
query/interrupted/countNumber of queries interrupted due to cancellation.This metric is only available if the QueryCountStatsMonitor module is included.
query/timeout/countNumber of timed out queries.This metric is only available if the QueryCountStatsMonitor module is included.
mergeBuffer/pendingRequestsNumber of requests waiting to acquire a batch of buffers from the merge buffer pool.This metric is only available if the QueryCountStatsMonitor module is included.

Jetty

MetricDescriptionNormal value
jetty/numOpenConnectionsNumber of open jetty connections.Not much higher than number of jetty threads.
jetty/threadPool/totalNumber of total workable threads allocated.The number should equal to threadPoolNumIdleThreads + threadPoolNumBusyThreads.
jetty/threadPool/idleNumber of idle threads.Less than or equal to threadPoolNumTotalThreads. Non zero number means there is less work to do than configured capacity.
jetty/threadPool/busyNumber of busy threads that has work to do from the worker queue.Less than or equal to threadPoolNumTotalThreads.
jetty/threadPool/isLowOnThreadsA rough indicator of whether number of total workable threads allocated is enough to handle the works in the work queue.0
jetty/threadPool/minNumber of minimum threads allocatable.druid.server.http.numThreads plus a small fixed number of threads allocated for Jetty acceptors and selectors.
jetty/threadPool/maxNumber of maximum threads allocatable.druid.server.http.numThreads plus a small fixed number of threads allocated for Jetty acceptors and selectors.
jetty/threadPool/queueSizeSize of the worker queue.Not much higher than druid.server.http.queueSize.

Cache

MetricDescriptionDimensionsNormal value
query/cache/delta/*Cache metrics since the last emission.N/A
query/cache/total/*Total cache metrics.N/A
*/numEntriesNumber of cache entries.Varies
*/sizeBytesSize in bytes of cache entries.Varies
*/hitsNumber of cache hits.Varies
*/missesNumber of cache misses.Varies
*/evictionsNumber of cache evictions.Varies
*/hitRateCache hit rate.~40%
*/averageByteAverage cache entry byte size.Varies
*/timeoutsNumber of cache timeouts.0
*/errorsNumber of cache errors.0
*/put/okNumber of new cache entries successfully cached.Varies, but more than zero
*/put/errorNumber of new cache entries that could not be cached due to errors.Varies, but more than zero
*/put/oversizedNumber of potential new cache entries that were skipped due to being too large (based on druid.{broker,historical,realtime}.cache.maxEntrySize properties).Varies

Memcached only metrics

Memcached client metrics are reported as per the following. These metrics come directly from the client as opposed to from the cache retrieval layer.

MetricDescriptionDimensionsNormal value
query/cache/memcached/totalCache metrics unique to memcached (only if druid.cache.type=memcached) as their actual values.VariableN/A
query/cache/memcached/deltaCache metrics unique to memcached (only if druid.cache.type=memcached) as their delta from the prior event emission.VariableN/A

SQL Metrics

If SQL is enabled, the Broker will emit the following metrics for SQL.

MetricDescriptionDimensionsNormal value
sqlQuery/timeMilliseconds taken to complete a SQL.id, nativeQueryIds, dataSource, remoteAddress, success< 1s
sqlQuery/planningTimeMsMilliseconds taken to plan a SQL to native query.id, nativeQueryIds, dataSource, remoteAddress, success
sqlQuery/bytesnumber of bytes returned in SQL response.id, nativeQueryIds, dataSource, remoteAddress, success

Ingestion metrics

General native ingestion metrics

MetricDescriptionDimensionsNormal value
ingest/countCount of 1 every time an ingestion job runs (includes compaction jobs). Aggregate using dimensions.dataSource, taskId, taskType, groupId, taskIngestionMode, tagsAlways 1.
ingest/segments/countCount of final segments created by job (includes tombstones).dataSource, taskId, taskType, groupId, taskIngestionMode, tagsAt least 1.
ingest/tombstones/countCount of tombstones created by job.dataSource, taskId, taskType, groupId, taskIngestionMode, tagsZero or more for replace. Always zero for non-replace tasks (always zero for legacy replace, see below).

The taskIngestionMode dimension includes the following modes:

  • APPEND: a native ingestion job appending to existing segments
  • REPLACE_LEGACY: the original replace before tombstones
  • REPLACE: a native ingestion job replacing existing segments using tombstones

The mode is decided using the values of the isAppendToExisting and isDropExisting flags in the task's IOConfig as follows:

isAppendToExistingisDropExistingMode
truefalseAPPEND
truetrue Invalid combination, exception thrown.
falsefalseREPLACE_LEGACY. The default for JSON-based batch ingestion.
falsetrueREPLACE

The tags dimension is reported only for metrics emitted from ingestion tasks whose ingest spec specifies the tags field in the context field of the ingestion spec. tags is expected to be a map of string to object.

Ingestion metrics for Kafka

These metrics apply to the Kafka indexing service.

MetricDescriptionDimensionsNormal value
ingest/kafka/lagTotal lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute.dataSource, stream, tagsGreater than 0, should not be a very high number.
ingest/kafka/maxLagMax lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute.dataSource, stream, tagsGreater than 0, should not be a very high number.
ingest/kafka/avgLagAverage lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute.dataSource, stream, tagsGreater than 0, should not be a very high number.
ingest/kafka/partitionLagPartition-wise lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers. Minimum emission period for this metric is a minute.dataSource, stream, partition, tagsGreater than 0, should not be a very high number.

Ingestion metrics for Kinesis

These metrics apply to the Kinesis indexing service.

MetricDescriptionDimensionsNormal value
ingest/kinesis/lag/timeTotal lag time in milliseconds between the current message sequence number consumed by the Kinesis indexing tasks and latest sequence number in Kinesis across all shards. Minimum emission period for this metric is a minute.dataSource, stream, tagsGreater than 0, up to max Kinesis retention period in milliseconds.
ingest/kinesis/maxLag/timeMax lag time in milliseconds between the current message sequence number consumed by the Kinesis indexing tasks and latest sequence number in Kinesis across all shards. Minimum emission period for this metric is a minute.dataSource, stream, tagsGreater than 0, up to max Kinesis retention period in milliseconds.
ingest/kinesis/avgLag/timeAverage lag time in milliseconds between the current message sequence number consumed by the Kinesis indexing tasks and latest sequence number in Kinesis across all shards. Minimum emission period for this metric is a minute.dataSource, stream, tagsGreater than 0, up to max Kinesis retention period in milliseconds.
ingest/kinesis/partitionLag/timePartition-wise lag time in milliseconds between the current message sequence number consumed by the Kinesis indexing tasks and latest sequence number in Kinesis. Minimum emission period for this metric is a minute.dataSource, stream, partition, tagsGreater than 0, up to max Kinesis retention period in milliseconds.

Compaction metrics

Compaction tasks emit the following metrics.

MetricDescriptionDimensionsNormal value
compact/segmentAnalyzer/fetchAndProcessMillisTime taken to fetch and process segments to infer the schema for the compaction task to run.dataSource, taskId, taskType, groupId,tagsVaries. A high value indicates compaction tasks will speed up from explicitly setting the data schema.

Other ingestion metrics

Streaming ingestion tasks and certain types of batch ingestion emit the following metrics. These metrics are deltas for each emission period.

MetricDescriptionDimensionsNormal value
ingest/events/processedNumber of events processed per emission period.dataSource, taskId, taskType, groupId, tagsEqual to the number of events per emission period.
ingest/events/processedWithErrorNumber of events processed with some partial errors per emission period. Events processed with partial errors are counted towards both this metric and ingest/events/processed.dataSource, taskId, taskType, groupId, tags0
ingest/events/unparseableNumber of events rejected because the events are unparseable.dataSource, taskId, taskType, groupId, tags0
ingest/events/thrownAwayNumber of events rejected because they are null, or filtered by transformSpec, or outside one of lateMessageRejectionPeriod, earlyMessageRejectionPeriod, or windowPeriod.dataSource, taskId, taskType, groupId, tags0
ingest/events/duplicateNumber of events rejected because the events are duplicated.dataSource, taskId, taskType, groupId, tags0
ingest/input/bytesNumber of bytes read from input sources, after decompression but prior to parsing. This covers all data read, including data that does not end up being fully processed and ingested. For example, this includes data that ends up being rejected for being unparseable or filtered out.dataSource, taskId, taskType, groupId, tagsDepends on the amount of data read.
ingest/rows/outputNumber of Druid rows persisted.dataSource, taskId, taskType, groupIdYour number of events with rollup.
ingest/persists/countNumber of times persist occurred.dataSource, taskId, taskType, groupId, tagsDepends on the configuration.
ingest/persists/timeMilliseconds spent doing intermediate persist.dataSource, taskId, taskType, groupId, tagsDepends on the configuration. Generally a few minutes at most.
ingest/persists/cpuCPU time in nanoseconds spent on doing intermediate persist.dataSource, taskId, taskType, groupId, tagsDepends on the configuration. Generally a few minutes at most.
ingest/persists/backPressureMilliseconds spent creating persist tasks and blocking waiting for them to finish.dataSource, taskId, taskType, groupId, tags0 or very low
ingest/persists/failedNumber of persists that failed.dataSource, taskId, taskType, groupId, tags0
ingest/handoff/failedNumber of handoffs that failed.dataSource, taskId, taskType, groupId,tags0
ingest/merge/timeMilliseconds spent merging intermediate segments.dataSource, taskId, taskType, groupId, tagsDepends on the configuration. Generally a few minutes at most.
ingest/merge/cpuCPU time in Nanoseconds spent on merging intermediate segments.dataSource, taskId, taskType, groupId, tagsDepends on the configuration. Generally a few minutes at most.
ingest/handoff/countNumber of handoffs that happened.dataSource, taskId, taskType, groupId, tagsVaries. Generally greater than 0 once every segment granular period if cluster operating normally.
ingest/sink/countNumber of sinks not handed off.dataSource, taskId, taskType, groupId, tags1~3
ingest/events/messageGapTime gap in milliseconds between the latest ingested event timestamp and the current system timestamp of metrics emission. If the value is increasing but lag is low, Druid may not be receiving new data. This metric is reset as new tasks spawn up.dataSource, taskId, taskType, groupId, tagsGreater than 0, depends on the time carried in event.
ingest/notices/queueSizeNumber of pending notices to be processed by the coordinator.dataSource, tagsTypically 0 and occasionally in lower single digits. Should not be a very high number.
ingest/notices/timeMilliseconds taken to process a notice by the supervisor.dataSource, tags< 1s
ingest/pause/timeMilliseconds spent by a task in a paused state without ingesting.dataSource, taskId, tags< 10 seconds
ingest/handoff/timeTotal number of milliseconds taken to handoff a set of segments.dataSource, taskId, taskType, groupId, tagsDepends on the coordinator cycle time.
task/autoScaler/requiredCountCount of required tasks based on the calculations of lagBased auto scaler.dataSource, stream, scalingSkipReasonDepends on auto scaler config.

If the JVM does not support CPU time measurement for the current thread, ingest/merge/cpu and ingest/persists/cpu will be 0.

Indexing service

MetricDescriptionDimensionsNormal value
task/run/timeMilliseconds taken to run a task.dataSource, taskId, taskType, groupId, taskStatus, tagsVaries
task/pending/timeMilliseconds taken for a task to wait for running.dataSource, taskId, taskType, groupId, tagsVaries
task/action/log/timeMilliseconds taken to log a task action to the audit log.dataSource, taskId, taskType, groupId, taskActionType, tags< 1000 (subsecond)
task/action/run/timeMilliseconds taken to execute a task action.dataSource, taskId, taskType, groupId, taskActionType, tagsVaries from subsecond to a few seconds, based on action type.
task/action/success/countNumber of task actions that were executed successfully during the emission period. Currently only being emitted for batched segmentAllocate actions.dataSource, taskId, taskType, groupId, taskActionType, tagsVaries
task/action/failed/countNumber of task actions that failed during the emission period. Currently only being emitted for batched segmentAllocate actions.dataSource, taskId, taskType, groupId, taskActionType, tagsVaries
task/action/batch/queueTimeMilliseconds spent by a batch of task actions in queue. Currently only being emitted for batched segmentAllocate actions.dataSource, taskActionType, intervalVaries based on the batchAllocationWaitTime and number of batches in queue.
task/action/batch/runTimeMilliseconds taken to execute a batch of task actions. Currently only being emitted for batched segmentAllocate actions.dataSource, taskActionType, intervalVaries from subsecond to a few seconds, based on action type and batch size.
task/action/batch/sizeNumber of task actions in a batch that was executed during the emission period. Currently only being emitted for batched segmentAllocate actions.dataSource, taskActionType, intervalVaries based on number of concurrent task actions.
task/action/batch/attemptsNumber of execution attempts for a single batch of task actions. Currently only being emitted for batched segmentAllocate actions.dataSource, taskActionType, interval1 if there are no failures or retries.
task/segmentAvailability/wait/timeThe amount of milliseconds a batch indexing task waited for newly created segments to become available for querying.dataSource, taskType, groupId, taskId, segmentAvailabilityConfirmed, tagsVaries
segment/added/bytesSize in bytes of new segments created.dataSource, taskId, taskType, groupId, interval, tagsVaries
segment/moved/bytesSize in bytes of segments moved/archived via the Move Task.dataSource, taskId, taskType, groupId, interval, tagsVaries
segment/nuked/bytesSize in bytes of segments deleted via the Kill Task.dataSource, taskId, taskType, groupId, interval, tagsVaries
task/success/countNumber of successful tasks per emission period. This metric is only available if the TaskCountStatsMonitor module is included.dataSourceVaries
task/failed/countNumber of failed tasks per emission period. This metric is only available if the TaskCountStatsMonitor module is included.dataSourceVaries
task/running/countNumber of current running tasks. This metric is only available if the TaskCountStatsMonitor module is included.dataSourceVaries
task/pending/countNumber of current pending tasks. This metric is only available if the TaskCountStatsMonitor module is included.dataSourceVaries
task/waiting/countNumber of current waiting tasks. This metric is only available if the TaskCountStatsMonitor module is included.dataSourceVaries
taskSlot/total/countNumber of total task slots per emission period. This metric is only available if the TaskSlotCountStatsMonitor module is included.categoryVaries
taskSlot/idle/countNumber of idle task slots per emission period. This metric is only available if the TaskSlotCountStatsMonitor module is included.categoryVaries
taskSlot/used/countNumber of busy task slots per emission period. This metric is only available if the TaskSlotCountStatsMonitor module is included.categoryVaries
taskSlot/lazy/countNumber of total task slots in lazy marked Middle Managers and Indexers per emission period. This metric is only available if the TaskSlotCountStatsMonitor module is included.categoryVaries
taskSlot/blacklisted/countNumber of total task slots in blacklisted Middle Managers and Indexers per emission period. This metric is only available if the TaskSlotCountStatsMonitor module is included.categoryVaries
worker/task/failed/countNumber of failed tasks run on the reporting worker per emission period. This metric is only available if the WorkerTaskCountStatsMonitor module is included, and is only supported for Middle Manager nodes.category, workerVersionVaries
worker/task/success/countNumber of successful tasks run on the reporting worker per emission period. This metric is only available if the WorkerTaskCountStatsMonitor module is included, and is only supported for Middle Manager nodes.category,workerVersionVaries
worker/taskSlot/idle/countNumber of idle task slots on the reporting worker per emission period. This metric is only available if the WorkerTaskCountStatsMonitor module is included, and is only supported for Middle Manager nodes.category, workerVersionVaries
worker/taskSlot/total/countNumber of total task slots on the reporting worker per emission period. This metric is only available if the WorkerTaskCountStatsMonitor module is included.category, workerVersionVaries
worker/taskSlot/used/countNumber of busy task slots on the reporting worker per emission period. This metric is only available if the WorkerTaskCountStatsMonitor module is included.category, workerVersionVaries
worker/task/assigned/countNumber of tasks assigned to an indexer per emission period. This metric is only available if the WorkerTaskCountStatsMonitor module is included.dataSourceVaries
worker/task/completed/countNumber of tasks completed by an indexer per emission period. This metric is only available if the WorkerTaskCountStatsMonitor module is included.dataSourceVaries
worker/task/failed/countNumber of tasks that failed on an indexer during the emission period. This metric is only available if the WorkerTaskCountStatsMonitor module is included.dataSourceVaries
worker/task/success/countNumber of tasks that succeeded on an indexer during the emission period. This metric is only available if the WorkerTaskCountStatsMonitor module is included.dataSourceVaries
worker/task/running/countNumber of tasks running on an indexer per emission period. This metric is only available if the WorkerTaskCountStatsMonitor module is included.dataSourceVaries

Shuffle metrics (Native parallel task)

The shuffle metrics can be enabled by adding org.apache.druid.indexing.worker.shuffle.ShuffleMonitor in druid.monitoring.monitors. See Enabling metrics for more details.

MetricDescriptionDimensionsNormal value
ingest/shuffle/bytesNumber of bytes shuffled per emission period.supervisorTaskIdVaries
ingest/shuffle/requestsNumber of shuffle requests per emission period.supervisorTaskIdVaries

Coordination

These metrics are emitted by the Druid Coordinator in every run of the corresponding coordinator duty.

MetricDescriptionDimensionsNormal value
segment/assigned/countNumber of segments assigned to be loaded in the cluster.dataSource, tierVaries
segment/moved/countNumber of segments moved in the cluster.dataSource, tierVaries
segment/dropped/countNumber of segments chosen to be dropped from the cluster due to being over-replicated.dataSource, tierVaries
segment/deleted/countNumber of segments marked as unused due to drop rules.dataSourceVaries
segment/unneeded/countNumber of segments dropped due to being marked as unused.dataSource, tierVaries
segment/assignSkipped/countNumber of segments that could not be assigned to any server for loading. This can occur due to replication throttling, no available disk space, or a full load queue.dataSource, tier, descriptionVaries
segment/moveSkipped/countNumber of segments that were chosen for balancing but could not be moved. This can occur when segments are already optimally placed.dataSource, tier, descriptionVaries
segment/dropSkipped/countNumber of segments that could not be dropped from any server.dataSource, tier, descriptionVaries
segment/loadQueue/sizeSize in bytes of segments to load.serverVaries
segment/loadQueue/countNumber of segments to load.serverVaries
segment/loading/rateKbpsCurrent rate of segment loading on a server in kbps (1000 bits per second). The rate is calculated as a moving average over the last 10 GiB or more of successful segment loads on that server.serverVaries
segment/dropQueue/countNumber of segments to drop.serverVaries
segment/loadQueue/assignedNumber of segments assigned for load or drop to the load queue of a server.dataSource, serverVaries
segment/loadQueue/successNumber of segment assignments that completed successfully.dataSource, serverVaries
segment/loadQueue/failedNumber of segment assignments that failed to complete.dataSource, server0
segment/loadQueue/cancelledNumber of segment assignments that were canceled before completion.dataSource, serverVaries
segment/sizeTotal size of used segments in a data source. Emitted only for data sources to which at least one used segment belongs.dataSourceVaries
segment/countNumber of used segments belonging to a data source. Emitted only for data sources to which at least one used segment belongs.dataSource< max
segment/overShadowed/countNumber of segments marked as unused due to being overshadowed.Varies
segment/unneededEternityTombstone/countNumber of non-overshadowed eternity tombstones marked as unused.Varies
segment/unavailable/countNumber of unique segments left to load until all used segments are available for queries.dataSource0
segment/underReplicated/countNumber of segments, including replicas, left to load until all used segments are available for queries.tier, dataSource0
segment/availableDeepStorageOnly/countNumber of unique segments that are only available for querying directly from deep storage.dataSourceVaries
tier/historical/countNumber of available historical nodes in each tier.tierVaries
tier/replication/factorConfigured maximum replication factor in each tier.tierVaries
tier/required/capacityTotal capacity in bytes required in each tier.tierVaries
tier/total/capacityTotal capacity in bytes available in each tier.tierVaries
compact/task/countNumber of tasks issued in the auto compaction run.Varies
compactTask/maxSlot/countMaximum number of task slots available for auto compaction tasks in the auto compaction run.Varies
compactTask/availableSlot/countNumber of available task slots that can be used for auto compaction tasks in the auto compaction run. This is the max number of task slots minus any currently running compaction tasks.Varies
killTask/availableSlot/countNumber of available task slots that can be used for auto kill tasks in the auto kill run. This is the max number of task slots minus any currently running auto kill tasks.Varies
killTask/maxSlot/countMaximum number of task slots available for auto kill tasks in the auto kill run.Varies
kill/task/countNumber of tasks issued in the auto kill run.Varies
kill/eligibleUnusedSegments/countThe number of unused segments of a datasource that are identified as eligible for deletion from the metadata store by the coordinator.dataSourceVaries
kill/pendingSegments/countNumber of stale pending segments deleted from the metadata store.dataSourceVaries
segment/waitCompact/bytesTotal bytes of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction).dataSourceVaries
segment/waitCompact/countTotal number of segments of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction).dataSourceVaries
interval/waitCompact/countTotal number of intervals of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction).dataSourceVaries
segment/compacted/bytesTotal bytes of this datasource that are already compacted with the spec set in the auto compaction config.dataSourceVaries
segment/compacted/countTotal number of segments of this datasource that are already compacted with the spec set in the auto compaction config.dataSourceVaries
interval/compacted/countTotal number of intervals of this datasource that are already compacted with the spec set in the auto compaction config.dataSourceVaries
segment/skipCompact/bytesTotal bytes of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.dataSourceVaries
segment/skipCompact/countTotal number of segments of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.dataSourceVaries
interval/skipCompact/countTotal number of intervals of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.dataSourceVaries
coordinator/timeApproximate Coordinator duty runtime in milliseconds.dutyVaries
coordinator/global/timeApproximate runtime of a full coordination cycle in milliseconds. The dutyGroup dimension indicates what type of coordination this run was. For example: Historical Management or Indexing.dutyGroupVaries
metadata/kill/supervisor/countTotal number of terminated supervisors that were automatically deleted from metadata store per each Coordinator kill supervisor duty run. This metric can help adjust druid.coordinator.kill.supervisor.durationToRetain configuration based on whether more or less terminated supervisors need to be deleted per cycle. This metric is only emitted when druid.coordinator.kill.supervisor.on is set to true.Varies
metadata/kill/audit/countTotal number of audit logs that were automatically deleted from metadata store per each Coordinator kill audit duty run. This metric can help adjust druid.coordinator.kill.audit.durationToRetain configuration based on whether more or less audit logs need to be deleted per cycle. This metric is emitted only when druid.coordinator.kill.audit.on is set to true.Varies
metadata/kill/compaction/countTotal number of compaction configurations that were automatically deleted from metadata store per each Coordinator kill compaction configuration duty run. This metric is only emitted when druid.coordinator.kill.compaction.on is set to true.Varies
metadata/kill/rule/countTotal number of rules that were automatically deleted from metadata store per each Coordinator kill rule duty run. This metric can help adjust druid.coordinator.kill.rule.durationToRetain configuration based on whether more or less rules need to be deleted per cycle. This metric is only emitted when druid.coordinator.kill.rule.on is set to true.Varies
metadata/kill/datasource/countTotal number of datasource metadata that were automatically deleted from metadata store per each Coordinator kill datasource duty run. Note that datasource metadata only exists for datasource created from supervisor. This metric can help adjust druid.coordinator.kill.datasource.durationToRetain configuration based on whether more or less datasource metadata need to be deleted per cycle. This metric is only emitted when druid.coordinator.kill.datasource.on is set to true.Varies
serverview/init/timeTime taken to initialize the coordinator server view.Depends on the number of segments.
serverview/sync/healthySync status of the Coordinator with a segment-loading server such as a Historical or Peon. Emitted only when HTTP-based server view is enabled. You can use this metric in conjunction with serverview/sync/unstableTime to debug slow startup of the Coordinator.server, tier1 for fully synced servers, 0 otherwise
serverview/sync/unstableTimeTime in milliseconds for which the Coordinator has been failing to sync with a segment-loading server. Emitted only when HTTP-based server view is enabled.server, tierNot emitted for synced servers.
metadatacache/init/timeTime taken to initialize the coordinator segment metadata cache.Depends on the number of segments.
metadatacache/refresh/countNumber of segments to refresh in coordinator segment metadata cache.dataSource
metadatacache/refresh/timeTime taken to refresh segments in coordinator segment metadata cache.dataSource
metadatacache/backfill/countNumber of segments for which schema was back filled in the database.dataSource
metadatacache/realtimeSegmentSchema/countNumber of realtime segments for which schema is cached.Depends on the number of realtime segments in the cluster.
metadatacache/finalizedSegmentMetadata/countNumber of finalized segments for which schema metadata is cached.Depends on the number of segments in the cluster.
metadatacache/finalizedSchemaPayload/countNumber of finalized segment schema cached.Depends on the number of distinct schema in the cluster.
metadatacache/temporaryMetadataQueryResults/countNumber of segments for which schema was fetched by executing segment metadata query.Eventually it should be 0.
metadatacache/temporaryPublishedMetadataQueryResults/countNumber of segments for which schema is cached after back filling in the database.This value gets reset after each database poll. Eventually it should be 0.
metadatacache/deepStorageOnly/segment/countNumber of available segments present only in deep storage.dataSource
metadatacache/deepStorageOnly/refresh/countNumber of deep storage only segments with cached schema.dataSource
metadatacache/deepStorageOnly/process/timeTime taken in milliseconds to process deep storage only segment schema.Under a minute

General Health

Service Health

MetricDescriptionDimensionsNormal value
service/heartbeatMetric indicating the service is up. This metric is emitted only when ServiceStatusMonitor is enabled.leader on the Overlord and Coordinator.
workerVersion, category, status on the Middle Manager.
taskId, groupId, taskType, dataSource, tags on the Peon
1

Historical

MetricDescriptionDimensionsNormal value
segment/maxMaximum byte limit available for segments.Varies.
segment/usedBytes used for served segments.dataSource, tier, priority< max
segment/usedPercentPercentage of space used by served segments.dataSource, tier, priority< 100%
segment/countNumber of served segments.dataSource, tier, priorityVaries
segment/pendingDeleteOn-disk size in bytes of segments that are waiting to be cleared out.Varies
segment/rowCount/avgThe average number of rows per segment on a historical. SegmentStatsMonitor must be enabled.dataSource, tier, priorityVaries. See segment optimization for guidance on optimal segment sizes.
segment/rowCount/range/countThe number of segments in a bucket. SegmentStatsMonitor must be enabled.dataSource, tier, priority, rangeVaries

JVM

These metrics are only available if the JvmMonitor module is included in druid.monitoring.monitors. For more information, see Enabling Metrics.

MetricDescriptionDimensionsNormal value
jvm/pool/committedCommitted poolpoolKind, poolName, jvmVersionClose to max pool
jvm/pool/initInitial poolpoolKind, poolName, jvmVersionVaries
jvm/pool/maxMax poolpoolKind, poolName, jvmVersionVaries
jvm/pool/usedPool usedpoolKind, poolName, jvmVersion< max pool
jvm/bufferpool/countBufferpool countbufferpoolName, jvmVersionVaries
jvm/bufferpool/usedBufferpool usedbufferpoolName, jvmVersionClose to capacity
jvm/bufferpool/capacityBufferpool capacitybufferpoolName, jvmVersionVaries
jvm/mem/initInitial memorymemKind, jvmVersionVaries
jvm/mem/maxMax memorymemKind, jvmVersionVaries
jvm/mem/usedUsed memorymemKind, jvmVersion< max memory
jvm/mem/committedCommitted memorymemKind, jvmVersionClose to max memory
jvm/gc/countGarbage collection countgcName (cms/g1/parallel/etc.), gcGen (old/young), jvmVersionVaries
jvm/gc/cpuCount of CPU time in Nanoseconds spent on garbage collection. Note: jvm/gc/cpu represents the total time over multiple GC cycles; divide by jvm/gc/count to get the mean GC time per cycle.gcName, gcGen, jvmVersionSum of jvm/gc/cpu should be within 10-30% of sum of jvm/cpu/total, depending on the GC algorithm used (reported by JvmCpuMonitor).

ZooKeeper

These metrics are available only when druid.zk.service.enabled = true.

MetricDescriptionDimensionsNormal value
zk/connectedIndicator of connection status. 1 for connected, 0 for disconnected. Emitted once per monitor period.None1
zk/reconnect/timeAmount of time, in milliseconds, that a server was disconnected from ZooKeeper before reconnecting. Emitted on reconnection. Not emitted if connection to ZooKeeper is permanently lost, because in this case, there is no reconnection.NoneNot present

Sys [Deprecated]

SysMonitor is now deprecated and will be removed in future releases. Instead, use the new OSHI monitor called OshiSysMonitor. The new monitor has a wider support for different machine architectures including ARM instances.

These metrics are only available if the SysMonitor module is included.

MetricDescriptionDimensionsNormal value
sys/swap/freeFree swapVaries
sys/swap/maxMax swapVaries
sys/swap/pageInPaged in swapVaries
sys/swap/pageOutPaged out swapVaries
sys/disk/write/countWrites to diskfsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptionsVaries
sys/disk/read/countReads from diskfsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptionsVaries
sys/disk/write/sizeBytes written to disk. One indicator of the amount of paging occurring for segments.fsDevName,fsDirName,fsTypeName, fsSysTypeName, fsOptionsVaries
sys/disk/read/sizeBytes read from disk. One indicator of the amount of paging occurring for segments.fsDevName,fsDirName, fsTypeName, fsSysTypeName, fsOptionsVaries
sys/net/write/sizeBytes written to the networknetName, netAddress, netHwaddrVaries
sys/net/read/sizeBytes read from the networknetName, netAddress, netHwaddrVaries
sys/fs/usedFilesystem bytes usedfsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptions< max
sys/fs/maxFilesystem bytes maxfsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptionsVaries
sys/mem/usedMemory used< max
sys/mem/maxMemory maxVaries
sys/storage/usedDisk space usedfsDirNameVaries
sys/cpuCPU usedcpuName, cpuTimeVaries

OshiSysMonitor

These metrics are only available if the OshiSysMonitor module is included.

MetricDescriptionDimensionsNormal Value
sys/swap/freeFree swapVaries
sys/swap/maxMax swapVaries
sys/swap/pageInPaged in swapVaries
sys/swap/pageOutPaged out swapVaries
sys/disk/write/countWrites to diskdiskNameVaries
sys/disk/read/countReads from diskdiskNameVaries
sys/disk/write/sizeBytes written to disk. One indicator of the amount of paging occurring for segments.diskNameVaries
sys/disk/read/sizeBytes read from disk. One indicator of the amount of paging occurring for segments.diskNameVaries
sys/disk/queueDisk queue length. Measures number of requests waiting to be processed by diskdiskNameGenerally 0
sys/disk/transferTimeTransfer time to read from or write to diskdiskNameDepends on hardware
sys/net/write/sizeBytes written to the networknetName, netAddress, netHwaddrVaries
sys/net/read/sizeBytes read from the networknetName, netAddress, netHwaddrVaries
sys/net/read/packetsTotal packets read from the networknetName, netAddress, netHwaddrVaries
sys/net/write/packetsTotal packets written to the networknetName, netAddress, netHwaddrVaries
sys/net/read/errorsTotal network read errorsnetName, netAddress, netHwaddrGenerally 0
sys/net/write/errorsTotal network write errorsnetName, netAddress, netHwaddrGenerally 0
sys/net/read/droppedTotal packets dropped coming from networknetName, netAddress, netHwaddrGenerally 0
sys/net/write/collisionsTotal network write collisionsnetName, netAddress, netHwaddrGenerally 0
sys/fs/usedFilesystem bytes usedfsDevName, fsDirName< max
sys/fs/maxFilesystem bytes maxfsDevName, fsDirNameVaries
sys/fs/files/countFilesystem total IO nodesfsDevName, fsDirName< max
sys/fs/files/freeFilesystem free IO nodesfsDevName, fsDirNameVaries
sys/mem/usedMemory used< max
sys/mem/maxMemory maxVaries
sys/mem/freeMemory freeVaries
sys/storage/usedDisk space usedfsDirNameVaries
sys/cpuCPU usedcpuName, cpuTimeVaries
sys/uptimeTotal system uptimeVaries
sys/la/{i}System CPU load averages over past i minutes, where i={1,5,15}Varies
sys/tcpv4/activeOpensTotal TCP active open connectionsVaries
sys/tcpv4/passiveOpensTotal TCP passive open connectionsVaries
sys/tcpv4/attemptFailsTotal TCP active connection failuresGenerally 0
sys/tcpv4/estabResetsTotal TCP connection resetsGenerally 0
sys/tcpv4/in/segsTotal segments received in connectionVaries
sys/tcpv4/in/errsErrors while reading segmentsGenerally 0
sys/tcpv4/out/segsTotal segments sentVaries
sys/tcpv4/out/rstsTotal "out reset" packets sent to reset the connectionGenerally 0
sys/tcpv4/retrans/segsTotal segments re-transmittedVaries

S3 multi-part upload

These metrics are only available if the druid-s3-extensions module is included and if certain specific features are being used: MSQ export to S3, durable intermediate storage on S3.

MetricDescriptionDimensionsNormal value
s3/upload/part/queueSizeNumber of items currently waiting in queue to be uploaded to S3. Each item in the queue corresponds to a single part in a multi-part upload.Varies
s3/upload/part/queuedTimeMilliseconds spent by a single item (or part) in queue before it starts getting uploaded to S3.uploadId, partNumberVaries
s3/upload/part/timeMilliseconds taken to upload a single part of a multi-part upload to S3.uploadId, partNumberVaries
s3/upload/total/timeMilliseconds taken for uploading all parts of a multi-part upload to S3.uploadIdVaries
s3/upload/total/bytesTotal bytes uploaded to S3 during a multi-part upload.uploadIdVaries

Cgroup

These metrics are available on operating systems with the cgroup kernel feature. All the values are derived by reading from /sys/fs/cgroup.

MetricDescriptionDimensionsNormal value
cgroup/cpu/sharesRelative value of CPU time available to this process. Read from cpu.shares.Varies
cgroup/cpu/cores_quotaNumber of cores available to this process. Derived from cpu.cfs_quota_us/cpu.cfs_period_us.Varies. A value of -1 indicates there is no explicit quota set.
cgroup/cpu/usage/total/percentageTotal cpu percentage used by cgroup of process that is running0-100
cgroup/cpu/usage/user/percentageUser cpu percentage used by cgroup of process that is running0-100
cgroup/cpu/usage/sys/percentageSys cpu percentage used by cgroup of process that is running0-100
cgroup/disk/read/sizeReports the number of bytes transferred to specific devices by a cgroup of process that is running.diskNameVaries
cgroup/disk/write/sizeReports the number of bytes transferred from specific devices by a cgroup of process that is running.diskNameVaries
cgroup/disk/read/countReports the number of read operations performed on specific devices by a cgroup of process that is running.diskNameVaries
cgroup/disk/write/countReports the number of write operations performed on specific devices by a cgroup of process that is running.diskNameVaries
cgroup/memory/*Memory stats for this process, such as cache and total_swap. Each stat produces a separate metric. Read from memory.stat.Varies
cgroup/memory_numa/*/pagesMemory stats, per NUMA node, for this process, such as total and unevictable. Each stat produces a separate metric. Read from memory.num_stat.numaZoneVaries
cgroup/memory/limit/bytesReports the maximum memory that can be used by processes in the cgroup (in bytes)Varies
cgroup/memory/usage/bytesReports the maximum amount of user memory (including file cache)Varies
cgroup/cpuset/cpu_countTotal number of CPUs available to the process. Derived from cpuset.cpus.Varies
cgroup/cpuset/effective_cpu_countTotal number of active CPUs available to the process. Derived from cpuset.effective_cpus.Varies
cgroup/cpuset/mems_countTotal number of memory nodes available to the process. Derived from cpuset.mems.Varies
cgroup/cpuset/effective_mems_countTotal number of active memory nodes available to the process. Derived from cpuset.effective_mems.Varies