Monitoring and Metrics

Garage exposes detailed metrics in Prometheus format, allowing you to monitor cluster health, performance, and resource usage. For information on setting up monitoring infrastructure, see the Monitoring Cookbook.

Accessing Metrics

Metrics are available via the administration API endpoint:

curl http://localhost:3903/metrics

Or configure Prometheus to scrape this endpoint automatically.

Garage System Metrics

Version Information

`garage_build_info` (counter)

Exposes the Garage version running on each node.

garage_build_info{version="1.0"} 1

Use cases:

Verify all nodes run the same version
Track upgrade progress
Detect version mismatches

Configuration Metrics

`garage_replication_factor` (counter)

Exposes the configured replication factor.

garage_replication_factor 3

Disk Space Metrics

`garage_local_disk_avail` and `garage_local_disk_total` (gauge)

Reports available and total disk space on each node, separately for data and metadata.

garage_local_disk_avail{volume="data"} 540341960704
garage_local_disk_avail{volume="metadata"} 540341960704
garage_local_disk_total{volume="data"} 763063566336
garage_local_disk_total{volume="metadata"} 763063566336

Alert recommendations:

Alert when available space < 10% of total
Alert when metadata disk < 5GB available

Cluster Health Metrics

Overall Health

`cluster_healthy` (gauge)

Indicates whether all storage nodes are connected.

cluster_healthy 1  # All nodes connected
cluster_healthy 0  # One or more nodes disconnected

Critical alert: cluster_healthy = 0 indicates a node is unreachable.

`cluster_available` (gauge)

Indicates whether all requests can be served, even if some nodes are disconnected.

cluster_available 1  # Cluster can serve all requests
cluster_available 0  # Cluster cannot serve some requests

Critical alert: cluster_available = 0 indicates potential data unavailability.

Node Metrics

`cluster_connected_nodes` (gauge)

Number of nodes currently connected to the cluster.

cluster_connected_nodes 3

`cluster_known_nodes` (gauge)

Number of nodes that have been seen at least once in the cluster.

cluster_known_nodes 3

If cluster_connected_nodes < cluster_known_nodes, some nodes are currently offline.

`cluster_layout_node_connected` (gauge)

Connection status for individual nodes in the cluster layout.

cluster_layout_node_connected{id="62b218d848e86a64",role_capacity="1000000000",role_gateway="0",role_zone="dc1"} 1
cluster_layout_node_connected{id="a11c7cf18af29737",role_capacity="1000000000",role_gateway="0",role_zone="dc1"} 0

Values:

1 = connected
0 = disconnected

`cluster_layout_node_disconnected_time` (gauge)

Seconds since last connection to each node.

cluster_layout_node_disconnected_time{id="62b218d848e86a64",role_capacity="1000000000",role_gateway="0",role_zone="dc1"} 0
cluster_layout_node_disconnected_time{id="a11c7cf18af29737",role_capacity="1000000000",role_gateway="0",role_zone="dc1"} 3600

Alert recommendation:

Alert if disconnected_time > 300 (5 minutes)

Storage and Partition Metrics

`cluster_storage_nodes` (gauge)

Number of storage nodes declared in the current layout.

cluster_storage_nodes 4

`cluster_storage_nodes_ok` (gauge)

Number of storage nodes currently connected.

cluster_storage_nodes_ok 3

`cluster_partitions` (gauge)

Total number of partitions in the layout (always 256).

cluster_partitions 256

`cluster_partitions_all_ok` (gauge)

Number of partitions for which all storage nodes are connected.

cluster_partitions_all_ok 64

`cluster_partitions_quorum` (gauge)

Number of partitions with enough connected nodes to serve all requests.

cluster_partitions_quorum 256

If cluster_partitions_quorum < cluster_partitions, some data may be inaccessible.

API Endpoint Metrics

Admin API

`api_admin_request_counter` (counter)

Counts requests to each admin API endpoint.

api_admin_request_counter{api_endpoint="Metrics"} 127041

`api_admin_request_duration` (histogram)

Duration of admin API calls.

api_admin_request_duration_bucket{api_endpoint="Metrics",le="0.5"} 127041
api_admin_request_duration_sum{api_endpoint="Metrics"} 605.250344830999
api_admin_request_duration_count{api_endpoint="Metrics"} 127041

S3 API

`api_s3_request_counter` (counter)

Counts requests to each S3 API endpoint.

api_s3_request_counter{api_endpoint="CreateMultipartUpload"} 1
api_s3_request_counter{api_endpoint="GetObject"} 5234
api_s3_request_counter{api_endpoint="PutObject"} 1821

`api_s3_error_counter` (counter)

Counts S3 API errors by endpoint and status code.

api_s3_error_counter{api_endpoint="GetObject",status_code="404"} 39

Alert recommendations:

High rate of 500 errors indicates cluster issues
High rate of 404 errors may indicate application bugs

`api_s3_request_duration` (histogram)

Duration of S3 API calls.

api_s3_request_duration_bucket{api_endpoint="CreateMultipartUpload",le="0.5"} 1
api_s3_request_duration_sum{api_endpoint="CreateMultipartUpload"} 0.046340762
api_s3_request_duration_count{api_endpoint="CreateMultipartUpload"} 1

K2V API

Same metrics as S3 API but for the K2V endpoint:

api_k2v_request_counter
api_k2v_error_counter
api_k2v_request_duration

Web Endpoint Metrics

`web_request_counter` (counter)

Number of requests to the web endpoint.

web_request_counter{method="GET"} 80

`web_request_duration` (histogram)

Duration of web endpoint requests.

web_request_duration_bucket{method="GET",le="0.5"} 80
web_request_duration_sum{method="GET"} 1.0528433229999998
web_request_duration_count{method="GET"} 80

`web_error_counter` (counter)

Web endpoint errors by method and status code.

web_error_counter{method="GET",status_code="404 Not Found"} 64

Data Block Manager Metrics

I/O Metrics

`block_bytes_read`, `block_bytes_written` (counter)

Bytes read from and written to disk in the data storage directory.

block_bytes_read 120586322022
block_bytes_written 3386618077

`block_read_duration`, `block_write_duration` (histogram)

Duration of individual block read/write operations.

block_read_duration_bucket{le="0.5"} 169229
block_read_duration_sum 2761.6902550310056
block_read_duration_count 169240

Alert recommendations:

Alert if P95 read duration > 1s (slow disk)
Alert if P95 write duration > 5s (slow disk)

Memory Management

`block_ram_buffer_free_kb` (gauge)

Kibibytes available for buffering blocks to send to remote nodes.

block_ram_buffer_free_kb 219829

When this drops to zero, backpressure is applied. If consistently low, consider increasing available memory or reducing write rate.

Configuration

`block_compression_level` (counter)

Configured block compression level.

block_compression_level 3

Block Operations

`block_delete_counter` (counter)

Number of data blocks deleted from storage.

block_delete_counter 122

Resync Operations

`block_resync_counter` (counter), `block_resync_duration` (histogram)

Number and duration of block resync operations.

block_resync_counter 308897
block_resync_duration_bucket{le="0.5"} 308892
block_resync_duration_sum 139.64204196100016
block_resync_duration_count 308897

`block_resync_queue_length` (gauge)

Number of block hashes queued for resync.

block_resync_queue_length 0

Normal to be nonzero for long periods, especially after layout changes or node failures.

`block_resync_errored_blocks` (gauge)

Number of blocks that failed to resync on the last attempt.

block_resync_errored_blocks 0

THIS SHOULD BE ZERO OR FALL TO ZERO RAPIDLY IN A HEALTHY CLUSTER.Persistent nonzero values indicate potential data loss. Investigate immediately with:

garage block list-errors

RPC Metrics

Request Metrics

`rpc_netapp_request_counter` (counter)

Number of RPC requests emitted between nodes.

rpc_request_counter{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 176

Error Metrics

`rpc_netapp_error_counter` (counter)

Communication errors (usually due to disconnected nodes).

rpc_netapp_error_counter{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 354

`rpc_timeout_counter` (counter)

Number of RPC timeouts.

rpc_timeout_counter{from="<this node>",rpc_endpoint="garage_rpc/membership.rs/SystemRpc",to="<remote node>"} 1

Should be close to zero in a healthy cluster. High timeout rates indicate network issues or overloaded nodes.

Duration Metrics

`rpc_duration` (histogram)

Duration of RPC calls between nodes.

rpc_duration_bucket{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>",le="0.5"} 166
rpc_duration_sum{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 35.172253716
rpc_duration_count{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 174

Metadata Table Metrics

Garbage Collection

`table_gc_todo_queue_length` (gauge)

Length of the garbage collection TODO queue for each table.

table_gc_todo_queue_length{table_name="block_ref"} 0

Table Operations

`table_get_request_counter` (counter), `table_get_request_duration` (histogram)

Number and duration of get/get_range requests on each table.

table_get_request_counter{table_name="bucket_alias"} 315
table_get_request_duration_bucket{table_name="bucket_alias",le="0.5"} 315
table_get_request_duration_sum{table_name="bucket_alias"} 0.048509778000000024
table_get_request_duration_count{table_name="bucket_alias"} 315

`table_put_request_counter` (counter), `table_put_request_duration` (histogram)

Number and duration of insert/insert_many requests.

table_put_request_counter{table_name="block_ref"} 677
table_put_request_duration_bucket{table_name="block_ref",le="0.5"} 677
table_put_request_duration_sum{table_name="block_ref"} 61.617528636
table_put_request_duration_count{table_name="block_ref"} 677

Table Modifications

`table_internal_delete_counter` (counter)

Number of value deletions in the tree (due to GC or repartitioning).

table_internal_delete_counter{table_name="block_ref"} 2296

`table_internal_update_counter` (counter)

Number of value updates (creation and modification).

table_internal_update_counter{table_name="block_ref"} 5996

Merkle Tree

`table_merkle_updater_todo_queue_length` (gauge)

Merkle tree updater TODO queue length.

table_merkle_updater_todo_queue_length{table_name="block_ref"} 0

Should fall to zero rapidly. Persistent nonzero values during normal operation may indicate issues.

Synchronization

`table_sync_items_received`, `table_sync_items_sent` (counter)

Data items sent to/received from other nodes during resync.

table_sync_items_received{from="<remote node>",table_name="bucket_v2"} 3
table_sync_items_sent{table_name="block_ref",to="<remote node>"} 2

Example Prometheus Alerts

groups:
  - name: garage
    interval: 60s
    rules:
      - alert: GarageClusterUnhealthy
        expr: cluster_healthy == 0
        for: 5m
        annotations:
          summary: "Garage cluster is unhealthy"
          description: "One or more nodes are disconnected"

      - alert: GarageClusterUnavailable
        expr: cluster_available == 0
        for: 1m
        annotations:
          summary: "Garage cluster is unavailable"
          description: "Cluster cannot serve all requests"

      - alert: GarageBlockResyncErrors
        expr: block_resync_errored_blocks > 0
        for: 15m
        annotations:
          summary: "Garage has block resync errors"
          description: "{{ $value }} blocks failed to resync"

      - alert: GarageDiskSpaceLow
        expr: (garage_local_disk_avail / garage_local_disk_total) < 0.1
        for: 10m
        annotations:
          summary: "Garage disk space low"
          description: "Less than 10% disk space available"

      - alert: GarageHighErrorRate
        expr: rate(api_s3_error_counter{status_code=~"5.."}[5m]) > 10
        annotations:
          summary: "High S3 API error rate"
          description: "More than 10 5xx errors per second"

Best Practices

Monitor critical metrics:
- cluster_healthy and cluster_available
- block_resync_errored_blocks
- Disk space metrics
Set up alerting for:
- Node disconnections
- Disk space < 10%
- Persistent resync errors
- High error rates
Create dashboards for:
- Cluster health overview
- API performance (latency, throughput)
- Resource usage (disk, memory)
- RPC performance
Track trends over time:
- Request rates and patterns
- Disk usage growth
- Error rates
Document your alerts and runbooks for common issues

Documentation Index

​Accessing Metrics

​Garage System Metrics

​Version Information

​garage_build_info (counter)

​Configuration Metrics

​garage_replication_factor (counter)

​Disk Space Metrics

​garage_local_disk_avail and garage_local_disk_total (gauge)

​Cluster Health Metrics

​Overall Health

​cluster_healthy (gauge)

​cluster_available (gauge)

​Node Metrics

​cluster_connected_nodes (gauge)

​cluster_known_nodes (gauge)

​cluster_layout_node_connected (gauge)

​cluster_layout_node_disconnected_time (gauge)

​Storage and Partition Metrics

​cluster_storage_nodes (gauge)

​cluster_storage_nodes_ok (gauge)

​cluster_partitions (gauge)

​cluster_partitions_all_ok (gauge)

​cluster_partitions_quorum (gauge)

​API Endpoint Metrics

​Admin API

​api_admin_request_counter (counter)

​api_admin_request_duration (histogram)

​S3 API

​api_s3_request_counter (counter)

​api_s3_error_counter (counter)

​api_s3_request_duration (histogram)

​K2V API

​Web Endpoint Metrics

​web_request_counter (counter)

​web_request_duration (histogram)

​web_error_counter (counter)

​Data Block Manager Metrics

​I/O Metrics

​block_bytes_read, block_bytes_written (counter)

​block_read_duration, block_write_duration (histogram)

​Memory Management

​block_ram_buffer_free_kb (gauge)

​Configuration

​block_compression_level (counter)

​Block Operations

​block_delete_counter (counter)

​Resync Operations

​block_resync_counter (counter), block_resync_duration (histogram)

​block_resync_queue_length (gauge)

​block_resync_errored_blocks (gauge)

​RPC Metrics

​Request Metrics

​rpc_netapp_request_counter (counter)

​Error Metrics

​rpc_netapp_error_counter (counter)

​rpc_timeout_counter (counter)

​Duration Metrics

​rpc_duration (histogram)

​Metadata Table Metrics

​Garbage Collection

​table_gc_todo_queue_length (gauge)

​Table Operations

​table_get_request_counter (counter), table_get_request_duration (histogram)

​table_put_request_counter (counter), table_put_request_duration (histogram)

​Table Modifications

​table_internal_delete_counter (counter)

​table_internal_update_counter (counter)

​Merkle Tree

​table_merkle_updater_todo_queue_length (gauge)

​Synchronization

​table_sync_items_received, table_sync_items_sent (counter)

​Example Prometheus Alerts

​Best Practices

​See Also

Accessing Metrics

Garage System Metrics

Version Information

`garage_build_info` (counter)

Configuration Metrics

`garage_replication_factor` (counter)

Disk Space Metrics

`garage_local_disk_avail` and `garage_local_disk_total` (gauge)

Cluster Health Metrics

Overall Health

`cluster_healthy` (gauge)

`cluster_available` (gauge)

Node Metrics

`cluster_connected_nodes` (gauge)

`cluster_known_nodes` (gauge)

`cluster_layout_node_connected` (gauge)

`cluster_layout_node_disconnected_time` (gauge)

Storage and Partition Metrics

`cluster_storage_nodes` (gauge)

`cluster_storage_nodes_ok` (gauge)

`cluster_partitions` (gauge)

`cluster_partitions_all_ok` (gauge)

`cluster_partitions_quorum` (gauge)

API Endpoint Metrics

Admin API

`api_admin_request_counter` (counter)

`api_admin_request_duration` (histogram)

S3 API

`api_s3_request_counter` (counter)

`api_s3_error_counter` (counter)

`api_s3_request_duration` (histogram)

K2V API

Web Endpoint Metrics

`web_request_counter` (counter)

`web_request_duration` (histogram)

`web_error_counter` (counter)

Data Block Manager Metrics

I/O Metrics

`block_bytes_read`, `block_bytes_written` (counter)

`block_read_duration`, `block_write_duration` (histogram)

Memory Management

`block_ram_buffer_free_kb` (gauge)

Configuration

`block_compression_level` (counter)

Block Operations

`block_delete_counter` (counter)

Resync Operations

`block_resync_counter` (counter), `block_resync_duration` (histogram)

`block_resync_queue_length` (gauge)

`block_resync_errored_blocks` (gauge)

RPC Metrics

Request Metrics

`rpc_netapp_request_counter` (counter)

Error Metrics

`rpc_netapp_error_counter` (counter)

`rpc_timeout_counter` (counter)

Duration Metrics

`rpc_duration` (histogram)

Metadata Table Metrics

Garbage Collection

`table_gc_todo_queue_length` (gauge)

Table Operations

`table_get_request_counter` (counter), `table_get_request_duration` (histogram)

`table_put_request_counter` (counter), `table_put_request_duration` (histogram)

Table Modifications

`table_internal_delete_counter` (counter)

`table_internal_update_counter` (counter)

Merkle Tree

`table_merkle_updater_todo_queue_length` (gauge)

Synchronization

`table_sync_items_received`, `table_sync_items_sent` (counter)

Example Prometheus Alerts

Best Practices

See Also