Garage exposes detailed metrics in Prometheus format, allowing you to monitor cluster health, performance, and resource usage. For information on setting up monitoring infrastructure, see the Monitoring Cookbook.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/deuxfleurs-org/garage/llms.txt
Use this file to discover all available pages before exploring further.
Accessing Metrics
Metrics are available via the administration API endpoint:Garage System Metrics
Version Information
garage_build_info (counter)
Exposes the Garage version running on each node.
- Verify all nodes run the same version
- Track upgrade progress
- Detect version mismatches
Configuration Metrics
garage_replication_factor (counter)
Exposes the configured replication factor.
Disk Space Metrics
garage_local_disk_avail and garage_local_disk_total (gauge)
Reports available and total disk space on each node, separately for data and metadata.
- Alert when available space < 10% of total
- Alert when metadata disk < 5GB available
Cluster Health Metrics
Overall Health
cluster_healthy (gauge)
Indicates whether all storage nodes are connected.
cluster_available (gauge)
Indicates whether all requests can be served, even if some nodes are disconnected.
Node Metrics
cluster_connected_nodes (gauge)
Number of nodes currently connected to the cluster.
cluster_known_nodes (gauge)
Number of nodes that have been seen at least once in the cluster.
If
cluster_connected_nodes < cluster_known_nodes, some nodes are currently offline.cluster_layout_node_connected (gauge)
Connection status for individual nodes in the cluster layout.
1= connected0= disconnected
cluster_layout_node_disconnected_time (gauge)
Seconds since last connection to each node.
- Alert if disconnected_time > 300 (5 minutes)
Storage and Partition Metrics
cluster_storage_nodes (gauge)
Number of storage nodes declared in the current layout.
cluster_storage_nodes_ok (gauge)
Number of storage nodes currently connected.
cluster_partitions (gauge)
Total number of partitions in the layout (always 256).
cluster_partitions_all_ok (gauge)
Number of partitions for which all storage nodes are connected.
cluster_partitions_quorum (gauge)
Number of partitions with enough connected nodes to serve all requests.
API Endpoint Metrics
Admin API
api_admin_request_counter (counter)
Counts requests to each admin API endpoint.
api_admin_request_duration (histogram)
Duration of admin API calls.
S3 API
api_s3_request_counter (counter)
Counts requests to each S3 API endpoint.
api_s3_error_counter (counter)
Counts S3 API errors by endpoint and status code.
- High rate of 500 errors indicates cluster issues
- High rate of 404 errors may indicate application bugs
api_s3_request_duration (histogram)
Duration of S3 API calls.
K2V API
Same metrics as S3 API but for the K2V endpoint:api_k2v_request_counterapi_k2v_error_counterapi_k2v_request_duration
Web Endpoint Metrics
web_request_counter (counter)
Number of requests to the web endpoint.
web_request_duration (histogram)
Duration of web endpoint requests.
web_error_counter (counter)
Web endpoint errors by method and status code.
Data Block Manager Metrics
I/O Metrics
block_bytes_read, block_bytes_written (counter)
Bytes read from and written to disk in the data storage directory.
block_read_duration, block_write_duration (histogram)
Duration of individual block read/write operations.
- Alert if P95 read duration > 1s (slow disk)
- Alert if P95 write duration > 5s (slow disk)
Memory Management
block_ram_buffer_free_kb (gauge)
Kibibytes available for buffering blocks to send to remote nodes.
Configuration
block_compression_level (counter)
Configured block compression level.
Block Operations
block_delete_counter (counter)
Number of data blocks deleted from storage.
Resync Operations
block_resync_counter (counter), block_resync_duration (histogram)
Number and duration of block resync operations.
block_resync_queue_length (gauge)
Number of block hashes queued for resync.
Normal to be nonzero for long periods, especially after layout changes or node failures.
block_resync_errored_blocks (gauge)
Number of blocks that failed to resync on the last attempt.
RPC Metrics
Request Metrics
rpc_netapp_request_counter (counter)
Number of RPC requests emitted between nodes.
Error Metrics
rpc_netapp_error_counter (counter)
Communication errors (usually due to disconnected nodes).
rpc_timeout_counter (counter)
Number of RPC timeouts.
Duration Metrics
rpc_duration (histogram)
Duration of RPC calls between nodes.
Metadata Table Metrics
Garbage Collection
table_gc_todo_queue_length (gauge)
Length of the garbage collection TODO queue for each table.
Table Operations
table_get_request_counter (counter), table_get_request_duration (histogram)
Number and duration of get/get_range requests on each table.
table_put_request_counter (counter), table_put_request_duration (histogram)
Number and duration of insert/insert_many requests.
Table Modifications
table_internal_delete_counter (counter)
Number of value deletions in the tree (due to GC or repartitioning).
table_internal_update_counter (counter)
Number of value updates (creation and modification).
Merkle Tree
table_merkle_updater_todo_queue_length (gauge)
Merkle tree updater TODO queue length.
Should fall to zero rapidly. Persistent nonzero values during normal operation may indicate issues.
Synchronization
table_sync_items_received, table_sync_items_sent (counter)
Data items sent to/received from other nodes during resync.
Example Prometheus Alerts
Best Practices
-
Monitor critical metrics:
cluster_healthyandcluster_availableblock_resync_errored_blocks- Disk space metrics
-
Set up alerting for:
- Node disconnections
- Disk space < 10%
- Persistent resync errors
- High error rates
-
Create dashboards for:
- Cluster health overview
- API performance (latency, throughput)
- Resource usage (disk, memory)
- RPC performance
-
Track trends over time:
- Request rates and patterns
- Disk usage growth
- Error rates
- Document your alerts and runbooks for common issues