Internal Architecture

This page explores how Garage manages data internally, covering the distributed systems concepts and design decisions that power its architecture.

Core concepts

Garage’s architecture is built on proven distributed systems research:

Dynamo ring

Consistent hashing ring for data distribution (paper)

CRDTs

Conflict-free Replicated Data Types for eventual consistency (paper)

Quorum consensus

Majority-based consistency without leader election

Gossip protocol

Decentralized cluster membership and failure detection

For more background, see this presentation (French).

Request routing logic

Data retrieval requests to Garage endpoints (S3 API and websites) are resolved to an individual object in a bucket. Since objects are replicated to multiple nodes, Garage must ensure consistency before answering requests.

Using quorum to ensure consistency

Garage ensures consistency by attempting to establish a quorum with the data nodes responsible for the object. When a majority of data nodes have provided metadata on an object, Garage can answer the request.

Request arrives

A client makes a request for an object in a bucket

Query preferred nodes

Assuming 3 replicas (recommended), make requests to the two preferred nodes for object metadata

Try backup node

If one of the two initial requests fails, try the third node

Establish quorum

Check that metadata from at least 2 nodes match

Verify not deleted

Check that the object hasn’t been marked deleted

Return data

If object is small enough: Answer with inline data from metadata
Otherwise: Get data blocks from preferred nodes and answer with assembled object

Garage dynamically determines which nodes to query based on health, preference, and which nodes actually host the data.

No primary concept

Garage has no concept of “primary” nodes. Any healthy node with the data can be used as long as a quorum is reached for the metadata. Benefits:

Better load distribution
No single point of failure
Faster responses (use closest/fastest node)

Node management

Node health monitoring

Garage maintains cluster health through active monitoring:

TCP session management: Keeps a TCP session open to each node
Periodic pinging: Regularly pings all nodes to verify connectivity
Failure detection: Marks nodes as failed if:
- Connection cannot be established
- Node fails to answer multiple pings
Routing decisions: Failed nodes are not used for quorum or internal requests

Node preference

Garage prioritizes which nodes to query according to specific criteria:

Self-preference

A node always prefers itself if it can answer the request

Same-zone preference

The node prioritizes nodes in the same zone

Latency optimization

Finally, nodes with the lowest latency are prioritized

This preference system ensures optimal performance while respecting geographical constraints.

Garbage collection

Garbage collection is critical for maintaining data integrity. A faulty procedure was the cause of critical bug #39, leading to significant improvements in PR #135.

Rationale: We must ensure Garage’s safety by preventing premature deletion of needed data.

Table entry garbage collection

The Entry trait for table entries defines an is_tombstone() function that returns true for deleted entries. CRDT semantics and tombstones: CRDT semantics keep all tombstones by default because they’re necessary for reconciliation:

If node A has a tombstone superseding value x
And node B has value x
Node A must keep the tombstone to properly delete x at node B
Otherwise, value x would flow back from B to A (deleted item reappears)

Safe tombstone deletion: The garbage collector (implemented in table/gc.rs) can delete tombstones UNDER CERTAIN CONDITIONS:

All nodes responsible for storing this entry are aware of the tombstone’s existence, ensuring they cannot hold a superseded version.

Garage uses atomic database operations (compare-and-swap and transactions) to ensure only properly propagated tombstones are deleted.

Data block garbage collection

Blocks in the data directory are reference-counted. Safety measures (introduced in PR #135):

Tombstone delay

24-hour delay before anything is garbage collected in a table

Block deletion delay

10-minute interval after RC reaches zero before blocks can be deleted

Why 10 minutes for blocks? This is a compromise between:

Security: Prevents race conditions like bug #39
Practicality: Avoids disk space explosion during rebalancing (which would occur with a 24-hour delay)

Rebalancing considerations

When a partition is moving between nodes during rebalancing, the offload process takes time.

The challenge: During the offload interval, the GC doesn’t check with the offloading node before deleting tombstones. If that node hasn’t received the tombstone before offload completes, old data could reappear. Current solution: The 24-hour delay works under the assumption that rebalances complete within 24 hours. Future improvements: In distributed systems, time-based assumptions are generally considered bad practice (synchrony assumptions). To maximize Garage’s applicability, we’d like to:

Find a way to safely disable GC during data shuffling
Safely detect when shuffling has terminated
Resume GC after completion

This introduces protocol complexity and hasn’t been tackled yet.

Consistency model

Garage uses eventual consistency with quorum-based operations:

Writes: Succeed when acknowledged by a majority of replicas
Reads: Return data when a majority of replicas agree on object state
Conflicts: Resolved through CRDT merge semantics
Tombstones: Maintained until safe to delete

This model prioritizes availability and partition tolerance over strong consistency (AP in CAP theorem).

Documentation Index

​Core concepts

Dynamo ring

CRDTs

Quorum consensus

Gossip protocol

​Request routing logic

​Using quorum to ensure consistency

​No primary concept

​Node management

​Node health monitoring

​Node preference

​Garbage collection

​Table entry garbage collection

​Data block garbage collection

Tombstone delay

Block deletion delay

​Rebalancing considerations

​Consistency model

​Further reading

Core concepts

Request routing logic

Using quorum to ensure consistency

No primary concept

Node management

Node health monitoring

Node preference

Garbage collection

Table entry garbage collection

Data block garbage collection

Rebalancing considerations

Consistency model

Further reading