- Yehuda Sadeh (Inktank)
- Greg Farnum
- Sage Weil
- Loic Dachary
- Christophe Courtaut firstname.lastname@example.org
- Florian Haas
- Daniele Stroppa (ZHAW)
This feature is currently slated for the Dumpling release and implementation is currently underway, but additional assistance (to improve the schedule, provide more functionality, and reduce schedule risk) is welcome.
- regions are large, distinct geographic areas. A region is made up of multiple zones.
- a particular bucket is created and replicated only within a single region
- user metadata is replicated across all regions
- zones are geographically separated sites, sufficiently independent that they are unlikely to be affected by a single disaster.
- a bucket can be replicated to multiple zones within that region
- each bucket has (at any given time) a designated master-zone, from which that bucket can be written
- all other (backup zones) have read-only access to that bucket ... but the master zone for a bucket can be changed at any time.
- the master/backup designation applies to particular buckets. A zone that is a backup for one set of buckets can be master for others.
The basic replication model is:
- master zones maintain logs of both user-metadata and bucket data updates
- remote sites can use (new) RESTful APIs to get information about recent updates
- backup-zone replication agents will use these APIs to track changes in master-zones, pull the updated information, and replay those same changes locally.
This is mechanism provides eventual consistency. Backup zones will eventually see all master zone updates, but there the delay between master-zone operations and backup-zone replay means that clients in the backup-zones will sometimes see old data. But there are many benefits for asynchronous, eventual-consistency, pull replication:
- it is highly robust in the face of link and site failures
- it does not force master-zone updates to wait for backup-zones to acknowledge (or catch up with) changes
- it can support arbitrary numbers of replicas
- it can support the creation of new mirrors at any time (long after the original data creation)
- it can be done very efficiently (compressing out multiple updates to the same object)
- while there is a replication delay, it can easily be tuned to be anywhere from seconds to years