Home > Planning > Blueprints > Emperor > osd: tiering: object redirects

osd: tiering: object redirects

Summary

Create a RADOS redirect primitive and methods for making use of them.  A redirect should function analogously to a symlink, allowing an object to be moved to a different pool but still be accessible transparently by clients looking in the old location.  This would be underlying infrastructure to support tiering.

Owners

  • Sage Weil (Inktank)

Interested Parties

Current Status

 

Detailed Description

--- data types ---

terminology

origin: original object in original location

target: alternative location of object 

new fields for object_info_t:

 
enum redir_state;                ///< [origin, target]
object_locator_t redir_oloc;     ///< [origin] locator for target object
eversion_t redir_version;        ///< [origin, target] when this redirect was set to this target
u8 flags;                        ///< [origin]
object_locator_t owner_oloc;     ///< [target] locator for the origin
eversion_t owner_user_version;   ///< [target] user_version, not version!
 
where the origin states are:
 
 NONE
 REDIRECT    we are pointing to another object
 PROMOTING    we are copying the target object back to the origin location
 DEMOTING    we are copying the primary object to the origin location
 CLEANUP   we have the object, but need to delete the demoted object
 DELETING   local object is logically non-existent, but we need to clean up target location.
 
flags are:
 PROMOTE_ON_READ
 PROMOTE_ON_WRITE
 
- we may want to make PROMOTE_ON_WRITE the only behavior for the initial implementation.
 
- the demoted object has only 2 states:
 
 NONE
 TARGET      we are pointed to by primary
 
- primary osd will handle object promote, demote operations (copying to/from alternate location)
  - use backend cluster interface to avoid deadlock from throttling ( loic : how can it deadlock from throttling ? sage: hmm, might not be a problem, as long as no recovery operations can block on the redirect state. )
 
- objecter can also do a SET_REDIRECT operation:
   - will erase local object and set redirect metadata
 
- return redirect metadata with GET_REDIRECT ( loic : without GET_REDIRECT it would transparently try again when receiving a EAGAIN, in the same way an http client would on a 302 ? sage: yeah this is like lstat().. we want to find out if we are a redirect origin or target )
 
--- osd behavior ---
 
on read (no flags):
 NONE, DEMOTING, CLEANUP: do the read
 REDIRECT: send EAGAIN with redirect metadata to client
 PROMOTING: block or forward. ( loic : what does "forward" mean in this context ? I would understand "block then do the read" )
 DELETING: enoent
 
on read (PROMOTE_ON_READ):
 NONE, CLEANUP: do the read
 DEMOTING: abort the demotion move to CLEANUP and do the the read
 REDIRECT: move to PROMOTING, block then do the read
 PROMOTING: block then do the read
 DELETING: enoent
 
on write (no flag);
 DEMOTING: block
 REDIRECT: forward
 PROMOTING: block
 DELETING: CLEANUP, proceed.
 
on write (promote on write);
 DEMOTING:
   move to CLEANUP
 REDIRECT:
   move to PROMOTING, block
 PROMOTING: block
 DELETING: CLEANUP, proceed.
 
on delete:
 DEMOTING, REDIRECT, PROMOTING, CLEANUP: move to DELETING and queue target object for deletion (as with CLEANUP)
 DELETING: no change.
 
on any op:
 TARGET: verify the redir_version matches, or EAGAIN
 
- if we are doing the redirect request and the target does not exist or the version does not match what the redirect/primary had, retry
 
- the CLEANUP and DELETING states mean the osd needs to remove the redirect and then transition to NONE or delete (respectively)
 
--- objecter behavior ---
 
- send op to normal location
- on EAGAIN with redirect metadata,
  - note redirect version
  - if this is a retry and version hasn't changed, return error to caller.
  - resend op to alternate location, *including* the primary's eversion_t
  - if we get an error (ENOENT on read), retry from the top
 
--- pg log events ---
 
redir_demote_start -- we are now allowed to start copying to target pool.  move to DEMOTING
redir_demote_finish -- target is in place; delete local data and set redirect metadata. move to REDIRECT
redir_promote_cleanup -- did copy from target back to origin; still need to clean up old target.  move to CLEANUP
redir_cleanup_finish -- old target is cleaned up.  move to NONE
redir_delete_start -- can remove target, move to DELETING
remove (existing event) -- finished removing target, delete object.
 
--- common races ---
 
- read vs demote
  - if we hit primary while DEMOTING, we get the result
  - if we get EAGAIN, we read from teh demoted copy
 
- read vs promote (or read vs demote+prmote)
  - try primary
  - if REDIRECT:
    - EAGAIN, try alternate location
    - result, or ENOENT and back to primary (and block->success or success)
  - if PROMOTING, block, then success
 
--- in-memory osd state ---
 
For each PG, we maintain:
  • set<Demotion*> redir_demoting;   ///< all pending demotions
  • set<Promotion*> redir_promotion; ///< all pending promotions
  • set<Cleanup*> redir_cleanup;       ///< all pending cleanups/deletions

These structs will have a ref to the ObjectContext and will need to orchestrate the push/pull to do the promotion/demotion.  They will reuse all of the push/pull helpers used by recovery.

 

--- snapshots ---

We can start with a simple approach, and add more complex bheavior from there.

  1. Force promote-on-write if a non-empty SnapContext is specified.  This ensures that all the snap metadata lives in the main pool and makes sense.  Similarly, we refuse to demote anything that is snapped.
  2. Allow snaps to be demoted.  For teh primary pool, recovery needs to be adjusted so that the clone_range stuff falls back to a full copy when the snap is a redirect.  In the target pool, recovery needs to behave when we have a subset of the snapset... i.e. just the snapped object.  It may be simplest if it is not a snap at all: foo @12 -> foo_$version @nosnap with key foo.  And writes/cow never happen in the cold pool.

 

--- clonerange ---

If a source item for a clonerange is a redirect, block and promote.

 

Work items

Coding tasks

  1. osd: add object_info_t fields for redirects
  2. add redirect metadata to MOSDOp, MOSDOpReply.  
  3. add a feature bit.
  4. osd, objecter, librados, api tests: SET_REDIRECT, GET_REDIRECT operations
  5. osd: basic redirect logic: reply with EAGAIN on primary, verify or EAGAIN on target.  
  6.   osd: EINVAL or similar if client lacks feature.
  7. objecter: handle EAGAIN redirects
  8. osd: pg log entries to indicate state changes (none -> demoting -> redirect -> promoting -> cleanup, deleting, etc.)
  9. osd: per-PG map of pending redirect states (demoting, promoting, cleanup, tombstone)
  10. osd: log replay to update pending redirect states
  11. osd: support deletion.  refactoring to support tombstones.
  12. osd: promote
  13. osd: demote
  14. osd: allow snap

Build / release tasks

  1. add promote/demote to RadosModel
  2.  

Documentation tasks

  1. Task 1

 

You must to post a comment.
Last Modified
18:56, 9 Dec 2013

Page Rating

Was this article helpful?

Tags

This page has no custom tags set.