You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by "Jay Doane (JIRA)" <ji...@apache.org> on 2016/07/27 04:52:20 UTC

[jira] [Created] (COUCHDB-3083) Include uuid in db_header epoch tuples

Jay Doane created COUCHDB-3083:
----------------------------------

             Summary: Include uuid in db_header epoch tuples
                 Key: COUCHDB-3083
                 URL: https://issues.apache.org/jira/browse/COUCHDB-3083
             Project: CouchDB
          Issue Type: Improvement
          Components: Database Core
            Reporter: Jay Doane


This proposal is one result of an email discussion I started, asking which cluster elasticity operations should transfer a shard unchanged, and which operations should alter the db_header uuids when the shard is transferred to a different node.

Ultimately Adam proposed this idea:

# Extend the epoch tuple from {node(), seq()} to {uuid(), node(), seq()}
# Change #db_header.uuid when a file is opened on a node() different than the latest entry in #db_header.epochs

The lightly edited email thread is included below:
{quote}
Jay:

I also would like to close the loop on risk associated with multiple
shard files which have the same UUID.  Note that if we ultimately
decide we *do* need to rewrite headers for now, code already exists
for that, and it doesn't seem particular expensive in terms of
resource usage. Still, it would even be better if we didn't need to
worry about it at all, so I propose we lay the possible risks on the
table.
 
Bob:

Given that headers are byte aligned and that we can detect them by
inspecting a single byte and that any two uuids are the same length,
it seemed to me it would be efficient to rewrite them during
transmission.
 
To the core question of whether we should or shouldn't, I think it
does depend on our intentions.
 
If we are simply moving a shard then we should not rewrite the
uuid. Not rewriting the uuid will allow replications to continue where
they left off (we have code for detecting when a shard has 'travelled'
in this manner). If we are copying a shard to make an additional copy
(e.g, if we were down to two copies and are bootstrapping a third
copy), then we _should_ rewrite the uuid. The three copies of a shard
range should have different uuids. By rewriting during the copy we are
doing semantically the same as letting a new shard (which would get a
random uuid at creation time) regrow via internal replication.
 
In the former case of a move, not rewriting the uuid is an
optimization. In the latter case of repopulation, rewriting the uuid
seems important (not least because it's surprising to have duplicate
unique ids). Seeing a shard with a uuid you 'knew' (from a packed seq
value) implies you know it's full history (the order of updates). That
is not true for these two copies from the moment they each receive
updates after this copy event. Their history is, obviously, identical
up that point though. While the travel history of the two will be
different at that moment, my concern is, over time, we might move
these copies around, perhaps many times, and how confusing is that?

Adam:

The invariant that we promise today is that the {node, uuid, sequence}
tuple uniquely maps to a document+revid. In the world where UUIDs are
not "UU" it becomes possible to break that invariant (say, by copying
from db1 to db2 to recover db2, and then subsequently copying back
from db2 to db1 to recover db1 at a later date). Rewriting the UUID on
node loss recovery transfers addresses that concern.
 
Assuming we do rewrite the UUID on the fly, do we lose out on a
significant internal replication optimization? My recollection is that
we have some fairly sophisticated code in mem3_sync now which allows
it to fast forward as much as possible using all of the previous epoch
information when e.g. we do a cluster expansion.
 
We really only need to change the UUID associated with new updates on
the file transferred to the recovered node, but of course that's not
an option in the current data structure. It's almost as if you'd want
to add the UUID to the epoch tuple. I don't mean to overcomplicate
matters here, but I am curious about what we lose out on when we do
the rewrite.

Bob:

we're agreeing. we do lose out if we rewrite uuid for a simple shard
move (i.e, rebalance).
 
To do this perfectly, we'd need to have a series of uuids and know the
sequence range they are valid for. If we had that, then the procedure
would be:
 
1) copy file verbatim
2) extend the uuid sequence by a) marking the current uuid as ending
   on seq N, where N is the current sequence; b) marking seq N+1 as
   under a new uuid.
 
something like
 
[{0, <<"0c6ecae5ff84d8d1dd51a536d69b4a7a">>}, {5000, <<"f34cf4d1e5e01e83f10bf70e7e9845d6">>]
 
Though this is getting awfully cute since we already use node+uuid to
detect a move. with the above, we could instead start a new uuid
epoch, to the same purpose. On the plus side, it further decouples us
from erlang node name. We're looking for a uuid match not a node+uuid
match. But we could have done it this way before, did we not do that
on purpose?

Adam:

I was thinking of it a bit differently:
Extend the epoch tuple from {node(), seq()} to {uuid(), node(), seq()}

Change #db_header.uuid when a file is opened on a node() different
than the latest entry in #db_header.epochs

This gives you a true UUID for every active file in the cluster while
also allowing you to accurately compute lineage for that file, even in
cases like the slightly pathological db1 -> db2 -> db1 path I
described earlier in this thread.

Bob:

Oh sure. That's much tidier.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)