You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Nathan Vander Wilt <na...@calftrail.com> on 2015/03/24 22:06:44 UTC

Could CouchDB 2.0 fix actual read quorum?

Sorry, I have not been following CouchDB 2.0 roadmap but I was extending my fermata-couchdb plugin today and realized that perhaps the Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to fix a serious issue I had using Cloudant's implementation.

See https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for some additional background/explanation, but my understanding is that Cloudant for all practical purposes ignores the read durability parameter. So you can write with ?w=N to attempt some level of quorum, and get a 202 back if that quorum is unment. _However_ when you ?r=N it really doesn't matter if only <N nodes are available…if even just a single available node has some version of the requested document you will get a successful response (!).

So in practice, there's no way to actually use the quasi-Dynamo features to dynamically _choose_ between consistency or availability — when it comes time to read back a consistent result, BigCouch instead just always gives you availability* regardless of what a given request actually needs. (In my usage I ended up treating a 202 write as a 500, rather than proceeding with no way of ever knowing whether a write did NOT ACTUALLY conflict or just hadn't YET because $who_knows_how_many nodes were still down…)

IIRC, this was both confirmed and acknowledged as a serious bug by a Cloudant engineer (or support personnel at least) but could not be quickly fixed as it could introduce backwards-compatibility concerns. So…

Is CouchDB 2.0 already breaking backwards compatibility with BigCouch? If true, could this read durability issue now be fixed during the merge?

thanks,
-natevw





* DISCLAIMER: this statement has not been endorsed by actual uptime of *any* Couch fork…

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by "Mutton, James" <jm...@akamai.com>.

Sorry, I had originally intended to suggest a 203 not a 202.  I agree, it’s a stretch to find a status code the matches exactly to the meaning.  I think 203 Non-Authorative is closest. http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.2.4

</JamesM>

On Mar 25, 2015, at 5:03, Robert Newson <rn...@apache.org> wrote:

> Also noting that there's no status code in the standard to indicate what we mean by 202 for a write for GET. 
> 
> Sent from my iPhone
> 
>> On 25 Mar 2015, at 04:49, Robert Newson <rn...@apache.org> wrote:
>> 
>> 2.0 is explicitly an AP system, the behaviour you describe is not classified as a bug. 
>> 
>> Anti-entropy is the main reason that you cannot get strong consistency from the system, it will transform "failed" writes (those that succeeded on one node but fewer than R nodes) into success (N copies) as long as the nodes have enough healthy uptime. 
>> 
>> True of cloudant and 2.0. 
>> 
>> Sent from my iPhone
>> 
>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com> wrote:
>>> 
>>> Funny you should mention it.  I drafted an email in early February to queue up the same discussion whenever I could get involved again (which I promptly forgot about).  What happens currently in 2.0 appears unchanged from earlier versions.  When R is not satisfied in fabric, fabric_doc_open:handle_message eventually responds with a {stop, …}  but leaves the acc-state as the original r_not_met which triggers a read_repair from the response handler.  read_repair results in an {ok, …} with the only doc available, because no other docs are in the list.  The final doc returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is simply {ok, Doc}, which has now lost the fact that the answer was not complete.
>>> 
>>> This seems straightforward to fix by a change in fabric_open_doc:handle_response and read_repair.  handle_response knows whether it has R met and could pass that forward, or allow read-repair to pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for community interest in the behavior of sending a 202, but it’s something I’d definitely like for the same reasons you cite.  Plus it just seems disconnected to do it on writes but not reads.
>>> 
>>> Cheers,
>>> </JamesM>
>>> 
>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <na...@calftrail.com> wrote:
>>>> 
>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was extending my fermata-couchdb plugin today and realized that perhaps the Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to fix a serious issue I had using Cloudant's implementation.
>>>> 
>>>> See https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for some additional background/explanation, but my understanding is that Cloudant for all practical purposes ignores the read durability parameter. So you can write with ?w=N to attempt some level of quorum, and get a 202 back if that quorum is unment. _However_ when you ?r=N it really doesn't matter if only <N nodes are available…if even just a single available node has some version of the requested document you will get a successful response (!).
>>>> 
>>>> So in practice, there's no way to actually use the quasi-Dynamo features to dynamically _choose_ between consistency or availability — when it comes time to read back a consistent result, BigCouch instead just always gives you availability* regardless of what a given request actually needs. (In my usage I ended up treating a 202 write as a 500, rather than proceeding with no way of ever knowing whether a write did NOT ACTUALLY conflict or just hadn't YET because $who_knows_how_many nodes were still down…)
>>>> 
>>>> IIRC, this was both confirmed and acknowledged as a serious bug by a Cloudant engineer (or support personnel at least) but could not be quickly fixed as it could introduce backwards-compatibility concerns. So…
>>>> 
>>>> Is CouchDB 2.0 already breaking backwards compatibility with BigCouch? If true, could this read durability issue now be fixed during the merge?
>>>> 
>>>> thanks,
>>>> -natevw
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> * DISCLAIMER: this statement has not been endorsed by actual uptime of *any* Couch fork…
>>>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Robert Newson <rn...@apache.org>.

Also noting that there's no status code in the standard to indicate what we mean by 202 for a write for GET. 

Sent from my iPhone

> On 25 Mar 2015, at 04:49, Robert Newson <rn...@apache.org> wrote:
> 
> 2.0 is explicitly an AP system, the behaviour you describe is not classified as a bug. 
> 
> Anti-entropy is the main reason that you cannot get strong consistency from the system, it will transform "failed" writes (those that succeeded on one node but fewer than R nodes) into success (N copies) as long as the nodes have enough healthy uptime. 
> 
> True of cloudant and 2.0. 
> 
> Sent from my iPhone
> 
>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com> wrote:
>> 
>> Funny you should mention it.  I drafted an email in early February to queue up the same discussion whenever I could get involved again (which I promptly forgot about).  What happens currently in 2.0 appears unchanged from earlier versions.  When R is not satisfied in fabric, fabric_doc_open:handle_message eventually responds with a {stop, …}  but leaves the acc-state as the original r_not_met which triggers a read_repair from the response handler.  read_repair results in an {ok, …} with the only doc available, because no other docs are in the list.  The final doc returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is simply {ok, Doc}, which has now lost the fact that the answer was not complete.
>> 
>> This seems straightforward to fix by a change in fabric_open_doc:handle_response and read_repair.  handle_response knows whether it has R met and could pass that forward, or allow read-repair to pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for community interest in the behavior of sending a 202, but it’s something I’d definitely like for the same reasons you cite.  Plus it just seems disconnected to do it on writes but not reads.
>> 
>> Cheers,
>> </JamesM>
>> 
>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <na...@calftrail.com> wrote:
>>> 
>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was extending my fermata-couchdb plugin today and realized that perhaps the Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to fix a serious issue I had using Cloudant's implementation.
>>> 
>>> See https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for some additional background/explanation, but my understanding is that Cloudant for all practical purposes ignores the read durability parameter. So you can write with ?w=N to attempt some level of quorum, and get a 202 back if that quorum is unment. _However_ when you ?r=N it really doesn't matter if only <N nodes are available…if even just a single available node has some version of the requested document you will get a successful response (!).
>>> 
>>> So in practice, there's no way to actually use the quasi-Dynamo features to dynamically _choose_ between consistency or availability — when it comes time to read back a consistent result, BigCouch instead just always gives you availability* regardless of what a given request actually needs. (In my usage I ended up treating a 202 write as a 500, rather than proceeding with no way of ever knowing whether a write did NOT ACTUALLY conflict or just hadn't YET because $who_knows_how_many nodes were still down…)
>>> 
>>> IIRC, this was both confirmed and acknowledged as a serious bug by a Cloudant engineer (or support personnel at least) but could not be quickly fixed as it could introduce backwards-compatibility concerns. So…
>>> 
>>> Is CouchDB 2.0 already breaking backwards compatibility with BigCouch? If true, could this read durability issue now be fixed during the merge?
>>> 
>>> thanks,
>>> -natevw
>>> 
>>> 
>>> 
>>> 
>>> 
>>> * DISCLAIMER: this statement has not been endorsed by actual uptime of *any* Couch fork…
>>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by "Mutton, James" <jm...@akamai.com>.

I’ll have a look when possible, probably not until next week, but on the surface it sounds good.  Agreed that names need refinement and it appears to be missing one for the case of “not-enough N’s alive”.  Maybe add r_met:”insufficient” to the list to indicate not enough nodes were up to satisfy the requested R.  That also begs the question of when R is both insufficient and something else.  I think that “insufficient” is probably the bigger issue in that case and would take precedence (as opposed to making r_met an array or bitmask something awful).

</JamesM>

On Apr 4, 2015, at 3:12, Robert Samuel Newson <rn...@apache.org> wrote:

> I’ve made branch 2655-r-met2 in fabric which will indicate the consistency of the response. I’ve kept the is_r_met and r_met names for now, but if this is the right direction we will want to change that.
> 
> When fabric returns r_met:"consistent" it means complete agreement among all R responses
> When fabric returns r_met:"divergent" it means we saw more than one distinct revision from the R responses but all divergent copies are ancestors (i.e, they’re missing an update rather being an alternate branch)
> When fabric returns r_met:"disagreement" we saw truly divergent responses. Fabric blocks for the repair, so the response is "healed", but nevertheless it indicates an issue like a recent partition not yet fully healed by anti-entropy.
> 
> Obviously these names are terrible and we’ll need to brainstorm on those, but let’s first establish if this is the right kind of metadata.
> 
> B.
> 
> 
>> On 4 Apr 2015, at 10:41, Robert Samuel Newson <rn...@apache.org> wrote:
>> 
>> 
>> Ok, most of those make sense to me (I think the last two, and particularly the last one, are confounded by the fact couch will initiate read repair if it sees a lack of convergence, i.e, R to N* different revisions, and will perform the usual arbitrary-but-consistent winner algorithm right there).
>> 
>> So, what we want is not really r_met in the sense that fabric means it; which is the minimum number of responses to wait for before returning, regardless of whether they are the same revision or not.
>> 
>> It’s as you said, did we see at least R responses with the same revision? Would we want additional nuance like whether the responses were so inconsistent that we ran read repair? This would distinguish the case where there are simply fewer than R responses (for nodes down / slow / partitioned) that are returning the same revision versus the case where all R to N* responses return different revisions.
>> 
>> I’ll see how easy it is to return the first value while we ponder the other question.
>> 
>> * I say "R to N" to mean fabric will wait for at least R responses (or timeout) but up to N responses (or timeout) if the responses vary.
>> 
>> B.
>> 
>>> On 4 Apr 2015, at 02:08, Mutton, James <jm...@akamai.com> wrote:
>>> 
>>> * Report the number of r_met failed conditions to a statistical aggregator for alerting or trending on client-visible behavior.
>>> * Pause some operation for a time if possible, retry later.
>>> * Possibly re-resolve and use another cluster that is more healthy or less loaded
>>> * Indicate some hidden failure or bug in how shards got moved around/restored from down nodes
>>> 
>>> </JamesM>
>>> 
>>> On Apr 3, 2015, at 17:27, Robert Samuel Newson <rn...@apache.org> wrote:
>>> 
>>>> 
>>>> I’ve pushed an update to the fabric branch which accounts for when the r= value is higher than the number of replicas (so that it returns r_met:false)
>>>> 
>>>> Changing this so that r_met is true only if R matching revisions are seen doesn’t sound too difficult.
>>>> 
>>>> Where I struggle is seeing what a client can usefully do with this information. When you receive the r_met:false indication, however we end up conveying it, what will you do? Retry until r_met:true?
>>>> 
>>>> B.
>>>> 
>>>>> On 4 Apr 2015, at 00:55, Mutton, James <jm...@akamai.com> wrote:
>>>>> 
>>>>> Based on Paul’s description it sounds like we may need to decide 3 things to close this out:
>>>>> * What does satisfying R mean?
>>>>> * What is the appropriate scope of when R is applied?
>>>>> * How do we most appropriately convey the lack of R?
>>>>> 
>>>>> I’m basing my opinions of R on W.  W is satisfied when a write succeeds to W nodes.  For behavior to be consistent between R and W, R should be considered to be met when R “matching” results have been found, if we treat “matching” == “successful”.  I believe this to be a more-correct interpretation of R-W consistency then treating R-satisfied as “found-but-not-matching” since it matches the complete positive of W's “successfully-written”.  For scope, W acts for both current versions and historical revision updates (e.g. resolving conflicts).  W also functions in bulk operations so R should function in multi-key requests as well if it’s to be consistent.
>>>>> 
>>>>> The last question is how to appropriately convey lack of R.  I tested these branches to see that the _r_met was present, that worked.  I also made some quick modifications to return a 203 to see how some clients behaved.  Here are my test results: https://gist.github.com/jamutton/c823fdac328777e22646
>>>>> 
>>>>> I tested a few clients including an old version of couchdbkit and all worked while the server was returning a 203 and/or the meta-field.  A quick test-with replication was mixed.  I did a replicate into a couchdb 1.6 machine and although I did see some errors, replication succeeded (the errors were related to checkpointing the target and my 1.6 could have been messed up).  All that to say that where I tested it, returning a 203 on R was accepted behavior by clients, just as returning a 202 on W.  By no means is that extensive but at least indicative.  So, I think both approaches, field and status-code, are possible for single key requests (more on that in a second) and whether it’s status or field, I favor at least having consistency with W.  We could also have consistency by converting W’s 202 to a to be an in-document meta field like _w_met and only present when ?is_w_met=true is present on the query string.  That feels more drastic.
>>>>> 
>>>>> So the last issue is for the bulk/multi-doc responses.  Here the entire approach of reads and writes diverges.  Writes are still individual doc-updates, whereas reads of multi-docs are basically a “view” even if it’s all_docs.  IMHO, views could be called  out of scope for when R is Applied.  It doesn’t even descend into doc_open to apply R unless “keys” are specified and normal views without include_docs would do the same IIRC.  This approach of calling all views out of scope because they could only even be in scope under certain circumstances, leaves the door open still for either a status-code or field (and again, if using a field it would be more consistent API behavior to switch W to behave the same)
>>>>> 
>>>>> Cheers,
>>>>> </JamesM>
>>>>> 
>>>>> On Apr 2, 2015, at 3:51, Robert Samuel Newson <rn...@apache.org> wrote:
>>>>> 
>>>>>> To move this along I have COUCHDB-2655 and three branches with a working solution;
>>>>>> 
>>>>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-chttpd.git;h=b408ce5
>>>>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=7d811d3
>>>>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-fabric.git;h=90e9691
>>>>>> 
>>>>>> All three branches are called 2655-r-met if you want to try this locally (and please do!)
>>>>>> 
>>>>>> Sample output;
>>>>>> 
>>>>>> curl -v 'foo:bar@localhost:15984/db1/doc1?is_r_met=true'
>>>>>> 
>>>>>> {"_id":"doc1","_rev":"1-967a00dff5e02add41819138abb3284d","_r_met":true}
>>>>>> 
>>>>>> By making it opt-in, I think we avoid all the collateral damage that Paul was concerned about.
>>>>>> 
>>>>>> B.
>>>>>> 
>>>>>> 
>>>>>>> On 2 Apr 2015, at 10:36, Robert Samuel Newson <rn...@apache.org> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Yeah, not a bad idea. An extra query arg (akin to open_revs=all, conflicts=true, etc) would avoid compatibility breaks and would clearly put the onus on those supplying it to tolerate the presence of the extra reserved field.
>>>>>>> 
>>>>>>> +1
>>>>>>> 
>>>>>>> 
>>>>>>>> On 2 Apr 2015, at 10:32, Benjamin Bastian <bb...@apache.org> wrote:
>>>>>>>> 
>>>>>>>> What about adding an optional query parameter to indicate whether or not
>>>>>>>> Couch should include the _r_met flag in the document body/bodies
>>>>>>>> (defaulting to false)? That wouldn't break older clients and it'd work for
>>>>>>>> the bulk API as well. As far as the case where there are conflicts, it
>>>>>>>> seems like the most intuitive thing would be for the "r" in "_r_met" to
>>>>>>>> have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait
>>>>>>>> for r copies of the same doc rev until a timeout" and "_r_met" would mean
>>>>>>>> "we got/didn't get r copies of the same doc rev within the timeout").
>>>>>>>> 
>>>>>>>> Just my two cents.
>>>>>>>> 
>>>>>>>> On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <rn...@apache.org>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Paul outlined his previous efforts to introduce this indication, and the
>>>>>>>>> problems he faced doing so. Can we come up with an acceptable mechanism?
>>>>>>>>> 
>>>>>>>>> A different status code will break a lot of users. While the http spec
>>>>>>>>> says you can treat any 2xx code as success, plenty of libraries, etc, only
>>>>>>>>> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for
>>>>>>>>> reads.
>>>>>>>>> 
>>>>>>>>> My preference is for a change that "can’t" break anyone, which I think
>>>>>>>>> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most
>>>>>>>>> pleasant thing.
>>>>>>>>> 
>>>>>>>>> Suggestions?
>>>>>>>>> 
>>>>>>>>> B.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 1 Apr 2015, at 06:55, Mutton, James <jm...@akamai.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> For at least my part of it, I agree with Adam. Bigcouch has made an
>>>>>>>>> effort to inform in the case of a failure to apply W. I've seen it lead to
>>>>>>>>> confusion when the same logic was not applied on R.
>>>>>>>>>> 
>>>>>>>>>> I also agree that W and R are not binding contracts. There's no
>>>>>>>>> agreement protocol to assure that W is met before being committed to disk.
>>>>>>>>> But they are exposed as a blocking parameter of the request, so
>>>>>>>>> notification being consistent appeared to me to be the best compromise (vs
>>>>>>>>> straight up removal).
>>>>>>>>>> 
>>>>>>>>>> </JamesM>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Mar 31, 2015, at 13:15, Robert Newson <rn...@apache.org> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> If a way can be found that doesn't break things that can be sent in all
>>>>>>>>> or most cases, sure. It's what a user can really infer from that which I
>>>>>>>>> focused on. Not as much, I think, as users that want that info really want.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <ko...@apache.org> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I hope we can all agree that CouchDB should inform the user when it is
>>>>>>>>> unable to satisfy the requested read "quorum".
>>>>>>>>>>>> 
>>>>>>>>>>>> Adam
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <pa...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sounds like there's a bit of confusion here.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What Nathan is asking for is the ability to have Couch respond with
>>>>>>>>> some
>>>>>>>>>>>>> information on the actual number of replicas that responded to a read
>>>>>>>>>>>>> request. That way a user could tell that they issued an r=2 request
>>>>>>>>> when
>>>>>>>>>>>>> only r=1 was actually performed. Depending on your point of view in
>>>>>>>>> an MVCC
>>>>>>>>>>>>> world this is either a bug or a feature. :)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It was generally agreed upon that if we could return this information
>>>>>>>>> it
>>>>>>>>>>>>> would be beneficial. Although what happened when I started
>>>>>>>>> implementing
>>>>>>>>>>>>> this patch was that we are either only able to return it in a subset
>>>>>>>>> of
>>>>>>>>>>>>> cases where it happens, return it inconsistently between various
>>>>>>>>> responses,
>>>>>>>>>>>>> or break replication.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The three general methods for this would be to either include a new
>>>>>>>>>>>>> "_r_met" key in the doc body that would be a boolean indicating if the
>>>>>>>>>>>>> requested read quorum was actually met for the document. The second
>>>>>>>>> was to
>>>>>>>>>>>>> return a custom X-R-Met type header, and lastly was the status code as
>>>>>>>>>>>>> described.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The _r_met member was thought to be the best, but unfortunately that
>>>>>>>>> breaks
>>>>>>>>>>>>> replication with older clients because we throw an error rather than
>>>>>>>>> ignore
>>>>>>>>>>>>> any unknown underscore prefixed field name. Thus having something
>>>>>>>>> that was
>>>>>>>>>>>>> just dynamically injected into the document body was a non-starter.
>>>>>>>>>>>>> Unfortunately, if we don't inject into the document body then we limit
>>>>>>>>>>>>> ourselves to only the set of APIs where a single document is
>>>>>>>>> returned. This
>>>>>>>>>>>>> is due to both streaming semantics (we can't buffer an entire
>>>>>>>>> response in
>>>>>>>>>>>>> memory for large requests to _all_docs) as well as multi-doc
>>>>>>>>> responses (a
>>>>>>>>>>>>> single boolean doesn't say which document may have not had a properly
>>>>>>>>> met
>>>>>>>>>>>>> R).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On top of that, the other confusing part of meeting the read quorum
>>>>>>>>> is that
>>>>>>>>>>>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>>>>>>>>>>>> documents with different revision histories. For instance, if we read
>>>>>>>>> two
>>>>>>>>>>>>> docs, we have technically made the r=2 requirement, but what should
>>>>>>>>> our
>>>>>>>>>>>>> response be if those two revisions are different (technically, in
>>>>>>>>> this case
>>>>>>>>>>>>> we wait for the third response, but the decision on what to return
>>>>>>>>> for the
>>>>>>>>>>>>> "r met" value is still unclear).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> While I think everyone is in agreement that it'd be nice to return
>>>>>>>>> some of
>>>>>>>>>>>>> the information about the copies read, I think its much less clear
>>>>>>>>> what and
>>>>>>>>>>>>> how it should be returned in the multitude of cases that we can
>>>>>>>>> specify an
>>>>>>>>>>>>> value for R.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> While that doesn't offer a concrete path forward, hopefully it
>>>>>>>>> clarifies
>>>>>>>>>>>>> some of the issues at hand.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <
>>>>>>>>> rnewson@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It’s testament to my friendship with Mike that we can disagree on
>>>>>>>>> such
>>>>>>>>>>>>>> things and remain friends. I am sorry he misled you, though.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at
>>>>>>>>> all, at
>>>>>>>>>>>>>> least in the formal sense, the only one that matters, this is
>>>>>>>>> unfortunately
>>>>>>>>>>>>>> sloppy language in too many places to correct.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The r= and w= parameters control only how many of the n possible
>>>>>>>>> responses
>>>>>>>>>>>>>> are collected before returning an http response.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It’s not true that returning 202 in the situation where one write is
>>>>>>>>> made
>>>>>>>>>>>>>> but fewer than 'r' writes are made means we’ve chosen availability
>>>>>>>>> over
>>>>>>>>>>>>>> consistency since even if we returned a 500 or closed the connection
>>>>>>>>>>>>>> without responding, a subsequent GET could return the document (a
>>>>>>>>>>>>>> probability that increases over time as anti-entropy makes the
>>>>>>>>> missing
>>>>>>>>>>>>>> copies). A write attempt that returned a 409 could, likewise,
>>>>>>>>> introduce a
>>>>>>>>>>>>>> new edit branch into the document, which might then 'win', altering
>>>>>>>>> the
>>>>>>>>>>>>>> results of a subsequent GET.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The essential thing to remember is this: the ’n’ copies of your data
>>>>>>>>> are
>>>>>>>>>>>>>> completely independent when written/read by the clustered layer
>>>>>>>>> (fabric).
>>>>>>>>>>>>>> It is internal replication (anti-entropy) that converges those
>>>>>>>>> copies,
>>>>>>>>>>>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>>>>>>>>>>>> independent results into a single result as best it can. Older
>>>>>>>>> versions did
>>>>>>>>>>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I
>>>>>>>>> do agree
>>>>>>>>>>>>>> with you that there’s little value in the 202 distinction. About the
>>>>>>>>> only
>>>>>>>>>>>>>> thing you could do is investigate your cluster for connectivity
>>>>>>>>> issues or
>>>>>>>>>>>>>> overloading if you get a sustained period of 202’s, as it would be an
>>>>>>>>>>>>>> indicator that the system is partitioned.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure
>>>>>>>>> that the
>>>>>>>>>>>>>> result of a write did not change after the fact. That is,
>>>>>>>>> anti-entropy
>>>>>>>>>>>>>> would need to be disabled, or somehow agree to roll forward or
>>>>>>>>> backward
>>>>>>>>>>>>>> based on the initial circumstances. In short, we’d have to introduce
>>>>>>>>> strong
>>>>>>>>>>>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>>>>>>>>>>>> feature to add, it’s not currently present, and no amount of
>>>>>>>>> twiddling the
>>>>>>>>>>>>>> status codes will achieve it. We’d rather be honest about our
>>>>>>>>> position on
>>>>>>>>>>>>>> the CAP triangle.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> B.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <
>>>>>>>>> nate-lists@calftrail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> A technical co-founder of Cloudant agreed that this was a bug when I
>>>>>>>>>>>>>> first hit it a few years ago. I found back the original thread here
>>>>>>>>> — this
>>>>>>>>>>>>>> is the discussion I was trying to recall in my OP:
>>>>>>>>>>>>>>> It sounds like perhaps there is a related issue tracked internally
>>>>>>>>> at
>>>>>>>>>>>>>> Cloudant as a result of that conversation.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> JamesM, thanks for your support here and tracking this down. 203
>>>>>>>>> seemed
>>>>>>>>>>>>>> like the best status code to "steal" for this to me too. Best wishes
>>>>>>>>> in
>>>>>>>>>>>>>> getting this fixed!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>> -natevw
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>>>>>>>>>>>>> classified as a bug.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Anti-entropy is the main reason that you cannot get strong
>>>>>>>>> consistency
>>>>>>>>>>>>>> from the system, it will transform "failed" writes (those that
>>>>>>>>> succeeded on
>>>>>>>>>>>>>> one node but fewer than R nodes) into success (N copies) as long as
>>>>>>>>> the
>>>>>>>>>>>>>> nodes have enough healthy uptime.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> True of cloudant and 2.0.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Funny you should mention it.  I drafted an email in early
>>>>>>>>> February to
>>>>>>>>>>>>>> queue up the same discussion whenever I could get involved again
>>>>>>>>> (which I
>>>>>>>>>>>>>> promptly forgot about).  What happens currently in 2.0 appears
>>>>>>>>> unchanged
>>>>>>>>>>>>>> from earlier versions.  When R is not satisfied in fabric,
>>>>>>>>>>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}
>>>>>>>>> but
>>>>>>>>>>>>>> leaves the acc-state as the original r_not_met which triggers a
>>>>>>>>> read_repair
>>>>>>>>>>>>>> from the response handler.  read_repair results in an {ok, …} with
>>>>>>>>> the only
>>>>>>>>>>>>>> doc available, because no other docs are in the list.  The final doc
>>>>>>>>>>>>>> returned to chttpd_db:couch_doc_open and thusly to
>>>>>>>>> chttpd_db:db_doc_req is
>>>>>>>>>>>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>>>>>>>>>>>>> complete.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> This seems straightforward to fix by a change in
>>>>>>>>>>>>>> fabric_open_doc:handle_response and read_repair.  handle_response
>>>>>>>>> knows
>>>>>>>>>>>>>> whether it has R met and could pass that forward, or allow
>>>>>>>>> read-repair to
>>>>>>>>>>>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t
>>>>>>>>> speak for
>>>>>>>>>>>>>> community interest in the behavior of sending a 202, but it’s
>>>>>>>>> something I’d
>>>>>>>>>>>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>>>>>>>>>>>> disconnected to do it on writes but not reads.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> </JamesM>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>>>>>>>>>>>> nate-lists@calftrail.com> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>>>>>>>>>>>>> extending my fermata-couchdb plugin today and realized that perhaps
>>>>>>>>> the
>>>>>>>>>>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an
>>>>>>>>> opportunity to
>>>>>>>>>>>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> See
>>>>>>>>>>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
>>>>>>>>> for
>>>>>>>>>>>>>> some additional background/explanation, but my understanding is that
>>>>>>>>>>>>>> Cloudant for all practical purposes ignores the read durability
>>>>>>>>> parameter.
>>>>>>>>>>>>>> So you can write with ?w=N to attempt some level of quorum, and get
>>>>>>>>> a 202
>>>>>>>>>>>>>> back if that quorum is unment. _However_ when you ?r=N it really
>>>>>>>>> doesn't
>>>>>>>>>>>>>> matter if only <N nodes are available…if even just a single
>>>>>>>>> available node
>>>>>>>>>>>>>> has some version of the requested document you will get a successful
>>>>>>>>>>>>>> response (!).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>>>>>>>>>>>> features to dynamically _choose_ between consistency or availability
>>>>>>>>> — when
>>>>>>>>>>>>>> it comes time to read back a consistent result, BigCouch instead just
>>>>>>>>>>>>>> always gives you availability* regardless of what a given request
>>>>>>>>> actually
>>>>>>>>>>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather
>>>>>>>>> than
>>>>>>>>>>>>>> proceeding with no way of ever knowing whether a write did NOT
>>>>>>>>> ACTUALLY
>>>>>>>>>>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were
>>>>>>>>> still
>>>>>>>>>>>>>> down…)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug
>>>>>>>>> by a
>>>>>>>>>>>>>> Cloudant engineer (or support personnel at least) but could not be
>>>>>>>>> quickly
>>>>>>>>>>>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>>>>>>>>>>>>> BigCouch? If true, could this read durability issue now be fixed
>>>>>>>>> during the
>>>>>>>>>>>>>> merge?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>>>> -natevw
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual
>>>>>>>>> uptime
>>>>>>>>>>>>>> of *any* Couch fork…
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Robert Samuel Newson <rn...@apache.org>.

I’ve made branch 2655-r-met2 in fabric which will indicate the consistency of the response. I’ve kept the is_r_met and r_met names for now, but if this is the right direction we will want to change that.

When fabric returns r_met:"consistent" it means complete agreement among all R responses
When fabric returns r_met:"divergent" it means we saw more than one distinct revision from the R responses but all divergent copies are ancestors (i.e, they’re missing an update rather being an alternate branch)
When fabric returns r_met:"disagreement" we saw truly divergent responses. Fabric blocks for the repair, so the response is "healed", but nevertheless it indicates an issue like a recent partition not yet fully healed by anti-entropy.

Obviously these names are terrible and we’ll need to brainstorm on those, but let’s first establish if this is the right kind of metadata.

B.


> On 4 Apr 2015, at 10:41, Robert Samuel Newson <rn...@apache.org> wrote:
> 
> 
> Ok, most of those make sense to me (I think the last two, and particularly the last one, are confounded by the fact couch will initiate read repair if it sees a lack of convergence, i.e, R to N* different revisions, and will perform the usual arbitrary-but-consistent winner algorithm right there).
> 
> So, what we want is not really r_met in the sense that fabric means it; which is the minimum number of responses to wait for before returning, regardless of whether they are the same revision or not.
> 
> It’s as you said, did we see at least R responses with the same revision? Would we want additional nuance like whether the responses were so inconsistent that we ran read repair? This would distinguish the case where there are simply fewer than R responses (for nodes down / slow / partitioned) that are returning the same revision versus the case where all R to N* responses return different revisions.
> 
> I’ll see how easy it is to return the first value while we ponder the other question.
> 
> * I say "R to N" to mean fabric will wait for at least R responses (or timeout) but up to N responses (or timeout) if the responses vary.
> 
> B.
> 
>> On 4 Apr 2015, at 02:08, Mutton, James <jm...@akamai.com> wrote:
>> 
>> * Report the number of r_met failed conditions to a statistical aggregator for alerting or trending on client-visible behavior.
>> * Pause some operation for a time if possible, retry later.
>> * Possibly re-resolve and use another cluster that is more healthy or less loaded
>> * Indicate some hidden failure or bug in how shards got moved around/restored from down nodes
>> 
>> </JamesM>
>> 
>> On Apr 3, 2015, at 17:27, Robert Samuel Newson <rn...@apache.org> wrote:
>> 
>>> 
>>> I’ve pushed an update to the fabric branch which accounts for when the r= value is higher than the number of replicas (so that it returns r_met:false)
>>> 
>>> Changing this so that r_met is true only if R matching revisions are seen doesn’t sound too difficult.
>>> 
>>> Where I struggle is seeing what a client can usefully do with this information. When you receive the r_met:false indication, however we end up conveying it, what will you do? Retry until r_met:true?
>>> 
>>> B.
>>> 
>>>> On 4 Apr 2015, at 00:55, Mutton, James <jm...@akamai.com> wrote:
>>>> 
>>>> Based on Paul’s description it sounds like we may need to decide 3 things to close this out:
>>>> * What does satisfying R mean?
>>>> * What is the appropriate scope of when R is applied?
>>>> * How do we most appropriately convey the lack of R?
>>>> 
>>>> I’m basing my opinions of R on W.  W is satisfied when a write succeeds to W nodes.  For behavior to be consistent between R and W, R should be considered to be met when R “matching” results have been found, if we treat “matching” == “successful”.  I believe this to be a more-correct interpretation of R-W consistency then treating R-satisfied as “found-but-not-matching” since it matches the complete positive of W's “successfully-written”.  For scope, W acts for both current versions and historical revision updates (e.g. resolving conflicts).  W also functions in bulk operations so R should function in multi-key requests as well if it’s to be consistent.
>>>> 
>>>> The last question is how to appropriately convey lack of R.  I tested these branches to see that the _r_met was present, that worked.  I also made some quick modifications to return a 203 to see how some clients behaved.  Here are my test results: https://gist.github.com/jamutton/c823fdac328777e22646
>>>> 
>>>> I tested a few clients including an old version of couchdbkit and all worked while the server was returning a 203 and/or the meta-field.  A quick test-with replication was mixed.  I did a replicate into a couchdb 1.6 machine and although I did see some errors, replication succeeded (the errors were related to checkpointing the target and my 1.6 could have been messed up).  All that to say that where I tested it, returning a 203 on R was accepted behavior by clients, just as returning a 202 on W.  By no means is that extensive but at least indicative.  So, I think both approaches, field and status-code, are possible for single key requests (more on that in a second) and whether it’s status or field, I favor at least having consistency with W.  We could also have consistency by converting W’s 202 to a to be an in-document meta field like _w_met and only present when ?is_w_met=true is present on the query string.  That feels more drastic.
>>>> 
>>>> So the last issue is for the bulk/multi-doc responses.  Here the entire approach of reads and writes diverges.  Writes are still individual doc-updates, whereas reads of multi-docs are basically a “view” even if it’s all_docs.  IMHO, views could be called  out of scope for when R is Applied.  It doesn’t even descend into doc_open to apply R unless “keys” are specified and normal views without include_docs would do the same IIRC.  This approach of calling all views out of scope because they could only even be in scope under certain circumstances, leaves the door open still for either a status-code or field (and again, if using a field it would be more consistent API behavior to switch W to behave the same)
>>>> 
>>>> Cheers,
>>>> </JamesM>
>>>> 
>>>> On Apr 2, 2015, at 3:51, Robert Samuel Newson <rn...@apache.org> wrote:
>>>> 
>>>>> To move this along I have COUCHDB-2655 and three branches with a working solution;
>>>>> 
>>>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-chttpd.git;h=b408ce5
>>>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=7d811d3
>>>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-fabric.git;h=90e9691
>>>>> 
>>>>> All three branches are called 2655-r-met if you want to try this locally (and please do!)
>>>>> 
>>>>> Sample output;
>>>>> 
>>>>> curl -v 'foo:bar@localhost:15984/db1/doc1?is_r_met=true'
>>>>> 
>>>>> {"_id":"doc1","_rev":"1-967a00dff5e02add41819138abb3284d","_r_met":true}
>>>>> 
>>>>> By making it opt-in, I think we avoid all the collateral damage that Paul was concerned about.
>>>>> 
>>>>> B.
>>>>> 
>>>>> 
>>>>>> On 2 Apr 2015, at 10:36, Robert Samuel Newson <rn...@apache.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> Yeah, not a bad idea. An extra query arg (akin to open_revs=all, conflicts=true, etc) would avoid compatibility breaks and would clearly put the onus on those supplying it to tolerate the presence of the extra reserved field.
>>>>>> 
>>>>>> +1
>>>>>> 
>>>>>> 
>>>>>>> On 2 Apr 2015, at 10:32, Benjamin Bastian <bb...@apache.org> wrote:
>>>>>>> 
>>>>>>> What about adding an optional query parameter to indicate whether or not
>>>>>>> Couch should include the _r_met flag in the document body/bodies
>>>>>>> (defaulting to false)? That wouldn't break older clients and it'd work for
>>>>>>> the bulk API as well. As far as the case where there are conflicts, it
>>>>>>> seems like the most intuitive thing would be for the "r" in "_r_met" to
>>>>>>> have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait
>>>>>>> for r copies of the same doc rev until a timeout" and "_r_met" would mean
>>>>>>> "we got/didn't get r copies of the same doc rev within the timeout").
>>>>>>> 
>>>>>>> Just my two cents.
>>>>>>> 
>>>>>>> On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <rn...@apache.org>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> Paul outlined his previous efforts to introduce this indication, and the
>>>>>>>> problems he faced doing so. Can we come up with an acceptable mechanism?
>>>>>>>> 
>>>>>>>> A different status code will break a lot of users. While the http spec
>>>>>>>> says you can treat any 2xx code as success, plenty of libraries, etc, only
>>>>>>>> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for
>>>>>>>> reads.
>>>>>>>> 
>>>>>>>> My preference is for a change that "can’t" break anyone, which I think
>>>>>>>> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most
>>>>>>>> pleasant thing.
>>>>>>>> 
>>>>>>>> Suggestions?
>>>>>>>> 
>>>>>>>> B.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 1 Apr 2015, at 06:55, Mutton, James <jm...@akamai.com> wrote:
>>>>>>>>> 
>>>>>>>>> For at least my part of it, I agree with Adam. Bigcouch has made an
>>>>>>>> effort to inform in the case of a failure to apply W. I've seen it lead to
>>>>>>>> confusion when the same logic was not applied on R.
>>>>>>>>> 
>>>>>>>>> I also agree that W and R are not binding contracts. There's no
>>>>>>>> agreement protocol to assure that W is met before being committed to disk.
>>>>>>>> But they are exposed as a blocking parameter of the request, so
>>>>>>>> notification being consistent appeared to me to be the best compromise (vs
>>>>>>>> straight up removal).
>>>>>>>>> 
>>>>>>>>> </JamesM>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Mar 31, 2015, at 13:15, Robert Newson <rn...@apache.org> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> If a way can be found that doesn't break things that can be sent in all
>>>>>>>> or most cases, sure. It's what a user can really infer from that which I
>>>>>>>> focused on. Not as much, I think, as users that want that info really want.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <ko...@apache.org> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I hope we can all agree that CouchDB should inform the user when it is
>>>>>>>> unable to satisfy the requested read "quorum".
>>>>>>>>>>> 
>>>>>>>>>>> Adam
>>>>>>>>>>> 
>>>>>>>>>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <pa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Sounds like there's a bit of confusion here.
>>>>>>>>>>>> 
>>>>>>>>>>>> What Nathan is asking for is the ability to have Couch respond with
>>>>>>>> some
>>>>>>>>>>>> information on the actual number of replicas that responded to a read
>>>>>>>>>>>> request. That way a user could tell that they issued an r=2 request
>>>>>>>> when
>>>>>>>>>>>> only r=1 was actually performed. Depending on your point of view in
>>>>>>>> an MVCC
>>>>>>>>>>>> world this is either a bug or a feature. :)
>>>>>>>>>>>> 
>>>>>>>>>>>> It was generally agreed upon that if we could return this information
>>>>>>>> it
>>>>>>>>>>>> would be beneficial. Although what happened when I started
>>>>>>>> implementing
>>>>>>>>>>>> this patch was that we are either only able to return it in a subset
>>>>>>>> of
>>>>>>>>>>>> cases where it happens, return it inconsistently between various
>>>>>>>> responses,
>>>>>>>>>>>> or break replication.
>>>>>>>>>>>> 
>>>>>>>>>>>> The three general methods for this would be to either include a new
>>>>>>>>>>>> "_r_met" key in the doc body that would be a boolean indicating if the
>>>>>>>>>>>> requested read quorum was actually met for the document. The second
>>>>>>>> was to
>>>>>>>>>>>> return a custom X-R-Met type header, and lastly was the status code as
>>>>>>>>>>>> described.
>>>>>>>>>>>> 
>>>>>>>>>>>> The _r_met member was thought to be the best, but unfortunately that
>>>>>>>> breaks
>>>>>>>>>>>> replication with older clients because we throw an error rather than
>>>>>>>> ignore
>>>>>>>>>>>> any unknown underscore prefixed field name. Thus having something
>>>>>>>> that was
>>>>>>>>>>>> just dynamically injected into the document body was a non-starter.
>>>>>>>>>>>> Unfortunately, if we don't inject into the document body then we limit
>>>>>>>>>>>> ourselves to only the set of APIs where a single document is
>>>>>>>> returned. This
>>>>>>>>>>>> is due to both streaming semantics (we can't buffer an entire
>>>>>>>> response in
>>>>>>>>>>>> memory for large requests to _all_docs) as well as multi-doc
>>>>>>>> responses (a
>>>>>>>>>>>> single boolean doesn't say which document may have not had a properly
>>>>>>>> met
>>>>>>>>>>>> R).
>>>>>>>>>>>> 
>>>>>>>>>>>> On top of that, the other confusing part of meeting the read quorum
>>>>>>>> is that
>>>>>>>>>>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>>>>>>>>>>> documents with different revision histories. For instance, if we read
>>>>>>>> two
>>>>>>>>>>>> docs, we have technically made the r=2 requirement, but what should
>>>>>>>> our
>>>>>>>>>>>> response be if those two revisions are different (technically, in
>>>>>>>> this case
>>>>>>>>>>>> we wait for the third response, but the decision on what to return
>>>>>>>> for the
>>>>>>>>>>>> "r met" value is still unclear).
>>>>>>>>>>>> 
>>>>>>>>>>>> While I think everyone is in agreement that it'd be nice to return
>>>>>>>> some of
>>>>>>>>>>>> the information about the copies read, I think its much less clear
>>>>>>>> what and
>>>>>>>>>>>> how it should be returned in the multitude of cases that we can
>>>>>>>> specify an
>>>>>>>>>>>> value for R.
>>>>>>>>>>>> 
>>>>>>>>>>>> While that doesn't offer a concrete path forward, hopefully it
>>>>>>>> clarifies
>>>>>>>>>>>> some of the issues at hand.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <
>>>>>>>> rnewson@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It’s testament to my friendship with Mike that we can disagree on
>>>>>>>> such
>>>>>>>>>>>>> things and remain friends. I am sorry he misled you, though.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at
>>>>>>>> all, at
>>>>>>>>>>>>> least in the formal sense, the only one that matters, this is
>>>>>>>> unfortunately
>>>>>>>>>>>>> sloppy language in too many places to correct.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The r= and w= parameters control only how many of the n possible
>>>>>>>> responses
>>>>>>>>>>>>> are collected before returning an http response.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It’s not true that returning 202 in the situation where one write is
>>>>>>>> made
>>>>>>>>>>>>> but fewer than 'r' writes are made means we’ve chosen availability
>>>>>>>> over
>>>>>>>>>>>>> consistency since even if we returned a 500 or closed the connection
>>>>>>>>>>>>> without responding, a subsequent GET could return the document (a
>>>>>>>>>>>>> probability that increases over time as anti-entropy makes the
>>>>>>>> missing
>>>>>>>>>>>>> copies). A write attempt that returned a 409 could, likewise,
>>>>>>>> introduce a
>>>>>>>>>>>>> new edit branch into the document, which might then 'win', altering
>>>>>>>> the
>>>>>>>>>>>>> results of a subsequent GET.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The essential thing to remember is this: the ’n’ copies of your data
>>>>>>>> are
>>>>>>>>>>>>> completely independent when written/read by the clustered layer
>>>>>>>> (fabric).
>>>>>>>>>>>>> It is internal replication (anti-entropy) that converges those
>>>>>>>> copies,
>>>>>>>>>>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>>>>>>>>>>> independent results into a single result as best it can. Older
>>>>>>>> versions did
>>>>>>>>>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I
>>>>>>>> do agree
>>>>>>>>>>>>> with you that there’s little value in the 202 distinction. About the
>>>>>>>> only
>>>>>>>>>>>>> thing you could do is investigate your cluster for connectivity
>>>>>>>> issues or
>>>>>>>>>>>>> overloading if you get a sustained period of 202’s, as it would be an
>>>>>>>>>>>>> indicator that the system is partitioned.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure
>>>>>>>> that the
>>>>>>>>>>>>> result of a write did not change after the fact. That is,
>>>>>>>> anti-entropy
>>>>>>>>>>>>> would need to be disabled, or somehow agree to roll forward or
>>>>>>>> backward
>>>>>>>>>>>>> based on the initial circumstances. In short, we’d have to introduce
>>>>>>>> strong
>>>>>>>>>>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>>>>>>>>>>> feature to add, it’s not currently present, and no amount of
>>>>>>>> twiddling the
>>>>>>>>>>>>> status codes will achieve it. We’d rather be honest about our
>>>>>>>> position on
>>>>>>>>>>>>> the CAP triangle.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> B.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <
>>>>>>>> nate-lists@calftrail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> A technical co-founder of Cloudant agreed that this was a bug when I
>>>>>>>>>>>>> first hit it a few years ago. I found back the original thread here
>>>>>>>> — this
>>>>>>>>>>>>> is the discussion I was trying to recall in my OP:
>>>>>>>>>>>>>> It sounds like perhaps there is a related issue tracked internally
>>>>>>>> at
>>>>>>>>>>>>> Cloudant as a result of that conversation.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> JamesM, thanks for your support here and tracking this down. 203
>>>>>>>> seemed
>>>>>>>>>>>>> like the best status code to "steal" for this to me too. Best wishes
>>>>>>>> in
>>>>>>>>>>>>> getting this fixed!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>> -natevw
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org>
>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>>>>>>>>>>>> classified as a bug.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Anti-entropy is the main reason that you cannot get strong
>>>>>>>> consistency
>>>>>>>>>>>>> from the system, it will transform "failed" writes (those that
>>>>>>>> succeeded on
>>>>>>>>>>>>> one node but fewer than R nodes) into success (N copies) as long as
>>>>>>>> the
>>>>>>>>>>>>> nodes have enough healthy uptime.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> True of cloudant and 2.0.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com>
>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Funny you should mention it.  I drafted an email in early
>>>>>>>> February to
>>>>>>>>>>>>> queue up the same discussion whenever I could get involved again
>>>>>>>> (which I
>>>>>>>>>>>>> promptly forgot about).  What happens currently in 2.0 appears
>>>>>>>> unchanged
>>>>>>>>>>>>> from earlier versions.  When R is not satisfied in fabric,
>>>>>>>>>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}
>>>>>>>> but
>>>>>>>>>>>>> leaves the acc-state as the original r_not_met which triggers a
>>>>>>>> read_repair
>>>>>>>>>>>>> from the response handler.  read_repair results in an {ok, …} with
>>>>>>>> the only
>>>>>>>>>>>>> doc available, because no other docs are in the list.  The final doc
>>>>>>>>>>>>> returned to chttpd_db:couch_doc_open and thusly to
>>>>>>>> chttpd_db:db_doc_req is
>>>>>>>>>>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>>>>>>>>>>>> complete.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This seems straightforward to fix by a change in
>>>>>>>>>>>>> fabric_open_doc:handle_response and read_repair.  handle_response
>>>>>>>> knows
>>>>>>>>>>>>> whether it has R met and could pass that forward, or allow
>>>>>>>> read-repair to
>>>>>>>>>>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t
>>>>>>>> speak for
>>>>>>>>>>>>> community interest in the behavior of sending a 202, but it’s
>>>>>>>> something I’d
>>>>>>>>>>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>>>>>>>>>>> disconnected to do it on writes but not reads.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> </JamesM>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>>>>>>>>>>> nate-lists@calftrail.com> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>>>>>>>>>>>> extending my fermata-couchdb plugin today and realized that perhaps
>>>>>>>> the
>>>>>>>>>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an
>>>>>>>> opportunity to
>>>>>>>>>>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> See
>>>>>>>>>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
>>>>>>>> for
>>>>>>>>>>>>> some additional background/explanation, but my understanding is that
>>>>>>>>>>>>> Cloudant for all practical purposes ignores the read durability
>>>>>>>> parameter.
>>>>>>>>>>>>> So you can write with ?w=N to attempt some level of quorum, and get
>>>>>>>> a 202
>>>>>>>>>>>>> back if that quorum is unment. _However_ when you ?r=N it really
>>>>>>>> doesn't
>>>>>>>>>>>>> matter if only <N nodes are available…if even just a single
>>>>>>>> available node
>>>>>>>>>>>>> has some version of the requested document you will get a successful
>>>>>>>>>>>>> response (!).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>>>>>>>>>>> features to dynamically _choose_ between consistency or availability
>>>>>>>> — when
>>>>>>>>>>>>> it comes time to read back a consistent result, BigCouch instead just
>>>>>>>>>>>>> always gives you availability* regardless of what a given request
>>>>>>>> actually
>>>>>>>>>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather
>>>>>>>> than
>>>>>>>>>>>>> proceeding with no way of ever knowing whether a write did NOT
>>>>>>>> ACTUALLY
>>>>>>>>>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were
>>>>>>>> still
>>>>>>>>>>>>> down…)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug
>>>>>>>> by a
>>>>>>>>>>>>> Cloudant engineer (or support personnel at least) but could not be
>>>>>>>> quickly
>>>>>>>>>>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>>>>>>>>>>>> BigCouch? If true, could this read durability issue now be fixed
>>>>>>>> during the
>>>>>>>>>>>>> merge?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>>> -natevw
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual
>>>>>>>> uptime
>>>>>>>>>>>>> of *any* Couch fork…
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Robert Samuel Newson <rn...@apache.org>.

Ok, most of those make sense to me (I think the last two, and particularly the last one, are confounded by the fact couch will initiate read repair if it sees a lack of convergence, i.e, R to N* different revisions, and will perform the usual arbitrary-but-consistent winner algorithm right there).

So, what we want is not really r_met in the sense that fabric means it; which is the minimum number of responses to wait for before returning, regardless of whether they are the same revision or not.

It’s as you said, did we see at least R responses with the same revision? Would we want additional nuance like whether the responses were so inconsistent that we ran read repair? This would distinguish the case where there are simply fewer than R responses (for nodes down / slow / partitioned) that are returning the same revision versus the case where all R to N* responses return different revisions.

I’ll see how easy it is to return the first value while we ponder the other question.

* I say "R to N" to mean fabric will wait for at least R responses (or timeout) but up to N responses (or timeout) if the responses vary.

B.

> On 4 Apr 2015, at 02:08, Mutton, James <jm...@akamai.com> wrote:
> 
> * Report the number of r_met failed conditions to a statistical aggregator for alerting or trending on client-visible behavior.
> * Pause some operation for a time if possible, retry later.
> * Possibly re-resolve and use another cluster that is more healthy or less loaded
> * Indicate some hidden failure or bug in how shards got moved around/restored from down nodes
> 
> </JamesM>
> 
> On Apr 3, 2015, at 17:27, Robert Samuel Newson <rn...@apache.org> wrote:
> 
>> 
>> I’ve pushed an update to the fabric branch which accounts for when the r= value is higher than the number of replicas (so that it returns r_met:false)
>> 
>> Changing this so that r_met is true only if R matching revisions are seen doesn’t sound too difficult.
>> 
>> Where I struggle is seeing what a client can usefully do with this information. When you receive the r_met:false indication, however we end up conveying it, what will you do? Retry until r_met:true?
>> 
>> B.
>> 
>>> On 4 Apr 2015, at 00:55, Mutton, James <jm...@akamai.com> wrote:
>>> 
>>> Based on Paul’s description it sounds like we may need to decide 3 things to close this out:
>>> * What does satisfying R mean?
>>> * What is the appropriate scope of when R is applied?
>>> * How do we most appropriately convey the lack of R?
>>> 
>>> I’m basing my opinions of R on W.  W is satisfied when a write succeeds to W nodes.  For behavior to be consistent between R and W, R should be considered to be met when R “matching” results have been found, if we treat “matching” == “successful”.  I believe this to be a more-correct interpretation of R-W consistency then treating R-satisfied as “found-but-not-matching” since it matches the complete positive of W's “successfully-written”.  For scope, W acts for both current versions and historical revision updates (e.g. resolving conflicts).  W also functions in bulk operations so R should function in multi-key requests as well if it’s to be consistent.
>>> 
>>> The last question is how to appropriately convey lack of R.  I tested these branches to see that the _r_met was present, that worked.  I also made some quick modifications to return a 203 to see how some clients behaved.  Here are my test results: https://gist.github.com/jamutton/c823fdac328777e22646
>>> 
>>> I tested a few clients including an old version of couchdbkit and all worked while the server was returning a 203 and/or the meta-field.  A quick test-with replication was mixed.  I did a replicate into a couchdb 1.6 machine and although I did see some errors, replication succeeded (the errors were related to checkpointing the target and my 1.6 could have been messed up).  All that to say that where I tested it, returning a 203 on R was accepted behavior by clients, just as returning a 202 on W.  By no means is that extensive but at least indicative.  So, I think both approaches, field and status-code, are possible for single key requests (more on that in a second) and whether it’s status or field, I favor at least having consistency with W.  We could also have consistency by converting W’s 202 to a to be an in-document meta field like _w_met and only present when ?is_w_met=true is present on the query string.  That feels more drastic.
>>> 
>>> So the last issue is for the bulk/multi-doc responses.  Here the entire approach of reads and writes diverges.  Writes are still individual doc-updates, whereas reads of multi-docs are basically a “view” even if it’s all_docs.  IMHO, views could be called  out of scope for when R is Applied.  It doesn’t even descend into doc_open to apply R unless “keys” are specified and normal views without include_docs would do the same IIRC.  This approach of calling all views out of scope because they could only even be in scope under certain circumstances, leaves the door open still for either a status-code or field (and again, if using a field it would be more consistent API behavior to switch W to behave the same)
>>> 
>>> Cheers,
>>> </JamesM>
>>> 
>>> On Apr 2, 2015, at 3:51, Robert Samuel Newson <rn...@apache.org> wrote:
>>> 
>>>> To move this along I have COUCHDB-2655 and three branches with a working solution;
>>>> 
>>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-chttpd.git;h=b408ce5
>>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=7d811d3
>>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-fabric.git;h=90e9691
>>>> 
>>>> All three branches are called 2655-r-met if you want to try this locally (and please do!)
>>>> 
>>>> Sample output;
>>>> 
>>>> curl -v 'foo:bar@localhost:15984/db1/doc1?is_r_met=true'
>>>> 
>>>> {"_id":"doc1","_rev":"1-967a00dff5e02add41819138abb3284d","_r_met":true}
>>>> 
>>>> By making it opt-in, I think we avoid all the collateral damage that Paul was concerned about.
>>>> 
>>>> B.
>>>> 
>>>> 
>>>>> On 2 Apr 2015, at 10:36, Robert Samuel Newson <rn...@apache.org> wrote:
>>>>> 
>>>>> 
>>>>> Yeah, not a bad idea. An extra query arg (akin to open_revs=all, conflicts=true, etc) would avoid compatibility breaks and would clearly put the onus on those supplying it to tolerate the presence of the extra reserved field.
>>>>> 
>>>>> +1
>>>>> 
>>>>> 
>>>>>> On 2 Apr 2015, at 10:32, Benjamin Bastian <bb...@apache.org> wrote:
>>>>>> 
>>>>>> What about adding an optional query parameter to indicate whether or not
>>>>>> Couch should include the _r_met flag in the document body/bodies
>>>>>> (defaulting to false)? That wouldn't break older clients and it'd work for
>>>>>> the bulk API as well. As far as the case where there are conflicts, it
>>>>>> seems like the most intuitive thing would be for the "r" in "_r_met" to
>>>>>> have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait
>>>>>> for r copies of the same doc rev until a timeout" and "_r_met" would mean
>>>>>> "we got/didn't get r copies of the same doc rev within the timeout").
>>>>>> 
>>>>>> Just my two cents.
>>>>>> 
>>>>>> On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <rn...@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Paul outlined his previous efforts to introduce this indication, and the
>>>>>>> problems he faced doing so. Can we come up with an acceptable mechanism?
>>>>>>> 
>>>>>>> A different status code will break a lot of users. While the http spec
>>>>>>> says you can treat any 2xx code as success, plenty of libraries, etc, only
>>>>>>> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for
>>>>>>> reads.
>>>>>>> 
>>>>>>> My preference is for a change that "can’t" break anyone, which I think
>>>>>>> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most
>>>>>>> pleasant thing.
>>>>>>> 
>>>>>>> Suggestions?
>>>>>>> 
>>>>>>> B.
>>>>>>> 
>>>>>>> 
>>>>>>>> On 1 Apr 2015, at 06:55, Mutton, James <jm...@akamai.com> wrote:
>>>>>>>> 
>>>>>>>> For at least my part of it, I agree with Adam. Bigcouch has made an
>>>>>>> effort to inform in the case of a failure to apply W. I've seen it lead to
>>>>>>> confusion when the same logic was not applied on R.
>>>>>>>> 
>>>>>>>> I also agree that W and R are not binding contracts. There's no
>>>>>>> agreement protocol to assure that W is met before being committed to disk.
>>>>>>> But they are exposed as a blocking parameter of the request, so
>>>>>>> notification being consistent appeared to me to be the best compromise (vs
>>>>>>> straight up removal).
>>>>>>>> 
>>>>>>>> </JamesM>
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Mar 31, 2015, at 13:15, Robert Newson <rn...@apache.org> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> If a way can be found that doesn't break things that can be sent in all
>>>>>>> or most cases, sure. It's what a user can really infer from that which I
>>>>>>> focused on. Not as much, I think, as users that want that info really want.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <ko...@apache.org> wrote:
>>>>>>>>>> 
>>>>>>>>>> I hope we can all agree that CouchDB should inform the user when it is
>>>>>>> unable to satisfy the requested read "quorum".
>>>>>>>>>> 
>>>>>>>>>> Adam
>>>>>>>>>> 
>>>>>>>>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <pa...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Sounds like there's a bit of confusion here.
>>>>>>>>>>> 
>>>>>>>>>>> What Nathan is asking for is the ability to have Couch respond with
>>>>>>> some
>>>>>>>>>>> information on the actual number of replicas that responded to a read
>>>>>>>>>>> request. That way a user could tell that they issued an r=2 request
>>>>>>> when
>>>>>>>>>>> only r=1 was actually performed. Depending on your point of view in
>>>>>>> an MVCC
>>>>>>>>>>> world this is either a bug or a feature. :)
>>>>>>>>>>> 
>>>>>>>>>>> It was generally agreed upon that if we could return this information
>>>>>>> it
>>>>>>>>>>> would be beneficial. Although what happened when I started
>>>>>>> implementing
>>>>>>>>>>> this patch was that we are either only able to return it in a subset
>>>>>>> of
>>>>>>>>>>> cases where it happens, return it inconsistently between various
>>>>>>> responses,
>>>>>>>>>>> or break replication.
>>>>>>>>>>> 
>>>>>>>>>>> The three general methods for this would be to either include a new
>>>>>>>>>>> "_r_met" key in the doc body that would be a boolean indicating if the
>>>>>>>>>>> requested read quorum was actually met for the document. The second
>>>>>>> was to
>>>>>>>>>>> return a custom X-R-Met type header, and lastly was the status code as
>>>>>>>>>>> described.
>>>>>>>>>>> 
>>>>>>>>>>> The _r_met member was thought to be the best, but unfortunately that
>>>>>>> breaks
>>>>>>>>>>> replication with older clients because we throw an error rather than
>>>>>>> ignore
>>>>>>>>>>> any unknown underscore prefixed field name. Thus having something
>>>>>>> that was
>>>>>>>>>>> just dynamically injected into the document body was a non-starter.
>>>>>>>>>>> Unfortunately, if we don't inject into the document body then we limit
>>>>>>>>>>> ourselves to only the set of APIs where a single document is
>>>>>>> returned. This
>>>>>>>>>>> is due to both streaming semantics (we can't buffer an entire
>>>>>>> response in
>>>>>>>>>>> memory for large requests to _all_docs) as well as multi-doc
>>>>>>> responses (a
>>>>>>>>>>> single boolean doesn't say which document may have not had a properly
>>>>>>> met
>>>>>>>>>>> R).
>>>>>>>>>>> 
>>>>>>>>>>> On top of that, the other confusing part of meeting the read quorum
>>>>>>> is that
>>>>>>>>>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>>>>>>>>>> documents with different revision histories. For instance, if we read
>>>>>>> two
>>>>>>>>>>> docs, we have technically made the r=2 requirement, but what should
>>>>>>> our
>>>>>>>>>>> response be if those two revisions are different (technically, in
>>>>>>> this case
>>>>>>>>>>> we wait for the third response, but the decision on what to return
>>>>>>> for the
>>>>>>>>>>> "r met" value is still unclear).
>>>>>>>>>>> 
>>>>>>>>>>> While I think everyone is in agreement that it'd be nice to return
>>>>>>> some of
>>>>>>>>>>> the information about the copies read, I think its much less clear
>>>>>>> what and
>>>>>>>>>>> how it should be returned in the multitude of cases that we can
>>>>>>> specify an
>>>>>>>>>>> value for R.
>>>>>>>>>>> 
>>>>>>>>>>> While that doesn't offer a concrete path forward, hopefully it
>>>>>>> clarifies
>>>>>>>>>>> some of the issues at hand.
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <
>>>>>>> rnewson@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> It’s testament to my friendship with Mike that we can disagree on
>>>>>>> such
>>>>>>>>>>>> things and remain friends. I am sorry he misled you, though.
>>>>>>>>>>>> 
>>>>>>>>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at
>>>>>>> all, at
>>>>>>>>>>>> least in the formal sense, the only one that matters, this is
>>>>>>> unfortunately
>>>>>>>>>>>> sloppy language in too many places to correct.
>>>>>>>>>>>> 
>>>>>>>>>>>> The r= and w= parameters control only how many of the n possible
>>>>>>> responses
>>>>>>>>>>>> are collected before returning an http response.
>>>>>>>>>>>> 
>>>>>>>>>>>> It’s not true that returning 202 in the situation where one write is
>>>>>>> made
>>>>>>>>>>>> but fewer than 'r' writes are made means we’ve chosen availability
>>>>>>> over
>>>>>>>>>>>> consistency since even if we returned a 500 or closed the connection
>>>>>>>>>>>> without responding, a subsequent GET could return the document (a
>>>>>>>>>>>> probability that increases over time as anti-entropy makes the
>>>>>>> missing
>>>>>>>>>>>> copies). A write attempt that returned a 409 could, likewise,
>>>>>>> introduce a
>>>>>>>>>>>> new edit branch into the document, which might then 'win', altering
>>>>>>> the
>>>>>>>>>>>> results of a subsequent GET.
>>>>>>>>>>>> 
>>>>>>>>>>>> The essential thing to remember is this: the ’n’ copies of your data
>>>>>>> are
>>>>>>>>>>>> completely independent when written/read by the clustered layer
>>>>>>> (fabric).
>>>>>>>>>>>> It is internal replication (anti-entropy) that converges those
>>>>>>> copies,
>>>>>>>>>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>>>>>>>>>> independent results into a single result as best it can. Older
>>>>>>> versions did
>>>>>>>>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I
>>>>>>> do agree
>>>>>>>>>>>> with you that there’s little value in the 202 distinction. About the
>>>>>>> only
>>>>>>>>>>>> thing you could do is investigate your cluster for connectivity
>>>>>>> issues or
>>>>>>>>>>>> overloading if you get a sustained period of 202’s, as it would be an
>>>>>>>>>>>> indicator that the system is partitioned.
>>>>>>>>>>>> 
>>>>>>>>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure
>>>>>>> that the
>>>>>>>>>>>> result of a write did not change after the fact. That is,
>>>>>>> anti-entropy
>>>>>>>>>>>> would need to be disabled, or somehow agree to roll forward or
>>>>>>> backward
>>>>>>>>>>>> based on the initial circumstances. In short, we’d have to introduce
>>>>>>> strong
>>>>>>>>>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>>>>>>>>>> feature to add, it’s not currently present, and no amount of
>>>>>>> twiddling the
>>>>>>>>>>>> status codes will achieve it. We’d rather be honest about our
>>>>>>> position on
>>>>>>>>>>>> the CAP triangle.
>>>>>>>>>>>> 
>>>>>>>>>>>> B.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <
>>>>>>> nate-lists@calftrail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> A technical co-founder of Cloudant agreed that this was a bug when I
>>>>>>>>>>>> first hit it a few years ago. I found back the original thread here
>>>>>>> — this
>>>>>>>>>>>> is the discussion I was trying to recall in my OP:
>>>>>>>>>>>>> It sounds like perhaps there is a related issue tracked internally
>>>>>>> at
>>>>>>>>>>>> Cloudant as a result of that conversation.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> JamesM, thanks for your support here and tracking this down. 203
>>>>>>> seemed
>>>>>>>>>>>> like the best status code to "steal" for this to me too. Best wishes
>>>>>>> in
>>>>>>>>>>>> getting this fixed!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> regards,
>>>>>>>>>>>>> -natevw
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org>
>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>>>>>>>>>>> classified as a bug.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Anti-entropy is the main reason that you cannot get strong
>>>>>>> consistency
>>>>>>>>>>>> from the system, it will transform "failed" writes (those that
>>>>>>> succeeded on
>>>>>>>>>>>> one node but fewer than R nodes) into success (N copies) as long as
>>>>>>> the
>>>>>>>>>>>> nodes have enough healthy uptime.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> True of cloudant and 2.0.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com>
>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Funny you should mention it.  I drafted an email in early
>>>>>>> February to
>>>>>>>>>>>> queue up the same discussion whenever I could get involved again
>>>>>>> (which I
>>>>>>>>>>>> promptly forgot about).  What happens currently in 2.0 appears
>>>>>>> unchanged
>>>>>>>>>>>> from earlier versions.  When R is not satisfied in fabric,
>>>>>>>>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}
>>>>>>> but
>>>>>>>>>>>> leaves the acc-state as the original r_not_met which triggers a
>>>>>>> read_repair
>>>>>>>>>>>> from the response handler.  read_repair results in an {ok, …} with
>>>>>>> the only
>>>>>>>>>>>> doc available, because no other docs are in the list.  The final doc
>>>>>>>>>>>> returned to chttpd_db:couch_doc_open and thusly to
>>>>>>> chttpd_db:db_doc_req is
>>>>>>>>>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>>>>>>>>>>> complete.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This seems straightforward to fix by a change in
>>>>>>>>>>>> fabric_open_doc:handle_response and read_repair.  handle_response
>>>>>>> knows
>>>>>>>>>>>> whether it has R met and could pass that forward, or allow
>>>>>>> read-repair to
>>>>>>>>>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t
>>>>>>> speak for
>>>>>>>>>>>> community interest in the behavior of sending a 202, but it’s
>>>>>>> something I’d
>>>>>>>>>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>>>>>>>>>> disconnected to do it on writes but not reads.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> </JamesM>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>>>>>>>>>> nate-lists@calftrail.com> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>>>>>>>>>>> extending my fermata-couchdb plugin today and realized that perhaps
>>>>>>> the
>>>>>>>>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an
>>>>>>> opportunity to
>>>>>>>>>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> See
>>>>>>>>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
>>>>>>> for
>>>>>>>>>>>> some additional background/explanation, but my understanding is that
>>>>>>>>>>>> Cloudant for all practical purposes ignores the read durability
>>>>>>> parameter.
>>>>>>>>>>>> So you can write with ?w=N to attempt some level of quorum, and get
>>>>>>> a 202
>>>>>>>>>>>> back if that quorum is unment. _However_ when you ?r=N it really
>>>>>>> doesn't
>>>>>>>>>>>> matter if only <N nodes are available…if even just a single
>>>>>>> available node
>>>>>>>>>>>> has some version of the requested document you will get a successful
>>>>>>>>>>>> response (!).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>>>>>>>>>> features to dynamically _choose_ between consistency or availability
>>>>>>> — when
>>>>>>>>>>>> it comes time to read back a consistent result, BigCouch instead just
>>>>>>>>>>>> always gives you availability* regardless of what a given request
>>>>>>> actually
>>>>>>>>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather
>>>>>>> than
>>>>>>>>>>>> proceeding with no way of ever knowing whether a write did NOT
>>>>>>> ACTUALLY
>>>>>>>>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were
>>>>>>> still
>>>>>>>>>>>> down…)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug
>>>>>>> by a
>>>>>>>>>>>> Cloudant engineer (or support personnel at least) but could not be
>>>>>>> quickly
>>>>>>>>>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>>>>>>>>>>> BigCouch? If true, could this read durability issue now be fixed
>>>>>>> during the
>>>>>>>>>>>> merge?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>> -natevw
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual
>>>>>>> uptime
>>>>>>>>>>>> of *any* Couch fork…
>>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by "Mutton, James" <jm...@akamai.com>.

* Report the number of r_met failed conditions to a statistical aggregator for alerting or trending on client-visible behavior.
* Pause some operation for a time if possible, retry later.
* Possibly re-resolve and use another cluster that is more healthy or less loaded
* Indicate some hidden failure or bug in how shards got moved around/restored from down nodes

</JamesM>

On Apr 3, 2015, at 17:27, Robert Samuel Newson <rn...@apache.org> wrote:

> 
> I’ve pushed an update to the fabric branch which accounts for when the r= value is higher than the number of replicas (so that it returns r_met:false)
> 
> Changing this so that r_met is true only if R matching revisions are seen doesn’t sound too difficult.
> 
> Where I struggle is seeing what a client can usefully do with this information. When you receive the r_met:false indication, however we end up conveying it, what will you do? Retry until r_met:true?
> 
> B.
> 
>> On 4 Apr 2015, at 00:55, Mutton, James <jm...@akamai.com> wrote:
>> 
>> Based on Paul’s description it sounds like we may need to decide 3 things to close this out:
>> * What does satisfying R mean?
>> * What is the appropriate scope of when R is applied?
>> * How do we most appropriately convey the lack of R?
>> 
>> I’m basing my opinions of R on W.  W is satisfied when a write succeeds to W nodes.  For behavior to be consistent between R and W, R should be considered to be met when R “matching” results have been found, if we treat “matching” == “successful”.  I believe this to be a more-correct interpretation of R-W consistency then treating R-satisfied as “found-but-not-matching” since it matches the complete positive of W's “successfully-written”.  For scope, W acts for both current versions and historical revision updates (e.g. resolving conflicts).  W also functions in bulk operations so R should function in multi-key requests as well if it’s to be consistent.
>> 
>> The last question is how to appropriately convey lack of R.  I tested these branches to see that the _r_met was present, that worked.  I also made some quick modifications to return a 203 to see how some clients behaved.  Here are my test results: https://gist.github.com/jamutton/c823fdac328777e22646
>> 
>> I tested a few clients including an old version of couchdbkit and all worked while the server was returning a 203 and/or the meta-field.  A quick test-with replication was mixed.  I did a replicate into a couchdb 1.6 machine and although I did see some errors, replication succeeded (the errors were related to checkpointing the target and my 1.6 could have been messed up).  All that to say that where I tested it, returning a 203 on R was accepted behavior by clients, just as returning a 202 on W.  By no means is that extensive but at least indicative.  So, I think both approaches, field and status-code, are possible for single key requests (more on that in a second) and whether it’s status or field, I favor at least having consistency with W.  We could also have consistency by converting W’s 202 to a to be an in-document meta field like _w_met and only present when ?is_w_met=true is present on the query string.  That feels more drastic.
>> 
>> So the last issue is for the bulk/multi-doc responses.  Here the entire approach of reads and writes diverges.  Writes are still individual doc-updates, whereas reads of multi-docs are basically a “view” even if it’s all_docs.  IMHO, views could be called  out of scope for when R is Applied.  It doesn’t even descend into doc_open to apply R unless “keys” are specified and normal views without include_docs would do the same IIRC.  This approach of calling all views out of scope because they could only even be in scope under certain circumstances, leaves the door open still for either a status-code or field (and again, if using a field it would be more consistent API behavior to switch W to behave the same)
>> 
>> Cheers,
>> </JamesM>
>> 
>> On Apr 2, 2015, at 3:51, Robert Samuel Newson <rn...@apache.org> wrote:
>> 
>>> To move this along I have COUCHDB-2655 and three branches with a working solution;
>>> 
>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-chttpd.git;h=b408ce5
>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=7d811d3
>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-fabric.git;h=90e9691
>>> 
>>> All three branches are called 2655-r-met if you want to try this locally (and please do!)
>>> 
>>> Sample output;
>>> 
>>> curl -v 'foo:bar@localhost:15984/db1/doc1?is_r_met=true'
>>> 
>>> {"_id":"doc1","_rev":"1-967a00dff5e02add41819138abb3284d","_r_met":true}
>>> 
>>> By making it opt-in, I think we avoid all the collateral damage that Paul was concerned about.
>>> 
>>> B.
>>> 
>>> 
>>>> On 2 Apr 2015, at 10:36, Robert Samuel Newson <rn...@apache.org> wrote:
>>>> 
>>>> 
>>>> Yeah, not a bad idea. An extra query arg (akin to open_revs=all, conflicts=true, etc) would avoid compatibility breaks and would clearly put the onus on those supplying it to tolerate the presence of the extra reserved field.
>>>> 
>>>> +1
>>>> 
>>>> 
>>>>> On 2 Apr 2015, at 10:32, Benjamin Bastian <bb...@apache.org> wrote:
>>>>> 
>>>>> What about adding an optional query parameter to indicate whether or not
>>>>> Couch should include the _r_met flag in the document body/bodies
>>>>> (defaulting to false)? That wouldn't break older clients and it'd work for
>>>>> the bulk API as well. As far as the case where there are conflicts, it
>>>>> seems like the most intuitive thing would be for the "r" in "_r_met" to
>>>>> have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait
>>>>> for r copies of the same doc rev until a timeout" and "_r_met" would mean
>>>>> "we got/didn't get r copies of the same doc rev within the timeout").
>>>>> 
>>>>> Just my two cents.
>>>>> 
>>>>> On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <rn...@apache.org>
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> Paul outlined his previous efforts to introduce this indication, and the
>>>>>> problems he faced doing so. Can we come up with an acceptable mechanism?
>>>>>> 
>>>>>> A different status code will break a lot of users. While the http spec
>>>>>> says you can treat any 2xx code as success, plenty of libraries, etc, only
>>>>>> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for
>>>>>> reads.
>>>>>> 
>>>>>> My preference is for a change that "can’t" break anyone, which I think
>>>>>> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most
>>>>>> pleasant thing.
>>>>>> 
>>>>>> Suggestions?
>>>>>> 
>>>>>> B.
>>>>>> 
>>>>>> 
>>>>>>> On 1 Apr 2015, at 06:55, Mutton, James <jm...@akamai.com> wrote:
>>>>>>> 
>>>>>>> For at least my part of it, I agree with Adam. Bigcouch has made an
>>>>>> effort to inform in the case of a failure to apply W. I've seen it lead to
>>>>>> confusion when the same logic was not applied on R.
>>>>>>> 
>>>>>>> I also agree that W and R are not binding contracts. There's no
>>>>>> agreement protocol to assure that W is met before being committed to disk.
>>>>>> But they are exposed as a blocking parameter of the request, so
>>>>>> notification being consistent appeared to me to be the best compromise (vs
>>>>>> straight up removal).
>>>>>>> 
>>>>>>> </JamesM>
>>>>>>> 
>>>>>>> 
>>>>>>>> On Mar 31, 2015, at 13:15, Robert Newson <rn...@apache.org> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> If a way can be found that doesn't break things that can be sent in all
>>>>>> or most cases, sure. It's what a user can really infer from that which I
>>>>>> focused on. Not as much, I think, as users that want that info really want.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <ko...@apache.org> wrote:
>>>>>>>>> 
>>>>>>>>> I hope we can all agree that CouchDB should inform the user when it is
>>>>>> unable to satisfy the requested read "quorum".
>>>>>>>>> 
>>>>>>>>> Adam
>>>>>>>>> 
>>>>>>>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <pa...@gmail.com>
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Sounds like there's a bit of confusion here.
>>>>>>>>>> 
>>>>>>>>>> What Nathan is asking for is the ability to have Couch respond with
>>>>>> some
>>>>>>>>>> information on the actual number of replicas that responded to a read
>>>>>>>>>> request. That way a user could tell that they issued an r=2 request
>>>>>> when
>>>>>>>>>> only r=1 was actually performed. Depending on your point of view in
>>>>>> an MVCC
>>>>>>>>>> world this is either a bug or a feature. :)
>>>>>>>>>> 
>>>>>>>>>> It was generally agreed upon that if we could return this information
>>>>>> it
>>>>>>>>>> would be beneficial. Although what happened when I started
>>>>>> implementing
>>>>>>>>>> this patch was that we are either only able to return it in a subset
>>>>>> of
>>>>>>>>>> cases where it happens, return it inconsistently between various
>>>>>> responses,
>>>>>>>>>> or break replication.
>>>>>>>>>> 
>>>>>>>>>> The three general methods for this would be to either include a new
>>>>>>>>>> "_r_met" key in the doc body that would be a boolean indicating if the
>>>>>>>>>> requested read quorum was actually met for the document. The second
>>>>>> was to
>>>>>>>>>> return a custom X-R-Met type header, and lastly was the status code as
>>>>>>>>>> described.
>>>>>>>>>> 
>>>>>>>>>> The _r_met member was thought to be the best, but unfortunately that
>>>>>> breaks
>>>>>>>>>> replication with older clients because we throw an error rather than
>>>>>> ignore
>>>>>>>>>> any unknown underscore prefixed field name. Thus having something
>>>>>> that was
>>>>>>>>>> just dynamically injected into the document body was a non-starter.
>>>>>>>>>> Unfortunately, if we don't inject into the document body then we limit
>>>>>>>>>> ourselves to only the set of APIs where a single document is
>>>>>> returned. This
>>>>>>>>>> is due to both streaming semantics (we can't buffer an entire
>>>>>> response in
>>>>>>>>>> memory for large requests to _all_docs) as well as multi-doc
>>>>>> responses (a
>>>>>>>>>> single boolean doesn't say which document may have not had a properly
>>>>>> met
>>>>>>>>>> R).
>>>>>>>>>> 
>>>>>>>>>> On top of that, the other confusing part of meeting the read quorum
>>>>>> is that
>>>>>>>>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>>>>>>>>> documents with different revision histories. For instance, if we read
>>>>>> two
>>>>>>>>>> docs, we have technically made the r=2 requirement, but what should
>>>>>> our
>>>>>>>>>> response be if those two revisions are different (technically, in
>>>>>> this case
>>>>>>>>>> we wait for the third response, but the decision on what to return
>>>>>> for the
>>>>>>>>>> "r met" value is still unclear).
>>>>>>>>>> 
>>>>>>>>>> While I think everyone is in agreement that it'd be nice to return
>>>>>> some of
>>>>>>>>>> the information about the copies read, I think its much less clear
>>>>>> what and
>>>>>>>>>> how it should be returned in the multitude of cases that we can
>>>>>> specify an
>>>>>>>>>> value for R.
>>>>>>>>>> 
>>>>>>>>>> While that doesn't offer a concrete path forward, hopefully it
>>>>>> clarifies
>>>>>>>>>> some of the issues at hand.
>>>>>>>>>> 
>>>>>>>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <
>>>>>> rnewson@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> It’s testament to my friendship with Mike that we can disagree on
>>>>>> such
>>>>>>>>>>> things and remain friends. I am sorry he misled you, though.
>>>>>>>>>>> 
>>>>>>>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at
>>>>>> all, at
>>>>>>>>>>> least in the formal sense, the only one that matters, this is
>>>>>> unfortunately
>>>>>>>>>>> sloppy language in too many places to correct.
>>>>>>>>>>> 
>>>>>>>>>>> The r= and w= parameters control only how many of the n possible
>>>>>> responses
>>>>>>>>>>> are collected before returning an http response.
>>>>>>>>>>> 
>>>>>>>>>>> It’s not true that returning 202 in the situation where one write is
>>>>>> made
>>>>>>>>>>> but fewer than 'r' writes are made means we’ve chosen availability
>>>>>> over
>>>>>>>>>>> consistency since even if we returned a 500 or closed the connection
>>>>>>>>>>> without responding, a subsequent GET could return the document (a
>>>>>>>>>>> probability that increases over time as anti-entropy makes the
>>>>>> missing
>>>>>>>>>>> copies). A write attempt that returned a 409 could, likewise,
>>>>>> introduce a
>>>>>>>>>>> new edit branch into the document, which might then 'win', altering
>>>>>> the
>>>>>>>>>>> results of a subsequent GET.
>>>>>>>>>>> 
>>>>>>>>>>> The essential thing to remember is this: the ’n’ copies of your data
>>>>>> are
>>>>>>>>>>> completely independent when written/read by the clustered layer
>>>>>> (fabric).
>>>>>>>>>>> It is internal replication (anti-entropy) that converges those
>>>>>> copies,
>>>>>>>>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>>>>>>>>> independent results into a single result as best it can. Older
>>>>>> versions did
>>>>>>>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I
>>>>>> do agree
>>>>>>>>>>> with you that there’s little value in the 202 distinction. About the
>>>>>> only
>>>>>>>>>>> thing you could do is investigate your cluster for connectivity
>>>>>> issues or
>>>>>>>>>>> overloading if you get a sustained period of 202’s, as it would be an
>>>>>>>>>>> indicator that the system is partitioned.
>>>>>>>>>>> 
>>>>>>>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure
>>>>>> that the
>>>>>>>>>>> result of a write did not change after the fact. That is,
>>>>>> anti-entropy
>>>>>>>>>>> would need to be disabled, or somehow agree to roll forward or
>>>>>> backward
>>>>>>>>>>> based on the initial circumstances. In short, we’d have to introduce
>>>>>> strong
>>>>>>>>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>>>>>>>>> feature to add, it’s not currently present, and no amount of
>>>>>> twiddling the
>>>>>>>>>>> status codes will achieve it. We’d rather be honest about our
>>>>>> position on
>>>>>>>>>>> the CAP triangle.
>>>>>>>>>>> 
>>>>>>>>>>> B.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <
>>>>>> nate-lists@calftrail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> A technical co-founder of Cloudant agreed that this was a bug when I
>>>>>>>>>>> first hit it a few years ago. I found back the original thread here
>>>>>> — this
>>>>>>>>>>> is the discussion I was trying to recall in my OP:
>>>>>>>>>>>> It sounds like perhaps there is a related issue tracked internally
>>>>>> at
>>>>>>>>>>> Cloudant as a result of that conversation.
>>>>>>>>>>>> 
>>>>>>>>>>>> JamesM, thanks for your support here and tracking this down. 203
>>>>>> seemed
>>>>>>>>>>> like the best status code to "steal" for this to me too. Best wishes
>>>>>> in
>>>>>>>>>>> getting this fixed!
>>>>>>>>>>>> 
>>>>>>>>>>>> regards,
>>>>>>>>>>>> -natevw
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org>
>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>>>>>>>>>> classified as a bug.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Anti-entropy is the main reason that you cannot get strong
>>>>>> consistency
>>>>>>>>>>> from the system, it will transform "failed" writes (those that
>>>>>> succeeded on
>>>>>>>>>>> one node but fewer than R nodes) into success (N copies) as long as
>>>>>> the
>>>>>>>>>>> nodes have enough healthy uptime.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> True of cloudant and 2.0.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com>
>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Funny you should mention it.  I drafted an email in early
>>>>>> February to
>>>>>>>>>>> queue up the same discussion whenever I could get involved again
>>>>>> (which I
>>>>>>>>>>> promptly forgot about).  What happens currently in 2.0 appears
>>>>>> unchanged
>>>>>>>>>>> from earlier versions.  When R is not satisfied in fabric,
>>>>>>>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}
>>>>>> but
>>>>>>>>>>> leaves the acc-state as the original r_not_met which triggers a
>>>>>> read_repair
>>>>>>>>>>> from the response handler.  read_repair results in an {ok, …} with
>>>>>> the only
>>>>>>>>>>> doc available, because no other docs are in the list.  The final doc
>>>>>>>>>>> returned to chttpd_db:couch_doc_open and thusly to
>>>>>> chttpd_db:db_doc_req is
>>>>>>>>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>>>>>>>>>> complete.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This seems straightforward to fix by a change in
>>>>>>>>>>> fabric_open_doc:handle_response and read_repair.  handle_response
>>>>>> knows
>>>>>>>>>>> whether it has R met and could pass that forward, or allow
>>>>>> read-repair to
>>>>>>>>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t
>>>>>> speak for
>>>>>>>>>>> community interest in the behavior of sending a 202, but it’s
>>>>>> something I’d
>>>>>>>>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>>>>>>>>> disconnected to do it on writes but not reads.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> </JamesM>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>>>>>>>>> nate-lists@calftrail.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>>>>>>>>>> extending my fermata-couchdb plugin today and realized that perhaps
>>>>>> the
>>>>>>>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an
>>>>>> opportunity to
>>>>>>>>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> See
>>>>>>>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
>>>>>> for
>>>>>>>>>>> some additional background/explanation, but my understanding is that
>>>>>>>>>>> Cloudant for all practical purposes ignores the read durability
>>>>>> parameter.
>>>>>>>>>>> So you can write with ?w=N to attempt some level of quorum, and get
>>>>>> a 202
>>>>>>>>>>> back if that quorum is unment. _However_ when you ?r=N it really
>>>>>> doesn't
>>>>>>>>>>> matter if only <N nodes are available…if even just a single
>>>>>> available node
>>>>>>>>>>> has some version of the requested document you will get a successful
>>>>>>>>>>> response (!).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>>>>>>>>> features to dynamically _choose_ between consistency or availability
>>>>>> — when
>>>>>>>>>>> it comes time to read back a consistent result, BigCouch instead just
>>>>>>>>>>> always gives you availability* regardless of what a given request
>>>>>> actually
>>>>>>>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather
>>>>>> than
>>>>>>>>>>> proceeding with no way of ever knowing whether a write did NOT
>>>>>> ACTUALLY
>>>>>>>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were
>>>>>> still
>>>>>>>>>>> down…)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug
>>>>>> by a
>>>>>>>>>>> Cloudant engineer (or support personnel at least) but could not be
>>>>>> quickly
>>>>>>>>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>>>>>>>>>> BigCouch? If true, could this read durability issue now be fixed
>>>>>> during the
>>>>>>>>>>> merge?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>> -natevw
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual
>>>>>> uptime
>>>>>>>>>>> of *any* Couch fork…
>>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Robert Samuel Newson <rn...@apache.org>.

I’ve pushed an update to the fabric branch which accounts for when the r= value is higher than the number of replicas (so that it returns r_met:false)

Changing this so that r_met is true only if R matching revisions are seen doesn’t sound too difficult.

Where I struggle is seeing what a client can usefully do with this information. When you receive the r_met:false indication, however we end up conveying it, what will you do? Retry until r_met:true?

B.

> On 4 Apr 2015, at 00:55, Mutton, James <jm...@akamai.com> wrote:
> 
> Based on Paul’s description it sounds like we may need to decide 3 things to close this out:
> * What does satisfying R mean?
> * What is the appropriate scope of when R is applied?
> * How do we most appropriately convey the lack of R?
> 
> I’m basing my opinions of R on W.  W is satisfied when a write succeeds to W nodes.  For behavior to be consistent between R and W, R should be considered to be met when R “matching” results have been found, if we treat “matching” == “successful”.  I believe this to be a more-correct interpretation of R-W consistency then treating R-satisfied as “found-but-not-matching” since it matches the complete positive of W's “successfully-written”.  For scope, W acts for both current versions and historical revision updates (e.g. resolving conflicts).  W also functions in bulk operations so R should function in multi-key requests as well if it’s to be consistent.
> 
> The last question is how to appropriately convey lack of R.  I tested these branches to see that the _r_met was present, that worked.  I also made some quick modifications to return a 203 to see how some clients behaved.  Here are my test results: https://gist.github.com/jamutton/c823fdac328777e22646
> 
> I tested a few clients including an old version of couchdbkit and all worked while the server was returning a 203 and/or the meta-field.  A quick test-with replication was mixed.  I did a replicate into a couchdb 1.6 machine and although I did see some errors, replication succeeded (the errors were related to checkpointing the target and my 1.6 could have been messed up).  All that to say that where I tested it, returning a 203 on R was accepted behavior by clients, just as returning a 202 on W.  By no means is that extensive but at least indicative.  So, I think both approaches, field and status-code, are possible for single key requests (more on that in a second) and whether it’s status or field, I favor at least having consistency with W.  We could also have consistency by converting W’s 202 to a to be an in-document meta field like _w_met and only present when ?is_w_met=true is present on the query string.  That feels more drastic.
> 
> So the last issue is for the bulk/multi-doc responses.  Here the entire approach of reads and writes diverges.  Writes are still individual doc-updates, whereas reads of multi-docs are basically a “view” even if it’s all_docs.  IMHO, views could be called  out of scope for when R is Applied.  It doesn’t even descend into doc_open to apply R unless “keys” are specified and normal views without include_docs would do the same IIRC.  This approach of calling all views out of scope because they could only even be in scope under certain circumstances, leaves the door open still for either a status-code or field (and again, if using a field it would be more consistent API behavior to switch W to behave the same)
> 
> Cheers,
> </JamesM>
> 
> On Apr 2, 2015, at 3:51, Robert Samuel Newson <rn...@apache.org> wrote:
> 
>> To move this along I have COUCHDB-2655 and three branches with a working solution;
>> 
>> https://git-wip-us.apache.org/repos/asf?p=couchdb-chttpd.git;h=b408ce5
>> https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=7d811d3
>> https://git-wip-us.apache.org/repos/asf?p=couchdb-fabric.git;h=90e9691
>> 
>> All three branches are called 2655-r-met if you want to try this locally (and please do!)
>> 
>> Sample output;
>> 
>> curl -v 'foo:bar@localhost:15984/db1/doc1?is_r_met=true'
>> 
>> {"_id":"doc1","_rev":"1-967a00dff5e02add41819138abb3284d","_r_met":true}
>> 
>> By making it opt-in, I think we avoid all the collateral damage that Paul was concerned about.
>> 
>> B.
>> 
>> 
>>> On 2 Apr 2015, at 10:36, Robert Samuel Newson <rn...@apache.org> wrote:
>>> 
>>> 
>>> Yeah, not a bad idea. An extra query arg (akin to open_revs=all, conflicts=true, etc) would avoid compatibility breaks and would clearly put the onus on those supplying it to tolerate the presence of the extra reserved field.
>>> 
>>> +1
>>> 
>>> 
>>>> On 2 Apr 2015, at 10:32, Benjamin Bastian <bb...@apache.org> wrote:
>>>> 
>>>> What about adding an optional query parameter to indicate whether or not
>>>> Couch should include the _r_met flag in the document body/bodies
>>>> (defaulting to false)? That wouldn't break older clients and it'd work for
>>>> the bulk API as well. As far as the case where there are conflicts, it
>>>> seems like the most intuitive thing would be for the "r" in "_r_met" to
>>>> have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait
>>>> for r copies of the same doc rev until a timeout" and "_r_met" would mean
>>>> "we got/didn't get r copies of the same doc rev within the timeout").
>>>> 
>>>> Just my two cents.
>>>> 
>>>> On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <rn...@apache.org>
>>>> wrote:
>>>> 
>>>>> 
>>>>> Paul outlined his previous efforts to introduce this indication, and the
>>>>> problems he faced doing so. Can we come up with an acceptable mechanism?
>>>>> 
>>>>> A different status code will break a lot of users. While the http spec
>>>>> says you can treat any 2xx code as success, plenty of libraries, etc, only
>>>>> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for
>>>>> reads.
>>>>> 
>>>>> My preference is for a change that "can’t" break anyone, which I think
>>>>> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most
>>>>> pleasant thing.
>>>>> 
>>>>> Suggestions?
>>>>> 
>>>>> B.
>>>>> 
>>>>> 
>>>>>> On 1 Apr 2015, at 06:55, Mutton, James <jm...@akamai.com> wrote:
>>>>>> 
>>>>>> For at least my part of it, I agree with Adam. Bigcouch has made an
>>>>> effort to inform in the case of a failure to apply W. I've seen it lead to
>>>>> confusion when the same logic was not applied on R.
>>>>>> 
>>>>>> I also agree that W and R are not binding contracts. There's no
>>>>> agreement protocol to assure that W is met before being committed to disk.
>>>>> But they are exposed as a blocking parameter of the request, so
>>>>> notification being consistent appeared to me to be the best compromise (vs
>>>>> straight up removal).
>>>>>> 
>>>>>> </JamesM>
>>>>>> 
>>>>>> 
>>>>>>> On Mar 31, 2015, at 13:15, Robert Newson <rn...@apache.org> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> If a way can be found that doesn't break things that can be sent in all
>>>>> or most cases, sure. It's what a user can really infer from that which I
>>>>> focused on. Not as much, I think, as users that want that info really want.
>>>>>>> 
>>>>>>> 
>>>>>>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <ko...@apache.org> wrote:
>>>>>>>> 
>>>>>>>> I hope we can all agree that CouchDB should inform the user when it is
>>>>> unable to satisfy the requested read "quorum".
>>>>>>>> 
>>>>>>>> Adam
>>>>>>>> 
>>>>>>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <pa...@gmail.com>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Sounds like there's a bit of confusion here.
>>>>>>>>> 
>>>>>>>>> What Nathan is asking for is the ability to have Couch respond with
>>>>> some
>>>>>>>>> information on the actual number of replicas that responded to a read
>>>>>>>>> request. That way a user could tell that they issued an r=2 request
>>>>> when
>>>>>>>>> only r=1 was actually performed. Depending on your point of view in
>>>>> an MVCC
>>>>>>>>> world this is either a bug or a feature. :)
>>>>>>>>> 
>>>>>>>>> It was generally agreed upon that if we could return this information
>>>>> it
>>>>>>>>> would be beneficial. Although what happened when I started
>>>>> implementing
>>>>>>>>> this patch was that we are either only able to return it in a subset
>>>>> of
>>>>>>>>> cases where it happens, return it inconsistently between various
>>>>> responses,
>>>>>>>>> or break replication.
>>>>>>>>> 
>>>>>>>>> The three general methods for this would be to either include a new
>>>>>>>>> "_r_met" key in the doc body that would be a boolean indicating if the
>>>>>>>>> requested read quorum was actually met for the document. The second
>>>>> was to
>>>>>>>>> return a custom X-R-Met type header, and lastly was the status code as
>>>>>>>>> described.
>>>>>>>>> 
>>>>>>>>> The _r_met member was thought to be the best, but unfortunately that
>>>>> breaks
>>>>>>>>> replication with older clients because we throw an error rather than
>>>>> ignore
>>>>>>>>> any unknown underscore prefixed field name. Thus having something
>>>>> that was
>>>>>>>>> just dynamically injected into the document body was a non-starter.
>>>>>>>>> Unfortunately, if we don't inject into the document body then we limit
>>>>>>>>> ourselves to only the set of APIs where a single document is
>>>>> returned. This
>>>>>>>>> is due to both streaming semantics (we can't buffer an entire
>>>>> response in
>>>>>>>>> memory for large requests to _all_docs) as well as multi-doc
>>>>> responses (a
>>>>>>>>> single boolean doesn't say which document may have not had a properly
>>>>> met
>>>>>>>>> R).
>>>>>>>>> 
>>>>>>>>> On top of that, the other confusing part of meeting the read quorum
>>>>> is that
>>>>>>>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>>>>>>>> documents with different revision histories. For instance, if we read
>>>>> two
>>>>>>>>> docs, we have technically made the r=2 requirement, but what should
>>>>> our
>>>>>>>>> response be if those two revisions are different (technically, in
>>>>> this case
>>>>>>>>> we wait for the third response, but the decision on what to return
>>>>> for the
>>>>>>>>> "r met" value is still unclear).
>>>>>>>>> 
>>>>>>>>> While I think everyone is in agreement that it'd be nice to return
>>>>> some of
>>>>>>>>> the information about the copies read, I think its much less clear
>>>>> what and
>>>>>>>>> how it should be returned in the multitude of cases that we can
>>>>> specify an
>>>>>>>>> value for R.
>>>>>>>>> 
>>>>>>>>> While that doesn't offer a concrete path forward, hopefully it
>>>>> clarifies
>>>>>>>>> some of the issues at hand.
>>>>>>>>> 
>>>>>>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <
>>>>> rnewson@apache.org>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> It’s testament to my friendship with Mike that we can disagree on
>>>>> such
>>>>>>>>>> things and remain friends. I am sorry he misled you, though.
>>>>>>>>>> 
>>>>>>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at
>>>>> all, at
>>>>>>>>>> least in the formal sense, the only one that matters, this is
>>>>> unfortunately
>>>>>>>>>> sloppy language in too many places to correct.
>>>>>>>>>> 
>>>>>>>>>> The r= and w= parameters control only how many of the n possible
>>>>> responses
>>>>>>>>>> are collected before returning an http response.
>>>>>>>>>> 
>>>>>>>>>> It’s not true that returning 202 in the situation where one write is
>>>>> made
>>>>>>>>>> but fewer than 'r' writes are made means we’ve chosen availability
>>>>> over
>>>>>>>>>> consistency since even if we returned a 500 or closed the connection
>>>>>>>>>> without responding, a subsequent GET could return the document (a
>>>>>>>>>> probability that increases over time as anti-entropy makes the
>>>>> missing
>>>>>>>>>> copies). A write attempt that returned a 409 could, likewise,
>>>>> introduce a
>>>>>>>>>> new edit branch into the document, which might then 'win', altering
>>>>> the
>>>>>>>>>> results of a subsequent GET.
>>>>>>>>>> 
>>>>>>>>>> The essential thing to remember is this: the ’n’ copies of your data
>>>>> are
>>>>>>>>>> completely independent when written/read by the clustered layer
>>>>> (fabric).
>>>>>>>>>> It is internal replication (anti-entropy) that converges those
>>>>> copies,
>>>>>>>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>>>>>>>> independent results into a single result as best it can. Older
>>>>> versions did
>>>>>>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I
>>>>> do agree
>>>>>>>>>> with you that there’s little value in the 202 distinction. About the
>>>>> only
>>>>>>>>>> thing you could do is investigate your cluster for connectivity
>>>>> issues or
>>>>>>>>>> overloading if you get a sustained period of 202’s, as it would be an
>>>>>>>>>> indicator that the system is partitioned.
>>>>>>>>>> 
>>>>>>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure
>>>>> that the
>>>>>>>>>> result of a write did not change after the fact. That is,
>>>>> anti-entropy
>>>>>>>>>> would need to be disabled, or somehow agree to roll forward or
>>>>> backward
>>>>>>>>>> based on the initial circumstances. In short, we’d have to introduce
>>>>> strong
>>>>>>>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>>>>>>>> feature to add, it’s not currently present, and no amount of
>>>>> twiddling the
>>>>>>>>>> status codes will achieve it. We’d rather be honest about our
>>>>> position on
>>>>>>>>>> the CAP triangle.
>>>>>>>>>> 
>>>>>>>>>> B.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <
>>>>> nate-lists@calftrail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> A technical co-founder of Cloudant agreed that this was a bug when I
>>>>>>>>>> first hit it a few years ago. I found back the original thread here
>>>>> — this
>>>>>>>>>> is the discussion I was trying to recall in my OP:
>>>>>>>>>>> It sounds like perhaps there is a related issue tracked internally
>>>>> at
>>>>>>>>>> Cloudant as a result of that conversation.
>>>>>>>>>>> 
>>>>>>>>>>> JamesM, thanks for your support here and tracking this down. 203
>>>>> seemed
>>>>>>>>>> like the best status code to "steal" for this to me too. Best wishes
>>>>> in
>>>>>>>>>> getting this fixed!
>>>>>>>>>>> 
>>>>>>>>>>> regards,
>>>>>>>>>>> -natevw
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org>
>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>>>>>>>>> classified as a bug.
>>>>>>>>>>>> 
>>>>>>>>>>>> Anti-entropy is the main reason that you cannot get strong
>>>>> consistency
>>>>>>>>>> from the system, it will transform "failed" writes (those that
>>>>> succeeded on
>>>>>>>>>> one node but fewer than R nodes) into success (N copies) as long as
>>>>> the
>>>>>>>>>> nodes have enough healthy uptime.
>>>>>>>>>>>> 
>>>>>>>>>>>> True of cloudant and 2.0.
>>>>>>>>>>>> 
>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com>
>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Funny you should mention it.  I drafted an email in early
>>>>> February to
>>>>>>>>>> queue up the same discussion whenever I could get involved again
>>>>> (which I
>>>>>>>>>> promptly forgot about).  What happens currently in 2.0 appears
>>>>> unchanged
>>>>>>>>>> from earlier versions.  When R is not satisfied in fabric,
>>>>>>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}
>>>>> but
>>>>>>>>>> leaves the acc-state as the original r_not_met which triggers a
>>>>> read_repair
>>>>>>>>>> from the response handler.  read_repair results in an {ok, …} with
>>>>> the only
>>>>>>>>>> doc available, because no other docs are in the list.  The final doc
>>>>>>>>>> returned to chttpd_db:couch_doc_open and thusly to
>>>>> chttpd_db:db_doc_req is
>>>>>>>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>>>>>>>>> complete.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This seems straightforward to fix by a change in
>>>>>>>>>> fabric_open_doc:handle_response and read_repair.  handle_response
>>>>> knows
>>>>>>>>>> whether it has R met and could pass that forward, or allow
>>>>> read-repair to
>>>>>>>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t
>>>>> speak for
>>>>>>>>>> community interest in the behavior of sending a 202, but it’s
>>>>> something I’d
>>>>>>>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>>>>>>>> disconnected to do it on writes but not reads.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> </JamesM>
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>>>>>>>> nate-lists@calftrail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>>>>>>>>> extending my fermata-couchdb plugin today and realized that perhaps
>>>>> the
>>>>>>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an
>>>>> opportunity to
>>>>>>>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> See
>>>>>>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
>>>>> for
>>>>>>>>>> some additional background/explanation, but my understanding is that
>>>>>>>>>> Cloudant for all practical purposes ignores the read durability
>>>>> parameter.
>>>>>>>>>> So you can write with ?w=N to attempt some level of quorum, and get
>>>>> a 202
>>>>>>>>>> back if that quorum is unment. _However_ when you ?r=N it really
>>>>> doesn't
>>>>>>>>>> matter if only <N nodes are available…if even just a single
>>>>> available node
>>>>>>>>>> has some version of the requested document you will get a successful
>>>>>>>>>> response (!).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>>>>>>>> features to dynamically _choose_ between consistency or availability
>>>>> — when
>>>>>>>>>> it comes time to read back a consistent result, BigCouch instead just
>>>>>>>>>> always gives you availability* regardless of what a given request
>>>>> actually
>>>>>>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather
>>>>> than
>>>>>>>>>> proceeding with no way of ever knowing whether a write did NOT
>>>>> ACTUALLY
>>>>>>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were
>>>>> still
>>>>>>>>>> down…)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug
>>>>> by a
>>>>>>>>>> Cloudant engineer (or support personnel at least) but could not be
>>>>> quickly
>>>>>>>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>>>>>>>>> BigCouch? If true, could this read durability issue now be fixed
>>>>> during the
>>>>>>>>>> merge?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>> -natevw
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual
>>>>> uptime
>>>>>>>>>> of *any* Couch fork…
>>>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by "Mutton, James" <jm...@akamai.com>.

Based on Paul’s description it sounds like we may need to decide 3 things to close this out:
* What does satisfying R mean?
* What is the appropriate scope of when R is applied?
* How do we most appropriately convey the lack of R?

I’m basing my opinions of R on W.  W is satisfied when a write succeeds to W nodes.  For behavior to be consistent between R and W, R should be considered to be met when R “matching” results have been found, if we treat “matching” == “successful”.  I believe this to be a more-correct interpretation of R-W consistency then treating R-satisfied as “found-but-not-matching” since it matches the complete positive of W's “successfully-written”.  For scope, W acts for both current versions and historical revision updates (e.g. resolving conflicts).  W also functions in bulk operations so R should function in multi-key requests as well if it’s to be consistent.

The last question is how to appropriately convey lack of R.  I tested these branches to see that the _r_met was present, that worked.  I also made some quick modifications to return a 203 to see how some clients behaved.  Here are my test results: https://gist.github.com/jamutton/c823fdac328777e22646

I tested a few clients including an old version of couchdbkit and all worked while the server was returning a 203 and/or the meta-field.  A quick test-with replication was mixed.  I did a replicate into a couchdb 1.6 machine and although I did see some errors, replication succeeded (the errors were related to checkpointing the target and my 1.6 could have been messed up).  All that to say that where I tested it, returning a 203 on R was accepted behavior by clients, just as returning a 202 on W.  By no means is that extensive but at least indicative.  So, I think both approaches, field and status-code, are possible for single key requests (more on that in a second) and whether it’s status or field, I favor at least having consistency with W.  We could also have consistency by converting W’s 202 to a to be an in-document meta field like _w_met and only present when ?is_w_met=true is present on the query string.  That feels more drastic.

So the last issue is for the bulk/multi-doc responses.  Here the entire approach of reads and writes diverges.  Writes are still individual doc-updates, whereas reads of multi-docs are basically a “view” even if it’s all_docs.  IMHO, views could be called  out of scope for when R is Applied.  It doesn’t even descend into doc_open to apply R unless “keys” are specified and normal views without include_docs would do the same IIRC.  This approach of calling all views out of scope because they could only even be in scope under certain circumstances, leaves the door open still for either a status-code or field (and again, if using a field it would be more consistent API behavior to switch W to behave the same)

Cheers,
</JamesM>

On Apr 2, 2015, at 3:51, Robert Samuel Newson <rn...@apache.org> wrote:

> To move this along I have COUCHDB-2655 and three branches with a working solution;
> 
> https://git-wip-us.apache.org/repos/asf?p=couchdb-chttpd.git;h=b408ce5
> https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=7d811d3
> https://git-wip-us.apache.org/repos/asf?p=couchdb-fabric.git;h=90e9691
> 
> All three branches are called 2655-r-met if you want to try this locally (and please do!)
> 
> Sample output;
> 
> curl -v 'foo:bar@localhost:15984/db1/doc1?is_r_met=true'
> 
> {"_id":"doc1","_rev":"1-967a00dff5e02add41819138abb3284d","_r_met":true}
> 
> By making it opt-in, I think we avoid all the collateral damage that Paul was concerned about.
> 
> B.
> 
> 
>> On 2 Apr 2015, at 10:36, Robert Samuel Newson <rn...@apache.org> wrote:
>> 
>> 
>> Yeah, not a bad idea. An extra query arg (akin to open_revs=all, conflicts=true, etc) would avoid compatibility breaks and would clearly put the onus on those supplying it to tolerate the presence of the extra reserved field.
>> 
>> +1
>> 
>> 
>>> On 2 Apr 2015, at 10:32, Benjamin Bastian <bb...@apache.org> wrote:
>>> 
>>> What about adding an optional query parameter to indicate whether or not
>>> Couch should include the _r_met flag in the document body/bodies
>>> (defaulting to false)? That wouldn't break older clients and it'd work for
>>> the bulk API as well. As far as the case where there are conflicts, it
>>> seems like the most intuitive thing would be for the "r" in "_r_met" to
>>> have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait
>>> for r copies of the same doc rev until a timeout" and "_r_met" would mean
>>> "we got/didn't get r copies of the same doc rev within the timeout").
>>> 
>>> Just my two cents.
>>> 
>>> On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <rn...@apache.org>
>>> wrote:
>>> 
>>>> 
>>>> Paul outlined his previous efforts to introduce this indication, and the
>>>> problems he faced doing so. Can we come up with an acceptable mechanism?
>>>> 
>>>> A different status code will break a lot of users. While the http spec
>>>> says you can treat any 2xx code as success, plenty of libraries, etc, only
>>>> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for
>>>> reads.
>>>> 
>>>> My preference is for a change that "can’t" break anyone, which I think
>>>> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most
>>>> pleasant thing.
>>>> 
>>>> Suggestions?
>>>> 
>>>> B.
>>>> 
>>>> 
>>>>> On 1 Apr 2015, at 06:55, Mutton, James <jm...@akamai.com> wrote:
>>>>> 
>>>>> For at least my part of it, I agree with Adam. Bigcouch has made an
>>>> effort to inform in the case of a failure to apply W. I've seen it lead to
>>>> confusion when the same logic was not applied on R.
>>>>> 
>>>>> I also agree that W and R are not binding contracts. There's no
>>>> agreement protocol to assure that W is met before being committed to disk.
>>>> But they are exposed as a blocking parameter of the request, so
>>>> notification being consistent appeared to me to be the best compromise (vs
>>>> straight up removal).
>>>>> 
>>>>> </JamesM>
>>>>> 
>>>>> 
>>>>>> On Mar 31, 2015, at 13:15, Robert Newson <rn...@apache.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> If a way can be found that doesn't break things that can be sent in all
>>>> or most cases, sure. It's what a user can really infer from that which I
>>>> focused on. Not as much, I think, as users that want that info really want.
>>>>>> 
>>>>>> 
>>>>>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <ko...@apache.org> wrote:
>>>>>>> 
>>>>>>> I hope we can all agree that CouchDB should inform the user when it is
>>>> unable to satisfy the requested read "quorum".
>>>>>>> 
>>>>>>> Adam
>>>>>>> 
>>>>>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <pa...@gmail.com>
>>>> wrote:
>>>>>>>> 
>>>>>>>> Sounds like there's a bit of confusion here.
>>>>>>>> 
>>>>>>>> What Nathan is asking for is the ability to have Couch respond with
>>>> some
>>>>>>>> information on the actual number of replicas that responded to a read
>>>>>>>> request. That way a user could tell that they issued an r=2 request
>>>> when
>>>>>>>> only r=1 was actually performed. Depending on your point of view in
>>>> an MVCC
>>>>>>>> world this is either a bug or a feature. :)
>>>>>>>> 
>>>>>>>> It was generally agreed upon that if we could return this information
>>>> it
>>>>>>>> would be beneficial. Although what happened when I started
>>>> implementing
>>>>>>>> this patch was that we are either only able to return it in a subset
>>>> of
>>>>>>>> cases where it happens, return it inconsistently between various
>>>> responses,
>>>>>>>> or break replication.
>>>>>>>> 
>>>>>>>> The three general methods for this would be to either include a new
>>>>>>>> "_r_met" key in the doc body that would be a boolean indicating if the
>>>>>>>> requested read quorum was actually met for the document. The second
>>>> was to
>>>>>>>> return a custom X-R-Met type header, and lastly was the status code as
>>>>>>>> described.
>>>>>>>> 
>>>>>>>> The _r_met member was thought to be the best, but unfortunately that
>>>> breaks
>>>>>>>> replication with older clients because we throw an error rather than
>>>> ignore
>>>>>>>> any unknown underscore prefixed field name. Thus having something
>>>> that was
>>>>>>>> just dynamically injected into the document body was a non-starter.
>>>>>>>> Unfortunately, if we don't inject into the document body then we limit
>>>>>>>> ourselves to only the set of APIs where a single document is
>>>> returned. This
>>>>>>>> is due to both streaming semantics (we can't buffer an entire
>>>> response in
>>>>>>>> memory for large requests to _all_docs) as well as multi-doc
>>>> responses (a
>>>>>>>> single boolean doesn't say which document may have not had a properly
>>>> met
>>>>>>>> R).
>>>>>>>> 
>>>>>>>> On top of that, the other confusing part of meeting the read quorum
>>>> is that
>>>>>>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>>>>>>> documents with different revision histories. For instance, if we read
>>>> two
>>>>>>>> docs, we have technically made the r=2 requirement, but what should
>>>> our
>>>>>>>> response be if those two revisions are different (technically, in
>>>> this case
>>>>>>>> we wait for the third response, but the decision on what to return
>>>> for the
>>>>>>>> "r met" value is still unclear).
>>>>>>>> 
>>>>>>>> While I think everyone is in agreement that it'd be nice to return
>>>> some of
>>>>>>>> the information about the copies read, I think its much less clear
>>>> what and
>>>>>>>> how it should be returned in the multitude of cases that we can
>>>> specify an
>>>>>>>> value for R.
>>>>>>>> 
>>>>>>>> While that doesn't offer a concrete path forward, hopefully it
>>>> clarifies
>>>>>>>> some of the issues at hand.
>>>>>>>> 
>>>>>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <
>>>> rnewson@apache.org>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> It’s testament to my friendship with Mike that we can disagree on
>>>> such
>>>>>>>>> things and remain friends. I am sorry he misled you, though.
>>>>>>>>> 
>>>>>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at
>>>> all, at
>>>>>>>>> least in the formal sense, the only one that matters, this is
>>>> unfortunately
>>>>>>>>> sloppy language in too many places to correct.
>>>>>>>>> 
>>>>>>>>> The r= and w= parameters control only how many of the n possible
>>>> responses
>>>>>>>>> are collected before returning an http response.
>>>>>>>>> 
>>>>>>>>> It’s not true that returning 202 in the situation where one write is
>>>> made
>>>>>>>>> but fewer than 'r' writes are made means we’ve chosen availability
>>>> over
>>>>>>>>> consistency since even if we returned a 500 or closed the connection
>>>>>>>>> without responding, a subsequent GET could return the document (a
>>>>>>>>> probability that increases over time as anti-entropy makes the
>>>> missing
>>>>>>>>> copies). A write attempt that returned a 409 could, likewise,
>>>> introduce a
>>>>>>>>> new edit branch into the document, which might then 'win', altering
>>>> the
>>>>>>>>> results of a subsequent GET.
>>>>>>>>> 
>>>>>>>>> The essential thing to remember is this: the ’n’ copies of your data
>>>> are
>>>>>>>>> completely independent when written/read by the clustered layer
>>>> (fabric).
>>>>>>>>> It is internal replication (anti-entropy) that converges those
>>>> copies,
>>>>>>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>>>>>>> independent results into a single result as best it can. Older
>>>> versions did
>>>>>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I
>>>> do agree
>>>>>>>>> with you that there’s little value in the 202 distinction. About the
>>>> only
>>>>>>>>> thing you could do is investigate your cluster for connectivity
>>>> issues or
>>>>>>>>> overloading if you get a sustained period of 202’s, as it would be an
>>>>>>>>> indicator that the system is partitioned.
>>>>>>>>> 
>>>>>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure
>>>> that the
>>>>>>>>> result of a write did not change after the fact. That is,
>>>> anti-entropy
>>>>>>>>> would need to be disabled, or somehow agree to roll forward or
>>>> backward
>>>>>>>>> based on the initial circumstances. In short, we’d have to introduce
>>>> strong
>>>>>>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>>>>>>> feature to add, it’s not currently present, and no amount of
>>>> twiddling the
>>>>>>>>> status codes will achieve it. We’d rather be honest about our
>>>> position on
>>>>>>>>> the CAP triangle.
>>>>>>>>> 
>>>>>>>>> B.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <
>>>> nate-lists@calftrail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> A technical co-founder of Cloudant agreed that this was a bug when I
>>>>>>>>> first hit it a few years ago. I found back the original thread here
>>>> — this
>>>>>>>>> is the discussion I was trying to recall in my OP:
>>>>>>>>>> It sounds like perhaps there is a related issue tracked internally
>>>> at
>>>>>>>>> Cloudant as a result of that conversation.
>>>>>>>>>> 
>>>>>>>>>> JamesM, thanks for your support here and tracking this down. 203
>>>> seemed
>>>>>>>>> like the best status code to "steal" for this to me too. Best wishes
>>>> in
>>>>>>>>> getting this fixed!
>>>>>>>>>> 
>>>>>>>>>> regards,
>>>>>>>>>> -natevw
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org>
>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>>>>>>>> classified as a bug.
>>>>>>>>>>> 
>>>>>>>>>>> Anti-entropy is the main reason that you cannot get strong
>>>> consistency
>>>>>>>>> from the system, it will transform "failed" writes (those that
>>>> succeeded on
>>>>>>>>> one node but fewer than R nodes) into success (N copies) as long as
>>>> the
>>>>>>>>> nodes have enough healthy uptime.
>>>>>>>>>>> 
>>>>>>>>>>> True of cloudant and 2.0.
>>>>>>>>>>> 
>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>> 
>>>>>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com>
>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Funny you should mention it.  I drafted an email in early
>>>> February to
>>>>>>>>> queue up the same discussion whenever I could get involved again
>>>> (which I
>>>>>>>>> promptly forgot about).  What happens currently in 2.0 appears
>>>> unchanged
>>>>>>>>> from earlier versions.  When R is not satisfied in fabric,
>>>>>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}
>>>> but
>>>>>>>>> leaves the acc-state as the original r_not_met which triggers a
>>>> read_repair
>>>>>>>>> from the response handler.  read_repair results in an {ok, …} with
>>>> the only
>>>>>>>>> doc available, because no other docs are in the list.  The final doc
>>>>>>>>> returned to chttpd_db:couch_doc_open and thusly to
>>>> chttpd_db:db_doc_req is
>>>>>>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>>>>>>>> complete.
>>>>>>>>>>>> 
>>>>>>>>>>>> This seems straightforward to fix by a change in
>>>>>>>>> fabric_open_doc:handle_response and read_repair.  handle_response
>>>> knows
>>>>>>>>> whether it has R met and could pass that forward, or allow
>>>> read-repair to
>>>>>>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t
>>>> speak for
>>>>>>>>> community interest in the behavior of sending a 202, but it’s
>>>> something I’d
>>>>>>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>>>>>>> disconnected to do it on writes but not reads.
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> </JamesM>
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>>>>>>> nate-lists@calftrail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>>>>>>>> extending my fermata-couchdb plugin today and realized that perhaps
>>>> the
>>>>>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an
>>>> opportunity to
>>>>>>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> See
>>>>>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
>>>> for
>>>>>>>>> some additional background/explanation, but my understanding is that
>>>>>>>>> Cloudant for all practical purposes ignores the read durability
>>>> parameter.
>>>>>>>>> So you can write with ?w=N to attempt some level of quorum, and get
>>>> a 202
>>>>>>>>> back if that quorum is unment. _However_ when you ?r=N it really
>>>> doesn't
>>>>>>>>> matter if only <N nodes are available…if even just a single
>>>> available node
>>>>>>>>> has some version of the requested document you will get a successful
>>>>>>>>> response (!).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>>>>>>> features to dynamically _choose_ between consistency or availability
>>>> — when
>>>>>>>>> it comes time to read back a consistent result, BigCouch instead just
>>>>>>>>> always gives you availability* regardless of what a given request
>>>> actually
>>>>>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather
>>>> than
>>>>>>>>> proceeding with no way of ever knowing whether a write did NOT
>>>> ACTUALLY
>>>>>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were
>>>> still
>>>>>>>>> down…)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug
>>>> by a
>>>>>>>>> Cloudant engineer (or support personnel at least) but could not be
>>>> quickly
>>>>>>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>>>>>>>> BigCouch? If true, could this read durability issue now be fixed
>>>> during the
>>>>>>>>> merge?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>> -natevw
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual
>>>> uptime
>>>>>>>>> of *any* Couch fork…
>>>>>>> 
>>>> 
>>>> 
>> 
>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Robert Samuel Newson <rn...@apache.org>.

To move this along I have COUCHDB-2655 and three branches with a working solution;

https://git-wip-us.apache.org/repos/asf?p=couchdb-chttpd.git;h=b408ce5
https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=7d811d3
https://git-wip-us.apache.org/repos/asf?p=couchdb-fabric.git;h=90e9691

All three branches are called 2655-r-met if you want to try this locally (and please do!)

Sample output;

curl -v 'foo:bar@localhost:15984/db1/doc1?is_r_met=true'

{"_id":"doc1","_rev":"1-967a00dff5e02add41819138abb3284d","_r_met":true}

By making it opt-in, I think we avoid all the collateral damage that Paul was concerned about.

B.


> On 2 Apr 2015, at 10:36, Robert Samuel Newson <rn...@apache.org> wrote:
> 
> 
> Yeah, not a bad idea. An extra query arg (akin to open_revs=all, conflicts=true, etc) would avoid compatibility breaks and would clearly put the onus on those supplying it to tolerate the presence of the extra reserved field.
> 
> +1
> 
> 
>> On 2 Apr 2015, at 10:32, Benjamin Bastian <bb...@apache.org> wrote:
>> 
>> What about adding an optional query parameter to indicate whether or not
>> Couch should include the _r_met flag in the document body/bodies
>> (defaulting to false)? That wouldn't break older clients and it'd work for
>> the bulk API as well. As far as the case where there are conflicts, it
>> seems like the most intuitive thing would be for the "r" in "_r_met" to
>> have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait
>> for r copies of the same doc rev until a timeout" and "_r_met" would mean
>> "we got/didn't get r copies of the same doc rev within the timeout").
>> 
>> Just my two cents.
>> 
>> On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <rn...@apache.org>
>> wrote:
>> 
>>> 
>>> Paul outlined his previous efforts to introduce this indication, and the
>>> problems he faced doing so. Can we come up with an acceptable mechanism?
>>> 
>>> A different status code will break a lot of users. While the http spec
>>> says you can treat any 2xx code as success, plenty of libraries, etc, only
>>> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for
>>> reads.
>>> 
>>> My preference is for a change that "can’t" break anyone, which I think
>>> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most
>>> pleasant thing.
>>> 
>>> Suggestions?
>>> 
>>> B.
>>> 
>>> 
>>>> On 1 Apr 2015, at 06:55, Mutton, James <jm...@akamai.com> wrote:
>>>> 
>>>> For at least my part of it, I agree with Adam. Bigcouch has made an
>>> effort to inform in the case of a failure to apply W. I've seen it lead to
>>> confusion when the same logic was not applied on R.
>>>> 
>>>> I also agree that W and R are not binding contracts. There's no
>>> agreement protocol to assure that W is met before being committed to disk.
>>> But they are exposed as a blocking parameter of the request, so
>>> notification being consistent appeared to me to be the best compromise (vs
>>> straight up removal).
>>>> 
>>>> </JamesM>
>>>> 
>>>> 
>>>>> On Mar 31, 2015, at 13:15, Robert Newson <rn...@apache.org> wrote:
>>>>> 
>>>>> 
>>>>> If a way can be found that doesn't break things that can be sent in all
>>> or most cases, sure. It's what a user can really infer from that which I
>>> focused on. Not as much, I think, as users that want that info really want.
>>>>> 
>>>>> 
>>>>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <ko...@apache.org> wrote:
>>>>>> 
>>>>>> I hope we can all agree that CouchDB should inform the user when it is
>>> unable to satisfy the requested read "quorum".
>>>>>> 
>>>>>> Adam
>>>>>> 
>>>>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <pa...@gmail.com>
>>> wrote:
>>>>>>> 
>>>>>>> Sounds like there's a bit of confusion here.
>>>>>>> 
>>>>>>> What Nathan is asking for is the ability to have Couch respond with
>>> some
>>>>>>> information on the actual number of replicas that responded to a read
>>>>>>> request. That way a user could tell that they issued an r=2 request
>>> when
>>>>>>> only r=1 was actually performed. Depending on your point of view in
>>> an MVCC
>>>>>>> world this is either a bug or a feature. :)
>>>>>>> 
>>>>>>> It was generally agreed upon that if we could return this information
>>> it
>>>>>>> would be beneficial. Although what happened when I started
>>> implementing
>>>>>>> this patch was that we are either only able to return it in a subset
>>> of
>>>>>>> cases where it happens, return it inconsistently between various
>>> responses,
>>>>>>> or break replication.
>>>>>>> 
>>>>>>> The three general methods for this would be to either include a new
>>>>>>> "_r_met" key in the doc body that would be a boolean indicating if the
>>>>>>> requested read quorum was actually met for the document. The second
>>> was to
>>>>>>> return a custom X-R-Met type header, and lastly was the status code as
>>>>>>> described.
>>>>>>> 
>>>>>>> The _r_met member was thought to be the best, but unfortunately that
>>> breaks
>>>>>>> replication with older clients because we throw an error rather than
>>> ignore
>>>>>>> any unknown underscore prefixed field name. Thus having something
>>> that was
>>>>>>> just dynamically injected into the document body was a non-starter.
>>>>>>> Unfortunately, if we don't inject into the document body then we limit
>>>>>>> ourselves to only the set of APIs where a single document is
>>> returned. This
>>>>>>> is due to both streaming semantics (we can't buffer an entire
>>> response in
>>>>>>> memory for large requests to _all_docs) as well as multi-doc
>>> responses (a
>>>>>>> single boolean doesn't say which document may have not had a properly
>>> met
>>>>>>> R).
>>>>>>> 
>>>>>>> On top of that, the other confusing part of meeting the read quorum
>>> is that
>>>>>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>>>>>> documents with different revision histories. For instance, if we read
>>> two
>>>>>>> docs, we have technically made the r=2 requirement, but what should
>>> our
>>>>>>> response be if those two revisions are different (technically, in
>>> this case
>>>>>>> we wait for the third response, but the decision on what to return
>>> for the
>>>>>>> "r met" value is still unclear).
>>>>>>> 
>>>>>>> While I think everyone is in agreement that it'd be nice to return
>>> some of
>>>>>>> the information about the copies read, I think its much less clear
>>> what and
>>>>>>> how it should be returned in the multitude of cases that we can
>>> specify an
>>>>>>> value for R.
>>>>>>> 
>>>>>>> While that doesn't offer a concrete path forward, hopefully it
>>> clarifies
>>>>>>> some of the issues at hand.
>>>>>>> 
>>>>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <
>>> rnewson@apache.org>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> It’s testament to my friendship with Mike that we can disagree on
>>> such
>>>>>>>> things and remain friends. I am sorry he misled you, though.
>>>>>>>> 
>>>>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at
>>> all, at
>>>>>>>> least in the formal sense, the only one that matters, this is
>>> unfortunately
>>>>>>>> sloppy language in too many places to correct.
>>>>>>>> 
>>>>>>>> The r= and w= parameters control only how many of the n possible
>>> responses
>>>>>>>> are collected before returning an http response.
>>>>>>>> 
>>>>>>>> It’s not true that returning 202 in the situation where one write is
>>> made
>>>>>>>> but fewer than 'r' writes are made means we’ve chosen availability
>>> over
>>>>>>>> consistency since even if we returned a 500 or closed the connection
>>>>>>>> without responding, a subsequent GET could return the document (a
>>>>>>>> probability that increases over time as anti-entropy makes the
>>> missing
>>>>>>>> copies). A write attempt that returned a 409 could, likewise,
>>> introduce a
>>>>>>>> new edit branch into the document, which might then 'win', altering
>>> the
>>>>>>>> results of a subsequent GET.
>>>>>>>> 
>>>>>>>> The essential thing to remember is this: the ’n’ copies of your data
>>> are
>>>>>>>> completely independent when written/read by the clustered layer
>>> (fabric).
>>>>>>>> It is internal replication (anti-entropy) that converges those
>>> copies,
>>>>>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>>>>>> independent results into a single result as best it can. Older
>>> versions did
>>>>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I
>>> do agree
>>>>>>>> with you that there’s little value in the 202 distinction. About the
>>> only
>>>>>>>> thing you could do is investigate your cluster for connectivity
>>> issues or
>>>>>>>> overloading if you get a sustained period of 202’s, as it would be an
>>>>>>>> indicator that the system is partitioned.
>>>>>>>> 
>>>>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure
>>> that the
>>>>>>>> result of a write did not change after the fact. That is,
>>> anti-entropy
>>>>>>>> would need to be disabled, or somehow agree to roll forward or
>>> backward
>>>>>>>> based on the initial circumstances. In short, we’d have to introduce
>>> strong
>>>>>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>>>>>> feature to add, it’s not currently present, and no amount of
>>> twiddling the
>>>>>>>> status codes will achieve it. We’d rather be honest about our
>>> position on
>>>>>>>> the CAP triangle.
>>>>>>>> 
>>>>>>>> B.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <
>>> nate-lists@calftrail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> A technical co-founder of Cloudant agreed that this was a bug when I
>>>>>>>> first hit it a few years ago. I found back the original thread here
>>> — this
>>>>>>>> is the discussion I was trying to recall in my OP:
>>>>>>>>> It sounds like perhaps there is a related issue tracked internally
>>> at
>>>>>>>> Cloudant as a result of that conversation.
>>>>>>>>> 
>>>>>>>>> JamesM, thanks for your support here and tracking this down. 203
>>> seemed
>>>>>>>> like the best status code to "steal" for this to me too. Best wishes
>>> in
>>>>>>>> getting this fixed!
>>>>>>>>> 
>>>>>>>>> regards,
>>>>>>>>> -natevw
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org>
>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>>>>>>> classified as a bug.
>>>>>>>>>> 
>>>>>>>>>> Anti-entropy is the main reason that you cannot get strong
>>> consistency
>>>>>>>> from the system, it will transform "failed" writes (those that
>>> succeeded on
>>>>>>>> one node but fewer than R nodes) into success (N copies) as long as
>>> the
>>>>>>>> nodes have enough healthy uptime.
>>>>>>>>>> 
>>>>>>>>>> True of cloudant and 2.0.
>>>>>>>>>> 
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>> 
>>>>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com>
>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Funny you should mention it.  I drafted an email in early
>>> February to
>>>>>>>> queue up the same discussion whenever I could get involved again
>>> (which I
>>>>>>>> promptly forgot about).  What happens currently in 2.0 appears
>>> unchanged
>>>>>>>> from earlier versions.  When R is not satisfied in fabric,
>>>>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}
>>> but
>>>>>>>> leaves the acc-state as the original r_not_met which triggers a
>>> read_repair
>>>>>>>> from the response handler.  read_repair results in an {ok, …} with
>>> the only
>>>>>>>> doc available, because no other docs are in the list.  The final doc
>>>>>>>> returned to chttpd_db:couch_doc_open and thusly to
>>> chttpd_db:db_doc_req is
>>>>>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>>>>>>> complete.
>>>>>>>>>>> 
>>>>>>>>>>> This seems straightforward to fix by a change in
>>>>>>>> fabric_open_doc:handle_response and read_repair.  handle_response
>>> knows
>>>>>>>> whether it has R met and could pass that forward, or allow
>>> read-repair to
>>>>>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t
>>> speak for
>>>>>>>> community interest in the behavior of sending a 202, but it’s
>>> something I’d
>>>>>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>>>>>> disconnected to do it on writes but not reads.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> </JamesM>
>>>>>>>>>>> 
>>>>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>>>>>> nate-lists@calftrail.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>>>>>>> extending my fermata-couchdb plugin today and realized that perhaps
>>> the
>>>>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an
>>> opportunity to
>>>>>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>>>>>> 
>>>>>>>>>>>> See
>>>>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
>>> for
>>>>>>>> some additional background/explanation, but my understanding is that
>>>>>>>> Cloudant for all practical purposes ignores the read durability
>>> parameter.
>>>>>>>> So you can write with ?w=N to attempt some level of quorum, and get
>>> a 202
>>>>>>>> back if that quorum is unment. _However_ when you ?r=N it really
>>> doesn't
>>>>>>>> matter if only <N nodes are available…if even just a single
>>> available node
>>>>>>>> has some version of the requested document you will get a successful
>>>>>>>> response (!).
>>>>>>>>>>>> 
>>>>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>>>>>> features to dynamically _choose_ between consistency or availability
>>> — when
>>>>>>>> it comes time to read back a consistent result, BigCouch instead just
>>>>>>>> always gives you availability* regardless of what a given request
>>> actually
>>>>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather
>>> than
>>>>>>>> proceeding with no way of ever knowing whether a write did NOT
>>> ACTUALLY
>>>>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were
>>> still
>>>>>>>> down…)
>>>>>>>>>>>> 
>>>>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug
>>> by a
>>>>>>>> Cloudant engineer (or support personnel at least) but could not be
>>> quickly
>>>>>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>>>>>> 
>>>>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>>>>>>> BigCouch? If true, could this read durability issue now be fixed
>>> during the
>>>>>>>> merge?
>>>>>>>>>>>> 
>>>>>>>>>>>> thanks,
>>>>>>>>>>>> -natevw
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual
>>> uptime
>>>>>>>> of *any* Couch fork…
>>>>>> 
>>> 
>>> 
>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Robert Samuel Newson <rn...@apache.org>.

Yeah, not a bad idea. An extra query arg (akin to open_revs=all, conflicts=true, etc) would avoid compatibility breaks and would clearly put the onus on those supplying it to tolerate the presence of the extra reserved field.

+1


> On 2 Apr 2015, at 10:32, Benjamin Bastian <bb...@apache.org> wrote:
> 
> What about adding an optional query parameter to indicate whether or not
> Couch should include the _r_met flag in the document body/bodies
> (defaulting to false)? That wouldn't break older clients and it'd work for
> the bulk API as well. As far as the case where there are conflicts, it
> seems like the most intuitive thing would be for the "r" in "_r_met" to
> have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait
> for r copies of the same doc rev until a timeout" and "_r_met" would mean
> "we got/didn't get r copies of the same doc rev within the timeout").
> 
> Just my two cents.
> 
> On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <rn...@apache.org>
> wrote:
> 
>> 
>> Paul outlined his previous efforts to introduce this indication, and the
>> problems he faced doing so. Can we come up with an acceptable mechanism?
>> 
>> A different status code will break a lot of users. While the http spec
>> says you can treat any 2xx code as success, plenty of libraries, etc, only
>> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for
>> reads.
>> 
>> My preference is for a change that "can’t" break anyone, which I think
>> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most
>> pleasant thing.
>> 
>> Suggestions?
>> 
>> B.
>> 
>> 
>>> On 1 Apr 2015, at 06:55, Mutton, James <jm...@akamai.com> wrote:
>>> 
>>> For at least my part of it, I agree with Adam. Bigcouch has made an
>> effort to inform in the case of a failure to apply W. I've seen it lead to
>> confusion when the same logic was not applied on R.
>>> 
>>> I also agree that W and R are not binding contracts. There's no
>> agreement protocol to assure that W is met before being committed to disk.
>> But they are exposed as a blocking parameter of the request, so
>> notification being consistent appeared to me to be the best compromise (vs
>> straight up removal).
>>> 
>>> </JamesM>
>>> 
>>> 
>>>> On Mar 31, 2015, at 13:15, Robert Newson <rn...@apache.org> wrote:
>>>> 
>>>> 
>>>> If a way can be found that doesn't break things that can be sent in all
>> or most cases, sure. It's what a user can really infer from that which I
>> focused on. Not as much, I think, as users that want that info really want.
>>>> 
>>>> 
>>>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <ko...@apache.org> wrote:
>>>>> 
>>>>> I hope we can all agree that CouchDB should inform the user when it is
>> unable to satisfy the requested read "quorum".
>>>>> 
>>>>> Adam
>>>>> 
>>>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <pa...@gmail.com>
>> wrote:
>>>>>> 
>>>>>> Sounds like there's a bit of confusion here.
>>>>>> 
>>>>>> What Nathan is asking for is the ability to have Couch respond with
>> some
>>>>>> information on the actual number of replicas that responded to a read
>>>>>> request. That way a user could tell that they issued an r=2 request
>> when
>>>>>> only r=1 was actually performed. Depending on your point of view in
>> an MVCC
>>>>>> world this is either a bug or a feature. :)
>>>>>> 
>>>>>> It was generally agreed upon that if we could return this information
>> it
>>>>>> would be beneficial. Although what happened when I started
>> implementing
>>>>>> this patch was that we are either only able to return it in a subset
>> of
>>>>>> cases where it happens, return it inconsistently between various
>> responses,
>>>>>> or break replication.
>>>>>> 
>>>>>> The three general methods for this would be to either include a new
>>>>>> "_r_met" key in the doc body that would be a boolean indicating if the
>>>>>> requested read quorum was actually met for the document. The second
>> was to
>>>>>> return a custom X-R-Met type header, and lastly was the status code as
>>>>>> described.
>>>>>> 
>>>>>> The _r_met member was thought to be the best, but unfortunately that
>> breaks
>>>>>> replication with older clients because we throw an error rather than
>> ignore
>>>>>> any unknown underscore prefixed field name. Thus having something
>> that was
>>>>>> just dynamically injected into the document body was a non-starter.
>>>>>> Unfortunately, if we don't inject into the document body then we limit
>>>>>> ourselves to only the set of APIs where a single document is
>> returned. This
>>>>>> is due to both streaming semantics (we can't buffer an entire
>> response in
>>>>>> memory for large requests to _all_docs) as well as multi-doc
>> responses (a
>>>>>> single boolean doesn't say which document may have not had a properly
>> met
>>>>>> R).
>>>>>> 
>>>>>> On top of that, the other confusing part of meeting the read quorum
>> is that
>>>>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>>>>> documents with different revision histories. For instance, if we read
>> two
>>>>>> docs, we have technically made the r=2 requirement, but what should
>> our
>>>>>> response be if those two revisions are different (technically, in
>> this case
>>>>>> we wait for the third response, but the decision on what to return
>> for the
>>>>>> "r met" value is still unclear).
>>>>>> 
>>>>>> While I think everyone is in agreement that it'd be nice to return
>> some of
>>>>>> the information about the copies read, I think its much less clear
>> what and
>>>>>> how it should be returned in the multitude of cases that we can
>> specify an
>>>>>> value for R.
>>>>>> 
>>>>>> While that doesn't offer a concrete path forward, hopefully it
>> clarifies
>>>>>> some of the issues at hand.
>>>>>> 
>>>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <
>> rnewson@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> It’s testament to my friendship with Mike that we can disagree on
>> such
>>>>>>> things and remain friends. I am sorry he misled you, though.
>>>>>>> 
>>>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at
>> all, at
>>>>>>> least in the formal sense, the only one that matters, this is
>> unfortunately
>>>>>>> sloppy language in too many places to correct.
>>>>>>> 
>>>>>>> The r= and w= parameters control only how many of the n possible
>> responses
>>>>>>> are collected before returning an http response.
>>>>>>> 
>>>>>>> It’s not true that returning 202 in the situation where one write is
>> made
>>>>>>> but fewer than 'r' writes are made means we’ve chosen availability
>> over
>>>>>>> consistency since even if we returned a 500 or closed the connection
>>>>>>> without responding, a subsequent GET could return the document (a
>>>>>>> probability that increases over time as anti-entropy makes the
>> missing
>>>>>>> copies). A write attempt that returned a 409 could, likewise,
>> introduce a
>>>>>>> new edit branch into the document, which might then 'win', altering
>> the
>>>>>>> results of a subsequent GET.
>>>>>>> 
>>>>>>> The essential thing to remember is this: the ’n’ copies of your data
>> are
>>>>>>> completely independent when written/read by the clustered layer
>> (fabric).
>>>>>>> It is internal replication (anti-entropy) that converges those
>> copies,
>>>>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>>>>> independent results into a single result as best it can. Older
>> versions did
>>>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I
>> do agree
>>>>>>> with you that there’s little value in the 202 distinction. About the
>> only
>>>>>>> thing you could do is investigate your cluster for connectivity
>> issues or
>>>>>>> overloading if you get a sustained period of 202’s, as it would be an
>>>>>>> indicator that the system is partitioned.
>>>>>>> 
>>>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure
>> that the
>>>>>>> result of a write did not change after the fact. That is,
>> anti-entropy
>>>>>>> would need to be disabled, or somehow agree to roll forward or
>> backward
>>>>>>> based on the initial circumstances. In short, we’d have to introduce
>> strong
>>>>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>>>>> feature to add, it’s not currently present, and no amount of
>> twiddling the
>>>>>>> status codes will achieve it. We’d rather be honest about our
>> position on
>>>>>>> the CAP triangle.
>>>>>>> 
>>>>>>> B.
>>>>>>> 
>>>>>>> 
>>>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <
>> nate-lists@calftrail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> A technical co-founder of Cloudant agreed that this was a bug when I
>>>>>>> first hit it a few years ago. I found back the original thread here
>> — this
>>>>>>> is the discussion I was trying to recall in my OP:
>>>>>>>> It sounds like perhaps there is a related issue tracked internally
>> at
>>>>>>> Cloudant as a result of that conversation.
>>>>>>>> 
>>>>>>>> JamesM, thanks for your support here and tracking this down. 203
>> seemed
>>>>>>> like the best status code to "steal" for this to me too. Best wishes
>> in
>>>>>>> getting this fixed!
>>>>>>>> 
>>>>>>>> regards,
>>>>>>>> -natevw
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org>
>> wrote:
>>>>>>>>> 
>>>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>>>>>> classified as a bug.
>>>>>>>>> 
>>>>>>>>> Anti-entropy is the main reason that you cannot get strong
>> consistency
>>>>>>> from the system, it will transform "failed" writes (those that
>> succeeded on
>>>>>>> one node but fewer than R nodes) into success (N copies) as long as
>> the
>>>>>>> nodes have enough healthy uptime.
>>>>>>>>> 
>>>>>>>>> True of cloudant and 2.0.
>>>>>>>>> 
>>>>>>>>> Sent from my iPhone
>>>>>>>>> 
>>>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com>
>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Funny you should mention it.  I drafted an email in early
>> February to
>>>>>>> queue up the same discussion whenever I could get involved again
>> (which I
>>>>>>> promptly forgot about).  What happens currently in 2.0 appears
>> unchanged
>>>>>>> from earlier versions.  When R is not satisfied in fabric,
>>>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}
>> but
>>>>>>> leaves the acc-state as the original r_not_met which triggers a
>> read_repair
>>>>>>> from the response handler.  read_repair results in an {ok, …} with
>> the only
>>>>>>> doc available, because no other docs are in the list.  The final doc
>>>>>>> returned to chttpd_db:couch_doc_open and thusly to
>> chttpd_db:db_doc_req is
>>>>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>>>>>> complete.
>>>>>>>>>> 
>>>>>>>>>> This seems straightforward to fix by a change in
>>>>>>> fabric_open_doc:handle_response and read_repair.  handle_response
>> knows
>>>>>>> whether it has R met and could pass that forward, or allow
>> read-repair to
>>>>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t
>> speak for
>>>>>>> community interest in the behavior of sending a 202, but it’s
>> something I’d
>>>>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>>>>> disconnected to do it on writes but not reads.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> </JamesM>
>>>>>>>>>> 
>>>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>>>>> nate-lists@calftrail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>>>>>> extending my fermata-couchdb plugin today and realized that perhaps
>> the
>>>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an
>> opportunity to
>>>>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>>>>> 
>>>>>>>>>>> See
>>>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
>> for
>>>>>>> some additional background/explanation, but my understanding is that
>>>>>>> Cloudant for all practical purposes ignores the read durability
>> parameter.
>>>>>>> So you can write with ?w=N to attempt some level of quorum, and get
>> a 202
>>>>>>> back if that quorum is unment. _However_ when you ?r=N it really
>> doesn't
>>>>>>> matter if only <N nodes are available…if even just a single
>> available node
>>>>>>> has some version of the requested document you will get a successful
>>>>>>> response (!).
>>>>>>>>>>> 
>>>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>>>>> features to dynamically _choose_ between consistency or availability
>> — when
>>>>>>> it comes time to read back a consistent result, BigCouch instead just
>>>>>>> always gives you availability* regardless of what a given request
>> actually
>>>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather
>> than
>>>>>>> proceeding with no way of ever knowing whether a write did NOT
>> ACTUALLY
>>>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were
>> still
>>>>>>> down…)
>>>>>>>>>>> 
>>>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug
>> by a
>>>>>>> Cloudant engineer (or support personnel at least) but could not be
>> quickly
>>>>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>>>>> 
>>>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>>>>>> BigCouch? If true, could this read durability issue now be fixed
>> during the
>>>>>>> merge?
>>>>>>>>>>> 
>>>>>>>>>>> thanks,
>>>>>>>>>>> -natevw
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual
>> uptime
>>>>>>> of *any* Couch fork…
>>>>> 
>> 
>>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Benjamin Bastian <bb...@apache.org>.

What about adding an optional query parameter to indicate whether or not
Couch should include the _r_met flag in the document body/bodies
(defaulting to false)? That wouldn't break older clients and it'd work for
the bulk API as well. As far as the case where there are conflicts, it
seems like the most intuitive thing would be for the "r" in "_r_met" to
have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait
for r copies of the same doc rev until a timeout" and "_r_met" would mean
"we got/didn't get r copies of the same doc rev within the timeout").

Just my two cents.

On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <rn...@apache.org>
wrote:

>
> Paul outlined his previous efforts to introduce this indication, and the
> problems he faced doing so. Can we come up with an acceptable mechanism?
>
> A different status code will break a lot of users. While the http spec
> says you can treat any 2xx code as success, plenty of libraries, etc, only
> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for
> reads.
>
> My preference is for a change that "can’t" break anyone, which I think
> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most
> pleasant thing.
>
> Suggestions?
>
> B.
>
>
> > On 1 Apr 2015, at 06:55, Mutton, James <jm...@akamai.com> wrote:
> >
> > For at least my part of it, I agree with Adam. Bigcouch has made an
> effort to inform in the case of a failure to apply W. I've seen it lead to
> confusion when the same logic was not applied on R.
> >
> > I also agree that W and R are not binding contracts. There's no
> agreement protocol to assure that W is met before being committed to disk.
> But they are exposed as a blocking parameter of the request, so
> notification being consistent appeared to me to be the best compromise (vs
> straight up removal).
> >
> > </JamesM>
> >
> >
> >> On Mar 31, 2015, at 13:15, Robert Newson <rn...@apache.org> wrote:
> >>
> >>
> >> If a way can be found that doesn't break things that can be sent in all
> or most cases, sure. It's what a user can really infer from that which I
> focused on. Not as much, I think, as users that want that info really want.
> >>
> >>
> >>> On 31 Mar 2015, at 21:08, Adam Kocoloski <ko...@apache.org> wrote:
> >>>
> >>> I hope we can all agree that CouchDB should inform the user when it is
> unable to satisfy the requested read "quorum".
> >>>
> >>> Adam
> >>>
> >>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <pa...@gmail.com>
> wrote:
> >>>>
> >>>> Sounds like there's a bit of confusion here.
> >>>>
> >>>> What Nathan is asking for is the ability to have Couch respond with
> some
> >>>> information on the actual number of replicas that responded to a read
> >>>> request. That way a user could tell that they issued an r=2 request
> when
> >>>> only r=1 was actually performed. Depending on your point of view in
> an MVCC
> >>>> world this is either a bug or a feature. :)
> >>>>
> >>>> It was generally agreed upon that if we could return this information
> it
> >>>> would be beneficial. Although what happened when I started
> implementing
> >>>> this patch was that we are either only able to return it in a subset
> of
> >>>> cases where it happens, return it inconsistently between various
> responses,
> >>>> or break replication.
> >>>>
> >>>> The three general methods for this would be to either include a new
> >>>> "_r_met" key in the doc body that would be a boolean indicating if the
> >>>> requested read quorum was actually met for the document. The second
> was to
> >>>> return a custom X-R-Met type header, and lastly was the status code as
> >>>> described.
> >>>>
> >>>> The _r_met member was thought to be the best, but unfortunately that
> breaks
> >>>> replication with older clients because we throw an error rather than
> ignore
> >>>> any unknown underscore prefixed field name. Thus having something
> that was
> >>>> just dynamically injected into the document body was a non-starter.
> >>>> Unfortunately, if we don't inject into the document body then we limit
> >>>> ourselves to only the set of APIs where a single document is
> returned. This
> >>>> is due to both streaming semantics (we can't buffer an entire
> response in
> >>>> memory for large requests to _all_docs) as well as multi-doc
> responses (a
> >>>> single boolean doesn't say which document may have not had a properly
> met
> >>>> R).
> >>>>
> >>>> On top of that, the other confusing part of meeting the read quorum
> is that
> >>>> given MVCC semantics it becomes a bit confusing on how you respond to
> >>>> documents with different revision histories. For instance, if we read
> two
> >>>> docs, we have technically made the r=2 requirement, but what should
> our
> >>>> response be if those two revisions are different (technically, in
> this case
> >>>> we wait for the third response, but the decision on what to return
> for the
> >>>> "r met" value is still unclear).
> >>>>
> >>>> While I think everyone is in agreement that it'd be nice to return
> some of
> >>>> the information about the copies read, I think its much less clear
> what and
> >>>> how it should be returned in the multitude of cases that we can
> specify an
> >>>> value for R.
> >>>>
> >>>> While that doesn't offer a concrete path forward, hopefully it
> clarifies
> >>>> some of the issues at hand.
> >>>>
> >>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <
> rnewson@apache.org>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>> It’s testament to my friendship with Mike that we can disagree on
> such
> >>>>> things and remain friends. I am sorry he misled you, though.
> >>>>>
> >>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at
> all, at
> >>>>> least in the formal sense, the only one that matters, this is
> unfortunately
> >>>>> sloppy language in too many places to correct.
> >>>>>
> >>>>> The r= and w= parameters control only how many of the n possible
> responses
> >>>>> are collected before returning an http response.
> >>>>>
> >>>>> It’s not true that returning 202 in the situation where one write is
> made
> >>>>> but fewer than 'r' writes are made means we’ve chosen availability
> over
> >>>>> consistency since even if we returned a 500 or closed the connection
> >>>>> without responding, a subsequent GET could return the document (a
> >>>>> probability that increases over time as anti-entropy makes the
> missing
> >>>>> copies). A write attempt that returned a 409 could, likewise,
> introduce a
> >>>>> new edit branch into the document, which might then 'win', altering
> the
> >>>>> results of a subsequent GET.
> >>>>>
> >>>>> The essential thing to remember is this: the ’n’ copies of your data
> are
> >>>>> completely independent when written/read by the clustered layer
> (fabric).
> >>>>> It is internal replication (anti-entropy) that converges those
> copies,
> >>>>> pair-wise, to the same eventual state. Fabric is converting the 3
> >>>>> independent results into a single result as best it can. Older
> versions did
> >>>>> not expose the 201 vs 202 distinction, calling both of them 201. I
> do agree
> >>>>> with you that there’s little value in the 202 distinction. About the
> only
> >>>>> thing you could do is investigate your cluster for connectivity
> issues or
> >>>>> overloading if you get a sustained period of 202’s, as it would be an
> >>>>> indicator that the system is partitioned.
> >>>>>
> >>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure
> that the
> >>>>> result of a write did not change after the fact. That is,
> anti-entropy
> >>>>> would need to be disabled, or somehow agree to roll forward or
> backward
> >>>>> based on the initial circumstances. In short, we’d have to introduce
> strong
> >>>>> consistency (paxos or raft or zab, say). While this would be a great
> >>>>> feature to add, it’s not currently present, and no amount of
> twiddling the
> >>>>> status codes will achieve it. We’d rather be honest about our
> position on
> >>>>> the CAP triangle.
> >>>>>
> >>>>> B.
> >>>>>
> >>>>>
> >>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <
> nate-lists@calftrail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> A technical co-founder of Cloudant agreed that this was a bug when I
> >>>>> first hit it a few years ago. I found back the original thread here
> — this
> >>>>> is the discussion I was trying to recall in my OP:
> >>>>>> It sounds like perhaps there is a related issue tracked internally
> at
> >>>>> Cloudant as a result of that conversation.
> >>>>>>
> >>>>>> JamesM, thanks for your support here and tracking this down. 203
> seemed
> >>>>> like the best status code to "steal" for this to me too. Best wishes
> in
> >>>>> getting this fixed!
> >>>>>>
> >>>>>> regards,
> >>>>>> -natevw
> >>>>>>
> >>>>>>
> >>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org>
> wrote:
> >>>>>>>
> >>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
> >>>>> classified as a bug.
> >>>>>>>
> >>>>>>> Anti-entropy is the main reason that you cannot get strong
> consistency
> >>>>> from the system, it will transform "failed" writes (those that
> succeeded on
> >>>>> one node but fewer than R nodes) into success (N copies) as long as
> the
> >>>>> nodes have enough healthy uptime.
> >>>>>>>
> >>>>>>> True of cloudant and 2.0.
> >>>>>>>
> >>>>>>> Sent from my iPhone
> >>>>>>>
> >>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com>
> wrote:
> >>>>>>>>
> >>>>>>>> Funny you should mention it.  I drafted an email in early
> February to
> >>>>> queue up the same discussion whenever I could get involved again
> (which I
> >>>>> promptly forgot about).  What happens currently in 2.0 appears
> unchanged
> >>>>> from earlier versions.  When R is not satisfied in fabric,
> >>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}
> but
> >>>>> leaves the acc-state as the original r_not_met which triggers a
> read_repair
> >>>>> from the response handler.  read_repair results in an {ok, …} with
> the only
> >>>>> doc available, because no other docs are in the list.  The final doc
> >>>>> returned to chttpd_db:couch_doc_open and thusly to
> chttpd_db:db_doc_req is
> >>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
> >>>>> complete.
> >>>>>>>>
> >>>>>>>> This seems straightforward to fix by a change in
> >>>>> fabric_open_doc:handle_response and read_repair.  handle_response
> knows
> >>>>> whether it has R met and could pass that forward, or allow
> read-repair to
> >>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t
> speak for
> >>>>> community interest in the behavior of sending a 202, but it’s
> something I’d
> >>>>> definitely like for the same reasons you cite.  Plus it just seems
> >>>>> disconnected to do it on writes but not reads.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> </JamesM>
> >>>>>>>>
> >>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
> >>>>> nate-lists@calftrail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
> >>>>> extending my fermata-couchdb plugin today and realized that perhaps
> the
> >>>>> Apache release of BigCouch as CouchDB 2.0 might provide an
> opportunity to
> >>>>> fix a serious issue I had using Cloudant's implementation.
> >>>>>>>>>
> >>>>>>>>> See
> >>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
> for
> >>>>> some additional background/explanation, but my understanding is that
> >>>>> Cloudant for all practical purposes ignores the read durability
> parameter.
> >>>>> So you can write with ?w=N to attempt some level of quorum, and get
> a 202
> >>>>> back if that quorum is unment. _However_ when you ?r=N it really
> doesn't
> >>>>> matter if only <N nodes are available…if even just a single
> available node
> >>>>> has some version of the requested document you will get a successful
> >>>>> response (!).
> >>>>>>>>>
> >>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
> >>>>> features to dynamically _choose_ between consistency or availability
> — when
> >>>>> it comes time to read back a consistent result, BigCouch instead just
> >>>>> always gives you availability* regardless of what a given request
> actually
> >>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather
> than
> >>>>> proceeding with no way of ever knowing whether a write did NOT
> ACTUALLY
> >>>>> conflict or just hadn't YET because $who_knows_how_many nodes were
> still
> >>>>> down…)
> >>>>>>>>>
> >>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug
> by a
> >>>>> Cloudant engineer (or support personnel at least) but could not be
> quickly
> >>>>> fixed as it could introduce backwards-compatibility concerns. So…
> >>>>>>>>>
> >>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
> >>>>> BigCouch? If true, could this read durability issue now be fixed
> during the
> >>>>> merge?
> >>>>>>>>>
> >>>>>>>>> thanks,
> >>>>>>>>> -natevw
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual
> uptime
> >>>>> of *any* Couch fork…
> >>>
>
>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Robert Samuel Newson <rn...@apache.org>.

Paul outlined his previous efforts to introduce this indication, and the problems he faced doing so. Can we come up with an acceptable mechanism?

A different status code will break a lot of users. While the http spec says you can treat any 2xx code as success, plenty of libraries, etc, only recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for reads.

My preference is for a change that "can’t" break anyone, which I think only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most pleasant thing.

Suggestions?

B.


> On 1 Apr 2015, at 06:55, Mutton, James <jm...@akamai.com> wrote:
> 
> For at least my part of it, I agree with Adam. Bigcouch has made an effort to inform in the case of a failure to apply W. I've seen it lead to confusion when the same logic was not applied on R.
> 
> I also agree that W and R are not binding contracts. There's no agreement protocol to assure that W is met before being committed to disk. But they are exposed as a blocking parameter of the request, so notification being consistent appeared to me to be the best compromise (vs straight up removal).
> 
> </JamesM>
> 
> 
>> On Mar 31, 2015, at 13:15, Robert Newson <rn...@apache.org> wrote:
>> 
>> 
>> If a way can be found that doesn't break things that can be sent in all or most cases, sure. It's what a user can really infer from that which I focused on. Not as much, I think, as users that want that info really want. 
>> 
>> 
>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <ko...@apache.org> wrote:
>>> 
>>> I hope we can all agree that CouchDB should inform the user when it is unable to satisfy the requested read "quorum".
>>> 
>>> Adam
>>> 
>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <pa...@gmail.com> wrote:
>>>> 
>>>> Sounds like there's a bit of confusion here.
>>>> 
>>>> What Nathan is asking for is the ability to have Couch respond with some
>>>> information on the actual number of replicas that responded to a read
>>>> request. That way a user could tell that they issued an r=2 request when
>>>> only r=1 was actually performed. Depending on your point of view in an MVCC
>>>> world this is either a bug or a feature. :)
>>>> 
>>>> It was generally agreed upon that if we could return this information it
>>>> would be beneficial. Although what happened when I started implementing
>>>> this patch was that we are either only able to return it in a subset of
>>>> cases where it happens, return it inconsistently between various responses,
>>>> or break replication.
>>>> 
>>>> The three general methods for this would be to either include a new
>>>> "_r_met" key in the doc body that would be a boolean indicating if the
>>>> requested read quorum was actually met for the document. The second was to
>>>> return a custom X-R-Met type header, and lastly was the status code as
>>>> described.
>>>> 
>>>> The _r_met member was thought to be the best, but unfortunately that breaks
>>>> replication with older clients because we throw an error rather than ignore
>>>> any unknown underscore prefixed field name. Thus having something that was
>>>> just dynamically injected into the document body was a non-starter.
>>>> Unfortunately, if we don't inject into the document body then we limit
>>>> ourselves to only the set of APIs where a single document is returned. This
>>>> is due to both streaming semantics (we can't buffer an entire response in
>>>> memory for large requests to _all_docs) as well as multi-doc responses (a
>>>> single boolean doesn't say which document may have not had a properly met
>>>> R).
>>>> 
>>>> On top of that, the other confusing part of meeting the read quorum is that
>>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>>> documents with different revision histories. For instance, if we read two
>>>> docs, we have technically made the r=2 requirement, but what should our
>>>> response be if those two revisions are different (technically, in this case
>>>> we wait for the third response, but the decision on what to return for the
>>>> "r met" value is still unclear).
>>>> 
>>>> While I think everyone is in agreement that it'd be nice to return some of
>>>> the information about the copies read, I think its much less clear what and
>>>> how it should be returned in the multitude of cases that we can specify an
>>>> value for R.
>>>> 
>>>> While that doesn't offer a concrete path forward, hopefully it clarifies
>>>> some of the issues at hand.
>>>> 
>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <rn...@apache.org>
>>>> wrote:
>>>> 
>>>>> 
>>>>> It’s testament to my friendship with Mike that we can disagree on such
>>>>> things and remain friends. I am sorry he misled you, though.
>>>>> 
>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at all, at
>>>>> least in the formal sense, the only one that matters, this is unfortunately
>>>>> sloppy language in too many places to correct.
>>>>> 
>>>>> The r= and w= parameters control only how many of the n possible responses
>>>>> are collected before returning an http response.
>>>>> 
>>>>> It’s not true that returning 202 in the situation where one write is made
>>>>> but fewer than 'r' writes are made means we’ve chosen availability over
>>>>> consistency since even if we returned a 500 or closed the connection
>>>>> without responding, a subsequent GET could return the document (a
>>>>> probability that increases over time as anti-entropy makes the missing
>>>>> copies). A write attempt that returned a 409 could, likewise, introduce a
>>>>> new edit branch into the document, which might then 'win', altering the
>>>>> results of a subsequent GET.
>>>>> 
>>>>> The essential thing to remember is this: the ’n’ copies of your data are
>>>>> completely independent when written/read by the clustered layer (fabric).
>>>>> It is internal replication (anti-entropy) that converges those copies,
>>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>>> independent results into a single result as best it can. Older versions did
>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I do agree
>>>>> with you that there’s little value in the 202 distinction. About the only
>>>>> thing you could do is investigate your cluster for connectivity issues or
>>>>> overloading if you get a sustained period of 202’s, as it would be an
>>>>> indicator that the system is partitioned.
>>>>> 
>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure that the
>>>>> result of a write did not change after the fact. That is, anti-entropy
>>>>> would need to be disabled, or somehow agree to roll forward or backward
>>>>> based on the initial circumstances. In short, we’d have to introduce strong
>>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>>> feature to add, it’s not currently present, and no amount of twiddling the
>>>>> status codes will achieve it. We’d rather be honest about our position on
>>>>> the CAP triangle.
>>>>> 
>>>>> B.
>>>>> 
>>>>> 
>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <na...@calftrail.com>
>>>>>> wrote:
>>>>>> 
>>>>>> A technical co-founder of Cloudant agreed that this was a bug when I
>>>>> first hit it a few years ago. I found back the original thread here — this
>>>>> is the discussion I was trying to recall in my OP:
>>>>>> It sounds like perhaps there is a related issue tracked internally at
>>>>> Cloudant as a result of that conversation.
>>>>>> 
>>>>>> JamesM, thanks for your support here and tracking this down. 203 seemed
>>>>> like the best status code to "steal" for this to me too. Best wishes in
>>>>> getting this fixed!
>>>>>> 
>>>>>> regards,
>>>>>> -natevw
>>>>>> 
>>>>>> 
>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org> wrote:
>>>>>>> 
>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>>>> classified as a bug.
>>>>>>> 
>>>>>>> Anti-entropy is the main reason that you cannot get strong consistency
>>>>> from the system, it will transform "failed" writes (those that succeeded on
>>>>> one node but fewer than R nodes) into success (N copies) as long as the
>>>>> nodes have enough healthy uptime.
>>>>>>> 
>>>>>>> True of cloudant and 2.0.
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com> wrote:
>>>>>>>> 
>>>>>>>> Funny you should mention it.  I drafted an email in early February to
>>>>> queue up the same discussion whenever I could get involved again (which I
>>>>> promptly forgot about).  What happens currently in 2.0 appears unchanged
>>>>> from earlier versions.  When R is not satisfied in fabric,
>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}  but
>>>>> leaves the acc-state as the original r_not_met which triggers a read_repair
>>>>> from the response handler.  read_repair results in an {ok, …} with the only
>>>>> doc available, because no other docs are in the list.  The final doc
>>>>> returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is
>>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>>>> complete.
>>>>>>>> 
>>>>>>>> This seems straightforward to fix by a change in
>>>>> fabric_open_doc:handle_response and read_repair.  handle_response knows
>>>>> whether it has R met and could pass that forward, or allow read-repair to
>>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for
>>>>> community interest in the behavior of sending a 202, but it’s something I’d
>>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>>> disconnected to do it on writes but not reads.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> </JamesM>
>>>>>>>> 
>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>>> nate-lists@calftrail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>>>> extending my fermata-couchdb plugin today and realized that perhaps the
>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to
>>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>>> 
>>>>>>>>> See
>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for
>>>>> some additional background/explanation, but my understanding is that
>>>>> Cloudant for all practical purposes ignores the read durability parameter.
>>>>> So you can write with ?w=N to attempt some level of quorum, and get a 202
>>>>> back if that quorum is unment. _However_ when you ?r=N it really doesn't
>>>>> matter if only <N nodes are available…if even just a single available node
>>>>> has some version of the requested document you will get a successful
>>>>> response (!).
>>>>>>>>> 
>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>>> features to dynamically _choose_ between consistency or availability — when
>>>>> it comes time to read back a consistent result, BigCouch instead just
>>>>> always gives you availability* regardless of what a given request actually
>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather than
>>>>> proceeding with no way of ever knowing whether a write did NOT ACTUALLY
>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were still
>>>>> down…)
>>>>>>>>> 
>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug by a
>>>>> Cloudant engineer (or support personnel at least) but could not be quickly
>>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>>> 
>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>>>> BigCouch? If true, could this read durability issue now be fixed during the
>>>>> merge?
>>>>>>>>> 
>>>>>>>>> thanks,
>>>>>>>>> -natevw
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual uptime
>>>>> of *any* Couch fork…
>>>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by "Mutton, James" <jm...@akamai.com>.

For at least my part of it, I agree with Adam. Bigcouch has made an effort to inform in the case of a failure to apply W. I've seen it lead to confusion when the same logic was not applied on R.

I also agree that W and R are not binding contracts. There's no agreement protocol to assure that W is met before being committed to disk. But they are exposed as a blocking parameter of the request, so notification being consistent appeared to me to be the best compromise (vs straight up removal).

</JamesM>


> On Mar 31, 2015, at 13:15, Robert Newson <rn...@apache.org> wrote:
> 
> 
> If a way can be found that doesn't break things that can be sent in all or most cases, sure. It's what a user can really infer from that which I focused on. Not as much, I think, as users that want that info really want. 
> 
> 
>> On 31 Mar 2015, at 21:08, Adam Kocoloski <ko...@apache.org> wrote:
>> 
>> I hope we can all agree that CouchDB should inform the user when it is unable to satisfy the requested read "quorum".
>> 
>> Adam
>> 
>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <pa...@gmail.com> wrote:
>>> 
>>> Sounds like there's a bit of confusion here.
>>> 
>>> What Nathan is asking for is the ability to have Couch respond with some
>>> information on the actual number of replicas that responded to a read
>>> request. That way a user could tell that they issued an r=2 request when
>>> only r=1 was actually performed. Depending on your point of view in an MVCC
>>> world this is either a bug or a feature. :)
>>> 
>>> It was generally agreed upon that if we could return this information it
>>> would be beneficial. Although what happened when I started implementing
>>> this patch was that we are either only able to return it in a subset of
>>> cases where it happens, return it inconsistently between various responses,
>>> or break replication.
>>> 
>>> The three general methods for this would be to either include a new
>>> "_r_met" key in the doc body that would be a boolean indicating if the
>>> requested read quorum was actually met for the document. The second was to
>>> return a custom X-R-Met type header, and lastly was the status code as
>>> described.
>>> 
>>> The _r_met member was thought to be the best, but unfortunately that breaks
>>> replication with older clients because we throw an error rather than ignore
>>> any unknown underscore prefixed field name. Thus having something that was
>>> just dynamically injected into the document body was a non-starter.
>>> Unfortunately, if we don't inject into the document body then we limit
>>> ourselves to only the set of APIs where a single document is returned. This
>>> is due to both streaming semantics (we can't buffer an entire response in
>>> memory for large requests to _all_docs) as well as multi-doc responses (a
>>> single boolean doesn't say which document may have not had a properly met
>>> R).
>>> 
>>> On top of that, the other confusing part of meeting the read quorum is that
>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>> documents with different revision histories. For instance, if we read two
>>> docs, we have technically made the r=2 requirement, but what should our
>>> response be if those two revisions are different (technically, in this case
>>> we wait for the third response, but the decision on what to return for the
>>> "r met" value is still unclear).
>>> 
>>> While I think everyone is in agreement that it'd be nice to return some of
>>> the information about the copies read, I think its much less clear what and
>>> how it should be returned in the multitude of cases that we can specify an
>>> value for R.
>>> 
>>> While that doesn't offer a concrete path forward, hopefully it clarifies
>>> some of the issues at hand.
>>> 
>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <rn...@apache.org>
>>> wrote:
>>> 
>>>> 
>>>> It’s testament to my friendship with Mike that we can disagree on such
>>>> things and remain friends. I am sorry he misled you, though.
>>>> 
>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at all, at
>>>> least in the formal sense, the only one that matters, this is unfortunately
>>>> sloppy language in too many places to correct.
>>>> 
>>>> The r= and w= parameters control only how many of the n possible responses
>>>> are collected before returning an http response.
>>>> 
>>>> It’s not true that returning 202 in the situation where one write is made
>>>> but fewer than 'r' writes are made means we’ve chosen availability over
>>>> consistency since even if we returned a 500 or closed the connection
>>>> without responding, a subsequent GET could return the document (a
>>>> probability that increases over time as anti-entropy makes the missing
>>>> copies). A write attempt that returned a 409 could, likewise, introduce a
>>>> new edit branch into the document, which might then 'win', altering the
>>>> results of a subsequent GET.
>>>> 
>>>> The essential thing to remember is this: the ’n’ copies of your data are
>>>> completely independent when written/read by the clustered layer (fabric).
>>>> It is internal replication (anti-entropy) that converges those copies,
>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>> independent results into a single result as best it can. Older versions did
>>>> not expose the 201 vs 202 distinction, calling both of them 201. I do agree
>>>> with you that there’s little value in the 202 distinction. About the only
>>>> thing you could do is investigate your cluster for connectivity issues or
>>>> overloading if you get a sustained period of 202’s, as it would be an
>>>> indicator that the system is partitioned.
>>>> 
>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure that the
>>>> result of a write did not change after the fact. That is, anti-entropy
>>>> would need to be disabled, or somehow agree to roll forward or backward
>>>> based on the initial circumstances. In short, we’d have to introduce strong
>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>> feature to add, it’s not currently present, and no amount of twiddling the
>>>> status codes will achieve it. We’d rather be honest about our position on
>>>> the CAP triangle.
>>>> 
>>>> B.
>>>> 
>>>> 
>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <na...@calftrail.com>
>>>>> wrote:
>>>>> 
>>>>> A technical co-founder of Cloudant agreed that this was a bug when I
>>>> first hit it a few years ago. I found back the original thread here — this
>>>> is the discussion I was trying to recall in my OP:
>>>>> It sounds like perhaps there is a related issue tracked internally at
>>>> Cloudant as a result of that conversation.
>>>>> 
>>>>> JamesM, thanks for your support here and tracking this down. 203 seemed
>>>> like the best status code to "steal" for this to me too. Best wishes in
>>>> getting this fixed!
>>>>> 
>>>>> regards,
>>>>> -natevw
>>>>> 
>>>>> 
>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org> wrote:
>>>>>> 
>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>>> classified as a bug.
>>>>>> 
>>>>>> Anti-entropy is the main reason that you cannot get strong consistency
>>>> from the system, it will transform "failed" writes (those that succeeded on
>>>> one node but fewer than R nodes) into success (N copies) as long as the
>>>> nodes have enough healthy uptime.
>>>>>> 
>>>>>> True of cloudant and 2.0.
>>>>>> 
>>>>>> Sent from my iPhone
>>>>>> 
>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com> wrote:
>>>>>>> 
>>>>>>> Funny you should mention it.  I drafted an email in early February to
>>>> queue up the same discussion whenever I could get involved again (which I
>>>> promptly forgot about).  What happens currently in 2.0 appears unchanged
>>>> from earlier versions.  When R is not satisfied in fabric,
>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}  but
>>>> leaves the acc-state as the original r_not_met which triggers a read_repair
>>>> from the response handler.  read_repair results in an {ok, …} with the only
>>>> doc available, because no other docs are in the list.  The final doc
>>>> returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is
>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>>> complete.
>>>>>>> 
>>>>>>> This seems straightforward to fix by a change in
>>>> fabric_open_doc:handle_response and read_repair.  handle_response knows
>>>> whether it has R met and could pass that forward, or allow read-repair to
>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for
>>>> community interest in the behavior of sending a 202, but it’s something I’d
>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>> disconnected to do it on writes but not reads.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> </JamesM>
>>>>>>> 
>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>> nate-lists@calftrail.com> wrote:
>>>>>>>> 
>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>>> extending my fermata-couchdb plugin today and realized that perhaps the
>>>> Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to
>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>> 
>>>>>>>> See
>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for
>>>> some additional background/explanation, but my understanding is that
>>>> Cloudant for all practical purposes ignores the read durability parameter.
>>>> So you can write with ?w=N to attempt some level of quorum, and get a 202
>>>> back if that quorum is unment. _However_ when you ?r=N it really doesn't
>>>> matter if only <N nodes are available…if even just a single available node
>>>> has some version of the requested document you will get a successful
>>>> response (!).
>>>>>>>> 
>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>> features to dynamically _choose_ between consistency or availability — when
>>>> it comes time to read back a consistent result, BigCouch instead just
>>>> always gives you availability* regardless of what a given request actually
>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather than
>>>> proceeding with no way of ever knowing whether a write did NOT ACTUALLY
>>>> conflict or just hadn't YET because $who_knows_how_many nodes were still
>>>> down…)
>>>>>>>> 
>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug by a
>>>> Cloudant engineer (or support personnel at least) but could not be quickly
>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>> 
>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>>> BigCouch? If true, could this read durability issue now be fixed during the
>>>> merge?
>>>>>>>> 
>>>>>>>> thanks,
>>>>>>>> -natevw
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual uptime
>>>> of *any* Couch fork…
>>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Robert Newson <rn...@apache.org>.

If a way can be found that doesn't break things that can be sent in all or most cases, sure. It's what a user can really infer from that which I focused on. Not as much, I think, as users that want that info really want. 


> On 31 Mar 2015, at 21:08, Adam Kocoloski <ko...@apache.org> wrote:
> 
> I hope we can all agree that CouchDB should inform the user when it is unable to satisfy the requested read "quorum".
> 
> Adam
> 
>> On Mar 31, 2015, at 3:20 PM, Paul Davis <pa...@gmail.com> wrote:
>> 
>> Sounds like there's a bit of confusion here.
>> 
>> What Nathan is asking for is the ability to have Couch respond with some
>> information on the actual number of replicas that responded to a read
>> request. That way a user could tell that they issued an r=2 request when
>> only r=1 was actually performed. Depending on your point of view in an MVCC
>> world this is either a bug or a feature. :)
>> 
>> It was generally agreed upon that if we could return this information it
>> would be beneficial. Although what happened when I started implementing
>> this patch was that we are either only able to return it in a subset of
>> cases where it happens, return it inconsistently between various responses,
>> or break replication.
>> 
>> The three general methods for this would be to either include a new
>> "_r_met" key in the doc body that would be a boolean indicating if the
>> requested read quorum was actually met for the document. The second was to
>> return a custom X-R-Met type header, and lastly was the status code as
>> described.
>> 
>> The _r_met member was thought to be the best, but unfortunately that breaks
>> replication with older clients because we throw an error rather than ignore
>> any unknown underscore prefixed field name. Thus having something that was
>> just dynamically injected into the document body was a non-starter.
>> Unfortunately, if we don't inject into the document body then we limit
>> ourselves to only the set of APIs where a single document is returned. This
>> is due to both streaming semantics (we can't buffer an entire response in
>> memory for large requests to _all_docs) as well as multi-doc responses (a
>> single boolean doesn't say which document may have not had a properly met
>> R).
>> 
>> On top of that, the other confusing part of meeting the read quorum is that
>> given MVCC semantics it becomes a bit confusing on how you respond to
>> documents with different revision histories. For instance, if we read two
>> docs, we have technically made the r=2 requirement, but what should our
>> response be if those two revisions are different (technically, in this case
>> we wait for the third response, but the decision on what to return for the
>> "r met" value is still unclear).
>> 
>> While I think everyone is in agreement that it'd be nice to return some of
>> the information about the copies read, I think its much less clear what and
>> how it should be returned in the multitude of cases that we can specify an
>> value for R.
>> 
>> While that doesn't offer a concrete path forward, hopefully it clarifies
>> some of the issues at hand.
>> 
>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <rn...@apache.org>
>> wrote:
>> 
>>> 
>>> It’s testament to my friendship with Mike that we can disagree on such
>>> things and remain friends. I am sorry he misled you, though.
>>> 
>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at all, at
>>> least in the formal sense, the only one that matters, this is unfortunately
>>> sloppy language in too many places to correct.
>>> 
>>> The r= and w= parameters control only how many of the n possible responses
>>> are collected before returning an http response.
>>> 
>>> It’s not true that returning 202 in the situation where one write is made
>>> but fewer than 'r' writes are made means we’ve chosen availability over
>>> consistency since even if we returned a 500 or closed the connection
>>> without responding, a subsequent GET could return the document (a
>>> probability that increases over time as anti-entropy makes the missing
>>> copies). A write attempt that returned a 409 could, likewise, introduce a
>>> new edit branch into the document, which might then 'win', altering the
>>> results of a subsequent GET.
>>> 
>>> The essential thing to remember is this: the ’n’ copies of your data are
>>> completely independent when written/read by the clustered layer (fabric).
>>> It is internal replication (anti-entropy) that converges those copies,
>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>> independent results into a single result as best it can. Older versions did
>>> not expose the 201 vs 202 distinction, calling both of them 201. I do agree
>>> with you that there’s little value in the 202 distinction. About the only
>>> thing you could do is investigate your cluster for connectivity issues or
>>> overloading if you get a sustained period of 202’s, as it would be an
>>> indicator that the system is partitioned.
>>> 
>>> In order to achieve your goals, CouchDB 2.0 would have to ensure that the
>>> result of a write did not change after the fact. That is, anti-entropy
>>> would need to be disabled, or somehow agree to roll forward or backward
>>> based on the initial circumstances. In short, we’d have to introduce strong
>>> consistency (paxos or raft or zab, say). While this would be a great
>>> feature to add, it’s not currently present, and no amount of twiddling the
>>> status codes will achieve it. We’d rather be honest about our position on
>>> the CAP triangle.
>>> 
>>> B.
>>> 
>>> 
>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <na...@calftrail.com>
>>>> wrote:
>>>> 
>>>> A technical co-founder of Cloudant agreed that this was a bug when I
>>> first hit it a few years ago. I found back the original thread here — this
>>> is the discussion I was trying to recall in my OP:
>>>> It sounds like perhaps there is a related issue tracked internally at
>>> Cloudant as a result of that conversation.
>>>> 
>>>> JamesM, thanks for your support here and tracking this down. 203 seemed
>>> like the best status code to "steal" for this to me too. Best wishes in
>>> getting this fixed!
>>>> 
>>>> regards,
>>>> -natevw
>>>> 
>>>> 
>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org> wrote:
>>>>> 
>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>> classified as a bug.
>>>>> 
>>>>> Anti-entropy is the main reason that you cannot get strong consistency
>>> from the system, it will transform "failed" writes (those that succeeded on
>>> one node but fewer than R nodes) into success (N copies) as long as the
>>> nodes have enough healthy uptime.
>>>>> 
>>>>> True of cloudant and 2.0.
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com> wrote:
>>>>>> 
>>>>>> Funny you should mention it.  I drafted an email in early February to
>>> queue up the same discussion whenever I could get involved again (which I
>>> promptly forgot about).  What happens currently in 2.0 appears unchanged
>>> from earlier versions.  When R is not satisfied in fabric,
>>> fabric_doc_open:handle_message eventually responds with a {stop, …}  but
>>> leaves the acc-state as the original r_not_met which triggers a read_repair
>>> from the response handler.  read_repair results in an {ok, …} with the only
>>> doc available, because no other docs are in the list.  The final doc
>>> returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is
>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>> complete.
>>>>>> 
>>>>>> This seems straightforward to fix by a change in
>>> fabric_open_doc:handle_response and read_repair.  handle_response knows
>>> whether it has R met and could pass that forward, or allow read-repair to
>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for
>>> community interest in the behavior of sending a 202, but it’s something I’d
>>> definitely like for the same reasons you cite.  Plus it just seems
>>> disconnected to do it on writes but not reads.
>>>>>> 
>>>>>> Cheers,
>>>>>> </JamesM>
>>>>>> 
>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>> nate-lists@calftrail.com> wrote:
>>>>>>> 
>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>> extending my fermata-couchdb plugin today and realized that perhaps the
>>> Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to
>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>> 
>>>>>>> See
>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for
>>> some additional background/explanation, but my understanding is that
>>> Cloudant for all practical purposes ignores the read durability parameter.
>>> So you can write with ?w=N to attempt some level of quorum, and get a 202
>>> back if that quorum is unment. _However_ when you ?r=N it really doesn't
>>> matter if only <N nodes are available…if even just a single available node
>>> has some version of the requested document you will get a successful
>>> response (!).
>>>>>>> 
>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>> features to dynamically _choose_ between consistency or availability — when
>>> it comes time to read back a consistent result, BigCouch instead just
>>> always gives you availability* regardless of what a given request actually
>>> needs. (In my usage I ended up treating a 202 write as a 500, rather than
>>> proceeding with no way of ever knowing whether a write did NOT ACTUALLY
>>> conflict or just hadn't YET because $who_knows_how_many nodes were still
>>> down…)
>>>>>>> 
>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug by a
>>> Cloudant engineer (or support personnel at least) but could not be quickly
>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>> 
>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>> BigCouch? If true, could this read durability issue now be fixed during the
>>> merge?
>>>>>>> 
>>>>>>> thanks,
>>>>>>> -natevw
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual uptime
>>> of *any* Couch fork…
>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Adam Kocoloski <ko...@apache.org>.

I hope we can all agree that CouchDB should inform the user when it is unable to satisfy the requested read "quorum".

Adam

> On Mar 31, 2015, at 3:20 PM, Paul Davis <pa...@gmail.com> wrote:
> 
> Sounds like there's a bit of confusion here.
> 
> What Nathan is asking for is the ability to have Couch respond with some
> information on the actual number of replicas that responded to a read
> request. That way a user could tell that they issued an r=2 request when
> only r=1 was actually performed. Depending on your point of view in an MVCC
> world this is either a bug or a feature. :)
> 
> It was generally agreed upon that if we could return this information it
> would be beneficial. Although what happened when I started implementing
> this patch was that we are either only able to return it in a subset of
> cases where it happens, return it inconsistently between various responses,
> or break replication.
> 
> The three general methods for this would be to either include a new
> "_r_met" key in the doc body that would be a boolean indicating if the
> requested read quorum was actually met for the document. The second was to
> return a custom X-R-Met type header, and lastly was the status code as
> described.
> 
> The _r_met member was thought to be the best, but unfortunately that breaks
> replication with older clients because we throw an error rather than ignore
> any unknown underscore prefixed field name. Thus having something that was
> just dynamically injected into the document body was a non-starter.
> Unfortunately, if we don't inject into the document body then we limit
> ourselves to only the set of APIs where a single document is returned. This
> is due to both streaming semantics (we can't buffer an entire response in
> memory for large requests to _all_docs) as well as multi-doc responses (a
> single boolean doesn't say which document may have not had a properly met
> R).
> 
> On top of that, the other confusing part of meeting the read quorum is that
> given MVCC semantics it becomes a bit confusing on how you respond to
> documents with different revision histories. For instance, if we read two
> docs, we have technically made the r=2 requirement, but what should our
> response be if those two revisions are different (technically, in this case
> we wait for the third response, but the decision on what to return for the
> "r met" value is still unclear).
> 
> While I think everyone is in agreement that it'd be nice to return some of
> the information about the copies read, I think its much less clear what and
> how it should be returned in the multitude of cases that we can specify an
> value for R.
> 
> While that doesn't offer a concrete path forward, hopefully it clarifies
> some of the issues at hand.
> 
> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <rn...@apache.org>
> wrote:
> 
>> 
>> It’s testament to my friendship with Mike that we can disagree on such
>> things and remain friends. I am sorry he misled you, though.
>> 
>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at all, at
>> least in the formal sense, the only one that matters, this is unfortunately
>> sloppy language in too many places to correct.
>> 
>> The r= and w= parameters control only how many of the n possible responses
>> are collected before returning an http response.
>> 
>> It’s not true that returning 202 in the situation where one write is made
>> but fewer than 'r' writes are made means we’ve chosen availability over
>> consistency since even if we returned a 500 or closed the connection
>> without responding, a subsequent GET could return the document (a
>> probability that increases over time as anti-entropy makes the missing
>> copies). A write attempt that returned a 409 could, likewise, introduce a
>> new edit branch into the document, which might then 'win', altering the
>> results of a subsequent GET.
>> 
>> The essential thing to remember is this: the ’n’ copies of your data are
>> completely independent when written/read by the clustered layer (fabric).
>> It is internal replication (anti-entropy) that converges those copies,
>> pair-wise, to the same eventual state. Fabric is converting the 3
>> independent results into a single result as best it can. Older versions did
>> not expose the 201 vs 202 distinction, calling both of them 201. I do agree
>> with you that there’s little value in the 202 distinction. About the only
>> thing you could do is investigate your cluster for connectivity issues or
>> overloading if you get a sustained period of 202’s, as it would be an
>> indicator that the system is partitioned.
>> 
>> In order to achieve your goals, CouchDB 2.0 would have to ensure that the
>> result of a write did not change after the fact. That is, anti-entropy
>> would need to be disabled, or somehow agree to roll forward or backward
>> based on the initial circumstances. In short, we’d have to introduce strong
>> consistency (paxos or raft or zab, say). While this would be a great
>> feature to add, it’s not currently present, and no amount of twiddling the
>> status codes will achieve it. We’d rather be honest about our position on
>> the CAP triangle.
>> 
>> B.
>> 
>> 
>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <na...@calftrail.com>
>> wrote:
>>> 
>>> A technical co-founder of Cloudant agreed that this was a bug when I
>> first hit it a few years ago. I found back the original thread here — this
>> is the discussion I was trying to recall in my OP:
>>> It sounds like perhaps there is a related issue tracked internally at
>> Cloudant as a result of that conversation.
>>> 
>>> JamesM, thanks for your support here and tracking this down. 203 seemed
>> like the best status code to "steal" for this to me too. Best wishes in
>> getting this fixed!
>>> 
>>> regards,
>>> -natevw
>>> 
>>> 
>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org> wrote:
>>> 
>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>> classified as a bug.
>>>> 
>>>> Anti-entropy is the main reason that you cannot get strong consistency
>> from the system, it will transform "failed" writes (those that succeeded on
>> one node but fewer than R nodes) into success (N copies) as long as the
>> nodes have enough healthy uptime.
>>>> 
>>>> True of cloudant and 2.0.
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com> wrote:
>>>>> 
>>>>> Funny you should mention it.  I drafted an email in early February to
>> queue up the same discussion whenever I could get involved again (which I
>> promptly forgot about).  What happens currently in 2.0 appears unchanged
>> from earlier versions.  When R is not satisfied in fabric,
>> fabric_doc_open:handle_message eventually responds with a {stop, …}  but
>> leaves the acc-state as the original r_not_met which triggers a read_repair
>> from the response handler.  read_repair results in an {ok, …} with the only
>> doc available, because no other docs are in the list.  The final doc
>> returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is
>> simply {ok, Doc}, which has now lost the fact that the answer was not
>> complete.
>>>>> 
>>>>> This seems straightforward to fix by a change in
>> fabric_open_doc:handle_response and read_repair.  handle_response knows
>> whether it has R met and could pass that forward, or allow read-repair to
>> pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for
>> community interest in the behavior of sending a 202, but it’s something I’d
>> definitely like for the same reasons you cite.  Plus it just seems
>> disconnected to do it on writes but not reads.
>>>>> 
>>>>> Cheers,
>>>>> </JamesM>
>>>>> 
>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>> nate-lists@calftrail.com> wrote:
>>>>>> 
>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>> extending my fermata-couchdb plugin today and realized that perhaps the
>> Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to
>> fix a serious issue I had using Cloudant's implementation.
>>>>>> 
>>>>>> See
>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for
>> some additional background/explanation, but my understanding is that
>> Cloudant for all practical purposes ignores the read durability parameter.
>> So you can write with ?w=N to attempt some level of quorum, and get a 202
>> back if that quorum is unment. _However_ when you ?r=N it really doesn't
>> matter if only <N nodes are available…if even just a single available node
>> has some version of the requested document you will get a successful
>> response (!).
>>>>>> 
>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>> features to dynamically _choose_ between consistency or availability — when
>> it comes time to read back a consistent result, BigCouch instead just
>> always gives you availability* regardless of what a given request actually
>> needs. (In my usage I ended up treating a 202 write as a 500, rather than
>> proceeding with no way of ever knowing whether a write did NOT ACTUALLY
>> conflict or just hadn't YET because $who_knows_how_many nodes were still
>> down…)
>>>>>> 
>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug by a
>> Cloudant engineer (or support personnel at least) but could not be quickly
>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>> 
>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>> BigCouch? If true, could this read durability issue now be fixed during the
>> merge?
>>>>>> 
>>>>>> thanks,
>>>>>> -natevw
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> * DISCLAIMER: this statement has not been endorsed by actual uptime
>> of *any* Couch fork…
>>>>> 
>>> 
>> 
>>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Paul Davis <pa...@gmail.com>.

Sounds like there's a bit of confusion here.

What Nathan is asking for is the ability to have Couch respond with some
information on the actual number of replicas that responded to a read
request. That way a user could tell that they issued an r=2 request when
only r=1 was actually performed. Depending on your point of view in an MVCC
world this is either a bug or a feature. :)

It was generally agreed upon that if we could return this information it
would be beneficial. Although what happened when I started implementing
this patch was that we are either only able to return it in a subset of
cases where it happens, return it inconsistently between various responses,
or break replication.

The three general methods for this would be to either include a new
"_r_met" key in the doc body that would be a boolean indicating if the
requested read quorum was actually met for the document. The second was to
return a custom X-R-Met type header, and lastly was the status code as
described.

The _r_met member was thought to be the best, but unfortunately that breaks
replication with older clients because we throw an error rather than ignore
any unknown underscore prefixed field name. Thus having something that was
just dynamically injected into the document body was a non-starter.
Unfortunately, if we don't inject into the document body then we limit
ourselves to only the set of APIs where a single document is returned. This
is due to both streaming semantics (we can't buffer an entire response in
memory for large requests to _all_docs) as well as multi-doc responses (a
single boolean doesn't say which document may have not had a properly met
R).

On top of that, the other confusing part of meeting the read quorum is that
given MVCC semantics it becomes a bit confusing on how you respond to
documents with different revision histories. For instance, if we read two
docs, we have technically made the r=2 requirement, but what should our
response be if those two revisions are different (technically, in this case
we wait for the third response, but the decision on what to return for the
"r met" value is still unclear).

While I think everyone is in agreement that it'd be nice to return some of
the information about the copies read, I think its much less clear what and
how it should be returned in the multitude of cases that we can specify an
value for R.

While that doesn't offer a concrete path forward, hopefully it clarifies
some of the issues at hand.

On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <rn...@apache.org>
wrote:

>
> It’s testament to my friendship with Mike that we can disagree on such
> things and remain friends. I am sorry he misled you, though.
>
> CouchDB 2.0 (like Cloudant) does not have read or write quorums at all, at
> least in the formal sense, the only one that matters, this is unfortunately
> sloppy language in too many places to correct.
>
> The r= and w= parameters control only how many of the n possible responses
> are collected before returning an http response.
>
> It’s not true that returning 202 in the situation where one write is made
> but fewer than 'r' writes are made means we’ve chosen availability over
> consistency since even if we returned a 500 or closed the connection
> without responding, a subsequent GET could return the document (a
> probability that increases over time as anti-entropy makes the missing
> copies). A write attempt that returned a 409 could, likewise, introduce a
> new edit branch into the document, which might then 'win', altering the
> results of a subsequent GET.
>
> The essential thing to remember is this: the ’n’ copies of your data are
> completely independent when written/read by the clustered layer (fabric).
> It is internal replication (anti-entropy) that converges those copies,
> pair-wise, to the same eventual state. Fabric is converting the 3
> independent results into a single result as best it can. Older versions did
> not expose the 201 vs 202 distinction, calling both of them 201. I do agree
> with you that there’s little value in the 202 distinction. About the only
> thing you could do is investigate your cluster for connectivity issues or
> overloading if you get a sustained period of 202’s, as it would be an
> indicator that the system is partitioned.
>
> In order to achieve your goals, CouchDB 2.0 would have to ensure that the
> result of a write did not change after the fact. That is, anti-entropy
> would need to be disabled, or somehow agree to roll forward or backward
> based on the initial circumstances. In short, we’d have to introduce strong
> consistency (paxos or raft or zab, say). While this would be a great
> feature to add, it’s not currently present, and no amount of twiddling the
> status codes will achieve it. We’d rather be honest about our position on
> the CAP triangle.
>
> B.
>
>
> > On 30 Mar 2015, at 22:37, Nathan Vander Wilt <na...@calftrail.com>
> wrote:
> >
> > A technical co-founder of Cloudant agreed that this was a bug when I
> first hit it a few years ago. I found back the original thread here — this
> is the discussion I was trying to recall in my OP:
> > It sounds like perhaps there is a related issue tracked internally at
> Cloudant as a result of that conversation.
> >
> > JamesM, thanks for your support here and tracking this down. 203 seemed
> like the best status code to "steal" for this to me too. Best wishes in
> getting this fixed!
> >
> > regards,
> > -natevw
> >
> >
> > On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org> wrote:
> >
> >> 2.0 is explicitly an AP system, the behaviour you describe is not
> classified as a bug.
> >>
> >> Anti-entropy is the main reason that you cannot get strong consistency
> from the system, it will transform "failed" writes (those that succeeded on
> one node but fewer than R nodes) into success (N copies) as long as the
> nodes have enough healthy uptime.
> >>
> >> True of cloudant and 2.0.
> >>
> >> Sent from my iPhone
> >>
> >>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com> wrote:
> >>>
> >>> Funny you should mention it.  I drafted an email in early February to
> queue up the same discussion whenever I could get involved again (which I
> promptly forgot about).  What happens currently in 2.0 appears unchanged
> from earlier versions.  When R is not satisfied in fabric,
> fabric_doc_open:handle_message eventually responds with a {stop, …}  but
> leaves the acc-state as the original r_not_met which triggers a read_repair
> from the response handler.  read_repair results in an {ok, …} with the only
> doc available, because no other docs are in the list.  The final doc
> returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is
> simply {ok, Doc}, which has now lost the fact that the answer was not
> complete.
> >>>
> >>> This seems straightforward to fix by a change in
> fabric_open_doc:handle_response and read_repair.  handle_response knows
> whether it has R met and could pass that forward, or allow read-repair to
> pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for
> community interest in the behavior of sending a 202, but it’s something I’d
> definitely like for the same reasons you cite.  Plus it just seems
> disconnected to do it on writes but not reads.
> >>>
> >>> Cheers,
> >>> </JamesM>
> >>>
> >>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
> nate-lists@calftrail.com> wrote:
> >>>>
> >>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
> extending my fermata-couchdb plugin today and realized that perhaps the
> Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to
> fix a serious issue I had using Cloudant's implementation.
> >>>>
> >>>> See
> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for
> some additional background/explanation, but my understanding is that
> Cloudant for all practical purposes ignores the read durability parameter.
> So you can write with ?w=N to attempt some level of quorum, and get a 202
> back if that quorum is unment. _However_ when you ?r=N it really doesn't
> matter if only <N nodes are available…if even just a single available node
> has some version of the requested document you will get a successful
> response (!).
> >>>>
> >>>> So in practice, there's no way to actually use the quasi-Dynamo
> features to dynamically _choose_ between consistency or availability — when
> it comes time to read back a consistent result, BigCouch instead just
> always gives you availability* regardless of what a given request actually
> needs. (In my usage I ended up treating a 202 write as a 500, rather than
> proceeding with no way of ever knowing whether a write did NOT ACTUALLY
> conflict or just hadn't YET because $who_knows_how_many nodes were still
> down…)
> >>>>
> >>>> IIRC, this was both confirmed and acknowledged as a serious bug by a
> Cloudant engineer (or support personnel at least) but could not be quickly
> fixed as it could introduce backwards-compatibility concerns. So…
> >>>>
> >>>> Is CouchDB 2.0 already breaking backwards compatibility with
> BigCouch? If true, could this read durability issue now be fixed during the
> merge?
> >>>>
> >>>> thanks,
> >>>> -natevw
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> * DISCLAIMER: this statement has not been endorsed by actual uptime
> of *any* Couch fork…
> >>>
> >
>
>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Robert Samuel Newson <rn...@apache.org>.

It’s testament to my friendship with Mike that we can disagree on such things and remain friends. I am sorry he misled you, though.

CouchDB 2.0 (like Cloudant) does not have read or write quorums at all, at least in the formal sense, the only one that matters, this is unfortunately sloppy language in too many places to correct.

The r= and w= parameters control only how many of the n possible responses are collected before returning an http response.

It’s not true that returning 202 in the situation where one write is made but fewer than 'r' writes are made means we’ve chosen availability over consistency since even if we returned a 500 or closed the connection without responding, a subsequent GET could return the document (a probability that increases over time as anti-entropy makes the missing copies). A write attempt that returned a 409 could, likewise, introduce a new edit branch into the document, which might then 'win', altering the results of a subsequent GET.

The essential thing to remember is this: the ’n’ copies of your data are completely independent when written/read by the clustered layer (fabric). It is internal replication (anti-entropy) that converges those copies, pair-wise, to the same eventual state. Fabric is converting the 3 independent results into a single result as best it can. Older versions did not expose the 201 vs 202 distinction, calling both of them 201. I do agree with you that there’s little value in the 202 distinction. About the only thing you could do is investigate your cluster for connectivity issues or overloading if you get a sustained period of 202’s, as it would be an indicator that the system is partitioned.

In order to achieve your goals, CouchDB 2.0 would have to ensure that the result of a write did not change after the fact. That is, anti-entropy would need to be disabled, or somehow agree to roll forward or backward based on the initial circumstances. In short, we’d have to introduce strong consistency (paxos or raft or zab, say). While this would be a great feature to add, it’s not currently present, and no amount of twiddling the status codes will achieve it. We’d rather be honest about our position on the CAP triangle.

B.

> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <na...@calftrail.com> wrote:
> 
> A technical co-founder of Cloudant agreed that this was a bug when I first hit it a few years ago. I found back the original thread here — this is the discussion I was trying to recall in my OP: 
> It sounds like perhaps there is a related issue tracked internally at Cloudant as a result of that conversation.
> 
> JamesM, thanks for your support here and tracking this down. 203 seemed like the best status code to "steal" for this to me too. Best wishes in getting this fixed!
> 
> regards,
> -natevw
> 
> 
> On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org> wrote:
> 
>> 2.0 is explicitly an AP system, the behaviour you describe is not classified as a bug. 
>> 
>> Anti-entropy is the main reason that you cannot get strong consistency from the system, it will transform "failed" writes (those that succeeded on one node but fewer than R nodes) into success (N copies) as long as the nodes have enough healthy uptime. 
>> 
>> True of cloudant and 2.0. 
>> 
>> Sent from my iPhone
>> 
>>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com> wrote:
>>> 
>>> Funny you should mention it.  I drafted an email in early February to queue up the same discussion whenever I could get involved again (which I promptly forgot about).  What happens currently in 2.0 appears unchanged from earlier versions.  When R is not satisfied in fabric, fabric_doc_open:handle_message eventually responds with a {stop, …}  but leaves the acc-state as the original r_not_met which triggers a read_repair from the response handler.  read_repair results in an {ok, …} with the only doc available, because no other docs are in the list.  The final doc returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is simply {ok, Doc}, which has now lost the fact that the answer was not complete.
>>> 
>>> This seems straightforward to fix by a change in fabric_open_doc:handle_response and read_repair.  handle_response knows whether it has R met and could pass that forward, or allow read-repair to pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for community interest in the behavior of sending a 202, but it’s something I’d definitely like for the same reasons you cite.  Plus it just seems disconnected to do it on writes but not reads.
>>> 
>>> Cheers,
>>> </JamesM>
>>> 
>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <na...@calftrail.com> wrote:
>>>> 
>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was extending my fermata-couchdb plugin today and realized that perhaps the Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to fix a serious issue I had using Cloudant's implementation.
>>>> 
>>>> See https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for some additional background/explanation, but my understanding is that Cloudant for all practical purposes ignores the read durability parameter. So you can write with ?w=N to attempt some level of quorum, and get a 202 back if that quorum is unment. _However_ when you ?r=N it really doesn't matter if only <N nodes are available…if even just a single available node has some version of the requested document you will get a successful response (!).
>>>> 
>>>> So in practice, there's no way to actually use the quasi-Dynamo features to dynamically _choose_ between consistency or availability — when it comes time to read back a consistent result, BigCouch instead just always gives you availability* regardless of what a given request actually needs. (In my usage I ended up treating a 202 write as a 500, rather than proceeding with no way of ever knowing whether a write did NOT ACTUALLY conflict or just hadn't YET because $who_knows_how_many nodes were still down…)
>>>> 
>>>> IIRC, this was both confirmed and acknowledged as a serious bug by a Cloudant engineer (or support personnel at least) but could not be quickly fixed as it could introduce backwards-compatibility concerns. So…
>>>> 
>>>> Is CouchDB 2.0 already breaking backwards compatibility with BigCouch? If true, could this read durability issue now be fixed during the merge?
>>>> 
>>>> thanks,
>>>> -natevw
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> * DISCLAIMER: this statement has not been endorsed by actual uptime of *any* Couch fork…
>>> 
>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Nathan Vander Wilt <na...@calftrail.com>.

A technical co-founder of Cloudant agreed that this was a bug when I first hit it a few years ago. I found back the original thread here — this is the discussion I was trying to recall in my OP: https://twitter.com/mlmilleratmit/status/410911219428491265
It sounds like perhaps there is a related issue tracked internally at Cloudant as a result of that conversation.

JamesM, thanks for your support here and tracking this down. 203 seemed like the best status code to "steal" for this to me too. Best wishes in getting this fixed!

regards,
-natevw


On Mar 25, 2015, at 4:49 AM, Robert Newson <rn...@apache.org> wrote:

> 2.0 is explicitly an AP system, the behaviour you describe is not classified as a bug. 
> 
> Anti-entropy is the main reason that you cannot get strong consistency from the system, it will transform "failed" writes (those that succeeded on one node but fewer than R nodes) into success (N copies) as long as the nodes have enough healthy uptime. 
> 
> True of cloudant and 2.0. 
> 
> Sent from my iPhone
> 
>> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com> wrote:
>> 
>> Funny you should mention it.  I drafted an email in early February to queue up the same discussion whenever I could get involved again (which I promptly forgot about).  What happens currently in 2.0 appears unchanged from earlier versions.  When R is not satisfied in fabric, fabric_doc_open:handle_message eventually responds with a {stop, …}  but leaves the acc-state as the original r_not_met which triggers a read_repair from the response handler.  read_repair results in an {ok, …} with the only doc available, because no other docs are in the list.  The final doc returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is simply {ok, Doc}, which has now lost the fact that the answer was not complete.
>> 
>> This seems straightforward to fix by a change in fabric_open_doc:handle_response and read_repair.  handle_response knows whether it has R met and could pass that forward, or allow read-repair to pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for community interest in the behavior of sending a 202, but it’s something I’d definitely like for the same reasons you cite.  Plus it just seems disconnected to do it on writes but not reads.
>> 
>> Cheers,
>> </JamesM>
>> 
>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <na...@calftrail.com> wrote:
>>> 
>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was extending my fermata-couchdb plugin today and realized that perhaps the Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to fix a serious issue I had using Cloudant's implementation.
>>> 
>>> See https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for some additional background/explanation, but my understanding is that Cloudant for all practical purposes ignores the read durability parameter. So you can write with ?w=N to attempt some level of quorum, and get a 202 back if that quorum is unment. _However_ when you ?r=N it really doesn't matter if only <N nodes are available…if even just a single available node has some version of the requested document you will get a successful response (!).
>>> 
>>> So in practice, there's no way to actually use the quasi-Dynamo features to dynamically _choose_ between consistency or availability — when it comes time to read back a consistent result, BigCouch instead just always gives you availability* regardless of what a given request actually needs. (In my usage I ended up treating a 202 write as a 500, rather than proceeding with no way of ever knowing whether a write did NOT ACTUALLY conflict or just hadn't YET because $who_knows_how_many nodes were still down…)
>>> 
>>> IIRC, this was both confirmed and acknowledged as a serious bug by a Cloudant engineer (or support personnel at least) but could not be quickly fixed as it could introduce backwards-compatibility concerns. So…
>>> 
>>> Is CouchDB 2.0 already breaking backwards compatibility with BigCouch? If true, could this read durability issue now be fixed during the merge?
>>> 
>>> thanks,
>>> -natevw
>>> 
>>> 
>>> 
>>> 
>>> 
>>> * DISCLAIMER: this statement has not been endorsed by actual uptime of *any* Couch fork…
>>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by Robert Newson <rn...@apache.org>.

2.0 is explicitly an AP system, the behaviour you describe is not classified as a bug. 

Anti-entropy is the main reason that you cannot get strong consistency from the system, it will transform "failed" writes (those that succeeded on one node but fewer than R nodes) into success (N copies) as long as the nodes have enough healthy uptime. 

True of cloudant and 2.0. 

Sent from my iPhone

> On 24 Mar 2015, at 15:14, Mutton, James <jm...@akamai.com> wrote:
> 
> Funny you should mention it.  I drafted an email in early February to queue up the same discussion whenever I could get involved again (which I promptly forgot about).  What happens currently in 2.0 appears unchanged from earlier versions.  When R is not satisfied in fabric, fabric_doc_open:handle_message eventually responds with a {stop, …}  but leaves the acc-state as the original r_not_met which triggers a read_repair from the response handler.  read_repair results in an {ok, …} with the only doc available, because no other docs are in the list.  The final doc returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is simply {ok, Doc}, which has now lost the fact that the answer was not complete.
> 
> This seems straightforward to fix by a change in fabric_open_doc:handle_response and read_repair.  handle_response knows whether it has R met and could pass that forward, or allow read-repair to pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for community interest in the behavior of sending a 202, but it’s something I’d definitely like for the same reasons you cite.  Plus it just seems disconnected to do it on writes but not reads.
> 
> Cheers,
> </JamesM>
> 
>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <na...@calftrail.com> wrote:
>> 
>> Sorry, I have not been following CouchDB 2.0 roadmap but I was extending my fermata-couchdb plugin today and realized that perhaps the Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to fix a serious issue I had using Cloudant's implementation.
>> 
>> See https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for some additional background/explanation, but my understanding is that Cloudant for all practical purposes ignores the read durability parameter. So you can write with ?w=N to attempt some level of quorum, and get a 202 back if that quorum is unment. _However_ when you ?r=N it really doesn't matter if only <N nodes are available…if even just a single available node has some version of the requested document you will get a successful response (!).
>> 
>> So in practice, there's no way to actually use the quasi-Dynamo features to dynamically _choose_ between consistency or availability — when it comes time to read back a consistent result, BigCouch instead just always gives you availability* regardless of what a given request actually needs. (In my usage I ended up treating a 202 write as a 500, rather than proceeding with no way of ever knowing whether a write did NOT ACTUALLY conflict or just hadn't YET because $who_knows_how_many nodes were still down…)
>> 
>> IIRC, this was both confirmed and acknowledged as a serious bug by a Cloudant engineer (or support personnel at least) but could not be quickly fixed as it could introduce backwards-compatibility concerns. So…
>> 
>> Is CouchDB 2.0 already breaking backwards compatibility with BigCouch? If true, could this read durability issue now be fixed during the merge?
>> 
>> thanks,
>> -natevw
>> 
>> 
>> 
>> 
>> 
>> * DISCLAIMER: this statement has not been endorsed by actual uptime of *any* Couch fork…
>

Re: Could CouchDB 2.0 fix actual read quorum?

Posted by "Mutton, James" <jm...@akamai.com>.

Funny you should mention it.  I drafted an email in early February to queue up the same discussion whenever I could get involved again (which I promptly forgot about).  What happens currently in 2.0 appears unchanged from earlier versions.  When R is not satisfied in fabric, fabric_doc_open:handle_message eventually responds with a {stop, …}  but leaves the acc-state as the original r_not_met which triggers a read_repair from the response handler.  read_repair results in an {ok, …} with the only doc available, because no other docs are in the list.  The final doc returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is simply {ok, Doc}, which has now lost the fact that the answer was not complete.

This seems straightforward to fix by a change in fabric_open_doc:handle_response and read_repair.  handle_response knows whether it has R met and could pass that forward, or allow read-repair to pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for community interest in the behavior of sending a 202, but it’s something I’d definitely like for the same reasons you cite.  Plus it just seems disconnected to do it on writes but not reads.

Cheers,
</JamesM>

On Mar 24, 2015, at 14:06, Nathan Vander Wilt <na...@calftrail.com> wrote:

> Sorry, I have not been following CouchDB 2.0 roadmap but I was extending my fermata-couchdb plugin today and realized that perhaps the Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to fix a serious issue I had using Cloudant's implementation.
> 
> See https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for some additional background/explanation, but my understanding is that Cloudant for all practical purposes ignores the read durability parameter. So you can write with ?w=N to attempt some level of quorum, and get a 202 back if that quorum is unment. _However_ when you ?r=N it really doesn't matter if only <N nodes are available…if even just a single available node has some version of the requested document you will get a successful response (!).
> 
> So in practice, there's no way to actually use the quasi-Dynamo features to dynamically _choose_ between consistency or availability — when it comes time to read back a consistent result, BigCouch instead just always gives you availability* regardless of what a given request actually needs. (In my usage I ended up treating a 202 write as a 500, rather than proceeding with no way of ever knowing whether a write did NOT ACTUALLY conflict or just hadn't YET because $who_knows_how_many nodes were still down…)
> 
> IIRC, this was both confirmed and acknowledged as a serious bug by a Cloudant engineer (or support personnel at least) but could not be quickly fixed as it could introduce backwards-compatibility concerns. So…
> 
> Is CouchDB 2.0 already breaking backwards compatibility with BigCouch? If true, could this read durability issue now be fixed during the merge?
> 
> thanks,
> -natevw
> 
> 
> 
> 
> 
> * DISCLAIMER: this statement has not been endorsed by actual uptime of *any* Couch fork…