You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Michael Fair <mi...@daclubhouse.net> on 2016/03/23 01:30:25 UTC

Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Greetings CouchDBers!

I've been modifying a BERT library to recreate the md5 calc of a RevisionID
in Java.

I haven't tackled attachments yet, however with the awesome help of rnewson
on the IRC channel, I've succeeded in recreating the md5 for all the
documents I've tried so far which includes docs with values of strings, big
and small integers, lists of big integers, lists of small integers, true,
false, null, and objects; however the glaring exception is floats.

The {minor_version, 0} format used for floats (A 31 byte string based
representation in %.20e format) is dependent on the host environment doing
the encoding and can't be reliably duplicated in other machines and
languages.

For instance, here are examples of encoding 3.14159 as %.20e string on this
laptop:
erlang: 3.1415899999999999000e+00  (This is what term_to_binary is using)
python: 3.14158999999999988262e+00
java:   3.14159000000000000000e+00

These minor numerical differences unfortunately make the md5 computation
untenable.  And further, it seems that even different OTP versions and
different hardware will encode the {minor_version, 0} format slightly
differently on different Couch instances (A couple people on IRC shared
with me what their OTP produced).


To make a long story short and spare folks reading the mind-numbing
details, without changing something, replicating the md5 for the revision
id of documents with floats just can't be done sanely.

As things are now, like I mentioned, even different installations of
CouchDB can disagree on the MD5 revision id for the document {"pi":3.14159}.


So where does this create an issue?

It shows up by creating a conflict document during replication when the two
servers calculated different revision ids for the same document update
(which only happens if it was a multi-master update (an update where both
sides were updated before replicating -- like separate laptops on separate
planes each doing the same thing)).

If only one side or the other was updated, it doesn't cause a problem.

My goal is enabling people to upload documents from multiple server
applications using JSON and Couch to handle the replication bits.

To give this heterogeneous environment the same multi-master intelligence
that Couch has, they need to be able to compute the same revision id that
Couch would compute; otherwise documents modified directly in couch could
create these kinds of multi-master type conflicts.


----

What to do (aside from simply do nothing)?

At the least I recommend changing the term_to_binary computation to use the
{minor_version, 1} option in the rev_id calculation.

This changes how floats are encoded to the 64-bit IEEE format.  It became
the standard way of encoding floats in OTP 17.0+ and is available as an
option all the way back to OTP 11.  As long as it's explicitly provided as
a requested option in the term_to_binary call, all currently deployed OTP
installations for Couch can do it.

Doing this normalizes the md5 calculation for floats regardless of the OTP
platform, and should make it feasible for third party applications to
replicate the encoding.



I have some other ideas beyond that, but they would require changes to the
replication protocol to support.


----

For anyone interested I'd be happy to share the code I have.  It's still a
bit rough in the document construction part, but once constructed, getting
the binary encoding and revision id are each just a single call.


Thanks,
Mike

Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Posted by Michael Fair <mi...@daclubhouse.net>.
On Tue, Mar 22, 2016 at 7:52 PM, Adam Kocoloski <ko...@apache.org> wrote:

> Wow, does this mean that a CouchDB server running R16 and another running
> R17 will compute different revision IDs for the same document?


That would definitely happen as that would be a minor_version, 1 vs 0.

I also believe it's possible even across the same erl releases <OTP 17.0 on
different architectures if there's a difference in float rounding
implementation; through I haven't actually seen it happen yet (so far all
my comparisons have been against >OTP 17.0).

If anyone running < OTP 17.0 tries:

io:format("~.20e~n",[3.14159]).
3.1415899999999999000e+0

or

io:format("~.20e~n",[3.1415]).
3.1415000000000002000e+0


and gets anything different than those two answers then it would happen
between this Couch and that machine as well.

Or to test in Couch directly here's:

{
   "_id": "pi",
   "pi": 3.14159
}
_rev: "1-fef08460e5939d65f40f93f75ffad893"

or

{
   "_id": "piSmall",
   "pi": 3.1415
}
_rev: "1-6fcc4c4758e81af8ffb5a7df5de39533"

I didn't pick those numbers especially for any reason it was just the first
and easiest float test doc that came to mind.

Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Posted by Adam Kocoloski <ko...@apache.org>.
Wow, does this mean that a CouchDB server running R16 and another running R17 will compute different revision IDs for the same document? We should certainly bump to minor_version=1 across the board; we did this for on-disk representations of document bodies quite a long time ago I think.

Adam

> On Mar 22, 2016, at 10:45 PM, Paul Davis <pa...@gmail.com> wrote:
> 
> +1 to adding the minor version option. Floats are hard. Its still not
> perfect but it at least should make most cases easier.
> 
> On Tue, Mar 22, 2016 at 7:30 PM, Michael Fair <mi...@daclubhouse.net> wrote:
>> Greetings CouchDBers!
>> 
>> I've been modifying a BERT library to recreate the md5 calc of a RevisionID
>> in Java.
>> 
>> I haven't tackled attachments yet, however with the awesome help of rnewson
>> on the IRC channel, I've succeeded in recreating the md5 for all the
>> documents I've tried so far which includes docs with values of strings, big
>> and small integers, lists of big integers, lists of small integers, true,
>> false, null, and objects; however the glaring exception is floats.
>> 
>> The {minor_version, 0} format used for floats (A 31 byte string based
>> representation in %.20e format) is dependent on the host environment doing
>> the encoding and can't be reliably duplicated in other machines and
>> languages.
>> 
>> For instance, here are examples of encoding 3.14159 as %.20e string on this
>> laptop:
>> erlang: 3.1415899999999999000e+00  (This is what term_to_binary is using)
>> python: 3.14158999999999988262e+00
>> java:   3.14159000000000000000e+00
>> 
>> These minor numerical differences unfortunately make the md5 computation
>> untenable.  And further, it seems that even different OTP versions and
>> different hardware will encode the {minor_version, 0} format slightly
>> differently on different Couch instances (A couple people on IRC shared
>> with me what their OTP produced).
>> 
>> 
>> To make a long story short and spare folks reading the mind-numbing
>> details, without changing something, replicating the md5 for the revision
>> id of documents with floats just can't be done sanely.
>> 
>> As things are now, like I mentioned, even different installations of
>> CouchDB can disagree on the MD5 revision id for the document {"pi":3.14159}.
>> 
>> 
>> So where does this create an issue?
>> 
>> It shows up by creating a conflict document during replication when the two
>> servers calculated different revision ids for the same document update
>> (which only happens if it was a multi-master update (an update where both
>> sides were updated before replicating -- like separate laptops on separate
>> planes each doing the same thing)).
>> 
>> If only one side or the other was updated, it doesn't cause a problem.
>> 
>> My goal is enabling people to upload documents from multiple server
>> applications using JSON and Couch to handle the replication bits.
>> 
>> To give this heterogeneous environment the same multi-master intelligence
>> that Couch has, they need to be able to compute the same revision id that
>> Couch would compute; otherwise documents modified directly in couch could
>> create these kinds of multi-master type conflicts.
>> 
>> 
>> ----
>> 
>> What to do (aside from simply do nothing)?
>> 
>> At the least I recommend changing the term_to_binary computation to use the
>> {minor_version, 1} option in the rev_id calculation.
>> 
>> This changes how floats are encoded to the 64-bit IEEE format.  It became
>> the standard way of encoding floats in OTP 17.0+ and is available as an
>> option all the way back to OTP 11.  As long as it's explicitly provided as
>> a requested option in the term_to_binary call, all currently deployed OTP
>> installations for Couch can do it.
>> 
>> Doing this normalizes the md5 calculation for floats regardless of the OTP
>> platform, and should make it feasible for third party applications to
>> replicate the encoding.
>> 
>> 
>> 
>> I have some other ideas beyond that, but they would require changes to the
>> replication protocol to support.
>> 
>> 
>> ----
>> 
>> For anyone interested I'd be happy to share the code I have.  It's still a
>> bit rough in the document construction part, but once constructed, getting
>> the binary encoding and revision id are each just a single call.
>> 
>> 
>> Thanks,
>> Mike


Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Posted by Paul Davis <pa...@gmail.com>.
+1 to adding the minor version option. Floats are hard. Its still not
perfect but it at least should make most cases easier.

On Tue, Mar 22, 2016 at 7:30 PM, Michael Fair <mi...@daclubhouse.net> wrote:
> Greetings CouchDBers!
>
> I've been modifying a BERT library to recreate the md5 calc of a RevisionID
> in Java.
>
> I haven't tackled attachments yet, however with the awesome help of rnewson
> on the IRC channel, I've succeeded in recreating the md5 for all the
> documents I've tried so far which includes docs with values of strings, big
> and small integers, lists of big integers, lists of small integers, true,
> false, null, and objects; however the glaring exception is floats.
>
> The {minor_version, 0} format used for floats (A 31 byte string based
> representation in %.20e format) is dependent on the host environment doing
> the encoding and can't be reliably duplicated in other machines and
> languages.
>
> For instance, here are examples of encoding 3.14159 as %.20e string on this
> laptop:
> erlang: 3.1415899999999999000e+00  (This is what term_to_binary is using)
> python: 3.14158999999999988262e+00
> java:   3.14159000000000000000e+00
>
> These minor numerical differences unfortunately make the md5 computation
> untenable.  And further, it seems that even different OTP versions and
> different hardware will encode the {minor_version, 0} format slightly
> differently on different Couch instances (A couple people on IRC shared
> with me what their OTP produced).
>
>
> To make a long story short and spare folks reading the mind-numbing
> details, without changing something, replicating the md5 for the revision
> id of documents with floats just can't be done sanely.
>
> As things are now, like I mentioned, even different installations of
> CouchDB can disagree on the MD5 revision id for the document {"pi":3.14159}.
>
>
> So where does this create an issue?
>
> It shows up by creating a conflict document during replication when the two
> servers calculated different revision ids for the same document update
> (which only happens if it was a multi-master update (an update where both
> sides were updated before replicating -- like separate laptops on separate
> planes each doing the same thing)).
>
> If only one side or the other was updated, it doesn't cause a problem.
>
> My goal is enabling people to upload documents from multiple server
> applications using JSON and Couch to handle the replication bits.
>
> To give this heterogeneous environment the same multi-master intelligence
> that Couch has, they need to be able to compute the same revision id that
> Couch would compute; otherwise documents modified directly in couch could
> create these kinds of multi-master type conflicts.
>
>
> ----
>
> What to do (aside from simply do nothing)?
>
> At the least I recommend changing the term_to_binary computation to use the
> {minor_version, 1} option in the rev_id calculation.
>
> This changes how floats are encoded to the 64-bit IEEE format.  It became
> the standard way of encoding floats in OTP 17.0+ and is available as an
> option all the way back to OTP 11.  As long as it's explicitly provided as
> a requested option in the term_to_binary call, all currently deployed OTP
> installations for Couch can do it.
>
> Doing this normalizes the md5 calculation for floats regardless of the OTP
> platform, and should make it feasible for third party applications to
> replicate the encoding.
>
>
>
> I have some other ideas beyond that, but they would require changes to the
> replication protocol to support.
>
>
> ----
>
> For anyone interested I'd be happy to share the code I have.  It's still a
> bit rough in the document construction part, but once constructed, getting
> the binary encoding and revision id are each just a single call.
>
>
> Thanks,
> Mike

Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Posted by Tim Millwood <ti...@millwoodonline.co.uk>.
Don't think this is very relevant, but thought some people might be
interested.
This is how we generate a revision ID in Drupal to use with CouchDB.
https://github.com/dickolsson/drupal-multiversion/blob/8.x-1.x/src/MultiversionManager.php#L436-L457

On 23 March 2016 at 16:41, Jan Lehnardt <ja...@apache.org> wrote:

> Great sleuthing Michael!
>
> In addition to the recommendation to upgrade to {minor_version: 1}, which
> could
> be a good first step, how about going the extra mile to make _rev
> generation
> easier across platforms? This would benefit PouchDB and others.
>
> Best
> Jan
> --
>
> > On 23 Mar 2016, at 01:30, Michael Fair <mi...@daclubhouse.net> wrote:
> >
> > Greetings CouchDBers!
> >
> > I've been modifying a BERT library to recreate the md5 calc of a
> RevisionID
> > in Java.
> >
> > I haven't tackled attachments yet, however with the awesome help of
> rnewson
> > on the IRC channel, I've succeeded in recreating the md5 for all the
> > documents I've tried so far which includes docs with values of strings,
> big
> > and small integers, lists of big integers, lists of small integers, true,
> > false, null, and objects; however the glaring exception is floats.
> >
> > The {minor_version, 0} format used for floats (A 31 byte string based
> > representation in %.20e format) is dependent on the host environment
> doing
> > the encoding and can't be reliably duplicated in other machines and
> > languages.
> >
> > For instance, here are examples of encoding 3.14159 as %.20e string on
> this
> > laptop:
> > erlang: 3.1415899999999999000e+00  (This is what term_to_binary is using)
> > python: 3.14158999999999988262e+00
> > java:   3.14159000000000000000e+00
> >
> > These minor numerical differences unfortunately make the md5 computation
> > untenable.  And further, it seems that even different OTP versions and
> > different hardware will encode the {minor_version, 0} format slightly
> > differently on different Couch instances (A couple people on IRC shared
> > with me what their OTP produced).
> >
> >
> > To make a long story short and spare folks reading the mind-numbing
> > details, without changing something, replicating the md5 for the revision
> > id of documents with floats just can't be done sanely.
> >
> > As things are now, like I mentioned, even different installations of
> > CouchDB can disagree on the MD5 revision id for the document
> {"pi":3.14159}.
> >
> >
> > So where does this create an issue?
> >
> > It shows up by creating a conflict document during replication when the
> two
> > servers calculated different revision ids for the same document update
> > (which only happens if it was a multi-master update (an update where both
> > sides were updated before replicating -- like separate laptops on
> separate
> > planes each doing the same thing)).
> >
> > If only one side or the other was updated, it doesn't cause a problem.
> >
> > My goal is enabling people to upload documents from multiple server
> > applications using JSON and Couch to handle the replication bits.
> >
> > To give this heterogeneous environment the same multi-master intelligence
> > that Couch has, they need to be able to compute the same revision id that
> > Couch would compute; otherwise documents modified directly in couch could
> > create these kinds of multi-master type conflicts.
> >
> >
> > ----
> >
> > What to do (aside from simply do nothing)?
> >
> > At the least I recommend changing the term_to_binary computation to use
> the
> > {minor_version, 1} option in the rev_id calculation.
> >
> > This changes how floats are encoded to the 64-bit IEEE format.  It became
> > the standard way of encoding floats in OTP 17.0+ and is available as an
> > option all the way back to OTP 11.  As long as it's explicitly provided
> as
> > a requested option in the term_to_binary call, all currently deployed OTP
> > installations for Couch can do it.
> >
> > Doing this normalizes the md5 calculation for floats regardless of the
> OTP
> > platform, and should make it feasible for third party applications to
> > replicate the encoding.
> >
> >
> >
> > I have some other ideas beyond that, but they would require changes to
> the
> > replication protocol to support.
> >
> >
> > ----
> >
> > For anyone interested I'd be happy to share the code I have.  It's still
> a
> > bit rough in the document construction part, but once constructed,
> getting
> > the binary encoding and revision id are each just a single call.
> >
> >
> > Thanks,
> > Mike
>
> --
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
>
>

Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Posted by Michael Fair <mi...@daclubhouse.net>.
Absolutely!

1) I have a proposal for how conflicts are handled that I'm typing up
that's straight forward and will alleviate the need for endpoints to
compute the same revision id; it will also enable third party replication
without either side needing to do anything the same way (at the expense of
CPU time during replication when they can't do things the same way).


but 2) I'm happy to provide what I've got so far! :)

Here's a copy/paste of the documents as I've been using to test so far:

[
{"_id":"empty1","_rev":"1-967a00dff5e02add41819138abb3284d"}
{"_id":"empty2","_rev":"2-7051cbe5c8faecd085a3fa619e6e6337"},

{"_id":"foobar1","_rev":"1-4c6114c65e295552ab1019e2b046b10e","foo":"bar"},
{"_id":"foobar2","_rev":"2-0e871ef78849b0c206091f1a7af6ec41","foo":"bar"},

{"_id":"one1","_rev":"1-bab35b848b1d09f61ebb92473ad6449b","one":1},
{"_id":"one2","_rev":"2-27d66b6bb53423560f2ad1e3e870665f","one":1},

{"_id":"piSmall","_rev":"1-6fcc4c4758e81af8ffb5a7df5de39533","pi":3.1415},
{"_id":"piSmall2","_rev":"2-8ee24fd4794c01d28d0f19b33b20330e","pi":3.1415},

{"_id":"pi","_rev":"1-747554e1e394661be9216cd91b7ccc73","pi":3.14159},
{"_id":"pi2","_rev":"2-3ee614fb985a37d4ae23d73147f17904","pi":3.14159},

{"_id":"complex1","_rev":"1-329f8ae16f88bd62a8bf3de302febb2e",
"foo":"bar","baz":null,"pi":3.14159,"quux":1234,
"boolean":true,"otherboolean":false,
"list1":[1,2,3],"list2":[1,2,300],
"obj":{"a":"b"}
},
{"_id":"complex2","_rev":"2-a31473a5dd9119facc5d23564bfc210d",
"foo":"bar","baz":null,"pi":3.14159,"quux":1234,
"boolean":true,"otherboolean":false,
"list1":[1,2,3],"list2":[1,2,300],
"obj":{"a":"b"}
}
]


Documents that are missing, dedicated null, shortListSmall, shortListLarge,
and obj docs.
There's no reference at all to a long list (i.e. >= 65536) of small
integers.  I only learned about the list of small integers case being
different recently.



One thing to note is that I used the _all_docs?include_docs=true URL to
dump these out of my local 1.6.1 couch and the _all_docs dump provided the
term_to_binary encoded version of the floats, while the front end web GUI
is giving me what you see above.  (I've changed the docs above to use the
float values as I originally saved them and updated the MD5s to be what
they should be under the new code.  Any couch install running OTP 17.0+
should already be producing the same above checksums.)

I'm assuming that the _all_docs thing has something to do with javascript
interpreting/correcting the value of the floats as pulled from the on-disk
representation while _all_docs is just sending out the on-disk
representation from erlang...



During replication I think this would technically break the pure JSON
definition of the float as "a sequence of digits" as it changes the
original sequence of digits provided in the uploaded document.  Case in
point, via _all_docs, I received 3.1415000000000002 instead of the original
3.1415 I uploaded. Not the same issue as this, but something worth noting.


I'd be happy to do a PR for the above tests if someone gave me a template
to work from for the first empty doc (it should just be copy/paste from
there).

Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Posted by Joan Touzet <wo...@apache.org>.
Hey Mike,

As mentioned on IRC I'd like to see some test cases in our suite
to help ensure we don't regress on this in the future. Specifically,
I think it'd be good to ensure consistent revs on a handful of 
indicative docs.

There's nothing saying we can't change how we do this in the future,
but we need to be mindful of the fact now that:

1) We are a reference implementation for rev calculation for other DBs,
and
2) We should be alerted to the fact that how we calculate revs for docs
   has changed due to some checkin or another

You mentioned on IRC you had a few documents that could help with this.
Even if you don't commit the test cases yourself, could you share your
current test suite with us somehow?

Thanks,
Joan


----- Original Message -----
> From: "Michael Fair" <mi...@daclubhouse.net>
> To: dev@couchdb.apache.org
> Sent: Saturday, March 26, 2016 7:49:21 PM
> Subject: Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)
> 
> Alright, merge is in!
> 
> Step 1) for 2.0+ revisions!
> 
> Thanks all!
> Mike
> On Mar 24, 2016 12:16 AM, "Michael Fair" <mi...@daclubhouse.net>
> wrote:
> 
> > Ok pull request is away, I used the GitHub repository,
> > apache/master as a
> > base, and revId-minor_version_1-md5Calc as the branch in my fork.
> >  It
> > wasn't obvious to me how to get a Jira ticket and branch created
> > and my git
> > skills are minimal on a good day.
> >
> > There was a build error on R16B03-1; I tried to look at the logs
> > but
> > wasn't sure what I was looking at/for.
> >
> > The actual change is extremely minimal as it's just changing like
> > 876 in
> > src/couch_db.erl from this:
> >             couch_crypto:hash(md5, term_to_binary([Deleted,
> >             OldStart,
> > OldRev, Body, Atts2]))
> >
> > to this:
> >             couch_crypto:hash(md5, term_to_binary([Deleted,
> >             OldStart,
> > OldRev, Body, Atts2], [{minor_version, 1}]))
> >
> >  Any advice/help/guidance appreciated.
> >
> > Thanks,
> > Mike
> >
> 

Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Posted by Michael Fair <mi...@daclubhouse.net>.
Alright, merge is in!

Step 1) for 2.0+ revisions!

Thanks all!
Mike
On Mar 24, 2016 12:16 AM, "Michael Fair" <mi...@daclubhouse.net> wrote:

> Ok pull request is away, I used the GitHub repository, apache/master as a
> base, and revId-minor_version_1-md5Calc as the branch in my fork.  It
> wasn't obvious to me how to get a Jira ticket and branch created and my git
> skills are minimal on a good day.
>
> There was a build error on R16B03-1; I tried to look at the logs but
> wasn't sure what I was looking at/for.
>
> The actual change is extremely minimal as it's just changing like 876 in
> src/couch_db.erl from this:
>             couch_crypto:hash(md5, term_to_binary([Deleted, OldStart,
> OldRev, Body, Atts2]))
>
> to this:
>             couch_crypto:hash(md5, term_to_binary([Deleted, OldStart,
> OldRev, Body, Atts2], [{minor_version, 1}]))
>
>  Any advice/help/guidance appreciated.
>
> Thanks,
> Mike
>

Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Posted by Michael Fair <mi...@daclubhouse.net>.
Ok pull request is away, I used the GitHub repository, apache/master as a
base, and revId-minor_version_1-md5Calc as the branch in my fork.  It
wasn't obvious to me how to get a Jira ticket and branch created and my git
skills are minimal on a good day.

There was a build error on R16B03-1; I tried to look at the logs but wasn't
sure what I was looking at/for.

The actual change is extremely minimal as it's just changing like 876 in
src/couch_db.erl from this:
            couch_crypto:hash(md5, term_to_binary([Deleted, OldStart,
OldRev, Body, Atts2]))

to this:
            couch_crypto:hash(md5, term_to_binary([Deleted, OldStart,
OldRev, Body, Atts2], [{minor_version, 1}]))

 Any advice/help/guidance appreciated.

Thanks,
Mike

Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Posted by Adam Kocoloski <ko...@apache.org>.
> On Mar 23, 2016, at 2:31 PM, Michael Fair <mi...@daclubhouse.net> wrote:
> 
> On Mar 23, 2016 9:41 AM, "Jan Lehnardt" <ja...@apache.org> wrote:
>> 
>> Great sleuthing Michael!
>> 
>> In addition to the recommendation to upgrade to {minor_version: 1}, which
> could
>> be a good first step,
> 
> So should I just go ahead and make a pull request for the change then?
> 
> It seems like generally agreed it's an issue, it ought to be fixed, and
> this doesn't seriously change anything about the way its done; simply
> clarifying it and making it more consistent for more platforms.
> 
> I can confirm (using the erl command line) that with the change the md5
> sums would match between an OTP 15 series, 17 series, and the java library
> I have for the piDoc I posted (3.14159) where they didn't before.
> 
> I also suspect every language in question will have a way to create IEEE
> 64-bit floats (some obviously more easily than others, but the point is
> they can).

Go for it.

> 
>> how about going the extra mile to make _rev generation
>> easier across platforms?
> 
> I'm all for it!  And I'm up for it but at least with this step we have a
> solid base to work from.

I’m all for it as well, and I distinctly remember PouchDB wanting this back in the day. Let’s get the minor_version piece in place and then figure out a portable revision generation algorithm afterwards.

Adam

Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Posted by Michael Fair <mi...@daclubhouse.net>.
On Mar 23, 2016 9:41 AM, "Jan Lehnardt" <ja...@apache.org> wrote:
>
> Great sleuthing Michael!
>
> In addition to the recommendation to upgrade to {minor_version: 1}, which
could
> be a good first step,

So should I just go ahead and make a pull request for the change then?

It seems like generally agreed it's an issue, it ought to be fixed, and
this doesn't seriously change anything about the way its done; simply
clarifying it and making it more consistent for more platforms.

I can confirm (using the erl command line) that with the change the md5
sums would match between an OTP 15 series, 17 series, and the java library
I have for the piDoc I posted (3.14159) where they didn't before.

I also suspect every language in question will have a way to create IEEE
64-bit floats (some obviously more easily than others, but the point is
they can).

> how about going the extra mile to make _rev generation
> easier across platforms?

I'm all for it!  And I'm up for it but at least with this step we have a
solid base to work from.

Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Posted by Jan Lehnardt <ja...@apache.org>.
Great sleuthing Michael!

In addition to the recommendation to upgrade to {minor_version: 1}, which could
be a good first step, how about going the extra mile to make _rev generation
easier across platforms? This would benefit PouchDB and others.

Best
Jan
-- 

> On 23 Mar 2016, at 01:30, Michael Fair <mi...@daclubhouse.net> wrote:
> 
> Greetings CouchDBers!
> 
> I've been modifying a BERT library to recreate the md5 calc of a RevisionID
> in Java.
> 
> I haven't tackled attachments yet, however with the awesome help of rnewson
> on the IRC channel, I've succeeded in recreating the md5 for all the
> documents I've tried so far which includes docs with values of strings, big
> and small integers, lists of big integers, lists of small integers, true,
> false, null, and objects; however the glaring exception is floats.
> 
> The {minor_version, 0} format used for floats (A 31 byte string based
> representation in %.20e format) is dependent on the host environment doing
> the encoding and can't be reliably duplicated in other machines and
> languages.
> 
> For instance, here are examples of encoding 3.14159 as %.20e string on this
> laptop:
> erlang: 3.1415899999999999000e+00  (This is what term_to_binary is using)
> python: 3.14158999999999988262e+00
> java:   3.14159000000000000000e+00
> 
> These minor numerical differences unfortunately make the md5 computation
> untenable.  And further, it seems that even different OTP versions and
> different hardware will encode the {minor_version, 0} format slightly
> differently on different Couch instances (A couple people on IRC shared
> with me what their OTP produced).
> 
> 
> To make a long story short and spare folks reading the mind-numbing
> details, without changing something, replicating the md5 for the revision
> id of documents with floats just can't be done sanely.
> 
> As things are now, like I mentioned, even different installations of
> CouchDB can disagree on the MD5 revision id for the document {"pi":3.14159}.
> 
> 
> So where does this create an issue?
> 
> It shows up by creating a conflict document during replication when the two
> servers calculated different revision ids for the same document update
> (which only happens if it was a multi-master update (an update where both
> sides were updated before replicating -- like separate laptops on separate
> planes each doing the same thing)).
> 
> If only one side or the other was updated, it doesn't cause a problem.
> 
> My goal is enabling people to upload documents from multiple server
> applications using JSON and Couch to handle the replication bits.
> 
> To give this heterogeneous environment the same multi-master intelligence
> that Couch has, they need to be able to compute the same revision id that
> Couch would compute; otherwise documents modified directly in couch could
> create these kinds of multi-master type conflicts.
> 
> 
> ----
> 
> What to do (aside from simply do nothing)?
> 
> At the least I recommend changing the term_to_binary computation to use the
> {minor_version, 1} option in the rev_id calculation.
> 
> This changes how floats are encoded to the 64-bit IEEE format.  It became
> the standard way of encoding floats in OTP 17.0+ and is available as an
> option all the way back to OTP 11.  As long as it's explicitly provided as
> a requested option in the term_to_binary call, all currently deployed OTP
> installations for Couch can do it.
> 
> Doing this normalizes the md5 calculation for floats regardless of the OTP
> platform, and should make it feasible for third party applications to
> replicate the encoding.
> 
> 
> 
> I have some other ideas beyond that, but they would require changes to the
> replication protocol to support.
> 
> 
> ----
> 
> For anyone interested I'd be happy to share the code I have.  It's still a
> bit rough in the document construction part, but once constructed, getting
> the binary encoding and revision id are each just a single call.
> 
> 
> Thanks,
> Mike

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/