You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Alex Besogonov <al...@gmail.com> on 2011/11/14 19:52:47 UTC

Why MD5 is used for hashes, also about non-deterministic IDs.

I'm looking at CouchDB source code and I have several questions:

1) Why MD5 is used instead of more secure hashes. It's very real to
imagine a situation where a malicious user can cause hash collision
and cause problems in replication.

2) ID is not completely deterministic - it depends on
compression_level and compressible_types settings for attachments.
Would it make sense to use MD5 of the original uncompressed document?
And while you're at it, it probably makes sense to include file size
in Atts2 tuple.

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Jason Smith <jh...@iriscouch.com>.
On Tue, Nov 15, 2011 at 5:19 AM, Dustin Sallings <du...@spy.net> wrote:
>
> On Nov 14, 2011, at 8:41 PM, Alex Besogonov wrote:
>
>> Now, it might not sound too threatening, but this attack breaks the
>> main invariant of
>> CouchDB - database replicas won't ever be eventually consistent!
>>
>> Also, I'd like to use stronger hash just on general principles.
>
>
>        I'd prefer to get rid of this functionality altogether.  It's wrong even in cases where people aren't being malicious.
>
>        Example:
>
>        I have a document that represents how many things I've got.
>
>        On node A, I increment the number of things.  I go from 5 things to 6 things.
>
>        On node B, I increment the number of things.  I go from 5 things to 6 things.
>
>        Replication catches up, sees the same digest, and now I have six things -- but this is incorrect.  I have seven things (or at least a conflict).

Hi, Dustin. I think your data model is incomplete.

You seem to care where or when or under what circumstances a change
occurred. Yet you don't care enough to store that information in the
document. That is an odd approach. A timestamp, for example, would
trigger a conflict.

Also, no doubt your example is as simple as possible; but for
posterity, the idiomatic way to count things is to have one document
per thing and _count them up in a reduction. In that case, you would
have seven things and no conflicts.

-- 
Iris Couch

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Alex Besogonov <al...@gmail.com>.
On Tue, Nov 15, 2011 at 12:19 AM, Dustin Sallings <du...@spy.net> wrote:
> On Nov 14, 2011, at 8:41 PM, Alex Besogonov wrote:
>> Now, it might not sound too threatening, but this attack breaks the
>> main invariant of
>> CouchDB - database replicas won't ever be eventually consistent!
>> Also, I'd like to use stronger hash just on general principles.
>        I'd prefer to get rid of this functionality altogether.  It's wrong even in cases where people aren't being malicious.
That also, though it's a judgment call which approach is better.

>        Example:
>        I have a document that represents how many things I've got.
>        On node A, I increment the number of things.  I go from 5 things to 6 things.
>        On node B, I increment the number of things.  I go from 5 things to 6 things.
>        Replication catches up, sees the same digest, and now I have six things -- but this is incorrect.  I have seven things (or at least a conflict).
>        This does will happen with any hash, but no UUID.
Hash allows to trace ancestry unforgeably, though. Git uses this approach.

But IMHO it's better to either have secure hash-based implementation
with cryptographically strong hashes authenticating canonic
representations of documents or throw it out altogether and use
UUID-based implementation.

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Dustin Sallings <du...@spy.net>.
On Nov 14, 2011, at 8:41 PM, Alex Besogonov wrote:

> Now, it might not sound too threatening, but this attack breaks the
> main invariant of
> CouchDB - database replicas won't ever be eventually consistent!
> 
> Also, I'd like to use stronger hash just on general principles.


	I'd prefer to get rid of this functionality altogether.  It's wrong even in cases where people aren't being malicious.

	Example:

	I have a document that represents how many things I've got.

	On node A, I increment the number of things.  I go from 5 things to 6 things.

	On node B, I increment the number of things.  I go from 5 things to 6 things.

	Replication catches up, sees the same digest, and now I have six things -- but this is incorrect.  I have seven things (or at least a conflict).

	This does will happen with any hash, but no UUID.

-- 
dustin sallings




Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Jason Smith <jh...@iriscouch.com>.
On Wed, Nov 16, 2011 at 4:23 AM, Randall Leeds <ra...@gmail.com> wrote:
> On Tue, Nov 15, 2011 at 01:43, Robert Newson <rn...@apache.org> wrote:
>> _rev values used to be UUID's and became deterministic to improve
>> replication performance. I can see that there's a theoretical issue
>> where replication could be inhibited, though I question how practical
>> it is given the internal details of _rev calculation.
>>
>> Remember that the _rev value is derived from the contents of the
>> documents, all the bytes of all attachments and values from previous
>> revisions. Stock MD5 preimage attacks are of of much simpler form
>> (finding a Y such that MD5(Y)=X for some desired X). Also that you
>> would have to arrange for the same number of updates as well, since
>> the number at the front is incremented on each successful update.
>>
>
> Also remember that the contents would have to parse as JSON, so that
> restricts this search space even further. Then, if I understand Jason
> correctly, we're also talking about a situation where Couch B is
> insecure... it's allowing a malicious user to change documents. If
> these documents are anything more important than something affecting
> the user herself then what you have is a malicious administrator or an
> insecure deployment. I don't think MD5 is to blame here.

That is my understanding. I don't think MD5 is relevant. You could
modify couch B's source code to give you whatever _rev trees you want.
The trick is pushing that back to couch A.

> Does that sound like a reasonable assessment to you, Alex?
>
> Also, I'd love to hear about your C++ replicator as it develops.

Alex, a C++ replicator is super exciting! I don't want to change the
subject, but another replicator implementation would be brilliant!


-- 
Iris Couch

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Paul Davis <pa...@gmail.com>.
On Sun, Nov 20, 2011 at 1:54 AM, Alex Besogonov
<al...@gmail.com> wrote:
> On Thu, Nov 17, 2011 at 5:57 PM, Randall Leeds <ra...@gmail.com> wrote:
>> On Wed, Nov 16, 2011 at 09:46, Alex Besogonov <al...@gmail.com> wrote:
>>> On Tue, Nov 15, 2011 at 4:23 PM, Randall Leeds <ra...@gmail.com> wrote:
> [skip]
>> I hope that settles the "why", reassures any
>> "oh-my-god-my-couch-is-vulnerable", and motivates the
>> "hey-lets-make-a-patch" if you still want the feature, with the
>> understanding that it's unlikely the project will specify this as a
>> necessary condition for general-purpose replication. If you have more
>> bullet-proof needs, dev that armor up and I'll review it, but I'd
>> advise making it a config option.
> Well, I'm currently trying to replicate Couch DB behavior exactly, but
> I'm slowly moving
> towards 'hey, I'll just write a patch' stage. However, I like an idea
> of waiting for SHA-3
> competition to finish since it give me more time :)
>

Just a note, but remember that you should be able to use sha-8billion
in your replicator and it'd work fine with CouchDB (and if it doesn't,
its a bug in CouchDB that we'd want to fix). The replicator/revision
mechanism is written so that once generated, the _rev tokens are
opaque and the only tests used are equality comparisons.

HTH,
Paul Davis

>>>
>>>> Then, if I understand Jason
>>>> correctly, we're also talking about a situation where Couch B is
>>>> insecure... it's allowing a malicious user to change documents. If
>>>> these documents are anything more important than something affecting
>>>> the user herself then what you have is a malicious administrator or an
>>>> insecure deployment. I don't think MD5 is to blame here.
>>> No, the issue here is a possibility to break the synchronization.
>>>
>>>> Does that sound like a reasonable assessment to you, Alex?
>>> Almost.
>>>
>>>> Also, I'd love to hear about your C++ replicator as it develops.
>>> Sure, I'm developing a very small and fast embedded storage for mobile
>>> devices and desktop apps. It'll be open source once I finish its core.
>>>
>>>> -Randall
>>>>
>>>>> For switching from MD5 to SHA-1, I say no. If we switch, let's use
>>>>> something contemporary like SHA-256. Better yet, let's wait for the
>>>>> winner of the SHA-3 competition.
>>>>>
>>>>> B.
>>>>>
>>>>> On 15 November 2011 07:57, Jason Smith <jh...@iriscouch.com> wrote:
>>>>>> On Tue, Nov 15, 2011 at 7:34 AM, Alex Besogonov
>>>>>> <al...@gmail.com> wrote:
>>>>>>>>> Now I make a change to 'Doc' at machine A. This creates a new revid
>>>>>>>>> with new md5 hash.
>>>>>>>>> A malicious software somehow learns about this update and creates
>>>>>>>>> another document
>>>>>>>>> on machine B, contriving it so to make the resulting hash to be the
>>>>>>>>> same as on machine A.
>>>>>>>> Before going any further, you must show why we care about the contents
>>>>>>>> of machine B.
>>>>>>>> Why would I log in to machine B if I do not trust B's owner? Why would
>>>>>>>> I clone your Git repository if I do not know you?
>>>>>>> The problem is, MD5 hash depends on _untrusted_ data that external
>>>>>>> processes might put into the database.
>>>>>>>
>>>>>>> For example, imagine that machines A and B use CouchDB to store
>>>>>>> certificates.
>>>>>>
>>>>>> I ask again.
>>>>>>
>>>>>> --
>>>>>> Iris Couch
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Alex Besogonov <al...@gmail.com>.
On Thu, Nov 17, 2011 at 5:57 PM, Randall Leeds <ra...@gmail.com> wrote:
> On Wed, Nov 16, 2011 at 09:46, Alex Besogonov <al...@gmail.com> wrote:
>> On Tue, Nov 15, 2011 at 4:23 PM, Randall Leeds <ra...@gmail.com> wrote:
[skip]
> I hope that settles the "why", reassures any
> "oh-my-god-my-couch-is-vulnerable", and motivates the
> "hey-lets-make-a-patch" if you still want the feature, with the
> understanding that it's unlikely the project will specify this as a
> necessary condition for general-purpose replication. If you have more
> bullet-proof needs, dev that armor up and I'll review it, but I'd
> advise making it a config option.
Well, I'm currently trying to replicate Couch DB behavior exactly, but
I'm slowly moving
towards 'hey, I'll just write a patch' stage. However, I like an idea
of waiting for SHA-3
competition to finish since it give me more time :)

>>
>>> Then, if I understand Jason
>>> correctly, we're also talking about a situation where Couch B is
>>> insecure... it's allowing a malicious user to change documents. If
>>> these documents are anything more important than something affecting
>>> the user herself then what you have is a malicious administrator or an
>>> insecure deployment. I don't think MD5 is to blame here.
>> No, the issue here is a possibility to break the synchronization.
>>
>>> Does that sound like a reasonable assessment to you, Alex?
>> Almost.
>>
>>> Also, I'd love to hear about your C++ replicator as it develops.
>> Sure, I'm developing a very small and fast embedded storage for mobile
>> devices and desktop apps. It'll be open source once I finish its core.
>>
>>> -Randall
>>>
>>>> For switching from MD5 to SHA-1, I say no. If we switch, let's use
>>>> something contemporary like SHA-256. Better yet, let's wait for the
>>>> winner of the SHA-3 competition.
>>>>
>>>> B.
>>>>
>>>> On 15 November 2011 07:57, Jason Smith <jh...@iriscouch.com> wrote:
>>>>> On Tue, Nov 15, 2011 at 7:34 AM, Alex Besogonov
>>>>> <al...@gmail.com> wrote:
>>>>>>>> Now I make a change to 'Doc' at machine A. This creates a new revid
>>>>>>>> with new md5 hash.
>>>>>>>> A malicious software somehow learns about this update and creates
>>>>>>>> another document
>>>>>>>> on machine B, contriving it so to make the resulting hash to be the
>>>>>>>> same as on machine A.
>>>>>>> Before going any further, you must show why we care about the contents
>>>>>>> of machine B.
>>>>>>> Why would I log in to machine B if I do not trust B's owner? Why would
>>>>>>> I clone your Git repository if I do not know you?
>>>>>> The problem is, MD5 hash depends on _untrusted_ data that external
>>>>>> processes might put into the database.
>>>>>>
>>>>>> For example, imagine that machines A and B use CouchDB to store
>>>>>> certificates.
>>>>>
>>>>> I ask again.
>>>>>
>>>>> --
>>>>> Iris Couch
>>>>>
>>>>
>>>
>>
>

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Jan Lehnardt <ja...@apache.org>.
Thanks Randall :)

On Nov 17, 2011, at 23:57 , Randall Leeds wrote:

> On Wed, Nov 16, 2011 at 09:46, Alex Besogonov <al...@gmail.com> wrote:
>> On Tue, Nov 15, 2011 at 4:23 PM, Randall Leeds <ra...@gmail.com> wrote:
>>>> Remember that the _rev value is derived from the contents of the
>>>> documents, all the bytes of all attachments and values from previous
>>>> revisions. Stock MD5 preimage attacks are of of much simpler form
>>>> (finding a Y such that MD5(Y)=X for some desired X). Also that you
>>>> would have to arrange for the same number of updates as well, since
>>>> the number at the front is incremented on each successful update.
>>> Also remember that the contents would have to parse as JSON, so that
>>> restricts this search space even further.
>> Not really. Binary representation of JSON is used to calculate the hash.
>> 
>> So I can make a document like this:
>> ===
>> {
>>  "aa" : "xxxxxxxxxxxxxx.....[several thousands x's]"
>> }
>> ===
>> 
>> And use the large 'xxx...x' string as a scratch area for my attack. I don't
>> even need to bother with quoting issues because CouchDB is going to
>> unquote everything during JSON parsing. And there are no other hash
>> codes to work around (working around even two MD5s at the same time
>> is much harder).
>> 
>> That's about the best possible case for an attacker.
> 
> This "attack", though, is still pretty hard, and, I think, not an
> attack. The document _does_ have to take a trip through a JSON parser,
> pass as valid JSON, but create an MD5 sum, along with the metadata,
> that matches the revision id of the original document. All this needs
> to be done on a Couch that is trusted to perform unfiltered,
> bi-directional replication and allows the attacker to change documents
> that matter to other people.
> 
> The proper way to stop the "attack" is to not let users modify
> documents that will screw up things for other people. It's kind of
> like how a UNIX user is _welcome_ to trash their .bashrc and just
> because their home directory is mounted over NFS and now their .bashrc
> is trashed _everywhere_ doesn't mean they've really done any damage
> from anyone else's point of view. They didn't attack anything but
> themselves.
> 
> ----
> 
> However. It's worth noting that an attacker can just make up whatever
> revision identifiers they want to, without dealing with the MD5 stuff
> anyway!!! Passing ?new_edits=false allows an "attacker" to specify
> that a document has any revision they want, with whatever history of
> revisions they want.
> 
> curl -XPUT -H"Content-Type: application/json"
> http://some.couch/somedb/document?new_edits=false
> -d'{"_id":"document", "_rev":"5-anything",
> "_revisions":{"start":5,ids:["anything",
> "everything","bogus","revids"]}}'
> 
> (Side note to devs: we may want to deterministically prune the leaves
> for duplicates after merging rev trees, or not, because, well, this is
> a crazy hand-crafted fake-out and caveat power-user.)
> 
> In fact, I just discovered yesterday that you can create unreachable
> conflicts this way, by giving them revision ids and histories that
> create two branches with identical leaves but different stems. If
> CouchDB did decide to enforce some crypto-verifiable contraints on
> revision ids, they could be checked to prevent this kind of
> mis-history. However, other implementations would be forced to follow
> the same scheme. I think the intention of making the revision ID
> opaque was to make it an implementation detail and specifically _not_
> a security or validation feature.
> 
> That said, I'm starting to come around to this idea. I'd be happy to
> see patches that enable a "strict revisions mode" for CouchDB. I don't
> feel like CouchDB has made any promises that are broken by using MD5,
> but additional promises could possibly be made if we took a git-like
> approach to revision crypto.
> 
> I hope that settles the "why", reassures any
> "oh-my-god-my-couch-is-vulnerable", and motivates the
> "hey-lets-make-a-patch" if you still want the feature, with the
> understanding that it's unlikely the project will specify this as a
> necessary condition for general-purpose replication. If you have more
> bullet-proof needs, dev that armor up and I'll review it, but I'd
> advise making it a config option.
> 
> -Randall
> 
>> 
>>> Then, if I understand Jason
>>> correctly, we're also talking about a situation where Couch B is
>>> insecure... it's allowing a malicious user to change documents. If
>>> these documents are anything more important than something affecting
>>> the user herself then what you have is a malicious administrator or an
>>> insecure deployment. I don't think MD5 is to blame here.
>> No, the issue here is a possibility to break the synchronization.
>> 
>>> Does that sound like a reasonable assessment to you, Alex?
>> Almost.
>> 
>>> Also, I'd love to hear about your C++ replicator as it develops.
>> Sure, I'm developing a very small and fast embedded storage for mobile
>> devices and desktop apps. It'll be open source once I finish its core.
>> 
>>> -Randall
>>> 
>>>> For switching from MD5 to SHA-1, I say no. If we switch, let's use
>>>> something contemporary like SHA-256. Better yet, let's wait for the
>>>> winner of the SHA-3 competition.
>>>> 
>>>> B.
>>>> 
>>>> On 15 November 2011 07:57, Jason Smith <jh...@iriscouch.com> wrote:
>>>>> On Tue, Nov 15, 2011 at 7:34 AM, Alex Besogonov
>>>>> <al...@gmail.com> wrote:
>>>>>>>> Now I make a change to 'Doc' at machine A. This creates a new revid
>>>>>>>> with new md5 hash.
>>>>>>>> A malicious software somehow learns about this update and creates
>>>>>>>> another document
>>>>>>>> on machine B, contriving it so to make the resulting hash to be the
>>>>>>>> same as on machine A.
>>>>>>> Before going any further, you must show why we care about the contents
>>>>>>> of machine B.
>>>>>>> Why would I log in to machine B if I do not trust B's owner? Why would
>>>>>>> I clone your Git repository if I do not know you?
>>>>>> The problem is, MD5 hash depends on _untrusted_ data that external
>>>>>> processes might put into the database.
>>>>>> 
>>>>>> For example, imagine that machines A and B use CouchDB to store
>>>>>> certificates.
>>>>> 
>>>>> I ask again.
>>>>> 
>>>>> --
>>>>> Iris Couch
>>>>> 
>>>> 
>>> 
>> 


Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Randall Leeds <ra...@gmail.com>.
On Wed, Nov 16, 2011 at 09:46, Alex Besogonov <al...@gmail.com> wrote:
> On Tue, Nov 15, 2011 at 4:23 PM, Randall Leeds <ra...@gmail.com> wrote:
>>> Remember that the _rev value is derived from the contents of the
>>> documents, all the bytes of all attachments and values from previous
>>> revisions. Stock MD5 preimage attacks are of of much simpler form
>>> (finding a Y such that MD5(Y)=X for some desired X). Also that you
>>> would have to arrange for the same number of updates as well, since
>>> the number at the front is incremented on each successful update.
>> Also remember that the contents would have to parse as JSON, so that
>> restricts this search space even further.
> Not really. Binary representation of JSON is used to calculate the hash.
>
> So I can make a document like this:
> ===
> {
>  "aa" : "xxxxxxxxxxxxxx.....[several thousands x's]"
> }
> ===
>
> And use the large 'xxx...x' string as a scratch area for my attack. I don't
> even need to bother with quoting issues because CouchDB is going to
> unquote everything during JSON parsing. And there are no other hash
> codes to work around (working around even two MD5s at the same time
> is much harder).
>
> That's about the best possible case for an attacker.

This "attack", though, is still pretty hard, and, I think, not an
attack. The document _does_ have to take a trip through a JSON parser,
pass as valid JSON, but create an MD5 sum, along with the metadata,
that matches the revision id of the original document. All this needs
to be done on a Couch that is trusted to perform unfiltered,
bi-directional replication and allows the attacker to change documents
that matter to other people.

The proper way to stop the "attack" is to not let users modify
documents that will screw up things for other people. It's kind of
like how a UNIX user is _welcome_ to trash their .bashrc and just
because their home directory is mounted over NFS and now their .bashrc
is trashed _everywhere_ doesn't mean they've really done any damage
from anyone else's point of view. They didn't attack anything but
themselves.

----

However. It's worth noting that an attacker can just make up whatever
revision identifiers they want to, without dealing with the MD5 stuff
anyway!!! Passing ?new_edits=false allows an "attacker" to specify
that a document has any revision they want, with whatever history of
revisions they want.

curl -XPUT -H"Content-Type: application/json"
http://some.couch/somedb/document?new_edits=false
-d'{"_id":"document", "_rev":"5-anything",
"_revisions":{"start":5,ids:["anything",
"everything","bogus","revids"]}}'

(Side note to devs: we may want to deterministically prune the leaves
for duplicates after merging rev trees, or not, because, well, this is
a crazy hand-crafted fake-out and caveat power-user.)

In fact, I just discovered yesterday that you can create unreachable
conflicts this way, by giving them revision ids and histories that
create two branches with identical leaves but different stems. If
CouchDB did decide to enforce some crypto-verifiable contraints on
revision ids, they could be checked to prevent this kind of
mis-history. However, other implementations would be forced to follow
the same scheme. I think the intention of making the revision ID
opaque was to make it an implementation detail and specifically _not_
a security or validation feature.

That said, I'm starting to come around to this idea. I'd be happy to
see patches that enable a "strict revisions mode" for CouchDB. I don't
feel like CouchDB has made any promises that are broken by using MD5,
but additional promises could possibly be made if we took a git-like
approach to revision crypto.

I hope that settles the "why", reassures any
"oh-my-god-my-couch-is-vulnerable", and motivates the
"hey-lets-make-a-patch" if you still want the feature, with the
understanding that it's unlikely the project will specify this as a
necessary condition for general-purpose replication. If you have more
bullet-proof needs, dev that armor up and I'll review it, but I'd
advise making it a config option.

-Randall

>
>> Then, if I understand Jason
>> correctly, we're also talking about a situation where Couch B is
>> insecure... it's allowing a malicious user to change documents. If
>> these documents are anything more important than something affecting
>> the user herself then what you have is a malicious administrator or an
>> insecure deployment. I don't think MD5 is to blame here.
> No, the issue here is a possibility to break the synchronization.
>
>> Does that sound like a reasonable assessment to you, Alex?
> Almost.
>
>> Also, I'd love to hear about your C++ replicator as it develops.
> Sure, I'm developing a very small and fast embedded storage for mobile
> devices and desktop apps. It'll be open source once I finish its core.
>
>> -Randall
>>
>>> For switching from MD5 to SHA-1, I say no. If we switch, let's use
>>> something contemporary like SHA-256. Better yet, let's wait for the
>>> winner of the SHA-3 competition.
>>>
>>> B.
>>>
>>> On 15 November 2011 07:57, Jason Smith <jh...@iriscouch.com> wrote:
>>>> On Tue, Nov 15, 2011 at 7:34 AM, Alex Besogonov
>>>> <al...@gmail.com> wrote:
>>>>>>> Now I make a change to 'Doc' at machine A. This creates a new revid
>>>>>>> with new md5 hash.
>>>>>>> A malicious software somehow learns about this update and creates
>>>>>>> another document
>>>>>>> on machine B, contriving it so to make the resulting hash to be the
>>>>>>> same as on machine A.
>>>>>> Before going any further, you must show why we care about the contents
>>>>>> of machine B.
>>>>>> Why would I log in to machine B if I do not trust B's owner? Why would
>>>>>> I clone your Git repository if I do not know you?
>>>>> The problem is, MD5 hash depends on _untrusted_ data that external
>>>>> processes might put into the database.
>>>>>
>>>>> For example, imagine that machines A and B use CouchDB to store
>>>>> certificates.
>>>>
>>>> I ask again.
>>>>
>>>> --
>>>> Iris Couch
>>>>
>>>
>>
>

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Alex Besogonov <al...@gmail.com>.
On Tue, Nov 15, 2011 at 4:23 PM, Randall Leeds <ra...@gmail.com> wrote:
>> Remember that the _rev value is derived from the contents of the
>> documents, all the bytes of all attachments and values from previous
>> revisions. Stock MD5 preimage attacks are of of much simpler form
>> (finding a Y such that MD5(Y)=X for some desired X). Also that you
>> would have to arrange for the same number of updates as well, since
>> the number at the front is incremented on each successful update.
> Also remember that the contents would have to parse as JSON, so that
> restricts this search space even further.
Not really. Binary representation of JSON is used to calculate the hash.

So I can make a document like this:
===
{
  "aa" : "xxxxxxxxxxxxxx.....[several thousands x's]"
}
===

And use the large 'xxx...x' string as a scratch area for my attack. I don't
even need to bother with quoting issues because CouchDB is going to
unquote everything during JSON parsing. And there are no other hash
codes to work around (working around even two MD5s at the same time
is much harder).

That's about the best possible case for an attacker.

> Then, if I understand Jason
> correctly, we're also talking about a situation where Couch B is
> insecure... it's allowing a malicious user to change documents. If
> these documents are anything more important than something affecting
> the user herself then what you have is a malicious administrator or an
> insecure deployment. I don't think MD5 is to blame here.
No, the issue here is a possibility to break the synchronization.

> Does that sound like a reasonable assessment to you, Alex?
Almost.

> Also, I'd love to hear about your C++ replicator as it develops.
Sure, I'm developing a very small and fast embedded storage for mobile
devices and desktop apps. It'll be open source once I finish its core.

> -Randall
>
>> For switching from MD5 to SHA-1, I say no. If we switch, let's use
>> something contemporary like SHA-256. Better yet, let's wait for the
>> winner of the SHA-3 competition.
>>
>> B.
>>
>> On 15 November 2011 07:57, Jason Smith <jh...@iriscouch.com> wrote:
>>> On Tue, Nov 15, 2011 at 7:34 AM, Alex Besogonov
>>> <al...@gmail.com> wrote:
>>>>>> Now I make a change to 'Doc' at machine A. This creates a new revid
>>>>>> with new md5 hash.
>>>>>> A malicious software somehow learns about this update and creates
>>>>>> another document
>>>>>> on machine B, contriving it so to make the resulting hash to be the
>>>>>> same as on machine A.
>>>>> Before going any further, you must show why we care about the contents
>>>>> of machine B.
>>>>> Why would I log in to machine B if I do not trust B's owner? Why would
>>>>> I clone your Git repository if I do not know you?
>>>> The problem is, MD5 hash depends on _untrusted_ data that external
>>>> processes might put into the database.
>>>>
>>>> For example, imagine that machines A and B use CouchDB to store
>>>> certificates.
>>>
>>> I ask again.
>>>
>>> --
>>> Iris Couch
>>>
>>
>

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Randall Leeds <ra...@gmail.com>.
On Tue, Nov 15, 2011 at 01:43, Robert Newson <rn...@apache.org> wrote:
> _rev values used to be UUID's and became deterministic to improve
> replication performance. I can see that there's a theoretical issue
> where replication could be inhibited, though I question how practical
> it is given the internal details of _rev calculation.
>
> Remember that the _rev value is derived from the contents of the
> documents, all the bytes of all attachments and values from previous
> revisions. Stock MD5 preimage attacks are of of much simpler form
> (finding a Y such that MD5(Y)=X for some desired X). Also that you
> would have to arrange for the same number of updates as well, since
> the number at the front is incremented on each successful update.
>

Also remember that the contents would have to parse as JSON, so that
restricts this search space even further. Then, if I understand Jason
correctly, we're also talking about a situation where Couch B is
insecure... it's allowing a malicious user to change documents. If
these documents are anything more important than something affecting
the user herself then what you have is a malicious administrator or an
insecure deployment. I don't think MD5 is to blame here.

Does that sound like a reasonable assessment to you, Alex?

Also, I'd love to hear about your C++ replicator as it develops.

-Randall

> For switching from MD5 to SHA-1, I say no. If we switch, let's use
> something contemporary like SHA-256. Better yet, let's wait for the
> winner of the SHA-3 competition.
>
> B.
>
> On 15 November 2011 07:57, Jason Smith <jh...@iriscouch.com> wrote:
>> On Tue, Nov 15, 2011 at 7:34 AM, Alex Besogonov
>> <al...@gmail.com> wrote:
>>>>> Now I make a change to 'Doc' at machine A. This creates a new revid
>>>>> with new md5 hash.
>>>>> A malicious software somehow learns about this update and creates
>>>>> another document
>>>>> on machine B, contriving it so to make the resulting hash to be the
>>>>> same as on machine A.
>>>> Before going any further, you must show why we care about the contents
>>>> of machine B.
>>>> Why would I log in to machine B if I do not trust B's owner? Why would
>>>> I clone your Git repository if I do not know you?
>>> The problem is, MD5 hash depends on _untrusted_ data that external
>>> processes might put into the database.
>>>
>>> For example, imagine that machines A and B use CouchDB to store
>>> certificates.
>>
>> I ask again.
>>
>> --
>> Iris Couch
>>
>

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Robert Newson <rn...@apache.org>.
_rev values used to be UUID's and became deterministic to improve
replication performance. I can see that there's a theoretical issue
where replication could be inhibited, though I question how practical
it is given the internal details of _rev calculation.

Remember that the _rev value is derived from the contents of the
documents, all the bytes of all attachments and values from previous
revisions. Stock MD5 preimage attacks are of of much simpler form
(finding a Y such that MD5(Y)=X for some desired X). Also that you
would have to arrange for the same number of updates as well, since
the number at the front is incremented on each successful update.

For switching from MD5 to SHA-1, I say no. If we switch, let's use
something contemporary like SHA-256. Better yet, let's wait for the
winner of the SHA-3 competition.

B.

On 15 November 2011 07:57, Jason Smith <jh...@iriscouch.com> wrote:
> On Tue, Nov 15, 2011 at 7:34 AM, Alex Besogonov
> <al...@gmail.com> wrote:
>>>> Now I make a change to 'Doc' at machine A. This creates a new revid
>>>> with new md5 hash.
>>>> A malicious software somehow learns about this update and creates
>>>> another document
>>>> on machine B, contriving it so to make the resulting hash to be the
>>>> same as on machine A.
>>> Before going any further, you must show why we care about the contents
>>> of machine B.
>>> Why would I log in to machine B if I do not trust B's owner? Why would
>>> I clone your Git repository if I do not know you?
>> The problem is, MD5 hash depends on _untrusted_ data that external
>> processes might put into the database.
>>
>> For example, imagine that machines A and B use CouchDB to store
>> certificates.
>
> I ask again.
>
> --
> Iris Couch
>

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Jason Smith <jh...@iriscouch.com>.
On Tue, Nov 15, 2011 at 7:34 AM, Alex Besogonov
<al...@gmail.com> wrote:
>>> Now I make a change to 'Doc' at machine A. This creates a new revid
>>> with new md5 hash.
>>> A malicious software somehow learns about this update and creates
>>> another document
>>> on machine B, contriving it so to make the resulting hash to be the
>>> same as on machine A.
>> Before going any further, you must show why we care about the contents
>> of machine B.
>> Why would I log in to machine B if I do not trust B's owner? Why would
>> I clone your Git repository if I do not know you?
> The problem is, MD5 hash depends on _untrusted_ data that external
> processes might put into the database.
>
> For example, imagine that machines A and B use CouchDB to store
> certificates.

I ask again.

-- 
Iris Couch

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Alex Besogonov <al...@gmail.com>.
>> Now I make a change to 'Doc' at machine A. This creates a new revid
>> with new md5 hash.
>> A malicious software somehow learns about this update and creates
>> another document
>> on machine B, contriving it so to make the resulting hash to be the
>> same as on machine A.
> Before going any further, you must show why we care about the contents
> of machine B.
> Why would I log in to machine B if I do not trust B's owner? Why would
> I clone your Git repository if I do not know you?
The problem is, MD5 hash depends on _untrusted_ data that external
processes might put into the database.

For example, imagine that machines A and B use CouchDB to store
certificates. On machine A administrator issues a certificate revocation
record for a certificate stored in 'Doc2'. On machine B a malware issues
a no-op update of 'Doc2' which is contrived to have the same ID as the
certificate revocation record issued on machine A (using normal document
management functionality).

Such tampering would be normally noticed by presence of conflicts in
replication, but in this case it would go unnoticed!

This is a somewhat contrived example, but for me the most crucial fact is
that external potentially untrusted data can force CouchDB to behave
incorrectly and violate its invariants.

It can be ignored as a minor issue, of course, but in this case the fix is
simple - just a switch from MD5 to something more secure like SHA-1.

> Finally, revision tokens might look like MD5, but they are not. They
> especially look like MD5 if you read the source code. But they are not
> MD5. They are opaque tokens. They do not serve a security function.
> Between trusted nodes, they indicate document changes.
I'm actually writing a connector between CouchDB and external system, so
I'm reimplementing all the functionality required for the synchronization
protocol from scratch (in C++). Quite an interesting task.

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Jason Smith <jh...@iriscouch.com>.
On Tue, Nov 15, 2011 at 4:41 AM, Alex Besogonov
<al...@gmail.com> wrote:
> Now I make a change to 'Doc' at machine A. This creates a new revid
> with new md5 hash.
> A malicious software somehow learns about this update and creates
> another document
> on machine B, contriving it so to make the resulting hash to be the
> same as on machine A.

Before going any further, you must show why we care about the contents
of machine B.

Why would I log in to machine B if I do not trust B's owner? Why would
I clone your Git repository if I do not know you?

Finally, revision tokens might look like MD5, but they are not. They
especially look like MD5 if you read the source code. But they are not
MD5. They are opaque tokens. They do not serve a security function.
Between trusted nodes, they indicate document changes.

rsync used MD4 because it was faster, and who cares? You have already
authenticated (SSH) and been authorized (permission bits).

-- 
Iris Couch

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Alex Besogonov <al...@gmail.com>.
On Mon, Nov 14, 2011 at 5:48 PM, Randall Leeds <ra...@gmail.com> wrote:
>> I'm looking at CouchDB source code and I have several questions:
>> 1) Why MD5 is used instead of more secure hashes. It's very real to
>> imagine a situation where a malicious user can cause hash collision
>> and cause problems in replication.
> Can you explain a little bit more where you see this interacting with
> replication?
For example, imagine that I have two replicas on machine A and machine B with
document 'Doc' at the same initial state.

Now I make a change to 'Doc' at machine A. This creates a new revid
with new md5 hash.
A malicious software somehow learns about this update and creates
another document
on machine B, contriving it so to make the resulting hash to be the
same as on machine A.

So during replication machine B won't detect that a new version of the
document is
present and changes from machine A won't be replicated. Attack on MD5 achieving
this is quite possible today.

Now, it might not sound too threatening, but this attack breaks the
main invariant of
CouchDB - database replicas won't ever be eventually consistent!

Also, I'd like to use stronger hash just on general principles.

>> 2) ID is not completely deterministic - it depends on
>> compression_level and compressible_types settings for attachments.
>> Would it make sense to use MD5 of the original uncompressed document?
>> And while you're at it, it probably makes sense to include file size
>> in Atts2 tuple.
> Nothing in my mind requires that IDs be deterministic. It's useful for
> reducing conflicts when identical changes are replayed on different
> replicating couches, but it's not strictly required.
Yes, strictly deterministic IDs are not required, but it would be nice to have
a canonical form.

> With respect to uncompressed file size, sometimes that information is
> not available for attachments since they may have been send over the
> wire in compressed form. We went over this conversation a few times
> when adding compression features and it was decided that uncompressing
> on the fly, server-side, just to get the uncompressed file size and
> hash was not worth it.
Does it really have that much overhead? Usually only fairly small test/html/css
files are compressed. But okay, maybe at least a tag with compression level
and scheme could be attached?

> Attachment records do have att_len and disk_len (sometimes the same,
> depending on the encoding/compression during upload) properties and I
> believe this is exposed in the _attachments metadata on document
> requests.
I'm thinking about making it a part of the document revid.

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Posted by Randall Leeds <ra...@gmail.com>.
On Mon, Nov 14, 2011 at 10:52, Alex Besogonov <al...@gmail.com> wrote:
> I'm looking at CouchDB source code and I have several questions:
>
> 1) Why MD5 is used instead of more secure hashes. It's very real to
> imagine a situation where a malicious user can cause hash collision
> and cause problems in replication.

Can you explain a little bit more where you see this interacting with
replication?

>
> 2) ID is not completely deterministic - it depends on
> compression_level and compressible_types settings for attachments.
> Would it make sense to use MD5 of the original uncompressed document?
> And while you're at it, it probably makes sense to include file size
> in Atts2 tuple.
>

Nothing in my mind requires that IDs be deterministic. It's useful for
reducing conflicts when identical changes are replayed on different
replicating couches, but it's not strictly required.

With respect to uncompressed file size, sometimes that information is
not available for attachments since they may have been send over the
wire in compressed form. We went over this conversation a few times
when adding compression features and it was decided that uncompressing
on the fly, server-side, just to get the uncompressed file size and
hash was not worth it.

Attachment records do have att_len and disk_len (sometimes the same,
depending on the encoding/compression during upload) properties and I
believe this is exposed in the _attachments metadata on document
requests. I don't know exactly what's changed since what release, so
it may not be visible on released version of CouchDB. Looking at the
code in master right now, I see "length", "encoded_length", and
"digest" included in the attachment metadata.

Thanks!
-Randall