You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Dave Cottlehuber <dc...@jsonified.com> on 2014/02/20 10:37:01 UTC

Re: Duplicate fields in documents

> > On Feb 19, 2014, at 6:07 AM, Dave Cottlehuber wrote:
> >
> > > TL;DR the appropriately named ECMA 404 JSON spec [1] is broken or more
> > politely, insufficiently specific.
> >
> On Wed, Feb 19, 2014 at 8:30 PM, Jens Alfke wrote:
> >
> > This seems to fall into the category of "things so obvious that the people
> > who wrote the spec didn't realize they had to mention them." I.e. "You
> > can't have duplicate keys.”

When I was in France, I learned that anything not explicitly forbidden is
therefore permitted. And while in Germany, I learned that anything not
explicitly permitted is forbidden :-). Luckily in New Zealand, it only matters
if you get caught ;-).

On 20. Februar 2014 at 06:14:29, Mikael Söderström (vimpyboy@msn.com) wrote:
>
> If there are duplicate keys, it should absolutely fail.

I think returning an error on receiving duplicate keys would be a sensible
change to CouchDB, albeit a relatively minor breaking one. See below.

> > > JSON is typically based on a dictionary or hash map, and there's no
> > particular reason for that data structure to enforce uniqueness of keys.
> >
> > I disagree. Mathematically, a dictionary/map object is a function: it maps
> > from a set of keys to a set of values, with each key mapping to exactly one
> > value. (That's basically the definition of 'function'.) It's certainly
> > possible to create implementations that map a key to _multiple_ values, but
> > that's something different: it's a mapping from a key to a set. (For
> > example, it's not from string-->int, it's now from string-->set.) The
> > JSON spec does not include this kind of mapping -- an object value in JSON
> > can be a number, but not a set of numbers.

I was referring to the fact that hash tables usually have some form of collision
detection internally, when multiple keys map to the same hash bucket. There are
perfect hash functions & algorithms that avoid collisions, but that’s getting
a bit off track. And it’s moot point, as we all agree that duplicate keys are not
what most people expect, including Crockford.

> > IMHO the reasonable thing for a JSON parser to do if it encounters a
> > duplicate key is to fail with a clear error. Failing that, the only other
> > reasonable option is to discard one or the other value (I don't have an
> > opinion which.) But keeping both is unreasonable.

If somebody wants to sort this out, I’d suggest implementing the fix (in C) in
Paul Davis’ jiffy library, which has been on the list of things to import
for a while.

I added https://github.com/davisp/jiffy/issues/54 and updated
https://issues.apache.org/jira/browse/COUCHDB-1294

jiffy:decode(<< "{\"foo\":\"bar\", \"foo\":\"bar\"}" >>) should return an error,
something like invalid_json_duplicate_name_in_object.

CC’d dev@.

A+
-- 
Dave Cottlehuber
Sent from my PDP11



Re: Duplicate fields in documents

Posted by Jens Alfke <je...@couchbase.com>.
On Feb 20, 2014, at 10:40 AM, Paul Davis <pa...@gmail.com> wrote:

> To me, the answer to the original email in this thread is "If you
> don't want duplicate keys in your JSON, don't put duplicate keys in
> your JSON". Given that this is the first time I've actually seen this
> as a user reported error I think that guidance has served users quite
> well.

The thing that bothers me about this scenario is that you end up with a document stored in the database, which can be interpreted differently by different clients.

For example, you PUT a doc whose JSON contains "… "color":"blue", "color":"green", …
Now a client written in language X reads this doc. Language X's JSON parser handles dups by keeping the first key, so this client sees the color as blue.
A client in language Y reads this doc; Y's JSON parser keeps the last key, so this client sees the color as green.
A client in language Z reads the doc but gets an invalid-JSON error because Z's parser is strict and rejects duplicate keys. This client fails somehow. Hopefully it was checking errors properly and doesn't just crash.

So here you have a situation where CouchDB allows a document into a database that two clients can see in very different ways, while another client can't even read the document. That does not seem like a happy place at _all_. Especially if you consider validation issues — what if the database has a validation function that rejects green documents? The behavior now depends on which of duplicate values the JS implementation inside CouchDB sees. (Is it still SpiderMonkey or has it switched to V8? Do they have different behavior?) And a client with different rules might see the document with the invalid color.

I am not a licensed security expert (though I've dated one), but to me this strongly smells like a security hole in data validation. I'm pretty certain someone could find a way to build an exploit out of this, inserting a doc that's allowed by the database's validator but which reveals entirely different invalid contents to a client app with a different JSON parser.

—Jens

Re: Duplicate fields in documents

Posted by Jens Alfke <je...@couchbase.com>.
On Feb 20, 2014, at 10:40 AM, Paul Davis <pa...@gmail.com> wrote:

> To me, the answer to the original email in this thread is "If you
> don't want duplicate keys in your JSON, don't put duplicate keys in
> your JSON". Given that this is the first time I've actually seen this
> as a user reported error I think that guidance has served users quite
> well.

The thing that bothers me about this scenario is that you end up with a document stored in the database, which can be interpreted differently by different clients.

For example, you PUT a doc whose JSON contains "… "color":"blue", "color":"green", …
Now a client written in language X reads this doc. Language X's JSON parser handles dups by keeping the first key, so this client sees the color as blue.
A client in language Y reads this doc; Y's JSON parser keeps the last key, so this client sees the color as green.
A client in language Z reads the doc but gets an invalid-JSON error because Z's parser is strict and rejects duplicate keys. This client fails somehow. Hopefully it was checking errors properly and doesn't just crash.

So here you have a situation where CouchDB allows a document into a database that two clients can see in very different ways, while another client can't even read the document. That does not seem like a happy place at _all_. Especially if you consider validation issues — what if the database has a validation function that rejects green documents? The behavior now depends on which of duplicate values the JS implementation inside CouchDB sees. (Is it still SpiderMonkey or has it switched to V8? Do they have different behavior?) And a client with different rules might see the document with the invalid color.

I am not a licensed security expert (though I've dated one), but to me this strongly smells like a security hole in data validation. I'm pretty certain someone could find a way to build an exploit out of this, inserting a doc that's allowed by the database's validator but which reveals entirely different invalid contents to a client app with a different JSON parser.

—Jens

Re: Duplicate fields in documents

Posted by Paul Davis <pa...@gmail.com>.
So this is a wonderful example of the JSON "spec" being rather
underspecified. Most of the time for most languages this is not an
issue. Every parser I know of use some version of a language-native
hash-map to represent JSON objects. Erlang currently has no such
native mapping (I'll discuss maps in 17.0-rc1 below). The two closest
structures commonly used in Erlang are proplists and the dict module.
For whatever reason, proplists were decided to be *the* representation
for JSON in Erlang (disregarding some controversy around the specifics
of the proplist representation).

Its interesting on the historical interpretation on this spec as well.
For instance, Jens mentioned how order preservation is non-standard
which is kinda true but really not. Key ordering is actually even
worse in terms of general implementation consensus. JavaScript
iterates over the order that keys were defined. Python has a
deterministic-yet-undefined ordering. Ruby used to have a completely
random ordering (ie, iterating twice over the same keys could yield
different orderings, I think this has changed though).

These are all details that aren't in the spec but instead depend on
the language's native representation of a JSON object. Given that
Erlang uses a proplist, following the semantics of a proplist
shouldn't be a surprise. Yes, duplicate keys are different than most
implementations but its neither prohibited or has a clear "correct"
answer especially considering the RFC's further intent that if a JSON
text conforms to the grammar that it shouldn't be rejected (something
that Jiffy willfully violates in some circumstances).

To me, the answer to the original email in this thread is "If you
don't want duplicate keys in your JSON, don't put duplicate keys in
your JSON". Given that this is the first time I've actually seen this
as a user reported error I think that guidance has served users quite
well. I'd also be interested to hear what other serializer generated
duplicate keys.

Given all of that, Erlang 17.0-rc1 has a native hash-map-like data
structure available which will most likely become the standard
representation for JSON in Erlang (I would wager). Once Jiffy returns
values as maps then the duplicate key issue will match the behavior by
only keeping the last value for each duplicate key. Making that change
is going to be interesting and present a series of challenges (ie,
rewriting large swaths of CouchDB internals).

Given all that, I don't believe its an issue. It is a bit weird when
people first encounter the behavior, but given that its proven to not
be an issue in the last five years I'm not too inclined to change it
given the downsides of adding the check.

On Thu, Feb 20, 2014 at 3:37 AM, Dave Cottlehuber <dc...@jsonified.com> wrote:
>> > On Feb 19, 2014, at 6:07 AM, Dave Cottlehuber wrote:
>> >
>> > > TL;DR the appropriately named ECMA 404 JSON spec [1] is broken or more
>> > politely, insufficiently specific.
>> >
>> On Wed, Feb 19, 2014 at 8:30 PM, Jens Alfke wrote:
>> >
>> > This seems to fall into the category of "things so obvious that the people
>> > who wrote the spec didn't realize they had to mention them." I.e. "You
>> > can't have duplicate keys."
>
> When I was in France, I learned that anything not explicitly forbidden is
> therefore permitted. And while in Germany, I learned that anything not
> explicitly permitted is forbidden :-). Luckily in New Zealand, it only matters
> if you get caught ;-).
>
> On 20. Februar 2014 at 06:14:29, Mikael Söderström (vimpyboy@msn.com) wrote:
>>
>> If there are duplicate keys, it should absolutely fail.
>
> I think returning an error on receiving duplicate keys would be a sensible
> change to CouchDB, albeit a relatively minor breaking one. See below.
>
>> > > JSON is typically based on a dictionary or hash map, and there's no
>> > particular reason for that data structure to enforce uniqueness of keys.
>> >
>> > I disagree. Mathematically, a dictionary/map object is a function: it maps
>> > from a set of keys to a set of values, with each key mapping to exactly one
>> > value. (That's basically the definition of 'function'.) It's certainly
>> > possible to create implementations that map a key to _multiple_ values, but
>> > that's something different: it's a mapping from a key to a set. (For
>> > example, it's not from string-->int, it's now from string-->set.) The
>> > JSON spec does not include this kind of mapping -- an object value in JSON
>> > can be a number, but not a set of numbers.
>
> I was referring to the fact that hash tables usually have some form of collision
> detection internally, when multiple keys map to the same hash bucket. There are
> perfect hash functions & algorithms that avoid collisions, but that's getting
> a bit off track. And it's moot point, as we all agree that duplicate keys are not
> what most people expect, including Crockford.
>
>> > IMHO the reasonable thing for a JSON parser to do if it encounters a
>> > duplicate key is to fail with a clear error. Failing that, the only other
>> > reasonable option is to discard one or the other value (I don't have an
>> > opinion which.) But keeping both is unreasonable.
>
> If somebody wants to sort this out, I'd suggest implementing the fix (in C) in
> Paul Davis' jiffy library, which has been on the list of things to import
> for a while.
>
> I added https://github.com/davisp/jiffy/issues/54 and updated
> https://issues.apache.org/jira/browse/COUCHDB-1294
>
> jiffy:decode(<< "{\"foo\":\"bar\", \"foo\":\"bar\"}" >>) should return an error,
> something like invalid_json_duplicate_name_in_object.
>
> CC'd dev@.
>
> A+
> --
> Dave Cottlehuber
> Sent from my PDP11
>
>

Re: Duplicate fields in documents

Posted by Paul Davis <pa...@gmail.com>.
So this is a wonderful example of the JSON "spec" being rather
underspecified. Most of the time for most languages this is not an
issue. Every parser I know of use some version of a language-native
hash-map to represent JSON objects. Erlang currently has no such
native mapping (I'll discuss maps in 17.0-rc1 below). The two closest
structures commonly used in Erlang are proplists and the dict module.
For whatever reason, proplists were decided to be *the* representation
for JSON in Erlang (disregarding some controversy around the specifics
of the proplist representation).

Its interesting on the historical interpretation on this spec as well.
For instance, Jens mentioned how order preservation is non-standard
which is kinda true but really not. Key ordering is actually even
worse in terms of general implementation consensus. JavaScript
iterates over the order that keys were defined. Python has a
deterministic-yet-undefined ordering. Ruby used to have a completely
random ordering (ie, iterating twice over the same keys could yield
different orderings, I think this has changed though).

These are all details that aren't in the spec but instead depend on
the language's native representation of a JSON object. Given that
Erlang uses a proplist, following the semantics of a proplist
shouldn't be a surprise. Yes, duplicate keys are different than most
implementations but its neither prohibited or has a clear "correct"
answer especially considering the RFC's further intent that if a JSON
text conforms to the grammar that it shouldn't be rejected (something
that Jiffy willfully violates in some circumstances).

To me, the answer to the original email in this thread is "If you
don't want duplicate keys in your JSON, don't put duplicate keys in
your JSON". Given that this is the first time I've actually seen this
as a user reported error I think that guidance has served users quite
well. I'd also be interested to hear what other serializer generated
duplicate keys.

Given all of that, Erlang 17.0-rc1 has a native hash-map-like data
structure available which will most likely become the standard
representation for JSON in Erlang (I would wager). Once Jiffy returns
values as maps then the duplicate key issue will match the behavior by
only keeping the last value for each duplicate key. Making that change
is going to be interesting and present a series of challenges (ie,
rewriting large swaths of CouchDB internals).

Given all that, I don't believe its an issue. It is a bit weird when
people first encounter the behavior, but given that its proven to not
be an issue in the last five years I'm not too inclined to change it
given the downsides of adding the check.

On Thu, Feb 20, 2014 at 3:37 AM, Dave Cottlehuber <dc...@jsonified.com> wrote:
>> > On Feb 19, 2014, at 6:07 AM, Dave Cottlehuber wrote:
>> >
>> > > TL;DR the appropriately named ECMA 404 JSON spec [1] is broken or more
>> > politely, insufficiently specific.
>> >
>> On Wed, Feb 19, 2014 at 8:30 PM, Jens Alfke wrote:
>> >
>> > This seems to fall into the category of "things so obvious that the people
>> > who wrote the spec didn't realize they had to mention them." I.e. "You
>> > can't have duplicate keys."
>
> When I was in France, I learned that anything not explicitly forbidden is
> therefore permitted. And while in Germany, I learned that anything not
> explicitly permitted is forbidden :-). Luckily in New Zealand, it only matters
> if you get caught ;-).
>
> On 20. Februar 2014 at 06:14:29, Mikael Söderström (vimpyboy@msn.com) wrote:
>>
>> If there are duplicate keys, it should absolutely fail.
>
> I think returning an error on receiving duplicate keys would be a sensible
> change to CouchDB, albeit a relatively minor breaking one. See below.
>
>> > > JSON is typically based on a dictionary or hash map, and there's no
>> > particular reason for that data structure to enforce uniqueness of keys.
>> >
>> > I disagree. Mathematically, a dictionary/map object is a function: it maps
>> > from a set of keys to a set of values, with each key mapping to exactly one
>> > value. (That's basically the definition of 'function'.) It's certainly
>> > possible to create implementations that map a key to _multiple_ values, but
>> > that's something different: it's a mapping from a key to a set. (For
>> > example, it's not from string-->int, it's now from string-->set.) The
>> > JSON spec does not include this kind of mapping -- an object value in JSON
>> > can be a number, but not a set of numbers.
>
> I was referring to the fact that hash tables usually have some form of collision
> detection internally, when multiple keys map to the same hash bucket. There are
> perfect hash functions & algorithms that avoid collisions, but that's getting
> a bit off track. And it's moot point, as we all agree that duplicate keys are not
> what most people expect, including Crockford.
>
>> > IMHO the reasonable thing for a JSON parser to do if it encounters a
>> > duplicate key is to fail with a clear error. Failing that, the only other
>> > reasonable option is to discard one or the other value (I don't have an
>> > opinion which.) But keeping both is unreasonable.
>
> If somebody wants to sort this out, I'd suggest implementing the fix (in C) in
> Paul Davis' jiffy library, which has been on the list of things to import
> for a while.
>
> I added https://github.com/davisp/jiffy/issues/54 and updated
> https://issues.apache.org/jira/browse/COUCHDB-1294
>
> jiffy:decode(<< "{\"foo\":\"bar\", \"foo\":\"bar\"}" >>) should return an error,
> something like invalid_json_duplicate_name_in_object.
>
> CC'd dev@.
>
> A+
> --
> Dave Cottlehuber
> Sent from my PDP11
>
>