You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by "Fedor Indutny (JIRA)" <ji...@apache.org> on 2011/02/03 17:06:29 UTC

[jira] Created: (COUCHDB-1057) Wrong JSON parser behavior on escaped unicode characters

Wrong JSON parser behavior on escaped unicode characters
--------------------------------------------------------

                 Key: COUCHDB-1057
                 URL: https://issues.apache.org/jira/browse/COUCHDB-1057
             Project: CouchDB
          Issue Type: Bug
          Components: Database Core
    Affects Versions: 1.0
         Environment: Ubuntu 10.10
Doesn't matter
            Reporter: Fedor Indutny


Try to save following doc to couchdb:
{ "_id" : "json-test", "test": "\u0080-\uffff"}

And then put it to the database:
curl -X PUT -d @1.json --basic --user admin:admin -H "Content-Type: application/json" http://couchdb:5984/tadagraph/json-test

You'll get error:
{"error":"bad_request","reason":"invalid UTF-8 JSON"}

jsonlint ( http://www.jsonlint.com/ ) says that it's a valid JSON

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (COUCHDB-1057) Wrong JSON parser behavior on escaped unicode characters

Posted by "Fedor Indutny (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990177#comment-12990177 ] 

Fedor Indutny commented on COUCHDB-1057:
----------------------------------------

http://www.ietf.org/rfc/rfc4627.txt

2.5.  Strings

...
 Any character may be escaped.  If the character is in the Basic
   Multilingual Plane (U+0000 through U+FFFF), then it may be
   represented as a six-character sequence: a reverse solidus, followed
   by the lowercase letter u, followed by four hexadecimal digits that
   encode the character's code point.  The hexadecimal letters A though
   F can be upper or lowercase.  So, for example, a string containing
   only a single reverse solidus character may be represented as
   "\u005C".
...

Looks like (U+0000 through U+FFFF) is declared valid for JSON in RFC

> Wrong JSON parser behavior on escaped unicode characters
> --------------------------------------------------------
>
>                 Key: COUCHDB-1057
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1057
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 1.0
>         Environment: Ubuntu 10.10
> Doesn't matter
>            Reporter: Fedor Indutny
>
> Try to save following doc to couchdb:
> { "_id" : "json-test", "test": "\u0080-\uffff"}
> And then put it to the database:
> curl -X PUT -d @1.json --basic --user admin:admin -H "Content-Type: application/json" http://couchdb:5984/tadagraph/json-test
> You'll get error:
> {"error":"bad_request","reason":"invalid UTF-8 JSON"}
> jsonlint ( http://www.jsonlint.com/ ) says that it's a valid JSON

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Closed: (COUCHDB-1057) Wrong JSON parser behavior on escaped unicode characters

Posted by "Paul Joseph Davis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Joseph Davis closed COUCHDB-1057.
--------------------------------------

    Resolution: Won't Fix

But wikipedia says its not:

http://en.wikipedia.org/wiki/UTF-8

Specifically, \uFFFF is an invalid code point and I reject Crockford's crazy delusional world that says all strings in every language should be implemented as unsigned 16 bit integers.

> Wrong JSON parser behavior on escaped unicode characters
> --------------------------------------------------------
>
>                 Key: COUCHDB-1057
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1057
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 1.0
>         Environment: Ubuntu 10.10
> Doesn't matter
>            Reporter: Fedor Indutny
>
> Try to save following doc to couchdb:
> { "_id" : "json-test", "test": "\u0080-\uffff"}
> And then put it to the database:
> curl -X PUT -d @1.json --basic --user admin:admin -H "Content-Type: application/json" http://couchdb:5984/tadagraph/json-test
> You'll get error:
> {"error":"bad_request","reason":"invalid UTF-8 JSON"}
> jsonlint ( http://www.jsonlint.com/ ) says that it's a valid JSON

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (COUCHDB-1057) Wrong JSON parser behavior on escaped unicode characters

Posted by "Paul Joseph Davis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990205#comment-12990205 ] 

Paul Joseph Davis commented on COUCHDB-1057:
--------------------------------------------

Yeah, that's the part of the spec that uses a broken assumption of 16 bit integers for representing string data. Of other interest is that we also reject invalid surrogate pairs which the spec doesn't even mention.

> Wrong JSON parser behavior on escaped unicode characters
> --------------------------------------------------------
>
>                 Key: COUCHDB-1057
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1057
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 1.0
>         Environment: Ubuntu 10.10
> Doesn't matter
>            Reporter: Fedor Indutny
>
> Try to save following doc to couchdb:
> { "_id" : "json-test", "test": "\u0080-\uffff"}
> And then put it to the database:
> curl -X PUT -d @1.json --basic --user admin:admin -H "Content-Type: application/json" http://couchdb:5984/tadagraph/json-test
> You'll get error:
> {"error":"bad_request","reason":"invalid UTF-8 JSON"}
> jsonlint ( http://www.jsonlint.com/ ) says that it's a valid JSON

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (COUCHDB-1057) Wrong JSON parser behavior on escaped unicode characters

Posted by "Paul Joseph Davis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990250#comment-12990250 ] 

Paul Joseph Davis commented on COUCHDB-1057:
--------------------------------------------

Also, I realized I should probably give more background on this instead of just getting irritated with that spec again.

The underlying issue is that CouchDB stores all of its JSON strings as UTF-8, which means that all code points we recognize in the input is required to be representable as UTF-8. As you see in the JSON spec, there wasn't much foresight into what constitutes a valid Unicode code point. This means that the JSON spec allows for things that aren't representable as UTF-8 via unicode escapes.

When I asked about the issue on the es5-discuss list I was actually told that JSON requires strings to be stored as 16 bit integers (hence why I'm so fond of repeating that). Yeah, I was actually told that JSON supposedly requires a specific string implementation. Seeing as how JSON is widely characterized as a ubiquitous exchange format, I promptly rejected that assertion and haven't been overly motivated to relax our enforcement of valid Unicode code points.

If someone wants to write a patch that carries invalid escapes through the system I'd probably be ok with that, though I think we tried once and it gummed up something somewhere else.

> Wrong JSON parser behavior on escaped unicode characters
> --------------------------------------------------------
>
>                 Key: COUCHDB-1057
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1057
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 1.0
>         Environment: Ubuntu 10.10
> Doesn't matter
>            Reporter: Fedor Indutny
>
> Try to save following doc to couchdb:
> { "_id" : "json-test", "test": "\u0080-\uffff"}
> And then put it to the database:
> curl -X PUT -d @1.json --basic --user admin:admin -H "Content-Type: application/json" http://couchdb:5984/tadagraph/json-test
> You'll get error:
> {"error":"bad_request","reason":"invalid UTF-8 JSON"}
> jsonlint ( http://www.jsonlint.com/ ) says that it's a valid JSON

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira