You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by "mark (JIRA)" <ji...@apache.org> on 2009/04/25 19:00:30 UTC

[jira] Created: (COUCHDB-333) Json handling of UTF8 strings not in accordance with rfc4627

Json handling of UTF8 strings not in accordance with rfc4627
------------------------------------------------------------

                 Key: COUCHDB-333
                 URL: https://issues.apache.org/jira/browse/COUCHDB-333
             Project: CouchDB
          Issue Type: Bug
          Components: Database Core
    Affects Versions: 0.9
         Environment: couchdb 0.9.0  spidermonkey 0.7.0 erlang R12B3
            Reporter: mark


Handling of some unicode values escaped in json format \uXXXX fails with "invalid_json" error.

curl -X PUT -d '{"revisions":[],"_id":"U_1d11e","codepoint":"3441","definition":"\uD834\uDD1E G clef character"}' http://localhost:5984/mydb/U_1d11e

yields

{"error":"invalid_json","reason":"{\"revisions\":[],\"_id\":\"U_1d11e\",\"codepoint\":\"3441\",\"definition\":\"\\uD834\\uDD1E G clef character\"}"}

When the RFC states:
   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "\uD834\uDD1E".

Furthermore, couchdb accepts encoded strings of the format \uXXXXXXXX which is not mentioned as acceptable in the json rfc

curl -X PUT -d '{"revisions":[],"_id":"U_1d11e","codepoint":"3441","definition":"\u0001D11E G clef character"}' http://localhost:5984/mydb/U_1d11e
Yields:
{"ok":true,"id":"U_1d11e","rev":"1-1270273433"}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (COUCHDB-333) Json handling of UTF8 strings not in accordance with rfc4627

Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Kocoloski updated COUCHDB-333:
-----------------------------------

    Attachment: utf16-surrogate-pairs.diff

here's the patch i submitted upstream

> Json handling of UTF8 strings not in accordance with rfc4627
> ------------------------------------------------------------
>
>                 Key: COUCHDB-333
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-333
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.9
>         Environment: couchdb 0.9.0  spidermonkey 0.7.0 erlang R12B3
>            Reporter: mark
>         Attachments: utf16-surrogate-pairs.diff
>
>
> Handling of some unicode values escaped in json format \uXXXX fails with "invalid_json" error.
> curl -X PUT -d '{"revisions":[],"_id":"U_1d11e","codepoint":"3441","definition":"\uD834\uDD1E G clef character"}' http://localhost:5984/mydb/U_1d11e
> yields
> {"error":"invalid_json","reason":"{\"revisions\":[],\"_id\":\"U_1d11e\",\"codepoint\":\"3441\",\"definition\":\"\\uD834\\uDD1E G clef character\"}"}
> When the RFC states:
>    To escape an extended character that is not in the Basic Multilingual
>    Plane, the character is represented as a twelve-character sequence,
>    encoding the UTF-16 surrogate pair.  So, for example, a string
>    containing only the G clef character (U+1D11E) may be represented as
>    "\uD834\uDD1E".
> Furthermore, couchdb accepts encoded strings of the format \uXXXXXXXX which is not mentioned as acceptable in the json rfc
> curl -X PUT -d '{"revisions":[],"_id":"U_1d11e","codepoint":"3441","definition":"\u0001D11E G clef character"}' http://localhost:5984/mydb/U_1d11e
> Yields:
> {"ok":true,"id":"U_1d11e","rev":"1-1270273433"}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (COUCHDB-333) Json handling of UTF8 strings not in accordance with rfc4627

Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716640#action_12716640 ] 

Adam Kocoloski commented on COUCHDB-333:
----------------------------------------

I submitted a patch to that issue in addition to the one that was already there.  This problem actually breaks replication, see

https://issues.apache.org/jira/browse/COUCHDB-327

> Json handling of UTF8 strings not in accordance with rfc4627
> ------------------------------------------------------------
>
>                 Key: COUCHDB-333
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-333
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.9
>         Environment: couchdb 0.9.0  spidermonkey 0.7.0 erlang R12B3
>            Reporter: mark
>
> Handling of some unicode values escaped in json format \uXXXX fails with "invalid_json" error.
> curl -X PUT -d '{"revisions":[],"_id":"U_1d11e","codepoint":"3441","definition":"\uD834\uDD1E G clef character"}' http://localhost:5984/mydb/U_1d11e
> yields
> {"error":"invalid_json","reason":"{\"revisions\":[],\"_id\":\"U_1d11e\",\"codepoint\":\"3441\",\"definition\":\"\\uD834\\uDD1E G clef character\"}"}
> When the RFC states:
>    To escape an extended character that is not in the Basic Multilingual
>    Plane, the character is represented as a twelve-character sequence,
>    encoding the UTF-16 surrogate pair.  So, for example, a string
>    containing only the G clef character (U+1D11E) may be represented as
>    "\uD834\uDD1E".
> Furthermore, couchdb accepts encoded strings of the format \uXXXXXXXX which is not mentioned as acceptable in the json rfc
> curl -X PUT -d '{"revisions":[],"_id":"U_1d11e","codepoint":"3441","definition":"\u0001D11E G clef character"}' http://localhost:5984/mydb/U_1d11e
> Yields:
> {"ok":true,"id":"U_1d11e","rev":"1-1270273433"}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (COUCHDB-333) Json handling of UTF8 strings not in accordance with rfc4627

Posted by "Damien Katz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716720#action_12716720 ] 

Damien Katz commented on COUCHDB-333:
-------------------------------------

I think we should add this to our mochiweb src until it's patched upstream. I'd even like to see it in 0.9.1, since this problem can break replication.

> Json handling of UTF8 strings not in accordance with rfc4627
> ------------------------------------------------------------
>
>                 Key: COUCHDB-333
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-333
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.9
>         Environment: couchdb 0.9.0  spidermonkey 0.7.0 erlang R12B3
>            Reporter: mark
>         Attachments: utf16-surrogate-pairs.diff
>
>
> Handling of some unicode values escaped in json format \uXXXX fails with "invalid_json" error.
> curl -X PUT -d '{"revisions":[],"_id":"U_1d11e","codepoint":"3441","definition":"\uD834\uDD1E G clef character"}' http://localhost:5984/mydb/U_1d11e
> yields
> {"error":"invalid_json","reason":"{\"revisions\":[],\"_id\":\"U_1d11e\",\"codepoint\":\"3441\",\"definition\":\"\\uD834\\uDD1E G clef character\"}"}
> When the RFC states:
>    To escape an extended character that is not in the Basic Multilingual
>    Plane, the character is represented as a twelve-character sequence,
>    encoding the UTF-16 surrogate pair.  So, for example, a string
>    containing only the G clef character (U+1D11E) may be represented as
>    "\uD834\uDD1E".
> Furthermore, couchdb accepts encoded strings of the format \uXXXXXXXX which is not mentioned as acceptable in the json rfc
> curl -X PUT -d '{"revisions":[],"_id":"U_1d11e","codepoint":"3441","definition":"\u0001D11E G clef character"}' http://localhost:5984/mydb/U_1d11e
> Yields:
> {"ok":true,"id":"U_1d11e","rev":"1-1270273433"}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (COUCHDB-333) Json handling of UTF8 strings not in accordance with rfc4627

Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Kocoloski resolved COUCHDB-333.
------------------------------------

    Resolution: Fixed

patch applied to trunk (r 782643) and 0.9.x (r 782645)

> Json handling of UTF8 strings not in accordance with rfc4627
> ------------------------------------------------------------
>
>                 Key: COUCHDB-333
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-333
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.9
>         Environment: couchdb 0.9.0  spidermonkey 0.7.0 erlang R12B3
>            Reporter: mark
>         Attachments: utf16-surrogate-pairs.diff
>
>
> Handling of some unicode values escaped in json format \uXXXX fails with "invalid_json" error.
> curl -X PUT -d '{"revisions":[],"_id":"U_1d11e","codepoint":"3441","definition":"\uD834\uDD1E G clef character"}' http://localhost:5984/mydb/U_1d11e
> yields
> {"error":"invalid_json","reason":"{\"revisions\":[],\"_id\":\"U_1d11e\",\"codepoint\":\"3441\",\"definition\":\"\\uD834\\uDD1E G clef character\"}"}
> When the RFC states:
>    To escape an extended character that is not in the Basic Multilingual
>    Plane, the character is represented as a twelve-character sequence,
>    encoding the UTF-16 surrogate pair.  So, for example, a string
>    containing only the G clef character (U+1D11E) may be represented as
>    "\uD834\uDD1E".
> Furthermore, couchdb accepts encoded strings of the format \uXXXXXXXX which is not mentioned as acceptable in the json rfc
> curl -X PUT -d '{"revisions":[],"_id":"U_1d11e","codepoint":"3441","definition":"\u0001D11E G clef character"}' http://localhost:5984/mydb/U_1d11e
> Yields:
> {"ok":true,"id":"U_1d11e","rev":"1-1270273433"}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (COUCHDB-333) Json handling of UTF8 strings not in accordance with rfc4627

Posted by "mark (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705889#action_12705889 ] 

mark commented on COUCHDB-333:
------------------------------

Issue has been reported upstream to mochiweb.

http://code.google.com/p/mochiweb/issues/detail?id=35

> Json handling of UTF8 strings not in accordance with rfc4627
> ------------------------------------------------------------
>
>                 Key: COUCHDB-333
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-333
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.9
>         Environment: couchdb 0.9.0  spidermonkey 0.7.0 erlang R12B3
>            Reporter: mark
>
> Handling of some unicode values escaped in json format \uXXXX fails with "invalid_json" error.
> curl -X PUT -d '{"revisions":[],"_id":"U_1d11e","codepoint":"3441","definition":"\uD834\uDD1E G clef character"}' http://localhost:5984/mydb/U_1d11e
> yields
> {"error":"invalid_json","reason":"{\"revisions\":[],\"_id\":\"U_1d11e\",\"codepoint\":\"3441\",\"definition\":\"\\uD834\\uDD1E G clef character\"}"}
> When the RFC states:
>    To escape an extended character that is not in the Basic Multilingual
>    Plane, the character is represented as a twelve-character sequence,
>    encoding the UTF-16 surrogate pair.  So, for example, a string
>    containing only the G clef character (U+1D11E) may be represented as
>    "\uD834\uDD1E".
> Furthermore, couchdb accepts encoded strings of the format \uXXXXXXXX which is not mentioned as acceptable in the json rfc
> curl -X PUT -d '{"revisions":[],"_id":"U_1d11e","codepoint":"3441","definition":"\u0001D11E G clef character"}' http://localhost:5984/mydb/U_1d11e
> Yields:
> {"ok":true,"id":"U_1d11e","rev":"1-1270273433"}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.