You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by dmi <lo...@yandex.ru> on 2009/04/14 00:34:10 UTC

unicode output representation

Hello All!

CouchDB now using modified version of mochijson2 for JSON output.
The standard behavior of this library is to accept unicode in all forms (unicode, utf8, \uXXXX) via decode/1,
but when unicode is emitted via encode/1 to the client app, all unicode symbols are converted to \uXXXX form.

This is done for maximal compatibility. But I suspect, that modern software, which may want to interact with CouchDB, will have no problems with raw UTF8.

Recent version of mochiweb (r99) introduces an optional capability for mochijson2 to emit raw utf8.
The proposed way is:

Encoder = mochijson2:encoder([{utf8, true}]),
JSON = Encoder(json())

I have tested this patch (in reduced form) against CouchDB and it seems to be working.

I think, that bringing this option to CouchDB will be a good improvement for developers of international software.

-- 
WBR,
Dmi.

Re: unicode output representation

Posted by Justin Cormack <ju...@specialbusservice.com>.
On 13 Apr 2009, at 23:39, Chris Anderson wrote:

> On Mon, Apr 13, 2009 at 3:34 PM, dmi <lo...@yandex.ru> wrote:
>> Hello All!
>>
>> CouchDB now using modified version of mochijson2 for JSON output.
>> The standard behavior of this library is to accept unicode in all  
>> forms (unicode, utf8, \uXXXX) via decode/1,
>> but when unicode is emitted via encode/1 to the client app, all  
>> unicode symbols are converted to \uXXXX form.
>>
>> This is done for maximal compatibility. But I suspect, that modern  
>> software, which may want to interact with CouchDB, will have no  
>> problems with raw UTF8.
>>
>> Recent version of mochiweb (r99) introduces an optional capability  
>> for mochijson2 to emit raw utf8.
>> The proposed way is:
>>
>> Encoder = mochijson2:encoder([{utf8, true}]),
>> JSON = Encoder(json())
>>
>> I have tested this patch (in reduced form) against CouchDB and it  
>> seems to be working.
>>
>> I think, that bringing this option to CouchDB will be a good  
>> improvement for developers of international software.
>>
>
> Thanks for digging in here.
>
> To avoid incompatibility with old software, we may want to either:
>
> - make this a request time option
> - switch intelligently on some http request header
>
> Any thoughts on how best to do this? Should utf8 be the default, or  
> \uXXXX?
>
> Once we have these questions answered, if you put a patch in JIRA[1]
> it's likely to be accepted.
>
> [1] http://issues.apache.org/jira/browse/COUCHDB
>

In my experience real unicode is better than \u (and shorter!). The  
json spec (http://json.org) specifically says that you *must* accept  
any unicode character other than the \ escaped ones, and I was very  
surprised to find that a lot of json tools produce the \u versions by  
default.

Because it is part of the spec I dont see any problem in just changing  
it.

>
> -- 
> Chris Anderson
> http://jchrisa.net
> http://couch.io


Re: unicode output representation

Posted by Chris Anderson <jc...@apache.org>.
On Wed, Apr 15, 2009 at 7:34 AM, Fred Bowen <fr...@gmail.com> wrote:
> +1 default output utf8
> +1 switchable output to ascii via Accept:
>
> On Wed, Apr 15, 2009 at 7:53 AM, Dirkjan Ochtman <di...@ochtman.nl> wrote:
>> On Tue, Apr 14, 2009 at 00:39, Chris Anderson <jc...@apache.org> wrote:
>>> Any thoughts on how best to do this? Should utf8 be the default, or \uXXXX?
>>
>> Since RFC 4627 says JSON SHALL be encoded in Unicode, and the default
>> as specified in the RFC is UTF-8, I think utf8 as the default is
>> better option here. Maybe it could be switchable based on Accept:
>> application/json;charset=ascii in the request.
>

You've convinced me. If this becomes a patch and a Jira ticket (and no
one has well-reasoned objections) we can probably get it into CouchDB
0.10.0

Chris

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: unicode output representation

Posted by Fred Bowen <fr...@gmail.com>.
+1 default output utf8
+1 switchable output to ascii via Accept:

On Wed, Apr 15, 2009 at 7:53 AM, Dirkjan Ochtman <di...@ochtman.nl> wrote:
> On Tue, Apr 14, 2009 at 00:39, Chris Anderson <jc...@apache.org> wrote:
>> Any thoughts on how best to do this? Should utf8 be the default, or \uXXXX?
>
> Since RFC 4627 says JSON SHALL be encoded in Unicode, and the default
> as specified in the RFC is UTF-8, I think utf8 as the default is
> better option here. Maybe it could be switchable based on Accept:
> application/json;charset=ascii in the request.

Re: unicode output representation

Posted by Dirkjan Ochtman <di...@ochtman.nl>.
On Tue, Apr 14, 2009 at 00:39, Chris Anderson <jc...@apache.org> wrote:
> Any thoughts on how best to do this? Should utf8 be the default, or \uXXXX?

Since RFC 4627 says JSON SHALL be encoded in Unicode, and the default
as specified in the RFC is UTF-8, I think utf8 as the default is
better option here. Maybe it could be switchable based on Accept:
application/json;charset=ascii in the request.

Cheers,

Dirkjan

Re: unicode output representation

Posted by Chris Anderson <jc...@apache.org>.
On Mon, Apr 13, 2009 at 3:34 PM, dmi <lo...@yandex.ru> wrote:
> Hello All!
>
> CouchDB now using modified version of mochijson2 for JSON output.
> The standard behavior of this library is to accept unicode in all forms (unicode, utf8, \uXXXX) via decode/1,
> but when unicode is emitted via encode/1 to the client app, all unicode symbols are converted to \uXXXX form.
>
> This is done for maximal compatibility. But I suspect, that modern software, which may want to interact with CouchDB, will have no problems with raw UTF8.
>
> Recent version of mochiweb (r99) introduces an optional capability for mochijson2 to emit raw utf8.
> The proposed way is:
>
> Encoder = mochijson2:encoder([{utf8, true}]),
> JSON = Encoder(json())
>
> I have tested this patch (in reduced form) against CouchDB and it seems to be working.
>
> I think, that bringing this option to CouchDB will be a good improvement for developers of international software.
>

Thanks for digging in here.

To avoid incompatibility with old software, we may want to either:

- make this a request time option
- switch intelligently on some http request header

Any thoughts on how best to do this? Should utf8 be the default, or \uXXXX?

Once we have these questions answered, if you put a patch in JIRA[1]
it's likely to be accepted.

[1] http://issues.apache.org/jira/browse/COUCHDB


-- 
Chris Anderson
http://jchrisa.net
http://couch.io