You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by N/A N/A <pr...@yahoo.com> on 2011/04/28 11:24:43 UTC

CouchDB View Unicode Document

Hello,

I have a bunch of documents in my DB and i am using views so i can get more 
useful data for my application. 
Documents can contain chars. When i perform GET on some of those documents the 
string is returned in \uFFFF format.
What am i doing wrong? I know couchdb got unicode support. Am i missing 
something ?

regards,

Re: CouchDB View Unicode Document

Posted by Noah Diewald <no...@gmail.com>.
On Thu, Apr 28, 2011 at 5:19 PM, Paul Davis <pa...@gmail.com> wrote:
> On Thu, Apr 28, 2011 at 5:57 PM, Noah Diewald <no...@gmail.com> wrote:
>>> Can someone paste some actual input/output pairs so I have a clue
>>> what's going on.
>>>
>>> Theoretically \uFFFF isn't a valid escape sequence last I checked
>>> (don't get me started on 4627 idiocy).
>>>
>>> The JSON encoder will by default escape data that is non-printable
>>> ascii. The few special cased characters mentioned in the JSON spec are
>>> backslash escaped (\t \n \" etc) while All other bits are escaped as
>>> \uHHHH sequences.
>>
>> What you're describing is what I'm seeing. I don't think it is a bug,
>> just something I don't like because it isn't taking advantage of the
>> benefits of unicode. I'd rather see the characters instead of \uHHHH
>> sequences. For instance I get "\u00e9" for "é". I guess the JSON spec
>> says that any character can be escaped but characters in the basic
>> multilingual plane don't need to be because the string is utf8. I
>> guess I feel that the benefit of utf8 is supposed to be that escaping
>> these characters isn't necessary but that they'll appear in an easily
>> human readable form. I think from what you said above that I'm not
>> experiencing anything that is unexpected but I can supply some input
>> and output if it is.
>>
>> --
>> Noah Diewald
>> noah.diewald.me
>> noahsarchive.net
>>
>
> You are exactly correct. I think the general fear with escaping UTF-8
> is to make it easier for the JSON to pass through broken
> implementations that don't pay attention to possible UTF-8 in string
> data. It's possible to throw make that sort of thing configurable but
> that would entail quite a bit of consideration on a couple different
> fronts.
>

Yes, that makes sense. We do not live in a perfect world. It would be
cool if when "Accept-Charset: utf-8" were used that it might alter the
behavior and allow the characters through unescaped but I can see how
this wouldn't be a high priority since the current behavior is simple
and works for everyone.

-- 
Noah Diewald
noah.diewald.me
noahsarchive.net

Re: CouchDB View Unicode Document

Posted by Paul Davis <pa...@gmail.com>.
On Thu, Apr 28, 2011 at 5:57 PM, Noah Diewald <no...@gmail.com> wrote:
>> Can someone paste some actual input/output pairs so I have a clue
>> what's going on.
>>
>> Theoretically \uFFFF isn't a valid escape sequence last I checked
>> (don't get me started on 4627 idiocy).
>>
>> The JSON encoder will by default escape data that is non-printable
>> ascii. The few special cased characters mentioned in the JSON spec are
>> backslash escaped (\t \n \" etc) while All other bits are escaped as
>> \uHHHH sequences.
>
> What you're describing is what I'm seeing. I don't think it is a bug,
> just something I don't like because it isn't taking advantage of the
> benefits of unicode. I'd rather see the characters instead of \uHHHH
> sequences. For instance I get "\u00e9" for "é". I guess the JSON spec
> says that any character can be escaped but characters in the basic
> multilingual plane don't need to be because the string is utf8. I
> guess I feel that the benefit of utf8 is supposed to be that escaping
> these characters isn't necessary but that they'll appear in an easily
> human readable form. I think from what you said above that I'm not
> experiencing anything that is unexpected but I can supply some input
> and output if it is.
>
> --
> Noah Diewald
> noah.diewald.me
> noahsarchive.net
>

You are exactly correct. I think the general fear with escaping UTF-8
is to make it easier for the JSON to pass through broken
implementations that don't pay attention to possible UTF-8 in string
data. It's possible to throw make that sort of thing configurable but
that would entail quite a bit of consideration on a couple different
fronts.

Re: CouchDB View Unicode Document

Posted by Noah Diewald <no...@gmail.com>.
> Can someone paste some actual input/output pairs so I have a clue
> what's going on.
>
> Theoretically \uFFFF isn't a valid escape sequence last I checked
> (don't get me started on 4627 idiocy).
>
> The JSON encoder will by default escape data that is non-printable
> ascii. The few special cased characters mentioned in the JSON spec are
> backslash escaped (\t \n \" etc) while All other bits are escaped as
> \uHHHH sequences.

What you're describing is what I'm seeing. I don't think it is a bug,
just something I don't like because it isn't taking advantage of the
benefits of unicode. I'd rather see the characters instead of \uHHHH
sequences. For instance I get "\u00e9" for "é". I guess the JSON spec
says that any character can be escaped but characters in the basic
multilingual plane don't need to be because the string is utf8. I
guess I feel that the benefit of utf8 is supposed to be that escaping
these characters isn't necessary but that they'll appear in an easily
human readable form. I think from what you said above that I'm not
experiencing anything that is unexpected but I can supply some input
and output if it is.

-- 
Noah Diewald
noah.diewald.me
noahsarchive.net

Re: CouchDB View Unicode Document

Posted by Paul Davis <pa...@gmail.com>.
On Thu, Apr 28, 2011 at 3:30 PM, Noah Diewald <no...@gmail.com> wrote:
> From what I understand, JavaScript shouldn't need to escape characters
> in the basic multilingual plane so it is strange that strings with
> characters that fall within that range are escaped. I think there is
> something wrong with that. I mean, why not escape every character if
> you need to decode the JSON just to read the strings?
>

Can someone paste some actual input/output pairs so I have a clue
what's going on.

Theoretically \uFFFF isn't a valid escape sequence last I checked
(don't get me started on 4627 idiocy).

The JSON encoder will by default escape data that is non-printable
ascii. The few special cased characters mentioned in the JSON spec are
backslash escaped (\t \n \" etc) while All other bits are escaped as
\uHHHH sequences.

Re: CouchDB View Unicode Document

Posted by Noah Diewald <no...@gmail.com>.
>From what I understand, JavaScript shouldn't need to escape characters
in the basic multilingual plane so it is strange that strings with
characters that fall within that range are escaped. I think there is
something wrong with that. I mean, why not escape every character if
you need to decode the JSON just to read the strings?

On Thu, Apr 28, 2011 at 5:49 AM, Nils Breunese <N....@vpro.nl> wrote:
> N/A N/A wrote:
>
>> I have a bunch of documents in my DB and i am using views so i can get more
>> useful data for my application.
>> Documents can contain chars. When i perform GET on some of those documents the
>> string is returned in \uFFFF format.
>> What am i doing wrong?
>
> Nothing AFAIK.
>
>> I know couchdb got unicode support. Am i missing something ?
>
> Maybe the fact that nothing is wrong here? Parsing that JSON should just yield the original data you put into your database and at least for our applications that's working just fine.
>
> Nils.
> ------------------------------------------------------------------------
>  VPRO   www.vpro.nl
> ------------------------------------------------------------------------
>



-- 
Noah Diewald
noah.diewald.me
noahsarchive.net

Re: CouchDB View Unicode Document

Posted by Nils Breunese <N....@vpro.nl>.
N/A N/A wrote:

> I have a bunch of documents in my DB and i am using views so i can get more
> useful data for my application.
> Documents can contain chars. When i perform GET on some of those documents the
> string is returned in \uFFFF format.
> What am i doing wrong?

Nothing AFAIK.

> I know couchdb got unicode support. Am i missing something ?

Maybe the fact that nothing is wrong here? Parsing that JSON should just yield the original data you put into your database and at least for our applications that's working just fine.

Nils.
------------------------------------------------------------------------
 VPRO   www.vpro.nl
------------------------------------------------------------------------

Re: CouchDB View Unicode Document

Posted by N/A N/A <pr...@yahoo.com>.
Hi,

I think the problem is in my TV

regards,



________________________________
From: N/A N/A <pr...@yahoo.com>
To: user@couchdb.apache.org
Sent: Thu, April 28, 2011 12:24:43 PM
Subject: CouchDB View Unicode Document

Hello,

I have a bunch of documents in my DB and i am using views so i can get more 
useful data for my application. 
Documents can contain chars. When i perform GET on some of those documents the 
string is returned in \uFFFF format.
What am i doing wrong? I know couchdb got unicode support. Am i missing 
something ?

regards,