You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by MK <mk...@cognitivedissonance.ca> on 2011/06/08 18:32:00 UTC

when will utf8 handling be fixed?

Is there any intention to fix couch's handling of "unusual" unicode
characters?  One of the "unusual" characters is the right single quote
(226,128,153) which is a valid utf8 character and also not very
"unusual" IMO.

I have an interface which allows users to add and edit text in a db
document (again, not very unusual) and this one came up because of
someone cutting and pasting some text from a source which used the
right single quote as an apostrophe (which is just plain common -- in
fact they are used in the online "Definitive Guide").

So I am having to maintain a switch statement which filters out these
characters and replaces them with html entities before they get sent
to couch, which is okay in my case since the documents are just being
used as html pages anyway.

But it's an awkward and unnecessary solution: individual
developers should not have to be dealing with this, proper utf8
handling should be hard coded into couch.   For one thing, it means that
anyone worried about such "unusual" possibilities cannot use
couchapp or couch directly -- data has to be filtered first server side.
Although spidermonkey handles utf8 fine, depending on client side
filtering is not always an alternative. 

Sincerely, MK

-- 
"Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
"The angel of history[...]is turned toward the past." (Walter Benjamin)

Re: when will utf8 handling be fixed?

Posted by Dave Cottlehuber <da...@muse.net.nz>.

Thanks Jim,

nice tip which I was not aware of!

A+
Dave

On 9 June 2011 07:28, Jim Klo <ji...@sri.com> wrote:
> One problem that often bites me - someone forgets to include the UTF-8
> charset in the Content-Type header.  Missing that can often mangle the
> handling of high byte characters.
> When setting your Content-Type with curl this is often done something like:
> curl -H "Content-Type: application/json; charset=utf-8" ....
> Jim Klo
> Senior Software Engineer
> Center for Software Engineering
> SRI International
>
>
>
> On Jun 8, 2011, at 9:35 AM, Paul Davis wrote:
>
> On Wed, Jun 8, 2011 at 12:32 PM, MK <mk...@cognitivedissonance.ca> wrote:
>
> Is there any intention to fix couch's handling of "unusual" unicode
>
> characters?  One of the "unusual" characters is the right single quote
>
> (226,128,153) which is a valid utf8 character and also not very
>
> "unusual" IMO.
>
> I have an interface which allows users to add and edit text in a db
>
> document (again, not very unusual) and this one came up because of
>
> someone cutting and pasting some text from a source which used the
>
> right single quote as an apostrophe (which is just plain common -- in
>
> fact they are used in the online "Definitive Guide").
>
> So I am having to maintain a switch statement which filters out these
>
> characters and replaces them with html entities before they get sent
>
> to couch, which is okay in my case since the documents are just being
>
> used as html pages anyway.
>
> But it's an awkward and unnecessary solution: individual
>
> developers should not have to be dealing with this, proper utf8
>
> handling should be hard coded into couch.   For one thing, it means that
>
> anyone worried about such "unusual" possibilities cannot use
>
> couchapp or couch directly -- data has to be filtered first server side.
>
> Although spidermonkey handles utf8 fine, depending on client side
>
> filtering is not always an alternative.
>
> Sincerely, MK
>
> --
>
> "Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
>
> "The angel of history[...]is turned toward the past." (Walter Benjamin)
>
>
>
> What version of CouchDB are you using and what is an actual request look
> like?
>
> A recent check on trunk shows both decoders handle your case fine:
>
> 1> mochijson2:decode(<<"\"", 226,128,153, "\"">>).
> <<226,128,153>>
> 2> ejson:decode(<<"\"", 226,128,153, "\"">>).
> <<226,128,153>>
> 3>
>
>

Re: when will utf8 handling be fixed?

Posted by MK <mk...@cognitivedissonance.ca>.

On Wed, 08 Jun 2011 12:28:03 -0700
Jim Klo <ji...@sri.com> wrote:

> One problem that often bites me - someone forgets to include the
> UTF-8 charset in the Content-Type header.  Missing that can often
> mangle the handling of high byte characters.
> 
> When setting your Content-Type with curl this is often done something
> like:
> 
> curl -H "Content-Type: application/json; charset=utf-8" .... 

As mentioned in my other follow-up, I could not replicate the problem
via curl -- even without Content-type set, couch was okay with
multi-byte characters.

BUT: even setting the content-type correctly in the node request, couch
was not okay (wtf?).

Watching the two transfers in wireshark, I do not see a difference.
But couch's stdout dump is missing the last few bytes. 

So I am completely stumped.

-- 
"Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
"The angel of history[...]is turned toward the past." (Walter Benjamin)

Re: when will utf8 handling be fixed?

Posted by Jim Klo <ji...@sri.com>.

One problem that often bites me - someone forgets to include the UTF-8 charset in the Content-Type header.  Missing that can often mangle the handling of high byte characters.

When setting your Content-Type with curl this is often done something like:

curl -H "Content-Type: application/json; charset=utf-8" .... 

Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International




On Jun 8, 2011, at 9:35 AM, Paul Davis wrote:

> On Wed, Jun 8, 2011 at 12:32 PM, MK <mk...@cognitivedissonance.ca> wrote:
>> Is there any intention to fix couch's handling of "unusual" unicode
>> characters?  One of the "unusual" characters is the right single quote
>> (226,128,153) which is a valid utf8 character and also not very
>> "unusual" IMO.
>> 
>> I have an interface which allows users to add and edit text in a db
>> document (again, not very unusual) and this one came up because of
>> someone cutting and pasting some text from a source which used the
>> right single quote as an apostrophe (which is just plain common -- in
>> fact they are used in the online "Definitive Guide").
>> 
>> So I am having to maintain a switch statement which filters out these
>> characters and replaces them with html entities before they get sent
>> to couch, which is okay in my case since the documents are just being
>> used as html pages anyway.
>> 
>> But it's an awkward and unnecessary solution: individual
>> developers should not have to be dealing with this, proper utf8
>> handling should be hard coded into couch.   For one thing, it means that
>> anyone worried about such "unusual" possibilities cannot use
>> couchapp or couch directly -- data has to be filtered first server side.
>> Although spidermonkey handles utf8 fine, depending on client side
>> filtering is not always an alternative.
>> 
>> Sincerely, MK
>> 
>> --
>> "Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
>> "The angel of history[...]is turned toward the past." (Walter Benjamin)
>> 
>> 
> 
> What version of CouchDB are you using and what is an actual request look like?
> 
> A recent check on trunk shows both decoders handle your case fine:
> 
> 1> mochijson2:decode(<<"\"", 226,128,153, "\"">>).
> <<226,128,153>>
> 2> ejson:decode(<<"\"", 226,128,153, "\"">>).
> <<226,128,153>>
> 3>

Re: when will utf8 handling be fixed?

Posted by Mark Hahn <ma...@boutiquing.com>.

I had the exact same problem down to staring at missing bytes in wireshark.
 I solved it with some function that gave the actual byte count.

On Wed, Jun 8, 2011 at 3:28 PM, MK <mk...@cognitivedissonance.ca> wrote:

> On Wed, 8 Jun 2011 16:26:52 -0400
> "Mark J. Reed" <ma...@gmail.com> wrote:
>
> > The content-length Is bytes.  Sounds like your client is sending a
> > character count instead.
>
> Palm->face.
>
> This crossed my mind but then had a bit of an ADHD moment when
> going over dah code:
>
>        options.headers = {
>                "Content-length": data.length,
>
> ...and now cannot remember why I felt that necessary in the first
> place.  Hopefully the world will remind me ASAP and I can get all
> indignant on the mail list again.
>
> Best wishes, MK
>
> --
> "Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
> "The angel of history[...]is turned toward the past." (Walter Benjamin)
>
>


-- 
Mark Hahn
Website Manager
mark@boutiquing.com
949-229-1012

Re: when will utf8 handling be fixed?

Posted by MK <mk...@cognitivedissonance.ca>.

On Wed, 8 Jun 2011 16:26:52 -0400
"Mark J. Reed" <ma...@gmail.com> wrote:

> The content-length Is bytes.  Sounds like your client is sending a
> character count instead.

Palm->face.

This crossed my mind but then had a bit of an ADHD moment when
going over dah code:

	options.headers = { 
		"Content-length": data.length,

...and now cannot remember why I felt that necessary in the first
place.  Hopefully the world will remind me ASAP and I can get all
indignant on the mail list again. 

Best wishes, MK

-- 
"Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
"The angel of history[...]is turned toward the past." (Walter Benjamin)

Re: when will utf8 handling be fixed?

Posted by "Mark J. Reed" <ma...@gmail.com>.

The content-length Is bytes.  Sounds like your client is sending a
character count instead.

On Wednesday, June 8, 2011, MK <mk...@cognitivedissonance.ca> wrote:
> On Wed, 8 Jun 2011 12:35:57 -0400
> Paul Davis <pa...@gmail.com> wrote:
>> On Wed, Jun 8, 2011 at 12:32 PM, MK <mk...@cognitivedissonance.ca> wrote:
>> > Is there any intention to fix couch's handling of "unusual" unicode
>> > characters?  One of the "unusual" characters is the right single
>> > quote (226,128,153) which is a valid utf8 character and also not
>> > very "unusual" IMO.
>
>> What version of CouchDB are you using and what is an actual request
>> look like?
>
> 1.0.2 built a few weeks ago.
>
> I tried to replicate this simply using curl PUT and a copy of the
> request dumped from node, that works okay.  Ie, yep, couch deals with
> the multi-byte, and it is in the stdout csv decimal dump.
>
> So I took the csv decimal dump from couch in debug mode, turned it back
> into bytes, and diff'd it with the request.
>
> The difference: the last couple of bytes are not in the couch csv dump,
> such as the closing }, which would make the json invalid.  Otherwise it
> is identical to the curl request, which goes through.
>
> Watching the transfer on wireshark, however, couch does receive those
> last few bytes, so *it was not truncated by me or node*.
>
> Go figure.
>
>> A recent check on trunk shows both decoders handle your case fine:
>
> I have no idea what decoders you are referring to.   Anyway, for
> posterity, here's the issue:
>
> - Client sends utf8 data to node.
> - Node passes data on to couch via http (Content-type is
> application/x-www-form-urlencoded, identical to that used by curl).
> - Couch rejects data with multi-byte character, csv decimal dump is
> missing bytes that were in the transmission.
>
> But even to me this sounds dubious, considering an identical request
> from curl is fine...all I can say is that what makes a difference is a
> switch with this in node:
>
> case "\u2019": rv += "&rsquo;";
>
> That's the last thing I do before the PUT.  If I leave the multi-byte
> in, there's an issue.
>
> MK
>
> --
> "Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
> "The angel of history[...]is turned toward the past." (Walter Benjamin)
>
>

-- 
Mark J. Reed <ma...@gmail.com>

Re: when will utf8 handling be fixed?

Posted by MK <mk...@cognitivedissonance.ca>.

On Wed, 8 Jun 2011 12:35:57 -0400
Paul Davis <pa...@gmail.com> wrote:
> On Wed, Jun 8, 2011 at 12:32 PM, MK <mk...@cognitivedissonance.ca> wrote:
> > Is there any intention to fix couch's handling of "unusual" unicode
> > characters?  One of the "unusual" characters is the right single
> > quote (226,128,153) which is a valid utf8 character and also not
> > very "unusual" IMO.

> What version of CouchDB are you using and what is an actual request
> look like?

1.0.2 built a few weeks ago.   

I tried to replicate this simply using curl PUT and a copy of the
request dumped from node, that works okay.  Ie, yep, couch deals with
the multi-byte, and it is in the stdout csv decimal dump.

So I took the csv decimal dump from couch in debug mode, turned it back
into bytes, and diff'd it with the request.

The difference: the last couple of bytes are not in the couch csv dump,
such as the closing }, which would make the json invalid.  Otherwise it
is identical to the curl request, which goes through.

Watching the transfer on wireshark, however, couch does receive those
last few bytes, so *it was not truncated by me or node*.

Go figure.

> A recent check on trunk shows both decoders handle your case fine:

I have no idea what decoders you are referring to.   Anyway, for
posterity, here's the issue:

- Client sends utf8 data to node.
- Node passes data on to couch via http (Content-type is
application/x-www-form-urlencoded, identical to that used by curl).
- Couch rejects data with multi-byte character, csv decimal dump is
missing bytes that were in the transmission.

But even to me this sounds dubious, considering an identical request
from curl is fine...all I can say is that what makes a difference is a
switch with this in node:

case "\u2019": rv += "&rsquo;"; 

That's the last thing I do before the PUT.  If I leave the multi-byte
in, there's an issue.

MK

-- 
"Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
"The angel of history[...]is turned toward the past." (Walter Benjamin)

Re: when will utf8 handling be fixed?

Posted by Paul Davis <pa...@gmail.com>.

On Wed, Jun 8, 2011 at 12:32 PM, MK <mk...@cognitivedissonance.ca> wrote:
> Is there any intention to fix couch's handling of "unusual" unicode
> characters?  One of the "unusual" characters is the right single quote
> (226,128,153) which is a valid utf8 character and also not very
> "unusual" IMO.
>
> I have an interface which allows users to add and edit text in a db
> document (again, not very unusual) and this one came up because of
> someone cutting and pasting some text from a source which used the
> right single quote as an apostrophe (which is just plain common -- in
> fact they are used in the online "Definitive Guide").
>
> So I am having to maintain a switch statement which filters out these
> characters and replaces them with html entities before they get sent
> to couch, which is okay in my case since the documents are just being
> used as html pages anyway.
>
> But it's an awkward and unnecessary solution: individual
> developers should not have to be dealing with this, proper utf8
> handling should be hard coded into couch.   For one thing, it means that
> anyone worried about such "unusual" possibilities cannot use
> couchapp or couch directly -- data has to be filtered first server side.
> Although spidermonkey handles utf8 fine, depending on client side
> filtering is not always an alternative.
>
> Sincerely, MK
>
> --
> "Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
> "The angel of history[...]is turned toward the past." (Walter Benjamin)
>
>

What version of CouchDB are you using and what is an actual request look like?

A recent check on trunk shows both decoders handle your case fine:

1> mochijson2:decode(<<"\"", 226,128,153, "\"">>).
<<226,128,153>>
2> ejson:decode(<<"\"", 226,128,153, "\"">>).
<<226,128,153>>
3>