You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@abdera.apache.org by Brian Moseley <bc...@osafoundation.org> on 2007/09/08 03:31:53 UTC

more fun with character encodings

i'm running into a similar issue as was discussed earlier this week
with regard to problem data.

as was mentioned earlier, it turns out that the os x native character
encoding is MacRoman. well, it appears that even though both my mysql
database and my jdbc connection are configured to use utf8, at some
point the data taken from the db and inserted into an atom feed is
turning up in MacRoman, even though the ResponseContext's content type
is set to "application/atom+xml; charset=UTF-8".

from my re-reading of the various recent threads and my examining of
the code in the 0.3.0 branch, it seems like the value i set for an
entry's title (for instance) should be converted into utf8 while the
entry is being serialized. but it's clearly not. when i look at the
feed as it's fetched from my server by curl, in Terminal.app, the
non-ascii character in the entry title is rendered using what i like
to call the "wtf" glyph rather than the one that represents the actual
character in question. and when i run the feed through the
validome.org validator, it complains about this character being an
invalid utf8 character.

when i run the server and database on linux and get a non-ascii
character into the database,viewing the corresponding entry document
in Terminal.app shows me the expected character, not the wtf one.

i've run through all of my code looking for places where we might be
instantiating a Reader without specifying an encoding, but i can't
find any. i'm using the 0.3.0-incubating jars that i deployed earlier
today into the people.apache.org/m2-incubating-repository which
contain the recent default encoding fixes. so i'm at a loss as to what
could be going on. i feel like i'm missing something basic with regard
to character encodings. any pointers?

for reference, here's a url for the entry document as served by os x.
notice the final character of the title and summary are both the wtf
character.

http://bcm.osafoundation.org:8080/chandler/atom/item/a05e2870-5cce-11dc-f4b0-84f152603f14?ticket=fnwrt8htw1

and here is what happens when i plug that url into validome's atom validator:

http://www.validome.org/rss-atom/validate?lang=en&url=http://bcm.osafoundation.org:8080/chandler/atom/item/a05e2870-5cce-11dc-f4b0-84f152603f14%3fticket=fnwrt8htw1&version=atom_1_0

thanks!

Re: more fun with character encodings

Posted by James M Snell <ja...@gmail.com>.
Character encodings are evil.  The conversion to UTF-8 likely is
happening, it's just not being done correctly.  Can you provide a bit
more context?  E.g. the code that is actually setting the title value.
Have you tried writing the string value out to a UTF-8 writer without
Abdera being in the mix?  e.g. what do you get when you do something like:

    Writer w = new OutputStreamWriter(System.out,"UTF-8");
    w.write(titlevalue);
    w.flush();

- James

Brian Moseley wrote:
> i'm running into a similar issue as was discussed earlier this week
> with regard to problem data.
> 
> as was mentioned earlier, it turns out that the os x native character
> encoding is MacRoman. well, it appears that even though both my mysql
> database and my jdbc connection are configured to use utf8, at some
> point the data taken from the db and inserted into an atom feed is
> turning up in MacRoman, even though the ResponseContext's content type
> is set to "application/atom+xml; charset=UTF-8".
> 
> from my re-reading of the various recent threads and my examining of
> the code in the 0.3.0 branch, it seems like the value i set for an
> entry's title (for instance) should be converted into utf8 while the
> entry is being serialized. but it's clearly not. when i look at the
> feed as it's fetched from my server by curl, in Terminal.app, the
> non-ascii character in the entry title is rendered using what i like
> to call the "wtf" glyph rather than the one that represents the actual
> character in question. and when i run the feed through the
> validome.org validator, it complains about this character being an
> invalid utf8 character.
> 
> when i run the server and database on linux and get a non-ascii
> character into the database,viewing the corresponding entry document
> in Terminal.app shows me the expected character, not the wtf one.
> 
> i've run through all of my code looking for places where we might be
> instantiating a Reader without specifying an encoding, but i can't
> find any. i'm using the 0.3.0-incubating jars that i deployed earlier
> today into the people.apache.org/m2-incubating-repository which
> contain the recent default encoding fixes. so i'm at a loss as to what
> could be going on. i feel like i'm missing something basic with regard
> to character encodings. any pointers?
> 
> for reference, here's a url for the entry document as served by os x.
> notice the final character of the title and summary are both the wtf
> character.
> 
> http://bcm.osafoundation.org:8080/chandler/atom/item/a05e2870-5cce-11dc-f4b0-84f152603f14?ticket=fnwrt8htw1
> 
> and here is what happens when i plug that url into validome's atom validator:
> 
> http://www.validome.org/rss-atom/validate?lang=en&url=http://bcm.osafoundation.org:8080/chandler/atom/item/a05e2870-5cce-11dc-f4b0-84f152603f14%3fticket=fnwrt8htw1&version=atom_1_0
> 
> thanks!
> 

Re: more fun with character encodings

Posted by Brian Moseley <bc...@osafoundation.org>.
On 9/8/07, James M Snell <ja...@gmail.com> wrote:
> I just wanted to follow up with this.  If you think the title value is
> being converted to MacRoman at some point, try forcing it back to UTF-8
> just prior to setting the value in the abdera objects, e.g.

ok, i'll give this a try when i get a chance to cycle back to the bug.

interestingly enough, when i use this code to base64 encode the
string, after the client decodes it, the characters display properly.
you'd think i'd need to be using MacRoman to do the first byte
conversion...

            byte[] bytes = out.toString().getBytes("UTF-8");
            String value = new String(Base64.encodeBase64(bytes), "UTF-8");

Re: more fun with character encodings

Posted by James M Snell <ja...@gmail.com>.
I just wanted to follow up with this.  If you think the title value is
being converted to MacRoman at some point, try forcing it back to UTF-8
just prior to setting the value in the abdera objects, e.g.

public class Misc {

  public static void main(String... args) throws Exception {

    String t = "ë";

    String s = convert(t, "UTF-8","MacRoman");

    Writer w = new OutputStreamWriter(System.out,"UTF-8");

    w.write(s);     // wrong
    w.write("\n");
    w.write(convert(s,"MacRoman","UTF-8"));   // correct

    w.flush();

  }

  public static String convert(
    String string,
    String from,
    String to)
      throws UnsupportedEncodingException {
    return new String(string.getBytes(from),to);
  }
}

If you're converting from the default charset to UTF-8, and you're not
sure what the default charset is, use
java.nio.charset.Charset.defaultCharset().name() to get the name of the
default charset at runtime.

If this doesn't work for you, then we definitely still have a problem :-)

- James

Brian Moseley wrote:
> i'm running into a similar issue as was discussed earlier this week
> with regard to problem data.
> 
> as was mentioned earlier, it turns out that the os x native character
> encoding is MacRoman. well, it appears that even though both my mysql
> database and my jdbc connection are configured to use utf8, at some
> point the data taken from the db and inserted into an atom feed is
> turning up in MacRoman, even though the ResponseContext's content type
> is set to "application/atom+xml; charset=UTF-8".
> 
> from my re-reading of the various recent threads and my examining of
> the code in the 0.3.0 branch, it seems like the value i set for an
> entry's title (for instance) should be converted into utf8 while the
> entry is being serialized. but it's clearly not. when i look at the
> feed as it's fetched from my server by curl, in Terminal.app, the
> non-ascii character in the entry title is rendered using what i like
> to call the "wtf" glyph rather than the one that represents the actual
> character in question. and when i run the feed through the
> validome.org validator, it complains about this character being an
> invalid utf8 character.
> 
> when i run the server and database on linux and get a non-ascii
> character into the database,viewing the corresponding entry document
> in Terminal.app shows me the expected character, not the wtf one.
> 
> i've run through all of my code looking for places where we might be
> instantiating a Reader without specifying an encoding, but i can't
> find any. i'm using the 0.3.0-incubating jars that i deployed earlier
> today into the people.apache.org/m2-incubating-repository which
> contain the recent default encoding fixes. so i'm at a loss as to what
> could be going on. i feel like i'm missing something basic with regard
> to character encodings. any pointers?
> 
> for reference, here's a url for the entry document as served by os x.
> notice the final character of the title and summary are both the wtf
> character.
> 
> http://bcm.osafoundation.org:8080/chandler/atom/item/a05e2870-5cce-11dc-f4b0-84f152603f14?ticket=fnwrt8htw1
> 
> and here is what happens when i plug that url into validome's atom validator:
> 
> http://www.validome.org/rss-atom/validate?lang=en&url=http://bcm.osafoundation.org:8080/chandler/atom/item/a05e2870-5cce-11dc-f4b0-84f152603f14%3fticket=fnwrt8htw1&version=atom_1_0
> 
> thanks!
>