You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@shindig.apache.org by Brian Eaton <be...@google.com> on 2008/02/01 21:41:53 UTC

RemoteContentFetcher and i18n

The current fetchJson implementation uses "new
String(results.getByteArray())" to convert the response bytes to a
string for inclusion in the JSON reply to the gadget.  The behavior of
new String(byte[]) is undefined "when the given bytes are not valid in
the default charset".

The default charset could be anything, and the returned bytes from the
remote server could also be anything.  This is likely to cause
problems (data corruption) for gadgets fetching data from non-english
web sites.

I'll open up a JIRA issue for this, but I wanted to see whether anyone
had proposals for a solution.  The fix will probably involve using
CharsetDecoder, so we at least have well-defined behavior.  How we
pick the CharsetDecoder to use is an open question.  What to do when
the CharsetDecoding fails is another issue.  I'm tempted to put in a
quick fix that specifies UTF-8 for the character set.  That will
prevent anyone from depending on the current undefined behavior while
we work out what should happen.

Cheers,
Brian

Re: RemoteContentFetcher and i18n

Posted by John Hjelmstad <fa...@google.com>.
The quick fix of specifying UTF-8 sounds good to me. As you say, it's better
than unspecified behavior.

--John

On Fri, Feb 1, 2008 at 12:41 PM, Brian Eaton <be...@google.com> wrote:

> The current fetchJson implementation uses "new
> String(results.getByteArray())" to convert the response bytes to a
> string for inclusion in the JSON reply to the gadget.  The behavior of
> new String(byte[]) is undefined "when the given bytes are not valid in
> the default charset".
>
> The default charset could be anything, and the returned bytes from the
> remote server could also be anything.  This is likely to cause
> problems (data corruption) for gadgets fetching data from non-english
> web sites.
>
> I'll open up a JIRA issue for this, but I wanted to see whether anyone
> had proposals for a solution.  The fix will probably involve using
> CharsetDecoder, so we at least have well-defined behavior.  How we
> pick the CharsetDecoder to use is an open question.  What to do when
> the CharsetDecoding fails is another issue.  I'm tempted to put in a
> quick fix that specifies UTF-8 for the character set.  That will
> prevent anyone from depending on the current undefined behavior while
> we work out what should happen.
>
> Cheers,
> Brian
>

Re: RemoteContentFetcher and i18n

Posted by Kevin Brown <et...@google.com>.
On Feb 1, 2008 2:34 PM, John Panzer <jp...@google.com> wrote:

>
> +1, but this context is about converting to JSON, right?  So you can't
> just push bytes through.


Yes, but it veered into a RemoteContentFetcher discussion. For the JSON
proxy it makes sense, but certainly not for all http content retrieval.

Well behaved origins should declare their source charset encoding
> (though with text/XML it can admittedly get Byzantine).  Ones that
> don't do not should get 'best effort' at most.


This is why it's tricky, and if we can't detect what encoding it is we
should just fail the request.

Re: RemoteContentFetcher and i18n

Posted by John Panzer <jp...@google.com>.

-John

On Feb 1, 2008, at 12:59 PM, "Kevin Brown" <et...@google.com> wrote:

> On Feb 1, 2008 12:41 PM, Brian Eaton <be...@google.com> wrote:
>
>> The current fetchJson implementation uses "new
>> String(results.getByteArray())" to convert the response bytes to a
>> string for inclusion in the JSON reply to the gadget.  The behavior  
>> of
>> new String(byte[]) is undefined "when the given bytes are not valid  
>> in
>> the default charset".
>>
>> The default charset could be anything, and the returned bytes from  
>> the
>> remote server could also be anything.  This is likely to cause
>> problems (data corruption) for gadgets fetching data from non-english
>> web sites.
>
>
> The default charset is almost always utf-8 in practice (unless  
> you've done
> something particularly bizarre, like modifying system properties),

On some OS/JDK combos, this can be picked up from environment  
variables (ack.)

> but
> you're right that the back end could be anything. Honestly, the real  
> answer
> here is that this should *NOT* be a string at all -- it should be a  
> sequence
> of bytes. RemoteContentFetcher should not care about encoding. What  
> if I'm
> using this to fetch non-text data, such as an image file, for the open
> proxy?

+1, but this context is about converting to JSON, right?  So you can't  
just push bytes through.

>
>
> For text data (such as what you would fetch from  
> gadgets.io.makeRequest), it
> should always be utf-8. This does mean that we need to do encoding  
> detection
> / conversion in here. It has nothing to do with "non-English" web  
> sites, but
> rather websites which use regional character encodings (ISO-8859-1  
> probably
> being the most problematic since it "looks like" ASCII or UTF8 until  
> you
> start using diacritics; BIG5 is another likely problem for chinese  
> language
> sites).
>
> I'll open up a JIRA issue for this, but I wanted to see whether anyone
>> had proposals for a solution.  The fix will probably involve using
>> CharsetDecoder, so we at least have well-defined behavior.  How we
>> pick the CharsetDecoder to use is an open question.  What to do when
>> the CharsetDecoding fails is another issue.  I'm tempted to put in a
>> quick fix that specifies UTF-8 for the character set.  That will
>> prevent anyone from depending on the current undefined behavior while
>> we work out what should happen.
>
>
> If it can't be converted to utf8, or we can't detect the encoding,  
> we simply
> fail the request. This is consistent with the behavior on igoogle  
> today.
+1.
Well behaved origins should declare their source charset encoding  
(though with text/XML it can admittedly get Byzantine).  Ones that  
don't do not should get 'best effort' at most.

Re: RemoteContentFetcher and i18n

Posted by Kevin Brown <et...@google.com>.
On Feb 1, 2008 12:41 PM, Brian Eaton <be...@google.com> wrote:

> The current fetchJson implementation uses "new
> String(results.getByteArray())" to convert the response bytes to a
> string for inclusion in the JSON reply to the gadget.  The behavior of
> new String(byte[]) is undefined "when the given bytes are not valid in
> the default charset".
>
> The default charset could be anything, and the returned bytes from the
> remote server could also be anything.  This is likely to cause
> problems (data corruption) for gadgets fetching data from non-english
> web sites.


The default charset is almost always utf-8 in practice (unless you've done
something particularly bizarre, like modifying system properties), but
you're right that the back end could be anything. Honestly, the real answer
here is that this should *NOT* be a string at all -- it should be a sequence
of bytes. RemoteContentFetcher should not care about encoding. What if I'm
using this to fetch non-text data, such as an image file, for the open
proxy?

For text data (such as what you would fetch from gadgets.io.makeRequest), it
should always be utf-8. This does mean that we need to do encoding detection
/ conversion in here. It has nothing to do with "non-English" web sites, but
rather websites which use regional character encodings (ISO-8859-1 probably
being the most problematic since it "looks like" ASCII or UTF8 until you
start using diacritics; BIG5 is another likely problem for chinese language
sites).

I'll open up a JIRA issue for this, but I wanted to see whether anyone
> had proposals for a solution.  The fix will probably involve using
> CharsetDecoder, so we at least have well-defined behavior.  How we
> pick the CharsetDecoder to use is an open question.  What to do when
> the CharsetDecoding fails is another issue.  I'm tempted to put in a
> quick fix that specifies UTF-8 for the character set.  That will
> prevent anyone from depending on the current undefined behavior while
> we work out what should happen.


If it can't be converted to utf8, or we can't detect the encoding, we simply
fail the request. This is consistent with the behavior on igoogle today.