You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@shindig.apache.org by John Hjelmstad <fa...@google.com> on 2010/07/21 22:28:02 UTC

Re: [jira] Created: (SHINDIG-1395) MutableContent causing lossy content encoding

AFAIK this is a feature, not a bug. It standardizes output as UTF8, which
should be able to represent any character data.

On Wed, Jul 21, 2010 at 12:36 PM, Gagandeep Singh (JIRA) <ji...@apache.org>wrote:

> MutableContent causing lossy content encoding
> ---------------------------------------------
>
>                 Key: SHINDIG-1395
>                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
>             Project: Shindig
>          Issue Type: Bug
>          Components: Java
>            Reporter: Gagandeep Singh
>            Assignee: Gagandeep Singh
>            Priority: Critical
>
>
> MutableContent.getRawContentBytes and MutableContent.getContent are buggy
> because they serialize the Document into a utf8 string disregarding the
> original encoding of the page that is known to the HttpResponse object.
>
> Here is how it goes wrong for accel servlet:
>
> AccelServlet.doFetch ->
> DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> HttpResponseBUilder.create ->
> new HttpResponse ->
> HttpResponseBuilder.getResponse ->
> MutableContent.getRawContentBytes()
>
> NOTE: This could also be  problem with gadgets. Need to verify.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Created: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by Gagandeep singh <ga...@gmail.com>.

The problem is, headers still have the original encoding. For example, say
the site example.org had the encoding GBK.
First we copy over the response headers, so our HttpResponse also has
Content-Type: text/html; charset=GBK

Now, we send this response for rewriting, during which the dom is parsed and
then serialized. Now once it is serialized, the byte representation of it is
in Utf8. But the original content type header is still present.
So now the poor browser is confused and gives up by showing bad unicode
chars.

On Thu, Jul 22, 2010 at 1:58 AM, John Hjelmstad <fa...@google.com> wrote:

> AFAIK this is a feature, not a bug. It standardizes output as UTF8, which
> should be able to represent any character data.
>
> On Wed, Jul 21, 2010 at 12:36 PM, Gagandeep Singh (JIRA) <jira@apache.org
> >wrote:
>
> > MutableContent causing lossy content encoding
> > ---------------------------------------------
> >
> >                 Key: SHINDIG-1395
> >                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
> >             Project: Shindig
> >          Issue Type: Bug
> >          Components: Java
> >            Reporter: Gagandeep Singh
> >            Assignee: Gagandeep Singh
> >            Priority: Critical
> >
> >
> > MutableContent.getRawContentBytes and MutableContent.getContent are buggy
> > because they serialize the Document into a utf8 string disregarding the
> > original encoding of the page that is known to the HttpResponse object.
> >
> > Here is how it goes wrong for accel servlet:
> >
> > AccelServlet.doFetch ->
> > DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> > HttpResponseBUilder.create ->
> > new HttpResponse ->
> > HttpResponseBuilder.getResponse ->
> > MutableContent.getRawContentBytes()
> >
> > NOTE: This could also be  problem with gadgets. Need to verify.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>

Re: [jira] Created: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by Gagandeep singh <ga...@gmail.com>.

The problem is, headers still have the original encoding. For example, say
the site example.org had the encoding GBK.
First we copy over the response headers, so our HttpResponse also has
Content-Type: text/html; charset=GBK

Now, we send this response for rewriting, during which the dom is parsed and
then serialized. Now once it is serialized, the byte representation of it is
in Utf8. But the original content type header is still present.
So now the poor browser is confused and gives up by showing bad unicode
chars.

On Thu, Jul 22, 2010 at 1:58 AM, John Hjelmstad <fa...@google.com> wrote:

> AFAIK this is a feature, not a bug. It standardizes output as UTF8, which
> should be able to represent any character data.
>
> On Wed, Jul 21, 2010 at 12:36 PM, Gagandeep Singh (JIRA) <jira@apache.org
> >wrote:
>
> > MutableContent causing lossy content encoding
> > ---------------------------------------------
> >
> >                 Key: SHINDIG-1395
> >                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
> >             Project: Shindig
> >          Issue Type: Bug
> >          Components: Java
> >            Reporter: Gagandeep Singh
> >            Assignee: Gagandeep Singh
> >            Priority: Critical
> >
> >
> > MutableContent.getRawContentBytes and MutableContent.getContent are buggy
> > because they serialize the Document into a utf8 string disregarding the
> > original encoding of the page that is known to the HttpResponse object.
> >
> > Here is how it goes wrong for accel servlet:
> >
> > AccelServlet.doFetch ->
> > DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> > HttpResponseBUilder.create ->
> > new HttpResponse ->
> > HttpResponseBuilder.getResponse ->
> > MutableContent.getRawContentBytes()
> >
> > NOTE: This could also be  problem with gadgets. Need to verify.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>