You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@shindig.apache.org by "Gagandeep Singh (JIRA)" <ji...@apache.org> on 2010/07/21 21:36:52 UTC

[jira] Created: (SHINDIG-1395) MutableContent causing lossy content encoding

MutableContent causing lossy content encoding
---------------------------------------------

                 Key: SHINDIG-1395
                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
             Project: Shindig
          Issue Type: Bug
          Components: Java
            Reporter: Gagandeep Singh
            Assignee: Gagandeep Singh
            Priority: Critical


MutableContent.getRawContentBytes and MutableContent.getContent are buggy because they serialize the Document into a utf8 string disregarding the original encoding of the page that is known to the HttpResponse object.

Here is how it goes wrong for accel servlet:

AccelServlet.doFetch ->
DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
HttpResponseBUilder.create ->
new HttpResponse ->
HttpResponseBuilder.getResponse ->
MutableContent.getRawContentBytes()

NOTE: This could also be  problem with gadgets. Need to verify.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Created: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by Gagandeep singh <ga...@gmail.com>.
The problem is, headers still have the original encoding. For example, say
the site example.org had the encoding GBK.
First we copy over the response headers, so our HttpResponse also has
Content-Type: text/html; charset=GBK

Now, we send this response for rewriting, during which the dom is parsed and
then serialized. Now once it is serialized, the byte representation of it is
in Utf8. But the original content type header is still present.
So now the poor browser is confused and gives up by showing bad unicode
chars.

On Thu, Jul 22, 2010 at 1:58 AM, John Hjelmstad <fa...@google.com> wrote:

> AFAIK this is a feature, not a bug. It standardizes output as UTF8, which
> should be able to represent any character data.
>
> On Wed, Jul 21, 2010 at 12:36 PM, Gagandeep Singh (JIRA) <jira@apache.org
> >wrote:
>
> > MutableContent causing lossy content encoding
> > ---------------------------------------------
> >
> >                 Key: SHINDIG-1395
> >                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
> >             Project: Shindig
> >          Issue Type: Bug
> >          Components: Java
> >            Reporter: Gagandeep Singh
> >            Assignee: Gagandeep Singh
> >            Priority: Critical
> >
> >
> > MutableContent.getRawContentBytes and MutableContent.getContent are buggy
> > because they serialize the Document into a utf8 string disregarding the
> > original encoding of the page that is known to the HttpResponse object.
> >
> > Here is how it goes wrong for accel servlet:
> >
> > AccelServlet.doFetch ->
> > DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> > HttpResponseBUilder.create ->
> > new HttpResponse ->
> > HttpResponseBuilder.getResponse ->
> > MutableContent.getRawContentBytes()
> >
> > NOTE: This could also be  problem with gadgets. Need to verify.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>

Re: [jira] Created: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by Gagandeep singh <ga...@gmail.com>.
The problem is, headers still have the original encoding. For example, say
the site example.org had the encoding GBK.
First we copy over the response headers, so our HttpResponse also has
Content-Type: text/html; charset=GBK

Now, we send this response for rewriting, during which the dom is parsed and
then serialized. Now once it is serialized, the byte representation of it is
in Utf8. But the original content type header is still present.
So now the poor browser is confused and gives up by showing bad unicode
chars.

On Thu, Jul 22, 2010 at 1:58 AM, John Hjelmstad <fa...@google.com> wrote:

> AFAIK this is a feature, not a bug. It standardizes output as UTF8, which
> should be able to represent any character data.
>
> On Wed, Jul 21, 2010 at 12:36 PM, Gagandeep Singh (JIRA) <jira@apache.org
> >wrote:
>
> > MutableContent causing lossy content encoding
> > ---------------------------------------------
> >
> >                 Key: SHINDIG-1395
> >                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
> >             Project: Shindig
> >          Issue Type: Bug
> >          Components: Java
> >            Reporter: Gagandeep Singh
> >            Assignee: Gagandeep Singh
> >            Priority: Critical
> >
> >
> > MutableContent.getRawContentBytes and MutableContent.getContent are buggy
> > because they serialize the Document into a utf8 string disregarding the
> > original encoding of the page that is known to the HttpResponse object.
> >
> > Here is how it goes wrong for accel servlet:
> >
> > AccelServlet.doFetch ->
> > DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> > HttpResponseBUilder.create ->
> > new HttpResponse ->
> > HttpResponseBuilder.getResponse ->
> > MutableContent.getRawContentBytes()
> >
> > NOTE: This could also be  problem with gadgets. Need to verify.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>

Re: [jira] Created: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by John Hjelmstad <fa...@google.com>.
AFAIK this is a feature, not a bug. It standardizes output as UTF8, which
should be able to represent any character data.

On Wed, Jul 21, 2010 at 12:36 PM, Gagandeep Singh (JIRA) <ji...@apache.org>wrote:

> MutableContent causing lossy content encoding
> ---------------------------------------------
>
>                 Key: SHINDIG-1395
>                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
>             Project: Shindig
>          Issue Type: Bug
>          Components: Java
>            Reporter: Gagandeep Singh
>            Assignee: Gagandeep Singh
>            Priority: Critical
>
>
> MutableContent.getRawContentBytes and MutableContent.getContent are buggy
> because they serialize the Document into a utf8 string disregarding the
> original encoding of the page that is known to the HttpResponse object.
>
> Here is how it goes wrong for accel servlet:
>
> AccelServlet.doFetch ->
> DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> HttpResponseBUilder.create ->
> new HttpResponse ->
> HttpResponseBuilder.getResponse ->
> MutableContent.getRawContentBytes()
>
> NOTE: This could also be  problem with gadgets. Need to verify.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Created: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by John Hjelmstad <fa...@google.com>.
AFAIK this is a feature, not a bug. It standardizes output as UTF8, which
should be able to represent any character data.

On Wed, Jul 21, 2010 at 12:36 PM, Gagandeep Singh (JIRA) <ji...@apache.org>wrote:

> MutableContent causing lossy content encoding
> ---------------------------------------------
>
>                 Key: SHINDIG-1395
>                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
>             Project: Shindig
>          Issue Type: Bug
>          Components: Java
>            Reporter: Gagandeep Singh
>            Assignee: Gagandeep Singh
>            Priority: Critical
>
>
> MutableContent.getRawContentBytes and MutableContent.getContent are buggy
> because they serialize the Document into a utf8 string disregarding the
> original encoding of the page that is known to the HttpResponse object.
>
> Here is how it goes wrong for accel servlet:
>
> AccelServlet.doFetch ->
> DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> HttpResponseBUilder.create ->
> new HttpResponse ->
> HttpResponseBuilder.getResponse ->
> MutableContent.getRawContentBytes()
>
> NOTE: This could also be  problem with gadgets. Need to verify.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

[jira] Commented: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by "John Hjelmstad (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SHINDIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890881#action_12890881 ] 

John Hjelmstad commented on SHINDIG-1395:
-----------------------------------------

AFAIK this is a feature, not a bug. It standardizes output as UTF8, which
should be able to represent any character data.




> MutableContent causing lossy content encoding
> ---------------------------------------------
>
>                 Key: SHINDIG-1395
>                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
>             Project: Shindig
>          Issue Type: Bug
>          Components: Java
>            Reporter: Gagandeep Singh
>            Assignee: Gagandeep Singh
>            Priority: Critical
>
> MutableContent.getRawContentBytes and MutableContent.getContent are buggy because they serialize the Document into a utf8 string disregarding the original encoding of the page that is known to the HttpResponse object.
> Here is how it goes wrong for accel servlet:
> AccelServlet.doFetch ->
> DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> HttpResponseBUilder.create ->
> new HttpResponse ->
> HttpResponseBuilder.getResponse ->
> MutableContent.getRawContentBytes()
> NOTE: This could also be  problem with gadgets. Need to verify.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by "Gagandeep Singh (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SHINDIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890885#action_12890885 ] 

Gagandeep Singh commented on SHINDIG-1395:
------------------------------------------

The problem is, headers still have the original encoding. For example, say
the site example.org had the encoding GBK.
First we copy over the response headers, so our HttpResponse also has
Content-Type: text/html; charset=GBK

Now, we send this response for rewriting, during which the dom is parsed and
then serialized. Now once it is serialized, the byte representation of it is
in Utf8. But the original content type header is still present.
So now the poor browser is confused and gives up by showing bad unicode
chars.




> MutableContent causing lossy content encoding
> ---------------------------------------------
>
>                 Key: SHINDIG-1395
>                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
>             Project: Shindig
>          Issue Type: Bug
>          Components: Java
>            Reporter: Gagandeep Singh
>            Assignee: Gagandeep Singh
>            Priority: Critical
>
> MutableContent.getRawContentBytes and MutableContent.getContent are buggy because they serialize the Document into a utf8 string disregarding the original encoding of the page that is known to the HttpResponse object.
> Here is how it goes wrong for accel servlet:
> AccelServlet.doFetch ->
> DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> HttpResponseBUilder.create ->
> new HttpResponse ->
> HttpResponseBuilder.getResponse ->
> MutableContent.getRawContentBytes()
> NOTE: This could also be  problem with gadgets. Need to verify.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by "Gagandeep Singh (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SHINDIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890895#action_12890895 ] 

Gagandeep Singh commented on SHINDIG-1395:
------------------------------------------

Please don't worry about reviewing it now. I have discussed this with
anupama and left to watch Inception movie, hence could not make this patch
earlier and get her to look at it. Let us review it locally first. Will send
out the change to dev@ once we are sure.
After chatting with Ziv, it seems this way would be the smallest change and
we can start looking at other calls which might have this bug and start
fixing them later.

Thanks
Gagan




> MutableContent causing lossy content encoding
> ---------------------------------------------
>
>                 Key: SHINDIG-1395
>                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
>             Project: Shindig
>          Issue Type: Bug
>          Components: Java
>            Reporter: Gagandeep Singh
>            Assignee: Gagandeep Singh
>            Priority: Critical
>
> MutableContent.getRawContentBytes and MutableContent.getContent are buggy because they serialize the Document into a utf8 string disregarding the original encoding of the page that is known to the HttpResponse object.
> Here is how it goes wrong for accel servlet:
> AccelServlet.doFetch ->
> DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> HttpResponseBUilder.create ->
> new HttpResponse ->
> HttpResponseBuilder.getResponse ->
> MutableContent.getRawContentBytes()
> NOTE: This could also be  problem with gadgets. Need to verify.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by "Gagandeep Singh (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SHINDIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gagandeep Singh updated SHINDIG-1395:
-------------------------------------

    Comment: was deleted

(was: Please don't worry about reviewing it now. I have discussed this with
anupama and left to watch Inception movie, hence could not make this patch
earlier and get her to look at it. Let us review it locally first. Will send
out the change to dev@ once we are sure.
After chatting with Ziv, it seems this way would be the smallest change and
we can start looking at other calls which might have this bug and start
fixing them later.

Thanks
Gagan


)

> MutableContent causing lossy content encoding
> ---------------------------------------------
>
>                 Key: SHINDIG-1395
>                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
>             Project: Shindig
>          Issue Type: Bug
>          Components: Java
>            Reporter: Gagandeep Singh
>            Assignee: Gagandeep Singh
>            Priority: Critical
>
> MutableContent.getRawContentBytes and MutableContent.getContent are buggy because they serialize the Document into a utf8 string disregarding the original encoding of the page that is known to the HttpResponse object.
> Here is how it goes wrong for accel servlet:
> AccelServlet.doFetch ->
> DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> HttpResponseBUilder.create ->
> new HttpResponse ->
> HttpResponseBuilder.getResponse ->
> MutableContent.getRawContentBytes()
> NOTE: This could also be  problem with gadgets. Need to verify.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by "John Hjelmstad (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SHINDIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hjelmstad resolved SHINDIG-1395.
-------------------------------------

    Resolution: Fixed

Committed http://codereview.appspot.com/1903045/show, resolving this issue. Thanks!

> MutableContent causing lossy content encoding
> ---------------------------------------------
>
>                 Key: SHINDIG-1395
>                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
>             Project: Shindig
>          Issue Type: Bug
>          Components: Java
>            Reporter: Gagandeep Singh
>            Assignee: Gagandeep Singh
>            Priority: Critical
>
> MutableContent.getRawContentBytes and MutableContent.getContent are buggy because they serialize the Document into a utf8 string disregarding the original encoding of the page that is known to the HttpResponse object.
> Here is how it goes wrong for accel servlet:
> AccelServlet.doFetch ->
> DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> HttpResponseBUilder.create ->
> new HttpResponse ->
> HttpResponseBuilder.getResponse ->
> MutableContent.getRawContentBytes()
> NOTE: This could also be  problem with gadgets. Need to verify.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by "John Hjelmstad (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SHINDIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890890#action_12890890 ] 

John Hjelmstad commented on SHINDIG-1395:
-----------------------------------------

OK, reading the CL on this issue (http://codereview.appspot.com/1881043/show) I understand the problem description: the issue is that content encoding is lost when repeatedly generating HttpResponse objects w/o mutating or reading them, since the bytes are passed through w/o the encoding. Reviewing the CL now.

> MutableContent causing lossy content encoding
> ---------------------------------------------
>
>                 Key: SHINDIG-1395
>                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
>             Project: Shindig
>          Issue Type: Bug
>          Components: Java
>            Reporter: Gagandeep Singh
>            Assignee: Gagandeep Singh
>            Priority: Critical
>
> MutableContent.getRawContentBytes and MutableContent.getContent are buggy because they serialize the Document into a utf8 string disregarding the original encoding of the page that is known to the HttpResponse object.
> Here is how it goes wrong for accel servlet:
> AccelServlet.doFetch ->
> DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> HttpResponseBUilder.create ->
> new HttpResponse ->
> HttpResponseBuilder.getResponse ->
> MutableContent.getRawContentBytes()
> NOTE: This could also be  problem with gadgets. Need to verify.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by "John Hjelmstad (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SHINDIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892838#action_12892838 ] 

John Hjelmstad commented on SHINDIG-1395:
-----------------------------------------

After reconsidering the patch and Gagan's comment, the problem has become clear in my mind: indeed, Content-Type gets out of sync w/ the encoding of the data. This happens on write rather than read, which was the confusion.

With this in mind, the reviewed patch is a great start. I've created a slightly augmented version of this that seeks to balance the competing needs here:
A. Sync Content-Type header w/ encoding when changed.
B. Set encoding explicitly to UTF8 when interacting w/ Strings, which are converted to UTF8 in all cases.
C. Ability to have no Content-Type charset specified (default behavior for things like images)

http://codereview.appspot.com/1903045/show

> MutableContent causing lossy content encoding
> ---------------------------------------------
>
>                 Key: SHINDIG-1395
>                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
>             Project: Shindig
>          Issue Type: Bug
>          Components: Java
>            Reporter: Gagandeep Singh
>            Assignee: Gagandeep Singh
>            Priority: Critical
>
> MutableContent.getRawContentBytes and MutableContent.getContent are buggy because they serialize the Document into a utf8 string disregarding the original encoding of the page that is known to the HttpResponse object.
> Here is how it goes wrong for accel servlet:
> AccelServlet.doFetch ->
> DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> HttpResponseBUilder.create ->
> new HttpResponse ->
> HttpResponseBuilder.getResponse ->
> MutableContent.getRawContentBytes()
> NOTE: This could also be  problem with gadgets. Need to verify.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SHINDIG-1395) MutableContent causing lossy content encoding

Posted by "John Hjelmstad (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SHINDIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890891#action_12890891 ] 

John Hjelmstad commented on SHINDIG-1395:
-----------------------------------------

Note: this isn't an issue for gadgets since gadgets use the GadgetRewriterRegistry, and this problem exists in the HttpResponse creation chain.

> MutableContent causing lossy content encoding
> ---------------------------------------------
>
>                 Key: SHINDIG-1395
>                 URL: https://issues.apache.org/jira/browse/SHINDIG-1395
>             Project: Shindig
>          Issue Type: Bug
>          Components: Java
>            Reporter: Gagandeep Singh
>            Assignee: Gagandeep Singh
>            Priority: Critical
>
> MutableContent.getRawContentBytes and MutableContent.getContent are buggy because they serialize the Document into a utf8 string disregarding the original encoding of the page that is known to the HttpResponse object.
> Here is how it goes wrong for accel servlet:
> AccelServlet.doFetch ->
> DefaulltResponseRewriterRegistry.rewriteHttpResponse ->
> HttpResponseBUilder.create ->
> new HttpResponse ->
> HttpResponseBuilder.getResponse ->
> MutableContent.getRawContentBytes()
> NOTE: This could also be  problem with gadgets. Need to verify.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.