You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@shindig.apache.org by Justin Wyllie <ju...@hotmail.co.uk> on 2010/07/19 14:39:26 UTC

Encoding problem with metadata service









Hi
This may be a very silly question but...
I have two gadgets one with an XML declaration as UTF-8 and the other as iso-8859-1. Both have the ModulePreds title set to Iñtërnâtiônàlizætiøn. Both files are correctly saved - as UTF-8 and iso-8859-1 respectively.
In the case of the the UTF-8 one the metadata service returns the title with all the non-asci characters encoded as html unicode points e.g. \u00f8.
In the case of the iso-88590-1 the metadata service returns Itrntinliztin - that is the title with the non-asci characters simply stripped out.
Does this mean that Shindig / the Gadgets spec simply does not support character encodings other than UTF-8? The google docs say "Gadgets are specified in XML. The first line is the standard way to start an XML file." - giving a UTF-8 example. But this does not answer the question.
Thanks
Justin

 		 	   		  
Get a free e-mail account with Hotmail. Sign-up now. 		 	   		  
_________________________________________________________________
http://clk.atdmt.com/UKM/go/197222280/direct/01/
Do you have a story that started on Hotmail? Tell us now

Re: Bug in PHP Shindig: non UTF-8 gadgets lose all non asci characters

Posted by Bastian Hofmann <ba...@googlemail.com>.
Hi,

I tested this and could verify this behavior. I created a patch which
fixes this issue as well as another issue which occurs, when the
content type is not send via an http header but is only present in the
XML declaration.

Please review this at http://codereview.appspot.com/1952044/

Cheers

Bastian

2010/8/18 Paul Lindner <pl...@linkedin.com>:
> I'm not sure if you read this.  Do you think it's a correct assessment?
> ---------- Forwarded message ----------
> From: Justin Wyllie <ju...@hotmail.co.uk>
> Date: Tue, Jul 20, 2010 at 4:18 AM
> Subject: Bug in PHP Shindig: non UTF-8 gadgets lose all non asci characters
> To: dev@shindig.apache.org
>
>
>
> The original problem which I posted to the users list was that gadgets with
> non UTF-8 encodings (I used iso-8859-1 to test) were losing all non ascii
> characters in both the title (metadata call) and content (gadget rendering
> call).
> Details of the problem and solution is as follows:
>
> In BasicRemoteContentFetcher this line:
>     $content = mb_convert_encoding($content, 'UTF-8', $charset);
> converts the fetched XML as a string to UTF-8 whatever encoding it was in.
> ($charset is the source encoding)
> But the xml declaration line was not touched. So, after this we may have a
> gadget like this:
> <?xml version="1.0" encoding="iso-8859-1"?><Module>  <ModulePrefs
> title="IñtërnâtiônàlizætiønX" />   <Content type="html">     <![CDATA[
>    ]]>  </Content> </Module>
> which is UTF-8 encoded but with an iso-8859-1 encoding attribute.
> Later in the call (metadata request or gadget rendering) in
> GadgetSpecParser->parse() we load the XML content into an XML DOM object. At
> this point the error occurs - naturally as the UTF-8 content is flagged as
> being in iso-8859-1.
> My fix is as follows:
> In BasicRemoteContentFetcher->parseResult replace:
> $content = mb_convert_encoding($content, 'UTF-8', $charset);
> with
>  $content = mb_convert_encoding($content, 'UTF-8', $charset);  $pattern =
>  'encoding=\s*([' . '\'"])' . $charset . '\s*\1';  $content =
> mb_ereg_replace($pattern,'encoding="UTF-8"',$content,"i")  ;
> Now the XML is UTF-8 encoded and has the correct UTF-8 encoding attribute.
> Justin
>
>
>
>
>
>
>
> _________________________________________________________________
> http://clk.atdmt.com/UKM/go/197222280/direct/01/
> Do you have a story that started on Hotmail? Tell us now
>
>
> --
> Paul Lindner -- plindner@linkedin.com -- linkedin.com/in/plindner
>

Bug in PHP Shindig: non UTF-8 gadgets lose all non asci characters

Posted by Justin Wyllie <ju...@hotmail.co.uk>.
The original problem which I posted to the users list was that gadgets with non UTF-8 encodings (I used iso-8859-1 to test) were losing all non ascii characters in both the title (metadata call) and content (gadget rendering call). 
Details of the problem and solution is as follows:

In BasicRemoteContentFetcher this line:
     $content = mb_convert_encoding($content, 'UTF-8', $charset);
converts the fetched XML as a string to UTF-8 whatever encoding it was in. ($charset is the source encoding)
But the xml declaration line was not touched. So, after this we may have a gadget like this:
<?xml version="1.0" encoding="iso-8859-1"?><Module>  <ModulePrefs title="IñtërnâtiônàlizætiønX" />   <Content type="html">     <![CDATA[          ]]>  </Content> </Module>
which is UTF-8 encoded but with an iso-8859-1 encoding attribute.
Later in the call (metadata request or gadget rendering) in GadgetSpecParser->parse() we load the XML content into an XML DOM object. At this point the error occurs - naturally as the UTF-8 content is flagged as being in iso-8859-1.
My fix is as follows:
In BasicRemoteContentFetcher->parseResult replace:
$content = mb_convert_encoding($content, 'UTF-8', $charset);
with 
  $content = mb_convert_encoding($content, 'UTF-8', $charset);  $pattern =  'encoding=\s*([' . '\'"])' . $charset . '\s*\1';  $content = mb_ereg_replace($pattern,'encoding="UTF-8"',$content,"i")  ;
Now the XML is UTF-8 encoded and has the correct UTF-8 encoding attribute.
Justin






 		 	   		  
_________________________________________________________________
http://clk.atdmt.com/UKM/go/197222280/direct/01/
Do you have a story that started on Hotmail? Tell us now

Re: Encoding problem with metadata service

Posted by Paul Lindner <pl...@linkedin.com>.
This might be an issue with the metadata service, which is a simple servlet
that does not do much for character encoding.

For something more modern have a look at GadgetsHandler.java and the
metadata.get call.


On Mon, Jul 19, 2010 at 5:39 AM, Justin Wyllie
<ju...@hotmail.co.uk>wrote:

>
>
>
>
>
>
>
>
>
> Hi
> This may be a very silly question but...
> I have two gadgets one with an XML declaration as UTF-8 and the other as
> iso-8859-1. Both have the ModulePreds title set to Iñtërnâtiônàlizætiøn.
> Both files are correctly saved - as UTF-8 and iso-8859-1 respectively.
> In the case of the the UTF-8 one the metadata service returns the title
> with all the non-asci characters encoded as html unicode points e.g. \u00f8.
> In the case of the iso-88590-1 the metadata service returns Itrntinliztin -
> that is the title with the non-asci characters simply stripped out.
> Does this mean that Shindig / the Gadgets spec simply does not support
> character encodings other than UTF-8? The google docs say "Gadgets are
> specified in XML. The first line is the standard way to start an XML file."
> - giving a UTF-8 example. But this does not answer the question.
> Thanks
> Justin
>
>
> Get a free e-mail account with Hotmail. Sign-up now.
> _________________________________________________________________
> http://clk.atdmt.com/UKM/go/197222280/direct/01/
> Do you have a story that started on Hotmail? Tell us now
>



-- 
Paul Lindner -- plindner@linkedin.com -- linkedin.com/in/plindner