You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pivot.apache.org by Niclas Hedhman <ni...@hedhman.org> on 2010/11/15 01:31:18 UTC

Fwd: Character '€'

(can't post to user@ :-( )

On Sat, Oct 30, 2010 at 8:08 PM, Greg Brown <gk...@mac.com> wrote:
> It's because BXMLSerializer assumes that BXML files are encoded in UTF-8. There is currently no way to specify an alternate encoding.

I would categorize this as a bug. XML deserializers must respect the
encoding, otherwise we end up with a mess ;-) especially when we have
mixed namespaces and multiple intermixed consumers...

Doesn't the XML deserializer you use just work correctly if you pass
an InputStream instead of a Reader??

Cheers
--
Niclas Hedhman, Software Developer
http://www.qi4j.org - New Energy for Java

I  live here; http://tinyurl.com/2qq9er
I  work here; http://tinyurl.com/2ymelc
I relax here; http://tinyurl.com/2cgsug



-- 
Niclas Hedhman, Software Developer
http://www.qi4j.org - New Energy for Java

I  live here; http://tinyurl.com/2qq9er
I  work here; http://tinyurl.com/2ymelc
I relax here; http://tinyurl.com/2cgsug

Re: Character '€'

Posted by Niclas Hedhman <ni...@hedhman.org>.
On Mon, Nov 15, 2010 at 9:03 AM, Greg Brown <gk...@mac.com> wrote:
> The problem is that, even if the PI specifies UTF-8 for example, the file itself may be saved with a different encoding (so they may not match).

That is not a problem, but a bug with the user... Don't try to
outsmart stupidity of others, it is a loosing battle. ;-)

-- 
Niclas Hedhman, Software Developer
http://www.qi4j.org - New Energy for Java

I  live here; http://tinyurl.com/2qq9er
I  work here; http://tinyurl.com/2ymelc
I relax here; http://tinyurl.com/2cgsug

Re: Character '€'

Posted by Greg Brown <gk...@mac.com>.
Either way, you're probably right that entering it as a bug makes sense. That way we can track it and investigate further.
G

On Nov 14, 2010, at 8:03 PM, Greg Brown wrote:

> The problem is that, even if the PI specifies UTF-8 for example, the file itself may be saved with a different encoding (so they may not match).
> 
> On Nov 14, 2010, at 8:00 PM, Niclas Hedhman wrote:
> 
>> On Mon, Nov 15, 2010 at 8:52 AM, Greg Brown <gk...@mac.com> wrote:
>>>> Doesn't the XML deserializer you use just work correctly if you pass
>>>> an InputStream instead of a Reader??
>>> 
>>> 
>>> Actually, I think a Reader would work but we don't currently expose that API. We use javax.xml.stream.XMLInputFactory#createXMLStreamReader() to process the XML, which takes an InputStream as an argument. What we should probably do is allow the caller to specify the character set to read (there is another version of createXMLStreamReader() that takes both an InputStream and a java.nio.charset.Charset).
>> 
>> That is incorrect. XML specification says that the <?xml> processing
>> instruction is in (IIRC) ASCII and it contains the encoding of the
>> rest of the document., such as <?xml version="1.0" encoding="UTF-8"
>> ?>, and compliant parsers should understand this. So, for instance, if
>> the document is in UTF-16, the <?xml?> PI is NOT, and a regular text
>> editor would have problem with handling that. For UTF-8, ISO-8859-X
>> and others, the ASCII encoding coincide so not so obvious.
>> 
>> Cheers
>> -- 
>> Niclas Hedhman, Software Developer
>> http://www.qi4j.org - New Energy for Java
>> 
>> I  live here; http://tinyurl.com/2qq9er
>> I  work here; http://tinyurl.com/2ymelc
>> I relax here; http://tinyurl.com/2cgsug
> 


Re: Character '€'

Posted by Greg Brown <gk...@mac.com>.
The problem is that, even if the PI specifies UTF-8 for example, the file itself may be saved with a different encoding (so they may not match).

On Nov 14, 2010, at 8:00 PM, Niclas Hedhman wrote:

> On Mon, Nov 15, 2010 at 8:52 AM, Greg Brown <gk...@mac.com> wrote:
>>> Doesn't the XML deserializer you use just work correctly if you pass
>>> an InputStream instead of a Reader??
>> 
>> 
>> Actually, I think a Reader would work but we don't currently expose that API. We use javax.xml.stream.XMLInputFactory#createXMLStreamReader() to process the XML, which takes an InputStream as an argument. What we should probably do is allow the caller to specify the character set to read (there is another version of createXMLStreamReader() that takes both an InputStream and a java.nio.charset.Charset).
> 
> That is incorrect. XML specification says that the <?xml> processing
> instruction is in (IIRC) ASCII and it contains the encoding of the
> rest of the document., such as <?xml version="1.0" encoding="UTF-8"
> ?>, and compliant parsers should understand this. So, for instance, if
> the document is in UTF-16, the <?xml?> PI is NOT, and a regular text
> editor would have problem with handling that. For UTF-8, ISO-8859-X
> and others, the ASCII encoding coincide so not so obvious.
> 
> Cheers
> -- 
> Niclas Hedhman, Software Developer
> http://www.qi4j.org - New Energy for Java
> 
> I  live here; http://tinyurl.com/2qq9er
> I  work here; http://tinyurl.com/2ymelc
> I relax here; http://tinyurl.com/2cgsug


Re: Character '€'

Posted by Niclas Hedhman <ni...@hedhman.org>.
On Mon, Nov 15, 2010 at 8:52 AM, Greg Brown <gk...@mac.com> wrote:
>> Doesn't the XML deserializer you use just work correctly if you pass
>> an InputStream instead of a Reader??
>
>
> Actually, I think a Reader would work but we don't currently expose that API. We use javax.xml.stream.XMLInputFactory#createXMLStreamReader() to process the XML, which takes an InputStream as an argument. What we should probably do is allow the caller to specify the character set to read (there is another version of createXMLStreamReader() that takes both an InputStream and a java.nio.charset.Charset).

That is incorrect. XML specification says that the <?xml> processing
instruction is in (IIRC) ASCII and it contains the encoding of the
rest of the document., such as <?xml version="1.0" encoding="UTF-8"
?>, and compliant parsers should understand this. So, for instance, if
the document is in UTF-16, the <?xml?> PI is NOT, and a regular text
editor would have problem with handling that. For UTF-8, ISO-8859-X
and others, the ASCII encoding coincide so not so obvious.

Cheers
-- 
Niclas Hedhman, Software Developer
http://www.qi4j.org - New Energy for Java

I  live here; http://tinyurl.com/2qq9er
I  work here; http://tinyurl.com/2ymelc
I relax here; http://tinyurl.com/2cgsug

Re: Character '€'

Posted by Greg Brown <gk...@mac.com>.
>> It's because BXMLSerializer assumes that BXML files are encoded in UTF-8. There is currently no way to specify an alternate encoding.
> 
> I would categorize this as a bug. XML deserializers must respect the
> encoding, otherwise we end up with a mess ;-) especially when we have
> mixed namespaces and multiple intermixed consumers...
> 
> Doesn't the XML deserializer you use just work correctly if you pass
> an InputStream instead of a Reader??


Actually, I think a Reader would work but we don't currently expose that API. We use javax.xml.stream.XMLInputFactory#createXMLStreamReader() to process the XML, which takes an InputStream as an argument. What we should probably do is allow the caller to specify the character set to read (there is another version of createXMLStreamReader() that takes both an InputStream and a java.nio.charset.Charset).

G