You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Gary Gregory <ga...@gmail.com> on 2012/08/13 19:49:46 UTC

Any plans to support UTF-32 BOM?

Hi All:

Any plans to support UTF-32 BOM?

Currently, if I parse a UTF-32 document I get 'content not expected in
prolog" error.

Thank you,
Gary

-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: Any plans to support UTF-32 BOM?

Posted by Gary Gregory <ga...@gmail.com>.

Thank you for all the info Michael, very helpful.

Gary

On Mon, Aug 13, 2012 at 4:51 PM, Michael Glavassevich
<mr...@ca.ibm.com>wrote:

> Hi Gary,
>
> Gary Gregory <ga...@gmail.com> wrote on 13/08/2012 02:27:33 PM:
>
>
> > Hi Michael,
> >
> > I’ve not caught one in the savannah either! I've not had a customer
> > request for it either, that, or the request did not make it through
> > our sales engineers, professional services, or tech support all the way
> to me.
> >
> > Our products are XML and buzzword compliant and I am checking my Ps
> > and Qs. So, at this point, the point is rather academic as you mention.
>
> XML parsers are only required to support UTF-8 and UTF-16. Support for any
> other encodings is icing on the cake.
>
> > I am aware of the inefficiencies involved, but our customers can
> > decide how efficient they want to be for themselves, sometimes they
> > have no control over the format of the documents they have to
> > process with our software. For those who can control the format, I
> > do not know if someone has tried UTF-32, watched it blow up and then
> > switched to something.
> >
> > Now, out of curiosity, I do notice a
> > org.apache.xerces.impl.io.UCSReader class in Xerces which is used
> > from a couple of places.
> >
> > Is that not hooked up in all the right spots?
>
> It is, but if presented with a UTF-32 BOM, Xerces won't hit the code path
> where the UCSReader would be used since its encoding auto-detector doesn't
> recognize UTF-32 BOM byte sequences. It's probably just defaulting to UTF-8
> (since it has no better guess) and then bombs out.
>
> Assuming Xerces did support UTF-32 the UCSReader might not be the right
> reader to use anyway. A compliant UTF-32 Reader might require more error
> checking (e.g. to reject non-characters, like the byte sequences that would
> be used to represent surrogates in UTF-16).
>
> > Gary
>
> > On Mon, Aug 13, 2012 at 2:07 PM, Michael Glavassevich <
> mrglavas@ca.ibm.com
> > > wrote:
> > Hi Gary,
> >
> > There haven't been any plans for UTF-32 support. It seems you're the
> > first [1] (and only) one who has asked about it on the project lists.
> >
> > Is this just an academic question or do you have an actual need for it?
> >
> > I must say I've never seen a UTF-32 encoded document in the wild. In
> > my opinion it's a very inefficient encoding. Always uses 32-bits to
> > represent a character when the largest Unicode code point only
> > requires 21-bits. UTF-8 and UTF-16 only ever use that much space for
> > supplementary characters (i.e. code points greater than U+FFFF).
> >
> > Thanks.
> >
> > [1] http://xerces-j.markmail.org/search/?q=UTF-32
> >
> > Michael Glavassevich
> > XML Technologies and WAS Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
> >
> > Gary Gregory <ga...@gmail.com> wrote on 13/08/2012 01:49:46 PM:
> >
> >
> > > Hi All:
> > >
> > > Any plans to support UTF-32 BOM?
> > >
> > > Currently, if I parse a UTF-32 document I get 'content not expected
> > > in prolog" error.
> > >
> > > Thank you,
> > > Gary
> > >
> > > --
> > > E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > > JUnit in Action, 2nd Ed:
> > http://bit.ly/ECvg0
> >
> > > Spring Batch in Action: http://bit.ly/bqpbCK
> > > Blog: http://garygregory.wordpress.com
> > > Home: http://garygregory.com/
> > > Tweet! http://twitter.com/GaryGregory
> >
> >
> >
> > --
> > E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > JUnit in Action, 2nd Ed: http://bit.ly/ECvg0
> > Spring Batch in Action: http://bit.ly/bqpbCK
> > Blog: http://garygregory.wordpress.com
> > Home: http://garygregory.com/
> > Tweet! http://twitter.com/GaryGregory
>
> Michael Glavassevich
> XML Technologies and WAS Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>



-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: Any plans to support UTF-32 BOM?

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Gary,

Gary Gregory <ga...@gmail.com> wrote on 13/08/2012 02:27:33 PM:

> Hi Michael,
> 
> I’ve not caught one in the savannah either! I've not had a customer 
> request for it either, that, or the request did not make it through 
> our sales engineers, professional services, or tech support all the way 
to me.
> 
> Our products are XML and buzzword compliant and I am checking my Ps 
> and Qs. So, at this point, the point is rather academic as you mention. 

XML parsers are only required to support UTF-8 and UTF-16. Support for any 
other encodings is icing on the cake.

> I am aware of the inefficiencies involved, but our customers can 
> decide how efficient they want to be for themselves, sometimes they 
> have no control over the format of the documents they have to 
> process with our software. For those who can control the format, I 
> do not know if someone has tried UTF-32, watched it blow up and then
> switched to something.
> 
> Now, out of curiosity, I do notice a 
> org.apache.xerces.impl.io.UCSReader class in Xerces which is used 
> from a couple of places.
> 
> Is that not hooked up in all the right spots?

It is, but if presented with a UTF-32 BOM, Xerces won't hit the code path 
where the UCSReader would be used since its encoding auto-detector doesn't 
recognize UTF-32 BOM byte sequences. It's probably just defaulting to 
UTF-8 (since it has no better guess) and then bombs out.

Assuming Xerces did support UTF-32 the UCSReader might not be the right 
reader to use anyway. A compliant UTF-32 Reader might require more error 
checking (e.g. to reject non-characters, like the byte sequences that 
would be used to represent surrogates in UTF-16).

> Gary

> On Mon, Aug 13, 2012 at 2:07 PM, Michael Glavassevich 
<mrglavas@ca.ibm.com
> > wrote:
> Hi Gary, 
> 
> There haven't been any plans for UTF-32 support. It seems you're the
> first [1] (and only) one who has asked about it on the project lists. 
> 
> Is this just an academic question or do you have an actual need for it? 
> 
> I must say I've never seen a UTF-32 encoded document in the wild. In
> my opinion it's a very inefficient encoding. Always uses 32-bits to 
> represent a character when the largest Unicode code point only 
> requires 21-bits. UTF-8 and UTF-16 only ever use that much space for
> supplementary characters (i.e. code points greater than U+FFFF). 
> 
> Thanks. 
> 
> [1] http://xerces-j.markmail.org/search/?q=UTF-32 
> 
> Michael Glavassevich
> XML Technologies and WAS Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com 
> E-mail: mrglavas@apache.org 
> 
> Gary Gregory <ga...@gmail.com> wrote on 13/08/2012 01:49:46 PM:
> 
> 
> > Hi All:
> > 
> > Any plans to support UTF-32 BOM?
> > 
> > Currently, if I parse a UTF-32 document I get 'content not expected 
> > in prolog" error.
> > 
> > Thank you,
> > Gary
> > 
> > -- 
> > E-Mail: garydgregory@gmail.com | ggregory@apache.org 
> > JUnit in Action, 2nd Ed: 
> http://bit.ly/ECvg0
> 
> > Spring Batch in Action: http://bit.ly/bqpbCK
> > Blog: http://garygregory.wordpress.com 
> > Home: http://garygregory.com/
> > Tweet! http://twitter.com/GaryGregory
> 
> 
> 
> -- 
> E-Mail: garydgregory@gmail.com | ggregory@apache.org 
> JUnit in Action, 2nd Ed: http://bit.ly/ECvg0
> Spring Batch in Action: http://bit.ly/bqpbCK
> Blog: http://garygregory.wordpress.com 
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Re: Any plans to support UTF-32 BOM?

Posted by Gary Gregory <ga...@gmail.com>.

Hi Michael,

I’ve not caught one in the savannah either! I've not had a customer request
for it either, that, or the request did not make it through our sales
engineers, professional services, or tech support all the way to me.

Our products are XML and buzzword compliant and I am checking my Ps and Qs.
So, at this point, the point is rather academic as you mention.

I am aware of the inefficiencies involved, but our customers can decide how
efficient they want to be for themselves, sometimes they have no control
over the format of the documents they have to process with our software.
For those who can control the format, I do not know if someone has tried
UTF-32, watched it blow up and then switched to something.

Now, out of curiosity, I do notice a org.apache.xerces.impl.io.UCSReader
class in Xerces which is used from a couple of places.

Is that not hooked up in all the right spots?

Gary

On Mon, Aug 13, 2012 at 2:07 PM, Michael Glavassevich
<mr...@ca.ibm.com>wrote:

> Hi Gary,
>
> There haven't been any plans for UTF-32 support. It seems you're the first
> [1] (and only) one who has asked about it on the project lists.
>
> Is this just an academic question or do you have an actual need for it?
>
> I must say I've never seen a UTF-32 encoded document in the wild. In my
> opinion it's a very inefficient encoding. Always uses 32-bits to represent
> a character when the largest Unicode code point only requires 21-bits.
> UTF-8 and UTF-16 only ever use that much space for supplementary characters
> (i.e. code points greater than U+FFFF).
>
> Thanks.
>
> [1] http://xerces-j.markmail.org/search/?q=UTF-32
>
> Michael Glavassevich
> XML Technologies and WAS Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> Gary Gregory <ga...@gmail.com> wrote on 13/08/2012 01:49:46 PM:
>
>
> > Hi All:
> >
> > Any plans to support UTF-32 BOM?
> >
> > Currently, if I parse a UTF-32 document I get 'content not expected
> > in prolog" error.
> >
> > Thank you,
> > Gary
> >
> > --
> > E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > JUnit in Action, 2nd Ed:
> http://bit.ly/ECvg0
>
> > Spring Batch in Action: http://bit.ly/bqpbCK
> > Blog: http://garygregory.wordpress.com
> > Home: http://garygregory.com/
> > Tweet! http://twitter.com/GaryGregory
>

-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: Any plans to support UTF-32 BOM?

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Gary,

There haven't been any plans for UTF-32 support. It seems you're the first 
[1] (and only) one who has asked about it on the project lists.

Is this just an academic question or do you have an actual need for it?

I must say I've never seen a UTF-32 encoded document in the wild. In my 
opinion it's a very inefficient encoding. Always uses 32-bits to represent 
a character when the largest Unicode code point only requires 21-bits. 
UTF-8 and UTF-16 only ever use that much space for supplementary 
characters (i.e. code points greater than U+FFFF).

Thanks.

[1] http://xerces-j.markmail.org/search/?q=UTF-32

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Gary Gregory <ga...@gmail.com> wrote on 13/08/2012 01:49:46 PM:

> Hi All:
> 
> Any plans to support UTF-32 BOM?
> 
> Currently, if I parse a UTF-32 document I get 'content not expected 
> in prolog" error.
> 
> Thank you,
> Gary
> 
> -- 
> E-Mail: garydgregory@gmail.com | ggregory@apache.org 
> JUnit in Action, 2nd Ed: http://bit.ly/ECvg0
> Spring Batch in Action: http://bit.ly/bqpbCK
> Blog: http://garygregory.wordpress.com 
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory