You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Gary Gregory <ga...@gmail.com> on 2012/08/10 19:44:27 UTC
[IO] BOMInputStream bug?
Hi All:
Does anyone have expertise with BOMInputStream?
I know that some XML parsers (like the one shipped with the Oracle JRE) do
not detect UTF-32 BOMs (UTF-8 and UTF-16 BOMs are OK) but using
BOMInputStream is supposed to fix the issue.
These tests I added and @Ignore'd fail:
-
org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Be()
-
org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Le()
More basic tests do work:
- org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Be()
- org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Le()
When I look at the Oracle JRE (which uses a copy of Xerces) I see code to
deal with UCS-4, which is a precursor to UTF-32, like UCS-2 is a subset to
UTF-16, but as the test shows, Xerces fail parsing a UTF-32 document.
Any thoughts?
Thank you,
Gary
--
E-Mail: garydgregory@gmail.com | ggregory@apache.org
JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory
Re: [IO] BOMInputStream bug?
Posted by Niall Pemberton <ni...@gmail.com>.
On Fri, Aug 10, 2012 at 11:58 PM, Gary Gregory <ga...@gmail.com> wrote:
> On Fri, Aug 10, 2012 at 4:27 PM, Niall Pemberton
> <ni...@gmail.com>wrote:
>
>> On Fri, Aug 10, 2012 at 6:44 PM, Gary Gregory <ga...@gmail.com>
>> wrote:
>> > Hi All:
>> >
>> > Does anyone have expertise with BOMInputStream?
>> >
>> > I know that some XML parsers (like the one shipped with the Oracle JRE)
>> do
>> > not detect UTF-32 BOMs (UTF-8 and UTF-16 BOMs are OK) but using
>> > BOMInputStream is supposed to fix the issue.
>> >
>> > These tests I added and @Ignore'd fail:
>> >
>> > -
>> >
>> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Be()
>> > -
>> >
>> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Le()
>> >
>> > More basic tests do work:
>> >
>> > -
>> org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Be()
>> > -
>> org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Le()
>> >
>> > When I look at the Oracle JRE (which uses a copy of Xerces) I see code to
>> > deal with UCS-4, which is a precursor to UTF-32, like UCS-2 is a subset
>> to
>> > UTF-16, but as the test shows, Xerces fail parsing a UTF-32 document.
>> >
>> > Any thoughts?
>>
>> Hi Gary,
>>
>> I enabled the test and ran them. I'm a bit confused about what the
>> issue is because the lines that use the BOMInputStream to *skip* the
>> UTF-32 BOM do not fail for me:
>>
>> parseXml(new BOMInputStream(createUtf32BeDataStream(data,
>> true), ByteOrderMark.UTF_32BE));
>> parseXml(new BOMInputStream(createUtf32LeDataStream(data,
>> true), ByteOrderMark.UTF_32LE));
>>
>> whereas the lines after those that do not use any Commons IO components
>> fail:
>>
>> parseXml(createUtf32BeDataStream(data, true));
>> parseXml(createUtf32LeDataStream(data, true));
>>
>> So this just means that the XML parser doesn't deal with UTF-32 BOM.
>>
>> Really though the BOMInputStream stream doesn't provide anything that
>> helps parse the XML properly - it has two purposes 1) BOM detection
>> and 2) BOM removal/skipping.
>>
>> What we do have in Commons is XMLInputStream - this uses various
>> techniques to detect encoding, including using BOMInputStream to try
>> BOM detection and then uses that encoding to with a Reader to process
>> the bytes properly
>>
>
> Ok, thank you Nial, my initial experiment with XMLStreamReader works, I'll
> continue in this direction at work.
Yes sorry, meant XMLStreamReader!
Niall
> Gary
>
>>
>> Niall
>>
>> > Thank you,
>> > Gary
>> >
>> > --
>> > E-Mail: garydgregory@gmail.com | ggregory@apache.org
>> > JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
>> > Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
>> > Blog: http://garygregory.wordpress.com
>> > Home: http://garygregory.com/
>> > Tweet! http://twitter.com/GaryGregory
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
>
>
> --
> E-Mail: garydgregory@gmail.com | ggregory@apache.org
> JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
> Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
> Blog: http://garygregory.wordpress.com
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [IO] BOMInputStream bug?
Posted by Gary Gregory <ga...@gmail.com>.
On Fri, Aug 10, 2012 at 4:27 PM, Niall Pemberton
<ni...@gmail.com>wrote:
> On Fri, Aug 10, 2012 at 6:44 PM, Gary Gregory <ga...@gmail.com>
> wrote:
> > Hi All:
> >
> > Does anyone have expertise with BOMInputStream?
> >
> > I know that some XML parsers (like the one shipped with the Oracle JRE)
> do
> > not detect UTF-32 BOMs (UTF-8 and UTF-16 BOMs are OK) but using
> > BOMInputStream is supposed to fix the issue.
> >
> > These tests I added and @Ignore'd fail:
> >
> > -
> >
> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Be()
> > -
> >
> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Le()
> >
> > More basic tests do work:
> >
> > -
> org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Be()
> > -
> org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Le()
> >
> > When I look at the Oracle JRE (which uses a copy of Xerces) I see code to
> > deal with UCS-4, which is a precursor to UTF-32, like UCS-2 is a subset
> to
> > UTF-16, but as the test shows, Xerces fail parsing a UTF-32 document.
> >
> > Any thoughts?
>
> Hi Gary,
>
> I enabled the test and ran them. I'm a bit confused about what the
> issue is because the lines that use the BOMInputStream to *skip* the
> UTF-32 BOM do not fail for me:
>
> parseXml(new BOMInputStream(createUtf32BeDataStream(data,
> true), ByteOrderMark.UTF_32BE));
> parseXml(new BOMInputStream(createUtf32LeDataStream(data,
> true), ByteOrderMark.UTF_32LE));
>
> whereas the lines after those that do not use any Commons IO components
> fail:
>
> parseXml(createUtf32BeDataStream(data, true));
> parseXml(createUtf32LeDataStream(data, true));
>
> So this just means that the XML parser doesn't deal with UTF-32 BOM.
>
> Really though the BOMInputStream stream doesn't provide anything that
> helps parse the XML properly - it has two purposes 1) BOM detection
> and 2) BOM removal/skipping.
>
> What we do have in Commons is XMLInputStream - this uses various
> techniques to detect encoding, including using BOMInputStream to try
> BOM detection and then uses that encoding to with a Reader to process
> the bytes properly
>
Ok, thank you Nial, my initial experiment with XMLStreamReader works, I'll
continue in this direction at work.
Gary
>
> Niall
>
> > Thank you,
> > Gary
> >
> > --
> > E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
> > Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
> > Blog: http://garygregory.wordpress.com
> > Home: http://garygregory.com/
> > Tweet! http://twitter.com/GaryGregory
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>
--
E-Mail: garydgregory@gmail.com | ggregory@apache.org
JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory
Re: [IO] BOMInputStream bug?
Posted by Gary Gregory <ga...@gmail.com>.
On Fri, Aug 10, 2012 at 4:27 PM, Niall Pemberton
<ni...@gmail.com>wrote:
> On Fri, Aug 10, 2012 at 6:44 PM, Gary Gregory <ga...@gmail.com>
> wrote:
> > Hi All:
> >
> > Does anyone have expertise with BOMInputStream?
> >
> > I know that some XML parsers (like the one shipped with the Oracle JRE)
> do
> > not detect UTF-32 BOMs (UTF-8 and UTF-16 BOMs are OK) but using
> > BOMInputStream is supposed to fix the issue.
> >
> > These tests I added and @Ignore'd fail:
> >
> > -
> >
> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Be()
> > -
> >
> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Le()
> >
> > More basic tests do work:
> >
> > -
> org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Be()
> > -
> org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Le()
> >
> > When I look at the Oracle JRE (which uses a copy of Xerces) I see code to
> > deal with UCS-4, which is a precursor to UTF-32, like UCS-2 is a subset
> to
> > UTF-16, but as the test shows, Xerces fail parsing a UTF-32 document.
> >
> > Any thoughts?
>
> Hi Gary,
>
> I enabled the test and ran them. I'm a bit confused about what the
> issue is because the lines that use the BOMInputStream to *skip* the
> UTF-32 BOM do not fail for me:
>
> parseXml(new BOMInputStream(createUtf32BeDataStream(data,
> true), ByteOrderMark.UTF_32BE));
> parseXml(new BOMInputStream(createUtf32LeDataStream(data,
> true), ByteOrderMark.UTF_32LE));
>
> whereas the lines after those that do not use any Commons IO components
> fail:
>
> parseXml(createUtf32BeDataStream(data, true));
> parseXml(createUtf32LeDataStream(data, true));
>
> So this just means that the XML parser doesn't deal with UTF-32 BOM.
>
> Really though the BOMInputStream stream doesn't provide anything that
> helps parse the XML properly - it has two purposes 1) BOM detection
> and 2) BOM removal/skipping.
>
> What we do have in Commons is XMLInputStream - this uses various
> techniques to detect encoding, including using BOMInputStream to try
> BOM detection and then uses that encoding to with a Reader to process
> the bytes properly
>
Do you mean XmlStreamReader?
Gary
>
> Niall
>
> > Thank you,
> > Gary
> >
> > --
> > E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
> > Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
> > Blog: http://garygregory.wordpress.com
> > Home: http://garygregory.com/
> > Tweet! http://twitter.com/GaryGregory
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>
--
E-Mail: garydgregory@gmail.com | ggregory@apache.org
JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory
Re: [IO] BOMInputStream bug?
Posted by Niall Pemberton <ni...@gmail.com>.
On Fri, Aug 10, 2012 at 6:44 PM, Gary Gregory <ga...@gmail.com> wrote:
> Hi All:
>
> Does anyone have expertise with BOMInputStream?
>
> I know that some XML parsers (like the one shipped with the Oracle JRE) do
> not detect UTF-32 BOMs (UTF-8 and UTF-16 BOMs are OK) but using
> BOMInputStream is supposed to fix the issue.
>
> These tests I added and @Ignore'd fail:
>
> -
> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Be()
> -
> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Le()
>
> More basic tests do work:
>
> - org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Be()
> - org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Le()
>
> When I look at the Oracle JRE (which uses a copy of Xerces) I see code to
> deal with UCS-4, which is a precursor to UTF-32, like UCS-2 is a subset to
> UTF-16, but as the test shows, Xerces fail parsing a UTF-32 document.
>
> Any thoughts?
Hi Gary,
I enabled the test and ran them. I'm a bit confused about what the
issue is because the lines that use the BOMInputStream to *skip* the
UTF-32 BOM do not fail for me:
parseXml(new BOMInputStream(createUtf32BeDataStream(data,
true), ByteOrderMark.UTF_32BE));
parseXml(new BOMInputStream(createUtf32LeDataStream(data,
true), ByteOrderMark.UTF_32LE));
whereas the lines after those that do not use any Commons IO components fail:
parseXml(createUtf32BeDataStream(data, true));
parseXml(createUtf32LeDataStream(data, true));
So this just means that the XML parser doesn't deal with UTF-32 BOM.
Really though the BOMInputStream stream doesn't provide anything that
helps parse the XML properly - it has two purposes 1) BOM detection
and 2) BOM removal/skipping.
What we do have in Commons is XMLInputStream - this uses various
techniques to detect encoding, including using BOMInputStream to try
BOM detection and then uses that encoding to with a Reader to process
the bytes properly
Niall
> Thank you,
> Gary
>
> --
> E-Mail: garydgregory@gmail.com | ggregory@apache.org
> JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
> Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
> Blog: http://garygregory.wordpress.com
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [IO] BOMInputStream bug?
Posted by Gary Gregory <ga...@gmail.com>.
The bottom line in all of this is that no Xerces version seem to
handle UTF-32 even though I see a UCSReader class in there.
Gary
On Aug 11, 2012, at 7:40, sebb <se...@gmail.com> wrote:
> On 10 August 2012 18:44, Gary Gregory <ga...@gmail.com> wrote:
>> Hi All:
>>
>> Does anyone have expertise with BOMInputStream?
>>
>> I know that some XML parsers (like the one shipped with the Oracle JRE) do
>> not detect UTF-32 BOMs (UTF-8 and UTF-16 BOMs are OK) but using
>> BOMInputStream is supposed to fix the issue.
>>
>> These tests I added and @Ignore'd fail:
>>
>> -
>> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Be()
>> -
>> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Le()
>>
>> More basic tests do work:
>>
>> - org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Be()
>> - org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Le()
>>
>> When I look at the Oracle JRE (which uses a copy of Xerces) I see code to
>
> OT to this thread, but note that the Oracle version of Xerces was
> forked from Apache Xerces a long time ago, and is very different from
> the current Xerces code.
> It's also in a different package name space: com.sun.org.apache.xerces.*
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [IO] BOMInputStream bug?
Posted by sebb <se...@gmail.com>.
On 10 August 2012 18:44, Gary Gregory <ga...@gmail.com> wrote:
> Hi All:
>
> Does anyone have expertise with BOMInputStream?
>
> I know that some XML parsers (like the one shipped with the Oracle JRE) do
> not detect UTF-32 BOMs (UTF-8 and UTF-16 BOMs are OK) but using
> BOMInputStream is supposed to fix the issue.
>
> These tests I added and @Ignore'd fail:
>
> -
> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Be()
> -
> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Le()
>
> More basic tests do work:
>
> - org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Be()
> - org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Le()
>
> When I look at the Oracle JRE (which uses a copy of Xerces) I see code to
OT to this thread, but note that the Oracle version of Xerces was
forked from Apache Xerces a long time ago, and is very different from
the current Xerces code.
It's also in a different package name space: com.sun.org.apache.xerces.*
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org