You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Jim Cole <jc...@tecomac.com> on 2000/10/17 07:20:20 UTC

XMLReader bug?

Hi.

If the following has already come up on the list, my apologies. I poked 
around the archives a bit and didn't see any reference to the problem I
have encountered. Also, the problem affects every version I have so far
checked (1.2, 1.3, and the XML4C 3.3.0 package), so perhaps it has been
around for a while. Or maybe I am just being dense and missing something
important ;)

Anyway... The following relates to XMLReader.cpp and the use of UTF-8
encoding (at least). In xcodeMoreChars() a check is made to see whether
the raw buffer needs to be refreshed (line 1585). This condition is
tested by evaluating (fRawBufIndex == fRawBytesAvail). However, it would
seem that this test is insufficient. In particular, it is possible that
the transcoder has seen the beginning of a multi-byte character, but is
working with a buffer that does not contain the final byte(s). In this
case, fRawBufIndex is less than fRawBytesAvail, but the transcoder is
unable to obtain additional characters until the raw buffer is refreshed.
The result is that the reader falsely claims that it is at the end of
the file and an EndedWithTagsOnStack error is generated by the scanner.

The above was encountered while using the SAXCount sample, so I am certain
that no problems are being introduced by my code ;) As a hack, I changed

if (fRawBufIndex == fRawBytesAvail)

to 

if (fRawBufIndex + 5 >= fRawBytesAvail)

This seemed to fix the problem (for UTF-8), however I don't yet know the
code well enough to determine whether I might be introducing new and more
subtle problems.

Is this really a bug? Is it a known bug? Is there a fix that is cleaner
than the hack above? Any insight would be appreciated.

Jim Cole


Re: XMLReader bug?

Posted by Dean Roddey <dr...@charmedquark.com>.
I thought that this kind of bug had been stomped, since it had come up
before and I think some patches applied. But, someone else also reported
something similar (in the 1.3 code) with large UTF-8 files (which go through
enough buffers to tried out many permutations I guess.) So, you are probably
correct.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"It takes two buttocks to make friction"
    - African Proverb


----- Original Message -----
From: "Jim Cole" <jc...@tecomac.com>
To: <xe...@xml.apache.org>
Sent: Monday, October 16, 2000 10:20 PM
Subject: XMLReader bug?


> Hi.
>
> If the following has already come up on the list, my apologies. I poked
> around the archives a bit and didn't see any reference to the problem I
> have encountered. Also, the problem affects every version I have so far
> checked (1.2, 1.3, and the XML4C 3.3.0 package), so perhaps it has been
> around for a while. Or maybe I am just being dense and missing something
> important ;)
>
> Anyway... The following relates to XMLReader.cpp and the use of UTF-8
> encoding (at least). In xcodeMoreChars() a check is made to see whether
> the raw buffer needs to be refreshed (line 1585). This condition is
> tested by evaluating (fRawBufIndex == fRawBytesAvail). However, it would
> seem that this test is insufficient. In particular, it is possible that
> the transcoder has seen the beginning of a multi-byte character, but is
> working with a buffer that does not contain the final byte(s). In this
> case, fRawBufIndex is less than fRawBytesAvail, but the transcoder is
> unable to obtain additional characters until the raw buffer is refreshed.
> The result is that the reader falsely claims that it is at the end of
> the file and an EndedWithTagsOnStack error is generated by the scanner.
>
> The above was encountered while using the SAXCount sample, so I am certain
> that no problems are being introduced by my code ;) As a hack, I changed
>
> if (fRawBufIndex == fRawBytesAvail)
>
> to
>
> if (fRawBufIndex + 5 >= fRawBytesAvail)
>
> This seemed to fix the problem (for UTF-8), however I don't yet know the
> code well enough to determine whether I might be introducing new and more
> subtle problems.
>
> Is this really a bug? Is it a known bug? Is there a fix that is cleaner
> than the hack above? Any insight would be appreciated.
>
> Jim Cole
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>


Re: XMLReader bug?

Posted by Andy Heninger <an...@jtcsv.com>.
You indeed found a real bug in XMLReader.cpp.

I've made what is essentially your fix, but increased the lower
bound on the number of characters before trying to read more.  To
100 bytes.  I didn't see how the bigger limit could hurt anything,
and it could make things slightly more efficient by avoiding
transcoding and processing buffers of just a few characters.

The change is in last night's build.

I think there's another bug lurking in this area, which would
occur if a file ends in the middle of a multi-byte character.

Andy Heninger
IBM XML Technology Group, Cupertino, CA
heninger@us.ibm.com


----- Original Message -----
From: "Jim Cole" <jc...@tecomac.com>
To: <xe...@xml.apache.org>
Sent: Monday, October 16, 2000 10:20 PM
Subject: XMLReader bug?


> Hi.
>
> If the following has already come up on the list, my apologies. I poked
> around the archives a bit and didn't see any reference to the problem I
> have encountered. Also, the problem affects every version I have so far
> checked (1.2, 1.3, and the XML4C 3.3.0 package), so perhaps it has been
> around for a while. Or maybe I am just being dense and missing something
> important ;)
>
> Anyway... The following relates to XMLReader.cpp and the use of UTF-8
> encoding (at least). In xcodeMoreChars() a check is made to see whether
> the raw buffer needs to be refreshed (line 1585). This condition is
> tested by evaluating (fRawBufIndex == fRawBytesAvail). However, it would
> seem that this test is insufficient. In particular, it is possible that
> the transcoder has seen the beginning of a multi-byte character, but is
> working with a buffer that does not contain the final byte(s). In this
> case, fRawBufIndex is less than fRawBytesAvail, but the transcoder is
> unable to obtain additional characters until the raw buffer is
refreshed.
> The result is that the reader falsely claims that it is at the end of
> the file and an EndedWithTagsOnStack error is generated by the scanner.
>
> The above was encountered while using the SAXCount sample, so I am
certain
> that no problems are being introduced by my code ;) As a hack, I changed
>
> if (fRawBufIndex == fRawBytesAvail)
>
> to
>
> if (fRawBufIndex + 5 >= fRawBytesAvail)
>
> This seemed to fix the problem (for UTF-8), however I don't yet know the
> code well enough to determine whether I might be introducing new and
more
> subtle problems.
>
> Is this really a bug? Is it a known bug? Is there a fix that is cleaner
> than the hack above? Any insight would be appreciated.
>
> Jim Cole
>