You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by ML...@uk.ibm.com on 2001/02/26 17:22:28 UTC
Offsets reported by the scanner
Hi all,
I'm using the XMLScanner::getSrcOffset to track where I am in the XML
document that I'm parsing (using a memory input source). I've hit a few
problems caused by a BOM at the start of the input, and I'd like to hear
what the group has to say about it.
The first case is simple: We parse a message like the following (in
UTF-16, with a BOM on the front):
<?xml version="1.0"?><Element/>
In that case, the offsets reported by the scanner are 42 (in the XMLDecl
callback) and 62 (in the start element callback). They match up with what
you might expect - but the BOM has silently been ignored in the reporting.
That's OK - I just have to check for a BOM in my memory buffer, and if
there is then I move my start point along to compensate.
The problem comes when we try an even simpler test (again in UTF-16, with a
BOM):
<Element/>
This time the offset reported during the element callback is 22. e.g. the
BOM has been included in the calculation.
I think the place which is causing the inconsistency is in the
XMLReader::doInitDecode() method. If that call finds a BOM, it skips over
it. However, if it fails to find an XML decl, is backs out that skip.
I've put a hack into my Xerces DLL so that doInitDecode() always skips over
a BOM, even if it does not find an XML decl, which gets me to a consistent
state and allows me to continue with my development. However, I don't want
to alter the Xerces code any more then I have to!
Would someone who knows more about the design of the scanner & reader
please make a call as to which offsets should be reported, and then I'll
either help with the code, or use their changes.
Comments?
Matt Lovett