You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by ML...@uk.ibm.com on 2001/02/26 17:22:28 UTC

Offsets reported by the scanner


Hi all,

I'm using the XMLScanner::getSrcOffset to track where I am in the XML
document that I'm parsing (using a memory input source).  I've hit a few
problems caused by a BOM at the start of the input, and I'd like to hear
what the group has to say about it.

The first case is simple:  We parse a message like the following (in
UTF-16, with a BOM on the front):

<?xml version="1.0"?><Element/>

In that case, the offsets reported by the scanner are 42 (in the XMLDecl
callback) and 62 (in the start element callback).  They match up with what
you might expect - but the BOM has silently been ignored in the reporting.
That's OK - I just have to check for a BOM in my memory buffer, and if
there is then I move my start point along to compensate.

The problem comes when we try an even simpler test (again in UTF-16, with a
BOM):

<Element/>

This time the offset reported during the element callback is 22.  e.g. the
BOM has been included in the calculation.

I think the place which is causing the inconsistency is in the
XMLReader::doInitDecode() method.  If that call finds a BOM, it skips over
it.  However, if it fails to find an XML decl, is backs out that skip.

I've put a hack into my Xerces DLL so that doInitDecode() always skips over
a BOM, even if it does not find an XML decl, which gets me to a consistent
state and allows me to continue with my development.  However, I don't want
to alter the Xerces code any more then I have to!

Would someone who knows more about the design of the scanner & reader
please make a call as to which offsets should be reported, and then I'll
either help with the code, or use their changes.

Comments?

Matt Lovett