You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by John Byrne <jo...@propylon.com> on 2008/04/15 17:34:17 UTC

character buffer

Hi,

I notice that the when the SAX parser encounters a character reference 
(in the form &xyz;) it puts the decoded character into a separate 
buffer, and this buffer is the one references in the "characters" 
method. My problem is that I need to know the exact character offset in 
the document when a fragment of text is found, and this behavior makes 
that impossible.

For example, if the document contains "&#x201d;" at offset 103, then the 
"characters" method will tell me there is a double quotes character at 
position zero in the buffer, because it is in a separate buffer of it's 
own. I need to know that it was found at position 103!

Any suggestions as to how I could do this?

Thanks in advance!
-John



---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: character buffer

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi John,

The offset and length parameters passed to the characters() method tell you
the range of characters in the array which are being reported in the event.
Nothing more.

You cannot reliably use these offsets to determine the position in the
document. If you need to know location information of SAX events you should
use the Locator [1]. If you want the character offset rather than
line/column numbers you'll need to dive into the XNI [2] layer to get that
since it's not available through the SAX API.

Thanks.

[1]
http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/Locator.html
[2]
http://xerces.apache.org/xerces2-j/javadocs/xni/org/apache/xerces/xni/XMLLocator.html#getCharacterOffset()

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

"John Byrne" <jo...@propylon.com> wrote on 04/15/2008 11:34:17 AM:

> Hi,
>
> I notice that the when the SAX parser encounters a character reference
> (in the form &xyz;) it puts the decoded character into a separate
> buffer, and this buffer is the one references in the "characters"
> method. My problem is that I need to know the exact character offset in
> the document when a fragment of text is found, and this behavior makes
> that impossible.
>
> For example, if the document contains "&#x201d;" at offset 103, then the
> "characters" method will tell me there is a double quotes character at
> position zero in the buffer, because it is in a separate buffer of it's
> own. I need to know that it was found at position 103!
>
> Any suggestions as to how I could do this?
>
> Thanks in advance!
> -John
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org