You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Meh-Lit Kim <me...@yahoo.com> on 2003/10/03 18:41:16 UTC

QUESTION: char content chunking in ContentHandler.characters()

Hi,
 
Is there any guarantee that the org.xml.sax.ContentHandler.characters() callback
will not break a whitespace-separated 'word' into different chunks ?
 
e.g., given the following XML fragment :
 
    <NumberList>
     111 222 333
     444 555 666
    </NumberList>
 
The possible values for the string corresponding to the input param
'ch[start] ... ch[start+length-1]' in the callback method
 
      org.xml.sax.ContentHandler( char[] ch,  int start, int length )
 
may be something like :
     "111 222 333"
     "444 555 666"
 
but will NEVER be something like :
      "111 22"
      "2 333"
      "444 5"
      "55 666"
 
Thanks,
/Meh


---------------------------------
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search

Re: QUESTION: char content chunking in ContentHandler.characters()

Posted by Andy Clark <an...@apache.org>.
Meh-Lit Kim wrote:
> Is there any guarantee that the org.xml.sax.ContentHandler.characters() 
> callback
> will not break a whitespace-separated 'word' into different chunks ?

Michael is absolutely right.

In case you're interested, though, Xerces will split
the callbacks at the following:

   1) a newline -- because newline chars have to be
      normalized and it's more efficient to break up
      the callbacks than to copy buffers; and
   2) the end of an internal buffer

So, in general, you can never rely on all of the data
you need to be passed in a single characters callback.

-- 
Andy Clark * andyc@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: QUESTION: char content chunking in ContentHandler.characters()

Posted by Michael Glavassevich <mr...@apache.org>.
Hi Meh-Lit,

No. SAX parsers are free to split character data [1] into as any much
chunks as they please, and they can split the text at whatever boundaries
they want. In order to handle this properly, your handler needs to
accumulate the text returned in each call until you recieve a callback
that isn't characters.

[1]
http://www.saxproject.org/apidoc/org/xml/sax/ContentHandler.html#characters(char[],%20int,%20int)


On Fri, 3 Oct 2003, Meh-Lit Kim wrote:

> Hi,
>
> Is there any guarantee that the org.xml.sax.ContentHandler.characters() callback
> will not break a whitespace-separated 'word' into different chunks ?
>
> e.g., given the following XML fragment :
>
>     <NumberList>
>      111 222 333
>      444 555 666
>     </NumberList>
>
> The possible values for the string corresponding to the input param
> 'ch[start] ... ch[start+length-1]' in the callback method
>
>       org.xml.sax.ContentHandler( char[] ch,  int start, int length )
>
> may be something like :
>      "111 222 333"
>      "444 555 666"
>
> but will NEVER be something like :
>       "111 22"
>       "2 333"
>       "444 5"
>       "55 666"
>
> Thanks,
> /Meh
>
>
> ---------------------------------
> Do you Yahoo!?
> The New Yahoo! Shopping - with improved product search

-- 
--------------------
Michael Glavassevich
mrglavas@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org