You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Marco Testa <ma...@funambol.com> on 2009/04/08 10:01:41 UTC

parsing an xml document chunk by chunk

Hi,
I have to parse a xml document, that actually is received in many 
chunks, and unfortunately I have to parse it chunk by chunk and not at 
the end, when I've received all the pieces.
I was thinking at a SAX parser, since I have to push the parser when i 
receive the data.
I was also thinking at an OutputStream where to write the chunks when I 
receive them, and pipe the OutputStream to an InputStream to be passed 
to the parser.
But I think there is no way to let the parser read from the InputStream 
in the same thread.
So I have to create a thread for every receiving document, but since the 
program may actually receive many different documents at the same time 
and that chunks may be received with long delays I have to create many 
threads that will be mainly idle while waiting for the chunks.
Is there a way to bypass the piped input and output streams and directly 
call the parser on a single chunk when it is received?
Does exist a non-blocking parser that does not wait if the input stream 
is not ready?
In other words, is there a way to call a parse in the same thread only 
for a xml document piece, and call it many times until the document is 
completely received?
thank you very much,
marco



---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: parsing an xml document chunk by chunk

Posted by Marco Testa <ma...@funambol.com>.
Hi Jeff


Jeff Greif ha scritto:
> I think you have two problems here, a network handling problem 
actually the xml pieces are received from the network, but my module 
does not directly handle the sockets.
it is called many times by an other module and at each call it receives 
as parameter a byte array, containing a peace of the xml document.
> and an
> XML parsing problem.  
yes, this one is the one that concerns me more
> The options you describe are by no means the
> only ones.  
nice to hear
> In particular, you don't need a thread per request;
> however, you do need an XML parser exclusively handling each single
> document if you are using Xerces.
>
> In the limit where your machine is heavily loaded processing chunks
> from many requests, it becomes less important whether you process the
> chunks when they're all accumulated or in more piecemeal fashion.  So
> if you're expecting heavy loads, you could do the easier thing.
>   
the problem of processing the document only once it is completely 
received is that the document may be very large, and I want to avoid to 
have it all in memory.
> To use fewer threads, you can receive the document chunks using one of
> the networking patterns like a Reactor, which handles many socket
> connections at once, backed by, for example, a message queue and
> associated parser for the chunks of each document.  The associated
> parser can be reading from a stream wrapped around the message queue.
> The chunks accumulate in the message queue until one of a few worker
> threads gets around to processing those that have accumulated since
> the last time the queue was accessed.  The worker threads handle all
> the parsing for however many documents are active at any one time, in
> round-robin or some priority-based fashion.
>   
but the parsing is performed only on complete documents, right?
> I believe that the Java nio classes are designed to the provide the
> Reactor, or similar patterns.  You can Google for Reactor Pattern or
> look in the "Pattern-Oriented Software Architecture" volume 2.  One of
> the authors of that book is Douglas Schmidt, who has a large online
> collection of papers on subjects in this field.
>   
yes, very interesting book
> It's possible to put a message queue on the far side of the parser to
> hold the generated SAX events.  This might make it possible for the
> application processing the received documents to be less complicated.
> But the freedom to provide your own event-handlers to the parser might
> remove the need for such queues.  For example, if the XML docs were
> simple ones, suitable for incremental action before they were
> completely received, such as a linear sequence of instructions of some
> kind, the endElement event could have an application-specific handler
> that processed one instruction if the name of the element demarcated
> the completion of an instruction.
>   
yes, this is basically my case: the document is relatively simple, with 
few elements, one of them potentially very large and thus potentially 
splitted in many peaces.
> Jeff
>   
thank you very much,
mt
> On Wed, Apr 8, 2009 at 1:01 AM, Marco Testa <ma...@funambol.com> wrote:
>   
>> Hi,
>> I have to parse a xml document, that actually is received in many chunks,
>> and unfortunately I have to parse it chunk by chunk and not at the end, when
>> I've received all the pieces.
>> I was thinking at a SAX parser, since I have to push the parser when i
>> receive the data.
>> I was also thinking at an OutputStream where to write the chunks when I
>> receive them, and pipe the OutputStream to an InputStream to be passed to
>> the parser.
>> But I think there is no way to let the parser read from the InputStream in
>> the same thread.
>> So I have to create a thread for every receiving document, but since the
>> program may actually receive many different documents at the same time and
>> that chunks may be received with long delays I have to create many threads
>> that will be mainly idle while waiting for the chunks.
>> Is there a way to bypass the piped input and output streams and directly
>> call the parser on a single chunk when it is received?
>> Does exist a non-blocking parser that does not wait if the input stream is
>> not ready?
>> In other words, is there a way to call a parse in the same thread only for a
>> xml document piece, and call it many times until the document is completely
>> received?
>> thank you very much,
>> marco
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>> For additional commands, e-mail: j-users-help@xerces.apache.org
>>
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org
>
>   


-- 
Marco Testa

Funambol :: Open Source Mobile'We' for the Mass Market :: http://www.funambol.com
Funambol :: Cross-Platform Mobile Open Source :: https://www.forge.funambol.org/



---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: parsing an xml document chunk by chunk

Posted by Jeff Greif <jg...@alumni.princeton.edu>.
I think you have two problems here, a network handling problem and an
XML parsing problem.  The options you describe are by no means the
only ones.  In particular, you don't need a thread per request;
however, you do need an XML parser exclusively handling each single
document if you are using Xerces.

In the limit where your machine is heavily loaded processing chunks
from many requests, it becomes less important whether you process the
chunks when they're all accumulated or in more piecemeal fashion.  So
if you're expecting heavy loads, you could do the easier thing.

To use fewer threads, you can receive the document chunks using one of
the networking patterns like a Reactor, which handles many socket
connections at once, backed by, for example, a message queue and
associated parser for the chunks of each document.  The associated
parser can be reading from a stream wrapped around the message queue.
The chunks accumulate in the message queue until one of a few worker
threads gets around to processing those that have accumulated since
the last time the queue was accessed.  The worker threads handle all
the parsing for however many documents are active at any one time, in
round-robin or some priority-based fashion.

I believe that the Java nio classes are designed to the provide the
Reactor, or similar patterns.  You can Google for Reactor Pattern or
look in the "Pattern-Oriented Software Architecture" volume 2.  One of
the authors of that book is Douglas Schmidt, who has a large online
collection of papers on subjects in this field.

It's possible to put a message queue on the far side of the parser to
hold the generated SAX events.  This might make it possible for the
application processing the received documents to be less complicated.
But the freedom to provide your own event-handlers to the parser might
remove the need for such queues.  For example, if the XML docs were
simple ones, suitable for incremental action before they were
completely received, such as a linear sequence of instructions of some
kind, the endElement event could have an application-specific handler
that processed one instruction if the name of the element demarcated
the completion of an instruction.

Jeff

On Wed, Apr 8, 2009 at 1:01 AM, Marco Testa <ma...@funambol.com> wrote:
> Hi,
> I have to parse a xml document, that actually is received in many chunks,
> and unfortunately I have to parse it chunk by chunk and not at the end, when
> I've received all the pieces.
> I was thinking at a SAX parser, since I have to push the parser when i
> receive the data.
> I was also thinking at an OutputStream where to write the chunks when I
> receive them, and pipe the OutputStream to an InputStream to be passed to
> the parser.
> But I think there is no way to let the parser read from the InputStream in
> the same thread.
> So I have to create a thread for every receiving document, but since the
> program may actually receive many different documents at the same time and
> that chunks may be received with long delays I have to create many threads
> that will be mainly idle while waiting for the chunks.
> Is there a way to bypass the piped input and output streams and directly
> call the parser on a single chunk when it is received?
> Does exist a non-blocking parser that does not wait if the input stream is
> not ready?
> In other words, is there a way to call a parse in the same thread only for a
> xml document piece, and call it many times until the document is completely
> received?
> thank you very much,
> marco
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: parsing an xml document chunk by chunk

Posted by Marco Testa <ma...@funambol.com>.
Hi,

keshlam@us.ibm.com ha scritto:
> My solution would be to tell the parser to read from an in-memory 
> stream acting as a FIFO buffer, and run it in its own thread; then 
> push data into that stream from the communications thread as it 
> becomes available.
yes, this was also my first idea and, if I don't find any other 
solution, I'll probably go for it.
I'm wandering if there is a way to invoke a parser on a piece of a xml 
document and not the entire document, each time a new piece is available.
I think somewhere in the parser code there should be a loop and inside 
it a read on the InputStream that reads a char or a buffer of chars and 
does something on it. so in a way my aim is to directly invoke this 
procedure once I have a chunk of chars.
in other words the read() should be non blocking and the parser should 
keep it status between a call and the other.
>
> Of course the hard thing is going to be carrying this handshaking 
> through to the application consuming the data, if it isn't driven 
> completely by the SAX stream; you may need to design an in-memory 
> document model that can be built incrementally and knows how to wait 
> for more parsing if the requested subtree hasn't yet arrived. We did 
> something along those lines for incremental DTM construction in Xalan, 
> though there it was a matter of the model wanting to control the 
> parser rather than vice versa.
interesting. can you tell me a little bit more on how the parser should 
work? I'm not sure I've understood.
thank you very much
mt

>
> ______________________________________
> "... Three things see no end: A loop with exit code done wrong,
> A semaphore untested, And the change that comes along. ..."
>  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish 
> (http://www.ovff.org/pegasus/songs/threes-rev-11.html) 


-- 
Marco Testa

Funambol :: Open Source Mobile'We' for the Mass Market :: http://www.funambol.com
Funambol :: Cross-Platform Mobile Open Source :: https://www.forge.funambol.org/



---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: parsing an xml document chunk by chunk

Posted by ke...@us.ibm.com.
My solution would be to tell the parser to read from an in-memory stream 
acting as a FIFO buffer, and run it in its own thread; then push data into 
that stream from the communications thread as it becomes available. 

Of course the hard thing is going to be carrying this handshaking through 
to the application consuming the data, if it isn't driven completely by 
the SAX stream; you may need to design an in-memory document model that 
can be built incrementally and knows how to wait for more parsing if the 
requested subtree hasn't yet arrived. We did something along those lines 
for incremental DTM construction in Xalan, though there it was a matter of 
the model wanting to control the parser rather than vice versa.

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)