You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Aleksander Slominski <as...@cs.indiana.edu> on 2006/02/21 18:35:46 UTC

Re: How to read multiple XML from socket: cannot change the protocol (Re: How to handle continuous stream of XML)

Polk, John R. wrote:

> Aleksander,
>  
> I found a post similar to a problem I am having with your name 
> associated with it.  I was wondering if you found a clean solution to 
> your issue.  I am writing a client application that is trying to parse 
> multiple XML responses over the same socket connection.  I have a 
> problem parsing the second response because it starts with 
> "<?xml...".  I was hoping to be able to cleanly reset the parser 
> between response messages, but I have not succeeded.  Do you have any 
> suggestions? 

hi John,

(i CCed j-users@xerces.apache.org as the same question was raised - [1] 
below)

it looks to me that the key thing to notice that your input has two 
layers - first there are high order markers (<?xml version="1.0" 
encoding="UTF-8"?>) that separate actual XML document - higher level 
layer is *not* XML so it should be processed without XML passer (good 
that <?xml is reserved and not used except for <![CDATA -to be 100% 
correct you would need to scan for <![CDATA[ too as CDATA can contain 
anything http://www.w3.org/TR/REC-xml/#sec-cdata-sect)

if you have control over format (it seems you do not?)  i would suggest 
to use something similar to HTTP chunked encoding so you write size of 
every output chunks before writing it and mark some chunk as last in the 
chain (size=-1) so you know when a document is finished and so on - this 
would lead to much more efficient streaming (and allows you to send 
metadata about files if you put it besides chunk sizes) as no string 
patterns is needed and no worry about CDATA as layering is completely 
transparent (using <?xml ...?> is not very good without XML independent 
markers  ...)

for the particular enveloping scheme you have (using <?xml ... as 
marker): a simple solution i would use is to create a composite reader 
(or input stream) that is buffered. the first thing it does it sets a 
mark (starts internal buffering)  and then you scan its input for next  
'<?xml version="1.0" encoding="UTF-8"?>\n<root' (as you know it is the 
marker for 2nd doc) and then it creates new reader that will only allow 
to read the input until 2nd document beginning or EOF (so it is content 
1st doc) and passes that reader to xml parser. then the process is 
repeated: new mark is set and scan for next <?xml...?> which is 
beginning of 3rd document (or EOF).

that is it - it should be fairly efficient as the key for IO performance 
is to read data in chunks and avoid copying (here not much is done but 
more advanced version could actually work in streaming pipeline and hook 
into MultiXmlCompositeReader.read(...) to actually scan for end of 
document marker (or EOF) so there is no memory overhead as only chunks 
(possibly multiple read()s to discover marker as it read() may only get 
it partially) need to be buffered then and not whole document (and only 
one buffering is done in MultiXmlCompositeReader and other buffering in 
xmlparser but that is hard to avoid)


in pseudo code
    InputStream in =
    MultiXmlCompositeReader mr = new MultiXmlCompositeReader(new 
InputStreamReader( in, "UTF8" ))
    Reader r;
    while( (r = mr.nextDocumentReader()) != null) {
      xmlparser.setInput(r);
      xmlparser.parse() ...
   }

still you should add then CDATA scanning to make it completely correct.

however if you are concerned about correctness and if you want to handle 
all in one xml parser stream i would instead write 
MultiXmlCompositeReader  to actually transform stream as follow

ORIG:
   <?xml version="1.0" encoding="UTF-8"?>
   <root/>
   <?xml version="1.0" encoding="UTF-8"?>
   <root2/>

TRANSFORMED
   <super-root>
   <root/>
   <root2/>
   </super-root>

i.e. add wrappers XML elements (<super-root/>) and remove all <?xml...?> 
when you see '<?xml...?>\n<root'

this can be also done as streaming reader/filter with careful coding 
(especially if XML content is signed and you want to make sure that 
CDATA content with <?xml ...?> is not modified ....)

HTH

alek

[1] Massimo Valla wrote:

> Hi Michael.
> Thank you for your reply. I definitely agree on your point. The 
> protocol is awful. But, unfortunately I cannot change the server side 
> nor the protocol. I could assume that each document ends when the root 
> tag is closed. So your example could be parsed and received as two 
> documents:
>  
> 1st doc:
>    <?xml version="1.0" encoding="UTF-8"?>
>    <root/>
> 2nd doc
>    <?xml version="1.0" encoding="UTF-8"?>
>    <root2/>
> leaving out the comment as not beloning to any of the two docs.
>  
> The problem is that with Xerces as soon as I receive the first end tag 
> SAX notification, the parser has already buffered part of the other 
> XML message, so starting another parse command on the inputstream will 
> not work.
>  
> How can I set a simular solution to FAQ-11 (of Xerces1) in Xerces2 ??
> More generally, how can I write a client with Xerces that is able to 
> parse mutiple XML coming from the socket?
>  
> (I have also tryed other parsers: they allow char-by-char parsing and 
> they would not close the inputstream after a parse error, so I would 
> be fine using them. But I would very much prefer to stay with Xerces 
> as it is the parser used in Java 1.5...)
>  
> Thanks a lot,
> Massimo
>  
> On 2/12/06, Michael Glavassevich <mrglavas@ca.ibm.com 
> <ma...@ca.ibm.com>> wrote:
> > Hi Massimo,
> >
> > The KeepSocketOpen sample works because the server socket tells the 
> client
> > how many bytes there are in the document. If the server has no protocol
> > for communicating the boundaries between XML documents, how can you tell
> > where one begins and another ends?
> >
> > Consider if your client receives this from the socket:
> >
> > <root/>
> > <!-- comment -->
> > <root2/>
> >
> > How would you know whether you've received two documents or one not
> > well-formed document containing multiple root elements? And if this is
> > processed as two documents does the comment belong to the first or the
> > second? Only the sender could know that.
> >
> > Thanks.
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com <ma...@ca.ibm.com>
> > E-mail: mrglavas@apache.org <ma...@apache.org>
> >
> > Massimo Valla < massimo.valla@gmail.com 
> <ma...@gmail.com>> wrote on 02/04/2006 09:54:25 PM:
> >
> > > Hi,
> > > I am trying to read multiple XML files from a socket using JAXP 1.3
> > > / Xerces-J 2.7.1.
> > >
> > > Unfortunately the KeepSocketOpen example in Xerces2 Socket Sample (
> > > http://xerces.apache.org/xerces2-j/samples-socket.html) does not
> > > work for me, because I have no control over the other side of the
> > socket.
> > >
> > > Also FAQ-11 of Xerces1 ( http://xerces.apache.org/xerces-j/faq-
> > > write.html#faq-11) does not help anymore, because the
> > > StreamingCharFactory class used there to prevent buffering cannot be
> > > used in Xerces2 (cannot compile the class).
> > >
> > > I have been trying to find a solution to this for a while now, but I
> > > could come to an end.
> > >
> > > Can anybody provide a simple example on how to read multiple XML
> > > docs from a socket InputStream?
> > >
> > > Thanks a lot,
> > > Massimo
> >
>  



-- 
The best way to predict the future is to invent it - Alan Kay



-- 
The best way to predict the future is to invent it - Alan Kay


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


RE: How to read multiple XML from socket: cannot change the protocol (Re: How to handle continuous stream of XML)

Posted by Don Bate <do...@iadfw.net>.
Sorry if I'm entering the discussion late, but we solved this problem 
by extending FilterInputStream and FilterOutputStream. We use a 
special character (ASCII EOM for our stuff) to mark the end of an XML 
document (and precede it with an ESC character if it occurs in the 
document, and of course the ESC is also ESCed). We then override the 
read and write methods to check/insert the characters in the streams. 
We also override the close methods so that they don't actually close 
the underlying streams. We then pass this stream to the parser.

I've attached the java code for these classes.

The transmitting side that uses this class looks something like this:

Socket s = ...;
OutputStream out = s.getOutputStream ();
BufferedOutputStream bos = new BufferedOutputStream (out);
while (...more...) {
     XMLMessage.XMLOutputStream xos = new XMLMessage.XMLOutputStream (bos);
     writeDocument (xos);
     // assuming that writeDocument closes the xos when it's done.
}

And, the reading/parsing side that uses this class looks something like this:

Socket s = ...;
InputStream in = s.getInputStream ();
BufferedInputStream bis = new BufferedInputStream (in);
while (...more...) {
     XMLMessage.XMLInputStream xis = new XMLMessage.XMLInputStream (bis);
     ... possibly wrap xis in a InputReader and hand it off to the parser
}

Of course you'll want to catch the appropriate exceptions, etc.

We've found that this technique is efficient, handles ill-formed xml 
documents correctly as they pertain to subsequent documents, and is 
easy to use.

Hope this helps,
Don Bate

At 10:20 AM +0000 2/23/06, Mike Skells wrote:
>Hi,
>I believe that you can have PI, comments, whitespace etc after the root
>element, is that significant for you ?
>
>----
>I have the same problem in one of out applications. We looked to the format
>the HTTP uses, and 'borrowed' the ideas from there. Due to the volume of the
>data that we handled then we could not afford the overhaed ov scanning each
>byte /char for a specific marker.
>
>What we do is insert a Content-Length:<xx><cr> marker that details the
>length of the content( which is usually in our case a document> in bytes,
>after encoding.
>
>This has a drawback inthat the document needs to be prepared before being
>sent, and therefore buffered, but for this application it is not an issue
>and the documents that we handle are ver very small (a few hundred bytes)
>but we have to manage 100-500 per second.
>
>You could insert a marker to indicate a continuation, if your documents are
>large, and you cannot afford the buffering
>
>E.g.
>More:4000
><4000 bytes>
>More:4000
><4000 bytes>
>More:4000
><4000 bytes>
>Complete:407
><407 bytes that make up the rest of the document>
>
>It works well for us. May not suit you
>
>--------------
>
>If you truly cannot change the protocol, then if the documents that are sent
>have a <?xml ... Header that you could use that as the marker, but his does
>mean that you would not know that a documen is complete until the next
>document is started, so you would always be one behind.
>
>Just my 2c
>
>Mike
>
>
>>  -----Original Message-----
>>  From: Joseph Kesselman [mailto:keshlam@us.ibm.com]
>>  Sent: 21 February 2006 18:21
>>  To: j-users@xerces.apache.org
>>  Cc: j-users@xerces.apache.org; Polk, John R.
>>  Subject: Re: How to read multiple XML from socket: cannot
>>  change the protocol (Re: How to handle continuous stream of XML)
>>
>>  Note too that a well-formed XML document can only have one
>>  top-level element -- everything after that is normally
>>  discarded -- so that too could be used as a clue for diviing
>>  a multiple-document stream.
>>
>>  Or you could invent some new marker between documents, and
>>  have your input-stream filter use that to break up the docs.
>>
>>  Or you could just pack all the XML files into a zipfile, send
>>  that, and have your recieving tool unpack that into separate
>  > files. This would have the advantage of not having to
>>  (slightly) break people's expectations about whether what
>>  they're getting back form the server is one document or
>>  several... and might actually improve performance, especially
>>  on larger documents; XML compresses wonderfully.
>>
>>  Whichever approach you use, note that this isn't really an
>>  XML problem; it's a stream management problem. The XML parser
>>  expects to see a stream that presents only a single XML
>>  document, so breaking up the stream into multiple docs has to
>>  happend before it reaches the parser.
>>
>>
>>  "Ooof! There's a wasp in the room!"
>>  "Get out! Quick! Before it gets to the tiger...!" -- Monty
>>  Python, _Matching_Tie_And_Handkerchief_
>>
>>  ______________________________________
>>  Joe Kesselman -- Beware of Blueshift!
>>  "The world changed profoundly and unpredictably the day Tim
>>  Berners Lee got bitten by a radioactive spider." -- Rafe
>>  Culpin, in r.m.filk
>>
>>
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>>  For additional commands, e-mail: j-users-help@xerces.apache.org
>>
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>For additional commands, e-mail: j-users-help@xerces.apache.org


-- 
Don Bate               | Specializing in Consulting and Mentoring in
Bate Consulting, Inc   | Object-Oriented Technologies,
                        | Software Architecture, and Software Process
(972) 618-0208 voice
(972) 618-0216 fax
donbate@iadfw.net

Re: How to read multiple XML from socket: cannot change the protocol (Re: How to handle continuous stream of XML)

Posted by Aleksander Slominski <as...@cs.indiana.edu>.
Mike Skells wrote:

>I believe that you can have PI, comments, whitespace etc after the root
>element, is that significant for you ?
>
>----
>I have the same problem in one of out applications. We looked to the format
>the HTTP uses, and 'borrowed' the ideas from there. Due to the volume of the
>data that we handled then we could not afford the overhaed ov scanning each
>byte /char for a specific marker.
>
>What we do is insert a Content-Length:<xx><cr> marker that details the
>length of the content( which is usually in our case a document> in bytes,
>after encoding.
>
>This has a drawback inthat the document needs to be prepared before being
>sent, and therefore buffered, but for this application it is not an issue
>and the documents that we handle are ver very small (a few hundred bytes)
>but we have to manage 100-500 per second. 
>
>You could insert a marker to indicate a continuation, if your documents are
>large, and you cannot afford the buffering
>
>E.g.
>More:4000
><4000 bytes>
>More:4000
><4000 bytes>
>More:4000
><4000 bytes>
>Complete:407
><407 bytes that make up the rest of the document>
>
>It works well for us. May not suit you
>  
>
hi,

you could also use for this purpose HTTP 1.1 chunked encoding  as it 
does exactly what you described and _more_ as it allows headers and 
trailing headers (good for metadata)

>If you truly cannot change the protocol, then if the documents that are sent
>have a <?xml ... Header that you could use that as the marker, but his does
>mean that you would not know that a documen is complete until the next
>document is started, so you would always be one behind.
>  
>
AFAICS that should not be a problem (?) as you never know that document 
is complete until you read the last byte form the input in *whatever* 
format input is encoded ...  so that means streaming still can be done 
and there is no need to buffer whole input even when <?xml... markers 
are used.

best,

alek

>  
>
>>-----Original Message-----
>>From: Joseph Kesselman [mailto:keshlam@us.ibm.com] 
>>Sent: 21 February 2006 18:21
>>To: j-users@xerces.apache.org
>>Cc: j-users@xerces.apache.org; Polk, John R.
>>Subject: Re: How to read multiple XML from socket: cannot 
>>change the protocol (Re: How to handle continuous stream of XML)
>>
>>Note too that a well-formed XML document can only have one 
>>top-level element -- everything after that is normally 
>>discarded -- so that too could be used as a clue for diviing 
>>a multiple-document stream.
>>
>>Or you could invent some new marker between documents, and 
>>have your input-stream filter use that to break up the docs.
>>
>>Or you could just pack all the XML files into a zipfile, send 
>>that, and have your recieving tool unpack that into separate 
>>files. This would have the advantage of not having to 
>>(slightly) break people's expectations about whether what 
>>they're getting back form the server is one document or 
>>several... and might actually improve performance, especially 
>>on larger documents; XML compresses wonderfully.
>>
>>Whichever approach you use, note that this isn't really an 
>>XML problem; it's a stream management problem. The XML parser 
>>expects to see a stream that presents only a single XML 
>>document, so breaking up the stream into multiple docs has to 
>>happend before it reaches the parser.
>>
>>
>>"Ooof! There's a wasp in the room!"
>>"Get out! Quick! Before it gets to the tiger...!" -- Monty 
>>Python, _Matching_Tie_And_Handkerchief_
>>
>>______________________________________
>>Joe Kesselman -- Beware of Blueshift!
>>"The world changed profoundly and unpredictably the day Tim 
>>Berners Lee got bitten by a radioactive spider." -- Rafe 
>>Culpin, in r.m.filk
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>>For additional commands, e-mail: j-users-help@xerces.apache.org
>>
>>
>>    
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>For additional commands, e-mail: j-users-help@xerces.apache.org
>
>  
>


-- 
The best way to predict the future is to invent it - Alan Kay


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


RE: How to read multiple XML from socket: cannot change the protocol (Re: How to handle continuous stream of XML)

Posted by Joseph Kesselman <ke...@us.ibm.com>.
>I believe that you can have PI, comments, whitespace etc after the root
>element, is that significant for you ?

They can exist in the file. They aren't supposed to be significant to the
parser. Obviously, if present, they're a problem for dividing up a stream
into multiple documents, which brings us back to the suggestion of putting
an explicit mark between the docs and having your front-end stream
filtering recognize and deal with it.

______________________________________
Joe Kesselman -- Beware of Blueshift!
"The world changed profoundly and unpredictably the day Tim Berners Lee
got bitten by a radioactive spider." -- Rafe Culpin, in r.m.filk


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


RE: How to read multiple XML from socket: cannot change the protocol (Re: How to handle continuous stream of XML)

Posted by Mike Skells <mi...@validsoft.com>.
Hi,
I believe that you can have PI, comments, whitespace etc after the root
element, is that significant for you ?

----
I have the same problem in one of out applications. We looked to the format
the HTTP uses, and 'borrowed' the ideas from there. Due to the volume of the
data that we handled then we could not afford the overhaed ov scanning each
byte /char for a specific marker.

What we do is insert a Content-Length:<xx><cr> marker that details the
length of the content( which is usually in our case a document> in bytes,
after encoding.

This has a drawback inthat the document needs to be prepared before being
sent, and therefore buffered, but for this application it is not an issue
and the documents that we handle are ver very small (a few hundred bytes)
but we have to manage 100-500 per second. 

You could insert a marker to indicate a continuation, if your documents are
large, and you cannot afford the buffering

E.g.
More:4000
<4000 bytes>
More:4000
<4000 bytes>
More:4000
<4000 bytes>
Complete:407
<407 bytes that make up the rest of the document>

It works well for us. May not suit you

--------------

If you truly cannot change the protocol, then if the documents that are sent
have a <?xml ... Header that you could use that as the marker, but his does
mean that you would not know that a documen is complete until the next
document is started, so you would always be one behind.

Just my 2c

Mike


> -----Original Message-----
> From: Joseph Kesselman [mailto:keshlam@us.ibm.com] 
> Sent: 21 February 2006 18:21
> To: j-users@xerces.apache.org
> Cc: j-users@xerces.apache.org; Polk, John R.
> Subject: Re: How to read multiple XML from socket: cannot 
> change the protocol (Re: How to handle continuous stream of XML)
> 
> Note too that a well-formed XML document can only have one 
> top-level element -- everything after that is normally 
> discarded -- so that too could be used as a clue for diviing 
> a multiple-document stream.
> 
> Or you could invent some new marker between documents, and 
> have your input-stream filter use that to break up the docs.
> 
> Or you could just pack all the XML files into a zipfile, send 
> that, and have your recieving tool unpack that into separate 
> files. This would have the advantage of not having to 
> (slightly) break people's expectations about whether what 
> they're getting back form the server is one document or 
> several... and might actually improve performance, especially 
> on larger documents; XML compresses wonderfully.
> 
> Whichever approach you use, note that this isn't really an 
> XML problem; it's a stream management problem. The XML parser 
> expects to see a stream that presents only a single XML 
> document, so breaking up the stream into multiple docs has to 
> happend before it reaches the parser.
> 
> 
> "Ooof! There's a wasp in the room!"
> "Get out! Quick! Before it gets to the tiger...!" -- Monty 
> Python, _Matching_Tie_And_Handkerchief_
> 
> ______________________________________
> Joe Kesselman -- Beware of Blueshift!
> "The world changed profoundly and unpredictably the day Tim 
> Berners Lee got bitten by a radioactive spider." -- Rafe 
> Culpin, in r.m.filk
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: How to read multiple XML from socket: cannot change the protocol (Re: How to handle continuous stream of XML)

Posted by Joseph Kesselman <ke...@us.ibm.com>.
Note too that a well-formed XML document can only have one top-level
element -- everything after that is normally discarded -- so that too could
be used as a clue for diviing a multiple-document stream.

Or you could invent some new marker between documents, and have your
input-stream filter use that to break up the docs.

Or you could just pack all the XML files into a zipfile, send that, and
have your recieving tool unpack that into separate files. This would have
the advantage of not having to (slightly) break people's expectations about
whether what they're getting back form the server is one document or
several... and might actually improve performance, especially on larger
documents; XML compresses wonderfully.

Whichever approach you use, note that this isn't really an XML problem;
it's a stream management problem. The XML parser expects to see a stream
that presents only a single XML document, so breaking up the stream into
multiple docs has to happend before it reaches the parser.


"Ooof! There's a wasp in the room!"
"Get out! Quick! Before it gets to the tiger...!" -- Monty Python,
_Matching_Tie_And_Handkerchief_

______________________________________
Joe Kesselman -- Beware of Blueshift!
"The world changed profoundly and unpredictably the day Tim Berners Lee
got bitten by a radioactive spider." -- Rafe Culpin, in r.m.filk


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org