You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Matt Nemenman <ma...@inktomi.com> on 2003/10/02 02:19:11 UTC

Parsing a stream of XML documents

Hello Everyone!

I am trying to write an application that has to parse a sequence of XML
documents (thousands of them) from the file/stream. Every document in
the sequence should be a well-formed XML, but they are not necessarily
in the same encoding. The stream will look somewhat like this:

:BEGIN EXAMPLE: 

<?xml version="1.0" encoding="utf-8"?>
<document id="1">
  ... content ...
</document>

<?xml version="1.0" encoding="iso-8859-1"?>
<document id="2">
  ... content ...
</document>

...

:END EXAMPLE:

The problem is, that if there is a well-formness error in any of the
documents, I don't want to discard the whole stream, since there may be
thousands of good well-formed XML documents in it.  I want to discard
just one document, but try to recover and continue parsing the next one.

Anyone has any suggestions on how to do it "the right way"? 

I was thinking of deriving my own InputSource class, that will be
similar to LocalFileInputSource, but will keep reusing the same
BinFileInputStream object for every makeStream() call. Then supply this
InputSource to SAX2XMLReader::parse(), reset SAX2XMLReader after the doc
is complete, and call parse() again and again ...

This should work fine (I haven't tried it yet, though) if all documents
in the stream are well-formed. If not, parser will die half-way through
the document. At this point I will have to recover by searching for the
closing </document> tag, to start parsing next document right after it.
But in order to do that I need to know what encoding the malformed
document was in. Is there any way to get access to that info?

I can see other problems with such approach too (e.g. what if
well-formness error is even before the opening <document> tag?), and
therefore I am wondering if I am at all on the right path. 

Any advice on this is really appreciated. Thanks a lot,

	-- Matt


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


RE: Parsing a stream of XML documents

Posted by Gary Hughes <ge...@itga.com.au>.
Matt,

I have tackled a similar problem, I did't have to worry about different
encodings though. The first problem is that the parsers only stop when the
input source returns eof, this is because a document is not delimited by the
start/end of the root element, there can be multiple processing instructions
before and after the root element. I got around that by creating my own
input stream class, the parser intercepted end element events and when I hit
the end element for the root element I instructed the input source to return
eof. There is also another issue, when you tell the input stream to return
eof the parser will correctly end the parse of the current document, however
when you reset the parser it will throw away any extra data it had obtained
from the input stream - this could be part of the next document. The only
way I could fix this was to only ever return 1 byte at a time from the input
stream, the whole thing turned out to be too slow.

I wanted to do this so I could have client/server applications streaming xml
to each other over sockets, I have iostream socket classes and could have
parsed the xml directly from the socket which would have been very clean and
simple but alas I have not been able to find a good solution so I simply
serialised the xml into another message format, I read these messages, extra
the xml section and parse it, a bit of double handling but it's still quick
and works well.

Gary.

> -----Original Message-----
> From: Matt Nemenman [mailto:mattn@inktomi.com] 
> Sent: Thursday, 2 October 2003 10:19 AM
> To: xerces-c-dev@xml.apache.org
> Subject: Parsing a stream of XML documents
> 
> 
> Hello Everyone!
> 
> I am trying to write an application that has to parse a 
> sequence of XML
> documents (thousands of them) from the file/stream. Every document in
> the sequence should be a well-formed XML, but they are not necessarily
> in the same encoding. The stream will look somewhat like this:
> 
> :BEGIN EXAMPLE: 
> 
> <?xml version="1.0" encoding="utf-8"?>
> <document id="1">
>   ... content ...
> </document>
> 
> <?xml version="1.0" encoding="iso-8859-1"?>
> <document id="2">
>   ... content ...
> </document>
> 
> ...
> 
> :END EXAMPLE:
> 
> The problem is, that if there is a well-formness error in any of the
> documents, I don't want to discard the whole stream, since 
> there may be
> thousands of good well-formed XML documents in it.  I want to discard
> just one document, but try to recover and continue parsing 
> the next one.
> 
> Anyone has any suggestions on how to do it "the right way"? 
> 
> I was thinking of deriving my own InputSource class, that will be
> similar to LocalFileInputSource, but will keep reusing the same
> BinFileInputStream object for every makeStream() call. Then 
> supply this
> InputSource to SAX2XMLReader::parse(), reset SAX2XMLReader 
> after the doc
> is complete, and call parse() again and again ...
> 
> This should work fine (I haven't tried it yet, though) if all 
> documents
> in the stream are well-formed. If not, parser will die 
> half-way through
> the document. At this point I will have to recover by 
> searching for the
> closing </document> tag, to start parsing next document right 
> after it.
> But in order to do that I need to know what encoding the malformed
> document was in. Is there any way to get access to that info?
> 
> I can see other problems with such approach too (e.g. what if
> well-formness error is even before the opening <document> tag?), and
> therefore I am wondering if I am at all on the right path. 
> 
> Any advice on this is really appreciated. Thanks a lot,
> 
> 	-- Matt
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org