You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@xml.apache.org by Paul Libbrecht <pa...@activemath.org> on 2002/08/26 21:39:26 UTC

Re: Progressive parsing

First, thanks to Neil for his answer (on Xerces-j-user list) which I 
don't find anymore to quote appropriately.
Here is an attempt of a solution that looks pretty much to satisfy my 
needs but for which I'd have some more comments on the quality of the 
approach.

To re-parse (or parse) a single element (and its child content), it 
seems sufficient to have the following information: the URL of the 
document, the byte-positions (start and end) of the whole element, and 
the byte-positions of the all the parents element-start declarations 
(and be able to feed the corresponding closing elements).
This could be easily piped through a stream, reading only the necessary 
bits and skipping the rest, thereby feeding to the parser only the 
needed things.

Here's an example:
<a>    <b><c>blop</c></b>     <b id="b1"><c>blip</c></b>      </a>

To reparse only the content of b of id "b1" I can then feed to the 
parser:
			<a><b id="b1"><c>blip</c></b></a>
thus avoiding the presumabily enormous first b element's content.
(note, this doesn't mention what the parsing is actually, feeding, I am 
thinking of JDOM but one's free, just... sax events).

I see at least two applications of this:

- an xml source editor that has, say, a tree-view, could reparse much 
less thereby being much more responsive (try jEdit's excellent xml-mode, 
the parsing step is heavy!).

- to make poor-man's (read-only) database of xml-content, it would be 
sufficient to build an index of the elements with an id which would then 
be fed responding to a query

But is this good xml practice ?
I am clearly loosing the ability to apply full-validation (that is, I 
could only revalidate the element's content, is schema exchangeable in 
terms of root element like a DTD is ? relax-ng schemas ?)

Finally... to xerces makers/users: how do I get the byte position of an 
element declaration I've just been handed to by the sax parser ?

Thanks.

Paul

On Jeudi, juillet 25, 2002, at 02:58 , Paul Libbrecht wrote:
> Although this request only about parsing, I think it looks to be 
> general enough to be posted in this list.
>
> Here's a simple problem: one of our applications reads a row of XML 
> documents, all using the same DTD declarations. If I understand well, 
> at least from the SAX or JAXP interfaces, the parser will read the 
> DTD(s) completely everytime.
> This looks like a real resource loss. Do some parsers, and preferably a 
> standard, have a way to avoid this and re-use the same parsed DTD 
> everytime ??
>
>
> A related fact is in the building of an XML editor where you offer the 
> user the ability to edit the source code: what you would like is that 
> the internal XML representation becomes updated quickly (ideally all 
> the time). For this, however, we would need the parser to be able to 
> only parse, say, the biggest element containing the changes.
> And for this, some more information should be kept, at least something 
> similar to a stack of namespaces for each location.

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: Progressive parsing

Posted by Aleksander Slominski <as...@cs.indiana.edu>.

hi,

typical decoder (such as UTF8) takes input as byte stream and converts it
into char reader is _not_ bi-directional. so even if i know that i have
now char 'x' the decoder will not tell me what is current byte(s) position
for this character.

you could try overcome it by keeping reference to original
byte stream and reading position directly form this stream however
this will also not work as any decent decoder and parser will run
over buffered input (and will try to "fix" unbuffered input into buffered one)
or it will even do its own buffering such as Xerces2, and i think also SUN built-in
UTF8 decoder (AFAIR it made very hard  to write Jabber client as UTF8
decoder tried to read t much and blocked ...) in such cases
buffering makes reading byte stream position pretty much useless as
logical position does not necessarily correspond to physical byte stream position
(byte stream will be typically ahead of currently available character from decoder).
moreover even if you manage to turn off buffering it will degrade perfromance of
parser a lot - so you may read physical positions but overall perfromance will be
really bad ...

all of it is true when trying to find a general solution to the problem
however situation is not that bad for simple encodings such as ISO-8859-1
as you always can get original byte stream position as there is one-to-one
correspondence. the only difficulty is for UTF8 - os if you are willing to send
input as ISO-8859-1 or UTF16 or any of other non-variable number of
byte(s)-to-character(s) encodings you should be fine (and easily can
convert logical posiiton to physical byte stream position ...)

finally as source code of Xerces2 is available you can always work on
changing code to suit your needs (but it will take some time to get good solution
especially if you care about perfromance and it seems to be the case ...)

hope it helps,

alek

Paul Libbrecht wrote:

> Well,
>
> Having a position according to an encoding is honestly, simply... bad.
>
> One of the goal applications was to be able to be a client of such an
> indexed-database over http/1.1. The latter protocol has a way to request
> only a row of segments of a file. But that can only happen in bytes of
> course.
>
> When doing it with files, one expects to use, say, the
> InputStream.skip() method which is, hopefully, efficiently implemented
> and skips the cursor in the file-reading underlying routines.
> Skipping x characters using an encoding is simply a killer: the encoding
> has to run through all the characters. For example, in UTF-8, skipping
> an escaped character means skipping three bytes (I think) whereas
> skipping an ASCII character means skipping one byte.
>
> So... I really meant: "Can I get the byte-position".
> Currently, the only way is to build thing index using a
> "load-in-memory-than-rewrite-to-file"... I can live with this but I
> would have expected "fine parsers" to provide more.
>
> Paul
>
> On Mardi, août 27, 2002, at 04:42 , Aleksander Slominski wrote:
> >> Finally... to xerces makers/users: how do I get the byte position of an
> >> element declaration I've just been handed to by the sax parser ?
> >
> > this is more complex as parser works on UTF-16 characters (char)
> > so obtaining position of original stream if it was not UTF-16 is very
> > difficult. however i think that for your cases it is enough to get
> > position of start/end element in character stream. ability to obtain
> > position is not currently part of xerces2 but you can take a look on my
> > patch that adds to XMLLocator function getCurrentEntityAbsoluteOffset()
> > that can be used to get current position of parser. together with
> > changes to XMLDocumentFragmentScannerImpl it is possible to get
> > start/end position of every XML event in XNI. for details see:
> >
> > http://www.extreme.indiana.edu/xgws/xsoap/xpp/download/PullParser2/lib/xerces2_patched/
>
> ---------------------------------------------------------------------
> In case of troubles, e-mail:     webmaster@xml.apache.org
> To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> For additional commands, e-mail: general-help@xml.apache.org

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: Progressive parsing

Posted by Paul Libbrecht <pa...@activemath.org>.

Well,

Having a position according to an encoding is honestly, simply... bad.

One of the goal applications was to be able to be a client of such an 
indexed-database over http/1.1. The latter protocol has a way to request 
only a row of segments of a file. But that can only happen in bytes of 
course.

When doing it with files, one expects to use, say, the 
InputStream.skip() method which is, hopefully, efficiently implemented 
and skips the cursor in the file-reading underlying routines.
Skipping x characters using an encoding is simply a killer: the encoding 
has to run through all the characters. For example, in UTF-8, skipping 
an escaped character means skipping three bytes (I think) whereas 
skipping an ASCII character means skipping one byte.

So... I really meant: "Can I get the byte-position".
Currently, the only way is to build thing index using a 
"load-in-memory-than-rewrite-to-file"... I can live with this but I 
would have expected "fine parsers" to provide more.

Paul

On Mardi, août 27, 2002, at 04:42 , Aleksander Slominski wrote:
>> Finally... to xerces makers/users: how do I get the byte position of an
>> element declaration I've just been handed to by the sax parser ?
>
> this is more complex as parser works on UTF-16 characters (char)
> so obtaining position of original stream if it was not UTF-16 is very 
> difficult. however i think that for your cases it is enough to get 
> position of start/end element in character stream. ability to obtain 
> position is not currently part of xerces2 but you can take a look on my 
> patch that adds to XMLLocator function getCurrentEntityAbsoluteOffset() 
> that can be used to get current position of parser. together with 
> changes to XMLDocumentFragmentScannerImpl it is possible to get 
> start/end position of every XML event in XNI. for details see:
>
> http://www.extreme.indiana.edu/xgws/xsoap/xpp/download/PullParser2/lib/xerces2_patched/

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: Progressive parsing

Posted by Paul Libbrecht <pa...@activemath.org>.

On Mardi, août 27, 2002, at 04:42 , Aleksander Slominski wrote:
> however that may not work if parser is validating and for example there
> are explicit rules for children of <a> (like <a> must have two <b> 
> children).

Well that was the point of my probably too quick sub-paragraph...
If you provide to the validating parser only the content of <b id="b1"> 
I think it is OK.
It would be OK with a DTD parser.
Would it with a schema parser ??

Paul



>> <a>    <b><c>blop</c></b>     <b id="b1"><c>blip</c></b>      </a>
>>
>> To reparse only the content of b of id "b1" I can then feed to the
>> parser:
>>                         <a><b id="b1"><c>blip</c></b></a>

> http://www.extreme.indiana.edu/xgws/xsoap/xpp/download/PullParser2/lib/xerces2_patched/

Looks good, I'll have a look.


---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: Progressive parsing

Posted by Aleksander Slominski <as...@cs.indiana.edu>.

Paul Libbrecht wrote:

> Here's an example:
> <a>    <b><c>blop</c></b>     <b id="b1"><c>blip</c></b>      </a>
>
> To reparse only the content of b of id "b1" I can then feed to the
> parser:
>                         <a><b id="b1"><c>blip</c></b></a>
> thus avoiding the presumabily enormous first b element's content.
> (note, this doesn't mention what the parsing is actually, feeding, I am
> thinking of JDOM but one's free, just... sax events).

> I see at least two applications of this:
>
> - an xml source editor that has, say, a tree-view, could reparse much
> less thereby being much more responsive (try jEdit's excellent xml-mode,
> the parsing step is heavy!).
>
> - to make poor-man's (read-only) database of xml-content, it would be
> sufficient to build an index of the elements with an id which would then
> be fed responding to a query
>
> But is this good xml practice ?
> I am clearly loosing the ability to apply full-validation (that is, I
> could only revalidate the element's content, is schema exchangeable in
> terms of root element like a DTD is ? relax-ng schemas ?)

hi,

however that may not work if parser is validating and for example there
are explicit rules for children of <a> (like <a> must have two <b> children).

> Finally... to xerces makers/users: how do I get the byte position of an
> element declaration I've just been handed to by the sax parser ?

this is more complex as parser works on UTF-16 characters (char)
so obtaining position of original stream if it was not UTF-16 is very
difficult. however i think that for your cases it is enough to get position
of start/end element in character stream. ability to obtain position
is not currently part of xerces2 but you can take a look on my patch
that adds to XMLLocator function getCurrentEntityAbsoluteOffset()
that can be used to get current position of parser. together with
changes to XMLDocumentFragmentScannerImpl it is possible to
get start/end position of every XML event in XNI. for details see:

http://www.extreme.indiana.edu/xgws/xsoap/xpp/download/PullParser2/lib/xerces2_patched/

thanks,

alek

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org