You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-dev@axis.apache.org by Aleksander Slominski <as...@cs.indiana.edu> on 2001/10/22 02:10:43 UTC

Re: [Xerces 2] accessing and controling entity parsing in XNI

Andy Clark wrote:

> Currently, we only provide the location via the XMLLocator
> passed to the startDocument/startDTD methods in the handlers.
> Can you use this between callbacks in order to determine the
> boundaries of the markup or content returned?

> Please note however, that the locations reported by the
> locator object are the row and column numbers of the position
> in the *transcoded* stream immediately following the last
> scanned markup or content. So this information does not
> reflect the actual position in the original stream because
> of various issues like character encoding, etc.

i know this that is why i would like XNI XMLLocator to be extended to report
position in the input stream so i can precisely access to parts of markup or
content - it will require to keep per entity position since beginning but it
should be the only change and it should not be difficult?

> > i would like to expose to application fCurrentEntity.position and allow to
> > control peekChar() and load() behavior (load is now private final function
> > ...).
>
> Why do you want to control the entity scanner? Other people
> (e.g. Xalan folks) have also asked about being able to control
> the input buffer in the parser. So it would be useful to know
> why you want this feature.

i am working on the SOAP processor. in this particular case messages are small
and possible will be forwarded to other SOAP processors. but there may be some
modifications such as removing part of markup (ex. SOAP header). the best way to
do it is to keep buffering input stream, notice all skipped markup and then
forward XML content with slight modification like skipping or inserting some
content (if necessary).

this can only be done if parsing layer will give me this precise positioning
information...

in Xml Pull Parser 2 i have now X2 driver that uses Xerces2 XNI pull parsing API
as an alternative to my default tokenizer/parser implementation and this
positioning is the only misising feature that prevents me from using Xerces2 for
efficient SOAP pull parsing :-)

> > finally i would like to be able ot pinpoint input buffer so it is always
> > growing but never shrunk with System.arraycopy() - it is very useful if i
> > want to keep in memory representation of unparsed XML in memory that can
> > be used similarly to DOM as persistent representation of XML doc  ( to
> > reconstruct DOM *when* it is needed...).
>
> This is a much more difficult request and I'll explain why.
>
> The scanner is implemented to be as efficient as possible.
> So it re-uses the underlying character buffer over and over
> again. We've been asked to add a feature to orphan the
> character array instead of re-using it so that people can
> keep a reference to the character array passed to the
> characters() method and know that the data won't be
> changed later. This could very easily be done.
>
> However, growing the underlying character array is much
> more difficult. Do you want the array contain the decoded
> but non-normalized contents of the document? Or do you
> want the array to contain the "flattened" contents of
> the document, with all entities inlined, etc? And once
> you grow the array, then all of the array references
> and position information that you've collected during
> the parse is incorrect.
>
> So I would advise not to go down that path.

instead of pinpointing i can pass my own reader that will keep content of
incoming input in a growable buffer. however i still will need to install some
kind of entity manager so i can wrap all entity inputs with my own buffering
reader (though it is not that critical for as SOAP spec disallows DTD
declaration so there is no external entities to worry about ...).

how difficult would it be to do? i would be happy to do it (all actually looks
not that complex) but i have no experience with X2 codebase...

thanks,

alek



Re: [Xerces 2] accessing and controling entity parsing in XNI

Posted by Aleksander Slominski <as...@cs.indiana.edu>.
Andy Clark wrote:

> Aleksander Slominski wrote:
> > content - it will require to keep per entity position since
> > beginning but it should be the only change and it should not
> > be difficult?
>
> It *seems* easy but it's not.
>
> The only reliable way of doing this is to write custom
> readers for every conceivable character encoding so that
> you can keep track of byte vs. char location in the XML
> document stream.

hi,

i was actually thinking about keeping position in UTF16 input reader ie. fCurrentEntity.reader and just
exposing fCurrentEntity.position and that i think is much much easier....

i agree completely that trying to do it with keeping position in original input stream (with all
possible encodings) would require a lot of work and could even prevent efficient buffering....

so in nutshell  i would like simple solution could be adding to XMLLocator one method:

    /** Returns the parser position counting size beginning of entity input. */
    public int getCurrentEntityAbsoluteOffset();

and implementing it as part of XMLEntityManager so more precise positioning is available. the effect on
parser performance is absolutely minimal - just one add operation in load(...).


> > positioning is the only misising feature that prevents me from
> > using Xerces2 for efficient SOAP pull parsing :-)
>
> I'm having trouble buying that conclusion. :)

i am just saying that SOAP processing is a special kind of XML parsing and sometimes requires special
features but it would be good to leverage existing well tested and implemented infrastructure even in
those special circumstance ....

> > instead of pinpointing i can pass my own reader that will keep
> > content of incoming input in a growable buffer. however i still
>
> Unless your reader only returns one char at a time, this
> is not going to work because the parser reads the input in
> chunks. Therefore, the location your reader reports will
> be past the actual point where the scanner is looking at
> markup. And even if your reader limited chunking calls to
> a single char at a time, this is grossly inefficient.

i agree. i would use only reader to actually preserve input and not to do positioning (and also maybe
to prevent Xerces from closing my input stream ...).

> If you're relying on this provide the performance you
> need, then I would suggest attacking the performance from
> another angle.

i am interested in doing efficient dispatching/routing and that requires extracting from XML partial
information but still as it is XML i need XML parser. so the idea is simply to do pull parsing as much
content as needed and then dispatch the rest of it.

thanks,

alek



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: [Xerces 2] accessing and controling entity parsing in XNI

Posted by Andy Clark <an...@apache.org>.
Aleksander Slominski wrote:
> content - it will require to keep per entity position since 
> beginning but it should be the only change and it should not 
> be difficult?

It *seems* easy but it's not.

The only reliable way of doing this is to write custom
readers for every conceivable character encoding so that
you can keep track of byte vs. char location in the XML
document stream.

> positioning is the only misising feature that prevents me from 
> using Xerces2 for efficient SOAP pull parsing :-)

I'm having trouble buying that conclusion. :)

> instead of pinpointing i can pass my own reader that will keep 
> content of incoming input in a growable buffer. however i still 

Unless your reader only returns one char at a time, this 
is not going to work because the parser reads the input in 
chunks. Therefore, the location your reader reports will 
be past the actual point where the scanner is looking at 
markup. And even if your reader limited chunking calls to
a single char at a time, this is grossly inefficient.

If you're relying on this provide the performance you
need, then I would suggest attacking the performance from
another angle.

> how difficult would it be to do? i would be happy to do it 
> (all actually looks not that complex) but i have no 
> experience with X2 codebase...

You're welcome to try but I would not recommend it. And
that has nothing to do with the Xerces2 codebase -- I'm
speaking from years of experience in writing XML parsers.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org