You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Andy Clark <an...@apache.org> on 2003/10/06 06:48:56 UTC

[Discuss] Pull Parsing, JSR-173, and Xerces

With the recent public review of JSR-173, the Streaming
API for Java, I've been hoping for more of a discussion
among Apache users and developers regarding this API
and pull parsing in general. But there seems to have
been an amazing amount of apathy in this regard. So I
would like to kickstart the discussion.

I am concerned that the API, as it stands, will not
adequately meet the needs of XML developers. Moreover,
I have concerns about implementing it efficiently in
the Xerces parser. But I'll let others comment on the
technical (de)merits of the API because I want to take
this opportunity to discuss what I would like to see
in a pull parser design.

There are two camps of thought in JSR-173: one that
wants a single interface iterator model and another
that wants discrete event objects to represent the
various parts of the document. The first is designed
with small footprint in mind while the second is more
OO and allows apps to conveniently save document
content.

To appease both camps, JSR-173 includes both approaches
in the API. This is wrong. Users would be better served
by a single, simpler, more integrated approach.

I favor the event approach with the fundamental change
that the event objects returned are singletons owned
by the parser. If the application wants the information
stored within the object, the app must copy the info out
of the singleton and save it.

This approach would appease those developers concerned
with memory (e.g. people targeting J2ME) while providing
a straightforward OO model for everyone else. The counter
argument is that users of the API would be confused about
who owns the memory and try to keep references to objects
whose content is transient. But I disagree.

While it may cause some people trouble the first time
they sit down to write an app, they quickly learn the
paradigm and move on. As we all know, DOM has "live" node
lists. That's the model. You may trip over it the first
time but then you learn it and move on.

And providing a clone method allows applications to keep
references to event objects if they choose. So this would
be a way to provide that functionality as well while
maintaining a single, integrated model which I think is
paramount.

I'll provide more details as the discussion develops but
now I'd like to see what other people think. If you need
to catch up with what I'm talking about, you can check
out the following URLs regarding JSR-173:

   http://www.jcp.org/en/jsr/detail?id=173
   http://jcp.org/aboutJava/communityprocess/first/jsr173/index.html

One last thing: even though I'm cc'ing xerces-j-user, I
would like to keep this discussion on the xerces-j-dev
list. So if you'd like to contribute your two cents and
you're not already subscribed to the xerces-j-dev list,
do that now.

-- 
Andy Clark * andyc@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: [Discuss] Pull Parsing, JSR-173, and Xerces

Posted by Andy Clark <an...@apache.org>.

Jeremy Carroll wrote:
> I am afraid I do not know anything about JSR-173, but I am a user of the 
> Xerces pull parser, and I like the very simple parse-some API offered.

The event parser that you developed is an extreme and
custom case. While that can always be implemented where
needed, the average user would get more benefit from a
generalized API.

I like the push model (and the SAX implementation) very
much but at the same time I understand the annoyance of
having to split my logic across multiple callback methods
and the need to buffer text content. So an efficient
model that provides me a convenient way to navigate and
access document data is extremely useful. And that's
what I had hoped would come out of the JSR-173 work...

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Discuss] Pull Parsing, JSR-173, and Xerces

Posted by Jeremy Carroll <jj...@hplb.hpl.hp.com>.

I am afraid I do not know anything about JSR-173, but I am a user of the 
Xerces pull parser, and I like the very simple parse-some API offered.

I offer this brief system description for my RDF/XML parser (which is the 
one used by the W3C RDF Validator).

My code consists of two parsers (an XML parser and an RDF parser) that 
conceptually act as coroutines. I initially coded them as two threads - the 
XML parser based aound the SAX interface. The relevant SAX events gave rise 
to a sequence of events (defined by me) which were the input to the second 
parser.

The original implementation used two threads for the two coroutines.

Using the Xerces pull parser, I have inverted the XML parser, and made it a 
subroutine to the RDF parser. This needed minimal code changes, and leaves 
a clear conceptual coroutine design, all running in a single thread.

Advantages of the Xerces design are:
- the events being pulled are defined by the user rather than the standard

Disadvantges of the Xerces design:
- the user has to manage the event buffer

A particular issue in my implementation is that I turn attribute value 
pairs into events, and they are placed in the buffer subject to order 
constraints defined by me - in particular the rdf: attributes come before 
other attributes. Clearly this would not be appropriate for most users.

A second issue is error handling - I turn all errors into error events and 
place them in the event buffer. This means they are reported at the 
appropriate point in the second round of processing.

Hope this helps.

My ideal is that a pull parsing standard should be close to the current 
Xerces code.

Jeremy


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Discuss] Pull Parsing, JSR-173, and Xerces

Posted by Andy Clark <an...@apache.org>.

Elliotte Rusty Harold wrote:
> In my opinion, JSR-173/StAX shouldn't be implemented in the Xerces 
> parser. It's a very different API that needs its own parsers, tuned 
> precisely for its model. Xerces is quite bloated as it is. I would 
> prefer not to have carry around StAX classes I don't want or need just 
> to use Xerces's excellent SAX parser (and vice versa). I would prefer 
> any Apache StAX implementation to be a separate project.

It's interesting that you mention this point because
I have put some thought into how to implement this
API efficiently in Xerces. Since XNI is based on a
push model, the conversion from push to pull at the
boundary induces buffering and degrades performance.

To overcome such a deficiency, the parser should
really be implemented as pull internally. However,
then we lose the modularity and configurability of
XNI. And trying to get that same level of modularity
from a purely pull parser greatly complicates the
implementation. Which is precisely why the Java pull
parsers I've reviewed are extreme subsets of full
XML parsers, without even DTD validation.

The only problem I see with your suggestion is what
it means for the application developer. Having a
separate implementation means different feature sets,
possibly different behavior, and different bugs to
deal with.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Discuss] Pull Parsing, JSR-173, and Xerces

Posted by Elliotte Rusty Harold <el...@metalab.unc.edu>.

At 9:48 PM -0700 10/5/03, Andy Clark wrote:

>I am concerned that the API, as it stands, will not
>adequately meet the needs of XML developers. Moreover,
>I have concerns about implementing it efficiently in
>the Xerces parser.

In my opinion, JSR-173/StAX shouldn't be implemented in the Xerces 
parser. It's a very different API that needs its own parsers, tuned 
precisely for its model. Xerces is quite bloated as it is. I would 
prefer not to have carry around StAX classes I don't want or need 
just to use Xerces's excellent SAX parser (and vice versa). I would 
prefer any Apache StAX implementation to be a separate project.

-- 

   Elliotte Rusty Harold
   elharo@metalab.unc.edu
   Processing XML with Java (Addison-Wesley, 2002)
   http://www.cafeconleche.org/books/xmljava
   http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Discuss] Pull Parsing, JSR-173, and Xerces

Posted by Andy Clark <an...@apache.org>.

Bob Foster wrote:
> head with a benchmark or two. Are there any realistic prototypes of the 
> JSR proposal vs. your suggestion? I'm sure what you suggest is faster; 
> I'm not sure how much.

Unfortunately, I don't have the source of the JSR-173
reference implementation. Even if I did, comparing two
different parsers with a different API doesn't prove
much. But we really don't need an implementation in
order to make some comparisons.

The cursor API in the JSR can be used as a baseline
because it doesn't create any objects and therefore
doesn't have the overhead of the event API. But if the
event API were changed to return singletons, it would
be the same as the cursor API -- it would use as little
memory as the cursor API but would have the added
benefit of being layered nicely.

> One glaring problem does jump out, though. Section 4.5.1, 
> XMLInputFactory, describes yet another non-threadsafe parser 
> configuration. It seems insane that an application can't control its own 
> class loading in a threadsafe way. This one is even worse than the SAX 
> API - _none_ of the ways to configure what gets loaded is threadsafe, 
> nor is there any way for an application to disable any of them. Is it 
> too late to fix this?

What do you mean when you say that it's even worse
than SAX?  Are you actually referring to the fact
that parsers are not re-entrant? This doesn't present
a problem for me because there is no meaning for a
single parser instance to be parsing two documents
at the same time.

P.S. Sorry about the reply-to confusion.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Discuss] Pull Parsing, JSR-173, and Xerces

Posted by Bob Foster <bo...@objfac.com>.

Andy Clark wrote:
> There are two camps of thought in JSR-173: one that
> wants a single interface iterator model and another
> that wants discrete event objects to represent the
> various parts of the document. The first is designed
> with small footprint in mind while the second is more
> OO and allows apps to conveniently save document
> content.
> 
> To appease both camps, JSR-173 includes both approaches
> in the API. This is wrong. Users would be better served
> by a single, simpler, more integrated approach.
> 
> I favor the event approach with the fundamental change
> that the event objects returned are singletons owned
> by the parser. If the application wants the information
> stored within the object, the app must copy the info out
> of the singleton and save it.

After a very cursory reading, I'd be inclined to agree. But I'm a sucker 
for designed-in performance arguments and I recognize that others are 
unconvinced by hypothetical performance; you have to whack 'em over the 
head with a benchmark or two. Are there any realistic prototypes of the 
JSR proposal vs. your suggestion? I'm sure what you suggest is faster; 
I'm not sure how much.

One glaring problem does jump out, though. Section 4.5.1, 
XMLInputFactory, describes yet another non-threadsafe parser 
configuration. It seems insane that an application can't control its own 
class loading in a threadsafe way. This one is even worse than the SAX 
API - _none_ of the ways to configure what gets loaded is threadsafe, 
nor is there any way for an application to disable any of them. Is it 
too late to fix this?

Bob Foster
http://www.xmlbuddy.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: [Discuss] Pull Parsing, JSR-173, and Xerces

Posted by Andy Clark <an...@apache.org>.

David M Williams wrote:
> You say you favor an "event" approach, but I thought the lack of events 
> was the very definition of a "pull parser" and event driven approaches 
> were the "push parsers"?

The word "event" is overloaded a lot when talking
about XML programming. Personally, I have used the
term in the past in reference to push based APIs
like SAX and XNI. But when I talk about events, I
am really talking about the fact that the document
information is reported as a series of discrete
items. So whether it's pushed at you or you pull
it, they're both events in my mind.

To clarify, I meant to say that I prefer the event
*object* approach. This means that the information
items are encapsulated within discrete objects
instead of requiring the application to call a set
of methods based on the item type.

> I haven't worked with or even studied the API ... but the spec itself 
> seems to have a higher than usual number of sections that say they are 
> "optional". That always makes be think some people in the inner circle 
> want it, no one can work out the details in time for agreement, so its 
> left optional, and those that implement/gain acceptance first then 
> support that part of the standard defacto, without proper public review. 

Another problem with all these optional components
is that some parsers will implement them and some
won't. In the end, users will have to rely on a
specific parser implementation and lose the ability
to swap parsers at will.

> My own interest in this API are those that allow another parser to be 
> written on the "output" of the "pull" operation. So, things like 'skip' 
> and 'backup' are important. I did see that 'skip' would be supported, 
> but never heard about 'backup' (there was a section that "random access" 
> was part of this spec, which I think is ok).

The only way to truly make a "skip" operation fast,
instead of being just a convenience, is to implement
it at the parser level and not from the output. By
the time it hits an external filter all the work has
been done already so there's very little to gain but
convenience.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Discuss] Pull Parsing, JSR-173, and Xerces

Posted by David M Williams <da...@us.ibm.com>.

Andy, I'll publically express my ignorance about this important topic, and 
make a few comments anyway.

You say you favor an "event" approach, but I thought the lack of events 
was the very definition of a "pull parser" and event driven approaches 
were the "push parsers"?

I guess I have a fear that events that (try to) represent a basically 
linear process due to concerns that it makes multithreaded apps harder to 
write, though no projects to prove it.

I haven't worked with or even studied the API ... but the spec itself 
seems to have a higher than usual number of sections that say they are 
"optional". That always makes be think some people in the inner circle 
want it, no one can work out the details in time for agreement, so its 
left optional, and those that implement/gain acceptance first then support 
that part of the standard defacto, without proper public review. (just my 
intuition, no data that this is the case here)

My own interest in this API are those that allow another parser to be 
written on the "output" of the "pull" operation. So, things like 'skip' 
and 'backup' are important. I did see that 'skip' would be supported, but 
never heard about 'backup' (there was a section that "random access" was 
part of this spec, which I think is ok). 

Hope these comments spur further discussion. 

Thanks for the education,

David








Andy Clark <an...@apache.org>
10/06/2003 12:48 AM
Please respond to xerces-j-user
 
        To:     xerces-j-dev@xml.apache.org
        cc:     xerces-j-user@xml.apache.org
        Subject:        [Discuss] Pull Parsing, JSR-173, and Xerces


With the recent public review of JSR-173, the Streaming
API for Java, I've been hoping for more of a discussion
among Apache users and developers regarding this API
and pull parsing in general. But there seems to have
been an amazing amount of apathy in this regard. So I
would like to kickstart the discussion.

I am concerned that the API, as it stands, will not
adequately meet the needs of XML developers. Moreover,
I have concerns about implementing it efficiently in
the Xerces parser. But I'll let others comment on the
technical (de)merits of the API because I want to take
this opportunity to discuss what I would like to see
in a pull parser design.

There are two camps of thought in JSR-173: one that
wants a single interface iterator model and another
that wants discrete event objects to represent the
various parts of the document. The first is designed
with small footprint in mind while the second is more
OO and allows apps to conveniently save document
content.

To appease both camps, JSR-173 includes both approaches
in the API. This is wrong. Users would be better served
by a single, simpler, more integrated approach.

I favor the event approach with the fundamental change
that the event objects returned are singletons owned
by the parser. If the application wants the information
stored within the object, the app must copy the info out
of the singleton and save it.

This approach would appease those developers concerned
with memory (e.g. people targeting J2ME) while providing
a straightforward OO model for everyone else. The counter
argument is that users of the API would be confused about
who owns the memory and try to keep references to objects
whose content is transient. But I disagree.

While it may cause some people trouble the first time
they sit down to write an app, they quickly learn the
paradigm and move on. As we all know, DOM has "live" node
lists. That's the model. You may trip over it the first
time but then you learn it and move on.

And providing a clone method allows applications to keep
references to event objects if they choose. So this would
be a way to provide that functionality as well while
maintaining a single, integrated model which I think is
paramount.

I'll provide more details as the discussion develops but
now I'd like to see what other people think. If you need
to catch up with what I'm talking about, you can check
out the following URLs regarding JSR-173:

   http://www.jcp.org/en/jsr/detail?id=173
   http://jcp.org/aboutJava/communityprocess/first/jsr173/index.html

One last thing: even though I'm cc'ing xerces-j-user, I
would like to keep this discussion on the xerces-j-dev
list. So if you'd like to contribute your two cents and
you're not already subscribed to the xerces-j-dev list,
do that now.

-- 
Andy Clark * andyc@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org