You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Arnaud Le Hors <le...@us.ibm.com> on 2001/03/13 03:56:59 UTC

[Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Hi all,

The processing model is the order in which the different operations are
performed. For XML, this is the order in which you perform operations
such as document scanning, namespace binding, validation, xincluding,
etc...

There is no such thing as a standard processing model of any kind today.
The main reason is that there is just not one processing model that
would be universal enough that everybody could agree on. Based on your
application you may want a model that is different from someone else's.
For instance, you may want to process xincludes before validating, while
someone else may want to do it after.

The concept of a pipeline made of components that we're introducing in
Xerces2 should definitely help into providing people with a framework
that is flexible enough so that people can actually implement the
processing model that fits their need without having to write too much
code. As a matter of fact the way you set your pipeline is what defines
your processing model.

When we first implemented Xerces2 we started with a base XMLParser that
set the pipeline. From this class we derived subclasses such as
SAXParser and DOMParser that basically added the extra layer on the
output to perform the appropriate massage of the information coming out
of the pipeline.

The problem with this architecture is that to change the pipeline one
had to change the base class and therefore duplicate all its subclasses.
As a remedy I turned around the class hierarchy implementing the SAX and
DOM stuff in a set of Abstract classes that worked independently of the
actual pipeline. From this layer I then implemented a set of concrete
classes that only added the pipeline setup.
This was clearly better because one could then change the pipeline by
simply creating a new concrete class without having to duplicate all the
SAX and/or DOM specific code. 
However, the code to setup the pipeline, although small, still ended up
being largely duplicated. Talking with Andy we came up with the idea of
a ParserConfiguration object.

The idea is to move the pipeline setup code from the parser class
hierarchy to a different class and get the parser to use such an object
to get its pipeline. I experimented with this and I'm quite happy with
it. The big gain is that I can now write a single class that sets the
pipeline in a certain way and simply use an object of that type with
whatever parser I want (SAX or DOM).

There are still difficulties to deal with though. The main problem is to
decide where to draw the line between the parser and its configuration
object. I think only experience will help us here.

I gave it a shot an implemented a ParserConfiguration abstract class
with a subclass StandardParserConfiguration that creates the pipeline
we're used to. As a proof of concept I was able to also write a
MiniParserConfiguration that didn't have any validator and grammar
support (this doesn't give you a compliant parser but that's besides the
point). It works just nicely. I can instantiate a SAXParser or DOMParser
with my MiniParserConfiguration and get the expected results.

The code still needs to be polished, but I think that's the way to go
though.
Any comments?
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Andy Clark <an...@apache.org>.

I fully take it back -- it looks like the discussion is starting
to pick up again! :) It's just like when you wash you car, you
*know* it's going to rain...

But I'll have to follow up on the discussion tomorrow. Gotta
catch the train home...

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Andy Clark <an...@apache.org>.

Ted Leung wrote:
> Aw, c'mon.  At least give the rest of us with day jobs 24 hours before you
> say that feedback is slow in coming....

I was actually referring to the other discussions that I've
started. I realize that people are doing this on their off-
time but I want to get somewhere and I don't like sitting
on my hands.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Ted Leung <tw...@sauria.com>.

Aw, c'mon.  At least give the rest of us with day jobs 24 hours before you
say that feedback is slow in coming....

Ted
----- Original Message -----
From: "Andy Clark" <an...@apache.org>
To: <xe...@xml.apache.org>
Sent: Tuesday, March 13, 2001 1:02 PM
Subject: Re: [Xerces2] Processing Models, Pipeline, and Parser
configurations [Long]


> You call that "[Long]"? Ha! ;)
>
> The feedback on design decisions is rather slow in coming so I
> propose that we start implementing this stuff in the parser as
> soon as possible. I helped in designing the parser configuration
> stuff so I don't see any problems with it. ;) Let's go for it!
>
> --
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Andy Clark <an...@apache.org>.

You call that "[Long]"? Ha! ;)

The feedback on design decisions is rather slow in coming so I
propose that we start implementing this stuff in the parser as
soon as possible. I helped in designing the parser configuration
stuff so I don't see any problems with it. ;) Let's go for it!

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Arnaud Le Hors <le...@us.ibm.com>.

Andy Clark wrote:
> 
> Ted Leung wrote:
> > We *may* get to the point where people want to do a lot of pipeline
> > rearranging.  How feasible would it be to connect up pipeline stages
> > via an XML configuration file?
> 
> Perfectly feasible as far as I can tell. It's just a matter
> of writing a ParserConfiguration object that is constructed
> with a systemId or InputSource in order to parse some kind
> of XML configuration file. SMOP.

This is clearly the next step for me. Once our parser is written to work
with a given ParserConfiguration we can use any way we want to create
these ParserConfiguration objects. Doing so based on an XML file seems
completely natural to me.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Andy Clark <an...@apache.org>.

Ted Leung wrote:
> We *may* get to the point where people want to do a lot of pipeline
> rearranging.  How feasible would it be to connect up pipeline stages 
> via an XML configuration file?

Perfectly feasible as far as I can tell. It's just a matter
of writing a ParserConfiguration object that is constructed
with a systemId or InputSource in order to parse some kind
of XML configuration file. SMOP.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations[Long]

Posted by Andy Clark <an...@apache.org>.

Ian Roberts wrote:
> Going to 1.2 would prevent Xerces working on the Microsoft JVM, 
> which could be a problem for some people.

I'm glad that you are stating your need to remain on Java 
1.1.x API for the parser. Many people have said that there's 
noone left using 1.1.x and we shouldn't worry about backwards 
compatibility. But as long as we have users who need the
parser to work under 1.1.x, then we should accommodate them.

Besides, there's no real reason to rely on Java 1.2 API. For
example: the Java collection classes. Because performance is
a key issue, it's faster to implement custom data structures.

Having said that, though, I believe that some Java 1.2+ code 
has been inserted into the code, recently. I can't recall 
what at the moment but I can find out just be recompiling 
under 1.1.x. So this should be fixed.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Ian Roberts <ir...@decisionsoft.com>.

On Wed, 14 Mar 2001, Andy Clark wrote:

> Is backwards compatibility necessary? Unlike the situation last
> year, perhaps now is a good time to break with the past which
> means allowing the use of Java 1.2 API like collections (even 
> though they aren't really needed), and not implementing SAX1.
> The latter could be done through the SAX2 adapter helper class.

Going to 1.2 would prevent Xerces working on the Microsoft JVM, which
could be a problem for some people.

Ian

-- 
Ian Roberts, Software Engineer        DecisionSoft Ltd.
tel: +44-1865-203192                  http://www.decisionsoft.com


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Arnaud Le Hors <le...@us.ibm.com>.

Andy Clark wrote:
> 
> Arnaud Le Hors wrote:
> > What you're seeing is the result of several evolutions so it may not be
> > optimal. We need to keep SAXParser and DOMParser backwards compatible
> > with Xerces1, which makes things a little heavier than they would
> > otherwise. But in any case I'm definitely open to ny specific suggestion
> > for improvement. This still very much a prototype.
> 
> Is backwards compatibility necessary?

Yes, very much so.

> Unlike the situation last
> year, perhaps now is a good time to break with the past which
> means allowing the use of Java 1.2 API like collections (even
> though they aren't really needed), and not implementing SAX1.
> The latter could be done through the SAX2 adapter helper class.

Unless you can show me a clear gain in doing anything like that I'm
strongly against it.
Backwards compatibility is paramount.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Andy Clark <an...@apache.org>.

Arnaud Le Hors wrote:
> What you're seeing is the result of several evolutions so it may not be
> optimal. We need to keep SAXParser and DOMParser backwards compatible
> with Xerces1, which makes things a little heavier than they would
> otherwise. But in any case I'm definitely open to ny specific suggestion
> for improvement. This still very much a prototype.

Is backwards compatibility necessary? Unlike the situation last
year, perhaps now is a good time to break with the past which
means allowing the use of Java 1.2 API like collections (even 
though they aren't really needed), and not implementing SAX1.
The latter could be done through the SAX2 adapter helper class.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Arnaud Le Hors <le...@us.ibm.com>.

Ted Leung wrote:
> 
> Arnaud,
> 
> I definitely agree that pipeline configuration ought to be
> centralized/refactored.
> I glanced through the code and it seems good to me.  The hierarchy is a
> little deep for my liking, but okay.

What you're seeing is the result of several evolutions so it may not be
optimal. We need to keep SAXParser and DOMParser backwards compatible
with Xerces1, which makes things a little heavier than they would
otherwise. But in any case I'm definitely open to ny specific suggestion
for improvement. This still very much a prototype.

> I assume that if I wanted to build the much maligned JDOM
> parser, that I could start from AbstractXMLDocumentParser?

Yes.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Ted Leung <tw...@sauria.com>.

Arnaud,

I definitely agree that pipeline configuration ought to be
centralized/refactored.
I glanced through the code and it seems good to me.  The hierarchy is a
little deep
for my liking, but okay.   I assume that if I wanted to build the much
maligned JDOM
parser, that I could start from AbstractXMLDocumentParser?

We *may* get to the point where people want to do a lot of pipeline
rearranging.  How
feasible would it be to connect up pipeline stages via an XML configuration
file?

Ted
----- Original Message -----
From: "Arnaud Le Hors" <le...@us.ibm.com>
To: <xe...@xml.apache.org>
Sent: Monday, March 12, 2001 6:56 PM
Subject: [Xerces2] Processing Models, Pipeline, and Parser configurations
[Long]


> Hi all,
>
> The processing model is the order in which the different operations are
> performed. For XML, this is the order in which you perform operations
> such as document scanning, namespace binding, validation, xincluding,
> etc...
>
> There is no such thing as a standard processing model of any kind today.
> The main reason is that there is just not one processing model that
> would be universal enough that everybody could agree on. Based on your
> application you may want a model that is different from someone else's.
> For instance, you may want to process xincludes before validating, while
> someone else may want to do it after.
>
> The concept of a pipeline made of components that we're introducing in
> Xerces2 should definitely help into providing people with a framework
> that is flexible enough so that people can actually implement the
> processing model that fits their need without having to write too much
> code. As a matter of fact the way you set your pipeline is what defines
> your processing model.
>
> When we first implemented Xerces2 we started with a base XMLParser that
> set the pipeline. From this class we derived subclasses such as
> SAXParser and DOMParser that basically added the extra layer on the
> output to perform the appropriate massage of the information coming out
> of the pipeline.
>
> The problem with this architecture is that to change the pipeline one
> had to change the base class and therefore duplicate all its subclasses.
> As a remedy I turned around the class hierarchy implementing the SAX and
> DOM stuff in a set of Abstract classes that worked independently of the
> actual pipeline. From this layer I then implemented a set of concrete
> classes that only added the pipeline setup.
> This was clearly better because one could then change the pipeline by
> simply creating a new concrete class without having to duplicate all the
> SAX and/or DOM specific code.
> However, the code to setup the pipeline, although small, still ended up
> being largely duplicated. Talking with Andy we came up with the idea of
> a ParserConfiguration object.
>
> The idea is to move the pipeline setup code from the parser class
> hierarchy to a different class and get the parser to use such an object
> to get its pipeline. I experimented with this and I'm quite happy with
> it. The big gain is that I can now write a single class that sets the
> pipeline in a certain way and simply use an object of that type with
> whatever parser I want (SAX or DOM).
>
> There are still difficulties to deal with though. The main problem is to
> decide where to draw the line between the parser and its configuration
> object. I think only experience will help us here.
>
> I gave it a shot an implemented a ParserConfiguration abstract class
> with a subclass StandardParserConfiguration that creates the pipeline
> we're used to. As a proof of concept I was able to also write a
> MiniParserConfiguration that didn't have any validator and grammar
> support (this doesn't give you a compliant parser but that's besides the
> point). It works just nicely. I can instantiate a SAXParser or DOMParser
> with my MiniParserConfiguration and get the expected results.
>
> The code still needs to be polished, but I think that's the way to go
> though.
> Any comments?
> --
> Arnaud  Le Hors - IBM Cupertino, XML Strategy Group
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Andy Clark <an...@apache.org>.

Arnaud Le Hors wrote:
> classes that one may not like. So there is a conflict between sharing
> code and keeping the base classes as free of dependencies as possible.
> We need to find the right trade-off. I believe only experience will
> help.

We can always make the abstract class just rely on factory methods
for the various pieces of the standard pipeline. This would handle
the case where the user only wants to replace a piece of the
pipeline (or leave out the validator).

There is another popular case where we need to be able to insert
another component into the pipeline. For example, XInclude. How
do we do that in a generic way?

> I assume this would imply a one-to-one mapping between name and function
> of the component.
> But what if I have a component that actually implements more than one
> operation and I register under different names?

Do you have an example or do you just want to keep it as open
as possible?

> I think the configuration should let each component deal with its own
> set of features. That would allow both to make the configuration code

If the features/properties are unique to that component. What
about features like "http://xml.org/sax/features/validation"
that are needed by multiple components? In this case, the
configuration must manage them. And anytime that a feature
that was managed by a single component is needed by multiple
components, then the management of that feature must move up
into the configuration.

So this argues for keeping all features/properties in the
configuration. But perhaps we can design a system where each
component tells the configuration about all of the features
and properties that it's looking for. In this way, the
configuration can dynamically update its list of allowed
features/properties.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Ted Leung <tw...@sauria.com>.

----- Original Message -----
From: "Arnaud Le Hors" <le...@us.ibm.com>
To: <xe...@xml.apache.org>
Sent: Tuesday, March 13, 2001 11:11 AM
Subject: Re: [Xerces2] Processing Models, Pipeline, and Parser
configurations [Long]


> Andy Clark wrote:
> > ...
> >   * Duplication of implementation across parser configurations.
> >
> > Even though I extend the BaseParserConfiguration object to get
> > a lot of the implementation for free, I'm finding that I have
> > to code a lot more than I want to. So I'm cut-and-pasting a
> > lot of stuff from StandardParserConfiguration.
> > ...
>
> There is an inherent problem here that I just don't know how to deal
> with. Basically, sharing code requires to move code as high as possible
> in the class hierarchy. However, this creates dependencies in the base
> classes that one may not like. So there is a conflict between sharing
> code and keeping the base classes as free of dependencies as possible.
> We need to find the right trade-off. I believe only experience will
> help.
>

There's another option, which is to put some of the shared code into a
class and delegate to an instance of it.  You're not using inheritance, but
if you're going to take a method call hit (which you will to call the
superclass
method) then there isn't really any reason why you have to be in the same
object.

Just a thought...

Ted


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Arnaud Le Hors <le...@us.ibm.com>.

Andy Clark wrote:
> ...
>   * Duplication of implementation across parser configurations.
> 
> Even though I extend the BaseParserConfiguration object to get
> a lot of the implementation for free, I'm finding that I have
> to code a lot more than I want to. So I'm cut-and-pasting a
> lot of stuff from StandardParserConfiguration.
> ...

There is an inherent problem here that I just don't know how to deal
with. Basically, sharing code requires to move code as high as possible
in the class hierarchy. However, this creates dependencies in the base
classes that one may not like. So there is a conflict between sharing
code and keeping the base classes as free of dependencies as possible.
We need to find the right trade-off. I believe only experience will
help.

> [As a side question: Should the XMLComponent interface include
> a method that returns its component name?]

I assume this would imply a one-to-one mapping between name and function
of the component.
But what if I have a component that actually implements more than one
operation and I register under different names?

>   * Components that request features and properties that the
>     configuration doesn't know about, will cause the parser
>     to barf and die.
> 
> ...

I think the configuration should let each component deal with its own
set of features. That would allow both to make the configuration code
simpler and avoid duplicating this code in every configuration class
that uses the same components.
This doesn't solve the problem for properties though. There I think
we're stuck. It's inherently tied to the configuration. On the other
hand it's typically less code.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Processing Models, Pipeline, and Parser configurations [Long]

Posted by Andy Clark <an...@apache.org>.

Arnaud Le Hors wrote:
> Any comments?

Yes. The concept is great (I know because I helped with it ;) but
the execution needs a little work. I'm actually trying to create
a parser configuration and it's too much work -- and not self-
documenting enough.

I'm running into multiple problems which can all be fixed and
made simpler to implement custom configurations. Here's a partial
list of what I've found:

  * The parser configuration changes aren't on the correct
    CVS branch. Instead of being on the announced branch,
    "parserConfig", the changes are spread across that branch
    and the "xerces_j_2" branch.

I already fixed this problem.

  * Unsure what I need to override and implement in my parser
    configuration.

This is simple -- just write more documentation. At this stage
in design, though, I don't think that we should waste a lot of
time documenting something that may change drastically. Just
adding another documentation todo item. [Anybody yearning to
document some of this stuff we're talking about?]

  * Duplication of implementation across parser configurations.

Even though I extend the BaseParserConfiguration object to get
a lot of the implementation for free, I'm finding that I have
to code a lot more than I want to. So I'm cut-and-pasting a
lot of stuff from StandardParserConfiguration.

Here's my scenario: I'm trying to replace the entire validation
engine in the parser. (This work is related to my post earlier
where I said I want to re-implement the validation engine.) 
This is the same scenario if I were trying to insert a stage 
into the document pipeline such as XInclude. So I think that
this will come up a lot.

Here's what I want: I want to extend the abstract configuration
and just implement the abstract (or no-op'd?) method create-
Validator to return my validator (of type XMLDocumentFilter).
If we no-op'd this method, then the default behavior would be
to exclude the validator if it returned a null pointer.

[As a side question: Should the XMLComponent interface include
a method that returns its component name?]

  * Components that request features and properties that the
    configuration doesn't know about, will cause the parser
    to barf and die.

Part of the fix for this problem is to change the components to
be more flexible about the features and property settings that
they need from the component manager in order to do their job.
In other words, if the requested feature or property is not
absolutely critical, use a default value instead of allowing
the SAXNotRecognizedException to be thrown up the stack and
out of the parser. Some features and properties *will* be
critical and *should* signal a fatal error. For example, the
SymbolTable is usually required by most components so that the
entire system uses the same String references (for performance).
However, the load-external-dtd feature is not all that critical
and can be ignored if that particular parser configuration
doesn't know about it.

Perhaps a better fix, though, is to figure out a way for
parser components to communicate what features and properties
are important to the parser configuration object. I don't
have a solution to this problem, yet, so I'm just throwing
out ideas at this point.

Thoughts?

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org