You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@xml.apache.org by Scott Boag/CAM/Lotus <Sc...@lotus.com> on 2000/07/11 23:42:10 UTC

Re: parser-next-gen goals, plan, and requirements

First, I would rather see a list of requirements first, rather than goals.
The goal's below are simply mom and apple pie, in my opinion.  The devil's
in the details.

Xalan XSLT Processor Requirements (or requests) on the Parser (my
opinions):

1) SAX2, of course.
2) Read-only, memory conservative, high performance DOM subset.  In some
ways, this is optional, since the alternative is that the XSLT processor
implement it's own DOM, as it does today.  But it would be neat and simpler
if only one DOM implementation needed to exist.
  2a) Document-order indexes or API as a DOM extension.  I know of few or
no conformant XSLT processors that can do without this.
  2b) [optional] isWhite() method as a DOM extensions (pure telling of
whether or not the text contains non-whitespace), for performance reasons.
  2c) Some sort of weak reference, where nodes could be released if not
referenced, and then rebuilt if requested.  For performance and memory
footprint.
3) parse-next function, with added control over buffer size.
4) Some sort of way to tell if a SAX char buffer is going to be
overwritten, so data doesn't have to be copied until this occurs.
5) Serialization support, as is currently in Assaf's classes.
6) Schema data-type support, which will be needed for XSLT2, and Xalan 2.0
extensions.
7) We should talk about whether XPath should be part of the core XML
services, rather than part of the XSLT processor.
8) Small core footprint for standalone, compiled stylesheet capability, for
use on small devices.  This would need to include the Serializer.  I'm not
sure if this should really be a separate micro-parser?


> GOALS:
>
>     * Simple to read, maintainable code. Above all, this is the primary
goal
>       for any openly developed project as without the ability to read the
>       code, it's impossible for people to contribute and get involved.

+1.

>     * Smallest possible size. This means small distribution size (JAR
file)
>       and small memory footprint.

+0.  I'm not sure this is compatible with the first goal.  Also, I would
rather have performance and *scaleable* memory footprint prioritized over
jar file size.  However, the Xalan project does need this...

Also, I would like to see packaging options to address the jar file size.
I suspect Xerces today could be packaged to a much smaller footprint, if
only the base features were used.

As I said above, perhaps a separate code-base for a micro parser would be a
better option, with support for an XML subset.

>     * Modular. It should be possible to build a parser as a set of Jar
files
>       so that a smaller parser can be assembled which fits the need of a
>       particular implementation. For example, in TV sets do you really
need
>       validation?

+0, or +1, depending on how you read this.  You may not need validation,
but you may indeed need schema processing for data types, entity refs, etc.

>    * Cleanly Optimized. This means optimized in a way that is compatible
>      with modern virtual machines such as HotSpot. Optimizations that
work
>      well with JDK 1.1 style VMs can actually impact performance under
>      more modern VMs. Optimizations that interfere with readability,
>      modularity, or size will be shunned.

-0 or +1, depending on how you read this.  Is it, or is it not, a
requirement to have good performance with JDK 1.1, or even backwards
compatibility?  If not, then I think, sure, let's optimize in a way that is
cleanly compatible with "modern" VMs.

>   * First, factor out utility classes from both the Xerces and Crimson
>       source bases. There is a lot of good work on things like the Xerces
>       decoders which are faster than the JDK's. This is actually the
start
>       of an Apache wide common utility set (something that I'd like
>       to see in the future as AUC -- Apache Utility Classes). We've
talked
>       about this before in other Apache projects, and there's a lot of
>       good code that we can start it off with here.

Big +1.  I would like to see this done independent of any next-gen work,
for availability to Xalan 2.0 and other projects, sooner, rather than
later.

>     * Determine what the modular API looks like. What are the various
>       peices that can be factored out. How can we get to a point where
it's
>       easy to package a parser that doesn't include DOM or a particular
>       validator? There's some work started on a branch, but it hasn't
>       been touched in a month or so. This might serve as a start place.

+1.

>    * Refactor out a base parser. Once we see how those APIs should look
(or
>      at least get a start, they don't have to be perfect :) we start at
>      the bottom and look at the code of the existing parsers to come up
>      with a basic non-validating parser that can rip through XML.

-1.  I think there is enough knowledge at this point to first put together
a pretty complete design, with a clear understanding of how schema
processing should work with the base parser (maybe they shouldn't -- but I
would argue that point).  Hard problem, in my opinion, and more design
rather than less would result in a better idea of what a base parser should
be.

>    * Set SAX on top of this base parser. Of course.

+1. However, I think there is likely clear evidence that it may benefit
certain high-performance applications to have a much tighter binding to the
parser than SAX supports.  A particular problem is the way that SAX2 treats
character data, and the fact that it's an event-only API, rather than
having by-request characteristics (i.e. parse-next type functionality, so
you can run an incremental parse/transform without having to run an extra
thread).

>    * Look at pluggable validation.

+1, but validation is not the same thing as basic schema and DTD
processing.  Data-types, entity refs, default attributes, etc., tend to be
required.

>    * Factor in tree based producers. We'd like to see DOM and JDOM up
>      front.

-1 on JDOM for the core.  Just my opinion.  I don't like it, I think it
misleads developers about the XML data model, and I would rather not see
Apache support it.

>    * Stability. By this point, we should have something that is starting
>      to work well. Stability will be a driving goal then.

Sure.

-scott

Re: parser-next-gen goals, plan, and requirements

Posted by Jason Hunter <jh...@acm.org>.

Arnaud Le Hors wrote:
> 
> I actually simply don't understand the requirement about JDOM. DOM is an
> API, we need to provide classes that implement the API. This is true for
> JDOM. It's not an API. It's a set of classes that include a builder that
> works on SAX. So as long as we support SAX, which definitely is a
> requirement, we're all set on that front. Let's leave the debate of
> whether JDOM is a good thing or not outside of this project.

I'd like to see Spinnaker/XRI/whatevercodename come equipped with a
powerful and pluggable architecture that allowed for better JDOM
implementations than what the simple SAXBuilder provides.  We have plans
for a deferred implementation (done using subclasses) but this requires
closer iteraction with the parser.  A new well-designed and
understandable parser sounds wonderful.

-jh-

Re: parser-next-gen goals, plan, and requirements

Posted by James Duncan Davidson <ja...@eng.sun.com>.

on 7/11/00 4:35 PM, Arnaud Le Hors at lehors@us.ibm.com wrote:

> I don't understand what you're disagreeing with. All of my statements
> are true:
> 
> 1) DOM is only made of interfaces and can only be used if we provide
> classes for it
> 2) JDOM is a set of classes with a builder that works on top of SAX (and
> DOM btw)
> 3) given that we'll support SAX, JDOM can exist

Logical progression. What I thought you were saying with that statement up
there, and I wasn't alone since Brett read it the same way, was that you
didn't understand why we should do any JDOM work in XRI/NG/Whatever.

If we're agreed that something like a SAX++ is the core and everything else
sits on top, then theres nothing to disagree about. :) What a change. :)

.duncan

Re: parser-next-gen goals, plan, and requirements

Posted by Arnaud Le Hors <le...@us.ibm.com>.

James Duncan Davidson wrote:
> 
> on 7/11/00 3:53 PM, Arnaud Le Hors at lehors@us.ibm.com wrote:
> 
> > I actually simply don't understand the requirement about JDOM. DOM is an
> > API, we need to provide classes that implement the API. This is true for
> > JDOM. It's not an API. It's a set of classes that include a builder that
> > works on SAX. So as long as we support SAX, which definitely is a
> > requirement, we're all set on that front. Let's leave the debate of
> > whether JDOM is a good thing or not outside of this project.
> 
> I Disagree.

I don't understand what you're disagreeing with. All of my statements
are true:

1) DOM is only made of interfaces and can only be used if we provide
classes for it
2) JDOM is a set of classes with a builder that works on top of SAX (and
DOM btw)
3) given that we'll support SAX, JDOM can exist

> JDOM is an important up and coming API that already has
> established a large and rapidly growing groundswell of support and in the
> developer community.

Fine with me!

> You personally don't have to do the work to provide JDOM support

But this is just SAX!!

> -- as long
> as the core architecture is pluggable and modular, then you can work on DOM
> and Brett and co can work on JDOM and everybody wins.

I certainly hope so.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Technology Group

Re: parser-next-gen goals, plan, and requirements

Posted by Octav Chipara <oc...@cse.unl.edu>.


> 
> 
> Octav Chipara wrote:
> > 
> > > Octav Chipara wrote:
> > > >
> > > > 1) True! ... But my only problem is that when you are using various
> > > > subsets of DOM, are you still having a compliant W3C recomandation? I
> > > > believe that we should actually develop a new set of interfaces rather
> > > > then using a subset of DOM! If we are starting from scratch we might get
> > > > better results than trying to subset DOM! Many of us try to build
> > > > applications which understand XML for embedded systems and I see great
> > > > value in having a DOM-like implementation for such systems. Do you believe
> > > > using a subset is good enough?
> > >
> > > The Xerces DOM implementation currently supports almost all of the DOM
> > > Level 2. This is the Core + several optional modules, such as mutation
> > > events, traversal, etc... Every optional module makes the whole thing
> > > bigger in memory and slower. To start with, we can have an
> > > implementation with just the Core. But we can do much more. This
> > > includes things like being readonly, and/or not providing fast random
> > > access a la getChildNodes().item(i). We currently have a cache for that
> > > which costs us in memory. There are many things like that we can do, all
> > > within the scope of a compliant implementation. And as we work on this,
> > > if there is anything in the DOM that gets in the middle we can always
> > > bring it up to W3C and look for a solution there.
> > 
> > This exactly my point and I agree with your attitued ... But I do not
> > believe that W3C has to solve our problems. Unfortunately :-). My worries
> > are that even the core implementation is too big, that's why I would
> > propose if someone wants to take a look into other possible solutions
> > except DOM and JDOM. When W3C made the recomandation for DOM I do not
> > belive that they had in mind that DOM would be used for embedded systems.
> > My point is that SAX would be the solution for such small systems but it
> > does not have the necessary processing power that I would like... and I
> > was not able to cutdown the DOM size to be resonable for an embedded
> > system. :-(. That's why I was trying to propose to move away from DOM and
> > make something innovative...
> 
> Just out of curiosity, you never explained why JDOM didn't work for you
> (not trying to start a war - just asking the question). I can see how
> the memory required for DOM caused you problems. What about JDOM did? 
> 
> -Brett

   Brett,

   Most the code that I written is in C++ so ...

 - Octav
 
> 
> > 
> > >
> > > > 2) The second part regards a higher level tool which I would love to have.
> > > > Something to map the structure of a XML document directly on a structure
> > > > defined in a programming language ...
> > > > ...
> > > > Do you understand what I'm trying to say?
> > >
> > > Sure. There are already several tools that do that. I don't have any
> > > pointers handy though, but I'm sure someone else on this list has some.
> > > --
> > > Arnaud  Le Hors - IBM Cupertino, XML Technology Group
> > >
> > > ---------------------------------------------------------------------
> > > In case of troubles, e-mail:     webmaster@xml.apache.org
> > > To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> > > For additional commands, e-mail: general-help@xml.apache.org
> > >
> > 
> > ---------------------------------------------------------------------
> > In case of troubles, e-mail:     webmaster@xml.apache.org
> > To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> > For additional commands, e-mail: general-help@xml.apache.org
> 
> -- 
> Brett McLaughlin, Enhydra Strategist
> Lutris Technologies, Inc. 
> 1200 Pacific Avenue, Suite 300 
> Santa Cruz, CA 95060 USA 
> http://www.lutris.com
> http://www.enhydra.org
> 
> ---------------------------------------------------------------------
> In case of troubles, e-mail:     webmaster@xml.apache.org
> To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> For additional commands, e-mail: general-help@xml.apache.org
>

Re: parser-next-gen goals, plan, and requirements

Posted by Brett McLaughlin <br...@lutris.com>.


Octav Chipara wrote:
> 
> > Octav Chipara wrote:
> > >
> > > 1) True! ... But my only problem is that when you are using various
> > > subsets of DOM, are you still having a compliant W3C recomandation? I
> > > believe that we should actually develop a new set of interfaces rather
> > > then using a subset of DOM! If we are starting from scratch we might get
> > > better results than trying to subset DOM! Many of us try to build
> > > applications which understand XML for embedded systems and I see great
> > > value in having a DOM-like implementation for such systems. Do you believe
> > > using a subset is good enough?
> >
> > The Xerces DOM implementation currently supports almost all of the DOM
> > Level 2. This is the Core + several optional modules, such as mutation
> > events, traversal, etc... Every optional module makes the whole thing
> > bigger in memory and slower. To start with, we can have an
> > implementation with just the Core. But we can do much more. This
> > includes things like being readonly, and/or not providing fast random
> > access a la getChildNodes().item(i). We currently have a cache for that
> > which costs us in memory. There are many things like that we can do, all
> > within the scope of a compliant implementation. And as we work on this,
> > if there is anything in the DOM that gets in the middle we can always
> > bring it up to W3C and look for a solution there.
> 
> This exactly my point and I agree with your attitued ... But I do not
> believe that W3C has to solve our problems. Unfortunately :-). My worries
> are that even the core implementation is too big, that's why I would
> propose if someone wants to take a look into other possible solutions
> except DOM and JDOM. When W3C made the recomandation for DOM I do not
> belive that they had in mind that DOM would be used for embedded systems.
> My point is that SAX would be the solution for such small systems but it
> does not have the necessary processing power that I would like... and I
> was not able to cutdown the DOM size to be resonable for an embedded
> system. :-(. That's why I was trying to propose to move away from DOM and
> make something innovative...

Just out of curiosity, you never explained why JDOM didn't work for you
(not trying to start a war - just asking the question). I can see how
the memory required for DOM caused you problems. What about JDOM did? 

-Brett

> 
> >
> > > 2) The second part regards a higher level tool which I would love to have.
> > > Something to map the structure of a XML document directly on a structure
> > > defined in a programming language ...
> > > ...
> > > Do you understand what I'm trying to say?
> >
> > Sure. There are already several tools that do that. I don't have any
> > pointers handy though, but I'm sure someone else on this list has some.
> > --
> > Arnaud  Le Hors - IBM Cupertino, XML Technology Group
> >
> > ---------------------------------------------------------------------
> > In case of troubles, e-mail:     webmaster@xml.apache.org
> > To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> > For additional commands, e-mail: general-help@xml.apache.org
> >
> 
> ---------------------------------------------------------------------
> In case of troubles, e-mail:     webmaster@xml.apache.org
> To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> For additional commands, e-mail: general-help@xml.apache.org

-- 
Brett McLaughlin, Enhydra Strategist
Lutris Technologies, Inc. 
1200 Pacific Avenue, Suite 300 
Santa Cruz, CA 95060 USA 
http://www.lutris.com
http://www.enhydra.org

Re: parser-next-gen goals, plan, and requirements

Posted by Arnaud Le Hors <le...@us.ibm.com>.

Eric Hodges wrote:
> 
> > The DOM doesn't deal with parsing at all for now, so I don't understand
> > what this is about.
> 
> But there's the problem.  It makes them happy and it annoys me.  There isn't
> asingle solution that fits us both.

If by "them" you mean W3C, you're wrong again. It doesn't make anyone
happy, it's just the current state of affairs. Loading and saving is one
of the requirements of DOM Level 3 [1]

[1] 

> No, but it means some people shouldn't use XML. 

I completely agree with you. But I'm not too worried about that, this is
human nature. People need to push new things all the way to their limits
to test them out. Once this done, they usually opt for a more balanced
approach.

> And some people shouldn't use DOM.

I completely agree here too! For one thing, if you don't need a standard
API use your own.
On the other hand trying to establish a new standard API just because
you prefer writing "el = new Element(name)" rather than "el =
doc.createElement(name)" is pointless to me. The benefits of using an
industry standard API are far greater than the trouble. But, this is
only my opinion, others are free to think otherwise!

> I understand why it is the way it is.  I don't think it should be thrown
> away, but I don't think it's needed in all the places it's used.

I agree.

> So if someone asks for a DOM for embedded systems, will they consider that?

Why not? I can't give any guarantee obviously. But the W3C process
allows anyone to make requests and/or submissions. If enough members
think the idea is interesting it would definitely be considered.

> How about a DOM for Java only? Or a Smalltalk DOM?

As far as I know W3C never got such a request. I doubt members would
want to spend resources on these.

> I don't think they can add new standards as fast as I can ask for them.

I don't think either. Even though W3C is known in the industry to be a
place where things happen very fast. (even too fast sometimes ;-)
-- 
Arnaud  Le Hors - IBM Cupertino, XML Technology Group

Re: parser-next-gen goals, plan, and requirements

Posted by Eric Hodges <ha...@swbell.net>.

----- Original Message -----
From: "Arnaud Le Hors" <le...@us.ibm.com>
To: <ge...@xml.apache.org>
Sent: Wednesday, July 12, 2000 4:54 PM
Subject: Re: parser-next-gen goals, plan, and requirements


> Eric Hodges wrote:
> >
> > So why is the DOM API so bloated and ugly?  Why doesn't it use Java
> > collections?
>
> Because one of the requirements of the DOM was to be language
> independent. Most companies don't just use one language. They use a
> variety of tools and components that are in different languages. There
> is a lot of benefits for them to have a single API across the board. It
> reduces education costs, support costs, etc...
>
> >  Why are there so many non-obvious steps required just to parse
> > a document?
>
> The DOM doesn't deal with parsing at all for now, so I don't understand
> what this is about.

But there's the problem.  It makes them happy and it annoys me.  There isn't
asingle solution that fits us both.

>
> > I already know why because you told me.
>
> So why do you ask??
>
> > W3C's job was to make several
> > different implementations happy with one API.  The result is an API that
> > doesn't make anyone happy.
>
> That's most often what happens with standards. XML 1.0 itself doesn't
> satisfy everybody for that matter. Does it mean we should through it
> away and start over? I don't think so. Some have tried.

No, but it means some people shouldn't use XML.  And some people shouldn't
use DOM.

>
> The important point is that it's better to have a standard then no
> standard at all.

Or even better, have several interoperable standards tailored to specific
needs.

>
> I know it's hard to accept/understand that. I have an academic
> background; for years I worked as a research engineer where I had the
> luxury of having the time to develop my applications entirely from
> scratch. No constraints at all. I could aim for the best and the
> cleanest of all. Working at the X Consortium and W3C after that, I have
> had to learn the industrial reality. It was quite a chock at first.
> Especially at W3C where I started by working on HTML, by far the most
> controversial place at the time. There I learned that the priorities in
> the industry are very different from what I might dream of. But I also
> learned that there are good reasons for that.
>
> Sorry if this sounds boring. I'm just trying to say that, while I
> understand why people don't like the DOM, they should understand why
> things are the way they are. And why I think throwing it away and
> starting over isn't necessarily the best thing to do.

I understand why it is the way it is.  I don't think it should be thrown
away, but I don't think it's needed in all the places it's used.

>
> > Isn't it too late?  Are they going to throw away the current API?
>
> No, they (we) aren't going to throw away the current API. I don't think
> there is any chance nor any point in doing this (given what I just
> explained). But the WG will certainly consider specific issues and see
> if/how they can be addressed. A read-only DOM is on the list of issues
> to be addressed for DOM Level 3 for instance.

So if someone asks for a DOM for embedded systems, will they consider that?
How about a DOM for Java only?  Or a Smalltalk DOM?

I don't think they can add new standards as fast as I can ask for them.

Re: parser-next-gen goals, plan, and requirements

Posted by Arnaud Le Hors <le...@us.ibm.com>.

Eric Hodges wrote:
> 
> So why is the DOM API so bloated and ugly?  Why doesn't it use Java
> collections?

Because one of the requirements of the DOM was to be language
independent. Most companies don't just use one language. They use a
variety of tools and components that are in different languages. There
is a lot of benefits for them to have a single API across the board. It
reduces education costs, support costs, etc...

>  Why are there so many non-obvious steps required just to parse
> a document?

The DOM doesn't deal with parsing at all for now, so I don't understand
what this is about.

> I already know why because you told me.

So why do you ask??

> W3C's job was to make several
> different implementations happy with one API.  The result is an API that
> doesn't make anyone happy.

That's most often what happens with standards. XML 1.0 itself doesn't
satisfy everybody for that matter. Does it mean we should through it
away and start over? I don't think so. Some have tried.

The important point is that it's better to have a standard then no
standard at all.

I know it's hard to accept/understand that. I have an academic
background; for years I worked as a research engineer where I had the
luxury of having the time to develop my applications entirely from
scratch. No constraints at all. I could aim for the best and the
cleanest of all. Working at the X Consortium and W3C after that, I have
had to learn the industrial reality. It was quite a chock at first.
Especially at W3C where I started by working on HTML, by far the most
controversial place at the time. There I learned that the priorities in
the industry are very different from what I might dream of. But I also
learned that there are good reasons for that.

Sorry if this sounds boring. I'm just trying to say that, while I
understand why people don't like the DOM, they should understand why
things are the way they are. And why I think throwing it away and
starting over isn't necessarily the best thing to do.

> Isn't it too late?  Are they going to throw away the current API?

No, they (we) aren't going to throw away the current API. I don't think
there is any chance nor any point in doing this (given what I just
explained). But the WG will certainly consider specific issues and see
if/how they can be addressed. A read-only DOM is on the list of issues
to be addressed for DOM Level 3 for instance.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Technology Group

RE: parser-next-gen goals, plan, and requirements

Posted by Eric Hodges <ha...@swbell.net>.


> -----Original Message-----
> From: Arnaud Le Hors [mailto:lehors@us.ibm.com]
> Sent: Wednesday, July 12, 2000 2:19 PM
> To: general@xml.apache.org
> Subject: Re: parser-next-gen goals, plan, and requirements
>
>
> Octav Chipara wrote:
> >
> > But I do not
> > believe that W3C has to solve our problems. Unfortunately :-).
>
> You're wrong there. W3C is an industrial consortium, the goal of which
> is to provide the industry with the standards it needs to let the Web
> grow. Both IBM and Sun are active members of W3C, as a matter of fact
> both James and I along with Andy Heninger are members of the DOM WG
> itself, so we can easily bring up any issue we found with the DOM.

So why is the DOM API so bloated and ugly?  Why doesn't it use Java
collections?  Why are there so many non-obvious steps required just to parse
a document?

I already know why because you told me.  W3C's job was to make several
different implementations happy with one API.  The result is an API that
doesn't make anyone happy.

>
> > My worries
> > are that even the core implementation is too big, that's why I would
> > propose if someone wants to take a look into other possible solutions
> > except DOM and JDOM. When W3C made the recomandation for DOM I do not
> > belive that they had in mind that DOM would be used for
> embedded systems.
>
> This is true. Having been involved in the DOM Activity since its
> beginning I can tell you that the persons involved in it at first (this
> is several years ago!) only represented browser vendors, authoring tool
> vendors, server vendors, and users. But W3C now counts as members many
> handheld device vendors and their requirements are taken into account
> just like any other.

Isn't it too late?  Are they going to throw away the current API?

Re: parser-next-gen goals, plan, and requirements

Posted by James Duncan Davidson <ja...@eng.sun.com>.

on 7/12/00 1:05 PM, Octav Chipara at ochipara@cse.unl.edu wrote:

> 
> 
> On Wed, 12 Jul 2000, Arnaud Le Hors wrote:
> 
>> Octav Chipara wrote:
>>> 
>>> But I do not
>>> believe that W3C has to solve our problems. Unfortunately :-).
>> 
>> You're wrong there. W3C is an industrial consortium, the goal of which
>> is to provide the industry with the standards it needs to let the Web
>> grow. Both IBM and Sun are active members of W3C, as a matter of fact
>> both James and I along with Andy Heninger are members of the DOM WG
>> itself, so we can easily bring up any issue we found with the DOM.
> 
> 
> It is nice to know that W3C has its hears everywhere. Maybe is my mistake
> for not being able to imagine such an organization very flexibile and able
> to take fast steps towards the achivement of a goal... Sorry ...

True enough... But any change to the DOM has to respect its requirements and
goals... Doing something like a lightweight single language based tree model
that doesn't do the things needed for editing and such (which are some of
the bigger parts of DOM) is a much different beast. I wouldn't compromise
the place where we are at with browsers and such to go after this other
problem domain. Hence my support of JDOM for the Java version of this design
space. In addition to my support of W3C DOM in it's problem space.

.duncan

Re: parser-next-gen goals, plan, and requirements

Posted by Octav Chipara <oc...@cse.unl.edu>.


On Wed, 12 Jul 2000, Arnaud Le Hors wrote:

> Octav Chipara wrote:
> > 
> > But I do not
> > believe that W3C has to solve our problems. Unfortunately :-).
> 
> You're wrong there. W3C is an industrial consortium, the goal of which
> is to provide the industry with the standards it needs to let the Web
> grow. Both IBM and Sun are active members of W3C, as a matter of fact
> both James and I along with Andy Heninger are members of the DOM WG
> itself, so we can easily bring up any issue we found with the DOM.


It is nice to know that W3C has its hears everywhere. Maybe is my mistake
for not being able to imagine such an organization very flexibile and able
to take fast steps towards the achivement of a goal... Sorry ...

> 
> > My worries
> > are that even the core implementation is too big, that's why I would
> > propose if someone wants to take a look into other possible solutions
> > except DOM and JDOM. When W3C made the recomandation for DOM I do not
> > belive that they had in mind that DOM would be used for embedded systems.
> 
> This is true. Having been involved in the DOM Activity since its
> beginning I can tell you that the persons involved in it at first (this
> is several years ago!) only represented browser vendors, authoring tool
> vendors, server vendors, and users. But W3C now counts as members many
> handheld device vendors and their requirements are taken into account
> just like any other.
> 
> > My point is that SAX would be the solution for such small systems but it
> > does not have the necessary processing power that I would like... and I
> > was not able to cutdown the DOM size to be resonable for an embedded
> > system. :-(. That's why I was trying to propose to move away from DOM and
> > make something innovative...
> 
> The DOM Core barely contains what's in an XML document (as defined per
> the XML Infoset), I'm not sure what could be removed from it if you want
> any structure at all (as opposed to SAX which provides none). Could you
> expand on this a little?

  I guess what could be cut would be the Document interface. I guess for
embedded systems you wold not not need interfaces for creating Comments,
and more or less processing instructions. Document fragments are 
very nice to have but not always necessary. Moreover, at very low level a
document fragment can be seen as a Node List. I would even dump
create CDATA because at a low level I would never use it. I prefer to have
a reference to my data and grab it using HTTP directly ... And maybe,
Entity interface I would not use it. And some of the methods of these
interfaces could be omitted without causing to much trouble. What do you
think?
 
 -Octav

> -- 
> Arnaud  Le Hors - IBM Cupertino, XML Technology Group
> 
> ---------------------------------------------------------------------
> In case of troubles, e-mail:     webmaster@xml.apache.org
> To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> For additional commands, e-mail: general-help@xml.apache.org
>

Re: parser-next-gen goals, plan, and requirements

Posted by Arnaud Le Hors <le...@us.ibm.com>.

Octav Chipara wrote:
> 
> But I do not
> believe that W3C has to solve our problems. Unfortunately :-).

You're wrong there. W3C is an industrial consortium, the goal of which
is to provide the industry with the standards it needs to let the Web
grow. Both IBM and Sun are active members of W3C, as a matter of fact
both James and I along with Andy Heninger are members of the DOM WG
itself, so we can easily bring up any issue we found with the DOM.

> My worries
> are that even the core implementation is too big, that's why I would
> propose if someone wants to take a look into other possible solutions
> except DOM and JDOM. When W3C made the recomandation for DOM I do not
> belive that they had in mind that DOM would be used for embedded systems.

This is true. Having been involved in the DOM Activity since its
beginning I can tell you that the persons involved in it at first (this
is several years ago!) only represented browser vendors, authoring tool
vendors, server vendors, and users. But W3C now counts as members many
handheld device vendors and their requirements are taken into account
just like any other.

> My point is that SAX would be the solution for such small systems but it
> does not have the necessary processing power that I would like... and I
> was not able to cutdown the DOM size to be resonable for an embedded
> system. :-(. That's why I was trying to propose to move away from DOM and
> make something innovative...

The DOM Core barely contains what's in an XML document (as defined per
the XML Infoset), I'm not sure what could be removed from it if you want
any structure at all (as opposed to SAX which provides none). Could you
expand on this a little?
-- 
Arnaud  Le Hors - IBM Cupertino, XML Technology Group

Re: parser-next-gen goals, plan, and requirements

Posted by James Duncan Davidson <ja...@eng.sun.com>.

on 7/14/00 8:42 AM, Octav Chipara at ochipara@cse.unl.edu wrote:

> The bad thing about JDOM is that it is done in java. All the
> programs that I write are in C/C++ because of the memory constrains ... So
> that's why I do not use JDOM!

Fair enough -- sounds like you've got the right motivation to write a CDOM
thing.. Something taylored just for C programming with structs and all. :)

.duncan

Re: parser-next-gen goals, plan, and requirements

Posted by Octav Chipara <oc...@cse.unl.edu>.


On Thu, 13 Jul 2000, James Duncan Davidson wrote:

> on 7/12/00 11:01 AM, Octav Chipara at ochipara@cse.unl.edu wrote:
> 
> > When W3C made the recomandation for DOM I do not
> > belive that they had in mind that DOM would be used for embedded systems.
> 
> It's quite clear from it's name, imho "Document Object Model" :) Given that,
> and its use in editors and browsers, it does what it does and you probably
> would have a hard time coming up with something *too* much better given the
> problem domain (except for that little bit about learning from the past
> lessons. :)
> 
> > system. :-(. That's why I was trying to propose to move away from DOM and
> > make something innovative...
> 
> That's what JDOM is, no? That was why I put the time into reviewing it with
> Jas and Brett. :)


	The bad thing about JDOM is that it is done in java. All the
programs that I write are in C/C++ because of the memory constrains ... So
that's why I do not use JDOM!

> 
> .duncan
> 
> 
> ---------------------------------------------------------------------
> In case of troubles, e-mail:     webmaster@xml.apache.org
> To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> For additional commands, e-mail: general-help@xml.apache.org
>

Re: parser-next-gen goals, plan, and requirements

Posted by James Duncan Davidson <ja...@eng.sun.com>.

on 7/12/00 11:01 AM, Octav Chipara at ochipara@cse.unl.edu wrote:

> When W3C made the recomandation for DOM I do not
> belive that they had in mind that DOM would be used for embedded systems.

It's quite clear from it's name, imho "Document Object Model" :) Given that,
and its use in editors and browsers, it does what it does and you probably
would have a hard time coming up with something *too* much better given the
problem domain (except for that little bit about learning from the past
lessons. :)

> system. :-(. That's why I was trying to propose to move away from DOM and
> make something innovative...

That's what JDOM is, no? That was why I put the time into reviewing it with
Jas and Brett. :)

.duncan

Re: parser-next-gen goals, plan, and requirements

Posted by Octav Chipara <oc...@cse.unl.edu>.

> Octav Chipara wrote:
> > 
> > 1) True! ... But my only problem is that when you are using various
> > subsets of DOM, are you still having a compliant W3C recomandation? I
> > believe that we should actually develop a new set of interfaces rather
> > then using a subset of DOM! If we are starting from scratch we might get
> > better results than trying to subset DOM! Many of us try to build
> > applications which understand XML for embedded systems and I see great
> > value in having a DOM-like implementation for such systems. Do you believe
> > using a subset is good enough?
> 
> The Xerces DOM implementation currently supports almost all of the DOM
> Level 2. This is the Core + several optional modules, such as mutation
> events, traversal, etc... Every optional module makes the whole thing
> bigger in memory and slower. To start with, we can have an
> implementation with just the Core. But we can do much more. This
> includes things like being readonly, and/or not providing fast random
> access a la getChildNodes().item(i). We currently have a cache for that
> which costs us in memory. There are many things like that we can do, all
> within the scope of a compliant implementation. And as we work on this,
> if there is anything in the DOM that gets in the middle we can always
> bring it up to W3C and look for a solution there.


This exactly my point and I agree with your attitued ... But I do not
believe that W3C has to solve our problems. Unfortunately :-). My worries
are that even the core implementation is too big, that's why I would
propose if someone wants to take a look into other possible solutions
except DOM and JDOM. When W3C made the recomandation for DOM I do not
belive that they had in mind that DOM would be used for embedded systems.
My point is that SAX would be the solution for such small systems but it
does not have the necessary processing power that I would like... and I
was not able to cutdown the DOM size to be resonable for an embedded
system. :-(. That's why I was trying to propose to move away from DOM and
make something innovative... 

> 
> > 2) The second part regards a higher level tool which I would love to have.
> > Something to map the structure of a XML document directly on a structure
> > defined in a programming language ...
> > ...
> > Do you understand what I'm trying to say?
> 
> Sure. There are already several tools that do that. I don't have any
> pointers handy though, but I'm sure someone else on this list has some.
> -- 
> Arnaud  Le Hors - IBM Cupertino, XML Technology Group
> 
> ---------------------------------------------------------------------
> In case of troubles, e-mail:     webmaster@xml.apache.org
> To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> For additional commands, e-mail: general-help@xml.apache.org
>

Re: parser-next-gen goals, plan, and requirements

Posted by Arnaud Le Hors <le...@us.ibm.com>.

Octav Chipara wrote:
> 
> 1) True! ... But my only problem is that when you are using various
> subsets of DOM, are you still having a compliant W3C recomandation? I
> believe that we should actually develop a new set of interfaces rather
> then using a subset of DOM! If we are starting from scratch we might get
> better results than trying to subset DOM! Many of us try to build
> applications which understand XML for embedded systems and I see great
> value in having a DOM-like implementation for such systems. Do you believe
> using a subset is good enough?

The Xerces DOM implementation currently supports almost all of the DOM
Level 2. This is the Core + several optional modules, such as mutation
events, traversal, etc... Every optional module makes the whole thing
bigger in memory and slower. To start with, we can have an
implementation with just the Core. But we can do much more. This
includes things like being readonly, and/or not providing fast random
access a la getChildNodes().item(i). We currently have a cache for that
which costs us in memory. There are many things like that we can do, all
within the scope of a compliant implementation. And as we work on this,
if there is anything in the DOM that gets in the middle we can always
bring it up to W3C and look for a solution there.

> 2) The second part regards a higher level tool which I would love to have.
> Something to map the structure of a XML document directly on a structure
> defined in a programming language ...
> ...
> Do you understand what I'm trying to say?

Sure. There are already several tools that do that. I don't have any
pointers handy though, but I'm sure someone else on this list has some.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Technology Group

Re: parser-next-gen goals, plan, and requirements

Posted by Edwin Goei <Ed...@eng.sun.com>.

> 2) The second part regards a higher level tool which I would love to have.
> Something to map the structure of a XML document directly on a structure
> defined in a programming language ... For instance,
>
> struct book{
> char *Title;
> char *ISBN;
> ...
> }
>
> should be easily mapped on the XML Document like ...
>
> <book>
> <Title> Stuff </Title>
> <ISBN> other stuff </ISBN>
> </book>
>
> This should happen both when you are creating XML documents and when you
> are writing XML documents. Do you understand what I'm trying to say?

I've heard of a project in Sun working on something like this called XML
Data Binding.  See
http://java.sun.com/aboutJava/communityprocess/jsr/jsr_031_xmld.html.  I
know there are other similar efforts going on elsewhere, but can't give you
a pointer.

Re: parser-next-gen goals, plan, and requirements

Posted by Octav Chipara <oc...@cse.unl.edu>.

> Octav Chipara wrote:
> > 
> > OK ... It is true that some of use want JDOM, but I would propose to try
> > to built a new tree structure. IMHO, neither JDOM nor DOM would are the
> > best possible solutions. I would like something that would have small
> > footprint and that would be easily mapped on structures that could be
> > defined in some programming language!
> > 
> > What do you think?
> 
> I don't understand the second part of your last sentence, after "and".
> But the DOM only defines interfaces. Nothing prevents anyone from
> designing a minimal DOM implementation that has a small footprint. I'm
> actually already experimenting with something like that. In the long run
> I see us having several DOM implementations that fulfill different
> requirements and live together. DOM being a standard API, one will be
> able to choose the implementation that best fits his/her needs and be
> able to change at any time if need be.
> -- 

1) True! ... But my only problem is that when you are using various
subsets of DOM, are you still having a compliant W3C recomandation? I
believe that we should actually develop a new set of interfaces rather
then using a subset of DOM! If we are starting from scratch we might get
better results than trying to subset DOM! Many of us try to build
applications which understand XML for embedded systems and I see great
value in having a DOM-like implementation for such systems. Do you believe
using a subset is good enough?

2) The second part regards a higher level tool which I would love to have.
Something to map the structure of a XML document directly on a structure
defined in a programming language ... For instance,

struct book{
char *Title;
char *ISBN;
...
}

should be easily mapped on the XML Document like ...

<book>
	<Title> Stuff </Title>
	<ISBN> other stuff </ISBN>
</book>

This should happen both when you are creating XML documents and when you
are writing XML documents. Do you understand what I'm trying to say?

> Arnaud  Le Hors - IBM Cupertino, XML Technology Group
> 
> ---------------------------------------------------------------------
> In case of troubles, e-mail:     webmaster@xml.apache.org
> To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> For additional commands, e-mail: general-help@xml.apache.org
>

Re: parser-next-gen goals, plan, and requirements

Posted by Arnaud Le Hors <le...@us.ibm.com>.

Octav Chipara wrote:
> 
> OK ... It is true that some of use want JDOM, but I would propose to try
> to built a new tree structure. IMHO, neither JDOM nor DOM would are the
> best possible solutions. I would like something that would have small
> footprint and that would be easily mapped on structures that could be
> defined in some programming language!
> 
> What do you think?

I don't understand the second part of your last sentence, after "and".
But the DOM only defines interfaces. Nothing prevents anyone from
designing a minimal DOM implementation that has a small footprint. I'm
actually already experimenting with something like that. In the long run
I see us having several DOM implementations that fulfill different
requirements and live together. DOM being a standard API, one will be
able to choose the implementation that best fits his/her needs and be
able to change at any time if need be.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Technology Group

Re: parser-next-gen goals, plan, and requirements

Posted by Octav Chipara <oc...@cse.unl.edu>.


HI!

> on 7/11/00 3:53 PM, Arnaud Le Hors at lehors@us.ibm.com wrote:
> 
> > I actually simply don't understand the requirement about JDOM. DOM is an
> > API, we need to provide classes that implement the API. This is true for
> > JDOM. It's not an API. It's a set of classes that include a builder that
> > works on SAX. So as long as we support SAX, which definitely is a
> > requirement, we're all set on that front. Let's leave the debate of
> > whether JDOM is a good thing or not outside of this project.
> 
> I Disagree. JDOM is an important up and coming API that already has
> established a large and rapidly growing groundswell of support and in the
> developer community.
> 
> You personally don't have to do the work to provide JDOM support -- as long
> as the core architecture is pluggable and modular, then you can work on DOM
> and Brett and co can work on JDOM and everybody wins.
> 
> Wouldn't this be sign of a development community getting along. :)
> 
> .duncan
> 

OK ... It is true that some of use want JDOM, but I would propose to try
to built a new tree structure. IMHO, neither JDOM nor DOM would are the
best possible solutions. I would like something that would have small
footprint and that would be easily mapped on structures that could be
defined in some programming language!

What do you think?

Octav

Re: parser-next-gen goals, plan, and requirements

Posted by James Duncan Davidson <ja...@eng.sun.com>.

on 7/11/00 3:53 PM, Arnaud Le Hors at lehors@us.ibm.com wrote:

> I actually simply don't understand the requirement about JDOM. DOM is an
> API, we need to provide classes that implement the API. This is true for
> JDOM. It's not an API. It's a set of classes that include a builder that
> works on SAX. So as long as we support SAX, which definitely is a
> requirement, we're all set on that front. Let's leave the debate of
> whether JDOM is a good thing or not outside of this project.

I Disagree. JDOM is an important up and coming API that already has
established a large and rapidly growing groundswell of support and in the
developer community.

You personally don't have to do the work to provide JDOM support -- as long
as the core architecture is pluggable and modular, then you can work on DOM
and Brett and co can work on JDOM and everybody wins.

Wouldn't this be sign of a development community getting along. :)

.duncan

Re: parser-next-gen goals, plan, and requirements

Posted by Arnaud Le Hors <le...@us.ibm.com>.

I actually simply don't understand the requirement about JDOM. DOM is an
API, we need to provide classes that implement the API. This is true for
JDOM. It's not an API. It's a set of classes that include a builder that
works on SAX. So as long as we support SAX, which definitely is a
requirement, we're all set on that front. Let's leave the debate of
whether JDOM is a good thing or not outside of this project.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Technology Group

Re: parser-next-gen goals, plan, and requirements

Posted by James Duncan Davidson <ja...@eng.sun.com>.

on 7/11/00 3:28 PM, Brett McLaughlin at brett.mclaughlin@lutris.com wrote:

> Ask James what he got asked over and over at JavaOne, often the first
> questions. 

It was amazing.... I tried not to roll my eyes after a while -- not because
I didn't think it was an appropriate question, more of a "oh my, let's see,
what was that response I gave a few hours ago -- here we go, cp
/dev/memory/jdom-answer /dev/mouth.. At least that was what was going on
inside my head :)

> I will be honest, though - if JDOM isn't supported at all, I can promise
> that we will pull large numbers of folks away - I have a 2nd edition of
> Java and XML that will sell lots (as the first one promotes JDOM and
> Xerces, I would have hoped to have people at least give credit there for
> my making attempts to encourage interaction), and JDOM has a strong
> following. Why make us choose between another, JDOM-supportable parser,
> and one that is not, esp. if you can use it or not use it as a module?

This last sentence is the important one. If DOM, DOM-ReadOnly, DOM-Deferred,
JDOM, and FOOTREEAPI are all equal status .jars, then it really shouldn't
matter -- as long as people are willing to put the time and effort into
building the tree producer, then by all means, let's encourage it.

.duncan

Re: parser-next-gen goals, plan, and requirements

Posted by Brett McLaughlin <br...@lutris.com>.

>>    * Factor in tree based producers. We'd like to see DOM and JDOM up
>>      front.
>
>-1 on JDOM for the core.  Just my opinion.  I don't like it, I think it
>misleads developers about the XML data model, and I would rather not see
>Apache support it.

I, as expected, think this is ridiculous. Not because it is true or
false, but because we sent you a version of JDOM before anyone else ever
saw it - pre-beta. And we have never gotten one comment from you, or one
mail on our mailing lists (are you subsribed? It doesn't look like it!),
saying what these problems are. I think that you could certainly help
solve or better understand, and educate us, on what you see those
problems are. This is incredibly close-minded, though - this would be
like me saying Xalan is not really a good idea, and (as I have not)
never having gotten involved in the mailing lists, and never having
posted suggestions to fix it.

In my mind, it's a -1 without a reason. I would be more than happy to
see you hop on jdom-interest and let us know what things you see
problems with. Let us know what version you have used (have you used it?
Beta 3? 4? CVS? tried the samples?), and help us correct the problems.
The bottom line, and James can attest to this, is that it has a
/substantial/ following. Ask James what he got asked over and over at
JavaOne, often the first questions. There are sessions on it at many of
the major XML conferences coming up. And if we do things right, it is
simply a module you can personally ignore. 

I will be honest, though - if JDOM isn't supported at all, I can promise
that we will pull large numbers of folks away - I have a 2nd edition of
Java and XML that will sell lots (as the first one promotes JDOM and
Xerces, I would have hoped to have people at least give credit there for
my making attempts to encourage interaction), and JDOM has a strong
following. Why make us choose between another, JDOM-supportable parser,
and one that is not, esp. if you can use it or not use it as a module?

Confused...

-Brett

Re: parser-next-gen goals, plan, and requirements

Posted by "Randall J. Parr" <RP...@TemporalArts.COM>.


James Duncan Davidson wrote:

> >> * Look at pluggable validation.
> >
> > +1, but validation is not the same thing as basic schema and DTD
> > processing.  Data-types, entity refs, default attributes, etc., tend to be
> > required.
>
> Actually, I'd like to see if that's not the case. I'd really like to see
> basic schema and DTD processing moved out of the core path so that a
> non-validating parser can be put together.
>
> This really cuts down on the critical path when used in a server case where
> validation is quite frequently turned off. Same for TV sets or PDAs. There
> are a whole catagory of apps that fall in to the catagory that validation is
> not a need after development.
>

I also believe many applications could benefit from leveraging the core parsing
ability to perform tasks such as extracting an element from within a database
column holding a description which contains XML tagging but which is not a proper
XML document.

R.Parr
Temporal Arts

Re: parser-next-gen goals, plan, and requirements

Posted by Arved Sandstrom <Ar...@chebucto.ns.ca>.

At 02:42 PM 7/13/00 +0200, Stefano Mazzocchi wrote:
>Edwin Goei wrote:
>> 
>> "Tim Bray" <tb...@textuality.com> writes:
>> 
>> > Wild-eyed suggestion: why not look into adopting James Clark's XT?  It's
>> > a pretty #!%@^#@ good parser IMHO.  Also it's from neither IBM or Sun :)
>> 
>> Yup, I think it's worth looking at the internals of XP also (you probably
>> meant XP), I just haven't done it yet.  Maybe other people who have can
>> comment.  It looks like an older parser which hasn't been updated.  There's
>> also the parser that you wrote, Lark, which you might want to comment on.
>> 
>> The other parser I have looked at is Aelfred2 which is a current SAX2 based
>> parser.  I think its easier to understand than the current Xerces code, but
>> it also does less.  But maybe we're still in the requirements stage right
>> now and these are more implementation details.
>
>+1 on making 
>
> Xerces2 = Xerces1 ideas + Crimson ideas + XP ideas
>
>I'm sure James Clark has _lots_ of design decisions to share on parsers.

If we are trying to capture ideas, let's look at _all_ the parsers, where 
the source is available and licenses permit.

We're not just in Java land here; we've got Xerces-C and Xerces-Perl and 
they can't be shuffled off.

There's stuff that microparsers like nanoxml can contribute to the 
discussion. Python XML parsers (and there is more than one) are quite good. 
If we are talking James Clark, let's not forget expat; this is a very good 
parser and represents the core of the Perl XML processing family.

Correct me if I'm wrong but we are still capturing requirements. Maybe 90% 
of the folks here plan to actually write Java as a result, but before we get 
to detailed design I wouldn't expect to see a single-minded focus on Java.

I think there is potential here for making this project best-of-breed when 
it comes to showing that open-source can do process. I think it would be 
awesome if "Xerces2" generated docs for conops, requirements, and design 
that allow folks down the road to step in and write a parser in Eiffel, for 
example (this is my plug for Eiffel, unabashedly... :-)), or in Python, or 
whatever.

Just my CAN $0.02 worth.

Arved Sandstrom

Senior Developer
e-plicity.com (www.e-plicity.com)
Halifax, Nova Scotia
"B2B Wireless in Canada's Ocean Playground"

Re: parser-next-gen goals, plan, and requirements

Posted by Stefano Mazzocchi <st...@apache.org>.

Edwin Goei wrote:
> 
> "Tim Bray" <tb...@textuality.com> writes:
> 
> > Wild-eyed suggestion: why not look into adopting James Clark's XT?  It's
> > a pretty #!%@^#@ good parser IMHO.  Also it's from neither IBM or Sun :)
> 
> Yup, I think it's worth looking at the internals of XP also (you probably
> meant XP), I just haven't done it yet.  Maybe other people who have can
> comment.  It looks like an older parser which hasn't been updated.  There's
> also the parser that you wrote, Lark, which you might want to comment on.
> 
> The other parser I have looked at is Aelfred2 which is a current SAX2 based
> parser.  I think its easier to understand than the current Xerces code, but
> it also does less.  But maybe we're still in the requirements stage right
> now and these are more implementation details.

+1 on making 

 Xerces2 = Xerces1 ideas + Crimson ideas + XP ideas

I'm sure James Clark has _lots_ of design decisions to share on parsers.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------

Re: parser-next-gen goals, plan, and requirements

Posted by Edwin Goei <Ed...@eng.sun.com>.

"Tim Bray" <tb...@textuality.com> writes:

> Wild-eyed suggestion: why not look into adopting James Clark's XT?  It's
> a pretty #!%@^#@ good parser IMHO.  Also it's from neither IBM or Sun :)

Yup, I think it's worth looking at the internals of XP also (you probably
meant XP), I just haven't done it yet.  Maybe other people who have can
comment.  It looks like an older parser which hasn't been updated.  There's
also the parser that you wrote, Lark, which you might want to comment on.

The other parser I have looked at is Aelfred2 which is a current SAX2 based
parser.  I think its easier to understand than the current Xerces code, but
it also does less.  But maybe we're still in the requirements stage right
now and these are more implementation details.

-Edwin

Re: parser-next-gen goals, plan, and requirements

Posted by Tim Bray <tb...@textuality.com>.

At 05:49 PM 11/07/00 -0700, Arnaud Le Hors wrote:
>At the minimum we need to have the same as Xerces 1. These are:
>
>Validating XML 1.0
>Namespaces
>SAX2
>DOM Level 2
>XML Schemas
>
>In addition, I guess it's a given that we all want:
>
>Modularity, meaning that one should be able to have a jar containing the
>bare minimum XML parser for instance.

I think this is really important.  There is going to be some proportion of 
the time N, where the parser is just pulling out elements and attributes,
not validating or XPathing or DOMbuilding.  Nobody knows what N is but my
guess it's going to be surprisingly high, like "most of the time".  This 
kind of parsing needs to be fast and it needs to have a light memory
footprint.

Question: if you build a low-level parser that 

(a) implements SAX2, and
(b) if asked parses the DTD and stuffs it into reasonable java data
    structure,

can you build all the other pieces that Arnaud lists on top of that and 
have acceptable efficiency?  

I don't know how representative I am, but for me, validation (at either
DTD or schema level) is mostly for debugging; at runtime the validation
logic tends to be hardwired and app-specific.  Thus I'd be willing to
trade quite a lot of validation performance for fast SAX2 events and
a light memory footprint. 

My intuition says that as regards building the tree and all that follows
from it, making that go through SAX2 shouldn't be a performance hit... or
are there other experiences.

Wild-eyed suggestion: why not look into adopting James Clark's XT?  It's
a pretty #!%@^#@ good parser IMHO.  Also it's from neither IBM or Sun :)
 -Tim

Re: parser-next-gen goals, plan, and requirements

Posted by "N. Sean Timm" <st...@mailgo.com>.

"Brett McLaughlin" <br...@lutris.com> wrote:
> Is SAX 1 support worth including? I'm -0, but not sure. Since we are
> talking size, you could simply implement SAX2, and include ParserAdapter
> and call it SAX 1.0. Any opinions? Not a huge deal, but as long as we're
> laying out requirements...
I'd be -1 on this.  This is next generation.  This is not your father's
parser.  :)

> > Also, performance should be the best-of-breed ACROSS ALL JIT's (not just
> > Hotspot).
>
> This is fair - we do need to make a decision on 1.1 JVM's. Personally, I
> don't think we need support - in other words, I'm for WeakReferences and
> Collections, because they (a) make things easier to understand and (b)
> could possibly really help with performance and memory. I agree with
> James Xerces 1.x is great for 1.1.
>
I agree.  Performance should be best-of-breed across all *Java2* JIT's.  1.1
shouldn't be supported.  Once again, this is next generation.

- Sean T.

Re: parser-next-gen goals, plan, and requirements

Posted by James Duncan Davidson <ja...@eng.sun.com>.

on 7/11/00 6:22 PM, Arnaud Le Hors at lehors@us.ibm.com wrote:

> I'm all for it, but this will require installing some new software on
> the apache server. Last time we tried that with Bugzilla the site was
> cracked, so I believe only Brian can install any thing like that now.
> What chances do we have that this will happen? Anybody knows?

Right this second, it'd be pretty hard.. Locus was just changed out for a
new machine and some other problems are being solved. In a bit hopefully
we'll have an area for servlets staked out that can be used for something
like this (assuming that it's servlet based).

.duncan

Re: parser-next-gen goals, plan, and requirements

Posted by Arnaud Le Hors <le...@us.ibm.com>.

Joe Polastre wrote:
> 
> From: "Brett McLaughlin" <br...@lutris.com>
> > Oh, how I'd love to volunteer here... Just too freakin' busy. I wonder
> > if there is any tool at collab.net or somewhere similar that does
> > requirements tracking? (Jason, are you listening? Anything here?) It
> > would be great for newcomers to be able to see these, and for us to be
> > able to "check them off" as we meet them.
> 
> there's a project task manager (which could double as a requirements thing)
> at sourceforge... plus some other cool options.

I'm all for it, but this will require installing some new software on
the apache server. Last time we tried that with Bugzilla the site was
cracked, so I believe only Brian can install any thing like that now.
What chances do we have that this will happen? Anybody knows?
-- 
Arnaud  Le Hors - IBM Cupertino, XML Technology Group

Re: parser-next-gen goals, plan, and requirements

Posted by Joe Polastre <jp...@apache.org>.

From: "Brett McLaughlin" <br...@lutris.com>
> Oh, how I'd love to volunteer here... Just too freakin' busy. I wonder
> if there is any tool at collab.net or somewhere similar that does
> requirements tracking? (Jason, are you listening? Anything here?) It
> would be great for newcomers to be able to see these, and for us to be
> able to "check them off" as we meet them.

there's a project task manager (which could double as a requirements thing)
at sourceforge... plus some other cool options.

-Joe

Re: parser-next-gen goals, plan, and requirements

Posted by Stefano Mazzocchi <st...@apache.org>.

Brett McLaughlin wrote:
> 
> Arnaud Le Hors wrote:
> >
> > About the requirements. I've tried no less than 3 times to send a
> > message out but it seems to always go into a black hole. (James, are you
> > hiding somewhere in there deleting my messages as they go through? Just
> > kidding! :-)
> > Here I go again:
> >
> > At the minimum we need to have the same as Xerces 1. These are:
> >
> > Validating XML 1.0
> > Namespaces
> > SAX2
> > DOM Level 2
> > XML Schemas
> 
> Is SAX 1 support worth including? I'm -0, but not sure. Since we are
> talking size, you could simply implement SAX2, and include ParserAdapter
> and call it SAX 1.0. Any opinions? Not a huge deal, but as long as we're
> laying out requirements...

-1 on SAX1
 
> >
> > In addition, I guess it's a given that we all want:
> >
> > Modularity, meaning that one should be able to have a jar containing the
> > bare minimum XML parser for instance.
> 
> I think it's worthwhile to drill into what this means. What exactly is
> "the bare minimum XML parser?" Does that mean it just parses XML? Does
> it output events (SAX)? What kind of APIs should we expose?
> 
> >
> > Also, performance should be the best-of-breed ACROSS ALL JIT's (not just
> > Hotspot).
> 
> This is fair - we do need to make a decision on 1.1 JVM's. Personally, I
> don't think we need support - in other words, I'm for WeakReferences and
> Collections, because they (a) make things easier to understand and (b)
> could possibly really help with performance and memory. I agree with
> James Xerces 1.x is great for 1.1.

-1 on supporting 1.1
+1 on supporting all 1.2+ JVM (not just hotspot)
 
> >
> > What else?
> >
> > Who's keeping track of the requirements that we come up with? We should
> > make this as open as possible at this point, have someone make a
> > compilation, and have a discussion on what we agree on.
> 
> It might be nice, at a minimum, to have a web page at Xerces with an
> ongoing list. It can be rough and ugly, but at least we can all wake up
> and see that condensed instead of missing something reading through 100
> mails in the wee hours...
> 
> >
> > (Hint: this is an opportunity for someone new to volunteer. Please,
> > don't make me do it, it would end up being labeled as "IBM's
> > requirements" ;-)
> 
> Oh, how I'd love to volunteer here... Just too freakin' busy. I wonder
> if there is any tool at collab.net or somewhere similar that does
> requirements tracking? (Jason, are you listening? Anything here?) It
> would be great for newcomers to be able to see these, and for us to be
> able to "check them off" as we meet them.

Asking for volunteers is pointless: just do something that barely
compiles, create the itch an people will jump to scratch it. This is how
it works when the community is perceived open.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------

Re: parser-next-gen goals, plan, and requirements

Posted by Brett McLaughlin <br...@lutris.com>.

Arnaud Le Hors wrote:
> 
> About the requirements. I've tried no less than 3 times to send a
> message out but it seems to always go into a black hole. (James, are you
> hiding somewhere in there deleting my messages as they go through? Just
> kidding! :-)
> Here I go again:
> 
> At the minimum we need to have the same as Xerces 1. These are:
> 
> Validating XML 1.0
> Namespaces
> SAX2
> DOM Level 2
> XML Schemas

Is SAX 1 support worth including? I'm -0, but not sure. Since we are
talking size, you could simply implement SAX2, and include ParserAdapter
and call it SAX 1.0. Any opinions? Not a huge deal, but as long as we're
laying out requirements...

> 
> In addition, I guess it's a given that we all want:
> 
> Modularity, meaning that one should be able to have a jar containing the
> bare minimum XML parser for instance.

I think it's worthwhile to drill into what this means. What exactly is
"the bare minimum XML parser?" Does that mean it just parses XML? Does
it output events (SAX)? What kind of APIs should we expose?

> 
> Also, performance should be the best-of-breed ACROSS ALL JIT's (not just
> Hotspot).

This is fair - we do need to make a decision on 1.1 JVM's. Personally, I
don't think we need support - in other words, I'm for WeakReferences and
Collections, because they (a) make things easier to understand and (b)
could possibly really help with performance and memory. I agree with
James Xerces 1.x is great for 1.1.

> 
> What else?
> 
> Who's keeping track of the requirements that we come up with? We should
> make this as open as possible at this point, have someone make a
> compilation, and have a discussion on what we agree on.

It might be nice, at a minimum, to have a web page at Xerces with an
ongoing list. It can be rough and ugly, but at least we can all wake up
and see that condensed instead of missing something reading through 100
mails in the wee hours...

> 
> (Hint: this is an opportunity for someone new to volunteer. Please,
> don't make me do it, it would end up being labeled as "IBM's
> requirements" ;-)

Oh, how I'd love to volunteer here... Just too freakin' busy. I wonder
if there is any tool at collab.net or somewhere similar that does
requirements tracking? (Jason, are you listening? Anything here?) It
would be great for newcomers to be able to see these, and for us to be
able to "check them off" as we meet them.

-Brett

> --
> Arnaud  Le Hors - IBM Cupertino, XML Technology Group
> 
> ---------------------------------------------------------------------
> In case of troubles, e-mail:     webmaster@xml.apache.org
> To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> For additional commands, e-mail: general-help@xml.apache.org

-- 
Brett McLaughlin, Enhydra Strategist
Lutris Technologies, Inc. 
1200 Pacific Avenue, Suite 300 
Santa Cruz, CA 95060 USA 
http://www.lutris.com
http://www.enhydra.org

Re: parser-next-gen goals, plan, and requirements

Posted by James Duncan Davidson <ja...@eng.sun.com>.

on 7/11/00 7:12 PM, James Duncan Davidson at james.davidson@eng.sun.com
wrote:

> I'll be happy to pull together a list tommorrow if nobody else has done it
> before them.

Er. I shouldn't have volunteered so quickly -- turns out my Sister is
graduating (hurray!) and I'm in North Carolina to celebrate and don't have
time to be volunteering quite this way till I get back on Monday.. :) And it
looks like somebody neither of Sun, nor IBM is doing so -- which is *real*
goodness.

.duncan

Re: parser-next-gen goals, plan, and requirements

Posted by James Duncan Davidson <ja...@eng.sun.com>.

on 7/11/00 5:49 PM, Arnaud Le Hors at lehors@us.ibm.com wrote:

> About the requirements. I've tried no less than 3 times to send a
> message out but it seems to always go into a black hole. (James, are you
> hiding somewhere in there deleting my messages as they go through? Just
> kidding! :-)

Yep.... I hacked locus' qmail and got a bot on.. :)

> Who's keeping track of the requirements that we come up with? We should
> make this as open as possible at this point, have someone make a
> compilation, and have a discussion on what we agree on.

I'll be happy to pull together a list tommorrow if nobody else has done it
before them.

.duncan

Re: parser-next-gen goals, plan, and requirements

Posted by Brett McLaughlin <br...@lutris.com>.

Stefano Mazzocchi wrote:
> 
> Donald Ball wrote:
> >
> > I'd like to see XPath and XInclude support built in to Xerces as
> > modules.
> 
> +1 for XInclude and XBase (they do together), as well as a way to
> extract XLink information from the file, if found.
> 
> -1 for XPath, it should belong to Xalan2

I'm actually in disagreement here - I know in JDOM, people want to be
able to look up nodes/elements/attributes using XPath. Additionally,
more and more APIs are using XPath outside of XSL/T. To include it as
part of Xalan forces users to get an entire jar to do that.
Additionally, based on Scott's reaction ;-), he wouldn't add in support
for XPath on JDOM. That's another reason. Still, I think users that know
XPath would love to do:

Element specific = root.getChild("foo/@bar='1'");

Not having to carry around XSL/T weight to do this is a major advantage,
and prepares us for the APIs that use XPath outside of just
transformations.

-Brett

> 
> --
> Stefano Mazzocchi      One must still have chaos in oneself to be
>                           able to give birth to a dancing star.
> <st...@apache.org>                             Friedrich Nietzsche
> --------------------------------------------------------------------
>  Missed us in Orlando? Make it up with ApacheCON Europe in London!
> ------------------------- http://ApacheCon.Com ---------------------
> 
> ---------------------------------------------------------------------
> In case of troubles, e-mail:     webmaster@xml.apache.org
> To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> For additional commands, e-mail: general-help@xml.apache.org

-- 
Brett McLaughlin, Enhydra Strategist
Lutris Technologies, Inc. 
1200 Pacific Avenue, Suite 300 
Santa Cruz, CA 95060 USA 
http://www.lutris.com
http://www.enhydra.org

Re: parser-next-gen goals, plan, and requirements

Posted by Donald Ball <ba...@webslingerZ.com>.

On Wed, 12 Jul 2000, Stefano Mazzocchi wrote:

> Donald Ball wrote:
> > 
> > I'd like to see XPath and XInclude support built in to Xerces as
> > modules.
> 
> +1 for XInclude and XBase (they do together), as well as a way to
> extract XLink information from the file, if found.
> 
> -1 for XPath, it should belong to Xalan2

but if one is doing an xinclude on a subset of a document with an
xpath-based xpointer, wouldn't it make more sense for the parser to ignore
the parts of the document that fall outside of the bounds of the
xpath? stupid example - given an xpath expression:

/root/*[1]

the parser could stop parsing after reading the first element child of the
root element.

- donald

Re: parser-next-gen goals, plan, and requirements

Posted by Stefano Mazzocchi <st...@apache.org>.

Donald Ball wrote:
> 
> I'd like to see XPath and XInclude support built in to Xerces as
> modules.

+1 for XInclude and XBase (they do together), as well as a way to
extract XLink information from the file, if found.

-1 for XPath, it should belong to Xalan2

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------

Re: parser-next-gen goals, plan, and requirements

Posted by Donald Ball <ba...@webslingerZ.com>.

I'd like to see XPath and XInclude support built in to Xerces as
modules.

- donald

Re: parser-next-gen goals, plan, and requirements

Posted by Arnaud Le Hors <le...@us.ibm.com>.

About the requirements. I've tried no less than 3 times to send a
message out but it seems to always go into a black hole. (James, are you
hiding somewhere in there deleting my messages as they go through? Just
kidding! :-)
Here I go again:

At the minimum we need to have the same as Xerces 1. These are:

Validating XML 1.0
Namespaces
SAX2
DOM Level 2
XML Schemas

In addition, I guess it's a given that we all want:

Modularity, meaning that one should be able to have a jar containing the
bare minimum XML parser for instance.

Also, performance should be the best-of-breed ACROSS ALL JIT's (not just
Hotspot).

What else?

Who's keeping track of the requirements that we come up with? We should
make this as open as possible at this point, have someone make a
compilation, and have a discussion on what we agree on.

(Hint: this is an opportunity for someone new to volunteer. Please,
don't make me do it, it would end up being labeled as "IBM's
requirements" ;-)
-- 
Arnaud  Le Hors - IBM Cupertino, XML Technology Group

Re: parser-next-gen goals, plan, and requirements

Posted by Costin Manolache <co...@eng.sun.com>.

> > 2) Read-only, memory conservative, high performance DOM subset.  In some
> > ways, this is optional, since the alternative is that the XSLT processor
> > implement it's own DOM, as it does today.  But it would be neat and simpler
> > if only one DOM implementation needed to exist.
>
> +1 -- note that this could be an "optional" DOM shipped as an external .jar
> file. In fact, I'd like to see as a requirement the ability to build into a
> set of jars that reflects the modules so that it's clear how to assemble a
> stripped down parser for whatever use.

I think it's a good idea to have multiple DOM modules, but if it is possible
to implement whatever extensions xalan requires into default DOM
( or in the internal APIs)   we should do it.

It seems the xalan-2 DOM is very clean and can easily be moved into
spinnaker as a module.


> > 3) parse-next function, with added control over buffer size.
>
> Explain more. Would this be the ability to feed in an input source that says
> "grab 16K at a time from the underlying stream and feed it into the parser"?
> This puts a requirement on the parser to be able to parser in increments,
> and a requirement on all the providers to higher level services to provide
> data to their consumers without having the full picture.

I guess it's a good point - you should be able to parse the document in
an iterative way:

parseNext(ParseState)  will parse the next element or char chunk.

SAX is a great API, but this kind of API may be much better as
an internal API.

One very interesting extension of this would be to do something
like parseAtOffset( int off), which will read the next element
starting with a certain file offset. This combined with a cache
may save us from storing very large documents in memory.

We should explore this !


> > 4) Some sort of way to tell if a SAX char buffer is going to be
> > overwritten, so data doesn't have to be copied until this occurs.

We have a similar problem in tomcat ( attempting to avoid
copy ), one way to resolve that would be to expose the
buffer via the internal API.

I think buffering and caching are vital to achieve performance
( by design , no pre-optimization here :-), and we should have full
control over that. Assumig a (pool of) 4k buffers are used to
read, you should be able to pin the buffer or be notified when
the buffer change.

( it may sound complex, but it would be great to have - maybe
as a goal, not a requirement )

> > Big +1.  I would like to see this done independent of any next-gen work,
> > for availability to Xalan 2.0 and other projects, sooner, rather than
> > later.
>
> Ok. Should I propose an apache-auc module to the joint jakarta/xml efforts
> to collect these sorts of things? We've talked about it on the jakarta lists
> and said resoundingly "YES" but didn't know how others feel. I think that if
> there's a loud "YES" here, that we can make headway.
>
> And most of my interest really lies at the AUC type level

That would be great - there is a lot of great code in all apache projects,
not only xerces, but also tomcat ( thread pools, logger, etc),  and so on,
and it will be really great if we can reuse code from one project to another.

StringTable ( or StringPool ) will provide a great benefit in tomcat for
example, assuming we could clean up the interfaces a bit.

This may also be a great way to keep alive some of the ( great IMHO )
1.1 optimizations that are now part of xerces, for example as a set of
1.1 modules. We will need a good set of interfaces, but it will have many
benefits : we may end up writing modules optimized for various
configurations ( low memory, embeded, jits), and do that without
adding any complexity to the project that uses them.


( another good example is a common Resource/Messages/whatever
module for I18N, and a common logger ).


Costin

Re: parser-next-gen goals, plan, and requirements

Posted by James Duncan Davidson <ja...@eng.sun.com>.

on 7/11/00 2:42 PM, Scott Boag/CAM/Lotus at Scott_Boag@lotus.com wrote:

> First, I would rather see a list of requirements first, rather than goals.
> The goal's below are simply mom and apple pie, in my opinion.  The devil's
> in the details.

He he he... 

> 1) SAX2, of course.

+1

> 2) Read-only, memory conservative, high performance DOM subset.  In some
> ways, this is optional, since the alternative is that the XSLT processor
> implement it's own DOM, as it does today.  But it would be neat and simpler
> if only one DOM implementation needed to exist.

+1 -- note that this could be an "optional" DOM shipped as an external .jar
file. In fact, I'd like to see as a requirement the ability to build into a
set of jars that reflects the modules so that it's clear how to assemble a
stripped down parser for whatever use.

> 2a) Document-order indexes or API as a DOM extension.  I know of few or
> no conformant XSLT processors that can do without this.
> 2b) [optional] isWhite() method as a DOM extensions (pure telling of
> whether or not the text contains non-whitespace), for performance reasons.
> 2c) Some sort of weak reference, where nodes could be released if not
> referenced, and then rebuilt if requested.  For performance and memory
> footprint.

Ok, these are all requirements on the DOM module. Which one, Read-Only,
Read-Write, or both?

> 3) parse-next function, with added control over buffer size.

Explain more. Would this be the ability to feed in an input source that says
"grab 16K at a time from the underlying stream and feed it into the parser"?
This puts a requirement on the parser to be able to parser in increments,
and a requirement on all the providers to higher level services to provide
data to their consumers without having the full picture.

> 4) Some sort of way to tell if a SAX char buffer is going to be
> overwritten, so data doesn't have to be copied until this occurs.

Once again, explain more.. I think that basic programming tenants say that
if I hand a buffer to a consumer, whether via the SAX provider, or any other
provider, I'm not going to much with it until it's released.

> 5) Serialization support, as is currently in Assaf's classes.

I've intentially left out serialization as a discussion point to date as
it's not parser, it's part of the larger toolset. In my world view, it seems
that the serialization (or better called output or externalization in my
mind since serialization carries a specific meaning in the Java sense) sits
on the other side of the producers from the parser in the diagram that I
threw out.

> 6) Schema data-type support, which will be needed for XSLT2, and Xalan 2.0
> extensions.

Right -- Pluggable validatiors should include Schema, DTD, and possibly
Relax if somebody wants to take a crack at it.

> 7) We should talk about whether XPath should be part of the core XML
> services, rather than part of the XSLT processor.

Yes we should. My initial thoughts are no, but...

> 8) Small core footprint for standalone, compiled stylesheet capability, for
> use on small devices.  This would need to include the Serializer.  I'm not
> sure if this should really be a separate micro-parser?

Compiled stylesheets are something that would be different than a parser in
mind mind -- wouldn't this be something that sits at the Xalan level?
(Hopefully with those folks at Sun helping out <ducking>:).

> +0.  I'm not sure this is compatible with the first goal.  Also, I would
> rather have performance and *scaleable* memory footprint prioritized over
> jar file size.  However, the Xalan project does need this...

Ok -- fair enough. I'd also prio modularization over this if we take it as a
good thing to be able to build out into a set of jars where a small non
validation SAX only parser could be intuitively and quickly thrown together
for a particular application (without a specialized build or diving into the
code). This would satisy quite a bit of my needs as far as jar size.

>> * Modular. It should be possible to build a parser as a set of Jar
> files
>> so that a smaller parser can be assembled which fits the need of a
>> particular implementation. For example, in TV sets do you really
> need
>> validation?
> 
> +0, or +1, depending on how you read this.  You may not need validation,
> but you may indeed need schema processing for data types, entity refs, etc.

Right. What I'm thinking is a build target that produces:

    parser-core.jar
    validator-dtd.jar
    validator-schema.jar
    producer-sax.jar
    producer-domrw.jar
    producer-domro.jar
    producer-jdom.jar

Then the person that needs a non validating SAX parser grabs parser-core and
producer-sax and goes on leaving the other parts behind.

>> * Cleanly Optimized. This means optimized in a way that is compatible
>> with modern virtual machines such as HotSpot. Optimizations that
> work
>> well with JDK 1.1 style VMs can actually impact performance under
>> more modern VMs. Optimizations that interfere with readability,
>> modularity, or size will be shunned.
> 
> -0 or +1, depending on how you read this.  Is it, or is it not, a
> requirement to have good performance with JDK 1.1, or even backwards
> compatibility?  If not, then I think, sure, let's optimize in a way that is
> cleanly compatible with "modern" VMs.

I think that we have a perfectly good parser answer for JDK 1.1 in the form
of Xerces 1.0.x -- I would actually make our target goal 1.2/1.3, with 1.1
compatibility (possibly, I'd actually really like to push forward so that
collections and whatever else can be used).

What this means is that instead of having to carry a specialized Hashtable
(as the one in 1.1 sucked), you use Hashtable and know that on 1.2/1.3 the
Hashtable impl is greatly better, and you're happy that it works fine on
1.1, even if it's not as performant.

> Big +1.  I would like to see this done independent of any next-gen work,
> for availability to Xalan 2.0 and other projects, sooner, rather than
> later.

Ok. Should I propose an apache-auc module to the joint jakarta/xml efforts
to collect these sorts of things? We've talked about it on the jakarta lists
and said resoundingly "YES" but didn't know how others feel. I think that if
there's a loud "YES" here, that we can make headway.

And most of my interest really lies at the AUC type level

>> * Refactor out a base parser. Once we see how those APIs should look
> (or
>> at least get a start, they don't have to be perfect :) we start at
>> the bottom and look at the code of the existing parsers to come up
>> with a basic non-validating parser that can rip through XML.
> 
> -1.  I think there is enough knowledge at this point to first put together
> a pretty complete design, with a clear understanding of how schema
> processing should work with the base parser (maybe they shouldn't -- but I
> would argue that point).  Hard problem, in my opinion, and more design
> rather than less would result in a better idea of what a base parser should
> be.

Ok.. I think that the API discussion that has started (the one with the
diagram and no APIs :) is a start on that. Once that gets to a certain
point, then I think that we should get some code rolling though.

>> * Set SAX on top of this base parser. Of course.
> 
> +1. However, I think there is likely clear evidence that it may benefit
> certain high-performance applications to have a much tighter binding to the
> parser than SAX supports.  A particular problem is the way that SAX2 treats
> character data, and the fact that it's an event-only API, rather than
> having by-request characteristics (i.e. parse-next type functionality, so
> you can run an incremental parse/transform without having to run an extra
> thread).

I know that there are others with other opinions on this, I'll defer to
them.

>> * Look at pluggable validation.
> 
> +1, but validation is not the same thing as basic schema and DTD
> processing.  Data-types, entity refs, default attributes, etc., tend to be
> required.

Actually, I'd like to see if that's not the case. I'd really like to see
basic schema and DTD processing moved out of the core path so that a
non-validating parser can be put together.

This really cuts down on the critical path when used in a server case where
validation is quite frequently turned off. Same for TV sets or PDAs. There
are a whole catagory of apps that fall in to the catagory that validation is
not a need after development.

> -1 on JDOM for the core.  Just my opinion.  I don't like it, I think it
> misleads developers about the XML data model, and I would rather not see
> Apache support it.

Producers that sit on top aren't part of the core parser in the currently
circulating diagram... It should be made available as part of a full build
of XRI or whatever we call this thing. Even if we use a SAX++ for the core
internal representation, the SAX producer that produces SAX only events
should be a pluggable thing that sits on top of the core parser.

.duncan