You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Andy Clark <an...@apache.org> on 2000/07/18 06:09:44 UTC

Xerces Redesign

I tried responding to the initial threads while I was away in
Japan for a week but I had a hard time connecting and the message
that I finally sent didn't make it through! Oh well... 

Anyway, I've included my old response below just to enter it
into the public record. Most of it is irrevelent now as we've
already worked through the hurt feelings and gotten to the
important thing -- making a better Xerces parser.

I would highly recommend everyone look at the skeleton of code
that was checked in under the Xerces 2 branch. We had a few 
discussions about what the internal interfaces for the parser 
should look like to provide ease of understanding and modularity. 
And if you'll read my old message you'll see that this would have 
have gone public once we had the skeleton in place but we were 
derailed by Schema implementation issues.

Edwin already noticed the SymbolTable that is part of this code
as a way of removing the integers throughout the parser's design
and using String objects instead. A side effect of removing the
integers, however, is that all of the character data in the file
must be transcoded. In other words, we have to assume that the
incoming bytes are always converted to the Unicode characters,
even if we throw this data away without ever looking at it. But
this performance feature made Xerces more complicated and I 
think that the gain in code simplification is worth it.

Also, there is a way to abstract the grammar definitions in order
to provide grammar caching. This area needs some work, however,
because of the complexity that Schema adds to validation. There's
a lot more flexibility in Schema and the "compiled" forms of the
grammars have to be able to handle the union of features. I would
really appreciate it if someone helped looked at that to see 
where it's deficient.

I have a list of requirements that I'll post tomorrow once I have
checked them against the current stated requirements. The one
thing that I know I don't like in the current requirements is not
being able to run on 1.1.x VMs. I can understand dropping support 
for SAX 1 because SAX 2 is a super set and users should update
their code to SAX 2, anyway. However, requiring users to switch
VMs is a *much* harder sell. Many customers just cannot afford
to upgrade. So I think that this requirement needs to be re-opened.
But I'll post more thoughts tomorrow.

For convenience, I've attached an HTML version of the skeleton
checked in under the Xerces 2 branch. It may be a lot easier to
see what we're talking about and make comparisons with the
initial Spinnaker checkin.

--- BEGIN OLD MESSAGE ---

I wish that I could have been involved with this discussion earlier
but I've been away in the Land of the Rising Sun. It's been a little
tough to get connected. (I finally got online today to find about
175 messages in the Xerces mailing lists alone! Luckily a bunch of
them were cross-posted duplicates.)

But now to the meat of the matter...

First, I would like to explain why the Spinnaker announcement upset 
me. Since the start of the Xerces project, a small group of people,
including myself, have worked extremely hard on the project. We have
done everything: implemented new features, fixed bugs, wrote docs,
and produced builds and releases. Aside from the generous donations
of Arkin's serializers, the WML DOM implementation, and miscellaneous 
patch submissions, no consistent contributions to the project have
come from anyone outside the original team. I think that this is a 
point that is too often overlooked by the public.

Open source software development is based on meritocracy. And this 
is the primary reason why I was upset about the announcement. It 
could have been *anyone* and I would have had the same reaction. The
announcement appeared to me that a non-contributer was trying to 
co-opt the project. (These comments aren't supposed to be flame
bait. I'm just stating my initial feelings on reading the post.)

I would like nothing more than to make the development of Xerces a
true community where everyone gets involved but this hasn't
happened, yet. Perhaps this series of threads will change that
trend. Until I see the commit messages, though, I won't hold my
breath. :) Somebody should update the web site to show people how 
to subscribe to the xml-contrib mailing list.

Next, I would like to talk about the redesign discussions occurring
on the mailing lists. Yes, we want to redesign the parser! The code
checked in under the Xerces 2 branch is just a skeleton of our 
thoughts for a redesign. However, this design was put on hold to 
work on implementing Schema. If we hadn't become pre-occupied with 
Schema, the design discussion was going to move to the mailing list. 
I guess now's the time to do that, eh?

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

Re: Xerces Redesign: REQUIREMENTS

Posted by Andy Clark <an...@apache.org>.
James Duncan Davidson wrote:
> There's no need to generate code at runtime to do this. You can do 
> all sorts of things at runtime with properties and classnames and 
> class.forNames to make things happen.

Using dynamic features too much would make it harder to port to
C++. There are obvious differences between the languages that
will influence the code but I'd like to keep them as close as
possible.

> Generating custom classes is problematic. Then you have versioning 
> problems, issues if you have two parsers in your classpath, and 
> generally feels like something that's statically compiled and not 
> designed to run in a dynamically loaded dynamically linked system.

Yes, custom classes are a problem. Anybody have any ideas?

> I guess you haven't read my full comments. :) I didn't say that 
> we had to provide everything as seperate zip/tgz files.. I said 
> that the parser should be able to be built into peices -- much 
> different. What we ship as an official distro is orthagonal to 
> the targets that we have in our build file and the internal 
> structure of the set of jar files in that distro.

Got it and I agree. Sounds like you are the build person. ;)

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

Re: Xerces Redesign: REQUIREMENTS

Posted by James Duncan Davidson <du...@x180.com>.
on 7/21/00 1:57 PM, Andy Clark at andyc@apache.org wrote:

> James Duncan Davidson wrote:
>> I don't see why they have too -- if you think of a build of modules
>> as a set of build targets, you should be able to build what you want
>> out of the source tree. Just want to build the main parser + sax --
>> "./build sax"
> 
> Okay, so how do we do this in the build file?

Using syntax not tied to make or ant :)

    main depends on parser, sax, rdom, ddom, rwdom
    dist depends on main

That's what dependancies are for.

> Depending on the target, do we generate a custom parser instance class that
> gets compiled? This is a possibility. And I actually prefer it because I'd
> like to have the output always be "DOMParser" even if the user doesn't want
> validation but does want a DOM parser.

There's no need to generate code at runtime to do this. You can do all sorts
of things at runtime with properties and classnames and class.forNames to
make things happen.

Generating custom classes is problematic. Then you have versioning problems,
issues if you have two parsers in your classpath, and generally feels like
something that's statically compiled and not designed to run in a
dynamically loaded dynamically linked system.

> I guess you haven't fielded the hordes of complaints from people
> that have had to download an ever-increasing ZIP/TGZ files. The
> build scripts are separate. People still want to download less
> in order to get what they want.

I guess you haven't read my full comments. :) I didn't say that we had to
provide everything as seperate zip/tgz files.. I said that the parser should
be able to be built into peices -- much different. What we ship as an
official distro is orthagonal to the targets that we have in our build file
and the internal structure of the set of jar files in that distro.

.duncan


Re: Xerces Redesign: REQUIREMENTS

Posted by Andy Clark <an...@apache.org>.
Ed Staub wrote:
> I had thought that this was settled earlier, to the effect that we'd put out
> two deployables:
>         - a "kitchen sink" jar as at present
>         - a set of "module" jars which together contain the same files as the
> "kitchen sink".
> 
> Is this ok with everyone?

As long as they are in separate downloadable ZIP/TGZs. Which is
why I was heading towards just downloading the separate modules.
Perhaps somewhere inbetween is a happy medium.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

RE: Xerces Redesign: REQUIREMENTS

Posted by Ed Staub <es...@mediaone.net>.
Andy Clark wrote:

> James Duncan Davidson wrote:
> Here I disagree with you. Coordinating the code across all these
>> spaces would be more of a pain than it's worth. The code should
>> be in seperate *packages* -- and build into seperate jars imho.
>> But it should be possible to "./build all" and get the who shebang.

>I guess you haven't fielded the hordes of complaints from people
>that have had to download an ever-increasing ZIP/TGZ files. The
>build scripts are separate. People still want to download less
>in order to get what they want.

It's clear that there are two communities here with differing requirements.
My own usage tends toward Andy's; I have to edit classpaths in script files
too often to want a lot of extra files.

I had thought that this was settled earlier, to the effect that we'd put out
two deployables:
	- a "kitchen sink" jar as at present
	- a set of "module" jars which together contain the same files as the
"kitchen sink".

Is this ok with everyone?

--------------

<tangent>By the way, the mail archives will hopefully be brought up to date
today, according to Dirk-Willem van Gulik.  There was a hardware
problem.</tangent>

-Ed Staub


-----Original Message-----
From: Andy Clark [mailto:andyc@apache.org]
Sent: Friday, July 21, 2000 4:58 PM
To: xerces-j-dev@xml.apache.org
Subject: Re: Xerces Redesign: REQUIREMENTS


James Duncan Davidson wrote:
> I don't see why they have too -- if you think of a build of modules
> as a set of build targets, you should be able to build what you want
> out of the source tree. Just want to build the main parser + sax --
> "./build sax"

Okay, so how do we do this in the build file? Depending on the
target, do we generate a custom parser instance class that gets
compiled? This is a possibility. And I actually prefer it
because I'd like to have the output always be "DOMParser" even
if the user doesn't want validation but does want a DOM parser.

The design I posted separates all of the basic functionality
into a series of base classes. Then DOMParser and SAXParser
become very simple wrappers on top of the basic document
parsing class.

> Here I disagree with you. Coordinating the code across all these
> spaces would be more of a pain than it's worth. The code should
> be in seperate *packages* -- and build into seperate jars imho.
> But it should be possible to "./build all" and get the who shebang.

I guess you haven't fielded the hordes of complaints from people
that have had to download an ever-increasing ZIP/TGZ files. The
build scripts are separate. People still want to download less
in order to get what they want.

--
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: Xerces Redesign: REQUIREMENTS

Posted by Andy Clark <an...@apache.org>.
James Duncan Davidson wrote:
> I don't see why they have too -- if you think of a build of modules 
> as a set of build targets, you should be able to build what you want 
> out of the source tree. Just want to build the main parser + sax -- 
> "./build sax"

Okay, so how do we do this in the build file? Depending on the
target, do we generate a custom parser instance class that gets
compiled? This is a possibility. And I actually prefer it
because I'd like to have the output always be "DOMParser" even
if the user doesn't want validation but does want a DOM parser.

The design I posted separates all of the basic functionality
into a series of base classes. Then DOMParser and SAXParser
become very simple wrappers on top of the basic document
parsing class.

> Here I disagree with you. Coordinating the code across all these 
> spaces would be more of a pain than it's worth. The code should 
> be in seperate *packages* -- and build into seperate jars imho. 
> But it should be possible to "./build all" and get the who shebang.

I guess you haven't fielded the hordes of complaints from people
that have had to download an ever-increasing ZIP/TGZ files. The
build scripts are separate. People still want to download less
in order to get what they want.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

RE: Xerces Redesign: REQUIREMENTS

Posted by Paulo Gaspar <pa...@krankikom.de>.
I think Duncan's views on packaging to be so obviously right, simple and
flexible, that I am having trouble understanding why someone would oppose
to them.

The only complexity they have is the house keeping of the different builds.
Is that much?

It has no adverse impact on architecture - just enforces a good one:
 - Modules should be - obviously - MODULAR (as opposed to promiscuous);
 - Module interdependencies must be WELL DEFINED and simplified by design.

If the above conditions are respected, defining several build
configurations should be (almost) something like a trivial administrative
task.


Have fun,
Paulo Gaspar


> -----Original Message-----
> From: James Duncan Davidson [mailto:james.davidson@eng.sun.com]
> Sent: Friday, July 21, 2000 09:31
>
> on 7/18/00 7:46 PM, Andy Clark at andyc@apache.org wrote:
>
> > I don't see why this needs to be folded into the Xerces build.
>
> I don't see why they have too -- if you think of a build of
> modules as a set
> of build targets, you should be able to build what you want out of the
> source tree. Just want to build the main parser + sax -- "./build sax"
>
> In fact, I could see a Xerces-Light dist that just had a few
> things in it, a
> Xerces-Full dist that had *everything* in it -- and whatever other dists
> people thought valuable. But mostly in the range between light
> and full, I'd
> like app programmers chose their poison.
>
> > In fact, I'd like to move away from "everything *and* the
> > kitchen sink" approach where all donations get rolled into the
> > main code release.
>
> Here I disagree with you. Coordinating the code across all these spaces
> would be more of a pain than it's worth. The code should be in seperate
> *packages* -- and build into seperate jars imho. But it should be possible
> to "./build all" and get the who shebang.
>
> .duncan


Re: Xerces Redesign: REQUIREMENTS

Posted by James Duncan Davidson <ja...@eng.sun.com>.
on 7/18/00 7:46 PM, Andy Clark at andyc@apache.org wrote:

> I don't see why this needs to be folded into the Xerces build.

I don't see why they have too -- if you think of a build of modules as a set
of build targets, you should be able to build what you want out of the
source tree. Just want to build the main parser + sax -- "./build sax"

In fact, I could see a Xerces-Light dist that just had a few things in it, a
Xerces-Full dist that had *everything* in it -- and whatever other dists
people thought valuable. But mostly in the range between light and full, I'd
like app programmers chose their poison.

> In fact, I'd like to move away from "everything *and* the
> kitchen sink" approach where all donations get rolled into the
> main code release.

Here I disagree with you. Coordinating the code across all these spaces
would be more of a pain than it's worth. The code should be in seperate
*packages* -- and build into seperate jars imho. But it should be possible
to "./build all" and get the who shebang.

.duncan


Re: Xerces Redesign: REQUIREMENTS

Posted by Andy Clark <an...@apache.org>.
Brett McLaughlin wrote:
> There's also been agreement (or at least lack of disagreement) that,
> given the ability to do it in a modular fashion, JDOM will be supported.

I don't see why this needs to be folded into the Xerces build.
In fact, I'd like to move away from "everything *and* the
kitchen sink" approach where all donations get rolled into the
main code release. I want to see separate modules. Whether 
these separate modules are hosted on the XML Apache site is
another question.

> I'm curious as to if you intended this list to be ordered (in priority)?
> There are big issues as to whether performance or simplicity is more
> important.

No priority should be given to my list.

> Would it be acceptable to support SAX 1.0 purely through the
> ParserAdapter class that SAX 2.0 comes with? That would allow us to
> support it, albeit at a little slower and less "native" route.

Perhaps. These are all issues that we need to work through
and vote on.

> that his list was the "definitive" list, as it is a lot larger, includes
> things like XLink, XPointer, XPath, etc. that far supercede Andy's.

There is a lot of overlap -- I just wanted to post the
requirements we were using to start our re-design discussion.

> Without looking too deeply, Andy, have you started using ints, interned
> Strings, or a RecyclableString type of construct? This is one of the

Strings all the way.

In the design that I posted, there is a SymbolTable class
that manages symbols to perform the "intern" of various
strings found in documents. The actual hashing is performed
by a SymbolHasher interface and allows 1) the hashing 
function to be modified, and 2) doesn't require you to
create a string in order to call intern().

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

Re: Xerces Redesign: REQUIREMENTS

Posted by James Duncan Davidson <ja...@eng.sun.com>.
on 7/18/00 5:07 PM, Brett McLaughlin at brett.mclaughlin@lutris.com wrote:

> It seems that the requirements that Andy keeps mentioning need to be
> merged with the online version that Ed is keeping. My understanding is
> that his list was the "definitive" list, as it is a lot larger, includes
> things like XLink, XPointer, XPath, etc. that far supercede Andy's.
> Where they are disjoint, we need to discuss and rectify. At that point,
> I'd hope we can use that one list, as keeping two lists is really
> confusing ;-)

+1

We need not create more work than necessary.. I kicked off with a short list
-- Ted and Ed have been keeping score since then and I think that's a good
thing imho since there's so much paranioa about com vs com here. :)


.duncan


Re: Xerces Redesign: REQUIREMENTS

Posted by James Duncan Davidson <ja...@eng.sun.com>.
on 7/18/00 7:15 PM, Ed Staub at estaub@mediaone.net wrote:
 
> I plan to do the next revision this weekend, for posting on Monday when
> Ted gets back and can check it in and post to the website.

Righto. :)

.duncan


RE: Xerces Redesign: REQUIREMENTS

Posted by Ed Staub <es...@mediaone.net>.
Brett McLaughlin wrote:
>
> It seems that the requirements that Andy keeps mentioning need to be
> merged with the online version that Ed is keeping. My understanding is
> that his list was the "definitive" list, as it is a lot larger, includes
> things like XLink, XPointer, XPath, etc. that far supercede Andy's.

I (and Ted L.) plan to integrate new mail into the requirements list
on a regular basis.  I expect to include a change summary at each revision.

I plan to do the next revision this weekend, for posting on Monday when
Ted gets back and can check it in and post to the website.

-Ed Staub


Re: Xerces Redesign: REQUIREMENTS

Posted by Brett McLaughlin <br...@lutris.com>.

Andy Clark wrote:
> 
> Requirements for any new design should center on customer
> requirements. Luckily, we're all customers of Xerces so we
> already have a good idea regarding what is important. But I
> also don't want to neglect other people that are building
> commercial products as well as the server-oriented folks
> where performance is paramount.
> 
> First, I would list the following basic requirements:
> 
>   Standards Compliance
>     XML 1.0
>     Namespaces 1.0
>     DOM Level 1, Level 2
>     SAX 1.0, 2.0

There's also been agreement (or at least lack of disagreement) that,
given the ability to do it in a modular fashion, JDOM will be supported.

>     XML Schema
>   Performance
>   Simplicity
>   Extensibility
>   Maintainability

I'm curious as to if you intended this list to be ordered (in priority)?
There are big issues as to whether performance or simplicity is more
important.


> 
> I think we can pretty much agree on all of them with perhaps
> SAX 1.0 being an exception. I would like to see Xerces

Would it be acceptable to support SAX 1.0 purely through the
ParserAdapter class that SAX 2.0 comes with? That would allow us to
support it, albeit at a little slower and less "native" route.

> support it while others feel everyone should move up to
> using SAX 2.0. This is an issue we can discuss, though.
> 
> Also, the design should accomodate the following features
> and allow new features to be added with ease:
> 
>   Core features and properties
>   Error handling
>   Grammar access
>   Grammar caching
> 
> These set of requirements were taken into consideration when
> making the skeleton that is checked in under the Xerces 2
> branch. See my previous posting for the overall design.
> 
> Has anyone had a change to look over it, yet? It might be
> easier to read if you detach the HTML and the CSS so that
> you can see the highlighting.

It seems that the requirements that Andy keeps mentioning need to be
merged with the online version that Ed is keeping. My understanding is
that his list was the "definitive" list, as it is a lot larger, includes
things like XLink, XPointer, XPath, etc. that far supercede Andy's.
Where they are disjoint, we need to discuss and rectify. At that point,
I'd hope we can use that one list, as keeping two lists is really
confusing ;-)

Without looking too deeply, Andy, have you started using ints, interned
Strings, or a RecyclableString type of construct? This is one of the
big, core issues, and I'm curious as to if you went with ints since
Xerces 1 did, or if you even got that far. The impression I am getting
(and, btw, agree with) is that interned Strings are the way people want
to go - it is the best compromise of performance and clarity. Also,
there was talk of using an interface that can be implemented
differently, although that seemed to be something that could cause
cross-talk (I use X impl, you use Y impl, we get confused).

Thanks for your thoughts...

-Brett

> 
> --
> Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

-- 
Brett McLaughlin, Enhydra Strategist
Lutris Technologies, Inc. 
1200 Pacific Avenue, Suite 300 
Santa Cruz, CA 95060 USA 
http://www.lutris.com
http://www.enhydra.org

Re: Xerces Redesign: REQUIREMENTS

Posted by Andy Clark <an...@apache.org>.
Requirements for any new design should center on customer 
requirements. Luckily, we're all customers of Xerces so we
already have a good idea regarding what is important. But I
also don't want to neglect other people that are building
commercial products as well as the server-oriented folks
where performance is paramount.

First, I would list the following basic requirements:

  Standards Compliance 
    XML 1.0 
    Namespaces 1.0 
    DOM Level 1, Level 2 
    SAX 1.0, 2.0 
    XML Schema 
  Performance 
  Simplicity 
  Extensibility 
  Maintainability 

I think we can pretty much agree on all of them with perhaps
SAX 1.0 being an exception. I would like to see Xerces 
support it while others feel everyone should move up to
using SAX 2.0. This is an issue we can discuss, though.

Also, the design should accomodate the following features 
and allow new features to be added with ease:

  Core features and properties 
  Error handling 
  Grammar access 
  Grammar caching 

These set of requirements were taken into consideration when
making the skeleton that is checked in under the Xerces 2
branch. See my previous posting for the overall design.

Has anyone had a change to look over it, yet? It might be
easier to read if you detach the HTML and the CSS so that
you can see the highlighting.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

Re: Xerces Redesign

Posted by Andy Clark <an...@apache.org>.
Ed Staub wrote:
> So "reopening" shouldn't be an issue!!!  Everything is open!!!

Good. Because I missed out on a lot of the discussion being
out of town last week.

> Tangentially: is the DTD you used of your own design?
> Stefano was working on something similar for Cocoon; see
> http://xml.apache.org/cocoon/javadoc.html.

Yeah, it is. I whipped up something really quick to get the
design spec'd out. Eventually it would be cool to write a
stylesheet to convert it to both Cocoon's format, XMI, or
any other useful format.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

RE: Xerces Redesign

Posted by Ed Staub <es...@mediaone.net>.
Andy Clark wrote:

> The one thing that I know I don't like in the current requirements is not
> being able to run on 1.1.x VMs. I can understand dropping support
> for SAX 1 because SAX 2 is a super set and users should update
> their code to SAX 2, anyway. However, requiring users to switch
> VMs is a *much* harder sell. Many customers just cannot afford
> to upgrade. So I think that this requirement needs to be re-opened.

Yikes!  I collated the requirements list to keep track of the
_current_ set of requirements expressed on the mailing list.

Many of these conflict with each other, or are so vague as to be
impossible to determine whether they have been met.

So "reopening" shouldn't be an issue!!!  Everything is open!!!
I'd hate to think that I shut down the process by taking a
snapshot of it.

---

Tangentially: is the DTD you used of your own design?
Stefano was working on something similar for Cocoon; see
http://xml.apache.org/cocoon/javadoc.html.

-Ed Staub

-----Original Message-----
From: Andy Clark [mailto:andyc@apache.org]
Sent: Tuesday, July 18, 2000 12:10 AM
To: Xerces-J
Subject: Xerces Redesign

...