You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Ted Leung <tw...@sauria.com> on 2001/08/03 07:01:30 UTC
Re: [Xerces-2]: schema parsing design; discussion starter [long]

Neil & gang,

Sorry for the delay in getting you feedback.  As usual, I'm always
busier than I expected to be.

I have a few questions and comments:

1. How will DTD validation fit into this scheme
2. Do you have some ideas for implementing PSVI support along with this
design?
3. If we are thinking we may change internal representations for the schema
documents,
does it make sense to have a very thin abstraction layer between the DOM
trees and the
rest of the system.
4. Has any profiling of the current validator been done?

Ted

----- Original Message -----
From: <ne...@ca.ibm.com>
To: <xe...@xml.apache.org>
Sent: Tuesday, July 31, 2001 9:35 AM
Subject: [Xerces-2]: schema parsing design; discussion starter [long]


> Hi folks,
>
> Folks may have noticed that the number of commits from us Torontonians
> has gone down, particularly over the past two weeks.  We
> haven't gone away; we've just been thinking long and hard about
> how to integrate schema support into Xerces2.  As those of you
> who have looked at Xerces1's schema implementation will know, it's
> neither pretty nor efficient.  It's basically a bunch of hacks
> built on a jury-rigged foundation.  We're really hoping to do
> things right in Xerces2, and that's why we've been slow to start
> putting things down.
>
> In this message I hope to provide a bird's-eye--or maybe spy
> satelite-camera :-)--view of the kind of thing we're thinking of.
>
> We've got the beginning of a skeleton outline based on these
> ideas, but there are lots of details to fill in.  Hopefully this
> post will stimulate discussion, and by the middle of next week
> we're hoping we can integrate the outline into Xerces2 (which
> should have had a beta release by then).  Once we've got a
> sturdier skeleton in place, we should all be in a position to
> volunteer to fill in portions of the outline.
>
> I should note that this only covers the process of converting
> schema documents into internal grammar representations.  We'll
> post about the organization of the grammars, GrammarPool,
> validation etc.  later.
>
> To the design:
>
> Schemas can of course be composed of several schema documents;
> one schema can import, include or redefine others.  Therefore, a
> schema document-centric way of parsing schemas really doesn't
> seem to make sense; what seems to be needed at the heart of
> things is a class whose function is to co-ordinate the
> construction of one (or more) schema grammars from a set of
> schema documents.  We propose to call this a SchemaHandler.  In a
> nutshell, its job is to collect all the documents that we need to
> parse, and farm out the parsing to objects which know how to do
> it.
>
> In more detail, there should be three phases to this process.
> First, given a schema document, the SchemaHandler needs to find
> all schema documents that it <include>s, <redefine>s or
> <import>s.  Then it needs to do the same thing for all these
> schema documents recursively.  The result of this
> will be a set of DOM trees, one for each schema document.
>
> As something of an aside, I should note that we've looked in a fair
> bit of detail at Xalan's DTM (Document Table Model) in hopes that it
> might be a lighter, more memory-efficient substitute for the DOM in
> our schema implementation.  We're certainly very much open to the
> idea that Xerces should, at some point, acquire the ability to
> produce a DTM from an XML document.  But, since it seems critically
> important that Xerces2 get schema-support as quickly as possible, we
> had to conclude that adding DTM support now would be a bad idea.
> This is largely a matter of time:  Not only would we have to
> implement a document table etc.--as well as reintroducing the concept
> of a StringPool into Xerces2, something that everyone's been trying
> to avoid--but we would have had to implement direct support for
> XPath as well, since this is at the core of the DTM as the Xalan
> community has defined it.  So we thought that, once Xerces2 has
> schema support, we could look at adding a DTM facility to the parser
> generally and then think about switching the schema parsing component
> to use it.
>
> Now back to the main design:
> Because each schema document has certain properties that hold
> throughout it (namespace bindings, values for elementFormDefault,
> blockDefault etc.), we're planning to wrap these DOM trees that
> we produce in the first phase of schema parsing in an
> object called an XMLSchemaDocument.  Our SchemaHandler will
> maintain a list of available XMLSchemaDocuments and will also
> keep a record of the relationships between them.
>
> In the second phase of processing, the SchemaHandler will go
> through all the children of the roots of all these DOM trees.
> The purpose of this operation is to identify all the named global
> components we have access to.  The schema spec defines various
> symbol spaces for components, and the SchemaHandler will maintain
> a table for each symbol space.  Each entry of the table will be
> identified with a QName (the localpart of the global component
> with the targetNamespace of the schema it came from); the values
> of the table entries will be references to the DOM node
> corresponding to the declaration.  This should save us a great
> deal of time in look-ups.  I should also note that we think
> references to redefined components can be handled in this phase
> as well.
>
> Once all our global declarations are identified, we'll begin
> parsing (traversing) them, starting with the first declaration
> from the first schema document we were asked to parse.  We
> propose to define a set of Traverser classes, more or less
> corresponding to each kind of schema component.  So we propose
> to have an ElementTraverser class, a SimpleTypeTraverser, etc.
> The SchemaHandler will call each of these traversers as
> appropriate when encountering a given DOM node.  Once a node has
> been parsed, we'll use one of the DOM node's flags to indicate
> that it has been parsed, so that we can easily skip over it if we
> encounter it later.
>
> When a traverser encounters a reference to a component, it will ask the
> SchemaHandler to get the required information.
> If the component as been parsed, the SchemaHandler will look up
> the information in the grammar.  (Here I should point out that we
> intend SchemaGrammar objects to have a one-one correspondence
> with targetNamespaces.  That is, if a schema is encountered that
> <import>s another schema, we'll end up producing two different
> grammars.)  If the information is not in the grammar, the
> SchemaHandler will locate the DOM node containing the relevant
> declaration, determine if components of the schema currently
> being parsed are allowed to access this component, and call the
> relevant traverser for that component to provide the information.
>
> This approach should allow us to localize knowledge about how to
> parse a given kind of schema component in a specific object.  We
> envision needing a series of helper classes to handle common
> things like DOM traversal operations, and perhaps to hold
> information that multiple traversers will need to access.  We're also
> trying to structure these classes so that object creation is
> minimized; e.g., we expect that one instance of a SchemaHandler object
> should be able to be used by a particular instance of a parser,
> however many schema documents it needs to parse.
>
> But, since the interaction between schema components can be very
> complex, we certainly have many details yet to work out.
> Nonetheless, I'm hoping that this outline won't confuse anyone,
> and will get some discussion going on how to do things right, now
> that we have all this experience in implementing this large and
> complex specification.  At all events, we want to avoid saddling
> Xerces2 with a schema implementation as inefficient and
> unmaintainable as that which Xerces 1 ended up with.
>
> Cheers,
> Neil
>
>
> Neil Graham
> XML Parser Development
> IBM Toronto Lab
> Phone:  416-448-3519, T/L 778-3519
> E-mail:  neilg@ca.ibm.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org