You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by ne...@ca.ibm.com on 2001/03/09 23:28:32 UTC

Re: [Xerces2] A Tale of Two Validation Engines (LONG AGAIN!)


Hi Andy,

Even while I've been submerged beneath <redefine> (not to mention all the
stuff on general@xml.apache.org...) for the last week, I've really been
enjoying this discussion.  I wish I had more profound things to add--and
maybe I will when I review this stuff again on Monday--but for what it's
worth here's two things that occur to me:

1.  You state down below that, as is done in Xerces 1, schema element and
group sequences can be handled by expanding them; that is

     <element name="a" minOccurs="2" maxOccurs="3"/>

becomes a sequence of two "a"'s and an optional "a", as a similar
construction would have to be specified in a DTD.

The trouble with this model is that, a fully compliant schema processor is
supposed to be able to handle minOccurs with values of at least 2^30!  Not
realistic in the real world perhaps, but if our ambition is to be 100%
compliant we'll need a better way of handling this case (or we'll all have
to start using supercomputers.)  To my mind, this can probably be handled
by adding a counter in the DFA, then decrementing it whenever the
corresponding state is entered.  Probably not hard to implement, but it
might break the superset idea that you propound here.

2.  Returning to the inbound vs. outbound validation question:  Since only
the parent element of an <any> child need worry about wildcards, I wonder
whether it would be possible to avoid all those method callbacks you're
worried about for most elements?  That is, to slightly redesign the current
approach such that, for most element types, we do what we're doing now, but
when we encounter an element which may have a wildcard child, then we do a
lot more bookkeeping and method-calling?  This would have the advantage of
cutting down on the recoding, and would give the user the option of taking
the performance hit or not (by keeping <any>'s out of their schema), but of
course wouldn't help with the error-order problem you've already cited.

Sorry these thoughts are so vague, but they're the best I can muster on a
Friday afternoon.

Cheers,
Neil

Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  416-448-3519, T/L 778-3519
E-mail:  neilg@ca.ibm.com



Andy Clark <an...@apache.org> on 03/09/2001 09:54:07 PM

Please respond to xerces-j-dev@xml.apache.org

To:   xerces-j-dev@xml.apache.org
cc:
Subject:  Re: [Xerces2] A Tale of Two Validation Engines (LONG AGAIN!)


In our last episode, I was discussing the details of the current
validation engine in some detail. This time I would like to lay
out what I think the new validation infrastructure could look
like. First, we still have the same parser pipeline configuration
as before. Second, we still have a concept of a grammar pool
(but it'll actually be able to be used as a pool this time!).
But now let's look at the grammars and the process of validating
content models in the new framework.

Grammar Objects

Currently, we have a basic grammar object which represents a
"compiled" form of a grammar. It contains the information about
element content models, attributes, etc. What it does *not*
contain is any data model structure that maps back to the
original grammar format, namely DTD or Schema. Compiling a
grammar is a lossy process; you can't go back to the original
form.

This grammar object is then subclassed for each type of grammar
format that we support. Currently, this is DTD and Schema. So
we have a DTDGrammar class and a SchemaGrammar class. Each one
of these classes knows how to compile its particular document
grammar representation into the form needed by the validator
component in order to do its work.

The DTDGrammar class implements the XNI interfaces for DTDs
and compiles the grammar based on the callbacks received. The
SchemaGrammar traverses a DOM tree that is constructed from
parsing the Schema grammar in order to do the same thing. The
use of the DOM for Schema is a dependency some people want to
live without. But if you don't use the DOM, you end of
creating a DOM-like structure of the Schema grammar document.

Problems

Even though this is a nice separation, there are problems
with this setup. One problem is that the grammar object
doesn't have enough information to handle all of the needs
of the validator. So the validator has to know what kind of
grammar it's dealing with, cast to the appropriate grammar
class, and call proprietary methods on that grammar in
order to get its job done. If you need an example of where
this comes up, think about substitution groups and xsi:type
in the Schema language.

This problem can be solved, perhaps, by determining the set
of questions that the validator needs to ask and then to
make sure that the grammar object stores this information.
I think that this solution is fine but I do think that it
will make it more complicated and lose some of the
abstraction that we previously had. But this is the nature
of the beast and we have to be able to tame it.

Datatype Validator

For validation of attribute values and elements of simple
type, we define a datatype validator interface. Multiple
implementations of this interface will provide all of the
various datatypes (e.g. NMTOKEN, decimal, etc.). There is
some complication with ENTITY/ENTITIES, ID/IDREF/IDREFS,
and NOTATION but I think we have a handle on how to handle
that in a generalized way, as alluded to in the API below.
[If people are interested, I  can spell that out as well.
Eventually it *should* be written down anyway from a
documentation standpoint.]

The datatype validator objects are created by a datatype
factory at the request of the grammar objects that are
compiling their grammars. The factory isn't absolutely
required but it does allow people to swap in customized
datatype validators which I can see a need for. And the
factory will also help with the separation of DTD and
Schema datatype validators so we can produce separate
parser configurations based on need.

Here's what it looks like:

                  +----------+  +---------+
                  | Grammar  |--| Grammar | *
                  | Compiler |  +---------+
                  +----------+
                       |
                       v
  +-------------------------------------------+
  |         Datatype Validator Factory        |
  +-------------------------------------------+
        |             |                 |
  +-----------+ +-----------+     +-----------+
  |  NMTOKEN  | |  Decimal  | ... |  Custom   |
  | Validator | | Validator |     | Validator |
  +-----------+ +-----------+     +-----------+

  * The grammar itself could be the grammar compiler as well.
    I just separated it on the diagram for clarity.

So the factory creates datatype validator objects for use
by the validator. They are stored in the grammar object for
attributes and elements of simple type. Here's what the
datatype validator API looks like:

  interface DatatypeValidator

    void validate(String data, Object state)

There are other methods but this is the primary one that does
the actual validation. The state parameter is not needed for
most datatypes but is needed for the ENTITY/ENTITIES, ID/
IDREF/IDREFS, and NOTATION types because there is external
state that must be maintained. It's best for all concerned if
the validator component "owns" these state objects because
that allows the datatype validator objects to be stateless
and shared among all of the grammars.

For layering purposes, please note that the set of Schema
datatypes is a superset of the DTD datatypes.

Content Model Validator

Besides validating datatypes for attributes and elements of
simple type, the major function of a validating parser is
to validate the structure of the document. The document's
grammar declares content models that describe what elements
can appear in the document and in what way. A content model
validator ensures that the document instance follows these
rules.

In Xerces 1.x, the ANY and EMPTY content models are handled
by the validator itself and specific content model validators
are used for the other types of models. Currently, these are
Mixed, Simple, and DFA. Mixed is for content models such as
(#PCDATA) and (#PCDATA|foo)*. Simple is for extremely simple
children content models like (a), (a)*, (a,b), etc. And then
when children content models are more complex, then they are
handled by a full-blown DFA.

The current interface for the content model validator is the
following (some methods have been left out):

  interface ContentModelValidator

    int validateContent(QName[] children,
                        int offset, int length)

In Xerces2, however, we could make a complete set of content
model validators so that all content models are handled in a
straightforward way. And we would also be able to create
parser configurations that don't require us to bundle every-
thing together. For example, DTD content models could be
broken down into the following class hierarchy:

  interface ContentModelValidator
    class EmptyContentModel
    class AnyContentModel
    class MixedContentModel
    class ChildrenContentModel

And the Schema content model validators would be comprised
of the DTD content model validators plus the following:

  interface ContentModelValidator
    class SimpleContentModel
    class MixedContentModel2
    class ChildrenContentModel2

Schema has a SimpleContentModel in order to handle element
content of simple type. The mixed and children content models
are repeated because 1) Schema allows mixed content models to
be ordered; and 2) both mixed and children need to be able to
handle wildcards. [The repetition of elements and groups in
the children content model in Schema could be handled by just
expanding the model into an equivalent DTD model where the
occurrence counts are only '?', '*', and '+'.]

Depending on how much code is actually involved here, though,
we don't really need to layer the content model validators
for DTDs and Schemas. We could just take the superset of all
content model validation features and roll them into the base
content model validator implementations.

But the interface would change from being able to handle
on-the-way-out to validating on-the-way-in. To do this, the
content model validator needs to informed of child elements
and text content appearing in the parent element scope. As
I said in the previous massive posting, this strikes me as
being extremely similar to a document handler. Therefore,
the interface would end up looking very similar to the
document handler interface. For example:

  interface ContentModelValidator

    void startElementScope()
     void startElement(QName element, XMLAttributes attrs)
     void endElement(QName element)
     void characters(XMLString text)
    void endElementScope()

The names were chosen in order to overlap the methods contained
in the document handler interface. But I'm thinking that this
isn't really necessary and, actually, is troublesome. For
example, since the startElement call is just used for state
transitions, then we don't really need to pass the attributes.

Next, there has to be a way to communicate errors from invalid
transitions. These can occur from the startElement, characters,
and endElementScope callbacks. Should errors be communicated
via a return code or by throwing a validation exception?

Throwing validation exceptions seems to be satisfactory and
it also allows us to keep the method names intact. Fewer
differences in API method names and the easier the system
will be to understand and maintain. However, we still have a
problem because we need some mechanism to specify how the
validator should process a certain child element's contents.

Remember that <any processContents='{strict|lax|skip}'/>
thing? Well, here's where it's important. The content model
validator transitions its state depending on the value of
the startElement call and needs to return how the validator
should process that element's contents.

Open Questions

There are a variety of questions still left to be resolved.
For example:

  1) Compiling grammars is lossy. Should it be? or should
     we design a grammar that is richer in content so
     that we can ask questions like "what can go here?"
     and "what is the base type of X?".
  2) How does this map to the final product of the ongoing
     DOM Level 3 Content Model work?
  3) What about substitution groups and xsi:type? The content
     model cannot know all of the various types that can
     validate as if they were the type it was expecting. How
     is this to be handled?

Conclusion

Validation with DTDs is straightforward and relatively easy.
However, supporting the validation features found in Schema
greatly complicates the process. So we have a big challenge
to design an implement a validation engine that can validate
documents using these (and possibly other) grammars as well
as doing this in the most efficient and fastest way
possible.

Sorry about bombarding you with another long post but I
wanted to describe in more detail the problems we face as
well as document the current implementation and my ideas
for the redesign. What are people's thoughts?

--
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org






---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: [Xerces2] A Tale of Two Validation Engines (LONG AGAIN!)

Posted by Andy Clark <an...@apache.org>.
neilg@ca.ibm.com wrote:
> 1.  You state down below that, as is done in Xerces 1, schema element and
> group sequences can be handled by expanding them; that is
> 
>      <element name="a" minOccurs="2" maxOccurs="3"/>
> 
> becomes a sequence of two "a"'s and an optional "a", as a similar
> construction would have to be specified in a DTD.
> 
> The trouble with this model is that, a fully compliant schema processor is
> supposed to be able to handle minOccurs with values of at least 2^30!  Not

Yes, this *should* be fixed so that it can handle an arbitrary
size. Just like people would *like* the parser to be able to
handle arbitrary sizes of the <all> model. Perhaps we should
just come out with a Xerces-T for Turing Machines... ;)

> 2.  Returning to the inbound vs. outbound validation question:  Since only
> the parent element of an <any> child need worry about wildcards, I wonder
> whether it would be possible to avoid all those method callbacks you're
> worried about for most elements?  That is, to slightly redesign the current
> approach such that, for most element types, we do what we're doing now, but
> when we encounter an element which may have a wildcard child, then we do a
> lot more bookkeeping and method-calling?  This would have the advantage of

I completely agree with you -- we *can* do this. It's just a
matter of complicating the code. There is always the question 
of maintenance cost. It would be great if we could have a 
single algorithm that was both correct and fast.

More to the point, each element could contain a bit flag that
tells whether the model contains any occurrences of <any>. On
the startElement call, we would check the bit flag (a cheap
operation) to determine how we handle the validation of that
element scope: on-the-way-in or on-the-way-out. Is this along
the lines of what you were thinking?

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org