You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Ted Leung <tw...@sauria.com> on 2001/03/06 01:24:41 UTC

Re: [Xerces2] A Tale of Two Validation Engines (LONG!)

That was a whopper...

I want to help on this.  I'll have a little more time to respond later.

I'm still not clear that we have to go from on-the-way-out to
on-the-way-in.  Isn't there a clever data structure way out of
this?  I'm pretty concerned about the perf implications of this.  If
they turn out to be true, then someone needs to raise this as an
objection before the CR goes recommendation.

Ted

----- Original Message ----- 
From: "Andy Clark" <an...@apache.org>
To: <xe...@xml.apache.org>
Sent: Monday, March 05, 2001 5:37 PM
Subject: [Xerces2] A Tale of Two Validation Engines (LONG!)


> "It was the best of validation engines, it was the worst of
> validation engines..."
> 
> This post is in regards to the current validation engine and 
> a proposed re-design of said engine. I think that a redesign 
> is needed for the following reasons (there may be more):
> 
>   * Cannot support all Schema content model validation;
>     (Currently this is the most important. Otherwise we'll
>      never be able to claim 100% Schema compliance. I'll
>      explain why in the section titled "The Bad News".)
>   * Cannot factor validation support in order to build
>     layered parser configurations based on need;
>   * Cannot support other grammar languages.
> 
> But *not* for the following reason (there may be more):
> 
>   * Performance.
> 
> Current
> 
> The current validation engine is an "on-the-way-out"
> validator. In other words, it gathers up a list of children
> of each element and validates the lot of them when it sees
> the end of the enclosing element. Got it? No? Okay, well
> consider the following example:
> 
>      <!-- File: "a.dtd" -->
>      <!ELEMENT a (b)>  <!-- error is intentional -->
>      <!ELEMENT b (c)>
>      <!ELEMENT c (#PCDATA)>
>      <!ELEMENT d EMPTY>
>      <!ATTLIST d e ID #REQUIRED>
> 
>      <!-- File: "a.xml" -->
>  [0] <!DOCTYPE a SYSTEM "a.dtd">
>  [1] <a>
>  [2]  <b>
>  [3]   <c>Foo</c>
>  [4]  </b>
>  [5]  <d e='123'/>
>  [6] </a>
> 
> For each element, we keep a list of its children. For
> example, the children of <a> are { <b>, <d> }; the children 
> of <b> are { <c> }; <c> has the children { "Foo" }; and <d>
> has the children of { }. When we see the end of an element, 
> then we pass the list of its children to the content model 
> validator for that element.
> 
> The Good News
> 
> This makes for a very efficient implementation because
> we only have to call a single method to validate each
> element and its content. The less method calls, the
> faster the parser validates documents. 
> 
> Also, we're able to eleminate excessive object creations 
> because we can just keep a stack of all of the elements
> that we've seen up to the current element depth. In XPath
> terms, think of the union of the ancestor:: and 
> preceeding:: axes. I will explain the nitty-gritty of
> this for those interested; for the others, skip to "The
> Bad News".
> 
> I realize that this message is long but it's a good 
> idea to go into detail how it currently works so that 
> it can be used as implementation documentation later. 
> If someone were nice enough to do that... ;)
> 
> There are two data structures: one stack keeps all of the
> elements and text content seen so far; and another stack
> provides indexes into the first to keep track of each 
> element's children. For example, Imagine that we are at 
> the end element markup for <c> on line [3].
> 
>    Content            Element
>     Stack              Stack
>   +-------+   +----------------------+
> 0 | <a> * |   | <a> start: 1         | 0
>   +-------+   |     end:   2         |
> 1 | <b>   |   |     model: (b)       |
>   +-------+   +----------------------+
> 2 | <c>   |   | <b> start: 2         | 1
>   +-------+   |     end:   3         |
> 3 | "Foo" |   |     model: (c)       |
>   +-------+   +----------------------+
> m |  ...  |   | <c> start: 3         | 2
>               |     end:   4         |
>               |     model: (#PCDATA) |
>               +----------------------+
>               |         ...          | n
> 
>  * <a> doesn't need to be included because XML documents
>    are trees, not hedges, but it simplifies the algorithm
>    -- no special case to handle the root element any 
>    differently than other elements.
> 
> To validate the <c> element's content, we examine the offsets
> at the top of the stack and call the validator. Here's some
> made up code to illustrate the point:
> 
>   int offset = elements[top].start;
>   int length = elements[top].end - elements[top].start;
> 
>   elements[top].model.validate(content, offset, length);
> 
> Once validated, we just "pop" the top off of the element
> stack and continue. We don't actually have to get rid of
> the object in the stack, just decrement the top. This is
> both simple and saves memory and construction time of
> the objects that we can now re-use. But it's not all
> apple pie and ice cream...
> 
> The Bad News
> 
> There are a few bad things about this validation scheme.
> First, because attributes for a start element are
> validated *before* the presence of that element is
> validated for its parent element (are you following?),
> it's possible to have errors reported out of order. 
> 
> Again, getting into the details, imagine that we are on
> line [5] in the example above. The validator will check
> the content of the attribute for the <d> element and
> report an error that "123" is not a valid ID.
> 
> Now imagine that we are at the close element for <a> on 
> line [6]. The stack information looks like this:
> 
>   Content            Element
>    Stack              Stack
>   +-----+   +----------------+
> 0 | <a> |   | <a> start: 1   | 0
>   +-----+   |     end:   3   |
> 1 | <b> |   |     model: (b) |
>   +-----+   +----------------+
> 2 | <d> |   |      ...       | n
>   +-----+
> m | ... |
> 
> Upon validating *this* content, an error will be reported
> that element <d> is not allowed. So looking at the errors
> for validating this content we see the following:
> 
>   [Error] a.xml:6:14: Attribute value "123" of type ID must be a name.
>   [Error] a.xml:7:6: The content of element type "a" must match "(b)".
> 
> Now wouldn't it be nicer and more intuitive to report the
> errors in order? In other words, report that element <d>
> causes an invalid transition in the content model for <a>
> *before* we report that the attribute @e is invalid?
> There are good reasons for doing it both ways.
> 
> But that's not the only problem with this validation
> scheme. By validating on-the-way-out, we can't support
> some XML Schema features. For example, we can only support
> <any processContents='strict'/>. Any other value for 
> @processContents (such as 'lax' or 'skip') and it can't 
> be supported in the current codebase. If we want to be
> 100% Schema compliant, then this has to be fixed. But
> how? I'm glad you asked...
> 
> Future
> 
> To fix the shortcomings in the current validation engine,
> we have to redesign it so that we validate "on-the-way-in".
> The on-the-way-in method would not store the children of
> each element to validate *afterwards* but would transition
> between states in the content model *as we validate*.
> 
> This approach is no harder to implement but has serious
> performance considerations:
> 
>   1) Instead of calling a single method at the end of
>      an element's content, we now have to call a *lot*
>      more methods: initialize model, one call per
>      element and text content for the element, and one
>      final call to ensure that we are at a valid end 
>      state; and
>   2) Create a separate content model validator object
>      for each element. In the current validation scheme,
>      content validation is state-less so the same object
>      can be used multiple times in complete safety. But
>      the new scheme is now state-ful which means that
>      content model validator objects can *not* be shared.
> 
> But at least the constuction of each content model
> validator object can count as the initialization method
> call required by the first performance consideration.
> Not much savings, though, considering the number of methods
> that you're gonna end up calling to validate a document.
> 
> Following this train of thought, what would this new
> system look like? Well, I have some ideas that I'll
> present immediately.
> 
> The Design
> 
> [NOTE: This design is proposed for Xerces2, not the old
> codebase in Xerces 1.x.]
> 
> Moving from on-the-way-out to on-the-way-in validation
> requires some API changes. Instead of thinking of content
> validation as a one-shot method call -- it's valid or it's
> not -- we have to start thinking of state transitions. 
> These transitions are caused by the appearance of the 
> child elements and text of each parent element. 
> 
> This strikes me as extremely similar to a document handler. 
> So the design would be that the content model validators 
> would be little document handlers; the validator would then 
> propogate the element and character calls to the currently 
> active content model in order to transition between states.
> The API could look something like this:
> 
>   interface ContentModelValidator
> 
>     startElementScope()
>      startElement(QName,XMLAttributes?)
>       characters(XMLString)
>      endElement(QName)
>     endElementScope()
> 
> As I said earlier, the first method isn't necessary if the
> initialization code is done at construction time. However,
> if we *do* include this method, then the content model 
> validators can be cached and re-used, thus removing
> destruction time until we're done parsing the document.
> 
> There are some open issues, however:
> 
>   1) How are the validation errors communicated out of
>      the content model validator object? Return code?
>      Exception type? I'm currently leaning towards
>      exceptions so that the method prototypes can be
>      shared with the document handler interface. BUT...
>      there's a problem, which I'll get to in point 3).
>   2) Do we pass the attributes as well? I don't think
>      we need to. But if we *do*, then the methods for
>      the validator overlap the document handler method
>      names. Keeping things the same improves developer
>      productivity because he/she doesn't have to learn
>      a special API. And there *might* be some wacky
>      grammar in the future that takes advantage of
>      this information in some way.
>   3) What about <any processContents='...'/>? That was
>      one of the key motivations for proposing a redesign
>      in the first place! There MUST be some communication
>      between the validator and the content model to say
>      "Yes, this element is a valid transition *BUT* you 
>      [the validator] should process its contents as {strict, 
>      lax, skip}". This sort of implies a return code from
>      startElement but this breaks the API from the document 
>      handler. Does anyone have ideas?
>   4) What about substitution groups and xsi:type? How does
>      this affect the validation engine and the content
>      models?
> 
> For DTD and Schema content model validation, there is a
> hierarchy of validator objects. The following shows what 
> this breakdown could be:
> 
>   interface ContentModelValidator
>     class EmptyContentModel
>     class AnyContentModel
>     class MixedContentModel
>       class MixedContentModel2 (only for Schema)
>     class ChildrenContentModel
>       class ChildrenContentModel2 (only for Schema)
>     class SimpleContentModel (only for Schema)
> 
> Most of the validator objects can be shared between DTD
> and Schema validation but there are a few that are Schema
> specific. For example, the mixed content model in Schema
> can be order dependent and also needs to be able to
> handle that magical <any/> construct. Note, however, that
> we *could* just implement a set of content models that is
> the union of all of the features needed by both DTDs and
> Schema. It probably doesn't make that much of a difference
> -- perhaps just minor performance for DTD-only validation.
> 
> The Big Picture
> 
> The big picture isn't all that complicated, though, and
> looks just like what we have already designed for Xerces2
> at the macro level. The only difference is within the
> validator component. For reference, here's what a typical
> parser configuration looks like:
> 
>         +---------+    +-----------+    +--------+
>  XML -> | Scanner | -> | Validator | -> | Parser | -> API
>         +---------+    |           |    |        |
>              |         |           |    |        |
>              v         |           |    |        |
>         +---------+    |           |    |        |
>         |   DTD   | -> |           | -> |        |
>         | Scanner |    |           |    |        |
>         +---------+    +-----------+    +--------+
> 
> The interfaces between the components are XNI, of course.
> 
> Also, in Xerces2, we'll have the ability to cache grammars
> and re-use them for validation without having to parse the
> DTDs and Schema grammars over and over again. So the
> validator component interacts with the grammar pool, as
> illustrated:
> 
>      +---------------------------------+
>   -> |            Validator            | ->
>      +---------------------------------+
>                       |
>      +---------------------------------+
>      |             Grammar             |
>      |              Pool               |
>      +---------------------------------+
>           |           |           |
>      +---------+ +---------+ +---------+
>      | Grammar | | Grammar | | Grammar |
>      |  (DTD)  | |(Schema) | | (???)   |
>      +---------+ +---------+ +---------+
> 
> But the similarity ends once you get into describing how
> the validation of content models is performed within the
> validator. And those details could fill another message
> of this length. So I won't go into any detail at the
> moment. I'd rather get some discussion going, first.
> 
> Conclusion
> 
> I do think that we need to redesign the validation engine.
> But I know that there is a serious cost in performance in
> the redesign. So the questions that remain are:
> 
>   1) Does the redesign need to happen at all?
>   2) Is there another way to support the needed features
>      without incurring the huge performance cost?
>   3) Is there some hybrid algorithm so that we can do
>      validation the "fast" way if there's no chance of
>      an <any/> occurring and "slow" if there is? (This
>      assumes, of course, that we give up on the error
>      ordering problem.)
>   4) How is this all going to impact implementation and
>      maintainability?
> 
> I've noticed that the discussion on XNI API design has dropped
> down to nearly nothing. But I'd like to get some discussion
> started on this topic as well. And even more so, I'd like to
> get people to volunteer to help design this beast and then
> actually implement it in Xerces2.
> 
> So...
> 
> Did anyone find this posting useful? And who wants to help?
> 
> -- 
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>

Re: [Xerces2] A Tale of Two Validation Engines (LONG!)

Posted by Andy Clark <an...@apache.org>.

Brad O'Hearne wrote:
> Secondly, though I have finished reading, I still need to ponder it a bit
> before giving any detailed suggestions.  However, I do fall a bit on Ted's
> side by saying that I believe there has got to be a data structure solution
> to the expected performance problems you are anticipating.  I'll give it

You would think so... and if you or Ted figure out a way to do
it efficiently and still keep it generalized, I'd be more than
happy to hear it! :) But the only way I currently see to get
the information we need *and* make it perform just as fast as
before (i.e. get rid of the extra method calls) is to inline
the content model validation operations directly into the
validator. But then we're hard-wiring specific validation
behaviors which are limiting in nature and a potential loss
of maintainability.

> don't have more to say at the moment, but I wanted to let you know that your
> email didn't fall on deaf ears. :)

Cool! Hope to be hearing more from you soon. And I'll start
working on some more information.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

RE: [Xerces2] A Tale of Two Validation Engines (LONG!)

Posted by Brad O'Hearne <ca...@megapathdsl.net>.

Andy,

I was just able to read through your original post and this thread.  First,
thanks for taking the time to thoroughly explain the issues at hand.
Secondly, though I have finished reading, I still need to ponder it a bit
before giving any detailed suggestions.  However, I do fall a bit on Ted's
side by saying that I believe there has got to be a data structure solution
to the expected performance problems you are anticipating.  I'll give it
some thought and see if I can't contribute an idea or two.  I'm sorry I
don't have more to say at the moment, but I wanted to let you know that your
email didn't fall on deaf ears. :)

BradO

-----Original Message-----
From: Andy Clark [mailto:andyc@apache.org]
Sent: Tuesday, March 06, 2001 5:42 PM
To: xerces-j-dev@xml.apache.org
Subject: Re: [Xerces2] A Tale of Two Validation Engines (LONG!)

Ted Leung wrote:
> It seems to me that an on-the-way-out system has more information
> that it can use to do the processing.  There's no need for lookahead
> because you've seen everything.  That's a general statement.  The

Actually, it has a lot less information on-the-way-out than
on-the-way-in. And the reason is that on-the-way-in, we only
collected the elements that were seen without regard to if
they were *valid* in that context. We use this for validating
the parent element on-the-way-out, as you know.

Even though we have seen everything, we didn't know what
transitions took us to that point for each element. This is
lost information; it's only known within the validate method
of the content model validator at the time of validation.
However, we need to know that information *before* we go
down into that element's descendants because it affects how
we apply validation at the lower levels.

This is the kind of problem where half a dozen times I think
that I'm over-complicating the issue and that we don't need
to switch to an on-the-way-in validation scheme. However, the
thing that always brings me back is the following question:

  If we need to ask for the transition information as we
  see children elements so that we can determine what the
  validation processing should be for that subtree, then
  why not perform the validation of that element at the
  same time?

So, time and time again I arrive at the same conclusion
which is to switch to an on-the-way-in validation scheme.

> But since the validity check doesn't occur until the </foo> element
> is hit, don't you already know what the transition was?  The problem

Take the following example:

  <foo>
   <baz/>         <!-- validate <baz> -->
   <xhtml:p>
    <xhtml:blah/>  <!-- validate <xhtml:blah> -->
   </xhtml:p>     <!-- validate <xhtml:p> -->
  </foo>         <!-- validate <foo> -->

Based on the content model defined in my earlier posting,
this should validate without errors (assuming that the
xhtml namespace is already properly bound, of course).
However, we don't know that the <xhtml:p>, which matches
the <any namespace='##other' .../>, is the <any> where
the processContents has been declared to be "skip". So
if no XHTML grammar can be found OR having seen the
<xhtml:blah> element, an error would occur.

This assumes, of course, that we assume "strict" content
processing. If we do, we're wrong; if we don't, then we
introduce ourselves to the opposite situation and we're
wrong again.

Like I said before: ultimately, we need to decide if it's
worth the performance impact to support this feature.
Obviously there is pressure to be 100% Schema compliant.
As soon as we don't implement this feature, people will
ask for it.

> of which processContents to apply could be solved by doing more
> book keeping when building the data structures (putting elements
> on the stack) before the validity check happens, couldn't it?  Am
> I missing something here?

That extra bookkeeping implies a method call to figure
out how we should process the contents of the element we
have just seen. If we have to make a method call anyway,
let's just do the validation at the same time.

> > I haven't been very successful in providing performance
> > critical feedback to the Schema WG. I hope you fair better.
>
> I wasn't tremendously successful when I was in your shoes.  Now,
> where were my s-exprs?

One of the reasons why it's hard to change this and other
features in XML Schema is because, like a lot of other
specs being developed right now, it's based on the XML
InfoSet. This is a nice foundation to build on but causes
implementation problems for processors that deal with XML
in a serial fashion. But let's not get me started... :)

--
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] A Tale of Two Validation Engines (LONG!)

Posted by Andy Clark <an...@apache.org>.

Ted Leung wrote:
> It seems to me that an on-the-way-out system has more information
> that it can use to do the processing.  There's no need for lookahead
> because you've seen everything.  That's a general statement.  The

Actually, it has a lot less information on-the-way-out than
on-the-way-in. And the reason is that on-the-way-in, we only
collected the elements that were seen without regard to if
they were *valid* in that context. We use this for validating
the parent element on-the-way-out, as you know.

Even though we have seen everything, we didn't know what
transitions took us to that point for each element. This is
lost information; it's only known within the validate method 
of the content model validator at the time of validation. 
However, we need to know that information *before* we go 
down into that element's descendants because it affects how 
we apply validation at the lower levels.

This is the kind of problem where half a dozen times I think
that I'm over-complicating the issue and that we don't need
to switch to an on-the-way-in validation scheme. However, the
thing that always brings me back is the following question:

  If we need to ask for the transition information as we 
  see children elements so that we can determine what the
  validation processing should be for that subtree, then 
  why not perform the validation of that element at the 
  same time?

So, time and time again I arrive at the same conclusion 
which is to switch to an on-the-way-in validation scheme.

> But since the validity check doesn't occur until the </foo> element
> is hit, don't you already know what the transition was?  The problem

Take the following example:

  <foo>
   <baz/>         <!-- validate <baz> -->
   <xhtml:p>
    <xhtml:blah/>  <!-- validate <xhtml:blah> -->
   </xhtml:p>     <!-- validate <xhtml:p> -->
  </foo>         <!-- validate <foo> -->

Based on the content model defined in my earlier posting,
this should validate without errors (assuming that the
xhtml namespace is already properly bound, of course).
However, we don't know that the <xhtml:p>, which matches
the <any namespace='##other' .../>, is the <any> where
the processContents has been declared to be "skip". So
if no XHTML grammar can be found OR having seen the
<xhtml:blah> element, an error would occur.

This assumes, of course, that we assume "strict" content
processing. If we do, we're wrong; if we don't, then we
introduce ourselves to the opposite situation and we're
wrong again.

Like I said before: ultimately, we need to decide if it's
worth the performance impact to support this feature.
Obviously there is pressure to be 100% Schema compliant.
As soon as we don't implement this feature, people will
ask for it.

> of which processContents to apply could be solved by doing more
> book keeping when building the data structures (putting elements
> on the stack) before the validity check happens, couldn't it?  Am
> I missing something here?

That extra bookkeeping implies a method call to figure
out how we should process the contents of the element we
have just seen. If we have to make a method call anyway,
let's just do the validation at the same time.

> > I haven't been very successful in providing performance
> > critical feedback to the Schema WG. I hope you fair better.
> 
> I wasn't tremendously successful when I was in your shoes.  Now,
> where were my s-exprs?

One of the reasons why it's hard to change this and other
features in XML Schema is because, like a lot of other
specs being developed right now, it's based on the XML
InfoSet. This is a nice foundation to build on but causes 
implementation problems for processors that deal with XML
in a serial fashion. But let's not get me started... :)

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

Re: [Xerces2] A Tale of Two Validation Engines (LONG!)

Posted by Ted Leung <tw...@sauria.com>.

----- Original Message ----- 
From: "Andy Clark" <an...@apache.org>
To: <xe...@xml.apache.org>
Sent: Tuesday, March 06, 2001 1:49 PM
Subject: Re: [Xerces2] A Tale of Two Validation Engines (LONG!)

> Ted Leung wrote:
> > I'm still not clear that we have to go from on-the-way-out to
> > on-the-way-in.  Isn't there a clever data structure way out of
> > this?

It seems to me that an on-the-way-out system has more information
that it can use to do the processing.  There's no need for lookahead
because you've seen everything.  That's a general statement.  The
current system may be deficient in that regard, but I don't necessarily
believe that it's impossible to fix.   More knowledge can only be good,
and you should have more knowledge in a properly implemented 
on-the-way-out system.

> The problem arises from the use of "processContents" on an
> <any> element in an XML Schema content model. Consider the 
> following:
> 
>   <!-- xmlns:tn='ThisTargetNamespace' -->
>   <element name='foo'>
>    <complexType>
>     <choice>
>      <sequence>
>       <element ref='tn:bar'/>
>       <any namespace='##other' processContents='strict'/>
>      </sequence>
>      <sequence>
>       <element ref='tn:baz'/>
>       <any namespace='##other' processContents='skip'/>
>      </sequence>
>     </choice>
>    </complexType>
>   </element>
> 
> I think currently we have some semi-smart code from Eric Ye
> that will check the possible children of the declared element 
> to see if <any> is used. And if so, then the process contents 
> is applied. [Eric: Can you confirm this?]
> 
> The problem is that we need to know the transition in order
> to determine what processContents should be applied. (e.g.
> "strict" vs. "skip" in the example above.) And this feature
> in Schema is the primary motivation for switching from an
> on-the-way-in scheme to an on-the-way-out scheme.

But since the validity check doesn't occur until the </foo> element
is hit, don't you already know what the transition was?  The problem
of which processContents to apply could be solved by doing more
book keeping when building the data structures (putting elements
on the stack) before the validity check happens, couldn't it?  Am
I missing something here?

> > I'm pretty concerned about the perf implications of this.  If
> > they turn out to be true, then someone needs to raise this as an
> > objection before the CR goes recommendation.
> 
> I haven't been very successful in providing performance
> critical feedback to the Schema WG. I hope you fair better.

I wasn't tremendously successful when I was in your shoes.  Now,
where were my s-exprs?

> -- 
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>

Re: [Xerces2] A Tale of Two Validation Engines (LONG!)

Posted by Andy Clark <an...@apache.org>.

Ted Leung wrote:
> I'm still not clear that we have to go from on-the-way-out to
> on-the-way-in.  Isn't there a clever data structure way out of
> this?

The problem arises from the use of "processContents" on an
<any> element in an XML Schema content model. Consider the 
following:

  <!-- xmlns:tn='ThisTargetNamespace' -->
  <element name='foo'>
   <complexType>
    <choice>
     <sequence>
      <element ref='tn:bar'/>
      <any namespace='##other' processContents='strict'/>
     </sequence>
     <sequence>
      <element ref='tn:baz'/>
      <any namespace='##other' processContents='skip'/>
     </sequence>
    </choice>
   </complexType>
  </element>

I think currently we have some semi-smart code from Eric Ye
that will check the possible children of the declared element 
to see if <any> is used. And if so, then the process contents 
is applied. [Eric: Can you confirm this?]

The problem is that we need to know the transition in order
to determine what processContents should be applied. (e.g.
"strict" vs. "skip" in the example above.) And this feature
in Schema is the primary motivation for switching from an
on-the-way-in scheme to an on-the-way-out scheme.

> I'm pretty concerned about the perf implications of this.  If
> they turn out to be true, then someone needs to raise this as an
> objection before the CR goes recommendation.

I haven't been very successful in providing performance
critical feedback to the Schema WG. I hope you fair better.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org