You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Andy Clark <an...@apache.org> on 2001/04/07 03:10:08 UTC
[Schema] Identity Constraints Implementation Details (LONG)

The identity constraint support in Xerces 1.x needs to be
updated to be in line with the subset defined in the PR and
also complete the remaining features that aren't implemented. 

When I first implemented this support there was no subset so 
I defined one with even more restrictive features than what 
is now defined. Now the design (which I thought was rather 
nice and clean ;)  needs a fundamental change in order to 
implement the updates.

I thought that I would document my current implementation in 
this posting and also try to explain what changes need to be 
made. That way someone else can pick it up if they feel so 
inclined. :)

Current Design and Implementation

I started implementing this support in Xerces2 with the
intention of back-porting it to Xerces 1.x. I thought 
that it would be easier to start there but I discovered,
much to my chagrin, that the validation engine in Xerces2 
doesn't even support Schema, yet. So I switched gears and 
implemented it Xerces 1.x.

Regardless of how the implementation actually turned out,
I intended to create an interface to represent a document
fragment. Since identity constraint selectors and fields
match within a subset (or fragment) of the instance
document, this seemed like a natural fit. Then the code
that would actually match the XPath subset in the parser
would implement this interface. 

However, in Xerces 1.x I just coded the support directly 
into the code without first defining a document fragment.
But it's the same behavior regardless. So, for the 
purposes of discussion, think of the XPath matchers as
document fragment handlers. Therefore, they have methods
like start/endElement, characters, etc.

I will now detail the basics of the design and how it
all fits together. 

* overview

First, the code for this work lives in the org.apache.
xerces.validators.schema.identity package. This is the 
home of all of the classes that represent the different 
kinds of identity constraints in Schema as well as the 
XPath code needed for matching. Also, the XMLElementDecl
structure added the identity constraint information and
the XMLValidator class needed to be updated in order to 
use these classes. But first I'll talk about the classes 
for the XPath implementation and identity constraints.

The XPath implementation is parses XPath expressions 
and builds up a token array of XPath symbols. The token 
array is then used to build up an internal model of the 
XPath. This internal model is then used directly by a 
class called XPathMatcher. The XPathMatcher, as I 
alluded to earlier, is essentially a document fragment 
handler that matches the XPath in a serial manner. When 
the matcher succeeds, the protected method matched is 
called. This is to allow subclasses to detect when a 
match occurs and do something with the value that is 
matched.

I "stole" the XPath parsing code from Glenn Marcy 
(thanks, Glenn!) and adapted it for use for identity 
constraints. Some of the changes I made were: combine 
all of the separate XPath implementation classes into 
a single class with inner classes; updated the code 
to work from a series of bytes representing the XPath 
expression to a Java String; and I added hooks into 
the code so that I could subclass the XPath parser and 
have control over what parts of the XPath language are
allowed to be parsed -- this is for the XPath subset 
that I chose.

Now that I've given an overview of the XPath code, I
can now explain by the identity constraint classes.
There are three parts to the identity constraint 
support: 1) representing the identity constraint 
information, 2) interacting with the validator so 
that the selectors and fields can be matched; and 
3) how the unique, key, and key reference field 
values are stored and verified by the validator. I'll
take each in turn.

* representing the identity constraints

The identity constraints are represented by a number
of classes called Unique, Key, and KeyRef. All of
these classes derive from a common base class called
IdentityConstraint. Each identity constaint has a
selector and one or more fields. The selectors and
fields are represented by the Selector and Field
classes, respectively, and are the connection between
the identity constraints and the validator.

If you look at the Field and Selector classes, you'll
notice that they define inner classes that are subsets 
of both the XPath (in order to further subset the 
general XPath) and an XPathMatcher to do something
useful when the selector or field matches. The "useful"
thing I'm talking about is in regards to how the
matchers communicate with the validator in order to
store the appropriate field value information.

* interacting with the validator

When an element is parsed that has associated identity
constraints, the validator is required to communicate
the document information within that element's scope
to the selector and then to the fields when/if the
selector matches. I call this "activating" an XPath
matcher within an element scope.

The Selector class has a createMatcher method that
takes a FieldActivator as an argument. The validator
acts as the field activator because it is responsible
for activating the XPath matchers and then managing
the field values that comprise the unique, key, and
key references.

The Field class also has a createMatcher method but
this method takes a ValueStore as an argument. The
value store interface is the communication mechanism
between a field's XPath matcher and the validator to
keep track of the field values when found.

All of this may seem a little complicated but it is
required to keep the system as modular as possible.
And once you jump into the code it'll be easier to
see the necessary interactions. I'll provide an
example once I have described how the XMLValidator
handles identity constraints.

Okay, now to talk about the validator! :)

* the validator

Now that the framework is in place, the validator has
to do everything to actually make this stuff work! In
short, the validator activates identity constraints;
communicates document fragment information to the
selector and field matchers; store the field values
for unique, key, and key references; and verifies the
identity constraints (e.g. verifying that all key
references actually reference valid keys).

I'll start by detailing the flow of code and then
follow up with more details about how the data values
are stored and verified. Here is some pseudo-code for
how identity constraints are supported:

  // activate identity constraints
  for each start element
    if element has identity constraints
      activate selectors for identity constraints
      push selector xpath matcher onto stack      

  // notifying xpath matchers
  for each start element, characters, end element
    for each active xpath matcher
      call appropriate method

  // matching selector
  if selector matched
    selector requests FieldActivator to activate fields
    field xpath matchers pushed onto stack

  // matching field
  if field matched
    field requests ValueStore to store field value

The stack referred to in the pseudo-code is actually
implemented as an inner class called XPathMatcherStack
on the validator. Like other types of validation info
that require keeping a stack of the element scopes, the
identity constraints XPath matchers are also stack
based because they are only active within a certain
element scope. Once the element scope is popped, all
of the selector and/or field XPath matchers that were
active within that scope are de-activated.

Take the following example where the <elem> element
defines a Unique identity constraint with the following
selector and fields:

  selector: foo
  field[0]: @bar
  field[1]: baz

  <!-- document fragment -->
  <elem>
   <foo bar='42'>
    <baz>Hello, World</baz>
   </foo>
  </elem>

Here's what happens (roughly) within the validator:

  XMLValidator#startElement("elem", {})
    // element <foo> has Unique identity constraint
    Selector selector = Unique#getSelector()
    FieldActivator activator = XMLValidator#this;
    XPathMatcher matcher0 = selector.createMatcher(activator)
    push matcher0 on stack
    matcher0.startElement("elem", {})
  XMLValidator#startElement("foo", {@bar="42"})
    matcher0.startElement("foo", {@bar="42"})
      // selector matches
      FieldActivator#activateField(@bar)
        ValueStore store = UniqueValueStore
        XPathMatcher matcher1 = Field#createMatcher(store)
        return matcher1
      push matcher1 on stack
      matcher1.startElement("foo", {@bar="42"})
        // matcher matches
        ValueStore#addValue(@bar, "42")
      FieldActivator#activateField(<baz>)
        ValueStore store = UniqueValueStore
        XPathMatcher matcher2 = Field#createMatcher(store)
        return matcher2
      push matcher2 on stack
      matcher2.startElement("foo", {@bar="42"})
  XMLValidator#startElement("baz", {})
    matcher0.startElement("baz", {})
    matcher1.startElement("baz", {})
    matcher2.startElement("baz", {})
  XMLValidator#characters("Hello, World")
    matcher0.characters("Hello, World")
    matcher1.characters("Hello, World")
    matcher2.characters("Hello, World")
      // matcher2 matches, buffers content until end element
  XMLValidator#endElement("baz")
    matcher0.endElement("baz")
    matcher1.endElement("baz")
    matcher2.endElement("baz")
      UniqueValueStore#addValue(<baz>, "Hello, World")
      add value set to list of unique values
  XMLValidator#endElement("foo")
    matcher0.endElement("foo")
    matcher1.endElement("foo")
    matcher2.endElement("foo")
    pop matcher1
    pop matcher2
  XMLValidator#endElement("elem")
    matcher0.endElement("elem")
    pop matcher0

[Note: The code isn't exact. I did some editing to make it 
more understandable. Did I succeed? ;) Please look at the 
real code for exact details.]

As you can see, the interaction can get quite complicated.
And if you're quick enough, you may be asking yourself
why the validator notifies *all* of the matchers even
though the selector isn't interested in the scope "below"
its match. Well, the answer is simplicity. The code stays
simple when there are no special cases. And besides, there
is code in the XPath matcher that will no-op much of the
work required once it's matched. We could always write
more code in the validator to avoid the extra method
calls but that's just added complexity.

Unlike the ID/IDREF mechanism in DTDs, Schema identity
constraints are relative to specific element scopes.
Therefore, the validator must keep a hashtable with the
field values associated to each identity constraint of
each unique element type. There is an inner class on
the validator called ValueStoreCache which keeps this
information.

The first time that an identity constraint comes into
play, a new value store object is created (whose type 
depends on the type of the identity constraint) and is
put into the ValueStoreCache associated to the element.
Each time after the first, the value store object is
re-used. The UniqueValueStore and KeyValueStore only
have to ensure that no field value set is duplicated.
The KeyRefValueStore, however, must have access to the
associated KeyValueStore so that it can verify that
the references are valid at the end of the document
parsing.

* limitations

The current implementation has certain limitations: it 
doesn't support predicates or the union operator for
selectors.

Differences in Schema Proposed XPath Subset

For a complete definition of the XPath subset designed and
approved by the Schema WG for use with identity constraints,
refer to the following link:

  http://www.w3.org/TR/2001/PR-xmlschema-1-20010330/#coss-identity-constraint

Predicates used to be allowed on the last child step of the
path. Now they appear to be completely removed. Since the
current implementation doesn't support them anyway, this
isn't a problem.

The inclusion of the descendant (//) axis is particularly
troublesome. As is the use of wildcards for child and
attribute axes. (And the union operator for that matter.)
While the descendant axes presents unique problems for an 
efficient implementation, they both share the common 
problem that they introduce ambiguity about which elements/
attributes will actually match in the instance document.

This isn't too terribly troublesome from an implementation
standpoint but it does mean that the matching and identity
constraint validation framework needs to be redone.

<opinion type='personal'>
It doesn't make much sense that a set of fields that make
up a unique or key could have different datatypes based on
what actually gets matched in the instance document. But
this is exactly the situation that ambiguity introduces. 

When XPath 2 introduces datatype information into the path 
then the subset defined for Schema could be updated to 
resolve this problem. But by then the damage is done and 
they won't be able to break documents exploiting this 
"problem".
</opinion>

Changes Needed

The inclusion of the descendant axis requires the most
changes to the implementation. Because of the ambiguity
introduced, the following changes need to be made:

1) The implementation can no longer verify that the field 
datatypes for keys and key references are the same at the 
time of identity constraint declaration.

2) The value stores need to change to be able to store
the datatype for each field value. Since the matched
fields may end up having different types, the comparison
operation to check for duplicates has to change. Now the
values are not duplicates if any of the types for the
fields are different.

This is further complicated by the fact that there needs
to be a way to tell if a type is a derived type of the
other. So some other information is required to be
communicated to the value store's addValue method. But
this could be as simple as the validator keeping track
of the current element decl (which it already does) and
then the value store using that to look up the needed
information. Not impossible but annoying.

3) The new XPath subset needs to be implemented and the
remaining features implemented (like union). But for 
now perhaps it's enough to implement the changes over a
series of releases.

* * *

I may have left out some information but my brain is
now fried...

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org