You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Eric Ye <er...@locus.apache.org> on 2000/07/12 20:00:29 UTC

Re: XRI requirements - for those whose are interested in XPath and Schema Identity constraints

>
> In the XPath Item, I would like to start a discussion about
> a requirement that we have to support Schema identity-constraint
> definition components. According to the Schema Spec:
> "A identity-constraint definition is an association between a name and one
> of several varieties of identity-constraint related to uniqueness and
> reference. All the varieties use [XPath] expressions to pick out sets of
> information items relative to particular target element information items
> which are unique, or a key, or a valid reference, within a specified
scope.
> An element information item is only schema-valid with respect to an
element
> declaration with identity-constraint definitions if those definitions are
> all satisfied for all the descendants of that element information item
which
> they pick out."
>
>
> Most XPath implementation are based on having a DOM or DOM like
> structure representation of an XML Document, unfortunately this
> is a serious memory and performance constraint.
>
> If the a Schema declares a unique, key, or keyref it means that
> we may have an XPath expression associated with the above declaration
> this may imply that we need to cache the whole document before
> identity validation is possible.
>

However the typical use of key, keyref won't require caching the whole
document,  because the context element of these keys would most likely be
associated with one row in a database table. The validator only needs to
cache a very small tree every time it encouters "a row in the database
table" when reading XML instance document, (and then trash it, I know there
are some implications here if we don't do garbage collection, but continue
reading, caching seems inevitable). Even if the whole document is just a
record in a datable table, then whole tree won't be big enough to be a
performance burden. Also optimization can be done to minimize the footrpint
of the tree and get tree traversal very fast.

>
> One solution to this problem is to support  a mini pluggable XPath
> API or have a smart XPath version that is able to support streaming
> parsing rather than caching the whole XML document in memory.
> The mini pluggable XPath would support forward axis (An axis that only
ever
> contains the context node or nodes that are after the context node in
> document order is a forward axis) but not reverse axis.
>

Even if it's only forward axis and stream parsing, the validator still need
to cache the path (or partialy?) from the context element node to the
matched(or selected) element or attribute, it is kind of like traversing a
branch of the tree, if it matchs, throw it away. The really nasty thing in
the Schema Identity constraints is that it allows you to have multiple
fields(every field need a path of itself), and a context element could
typically have one key and one keyref defined, so the validator could end up
with caching multiple branches of the same tree, some of them are
overlapping.  Whereas, with ONE cached tree for the context element, all
key, keyrefs and their fields can be caculated by using that one tree.  In
addition, the solution given above will only provide a subset of what Schema
spec specified, another downside is that it probably won't be a
straightforward  implementation.


> So the architecture should allow a pluggable XPath component through
> an interface.
>

Agreed.

>
> Any suggestions?
>
>
> Jeffrey Rodriguez
> IBM Silicon Valley
>

Jeff, I know, my office is almost right across yours :-), but just want to
throw in my 2 cents on the mailing list.
Anyone out there in the mailing list's opinion would be very welcome.

_____


Eric Ye * IBM, JTC - Silicon Valley * ericye@locus.apache.org

Re: XRI requirements - for those whose are interested in XPath and Schema Identity constraints

Posted by co...@eng.sun.com.

> > Most XPath implementation are based on having a DOM or DOM like
> > structure representation of an XML Document, unfortunately this
> > is a serious memory and performance constraint.
> >
> > If the a Schema declares a unique, key, or keyref it means that
> > we may have an XPath expression associated with the above declaration
> > this may imply that we need to cache the whole document before
> > identity validation is possible.
> >
> 
> However the typical use of key, keyref won't require caching the whole
> document,  because the context element of these keys would most likely be
> associated with one row in a database table. The validator only needs to
> cache a very small tree every time it encouters "a row in the database
> table" when reading XML instance document, (and then trash it, I know there
> are some implications here if we don't do garbage collection, but continue
> reading, caching seems inevitable). Even if the whole document is just a
> record in a datable table, then whole tree won't be big enough to be a
> performance burden. Also optimization can be done to minimize the footrpint
> of the tree and get tree traversal very fast.

I don't know the full problem, but it seems the question is if we can
avoid storing the whole tree for validation.

Of course, it would be great to be able to do that, but it seems _very_
hard, and I just hope it's even possible ( to implement schema without
storing the tree). 

I really don't see how you can implement "unique" constraints without
storing all the keys ( somewhere ), and for big documents you just can't
do it in memory. It's a database problem, and I don't think there is a
simple solution to that.

Maybe a solution ( not simple ) would be ( assuming we are allowed
to!) create a index ( on disk ) and do a minimal amount of database
processing. Of course, such complex solution is not justified just for
that, but it will benefit in many other problems.

If the document is small, we can just keep everything in memory. If it's
bigger than a treshold ( like the available memory :-) we create few temp
files ( using a gdb format for example ) and store all the keys we need. 

One possible use of this would be in xsl/xpath - if you have an xpath that
is used many times, you could just build and use the index.


Costin
( I know it doesn't sound too good, BTW, take it as brainstorming )
( sooner or later people will have to accept that you can't store a big
document in memory, at least if it's bigger than the memory, and XML
documents can be that big )

> 
> >
> > One solution to this problem is to support  a mini pluggable XPath
> > API or have a smart XPath version that is able to support streaming
> > parsing rather than caching the whole XML document in memory.
> > The mini pluggable XPath would support forward axis (An axis that only
> ever
> > contains the context node or nodes that are after the context node in
> > document order is a forward axis) but not reverse axis.
> >
> 
> Even if it's only forward axis and stream parsing, the validator still need
> to cache the path (or partialy?) from the context element node to the
> matched(or selected) element or attribute, it is kind of like traversing a
> branch of the tree, if it matchs, throw it away. The really nasty thing in
> the Schema Identity constraints is that it allows you to have multiple
> fields(every field need a path of itself), and a context element could
> typically have one key and one keyref defined, so the validator could end up
> with caching multiple branches of the same tree, some of them are
> overlapping.  Whereas, with ONE cached tree for the context element, all
> key, keyrefs and their fields can be caculated by using that one tree.  In
> addition, the solution given above will only provide a subset of what Schema
> spec specified, another downside is that it probably won't be a
> straightforward  implementation.
> 
> 
> > So the architecture should allow a pluggable XPath component through
> > an interface.
> >
> 
> Agreed.
> 
> >
> > Any suggestions?
> >
> >
> > Jeffrey Rodriguez
> > IBM Silicon Valley
> >
> 
> Jeff, I know, my office is almost right across yours :-), but just want to
> throw in my 2 cents on the mailing list.
> Anyone out there in the mailing list's opinion would be very welcome.
> 
> _____
> 
> 
> Eric Ye * IBM, JTC - Silicon Valley * ericye@locus.apache.org
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>