You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xindice-users@xml.apache.org by Tom Bradford <br...@dbxmlgroup.com> on 2002/01/15 22:28:43 UTC

Future of Xindice

Last Friday, I formally resigned from my position as Chief Architect of 
the dbXML Group, and so I am now a free agent.  I am about to take a job 
with a company in the Bay Area, and will be relocating there shortly 
after that.  This new position may or may not afford me the ability to 
continue working on Xindice with the amount of attention I devote to it 
now, so we need to start taking steps in order to make sure that the 
project continues to evolve if the situation is such that I can't do a 
lot of the coding any more.

There are a few of things that need to be addressed in future revisions 
of Xindice.  I'll run through them very quickly, and then I'd like to 
hear people's feedback.

Wire Protocol changes
-------------------------------
These have been widely mentioned, but we need to start moving away from 
CORBA and supporting a more flexible wire protocol system with Xindice.  
I'd propose to use my Labrador framework to provide this functionality, 
as I've already experimented with it, and it works rather well.

Schema support
-----------------------
We need to support schemas in an abstracted fashion.  If we can 
architect a content model API that would allow the system to validate 
and operate against a content model without needing to know that the 
content model is based on XML Schemas or Relax NG, that would be ideal.

Context-sensitive indexing
------------------------------------
XML Schemas introduces the idea of contextually-dependant typing.  What 
this means is that for any particular schema, that schema may use the 
same element name in more than one scope, and assign to that element 
name a completely different primitive type for each scope.  So in one 
scope, it may be an int, while in another it may be a string, or even a 
complex structure.

Xindice's indexing system was originally design when DTDs were the only 
standard way of representing an XML schema, and in DTDs, an element name 
is globally unique.  So we need to rearchitect the indexing system to 
support the ability for attaching a particular index to a schema 
context.  I have some vague ideas of how to do this, but I'd like to get 
a user's perspective on how you'd like to see this made available.


Large Documents and Document Versioning
------------------------------------------------------------
Xindice needs to be capable of supporting massive documents in a 
scalable fashion and with acceptable performance.  Currently, the 
document representation architecture is based on a tokenized, lazy DOM 
where the bytestream images that feed the DOM are stored and retrieved 
in a paged filing system.  Every document is treated as an atomic unit.  
This has some serious limitations when it comes to massive documents.

In order to support very large documents, the tokenization system needs 
to be replaced and geared more toward the simplified representation of 
document structure rather than an equal balance of structure and 
content.  Also, the Filer interfaces need to support the notion of 
streaming, and even more importantly, the ability to support random 
access streaming.

Also, the tokenization system needs to support versioning in one way or 
another.  For small documents, complete document revision links or 
permissible, but for massive documents, there's no way that versioning 
of that nature is acceptible.  So, the tokenization system needs to 
understand the notion of versioned linking.

The DTSM stuff that I started working on will help with the massive 
document problem, but we'd need to introduce the versioning concept into 
the specification as well.


Paged Files and BTrees
---------------------------------
Nodes that are stored by Paged files are currently materialize in their 
entirety, even if all of their content isn't needed.  Originally, it was 
written like this because I wanted to nail down functionality.  In a 
language like C++ or C, this is not an issue because you point a struct 
pointer to an offset into your buffer, and voila, you're done, but in 
Java, it requires a lot of conversion.  For Java, it may improve 
performance quite a bit if node portions (such as BTree node pointer and 
value lists) were materialize only on demand rather than as a whole.   
Obviously, this would require some research to determine if my guess is 
true or not.

--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database) - http://xml.apache.org
Creator - Project Labrador (Web Services Framework) - 
http://notdotnet.org


Re: Future of Xindice

Posted by Jeff Greif <jg...@alumni.princeton.edu>.
Tom,
Hope your services are not lost to the Xindice project forever.

Regarding support for schemas, it would be helpful to enumerate what aspects
of schemas need to be supported.  I've supplied a partial list, but perhaps
others could add more?

 - validation of update on existing document (validation of input docs
should probably occur outside Xindice)
 - supplying default values of attributes
 - indexing based on schema (including indexing on combinations of elements
and attributes) and including the context-sensitive indexing mentioned
 - joins when they are implemented
 - detection of queries which will always fail on valid docs for the schema
(is this a frill?  what if the collection is not homogeneous?)

Jeff


Re: Future of Xindice

Posted by Mathias Neumueller <Ma...@cis.strath.ac.uk>.
Tom Bradford wrote:
> 
> 
> Large Documents and Document Versioning
> ------------------------------------------------------------
> Xindice needs to be capable of supporting massive documents in a
> scalable fashion and with acceptable performance.  Currently, the
> document representation architecture is based on a tokenized, lazy DOM
> where the bytestream images that feed the DOM are stored and retrieved
> in a paged filing system.  Every document is treated as an atomic unit.
> This has some serious limitations when it comes to massive documents.
> 
> In order to support very large documents, the tokenization system needs
> to be replaced and geared more toward the simplified representation of
> document structure rather than an equal balance of structure and
> content.  Also, the Filer interfaces need to support the notion of
> streaming, and even more importantly, the ability to support random
> access streaming.

As I have mentioned in a private discussion earlier, I am working on
compressed in-memory representations of XML data, especially large,
data-centric documents. One of the side effects of my current design is
a split of content and structure, but it is quite immature so far. I was
thinking that integrating it with xindice would be a nice thing, but
currently i have not the time or skills required to do it. However, I'm
happy to hear about any ideas and would contribute (parts of) my design
if that's any help.

Mathias

Re: Future of Xindice

Posted by Murray Altheim <mu...@sun.com>.
Joel Rosi-Schwartz wrote:
> 
> Murray Altheim wrote:
> 
> > I'm likely to be tackling something akin to this in the next few months,
> > trying to hook up javacvs (the netbeans.org version, not the sourceforge
> > one which is under GPL) to Xindice. I don't have much of a need for large
> > document support, but the approach I'd take would be perhaps useful in
> > that regard. Basically, content would be checked into javacvs prior to being
> > stored in Xindice, hence most revision control issues are handled outside
> > of the database. I would not be attempting node-based revision control
> > support (ie., as Tom said above, support within the tokenization system),
> > which would be very valuable but outside the scope of effort I'm willing
> > to take on. If someone is willing to do the node-based RCS within Xindice,
> > I'm quite happy to step aside.
> 
> If you tackle this, will it be under the Xindice, NetBeans, Sun or private
> banner?  Personally, I would like to see this be available as either part of or
> an extension to Xindice, and would be willing to participate in such an effort.
> I assume that the NetBeans javacvs is Open Source and there are no license
> issues here.

I don't have the ability either as a Sun employee or an individual to
publish code on the Apache web site. This takes a rather concerted
effort, as I'm sure Tom Bradford will agree. I've authored an API called
"XNode" that wraps DOM nodes with a metadata wrapper (similar to a 
simplified SOAP) that is part of an upcoming Sun web services release,
and has been sent on to Tom Bradford for inclusion in Xindice (if he 
feels it's appropriate). As a Sun employee it seems I'm able to
contribute bug fixes and small bits of code like an API, but lawyers 
get involved if I attempt to go beyond that. It's actually quite 
difficult to publish code as a Sun employee, unless that code is part
of a sanctioned project. I expect this is true of most large companies,
and don't think Sun is unusual in this regard.

Netbeans is a Sun-sponsored initiative that provides the Forte code base
in open source. It has its own open source license that is similar to 
(and based upon) the Mozilla license. Sun is trying to establish a
relation to its open source similar to how Netscape has with Mozilla.
But in terms of publishing under Netbeans, I'd have the same 
restrictions with Netbeans, ie., anything substantial would have to 
be supported fully as a project within Sun.

I won't be an employee of Sun much longer, as I'm leaving soon to begin
a Ph.D. program at the Knowledge Media Institute in Milton Keynes, UK. 
I will be then able to begin producing code that I can publish any way 
I like. I'll either be doing that as part of something like sourceforge,
or on my own web site(s).

I'll be including an implementation of the XNode API and a node-based
datastore (using Xindice of course) as part of a larger package that
will be part of my Ph.D. project. This should show up online this Spring
under an Apache license under the project name "Ceryle."

Murray

...........................................................................
Murray Altheim                         <mailto:murray.altheim&#x40;sun.com>
XML Technology Center, Java and XML Software
Sun Microsystems, Inc., MS MPK17-102, 1601 Willow Rd., Menlo Park, CA 94025

            Corporations do not have human rights, despite the 
          altogether too-human opinions of the US Supreme Court.

Re: Future of Xindice

Posted by Murray Altheim <mu...@sun.com>.
Joel Rosi-Schwartz wrote:
> 
> Murray Altheim wrote:
> 
> > I'm likely to be tackling something akin to this in the next few months,
> > trying to hook up javacvs (the netbeans.org version, not the sourceforge
> > one which is under GPL) to Xindice. I don't have much of a need for large
> > document support, but the approach I'd take would be perhaps useful in
> > that regard. Basically, content would be checked into javacvs prior to being
> > stored in Xindice, hence most revision control issues are handled outside
> > of the database. I would not be attempting node-based revision control
> > support (ie., as Tom said above, support within the tokenization system),
> > which would be very valuable but outside the scope of effort I'm willing
> > to take on. If someone is willing to do the node-based RCS within Xindice,
> > I'm quite happy to step aside.
> 
> If you tackle this, will it be under the Xindice, NetBeans, Sun or private
> banner?  Personally, I would like to see this be available as either part of or
> an extension to Xindice, and would be willing to participate in such an effort.
> I assume that the NetBeans javacvs is Open Source and there are no license
> issues here.

I don't have the ability either as a Sun employee or an individual to
publish code on the Apache web site. This takes a rather concerted
effort, as I'm sure Tom Bradford will agree. I've authored an API called
"XNode" that wraps DOM nodes with a metadata wrapper (similar to a 
simplified SOAP) that is part of an upcoming Sun web services release,
and has been sent on to Tom Bradford for inclusion in Xindice (if he 
feels it's appropriate). As a Sun employee it seems I'm able to
contribute bug fixes and small bits of code like an API, but lawyers 
get involved if I attempt to go beyond that. It's actually quite 
difficult to publish code as a Sun employee, unless that code is part
of a sanctioned project. I expect this is true of most large companies,
and don't think Sun is unusual in this regard.

Netbeans is a Sun-sponsored initiative that provides the Forte code base
in open source. It has its own open source license that is similar to 
(and based upon) the Mozilla license. Sun is trying to establish a
relation to its open source similar to how Netscape has with Mozilla.
But in terms of publishing under Netbeans, I'd have the same 
restrictions with Netbeans, ie., anything substantial would have to 
be supported fully as a project within Sun.

I won't be an employee of Sun much longer, as I'm leaving soon to begin
a Ph.D. program at the Knowledge Media Institute in Milton Keynes, UK. 
I will be then able to begin producing code that I can publish any way 
I like. I'll either be doing that as part of something like sourceforge,
or on my own web site(s).

I'll be including an implementation of the XNode API and a node-based
datastore (using Xindice of course) as part of a larger package that
will be part of my Ph.D. project. This should show up online this Spring
under an Apache license under the project name "Ceryle."

Murray

...........................................................................
Murray Altheim                         <mailto:murray.altheim&#x40;sun.com>
XML Technology Center, Java and XML Software
Sun Microsystems, Inc., MS MPK17-102, 1601 Willow Rd., Menlo Park, CA 94025

            Corporations do not have human rights, despite the 
          altogether too-human opinions of the US Supreme Court.

Re: Future of Xindice

Posted by Joel Rosi-Schwartz <jo...@btconnect.com>.

Murray Altheim wrote:

> I'm likely to be tackling something akin to this in the next few months,
> trying to hook up javacvs (the netbeans.org version, not the sourceforge
> one which is under GPL) to Xindice. I don't have much of a need for large
> document support, but the approach I'd take would be perhaps useful in
> that regard. Basically, content would be checked into javacvs prior to being
> stored in Xindice, hence most revision control issues are handled outside
> of the database. I would not be attempting node-based revision control
> support (ie., as Tom said above, support within the tokenization system),
> which would be very valuable but outside the scope of effort I'm willing
> to take on. If someone is willing to do the node-based RCS within Xindice,
> I'm quite happy to step aside.

If you tackle this, will it be under the Xindice, NetBeans, Sun or private
banner?  Personally, I would like to see this be available as either part of or
an extension to Xindice, and would be willing to participate in such an effort.
I assume that the NetBeans javacvs is Open Source and there are no license
issues here.

Joel

Re: Future of Xindice

Posted by Dare Obasanjo <kp...@yahoo.com>.
----- Original Message -----
From: "Murray Altheim" <mu...@sun.com>
To: <xi...@xml.apache.org>
Cc: <xi...@xml.apache.org>
Sent: Tuesday, January 15, 2002 3:36 PM
Subject: Re: Future of Xindice


> Tom Bradford wrote:
> [...]

> > Schema support
> > -----------------------
> > We need to support schemas in an abstracted fashion.  If we can
> > architect a content model API that would allow the system to validate
> > and operate against a content model without needing to know that the
> > content model is based on XML Schemas or Relax NG, that would be ideal.
>
> Why in Xindice? There are several places where validation can occur:
> 1. upon storing in the database; 2. following an XUpdate; and 3. upon
> retrieving content from the database. In all three cases, the DOM nodes
> to be validated are already available to the developer outside of
> Xindice and can be validated using existing validation tools and techniques.
>
> Not that I'm going to fight the issue, but I'm rather against including
> support for schema validation within the Xindice, as this is an application-
> level issue (as I've described in previous messages). There are many
> different types of schema validation, and different validation needs, eg.,
> different levels of strictness or different content validation at various
> places within a processing regimen. Validation is a complicated issue that
> doesn't have a one-size-fits-all type of solution.
>
> There are a plethora of validation options out there and I don't see that
> one API could serve the variety of schema languages, structure and content
> validation needs that would be within a reasonable scope of effort. You'd
> be tackling the same issues that the W3C Schema WG tackled, with the
> "data heads" and "document heads" needs on the table.

I don't think it is impossible for one API to support multiple schema
validation models if designed abstractly enough. Secondly I don't see why it
is unreasonable to for users to be able to constrain the contents of the
database at the document or collection level after all we've been doing it for
years with relational and OO databases.Especially since there are some
validation actions (e.g. validating identity constraints ) that are best left
at the database level instead of making the application developer do the work
if they don't want to.



--
THINGS TO DO IF I BECOME AN EVIL OVERLORD #123
If I decide to hold a contest of skill open to the general public,
contestants will be required to remove their hooded cloaks and
shave their beards before entering.


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


Re: Future of Xindice

Posted by Dare Obasanjo <kp...@yahoo.com>.
----- Original Message -----
From: "Murray Altheim" <mu...@sun.com>
To: <xi...@xml.apache.org>
Cc: <xi...@xml.apache.org>
Sent: Tuesday, January 15, 2002 3:36 PM
Subject: Re: Future of Xindice


> Tom Bradford wrote:
> [...]

> > Schema support
> > -----------------------
> > We need to support schemas in an abstracted fashion.  If we can
> > architect a content model API that would allow the system to validate
> > and operate against a content model without needing to know that the
> > content model is based on XML Schemas or Relax NG, that would be ideal.
>
> Why in Xindice? There are several places where validation can occur:
> 1. upon storing in the database; 2. following an XUpdate; and 3. upon
> retrieving content from the database. In all three cases, the DOM nodes
> to be validated are already available to the developer outside of
> Xindice and can be validated using existing validation tools and techniques.
>
> Not that I'm going to fight the issue, but I'm rather against including
> support for schema validation within the Xindice, as this is an application-
> level issue (as I've described in previous messages). There are many
> different types of schema validation, and different validation needs, eg.,
> different levels of strictness or different content validation at various
> places within a processing regimen. Validation is a complicated issue that
> doesn't have a one-size-fits-all type of solution.
>
> There are a plethora of validation options out there and I don't see that
> one API could serve the variety of schema languages, structure and content
> validation needs that would be within a reasonable scope of effort. You'd
> be tackling the same issues that the W3C Schema WG tackled, with the
> "data heads" and "document heads" needs on the table.

I don't think it is impossible for one API to support multiple schema
validation models if designed abstractly enough. Secondly I don't see why it
is unreasonable to for users to be able to constrain the contents of the
database at the document or collection level after all we've been doing it for
years with relational and OO databases.Especially since there are some
validation actions (e.g. validating identity constraints ) that are best left
at the database level instead of making the application developer do the work
if they don't want to.



--
THINGS TO DO IF I BECOME AN EVIL OVERLORD #123
If I decide to hold a contest of skill open to the general public,
contestants will be required to remove their hooded cloaks and
shave their beards before entering.


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


Re: Future of Xindice

Posted by Joel Rosi-Schwartz <jo...@btconnect.com>.

Murray Altheim wrote:

> I'm likely to be tackling something akin to this in the next few months,
> trying to hook up javacvs (the netbeans.org version, not the sourceforge
> one which is under GPL) to Xindice. I don't have much of a need for large
> document support, but the approach I'd take would be perhaps useful in
> that regard. Basically, content would be checked into javacvs prior to being
> stored in Xindice, hence most revision control issues are handled outside
> of the database. I would not be attempting node-based revision control
> support (ie., as Tom said above, support within the tokenization system),
> which would be very valuable but outside the scope of effort I'm willing
> to take on. If someone is willing to do the node-based RCS within Xindice,
> I'm quite happy to step aside.

If you tackle this, will it be under the Xindice, NetBeans, Sun or private
banner?  Personally, I would like to see this be available as either part of or
an extension to Xindice, and would be willing to participate in such an effort.
I assume that the NetBeans javacvs is Open Source and there are no license
issues here.

Joel

Re: Validation Issues

Posted by Joel Rosi-Schwartz <jo...@btconnect.com>.

Murray Altheim wrote:

> [I think it's only right that we sub-thread this conversation.]
>
> "Timothy M. Dean" wrote:
> >
> > > -----Original Message-----
> > > From: altheim@mehitabel.eng.sun.com
> > > "Timothy M. Dean" wrote:
> > > >
> > > Perhaps I'm not understanding what you've explained, but it
> > > seems that you're confusing client and server. Xindice is not
> > > a client, it's a database server.
> >
> > No, I fully understand that Xindice is a server - There's no confusion
> > there.
> >
> > > A Xindice system would include client software written by you,
> > > and I would hope you'd have control over both the
> > > installation of the server and how those clients are
> > > configured.
> >
> > True, but what I don't have is the guarantee that everyone who writes
> > applications against this Xindice DB is going to follow the rules that I
> > am expecting. Assume that I am working on an application for the ABC
> > division of some company, and that my application needs to write/read
> > data to a data store. Now assume that another developer in the XYZ
> > division of the same company also is working on an application that
> > needs access to *the same data store*. This is a scenario that many of
> > the companies I work with have encountered.

This is one of the reasons why applications servers are an important part of an
enterprise architecture. In your place I would be exploring the viability of
placing JBoss in-between the client applications and the database.  You would
then have the ability to "validate" at more levels that merely the schema. You
get a shot a applying business rules where they are required, authentication
and authorization can be accommodated and it is much easier to address
concurrency issues, to name just a few of the advantages.

> So what you're saying is that anyone can write anything anytime? Couldn't
> such a system be implemented where the client software used to access
> the database (and the list of people who could) is restricted? If so,
> and if you can restrict access to clients you write, this shouldn't be
> such an issue.
>
> If OTOH anyone *can* write anything they want anytime, you must operate
> in Defensive Mode. You can't assume that the data is valid (or even
> uses the same XML Namespaces as your data). These sound so much like
> what I've called "management issues" that it hardly seems like a
> technical solution is the reasonable solution. I often find designs
> that attempt to solve management problems with technical solutions run
> awry of reality. Such systems are usually fragile. Look at MS Outlook.
>
> > If we assume that my application is implemented to perform all of the
> > necessary validation before storing documents, then I can be confident
> > that my Xindice data store is not "corrupted" for my purposes. My
> > project completes development/testing and is deployed to its user base.
> > Now consider when the other developer in division XYZ completes their
> > application and puts it online, again accessing the same enterprise data
> > that my app is accessing. How do I know that this other application is
> > following the same validation rules that I have depended on?
>
> You rely on agreements amongst the players within your enterprise to
> follow a reasonable set of constraints. If you can't get that agreement
> no technical solution is going to really help.
>
> > Unless I
> > spend a lot of extra effort in *every application* to make sure that
> > everything I read is valid (rather than only validating when I update),
> > I could easily find that my application stops working because some other
> > application has stored data which I consider to be corrupt.
>
> A terrible scenario, indeed.
>
> > You may call this a management issue, but the companies I've worked with
> > on this kind of problem have wanted the ability to define some sort of
> > contract that can be enforced at the database level. Some way to say
> > that "DB will only accept data that conforms to certain rules, so you
> > can be guaranteed that these rules have been met for any data retrieved
> > from the DB". With relational databases they have this ability: They can
> > define the schema of the DB to contain certain tables, each with columns
> > of a certain type, etc. Without some way of enforcing this rule at an
> > enterprise-wide level (rather than on an application-by-application
> > level), I fear that most companies I work with would never consider
> > using Xindice as a solution to their needs.
>
> Well, I do think this boils down to a management issue, especially now
> that you describe it more fully. Xindice is not solution to your problem
> any more than any other database solution is, as if there are *no*
> controls on data access/manipulation, you're sunk before the boat leaves
> the harbor. Relational databases have the *ability* to do this, but if
> the guy in division XYZ decides to ignore those rules, they're no
> different than Xindice.
>
> What might work as a technical solution would be to intercept write
> events to Xindice and validate the incoming content against a schema.
> I wouldn't consider this as "within" Xindice because you really want
> to validate the content prior to its insertion, not after it's in
> the database and has potentially corrupted it.
>
> > > ID uniqueness is a fairly easy contraint to fulfill. You'd merely
> > > create an Xindice index for IDs (using the XPathService) and
> > > provide that within your application as a HashSet. Incoming
> > > IDs would be checked against the HashSet and if you didn't
> > > return a null, you'd throw a
> > > duplicate ID exception. This would occur prior to the
> > > document being corrupted (ie., becoming invalid) by the
> > > insertion of new content, which I'd imagine is preferable to
> > > corrupting it and *then* having to fix the problem. Another
> > > simple means of doing this is to have Xindice return the
> > > existing document as a DOM Document node using Xindice's
> > > getContentAsDOM() method, and then perform the check against
> > > that using Xerces' DOM method getElementById(). This would
> > > allow you to bypass
> > > creating your own ID hash table. Which direction to take
> > > would depend upon the application's requirements, how big the
> > > documents are, how
> > > often they're changed, performance configurations, etc.
> >
> > I'll look into this solution: I haven't worked with indexes in Xindice
> > yet, so I can't form an educated opinion on this approach yet. However,
> > it does seem like a bit of a pain when all I want to do is say "Make
> > sure that documents stored in this collection are valid against schema
> > X"... Going through and creating indexes to help me enforce this
> > constraint seems to be a bit of a hack, but I'll look into it.
>
> Except that what you're talking about is two documents: the existing
> document and the one after the change is applied. When do you want to
> validate? And why validate the entire document *after* a potentially
> corrupting change if all you need is to ascertain if there's an ID
> conflict? Seems like a lot of extra work. The getElementById() method
> is the easiest and doesn't require you to even create your own hash.
> I don't think you'd need to resort to an index unless it buys you
> anything -- here I don't see that it does.
>
> As to validating an entire collection, I guess if you have no ability
> to regulate the content going into the database (ie., someone else
> might simply ignore the server-side validation features) you need to
> be able to validate each and every record. This could be an enormously
> time-consuming operation if done on each data access, seems extremely
> heavy-handed, and the content of one record could be changing even as
> you validate the rest. Putting validation features inside Xindice or
> keeping them as part of the server or client application makes no
> difference here. You're still validating 17,000 records against a schema.
>
> > Any recommendations of where I can find more information about using
> > indexes in Xindice? I haven't found any examples or description in the
> > docs that I've read thus far.
>
> I've not found any extant documentation and have been just hacking at
> the code without any. This is probably the biggest hole in the docs.
>
> > > > I would be interested in this kind of project as well - My first
> > > > requirements would be an easy way to handle the scenarios listed
> > > > above. Any ideas you have on how to proceed would be appreciated.
> > >
> > > Let me know if the suggestions I've made make any sense in
> > > your scenario.
> >
> > I still can't figure out how you envision this kind of validation to
> > work, so I'll withhold judgement until I understand your recommended
> > approach a little better.
>
> I guess where I'm confused is that I don't see putting validation
> features inside of Xindice makes any difference in solving the essential
> problems you enumerate. *Where* the validation features live is immaterial,
> as it seems like the solution to this management problem is to design
> a system where they live *somewhere*. Since you've already got these
> features in Xerces (and therefore in both any server and client code
> for Xindice) I'd continue to suggest that they're not necessary in
> XIndice itself -- I don't see that putting them in Xindice would solve
> any problem. Since you've not been assured that all clients followed
> the rules, you'd still have that need to validate those 17,000 records.
>
> [I have a funny feeling that this conversation would be a lot shorter
> over coffee...]
>
> Murray
>
> ...........................................................................
> Murray Altheim, Staff Engineer          <mailto:murray.altheim&#64;sun.com>
> Java and XML Software
> Sun Microsystems, 1601 Willow Rd., MS UMPK17-102, Menlo Park, CA 94025
>
>        Ernst Martin comments in 1949, "A certain degree of noise in
>        writing is required for confidence. Without such noise, the
>        writer would not know whether the type was actually printing
>        or not, so he would lose control."

Validation Issues

Posted by Murray Altheim <mu...@sun.com>.
[I think it's only right that we sub-thread this conversation.]

"Timothy M. Dean" wrote:
> 
> > -----Original Message-----
> > From: altheim@mehitabel.eng.sun.com
> > "Timothy M. Dean" wrote:
> > >
> > Perhaps I'm not understanding what you've explained, but it
> > seems that you're confusing client and server. Xindice is not
> > a client, it's a database server.
> 
> No, I fully understand that Xindice is a server - There's no confusion
> there.
> 
> > A Xindice system would include client software written by you,
> > and I would hope you'd have control over both the
> > installation of the server and how those clients are
> > configured.
> 
> True, but what I don't have is the guarantee that everyone who writes
> applications against this Xindice DB is going to follow the rules that I
> am expecting. Assume that I am working on an application for the ABC
> division of some company, and that my application needs to write/read
> data to a data store. Now assume that another developer in the XYZ
> division of the same company also is working on an application that
> needs access to *the same data store*. This is a scenario that many of
> the companies I work with have encountered.

So what you're saying is that anyone can write anything anytime? Couldn't
such a system be implemented where the client software used to access
the database (and the list of people who could) is restricted? If so,
and if you can restrict access to clients you write, this shouldn't be
such an issue. 

If OTOH anyone *can* write anything they want anytime, you must operate
in Defensive Mode. You can't assume that the data is valid (or even
uses the same XML Namespaces as your data). These sound so much like
what I've called "management issues" that it hardly seems like a 
technical solution is the reasonable solution. I often find designs 
that attempt to solve management problems with technical solutions run
awry of reality. Such systems are usually fragile. Look at MS Outlook.
 
> If we assume that my application is implemented to perform all of the
> necessary validation before storing documents, then I can be confident
> that my Xindice data store is not "corrupted" for my purposes. My
> project completes development/testing and is deployed to its user base.
> Now consider when the other developer in division XYZ completes their
> application and puts it online, again accessing the same enterprise data
> that my app is accessing. How do I know that this other application is
> following the same validation rules that I have depended on?

You rely on agreements amongst the players within your enterprise to
follow a reasonable set of constraints. If you can't get that agreement
no technical solution is going to really help.

> Unless I
> spend a lot of extra effort in *every application* to make sure that
> everything I read is valid (rather than only validating when I update),
> I could easily find that my application stops working because some other
> application has stored data which I consider to be corrupt.

A terrible scenario, indeed.
 
> You may call this a management issue, but the companies I've worked with
> on this kind of problem have wanted the ability to define some sort of
> contract that can be enforced at the database level. Some way to say
> that "DB will only accept data that conforms to certain rules, so you
> can be guaranteed that these rules have been met for any data retrieved
> from the DB". With relational databases they have this ability: They can
> define the schema of the DB to contain certain tables, each with columns
> of a certain type, etc. Without some way of enforcing this rule at an
> enterprise-wide level (rather than on an application-by-application
> level), I fear that most companies I work with would never consider
> using Xindice as a solution to their needs.

Well, I do think this boils down to a management issue, especially now
that you describe it more fully. Xindice is not solution to your problem
any more than any other database solution is, as if there are *no* 
controls on data access/manipulation, you're sunk before the boat leaves
the harbor. Relational databases have the *ability* to do this, but if
the guy in division XYZ decides to ignore those rules, they're no 
different than Xindice. 

What might work as a technical solution would be to intercept write
events to Xindice and validate the incoming content against a schema.
I wouldn't consider this as "within" Xindice because you really want
to validate the content prior to its insertion, not after it's in 
the database and has potentially corrupted it.

> > ID uniqueness is a fairly easy contraint to fulfill. You'd merely
> > create an Xindice index for IDs (using the XPathService) and
> > provide that within your application as a HashSet. Incoming
> > IDs would be checked against the HashSet and if you didn't
> > return a null, you'd throw a
> > duplicate ID exception. This would occur prior to the
> > document being corrupted (ie., becoming invalid) by the
> > insertion of new content, which I'd imagine is preferable to
> > corrupting it and *then* having to fix the problem. Another
> > simple means of doing this is to have Xindice return the
> > existing document as a DOM Document node using Xindice's
> > getContentAsDOM() method, and then perform the check against
> > that using Xerces' DOM method getElementById(). This would
> > allow you to bypass
> > creating your own ID hash table. Which direction to take
> > would depend upon the application's requirements, how big the
> > documents are, how
> > often they're changed, performance configurations, etc.
> 
> I'll look into this solution: I haven't worked with indexes in Xindice
> yet, so I can't form an educated opinion on this approach yet. However,
> it does seem like a bit of a pain when all I want to do is say "Make
> sure that documents stored in this collection are valid against schema
> X"... Going through and creating indexes to help me enforce this
> constraint seems to be a bit of a hack, but I'll look into it.

Except that what you're talking about is two documents: the existing 
document and the one after the change is applied. When do you want to
validate? And why validate the entire document *after* a potentially
corrupting change if all you need is to ascertain if there's an ID
conflict? Seems like a lot of extra work. The getElementById() method
is the easiest and doesn't require you to even create your own hash.
I don't think you'd need to resort to an index unless it buys you
anything -- here I don't see that it does.

As to validating an entire collection, I guess if you have no ability
to regulate the content going into the database (ie., someone else 
might simply ignore the server-side validation features) you need to 
be able to validate each and every record. This could be an enormously
time-consuming operation if done on each data access, seems extremely
heavy-handed, and the content of one record could be changing even as
you validate the rest. Putting validation features inside Xindice or 
keeping them as part of the server or client application makes no 
difference here. You're still validating 17,000 records against a schema. 

> Any recommendations of where I can find more information about using
> indexes in Xindice? I haven't found any examples or description in the
> docs that I've read thus far.

I've not found any extant documentation and have been just hacking at
the code without any. This is probably the biggest hole in the docs.
 
> > > I would be interested in this kind of project as well - My first
> > > requirements would be an easy way to handle the scenarios listed
> > > above. Any ideas you have on how to proceed would be appreciated.
> >
> > Let me know if the suggestions I've made make any sense in
> > your scenario.
> 
> I still can't figure out how you envision this kind of validation to
> work, so I'll withhold judgement until I understand your recommended
> approach a little better.

I guess where I'm confused is that I don't see putting validation 
features inside of Xindice makes any difference in solving the essential
problems you enumerate. *Where* the validation features live is immaterial,
as it seems like the solution to this management problem is to design
a system where they live *somewhere*. Since you've already got these 
features in Xerces (and therefore in both any server and client code
for Xindice) I'd continue to suggest that they're not necessary in
XIndice itself -- I don't see that putting them in Xindice would solve 
any problem. Since you've not been assured that all clients followed 
the rules, you'd still have that need to validate those 17,000 records.

[I have a funny feeling that this conversation would be a lot shorter
over coffee...]

Murray

...........................................................................
Murray Altheim, Staff Engineer          <mailto:murray.altheim&#64;sun.com>
Java and XML Software
Sun Microsystems, 1601 Willow Rd., MS UMPK17-102, Menlo Park, CA 94025

       Ernst Martin comments in 1949, "A certain degree of noise in 
       writing is required for confidence. Without such noise, the 
       writer would not know whether the type was actually printing 
       or not, so he would lose control."

RE: Future of Xindice

Posted by "Timothy M. Dean" <td...@visi.com>.

> -----Original Message-----
> From: altheim@mehitabel.eng.sun.com 
> "Timothy M. Dean" wrote:
> > 
> Perhaps I'm not understanding what you've explained, but it 
> seems that you're confusing client and server. Xindice is not 
> a client, it's a database server.

No, I fully understand that Xindice is a server - There's no confusion
there.

> A Xindice system would include client software written by you, 
> and I would hope you'd have control over both the 
> installation of the server and how those clients are 
> configured.

True, but what I don't have is the guarantee that everyone who writes
applications against this Xindice DB is going to follow the rules that I
am expecting. Assume that I am working on an application for the ABC
division of some company, and that my application needs to write/read
data to a data store. Now assume that another developer in the XYZ
division of the same company also is working on an application that
needs access to *the same data store*. This is a scenario that many of
the companies I work with have encountered.

If we assume that my application is implemented to perform all of the
necessary validation before storing documents, then I can be confident
that my Xindice data store is not "corrupted" for my purposes. My
project completes development/testing and is deployed to its user base.
Now consider when the other developer in division XYZ completes their
application and puts it online, again accessing the same enterprise data
that my app is accessing. How do I know that this other application is
following the same validation rules that I have depended on? Unless I
spend a lot of extra effort in *every application* to make sure that
everything I read is valid (rather than only validating when I update),
I could easily find that my application stops working because some other
application has stored data which I consider to be corrupt.

You may call this a management issue, but the companies I've worked with
on this kind of problem have wanted the ability to define some sort of
contract that can be enforced at the database level. Some way to say
that "DB will only accept data that conforms to certain rules, so you
can be guaranteed that these rules have been met for any data retrieved
from the DB". With relational databases they have this ability: They can
define the schema of the DB to contain certain tables, each with columns
of a certain type, etc. Without some way of enforcing this rule at an
enterprise-wide level (rather than on an application-by-application
level), I fear that most companies I work with would never consider
using Xindice as a solution to their needs.

> 
> ID uniqueness is a fairly easy contraint to fulfill. You'd merely 
> create an Xindice index for IDs (using the XPathService) and 
> provide that within your application as a HashSet. Incoming 
> IDs would be checked against the HashSet and if you didn't 
> return a null, you'd throw a 
> duplicate ID exception. This would occur prior to the 
> document being corrupted (ie., becoming invalid) by the 
> insertion of new content, which I'd imagine is preferable to 
> corrupting it and *then* having to fix the problem. Another 
> simple means of doing this is to have Xindice return the 
> existing document as a DOM Document node using Xindice's
> getContentAsDOM() method, and then perform the check against 
> that using Xerces' DOM method getElementById(). This would 
> allow you to bypass 
> creating your own ID hash table. Which direction to take 
> would depend upon the application's requirements, how big the 
> documents are, how 
> often they're changed, performance configurations, etc.
> 

I'll look into this solution: I haven't worked with indexes in Xindice
yet, so I can't form an educated opinion on this approach yet. However,
it does seem like a bit of a pain when all I want to do is say "Make
sure that documents stored in this collection are valid against schema
X"... Going through and creating indexes to help me enforce this
constraint seems to be a bit of a hack, but I'll look into it.

Any recommendations of where I can find more information about using
indexes in Xindice? I haven't found any examples or description in the
docs that I've read thus far.

> > I would be interested in this kind of project as well - My first 
> > requirements would be an easy way to handle the scenarios listed 
> > above. Any ideas you have on how to proceed would be appreciated.
> 
> Let me know if the suggestions I've made make any sense in 
> your scenario. 

I still can't figure out how you envision this kind of validation to
work, so I'll withhold judgement until I understand your recommended
approach a little better.

- Tim


Re: Future of Xindice

Posted by Murray Altheim <mu...@sun.com>.
"Timothy M. Dean" wrote:
> 
> Murray,
> 
> Thanks for the more detailed response. Below are a couple of follow-up
> questions:
> 
> > -----Original Message-----
> > From: Murray.Altheim@eng.sun.com
> >
> > My point (which
> > I'm guessing was not expressed very clearly) is that any
> > Xindice-based application *must* have an XML parser
> > available, and Xindice
> > is distributed with Xerces 2, which provides support for DTDs
> > and XML Schema. If you need stronger content validation,
> > Xerces provides that with its XML Schema support.
> 
> Yes, I've worked with standalone Xerces applications and have
> implemented apps that use strong content validation based on XML
> schemas. What I'm not understanding is how I can architect my systems so
> that I can be sure that all applications sharing a particular set of
> data via Xindice can consistently enforce the validation rules I need.
> 
> I could easily implement code that validates a document before storing
> it into my DB. I have concerns about enforcing a rule that says "all
> applications should ensure that they only store documents that are valid
> against the Schema X". Many clients I work with require this kind of
> enforcement, and if there's not an easy way to do it within Xindice I
> feel that Xindice would be ruled out as a valid solution for these
> clients.

Perhaps I'm not understanding what you've explained, but it seems
that you're confusing client and server. Xindice is not a client,
it's a database server. How that database is provided, who has 
access, and what specific access controls, client software, and
security are management issues, not technical ones. A Xindice system
would include client software written by you, and I would hope you'd
have control over both the installation of the server and how those
clients are configured. If this isn't the case I'm not clear what
Xindice's role would be technically, since such an unregulated 
system's problems wouldn't be solved by validation.

> > At most stages in the process an XML processor is *required*
> > to handle the XML content moving in and out of Xindice. All
> > one needs to do to provide stronger validation support during
> > these processes is to establish those parsers in validation
> > mode, and provide the schemas necessary to validate the
> > content.
> 
> How then would you suggest handling the following scenario (which I'm
> currently hacking around because I can't enforce validation within
> Xindice). I've got an XML document stored in Xindice. The structure of a
> particular element in my schema looks something like this:
> 
>     <element name="AddressList" type="ab:AddressListType">
>         <unique name="AddressUnique">
>             <selector xpath="*"/>
>             <field xpath="@id"/>
>         </unique>
>     </element>
> 
> Basically, this is used to represent a list of "Address" elements, where
> the "id" attribute of each address in the list is unique within the
> scope of the list.
> 
> Now consider this - My application wants to add a new Address element to
> the document. I want to use an XUpdate query to perform the insertion of
> a new element. My application creates the appropriate XUpdate query and
> submits it. I want to make sure that the new element is only stored if
> its "id" attribute is unique within the list.
> 
> How can I enforce this restriction of uniqueness? The new element is
> perfectly valid as far as I can tell by looking at the element on its
> own. The restriction only comes in when I try to place the new element
> into a previously stored document. Right now, I'm being forced to read
> in the entire list of Address elements into my application, add the new
> element to this list within my application to check for uniqueness, and
> then either rewriting the entire list or performing the insert of only
> the new element once I've performed my validation manually. It would be
> *very* nice for me if I could simply attempt an insert of the new
> element directly using Xindice, and expect validation that I've enabled
> for the collection (or document) to handle this scenario.
> 
> Is there another way I can approach this that would make my life easier?

ID uniqueness is a fairly easy contraint to fulfill. You'd merely 
create an Xindice index for IDs (using the XPathService) and provide
that within your application as a HashSet. Incoming IDs would be checked
against the HashSet and if you didn't return a null, you'd throw a 
duplicate ID exception. This would occur prior to the document being
corrupted (ie., becoming invalid) by the insertion of new content,
which I'd imagine is preferable to corrupting it and *then* having to
fix the problem. Another simple means of doing this is to have Xindice
return the existing document as a DOM Document node using Xindice's
getContentAsDOM() method, and then perform the check against that using
Xerces' DOM method getElementById(). This would allow you to bypass 
creating your own ID hash table. Which direction to take would depend
upon the application's requirements, how big the documents are, how 
often they're changed, performance configurations, etc.

> > You don't need to include validation features in Xindice
> > itself because the packages required to support Xindice
> > already provide those features, and any application built
> > upon Xindice *by necessity* must parse and process XML
> > content. All XML content going into Xindice must at minimum
> > be well-formed XML -- that's structural validation at its
> > most basic. If further structural or content validation is
> > needed, set the parser
> > factories to produce validating parsers, and then provide the
> > schemas.
> > To put these features into Xindice itself would be redundant and
> > unnecessary. Xerces is already doing it.
> 
> All I am asking for is a way to tell Xindice that it should enable the
> validation (provided by Xerces or whatever other implementation it
> chooses) at a level that is extremely inconvenient to get at in some
> applications. Because Xerces is included with Xindice, it seems
> reasonable to ask for Xindice to make use of Xerces in way that would be
> a great help to some of us...

Well, as I said, you need to be using Xerces (or a suitable 
replacement) in order to process the content going into Xindice,
so I don't see that it's an extra burden to you as a developer 
to validate the content you're manipulating, and to do it at
any time using any schema. If you're passing around a DOM Document
or even an element, you can pass it to a DOMParser to check it 
out.

> > Now, one thing I can think might be quite valuable would be
> > to create some utility classes/methods that could be used
> > generally within Xindice to provide either DOM Document- or
> > Node-level validation at any stage within the process of
> > managing content. I'd be happy to even contribute to such an
> > effort. If this is of general interest, writing up a set of
> > requirements would be a good start.
> 
> I would be interested in this kind of project as well - My first
> requirements would be an easy way to handle the scenarios listed above.
> Any ideas you have on how to proceed would be appreciated.

Let me know if the suggestions I've made make any sense in your
scenario. 

Murray

...........................................................................
Murray Altheim, Staff Engineer          <mailto:murray.altheim&#64;sun.com>
Java and XML Software
Sun Microsystems, 1601 Willow Rd., MS UMPK17-102, Menlo Park, CA 94025

       Ernst Martin comments in 1949, "A certain degree of noise in 
       writing is required for confidence. Without such noise, the 
       writer would not know whether the type was actually printing 
       or not, so he would lose control."

RE: Future of Xindice

Posted by "Timothy M. Dean" <td...@visi.com>.
Murray,

Thanks for the more detailed response. Below are a couple of follow-up
questions:

> -----Original Message-----
> From: Murray.Altheim@eng.sun.com 
>
> My point (which 
> I'm guessing was not expressed very clearly) is that any 
> Xindice-based application *must* have an XML parser 
> available, and Xindice 
> is distributed with Xerces 2, which provides support for DTDs 
> and XML Schema. If you need stronger content validation, 
> Xerces provides that with its XML Schema support.

Yes, I've worked with standalone Xerces applications and have
implemented apps that use strong content validation based on XML
schemas. What I'm not understanding is how I can architect my systems so
that I can be sure that all applications sharing a particular set of
data via Xindice can consistently enforce the validation rules I need.

I could easily implement code that validates a document before storing
it into my DB. I have concerns about enforcing a rule that says "all
applications should ensure that they only store documents that are valid
against the Schema X". Many clients I work with require this kind of
enforcement, and if there's not an easy way to do it within Xindice I
feel that Xindice would be ruled out as a valid solution for these
clients.

> 
> At most stages in the process an XML processor is *required* 
> to handle the XML content moving in and out of Xindice. All 
> one needs to do to provide stronger validation support during 
> these processes is to establish those parsers in validation 
> mode, and provide the schemas necessary to validate the 
> content. 

How then would you suggest handling the following scenario (which I'm
currently hacking around because I can't enforce validation within
Xindice). I've got an XML document stored in Xindice. The structure of a
particular element in my schema looks something like this:

    <element name="AddressList" type="ab:AddressListType">
        <unique name="AddressUnique">
            <selector xpath="*"/>
            <field xpath="@id"/>
        </unique>
    </element>

Basically, this is used to represent a list of "Address" elements, where
the "id" attribute of each address in the list is unique within the
scope of the list.

Now consider this - My application wants to add a new Address element to
the document. I want to use an XUpdate query to perform the insertion of
a new element. My application creates the appropriate XUpdate query and
submits it. I want to make sure that the new element is only stored if
its "id" attribute is unique within the list.

How can I enforce this restriction of uniqueness? The new element is
perfectly valid as far as I can tell by looking at the element on its
own. The restriction only comes in when I try to place the new element
into a previously stored document. Right now, I'm being forced to read
in the entire list of Address elements into my application, add the new
element to this list within my application to check for uniqueness, and
then either rewriting the entire list or performing the insert of only
the new element once I've performed my validation manually. It would be
*very* nice for me if I could simply attempt an insert of the new
element directly using Xindice, and expect validation that I've enabled
for the collection (or document) to handle this scenario.

Is there another way I can approach this that would make my life easier?


> 
> You don't need to include validation features in Xindice 
> itself because the packages required to support Xindice 
> already provide those features, and any application built 
> upon Xindice *by necessity* must parse and process XML 
> content. All XML content going into Xindice must at minimum 
> be well-formed XML -- that's structural validation at its 
> most basic. If further structural or content validation is 
> needed, set the parser 
> factories to produce validating parsers, and then provide the 
> schemas. 
> To put these features into Xindice itself would be redundant and 
> unnecessary. Xerces is already doing it.

All I am asking for is a way to tell Xindice that it should enable the
validation (provided by Xerces or whatever other implementation it
chooses) at a level that is extremely inconvenient to get at in some
applications. Because Xerces is included with Xindice, it seems
reasonable to ask for Xindice to make use of Xerces in way that would be
a great help to some of us...

> 
> Now, one thing I can think might be quite valuable would be 
> to create some utility classes/methods that could be used 
> generally within Xindice to provide either DOM Document- or 
> Node-level validation at any stage within the process of 
> managing content. I'd be happy to even contribute to such an 
> effort. If this is of general interest, writing up a set of 
> requirements would be a good start.
> 

I would be interested in this kind of project as well - My first
requirements would be an easy way to handle the scenarios listed above.
Any ideas you have on how to proceed would be appreciated.

Thanks,

- Tim


RE: Future of Xindice

Posted by Mike Mortensen <mm...@appsware.com>.
-----Original Message-----
From: Murray.Altheim@eng.sun.com [mailto:Murray.Altheim@eng.sun.com]On Behalf Of Murray Altheim
Sent: Wednesday, January 16, 2002 3:37 AM
To: xindice-dev@xml.apache.org
Subject: Re: Future of Xindice

In response to Mike Mortensen and Timothy M. Dean, I'll try to reiterate
that I'm not against validation features being available in applications
that use Xindice as a data store. I'm against those features being
embedded in Xindice itself. I guess where there seems to be some confusion
is what "embedded in Xindice" means exactly.

First, lets differentiate structural vs. content validation.

   Structural Validation:  validates that the XML markup of a document is
     (a) well-formed, and optionally (if a schema is available and the 
      parser is set in validation mode), that 
     (b) the markup structure is valid according to the schema constraints.

   Content Validation: validates that the element and attribute content
     conforms to the constraints expressed in a schema (or perhaps written
     into the application itself).

I certainly understand and agree with your points, and myself have experience
similar to yours. Content validation is appropriate and often critical to
database applications. Structural validation is a rather unique need in
XML databases, since a relational or object database can't be corrupted by
incoming data (unless there's something strange in the database design).
My point (which I'm guessing was not expressed very clearly) is that any
Xindice-based application *must* have an XML parser available, and Xindice
is distributed with Xerces 2, which provides support for DTDs and XML
Schema. If you need stronger content validation, Xerces provides that with
its XML Schema support.

At most stages in the process an XML processor is *required* to handle the
XML content moving in and out of Xindice. All one needs to do to provide
stronger validation support during these processes is to establish those
parsers in validation mode, and provide the schemas necessary to validate
the content. As I mentioned in my previous message, you can even "pipeline
validate", which is to validate the content travelling between components
by validating the SAX events themselves. O'Reilly will soon be publishing
a SAX book by David Brownell (initial author of Sun's XML parser) that
describes this type of functionality.

You don't need to include validation features in Xindice itself because
the packages required to support Xindice already provide those features,
and any application built upon Xindice *by necessity* must parse and
process XML content. All XML content going into Xindice must at minimum
be well-formed XML -- that's structural validation at its most basic. If
further structural or content validation is needed, set the parser
factories to produce validating parsers, and then provide the schemas.
To put these features into Xindice itself would be redundant and
unnecessary. Xerces is already doing it.


I now more clearly understand where you're going.  However, I still have problems with the central assertion.

Before addressing the central issue, I wanted to clear up another point.  It is possible to break the referential integrity in a relational database (even with only in-bound data).  Take the example of two tables Department and Employee

Department
=========
ID
Name

Employee
=========
ID
DepartmentID
FirstName
LastName


Let's sprinkle a little data into our structure for better illustration.

Department
3514 Finance
3515 Research

Employee
19845	3515	Lorraine	Jacobson
19846	3514	Stephen	Reed
19847	3515	Alexander	Morris
19848	3515	Phillip		Gutierrez


Without the appropriate foreign key constraints, it would be possible to insert new records into the Employee table with an invalid reference to the Department table (as shown here with the new employee Anton Azzameen

Employee
19845	3515	Lorraine	Jacobson
19846	3514	Stephen	Reed
19847	3515	Alexander	Morris
19848	3515	Phillip		Gutierrez
19849	NULL	Anton		Azzameen

Now let's see this same example in XML

<organization>
	<employee>
		<department>Research</department>
		<firstName>Lorraine</firstName>
		<lastName>Jacobson</lastName>
	</employee>
	<employee>
		<department> Finance </department>
		<firstName> Stephen </firstName>
		<lastName> Reed </lastName>
	</employee>
	<employee>
		<department>Research</department>
		<firstName> Alexander </firstName>
		<lastName> Morris </lastName>
	</employee>
	<employee>
		<department>Research</department>
		<firstName> Phillip </firstName>
		<lastName> Gutierrez </lastName>
	</employee>
	<employee>
		<firstName> Anton </firstName>
		<lastName>Azzameen </lastName>
	</employee>
</organization>

The XML version of the example is well-formed but would be invalid (where the department is a required element of employee).  The relational version is likewise permissible (except where there exists a foreign key constraint).

Consequently, it should be more easily seen that structural validation is <emphasis>not</emphasis> unique to the XML world and is directly applicable elsewhere (i.e. the relational model).  However, all of this is tangential to the central issue of where validation should take place.


If I have correctly understood your argument, you saying that because the XML content must pass through an XML parser (which can be set to validate as well as parse), there is no need to embed this functionality in Xindice.  Basically, since XML parsing (and optionally validating) is required and is presently available outside of Xindice, why go to the effort to embed it within Xindice?

The answer is the same reason as before.  If Xindice is to become a "dumb" datastore (meaning that it relies on outside applications for validation of the data it stores), then what is there to prevent an application (which shares Xindice with other applications) from choosing a different parser (which varies in its ability to validate from that of Xerces used by the other applications) or even from choosing not to validate at all.

Again, it really boils down to whether or not all applications using Xindice can rely on the datastore to contain only valid data.

If there is an advantage to keeping the validating parser outside of Xindice, then fine.  However, once the parser and the validation mechanism is chosen for a collection, the use of the selected validating parser and mechanism must be bound to the collection so that Xindice does not break its contract with the applications that use it as the data store.  This tight-coupling of parser and mechanism (DTD, W3C Schema, Relax, etc.) certainly seems to imply that it should be embedded within Xindice.

MGM
 

Re: Future of Xindice

Posted by Murray Altheim <mu...@sun.com>.
In response to Mike Mortensen and Timothy M. Dean, I'll try to reiterate
that I'm not against validation features being available in applications
that use Xindice as a data store. I'm against those features being 
embedded in Xindice itself. I guess where there seems to be some confusion
is what "embedded in Xindice" means exactly. 

First, lets differentiate structural vs. content validation.

   Structural Validation:  validates that the XML markup of a document is
     (a) well-formed, and optionally (if a schema is available and the  
      parser is set in validation mode), that  
     (b) the markup structure is valid according to the schema constraints.

   Content Validation: validates that the element and attribute content
     conforms to the constraints expressed in a schema (or perhaps written
     into the application itself).

I certainly understand and agree with your points, and myself have experience
similar to yours. Content validation is appropriate and often critical to 
database applications. Structural validation is a rather unique need in
XML databases, since a relational or object database can't be corrupted by
incoming data (unless there's something strange in the database design).
My point (which I'm guessing was not expressed very clearly) is that any 
Xindice-based application *must* have an XML parser available, and Xindice 
is distributed with Xerces 2, which provides support for DTDs and XML
Schema. If you need stronger content validation, Xerces provides that with
its XML Schema support.

At most stages in the process an XML processor is *required* to handle the
XML content moving in and out of Xindice. All one needs to do to provide
stronger validation support during these processes is to establish those
parsers in validation mode, and provide the schemas necessary to validate
the content. As I mentioned in my previous message, you can even "pipeline
validate", which is to validate the content travelling between components
by validating the SAX events themselves. O'Reilly will soon be publishing 
a SAX book by David Brownell (initial author of Sun's XML parser) that 
describes this type of functionality.

You don't need to include validation features in Xindice itself because
the packages required to support Xindice already provide those features,
and any application built upon Xindice *by necessity* must parse and
process XML content. All XML content going into Xindice must at minimum
be well-formed XML -- that's structural validation at its most basic. If
further structural or content validation is needed, set the parser 
factories to produce validating parsers, and then provide the schemas. 
To put these features into Xindice itself would be redundant and 
unnecessary. Xerces is already doing it.

Now, one thing I can think might be quite valuable would be to create
some utility classes/methods that could be used generally within Xindice
to provide either DOM Document- or Node-level validation at any stage
within the process of managing content. I'd be happy to even contribute
to such an effort. If this is of general interest, writing up a set
of requirements would be a good start.

Murray

...........................................................................
Murray Altheim                         <mailto:murray.altheim&#x40;sun.com>
XML Technology Center, Java and XML Software
Sun Microsystems, Inc., MS MPK17-102, 1601 Willow Rd., Menlo Park, CA 94025

            Corporations do not have human rights, despite the 
          altogether too-human opinions of the US Supreme Court.

RE: Future of Xindice

Posted by "Timothy M. Dean" <td...@visi.com>.


-----Original Message-----
From: altheim@mehitabel.eng.sun.com
[mailto:altheim@mehitabel.eng.sun.com]On Behalf Of Murray Altheim
> > Schema support
> > -----------------------
> > We need to support schemas in an abstracted fashion.  If we can 
> > architect a content model API that would allow the system to
validate 
> > and operate against a content model without needing to know that the

> > content model is based on XML Schemas or Relax NG, that would be 
> > ideal.
> 
> Why in Xindice? There are several places where validation can occur:
1.
> upon storing in the database; 2. following an XUpdate; and 3. upon
retrieving
> content from the database. In all three cases, the DOM nodes to be
validated
> are already available to the developer outside of Xindice and can be
validated
> using existing validation tools and techniques.

<snip>
> If anyone is unclear as to how to provide Java-based XML validation
within their
> application, I'd be happy to suggest several books and available
software packages.

I am curious how you would suggest handling the situation mentioned by
others and myself in the past: Consider an environment where many
independently-developed applications need access to the same data
stores. One cannot be guaranteed that every application has been
developed with the same degree of attention towards this
"application-specific" validation that you seem to be suggesting. How
can I, the developer of application X, be certain that application Y
hasn't put corrupt data into the data store I need to access? Do I have
to assume that all applications will behave as well as I'd like them to?
Do I have to explicitly check data for inconsistencies every time I read
something from the DB, just in case some other application hasn't
followed the rules I expect them to? Providing validation within Xindice
seems to be the best way to ensure enterprise-wide data consistency.

I agree with many of your sentiments and have no desire to bog down
Xindice with unnecessary layers. However, without this kind of
validation I fear that I will not be able to use Xindice for many of the
solutions I would like. Is there another way I should look at the issue
that could dismiss this fear I have?

- Tim



RE: Future of Xindice

Posted by Mike Mortensen <mm...@appsware.com>.
-----Original Message-----
From: altheim@mehitabel.eng.sun.com [mailto:altheim@mehitabel.eng.sun.com]On Behalf Of Murray Altheim
Sent: Tuesday, January 15, 2002 4:36 PM
To: xindice-users@xml.apache.org
Cc: xindice-dev@xml.apache.org
Subject: Re: Future of Xindice

Tom Bradford wrote:
[...]
> There are a few of things that need to be addressed in future revisions
> of Xindice.  I'll run through them very quickly, and then I'd like to
> hear people's feedback.
>
[...]
> Schema support
> -----------------------
> We need to support schemas in an abstracted fashion.  If we can
> architect a content model API that would allow the system to validate
> and operate against a content model without needing to know that the
> content model is based on XML Schemas or Relax NG, that would be ideal.

Why in Xindice? There are several places where validation can occur:
1. upon storing in the database; 2. following an XUpdate; and 3. upon
retrieving content from the database. In all three cases, the DOM nodes
to be validated are already available to the developer outside of
Xindice and can be validated using existing validation tools and techniques.

Not that I'm going to fight the issue, but I'm rather against including
support for schema validation within the Xindice, as this is an application-
level issue (as I've described in previous messages). There are many
different types of schema validation, and different validation needs, eg.,
different levels of strictness or different content validation at various
places within a processing regimen. Validation is a complicated issue that
doesn't have a one-size-fits-all type of solution.

There are a plethora of validation options out there and I don't see that
one API could serve the variety of schema languages, structure and content
validation needs that would be within a reasonable scope of effort. You'd
be tackling the same issues that the W3C Schema WG tackled, with the
"data heads" and "document heads" needs on the table.

Perhaps I'm just being daft, but I've never followed the reasoning on
why anyone would *need* to include further validation functionality
*within* XIndice. It only seems to add redundant complexity to the
package. Those who know my history in the markup field know I'm a
big advocate of validation, but this is one place I wouldn't support
its inclusion, code-wise. If anyone is unclear as to how to provide
Java-based XML validation within their application, I'd be happy to
suggest several books and available software packages.

In passing I should mention that Sun has released binaries and source
code for a Multi-Schema XML Validator, which we demoed at XML One. This
tool can work on the command line, can be integrated into applications,
and can even act as a SAX pipe validator.

   http://www.sun.com/software/xml/developers/multischema/

It supports RELAX NG, RELAX Namespace, RELAX Core, TREX, XML DTDs, and a
subset of XML Schema Part 1. Of course, DTD and XML Schema validation is
built into Xerces 2, which is already part of the Xindice distribution.


On this point I disagree.  I think it not only appropriate to validate in XIndice, but critical.  The reasons for so doing is the same for why there exist triggers, foreign-key constraints, check constraints, and the like.  If validation occurs close to the data store (i.e., within the application storing the data) then all applications using the data store can rely upon the validity of the data once stored.
Experience is too great a teacher for us not to conclude that there will be multiple applications accessing the same collection.  While, if we are the author of one application, and have taken the appropriate care and properly validated the documents as we work with them, we can have confidence in our application; we'll never be sure that the other applications sharing our data will do likewise.  
I have become weary of dealing with too many 2-bit Visual Basic applications sharing data in a Microsoft Access database.  Anyone who has had to deal with integrating and up scaling applications of this nature will understand.  I much prefer working with "real" databases which can enforce the referential integrity (in the data store).  To fail to include validation in Xindice is to condemn it to irrelevance.  Developers of the 2-bit applications will love it because it will allow them to put their data in another "dumb" data store.  The relevant difference this time is that Xindice would store XML as opposed to Access storing relational data.  Developers of enterprise applications would shun it because can't offer these guarantees critical to substantial applications.
 

Re: Future of Xindice

Posted by Murray Altheim <mu...@sun.com>.
Tom Bradford wrote:
[...]
> There are a few of things that need to be addressed in future revisions
> of Xindice.  I'll run through them very quickly, and then I'd like to
> hear people's feedback.
> 
[...] 
> Schema support
> -----------------------
> We need to support schemas in an abstracted fashion.  If we can
> architect a content model API that would allow the system to validate
> and operate against a content model without needing to know that the
> content model is based on XML Schemas or Relax NG, that would be ideal.

Why in Xindice? There are several places where validation can occur:
1. upon storing in the database; 2. following an XUpdate; and 3. upon
retrieving content from the database. In all three cases, the DOM nodes
to be validated are already available to the developer outside of 
Xindice and can be validated using existing validation tools and techniques.

Not that I'm going to fight the issue, but I'm rather against including
support for schema validation within the Xindice, as this is an application-
level issue (as I've described in previous messages). There are many 
different types of schema validation, and different validation needs, eg.,
different levels of strictness or different content validation at various 
places within a processing regimen. Validation is a complicated issue that
doesn't have a one-size-fits-all type of solution.

There are a plethora of validation options out there and I don't see that
one API could serve the variety of schema languages, structure and content
validation needs that would be within a reasonable scope of effort. You'd 
be tackling the same issues that the W3C Schema WG tackled, with the 
"data heads" and "document heads" needs on the table. 

Perhaps I'm just being daft, but I've never followed the reasoning on
why anyone would *need* to include further validation functionality
*within* XIndice. It only seems to add redundant complexity to the 
package. Those who know my history in the markup field know I'm a
big advocate of validation, but this is one place I wouldn't support
its inclusion, code-wise. If anyone is unclear as to how to provide
Java-based XML validation within their application, I'd be happy to
suggest several books and available software packages.

In passing I should mention that Sun has released binaries and source 
code for a Multi-Schema XML Validator, which we demoed at XML One. This
tool can work on the command line, can be integrated into applications, 
and can even act as a SAX pipe validator.

   http://www.sun.com/software/xml/developers/multischema/
 
It supports RELAX NG, RELAX Namespace, RELAX Core, TREX, XML DTDs, and a
subset of XML Schema Part 1. Of course, DTD and XML Schema validation is 
built into Xerces 2, which is already part of the Xindice distribution. 

> Context-sensitive indexing
> ------------------------------------
> XML Schemas introduces the idea of contextually-dependant typing.  What
> this means is that for any particular schema, that schema may use the
> same element name in more than one scope, and assign to that element
> name a completely different primitive type for each scope.  So in one
> scope, it may be an int, while in another it may be a string, or even a
> complex structure.
> 
> Xindice's indexing system was originally design when DTDs were the only
> standard way of representing an XML schema, and in DTDs, an element name
> is globally unique.  So we need to rearchitect the indexing system to
> support the ability for attaching a particular index to a schema
> context.  I have some vague ideas of how to do this, but I'd like to get
> a user's perspective on how you'd like to see this made available.

I don't see how this could be done reasonably without hooking deeply
into the XML Schema support code that's in Xerces in order to be
certain that the same context was arrived at by both an application
and Xindice. That is, I believe this would be necessary unless one
believes that Xerces' XML Schema support will provide the same context
under all circumstances as other XML Schema tools. I'm skeptical about 
that. At least we'd be bug-for-bug compatible with other Xerces-based
applications.

> Large Documents and Document Versioning
> ------------------------------------------------------------
> Xindice needs to be capable of supporting massive documents in a
> scalable fashion and with acceptable performance.  Currently, the
> document representation architecture is based on a tokenized, lazy DOM
> where the bytestream images that feed the DOM are stored and retrieved
> in a paged filing system.  Every document is treated as an atomic unit.
> This has some serious limitations when it comes to massive documents.
> 
> In order to support very large documents, the tokenization system needs
> to be replaced and geared more toward the simplified representation of
> document structure rather than an equal balance of structure and
> content.  Also, the Filer interfaces need to support the notion of
> streaming, and even more importantly, the ability to support random
> access streaming.
> 
> Also, the tokenization system needs to support versioning in one way or
> another.  For small documents, complete document revision links or
> permissible, but for massive documents, there's no way that versioning
> of that nature is acceptible.  So, the tokenization system needs to
> understand the notion of versioned linking.
> 
> The DTSM stuff that I started working on will help with the massive
> document problem, but we'd need to introduce the versioning concept into
> the specification as well.

I'm likely to be tackling something akin to this in the next few months,
trying to hook up javacvs (the netbeans.org version, not the sourceforge
one which is under GPL) to Xindice. I don't have much of a need for large
document support, but the approach I'd take would be perhaps useful in
that regard. Basically, content would be checked into javacvs prior to being
stored in Xindice, hence most revision control issues are handled outside
of the database. I would not be attempting node-based revision control 
support (ie., as Tom said above, support within the tokenization system),
which would be very valuable but outside the scope of effort I'm willing 
to take on. If someone is willing to do the node-based RCS within Xindice,
I'm quite happy to step aside.

Murray

...........................................................................
Murray Altheim, Staff Engineer          <mailto:murray.altheim&#64;sun.com>
Java and XML Software
Sun Microsystems, 1601 Willow Rd., MS UMPK17-102, Menlo Park, CA 94025

       Ernst Martin comments in 1949, "A certain degree of noise in 
       writing is required for confidence. Without such noise, the 
       writer would not know whether the type was actually printing 
       or not, so he would lose control."

Re: Future of Xindice

Posted by Jeff Greif <jg...@alumni.princeton.edu>.
Tom,
Hope your services are not lost to the Xindice project forever.

Regarding support for schemas, it would be helpful to enumerate what aspects
of schemas need to be supported.  I've supplied a partial list, but perhaps
others could add more?

 - validation of update on existing document (validation of input docs
should probably occur outside Xindice)
 - supplying default values of attributes
 - indexing based on schema (including indexing on combinations of elements
and attributes) and including the context-sensitive indexing mentioned
 - joins when they are implemented
 - detection of queries which will always fail on valid docs for the schema
(is this a frill?  what if the collection is not homogeneous?)

Jeff


Re: Future of Xindice

Posted by Murray Altheim <mu...@sun.com>.
Tom Bradford wrote:
[...]
> There are a few of things that need to be addressed in future revisions
> of Xindice.  I'll run through them very quickly, and then I'd like to
> hear people's feedback.
> 
[...] 
> Schema support
> -----------------------
> We need to support schemas in an abstracted fashion.  If we can
> architect a content model API that would allow the system to validate
> and operate against a content model without needing to know that the
> content model is based on XML Schemas or Relax NG, that would be ideal.

Why in Xindice? There are several places where validation can occur:
1. upon storing in the database; 2. following an XUpdate; and 3. upon
retrieving content from the database. In all three cases, the DOM nodes
to be validated are already available to the developer outside of 
Xindice and can be validated using existing validation tools and techniques.

Not that I'm going to fight the issue, but I'm rather against including
support for schema validation within the Xindice, as this is an application-
level issue (as I've described in previous messages). There are many 
different types of schema validation, and different validation needs, eg.,
different levels of strictness or different content validation at various 
places within a processing regimen. Validation is a complicated issue that
doesn't have a one-size-fits-all type of solution.

There are a plethora of validation options out there and I don't see that
one API could serve the variety of schema languages, structure and content
validation needs that would be within a reasonable scope of effort. You'd 
be tackling the same issues that the W3C Schema WG tackled, with the 
"data heads" and "document heads" needs on the table. 

Perhaps I'm just being daft, but I've never followed the reasoning on
why anyone would *need* to include further validation functionality
*within* XIndice. It only seems to add redundant complexity to the 
package. Those who know my history in the markup field know I'm a
big advocate of validation, but this is one place I wouldn't support
its inclusion, code-wise. If anyone is unclear as to how to provide
Java-based XML validation within their application, I'd be happy to
suggest several books and available software packages.

In passing I should mention that Sun has released binaries and source 
code for a Multi-Schema XML Validator, which we demoed at XML One. This
tool can work on the command line, can be integrated into applications, 
and can even act as a SAX pipe validator.

   http://www.sun.com/software/xml/developers/multischema/
 
It supports RELAX NG, RELAX Namespace, RELAX Core, TREX, XML DTDs, and a
subset of XML Schema Part 1. Of course, DTD and XML Schema validation is 
built into Xerces 2, which is already part of the Xindice distribution. 

> Context-sensitive indexing
> ------------------------------------
> XML Schemas introduces the idea of contextually-dependant typing.  What
> this means is that for any particular schema, that schema may use the
> same element name in more than one scope, and assign to that element
> name a completely different primitive type for each scope.  So in one
> scope, it may be an int, while in another it may be a string, or even a
> complex structure.
> 
> Xindice's indexing system was originally design when DTDs were the only
> standard way of representing an XML schema, and in DTDs, an element name
> is globally unique.  So we need to rearchitect the indexing system to
> support the ability for attaching a particular index to a schema
> context.  I have some vague ideas of how to do this, but I'd like to get
> a user's perspective on how you'd like to see this made available.

I don't see how this could be done reasonably without hooking deeply
into the XML Schema support code that's in Xerces in order to be
certain that the same context was arrived at by both an application
and Xindice. That is, I believe this would be necessary unless one
believes that Xerces' XML Schema support will provide the same context
under all circumstances as other XML Schema tools. I'm skeptical about 
that. At least we'd be bug-for-bug compatible with other Xerces-based
applications.

> Large Documents and Document Versioning
> ------------------------------------------------------------
> Xindice needs to be capable of supporting massive documents in a
> scalable fashion and with acceptable performance.  Currently, the
> document representation architecture is based on a tokenized, lazy DOM
> where the bytestream images that feed the DOM are stored and retrieved
> in a paged filing system.  Every document is treated as an atomic unit.
> This has some serious limitations when it comes to massive documents.
> 
> In order to support very large documents, the tokenization system needs
> to be replaced and geared more toward the simplified representation of
> document structure rather than an equal balance of structure and
> content.  Also, the Filer interfaces need to support the notion of
> streaming, and even more importantly, the ability to support random
> access streaming.
> 
> Also, the tokenization system needs to support versioning in one way or
> another.  For small documents, complete document revision links or
> permissible, but for massive documents, there's no way that versioning
> of that nature is acceptible.  So, the tokenization system needs to
> understand the notion of versioned linking.
> 
> The DTSM stuff that I started working on will help with the massive
> document problem, but we'd need to introduce the versioning concept into
> the specification as well.

I'm likely to be tackling something akin to this in the next few months,
trying to hook up javacvs (the netbeans.org version, not the sourceforge
one which is under GPL) to Xindice. I don't have much of a need for large
document support, but the approach I'd take would be perhaps useful in
that regard. Basically, content would be checked into javacvs prior to being
stored in Xindice, hence most revision control issues are handled outside
of the database. I would not be attempting node-based revision control 
support (ie., as Tom said above, support within the tokenization system),
which would be very valuable but outside the scope of effort I'm willing 
to take on. If someone is willing to do the node-based RCS within Xindice,
I'm quite happy to step aside.

Murray

...........................................................................
Murray Altheim, Staff Engineer          <mailto:murray.altheim&#64;sun.com>
Java and XML Software
Sun Microsystems, 1601 Willow Rd., MS UMPK17-102, Menlo Park, CA 94025

       Ernst Martin comments in 1949, "A certain degree of noise in 
       writing is required for confidence. Without such noise, the 
       writer would not know whether the type was actually printing 
       or not, so he would lose control."