You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xindice-dev@xml.apache.org by Murray Altheim <mu...@sun.com> on 2002/01/17 00:02:57 UTC

Validation Issues

[I think it's only right that we sub-thread this conversation.]

"Timothy M. Dean" wrote:
> 
> > -----Original Message-----
> > From: altheim@mehitabel.eng.sun.com
> > "Timothy M. Dean" wrote:
> > >
> > Perhaps I'm not understanding what you've explained, but it
> > seems that you're confusing client and server. Xindice is not
> > a client, it's a database server.
> 
> No, I fully understand that Xindice is a server - There's no confusion
> there.
> 
> > A Xindice system would include client software written by you,
> > and I would hope you'd have control over both the
> > installation of the server and how those clients are
> > configured.
> 
> True, but what I don't have is the guarantee that everyone who writes
> applications against this Xindice DB is going to follow the rules that I
> am expecting. Assume that I am working on an application for the ABC
> division of some company, and that my application needs to write/read
> data to a data store. Now assume that another developer in the XYZ
> division of the same company also is working on an application that
> needs access to *the same data store*. This is a scenario that many of
> the companies I work with have encountered.

So what you're saying is that anyone can write anything anytime? Couldn't
such a system be implemented where the client software used to access
the database (and the list of people who could) is restricted? If so,
and if you can restrict access to clients you write, this shouldn't be
such an issue. 

If OTOH anyone *can* write anything they want anytime, you must operate
in Defensive Mode. You can't assume that the data is valid (or even
uses the same XML Namespaces as your data). These sound so much like
what I've called "management issues" that it hardly seems like a 
technical solution is the reasonable solution. I often find designs 
that attempt to solve management problems with technical solutions run
awry of reality. Such systems are usually fragile. Look at MS Outlook.
 
> If we assume that my application is implemented to perform all of the
> necessary validation before storing documents, then I can be confident
> that my Xindice data store is not "corrupted" for my purposes. My
> project completes development/testing and is deployed to its user base.
> Now consider when the other developer in division XYZ completes their
> application and puts it online, again accessing the same enterprise data
> that my app is accessing. How do I know that this other application is
> following the same validation rules that I have depended on?

You rely on agreements amongst the players within your enterprise to
follow a reasonable set of constraints. If you can't get that agreement
no technical solution is going to really help.

> Unless I
> spend a lot of extra effort in *every application* to make sure that
> everything I read is valid (rather than only validating when I update),
> I could easily find that my application stops working because some other
> application has stored data which I consider to be corrupt.

A terrible scenario, indeed.
 
> You may call this a management issue, but the companies I've worked with
> on this kind of problem have wanted the ability to define some sort of
> contract that can be enforced at the database level. Some way to say
> that "DB will only accept data that conforms to certain rules, so you
> can be guaranteed that these rules have been met for any data retrieved
> from the DB". With relational databases they have this ability: They can
> define the schema of the DB to contain certain tables, each with columns
> of a certain type, etc. Without some way of enforcing this rule at an
> enterprise-wide level (rather than on an application-by-application
> level), I fear that most companies I work with would never consider
> using Xindice as a solution to their needs.

Well, I do think this boils down to a management issue, especially now
that you describe it more fully. Xindice is not solution to your problem
any more than any other database solution is, as if there are *no* 
controls on data access/manipulation, you're sunk before the boat leaves
the harbor. Relational databases have the *ability* to do this, but if
the guy in division XYZ decides to ignore those rules, they're no 
different than Xindice. 

What might work as a technical solution would be to intercept write
events to Xindice and validate the incoming content against a schema.
I wouldn't consider this as "within" Xindice because you really want
to validate the content prior to its insertion, not after it's in 
the database and has potentially corrupted it.

> > ID uniqueness is a fairly easy contraint to fulfill. You'd merely
> > create an Xindice index for IDs (using the XPathService) and
> > provide that within your application as a HashSet. Incoming
> > IDs would be checked against the HashSet and if you didn't
> > return a null, you'd throw a
> > duplicate ID exception. This would occur prior to the
> > document being corrupted (ie., becoming invalid) by the
> > insertion of new content, which I'd imagine is preferable to
> > corrupting it and *then* having to fix the problem. Another
> > simple means of doing this is to have Xindice return the
> > existing document as a DOM Document node using Xindice's
> > getContentAsDOM() method, and then perform the check against
> > that using Xerces' DOM method getElementById(). This would
> > allow you to bypass
> > creating your own ID hash table. Which direction to take
> > would depend upon the application's requirements, how big the
> > documents are, how
> > often they're changed, performance configurations, etc.
> 
> I'll look into this solution: I haven't worked with indexes in Xindice
> yet, so I can't form an educated opinion on this approach yet. However,
> it does seem like a bit of a pain when all I want to do is say "Make
> sure that documents stored in this collection are valid against schema
> X"... Going through and creating indexes to help me enforce this
> constraint seems to be a bit of a hack, but I'll look into it.

Except that what you're talking about is two documents: the existing 
document and the one after the change is applied. When do you want to
validate? And why validate the entire document *after* a potentially
corrupting change if all you need is to ascertain if there's an ID
conflict? Seems like a lot of extra work. The getElementById() method
is the easiest and doesn't require you to even create your own hash.
I don't think you'd need to resort to an index unless it buys you
anything -- here I don't see that it does.

As to validating an entire collection, I guess if you have no ability
to regulate the content going into the database (ie., someone else 
might simply ignore the server-side validation features) you need to 
be able to validate each and every record. This could be an enormously
time-consuming operation if done on each data access, seems extremely
heavy-handed, and the content of one record could be changing even as
you validate the rest. Putting validation features inside Xindice or 
keeping them as part of the server or client application makes no 
difference here. You're still validating 17,000 records against a schema. 

> Any recommendations of where I can find more information about using
> indexes in Xindice? I haven't found any examples or description in the
> docs that I've read thus far.

I've not found any extant documentation and have been just hacking at
the code without any. This is probably the biggest hole in the docs.
 
> > > I would be interested in this kind of project as well - My first
> > > requirements would be an easy way to handle the scenarios listed
> > > above. Any ideas you have on how to proceed would be appreciated.
> >
> > Let me know if the suggestions I've made make any sense in
> > your scenario.
> 
> I still can't figure out how you envision this kind of validation to
> work, so I'll withhold judgement until I understand your recommended
> approach a little better.

I guess where I'm confused is that I don't see putting validation 
features inside of Xindice makes any difference in solving the essential
problems you enumerate. *Where* the validation features live is immaterial,
as it seems like the solution to this management problem is to design
a system where they live *somewhere*. Since you've already got these 
features in Xerces (and therefore in both any server and client code
for Xindice) I'd continue to suggest that they're not necessary in
XIndice itself -- I don't see that putting them in Xindice would solve 
any problem. Since you've not been assured that all clients followed 
the rules, you'd still have that need to validate those 17,000 records.

[I have a funny feeling that this conversation would be a lot shorter
over coffee...]

Murray

...........................................................................
Murray Altheim, Staff Engineer          <mailto:murray.altheim&#64;sun.com>
Java and XML Software
Sun Microsystems, 1601 Willow Rd., MS UMPK17-102, Menlo Park, CA 94025

       Ernst Martin comments in 1949, "A certain degree of noise in 
       writing is required for confidence. Without such noise, the 
       writer would not know whether the type was actually printing 
       or not, so he would lose control."

Re: Validation Issues

Posted by Joel Rosi-Schwartz <jo...@btconnect.com>.

Murray Altheim wrote:

> [I think it's only right that we sub-thread this conversation.]
>
> "Timothy M. Dean" wrote:
> >
> > > -----Original Message-----
> > > From: altheim@mehitabel.eng.sun.com
> > > "Timothy M. Dean" wrote:
> > > >
> > > Perhaps I'm not understanding what you've explained, but it
> > > seems that you're confusing client and server. Xindice is not
> > > a client, it's a database server.
> >
> > No, I fully understand that Xindice is a server - There's no confusion
> > there.
> >
> > > A Xindice system would include client software written by you,
> > > and I would hope you'd have control over both the
> > > installation of the server and how those clients are
> > > configured.
> >
> > True, but what I don't have is the guarantee that everyone who writes
> > applications against this Xindice DB is going to follow the rules that I
> > am expecting. Assume that I am working on an application for the ABC
> > division of some company, and that my application needs to write/read
> > data to a data store. Now assume that another developer in the XYZ
> > division of the same company also is working on an application that
> > needs access to *the same data store*. This is a scenario that many of
> > the companies I work with have encountered.

This is one of the reasons why applications servers are an important part of an
enterprise architecture. In your place I would be exploring the viability of
placing JBoss in-between the client applications and the database.  You would
then have the ability to "validate" at more levels that merely the schema. You
get a shot a applying business rules where they are required, authentication
and authorization can be accommodated and it is much easier to address
concurrency issues, to name just a few of the advantages.

> So what you're saying is that anyone can write anything anytime? Couldn't
> such a system be implemented where the client software used to access
> the database (and the list of people who could) is restricted? If so,
> and if you can restrict access to clients you write, this shouldn't be
> such an issue.
>
> If OTOH anyone *can* write anything they want anytime, you must operate
> in Defensive Mode. You can't assume that the data is valid (or even
> uses the same XML Namespaces as your data). These sound so much like
> what I've called "management issues" that it hardly seems like a
> technical solution is the reasonable solution. I often find designs
> that attempt to solve management problems with technical solutions run
> awry of reality. Such systems are usually fragile. Look at MS Outlook.
>
> > If we assume that my application is implemented to perform all of the
> > necessary validation before storing documents, then I can be confident
> > that my Xindice data store is not "corrupted" for my purposes. My
> > project completes development/testing and is deployed to its user base.
> > Now consider when the other developer in division XYZ completes their
> > application and puts it online, again accessing the same enterprise data
> > that my app is accessing. How do I know that this other application is
> > following the same validation rules that I have depended on?
>
> You rely on agreements amongst the players within your enterprise to
> follow a reasonable set of constraints. If you can't get that agreement
> no technical solution is going to really help.
>
> > Unless I
> > spend a lot of extra effort in *every application* to make sure that
> > everything I read is valid (rather than only validating when I update),
> > I could easily find that my application stops working because some other
> > application has stored data which I consider to be corrupt.
>
> A terrible scenario, indeed.
>
> > You may call this a management issue, but the companies I've worked with
> > on this kind of problem have wanted the ability to define some sort of
> > contract that can be enforced at the database level. Some way to say
> > that "DB will only accept data that conforms to certain rules, so you
> > can be guaranteed that these rules have been met for any data retrieved
> > from the DB". With relational databases they have this ability: They can
> > define the schema of the DB to contain certain tables, each with columns
> > of a certain type, etc. Without some way of enforcing this rule at an
> > enterprise-wide level (rather than on an application-by-application
> > level), I fear that most companies I work with would never consider
> > using Xindice as a solution to their needs.
>
> Well, I do think this boils down to a management issue, especially now
> that you describe it more fully. Xindice is not solution to your problem
> any more than any other database solution is, as if there are *no*
> controls on data access/manipulation, you're sunk before the boat leaves
> the harbor. Relational databases have the *ability* to do this, but if
> the guy in division XYZ decides to ignore those rules, they're no
> different than Xindice.
>
> What might work as a technical solution would be to intercept write
> events to Xindice and validate the incoming content against a schema.
> I wouldn't consider this as "within" Xindice because you really want
> to validate the content prior to its insertion, not after it's in
> the database and has potentially corrupted it.
>
> > > ID uniqueness is a fairly easy contraint to fulfill. You'd merely
> > > create an Xindice index for IDs (using the XPathService) and
> > > provide that within your application as a HashSet. Incoming
> > > IDs would be checked against the HashSet and if you didn't
> > > return a null, you'd throw a
> > > duplicate ID exception. This would occur prior to the
> > > document being corrupted (ie., becoming invalid) by the
> > > insertion of new content, which I'd imagine is preferable to
> > > corrupting it and *then* having to fix the problem. Another
> > > simple means of doing this is to have Xindice return the
> > > existing document as a DOM Document node using Xindice's
> > > getContentAsDOM() method, and then perform the check against
> > > that using Xerces' DOM method getElementById(). This would
> > > allow you to bypass
> > > creating your own ID hash table. Which direction to take
> > > would depend upon the application's requirements, how big the
> > > documents are, how
> > > often they're changed, performance configurations, etc.
> >
> > I'll look into this solution: I haven't worked with indexes in Xindice
> > yet, so I can't form an educated opinion on this approach yet. However,
> > it does seem like a bit of a pain when all I want to do is say "Make
> > sure that documents stored in this collection are valid against schema
> > X"... Going through and creating indexes to help me enforce this
> > constraint seems to be a bit of a hack, but I'll look into it.
>
> Except that what you're talking about is two documents: the existing
> document and the one after the change is applied. When do you want to
> validate? And why validate the entire document *after* a potentially
> corrupting change if all you need is to ascertain if there's an ID
> conflict? Seems like a lot of extra work. The getElementById() method
> is the easiest and doesn't require you to even create your own hash.
> I don't think you'd need to resort to an index unless it buys you
> anything -- here I don't see that it does.
>
> As to validating an entire collection, I guess if you have no ability
> to regulate the content going into the database (ie., someone else
> might simply ignore the server-side validation features) you need to
> be able to validate each and every record. This could be an enormously
> time-consuming operation if done on each data access, seems extremely
> heavy-handed, and the content of one record could be changing even as
> you validate the rest. Putting validation features inside Xindice or
> keeping them as part of the server or client application makes no
> difference here. You're still validating 17,000 records against a schema.
>
> > Any recommendations of where I can find more information about using
> > indexes in Xindice? I haven't found any examples or description in the
> > docs that I've read thus far.
>
> I've not found any extant documentation and have been just hacking at
> the code without any. This is probably the biggest hole in the docs.
>
> > > > I would be interested in this kind of project as well - My first
> > > > requirements would be an easy way to handle the scenarios listed
> > > > above. Any ideas you have on how to proceed would be appreciated.
> > >
> > > Let me know if the suggestions I've made make any sense in
> > > your scenario.
> >
> > I still can't figure out how you envision this kind of validation to
> > work, so I'll withhold judgement until I understand your recommended
> > approach a little better.
>
> I guess where I'm confused is that I don't see putting validation
> features inside of Xindice makes any difference in solving the essential
> problems you enumerate. *Where* the validation features live is immaterial,
> as it seems like the solution to this management problem is to design
> a system where they live *somewhere*. Since you've already got these
> features in Xerces (and therefore in both any server and client code
> for Xindice) I'd continue to suggest that they're not necessary in
> XIndice itself -- I don't see that putting them in Xindice would solve
> any problem. Since you've not been assured that all clients followed
> the rules, you'd still have that need to validate those 17,000 records.
>
> [I have a funny feeling that this conversation would be a lot shorter
> over coffee...]
>
> Murray
>
> ...........................................................................
> Murray Altheim, Staff Engineer          <mailto:murray.altheim&#64;sun.com>
> Java and XML Software
> Sun Microsystems, 1601 Willow Rd., MS UMPK17-102, Menlo Park, CA 94025
>
>        Ernst Martin comments in 1949, "A certain degree of noise in
>        writing is required for confidence. Without such noise, the
>        writer would not know whether the type was actually printing
>        or not, so he would lose control."