You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xindice-dev@xml.apache.org by Jerry Wang <jw...@elegant.ca> on 2002/01/03 01:32:52 UTC

XSD or DTD validation?

Any plan to support XSD or DTD validation when creating or updating document? I
think it will be good for example we bound each collection with an XSD or DTD.

-Jerry Wang
Elegant Solution Consulting Inc.


Re: XSD or DTD validation?

Posted by Murray Altheim <mu...@sun.com>.
Jerry Wang wrote:
> 
> Any plan to support XSD or DTD validation when creating or updating document? I
> think it will be good for example we bound each collection with an XSD or DTD.

This is the type of thing that is done on an application-by-application
basis, and would be appropriately established by the parser factory
that creates the XML parser used to provide DOM nodes from a processed
XML document instance. It's not necessary to provide this service within
Xindice itself since this is an application decision everyone must make
anyway. Our project (for example) doesn't require valid content so we
use a well-formed parser, but it'd be a one line change (or perhaps use
of a command-line parameter if you wanted that flexibility) to create a 
validating parser instead.

Murray

...........................................................................
Murray Altheim                         <mailto:murray.altheim&#x40;sun.com>
XML Technology Center, Java and XML Software
Sun Microsystems, Inc., MS MPK17-102, 1601 Willow Rd., Menlo Park, CA 94025

               Rally against the evils of iceburg lettuce! 
            Grab a kitchen knife and join the Balsamic Jihad!

RE: XSD or DTD validation?

Posted by "Timothy M. Dean" <td...@visi.com>.
I am aware of how to set up the XSD or validation in standalone Xerces.
I could use some help in figuring out how to tie that parsing into an
Xindice-based solution. Specifically, how do I go about changing the XML
parsing used by Xindice's server so that parsing will use a validating
parser that I desire? I assume that I must use some sort of customized
parser factory implementation and tie that factory into the server so my
factory is used instead of the default. Any examples available that
demonstrate this would be greatly appreciated.

In addition to any programatic access to enforcing validation, it seems
likely that many environments would want to set up per-collection schema
or DTD validation through some kind of administrator's interface. In
many of the applications I would consider using Xindice, I wouldn't want
to rely on all applications to be completely "well-behaved".

- Tim

> -----Original Message-----
> From: Murray.Altheim@eng.sun.com 
> [mailto:Murray.Altheim@eng.sun.com] On Behalf Of Murray Altheim
> Sent: Thursday, January 03, 2002 1:01 AM
> To: xindice-dev@xml.apache.org
> Subject: Re: XSD or DTD validation?
> 
> 
> "Timothy M. Dean" wrote:
> > 
> > I was also curious about the same thing. I would be willing to 
> > contribute to this effort if necessary. Has anyone else given much 
> > thought to this idea?
> 
> To reiterate my answer to Jerry, both XSD and DTD validation 
> are provided by the Xerces 2 parser that is used by Xindice, 
> so it comes down to how you set up the XML parser used to 
> provide content to Xindice. If you have any question on how 
> to set up such a parser, check the 'samples' directory of the 
> Xerces 2 distribution, which shows how to do this for both 
> SAX and DOM parsers. It's really pretty simple. If you need 
> catalog-style entity resolution, you can get Norm Walsh's 
> catalog resolution code at nwalsh.com.
> 
> Murray
> 
> ..............................................................
> .............
> Murray Altheim                         
> <mailto:murray.altheim&#x40;sun.com>
> XML Technology Center, 
> Java and XML Software
> Sun Microsystems, Inc., MS MPK17-102, 1601 Willow Rd., Menlo 
> Park, CA 94025
> 
>                Rally against the evils of iceburg lettuce! 
>             Grab a kitchen knife and join the Balsamic Jihad!
> 


Re: XSD or DTD validation?

Posted by Murray Altheim <mu...@sun.com>.
"Timothy M. Dean" wrote:
> 
> I was also curious about the same thing. I would be willing to
> contribute to this effort if necessary. Has anyone else given much
> thought to this idea?

To reiterate my answer to Jerry, both XSD and DTD validation are provided
by the Xerces 2 parser that is used by Xindice, so it comes down to how
you set up the XML parser used to provide content to Xindice. If you have
any question on how to set up such a parser, check the 'samples' directory
of the Xerces 2 distribution, which shows how to do this for both SAX and
DOM parsers. It's really pretty simple. If you need catalog-style entity
resolution, you can get Norm Walsh's catalog resolution code at nwalsh.com.

Murray

...........................................................................
Murray Altheim                         <mailto:murray.altheim&#x40;sun.com>
XML Technology Center, Java and XML Software
Sun Microsystems, Inc., MS MPK17-102, 1601 Willow Rd., Menlo Park, CA 94025

               Rally against the evils of iceburg lettuce! 
            Grab a kitchen knife and join the Balsamic Jihad!

Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Kimbro Staken <ks...@dbxmlgroup.com>.
On Wednesday, January 9, 2002, at 08:42 AM, ericjs@rcn.com wrote:
>>
>
> Here's a real world example of where node level access control is very 
> useful. Say
> you are implementing a document authoring / management system for a 
> publisher of
> scientific articles. You want staff writers and editors to have access to 
> the body of
> documents for tweaking the writing. But you want only your staff of 
> domain and
> classification experts to have access to certain metadata sections that 
> classify and
> correlate the documents to the proper scientific fields, topics, and 
> specialized
> taxonomies with will be used by researchers for searching. Perhaps only 
> senior
> editors should have access to change certain publication metadata. And 
> only system
> administrators should be able to touch the document's unique identifier 
> once it's
> been assigned.
>

Thanks, that's a good example and it makes sense.

>

> As Stephano points out, you likely don't want to individually control 
> every single node,
> but you want to be able to choose nodes or sections to control, similar 
> to the defining
> of what in the document you want to index.
>
> --
> Eric Schwarzenbach
>
>
Kimbro Staken
XML Database Software, Consulting and Writing
http://www.xmldatabases.org/


Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by er...@rcn.com.
On 9 Jan 2002 at 13:48, Stefano Mazzocchi wrote:

> Kimbro Staken wrote:
> > 
> > > How can I perform access control at the node level without
> > > duplicating the information at the CMS level?
> > 
> > Why do you need node level access control for a CMS? That seems
> > awfully fine grained control and it will be extremely complex to
> > administer and expensive to implement. It's basically like asking to
> > have column level access control for an RDBMS.
> 
> I'm not saying that you have to fine tune your ACL for *every* node,
> but I'm saying that if you consider your nodes are the 'data atoms'
> you need to have access control at that level (think of file
> systems!).
> 

Here's a real world example of where node level access control is very useful. Say 
you are implementing a document authoring / management system for a publisher of 
scientific articles. You want staff writers and editors to have access to the body of 
documents for tweaking the writing. But you want only your staff of domain and 
classification experts to have access to certain metadata sections that classify and 
correlate the documents to the proper scientific fields, topics, and specialized 
taxonomies with will be used by researchers for searching. Perhaps only senior 
editors should have access to change certain publication metadata. And only system 
administrators should be able to touch the document's unique identifier once it's 
been assigned.

As Stephano points out, you likely don't want to individually control every single node, 
but you want to be able to choose nodes or sections to control, similar to the defining 
of what in the document you want to index.

--
Eric Schwarzenbach

Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Stefano Mazzocchi <st...@apache.org>.
Kimbro Staken wrote:
> 
> On Saturday, January 5, 2002, at 04:01 AM, Stefano Mazzocchi wrote:
> > My points was not to remove CORBA from the picture (BTW, is there
> > anybody here who is usign XIndice from CORBA in a real-life
> > application?) but to indicate my impression that time spent on a webDAV
> > connection would have been better spent. No offense intended, just a
> > consideration from the document-oriented world where CORBA will never
> > even enter.
> >
> 
> Everybody who uses the XML:DB API uses CORBA behind the scenes, which
> basically means everybody is using it. I don't know of anyone using the
> CORBA API directly and I wouldn't encourage anyone to do so since we want
> to get rid of CORBA. Now getting rid of CORBA does not mean getting rid of
> that layer. CORBA provides an essential function to the server and that
> function could not be entirely fulfilled by webdav.

Oh, absolutely. never thought the opposite.

> While webdav would be nice for document oriented applications, dbXML was
> not really designed or conceived for those applications nor has the
> majority of the interest in the server been for those types of
> applications. This isn't to deny that both webdav and document oriented
> applications are important, it is to deny that they are the only
> applications that should be targeted. 

Granted. Again, I was not asking for replacement, I was telling you my
wish list.

> I'm all for adding webdav as an
> option, but you're wrong in saying that our time would have been better
> spent there. In fact you are the only person who has ever "really" wanted
> webdav. It had come up in the past but it was never a real solid request
> from any user of the software. Now it is.

Ok, makes sense.

At the same time, I don't think the webdav layer should belong in this
project, but more on the CMS side of things (one day I'll try to have
Slide connect to XIndice directly... you'll hear me if I succeed)
 
> > That's a good point, but again, I'm questioning the darwinistic
> > evolutionary process of this effort: do what people ask, not what
> > architectural elegance suggests or W3C recommends.
> 
> And we've had far more requests for W3C XML Schema then for Relax NG. I'm
> not a fan of XML Schema either, doesn't change the fact that it is what is
> being asked for. I'm with Tom though, if we can do things in a schema
> language independent manner that should be the target.

+1

> >
> > I agree with you on the fact that the engine internals should deal with
> > validation. Just like Cocoon doesn't validate stuff by default.
> >
> 
> Let's not get too caught on just focusing on validation here. Validation
> is just part of the schema equation. There's also the data-typing issue to
> consider. This will be particularly important with XQuery. In fact I'd say
> data typing is even more important then validation for data oriented apps,
>   but you can't really apply types without the structure of the document
> being known. This means some level of schema support has to be built into
> the server.
> 
> Just to be clear, in no way am I suggesting that the server should
> "require" a schema. In fact I'd consider requiring schemas to be
> destroying what I value most about the server.

agreed.

> I agree it would be cool if validation could be done at either client or
> server under the control of the developer. For data oriented apps having
> robust schema support on the server will be essential though.

Oh, even for document systems, but probably on another level. databases
should concentrate on data.

> > The content management system I'd like to have could be build in two
> > ways:
> >
> >  1) single layer: XIndice includes all the required functionality.
> >
> >  2) double layer: XIndice is the db engine, something else wraps it and
> > performs CMS operations like access control, workflow management, data
> > validation, versioning, etc.
> 
> > Separation of Concerns clearly indicates that the second option is the
> > best. This has been my view of the issues since May 2000, when I first
> > took a serious look to dbXML as the engine for such a system.
> >
> 
> Yes number 2 is clearly the way to go.

agreed.

> > This is why I wanted XIndice over to Apache: separation of concerns is a
> > great way to do parallel design and increase productivity and give users
> > more choice, but it can't work without *solid* contracts between the
> > systems that interoperate.
> >
> > So, what I'm asking, is *NOT* to turn XIndice into a CMS, not at all!
> 
> Good, because I certainly wouldn't agree with that.
> 
> > What I'd like to see is XIndice remaining *very* abstract on the XML
> > model, but without sacrificing performance and making it possible to
> > implement more complex systems on top.
> >
> 
> Absolutely, that's the whole point. Xindice is about flexibility.
> 
> >
> > Absolutely. Still, please, let's try to avoid a pissing contest with the
> > RDBMS communities and lead the way for those grounds where the relation
> > model fits, but with a very bad twist.
> >
> 
> I agree, I don't want to get into this battle either. However, that doesn'
> t mean that an XML database is not useful in data oriented applications.

all applications using a database require a data-oriented engine. the
entire question is about the 'type' of structure this data has.

> The simple fact that you have semi-structured data is incredibly valuable
> for many applications that are nothing like a CMS. They're still data
> oriented applications though. Just by building a database it doesn't
> automatically mean that you have to suddenly start chanting "death to
> RDBMS".

Absolutely agreed.
 
> >>>
> >>>  - web services
> >>>  - content management systems
> >>
> >> Don't forget health care, legal documents, and scientific applications.
> >
> > These are all examples of the above two.
> >
> 
> Heh, heh, there is no way that I'll buy into the idea that the only two
> places where Xindice is useful are web services and CMS. There's more to
> XML data management then that.

For example? (just curious, not ironically challenging)
 
> > XUpdate is a way express deltas, differences between trees.
> >
> > In the data-centric world, people are used to send deltas: change this
> > number with this other one, append this new address, remove this credit
> > card from the valid list.
> 
> > In the document-centric world, people are used to think of files, not
> > about their diffs.
> 
> > CVS is a great system because does all the differential processing on
> > documents by itself, transparently.
> >
> > Now, the use of a delta-oriented update language isn't necessarely bad
> > as a 'wire-transport' (much like CVS sends compressed diff between the
> > client and the server) but definately isn't useful by itself without
> > some application level adaptation.
> >
> > Now, let me give you a scenario I'd like to see happening: imagine to
> > have this CMS system implemented and you provide a WebDAV view of your
> > database.
> >
> > You connect to this 'web folder' (both Windows, Linux and MacOSX come
> > with the ability to mount webdav hosts as they were file system
> > folders), you browse it and you save your file from your favorite XML
> > editor (or even using stuff like Adobe Illustrator for SVG).
> >
> > The CMS will control your accessibility (after authentication or using
> > client side certification, whatever), perform the necessary steps
> > defined on that folder by the workflow configurations (for example,
> > sending email to the editor and placing the document with a status of
> > 'to be reviewed') and save the document.
> >
> 
> In this scenario though, wouldn't you actually want the webdav impl at the
> CMS layer and not built into Xindice itself?
> 
> The flow would be.
> 
> client <-> webdav <-> CMS <-> XML:DB API <-> CORBA <-> Xindice
> 
> With the goal of making it
> 
> client <-> webdav <-> CMS <-> XML:DB API <-> SOAP <-> Xindice
> 
> or optionally
> 
> client <-> webdav <-> CMS <-> SOAP <-> Xindice
> 
> Personally, I'd like to see webdav available as a module for Xindice. I'm
> not sure it needs to be there by default, but maybe it does. I just don't
> know if it makes sense for the scenario you describe above. Going from the
> CMS to Xindice via the XML:DB API would be much more efficient then going
> through webdav.

I agree with you, the webdav layer should be on top of something else,
probably, SOAP.
 
> > Now, can I use XIndice to provide the storage system underneath this
> > CMS?
> >
> > For example, in order to have a webdav view I need the ability to have
> > 'node flavors': a node can be a 'folder' (currently done with
> > collections), what is a 'document' and what is a 'document fragment' and
> > what is a symlink to another document fragment.
> >
> 
> It seems you would model most of these at the application level. Do you
> think the database needs to support more then just collections and
> documents? If so what and why?

I'll trigger another email for this.
 
> > How can I perform access control at the node level without duplicating
> > the information at the CMS level?
> 
> Why do you need node level access control for a CMS? That seems awfully
> fine grained control and it will be extremely complex to administer and
> expensive to implement. It's basically like asking to have column level
> access control for an RDBMS.

I'm not saying that you have to fine tune your ACL for *every* node, but
I'm saying that if you consider your nodes are the 'data atoms' you need
to have access control at that level (think of file systems!).

How to make this usable it's an implementation detail (all nested nodes
inherit the parent ACL and so on...)
 
> > how can I perform versioning without
> > having to duplicate every document entirely?
> 
> I think having versioning in the database would be pretty useful for many
> different applications.

Absolutely.
 
> > Currently, whenever the CMS saves something on top of another document,
> > it has to call for the document, perform the diff, get the XUpdate and
> > send that.
> 
> You can replace the whole document if you want via the XML:DB API. Use of
> XUpdate is completely optional.

Ok, than I'm fine.
 
> >
> > I'm not asking to remove XUpdate from the feature list, but to give the
> > appropriate tools depending on the uses.
> >
> 
> Well that is fine, just don't say something is useless when it is only
> useless to you. :-) It isn't like XUpdate is the only way to change the
> content in the server.

Ok, good point, sorry for that. :)
 
> > Yes, you are right saying that XQuery does include this functionality,
> > but I suggest you to consider the following scenario:
> >
> > <db:database xmlns:db="xindice#internal" xmlns:cms="CMS">
> >
> >  <legal db:type="folder">
> >   <copyright db:type="document" db:version="10.2"
> > db:last-modified="20010223">
> >     This is copyright info and blah blah...
> >   </copyright>
> >  </legal>
> >
> >  <press db:type="folder">
> >   <press-releases db:type="folder">
> >    <press-release date="20010212" author="blah"
> >      db:type="document" db:version="10.2" db:last-modified="20010213"
> >      cms:status="published">
> >     <title>XIndice 2.0 released!</title>
> >     <content>
> >      <p>blah blah blah</p>
> >      <p><db:link href="/legal/copyright[text()]"/></p>
> >     </content>
> >    </press-release>
> >   </press-releases>
> >  </press>
> >
> > </db:database>
> >
> > then, you can ask for the document
> >
> >  /press/press-releases/press-release[@data = '20010212']
> >
> > and you get
> >
> >  <press-release>
> >   <title>XIndice 2.0 released!</title>
> >   <content>
> >    <p>blah blah blah</p>
> >    <p>This is copyright info and blah blah...</p>
> >   </content>
> >  </press-release>
> >
> > which allows your users to avoid probably 200 pages of XQuery syntax to
> > accomplish the same task (and also, probably, be much faster!).
> >
> 
> Is your goal here to have the database be specified in XML or just to have
> the linking? 

No, I was showing an XML 'view' of the internal database data, of
course, I'm not proposing to use *this* as the actual data stored. I'm
not that foolish :)

> For the database being specified in XML, that is a bad idea,
> but I don't think that is what you were really trying to convey.  

exactly.

> For the
> linking that actually already exists and has since dbXML 0.2, but we call
> it experimental because there are a lot of issues with it.
> 
> 1. It requires db specific tags in the XML documents. For some apps this
> is OK, for many it is not.

ok

> 2. If you use XLink to solve problem 1 then you deny the ability of
> including XLinks that should be passed through to the client.

you can use xlink 'roles' to identify its internal behavior!

> 3. There is a problem between views on the document. Basically you need
> different views when editing a document vs. retrieving a document. Webdav
> has/had the same problem with dynamic pages, may be fixed in later spec I'
> m not sure.

good point.

> 4. Runaway expansion of links (i.e. circular links) could have some very
> nasty results and could be difficult to detect.

nah, no difficult at all, just mark the nodes you have visited.

> 5. Related to above but applicable even in cases where circular links do
> not exist, linking could bring large portions of the database into memory
> in cases where that would not be the desired behavior.

I expect that if you link something you do it because you use the same
content in many places. This will actually increase performance by
placing used parts in memory.

The alternative is joining documents with XQuery and I do not thing this
is going to be any faster than this.

> 6. You have no way to express a relationship that you did not prewire into
> your data model.

well, that's an intrinsic limitation, but something we all can live with
if when we choose to use it. I'm not proposing to "remove" XQuery for
internal linking, no way, but this alone would remove 80% of XQuery
usages in the document-centric world and this is a worthy goal, IMO.
 
> Solutions are possible for most of these things and I'm not sure I agree
> with Tom that this should be abandoned for XQuery. 

Yep, that's what I think as well.

> I see them as being complementary if implemented correctly. 

Absolutely.

> For instance you could use linking
> as a mechanism to optimize XQuery evaluation by prewiring some of the
> relationships. 

Yep.

> Likewise XQuery can be used to express relationships that
> are not known via linking. I like the flexibility of having both, if the
> linking issues can be resolved acceptably.

Absolutely +1!!

> > Without appropriate hooks for caches, any data storage system is
> > destined not to scale in real life systems.
> >
> > I suggest you to place the above two features very high in the todo list
> > or you'll find people very disappointed when they start getting
> > scalability problems and you can't give them solutions to avoid
> > saturation.
> >
> 
> No disagreement at all here. I already consider those high priority. It's
> really a matter of exposing it through the API more then anything else.

Ok, cool.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Stefano Mazzocchi <st...@apache.org>.
Kimbro Staken wrote:

> >  Personally, I don't like XQuery, and would prefer it we XUpdate and
> > XSelect were the standards, but I'm not the one who influences the XML
> > world :-)
> >
> 
> Ugh, while XQuery isn't great I'd much rather have that then a cumbersome
> XML syntax language. XUpdate is nice but, I always find it very, very
> cumbersome to use. 

+18393!

> I want better interactive query and update facilities
> and I just don't see XUpdate and XSelect getting us there. XQuery may not
> be the right way either, but it is a lot closer.

+44893847!

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Tom Bradford <br...@dbxmlgroup.com>.
On Wednesday, January 9, 2002, at 06:03 AM, Stefano Mazzocchi wrote:
> You're right pointing out that there is some impedence mismatch between
> an application *using* XML and a database storage solution (say Tomcat
> and XIndice).
>
> But I this is a *feature* not a bug.
>
> I mean: if you wrap XIndice with a JAXP layer, you allow Tomcat to use
> XIndice as 'XML parser' and you could use it instead of a file system
> (which might be incredibly useful for clusters of tomcats all using the
> same configuration served from an XML db!)

The Xindice DOMBuider is already bootstrapped using JAXP, but we allow 
you to continue bootstrapping with whatever SAX Factory you already 
had.  At the moment, this is only important for documents going into the 
system, because after the document has reached the database, it doesn't 
need parsing on the way out.  For the most part, none of this is a 
problem if you use Xindice purely as its architected now, which is in 
its own VM serving down to a client such as a servlet engine or 
stand-alone App.  It's when you embed Xindice where it becomes a 
potential problem.  At some point, I'd like to see the HTTP Server I 
wrote replaced, and use Avalon and (maybe) a scaled down version of 
Tomcat to operate as the server framework.  Problem is, I already see 
this becoming a train wreck.

> I don't question the fuctionality, but I question the softare
> architecture.

What architecture?  Based on the variety of how people have used dbXML 
in the past, I can't even begin to imagine how they're going to employ 
Xindice, so we need to minimize the potential for damage now, before 
somebody else's project seriously suffers from it.

> For example, providing a webdav interface might allow to make this
> tomcat-cluster happy, with no need to have a direct JAXP implementation
> (also JAXP would require you to use RMI for distributed applications,
> while soap or webdav would do it over the wire).

Oh God!  Enough with the WebDAV already.  It's not the almighty 
elixir.  :-)

> Anyway, besides the implementation details and the examples, my point
> is: let's XIndice focus on the engine. The applications will come.

The applications are here.  I'd like to focus on the engine, but the 
applications will drive it.  Let's not forget why we're here.  dbXML 
already had a fairly large user community before joining the Apache 
effort.   Saying the applications will come negates all of the effort 
people have put into writing dbXML applications in the past.  Sorry, but 
my loyalty continues to be to those people.  They have certainly driven 
the project to this point, and have done a very good job of it.

> There is no agreement in the XML world about what "xml updates" mean or
> should do. If there is no agreement between you and Kimbro, go figure
> into the rest of the XML world.

Course there is!  But we don't have enough money to play with the W3C, 
and apparently, we're not 'experts'.

--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database)
Creator - Project Labrador (XML Object Broker)


Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Stefano Mazzocchi <st...@apache.org>.
Tom Bradford wrote:
> 
> On Tuesday, January 8, 2002, at 06:54 PM, Kimbro Staken wrote:
> > Would that achieve the exact same thing though? My goal was to be able
> > to prewire certain relationships so that queries could be simplified
> > and maybe even be sped up by removing the join. It won't work for all
> > applications, but for some if could be very handy. It also gets more
> > mileage out of straight XPath queries.
> 
> It would produce the same result, because you could use the same
> namespaced link attributes, but it would require the document to be
> processed manually instead of automagically, allowing the linking
> functionality to not be tied specifically to our DOM implementation.  A
> single XQuery query could be used to perform the linking, and it could
> even be done using SAX or DOM (where right now, we can only do it using
> DOM).
> 
> >> What I ultimately really want is to have our DOM implementation
> >> function identically to any other DOM you could bootstrap using JAXP,
> >
> > This sounds like a nice goal, but is it really necessary? What does it
> > gain us and what do we lose? I'm just trying to understand the
> > motivation.
> 
> For the client/server model, probably not much, but if you were to embed
> the server into another application (say Tomcat, for example), there may
> be conflicts as to which DOM is used.  Either we can explicitly create
> our own DOM instances ignoring the DOMBuilder stuff, or we can work
> cleanly with JAXP, which has the benefit of not requiring
> inter-implementation conversion, which may slow things down if nodes are
> imported between DOMs.

You're right pointing out that there is some impedence mismatch between
an application *using* XML and a database storage solution (say Tomcat
and XIndice).

But I this is a *feature* not a bug.

I mean: if you wrap XIndice with a JAXP layer, you allow Tomcat to use
XIndice as 'XML parser' and you could use it instead of a file system
(which might be incredibly useful for clusters of tomcats all using the
same configuration served from an XML db!)

I don't question the fuctionality, but I question the softare
architecture.

In my vision, the impedence mismatch percevied is a sign of the need for
another layer between the two. Something that *uses* XIndice as an
engine but provides all those 'file-system-like' functionalities that
people would need anyway.

For example, providing a webdav interface might allow to make this
tomcat-cluster happy, with no need to have a direct JAXP implementation
(also JAXP would require you to use RMI for distributed applications,
while soap or webdav would do it over the wire).

Anyway, besides the implementation details and the examples, my point
is: let's XIndice focus on the engine. The applications will come.

> >> and offload functionality like AutoLink into another layer, preferably
> >> into an XQuery engine, where the behavior is easily coded, instead of
> >> using Java to do it.
> >
> > I think you need to explain more what you mean here. I'm not seeing the
> > benefit of pushing it into the XQuery layer or even how it would work.
> 
> <above/>
> 
> >>  Personally, I don't like XQuery, and would prefer it we XUpdate and
> >> XSelect were the standards, but I'm not the one who influences the XML
> >> world :-)
> >>
> >
> > Ugh, while XQuery isn't great I'd much rather have that then a
> > cumbersome XML syntax language. XUpdate is nice but, I always find it
> > very, very cumbersome to use. I want better interactive query and
> > update facilities and I just don't see XUpdate and XSelect getting us
> > there. XQuery may not be the right way either, but it is a lot closer.
> 
> Closer?  Like XQuery updates? :-)  I'm not holding my breath.

There is no agreement in the XML world about what "xml updates" mean or
should do. If there is no agreement between you and Kimbro, go figure
into the rest of the XML world.

And I *like* the fact that the W3C doesn't recommend something they
can't focus on (besides XMLSchema, that is).

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------


Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Tom Bradford <br...@dbxmlgroup.com>.
On Wednesday, January 9, 2002, at 05:33 PM, Kimbro Staken wrote:
> It may produce the same end result, but it doesn't achieve my goal. 
> Pushing linking into the XQuery layer, to me at least, defeats the 
> whole point of having linking in the first place. I want to be able to 
> use it to simplify and speed up queries, not make them more complex

The user doesn't have to see or interact with the query.  The way I see 
it working is that you perform a standard retrieval through an expander, 
and the XQuery is used to expand the links.  Really, the same 
functionality could be performed with XSLT, just figured if we were 
gonna XQuery close to the database, we might as well use that instead.

> If there is runtime cost in converting DOM impls then it seems that 
> would be an optimization point for the developer to worry about as part 
> of the price for embedding the server. Anyway, I'm obviously unclear on 
> the details here, I thought our DOM would always have this problem 
> because of the compression system?

When I implement the DTSM stuff, I'm going to layer the compression 
system deeper and expose it using the DTM so that we can layer SAX or 
DOM on top of that.

> Oh come on, let's be realistic here. XQuery is far closer to being 
> complete then XUpdate. You know perfectly well that there is an update 
> syntax that exists and has even been implemented a couple times. 
> http://www.lehti.de/beruf/diplomarbeit.pdf Regardless of whether or not 
> it is stable or part of XQuery 1.0 the language is still far more 
> complete. XUpdate it self isn't even complete and the query component 
> required doesn'
> t even exist. XSelect is at a far less mature status then the update 
> extensions for XQuery. Additionally XUpdate was never intended to be 
> more then a stop-gap while XQuery was developed. I hate to defend 
> XQuery, but we have to at least keep one foot grounded in reality.

Dude, how long have you been working with me that you still don't know 
when I'm talking out of my ass?  If we can get XQuery implemented and 
its not a quickly moving target, I can wait for updates.  Maybe we can 
implement UpdateGrams :-)

--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database)
Creator - Project Labrador (Web Services Framework)


Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Kimbro Staken <ks...@dbxmlgroup.com>.
On Tuesday, January 8, 2002, at 09:56 PM, Tom Bradford wrote:

> On Tuesday, January 8, 2002, at 06:54 PM, Kimbro Staken wrote:
>
> It would produce the same result, because you could use the same 
> namespaced link attributes, but it would require the document to be 
> processed manually instead of automagically, allowing the linking 
> functionality to not be tied specifically to our DOM implementation.  A 
> single XQuery query could be used to perform the linking, and it could 
> even be done using SAX or DOM (where right now, we can only do it using 
> DOM).
>

It may produce the same end result, but it doesn't achieve my goal. 
Pushing linking into the XQuery layer, to me at least, defeats the whole 
point of having linking in the first place. I want to be able to use it to 
simplify and speed up queries, not make them more complex.
>

> For the client/server model, probably not much, but if you were to embed 
> the server into another application (say Tomcat, for example), there may 
> be conflicts as to which DOM is used.  Either we can explicitly create 
> our own DOM instances ignoring the DOMBuilder stuff, or we can work 
> cleanly with JAXP, which has the benefit of not requiring 
> inter-implementation conversion, which may slow things down if nodes are 
> imported between DOMs.

If there is runtime cost in converting DOM impls then it seems that would 
be an optimization point for the developer to worry about as part of the 
price for embedding the server. Anyway, I'm obviously unclear on the 
details here, I thought our DOM would always have this problem because of 
the compression system?

>>
>> I think you need to explain more what you mean here. I'm not seeing the 
>> benefit of pushing it into the XQuery layer or even how it would work.
>
> <above/>

I guess what I really wanted was an example of how linking would be 
implemented via XQuery.

>
> Closer?  Like XQuery updates? :-)  I'm not holding my breath.
>

Oh come on, let's be realistic here. XQuery is far closer to being 
complete then XUpdate. You know perfectly well that there is an update 
syntax that exists and has even been implemented a couple times. 
http://www.lehti.de/beruf/diplomarbeit.pdf Regardless of whether or not it 
is stable or part of XQuery 1.0 the language is still far more complete. 
XUpdate it self isn't even complete and the query component required doesn'
t even exist. XSelect is at a far less mature status then the update 
extensions for XQuery. Additionally XUpdate was never intended to be more 
then a stop-gap while XQuery was developed. I hate to defend XQuery, but 
we have to at least keep one foot grounded in reality.

> --
> Tom Bradford - http://www.tbradford.org
> Developer - Apache Xindice (Native XML Database)
> Creator - Project Labrador (XML Object Broker)
>
>
>
Kimbro Staken
XML Database Software, Consulting and Writing
http://www.xmldatabases.org/


Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Tom Bradford <br...@dbxmlgroup.com>.
On Tuesday, January 8, 2002, at 06:54 PM, Kimbro Staken wrote:
> Would that achieve the exact same thing though? My goal was to be able 
> to prewire certain relationships so that queries could be simplified 
> and maybe even be sped up by removing the join. It won't work for all 
> applications, but for some if could be very handy. It also gets more 
> mileage out of straight XPath queries.

It would produce the same result, because you could use the same 
namespaced link attributes, but it would require the document to be 
processed manually instead of automagically, allowing the linking 
functionality to not be tied specifically to our DOM implementation.  A 
single XQuery query could be used to perform the linking, and it could 
even be done using SAX or DOM (where right now, we can only do it using 
DOM).

>> What I ultimately really want is to have our DOM implementation 
>> function identically to any other DOM you could bootstrap using JAXP,
>
> This sounds like a nice goal, but is it really necessary? What does it 
> gain us and what do we lose? I'm just trying to understand the 
> motivation.

For the client/server model, probably not much, but if you were to embed 
the server into another application (say Tomcat, for example), there may 
be conflicts as to which DOM is used.  Either we can explicitly create 
our own DOM instances ignoring the DOMBuilder stuff, or we can work 
cleanly with JAXP, which has the benefit of not requiring 
inter-implementation conversion, which may slow things down if nodes are 
imported between DOMs.

>> and offload functionality like AutoLink into another layer, preferably 
>> into an XQuery engine, where the behavior is easily coded, instead of 
>> using Java to do it.
>
> I think you need to explain more what you mean here. I'm not seeing the 
> benefit of pushing it into the XQuery layer or even how it would work.

<above/>

>>  Personally, I don't like XQuery, and would prefer it we XUpdate and 
>> XSelect were the standards, but I'm not the one who influences the XML 
>> world :-)
>>
>
> Ugh, while XQuery isn't great I'd much rather have that then a 
> cumbersome XML syntax language. XUpdate is nice but, I always find it 
> very, very cumbersome to use. I want better interactive query and 
> update facilities and I just don't see XUpdate and XSelect getting us 
> there. XQuery may not be the right way either, but it is a lot closer.

Closer?  Like XQuery updates? :-)  I'm not holding my breath.

--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database)
Creator - Project Labrador (XML Object Broker)


Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Kimbro Staken <ks...@dbxmlgroup.com>.
On Monday, January 7, 2002, at 11:07 AM, Tom Bradford wrote:
>
> You could always express the same hardwired linking format in your 
> documents and use a stock XQuery script to expand them instead of having 
> a custom DOM implementation do it behind your back.

Would that achieve the exact same thing though? My goal was to be able to 
prewire certain relationships so that queries could be simplified and 
maybe even be sped up by removing the join. It won't work for all 
applications, but for some if could be very handy. It also gets more 
mileage out of straight XPath queries.

> What I ultimately really want is to have our DOM implementation function 
> identically to any other DOM you could bootstrap using JAXP,

This sounds like a nice goal, but is it really necessary? What does it 
gain us and what do we lose? I'm just trying to understand the motivation.

> and offload functionality like AutoLink into another layer, preferably 
> into an XQuery engine, where the behavior is easily coded, instead of 
> using Java to do it.

I think you need to explain more what you mean here. I'm not seeing the 
benefit of pushing it into the XQuery layer or even how it would work.

>  Personally, I don't like XQuery, and would prefer it we XUpdate and 
> XSelect were the standards, but I'm not the one who influences the XML 
> world :-)
>

Ugh, while XQuery isn't great I'd much rather have that then a cumbersome 
XML syntax language. XUpdate is nice but, I always find it very, very 
cumbersome to use. I want better interactive query and update facilities 
and I just don't see XUpdate and XSelect getting us there. XQuery may not 
be the right way either, but it is a lot closer.

> --
> Tom Bradford - http://www.tbradford.org
> Developer - Apache Xindice (Native XML Database)
> Creator - Project Labrador (XML Object Broker)
>
>
>
Kimbro Staken
XML Database Software, Consulting and Writing
http://www.xmldatabases.org/


Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Tom Bradford <br...@dbxmlgroup.com>.
On Monday, January 7, 2002, at 04:16 AM, Kimbro Staken wrote:
> Solutions are possible for most of these things and I'm not sure I 
> agree with Tom that this should be abandoned for XQuery. I see them as 
> being complementary if implemented correctly. For instance you could 
> use linking as a mechanism to optimize XQuery evaluation by prewiring 
> some of the relationships. Likewise XQuery can be used to express 
> relationships that are not known via linking. I like the flexibility of 
> having both, if the linking issues can be resolved acceptably.

You could always express the same hardwired linking format in your 
documents and use a stock XQuery script to expand them instead of having 
a custom DOM implementation do it behind your back.  What I ultimately 
really want is to have our DOM implementation function identically to 
any other DOM you could bootstrap using JAXP, and offload functionality 
like AutoLink into another layer, preferably into an XQuery engine, 
where the behavior is easily coded, instead of using Java to do it.  
Personally, I don't like XQuery, and would prefer it we XUpdate and 
XSelect were the standards, but I'm not the one who influences the XML 
world :-)

--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database)
Creator - Project Labrador (XML Object Broker)


Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Kimbro Staken <ks...@dbxmlgroup.com>.
On Saturday, January 5, 2002, at 04:01 AM, Stefano Mazzocchi wrote:
> My points was not to remove CORBA from the picture (BTW, is there
> anybody here who is usign XIndice from CORBA in a real-life
> application?) but to indicate my impression that time spent on a webDAV
> connection would have been better spent. No offense intended, just a
> consideration from the document-oriented world where CORBA will never
> even enter.
>

Everybody who uses the XML:DB API uses CORBA behind the scenes, which 
basically means everybody is using it. I don't know of anyone using the 
CORBA API directly and I wouldn't encourage anyone to do so since we want 
to get rid of CORBA. Now getting rid of CORBA does not mean getting rid of 
that layer. CORBA provides an essential function to the server and that 
function could not be entirely fulfilled by webdav.

While webdav would be nice for document oriented applications, dbXML was 
not really designed or conceived for those applications nor has the 
majority of the interest in the server been for those types of 
applications. This isn't to deny that both webdav and document oriented 
applications are important, it is to deny that they are the only 
applications that should be targeted. I'm all for adding webdav as an 
option, but you're wrong in saying that our time would have been better 
spent there. In fact you are the only person who has ever "really" wanted 
webdav. It had come up in the past but it was never a real solid request 
from any user of the software. Now it is.

> That's a good point, but again, I'm questioning the darwinistic
> evolutionary process of this effort: do what people ask, not what
> architectural elegance suggests or W3C recommends.

And we've had far more requests for W3C XML Schema then for Relax NG. I'm 
not a fan of XML Schema either, doesn't change the fact that it is what is 
being asked for. I'm with Tom though, if we can do things in a schema 
language independent manner that should be the target.

>
> I agree with you on the fact that the engine internals should deal with
> validation. Just like Cocoon doesn't validate stuff by default.
>

Let's not get too caught on just focusing on validation here. Validation 
is just part of the schema equation. There's also the data-typing issue to 
consider. This will be particularly important with XQuery. In fact I'd say 
data typing is even more important then validation for data oriented apps,
  but you can't really apply types without the structure of the document 
being known. This means some level of schema support has to be built into 
the server.

Just to be clear, in no way am I suggesting that the server should 
"require" a schema. In fact I'd consider requiring schemas to be 
destroying what I value most about the server.

I agree it would be cool if validation could be done at either client or 
server under the control of the developer. For data oriented apps having 
robust schema support on the server will be essential though.

> The content management system I'd like to have could be build in two
> ways:
>
>  1) single layer: XIndice includes all the required functionality.
>
>  2) double layer: XIndice is the db engine, something else wraps it and
> performs CMS operations like access control, workflow management, data
> validation, versioning, etc.

> Separation of Concerns clearly indicates that the second option is the
> best. This has been my view of the issues since May 2000, when I first
> took a serious look to dbXML as the engine for such a system.
>

Yes number 2 is clearly the way to go.

> This is why I wanted XIndice over to Apache: separation of concerns is a
> great way to do parallel design and increase productivity and give users
> more choice, but it can't work without *solid* contracts between the
> systems that interoperate.
>
> So, what I'm asking, is *NOT* to turn XIndice into a CMS, not at all!

Good, because I certainly wouldn't agree with that.

> What I'd like to see is XIndice remaining *very* abstract on the XML
> model, but without sacrificing performance and making it possible to
> implement more complex systems on top.
>

Absolutely, that's the whole point. Xindice is about flexibility.

>
> Absolutely. Still, please, let's try to avoid a pissing contest with the
> RDBMS communities and lead the way for those grounds where the relation
> model fits, but with a very bad twist.
>

I agree, I don't want to get into this battle either. However, that doesn'
t mean that an XML database is not useful in data oriented applications. 
The simple fact that you have semi-structured data is incredibly valuable 
for many applications that are nothing like a CMS. They're still data 
oriented applications though. Just by building a database it doesn't 
automatically mean that you have to suddenly start chanting "death to 
RDBMS".

>>>
>>>  - web services
>>>  - content management systems
>>
>> Don't forget health care, legal documents, and scientific applications.
>
> These are all examples of the above two.
>

Heh, heh, there is no way that I'll buy into the idea that the only two 
places where Xindice is useful are web services and CMS. There's more to 
XML data management then that.

> XUpdate is a way express deltas, differences between trees.
>
> In the data-centric world, people are used to send deltas: change this
> number with this other one, append this new address, remove this credit
> card from the valid list.

> In the document-centric world, people are used to think of files, not
> about their diffs.

> CVS is a great system because does all the differential processing on
> documents by itself, transparently.
>
> Now, the use of a delta-oriented update language isn't necessarely bad
> as a 'wire-transport' (much like CVS sends compressed diff between the
> client and the server) but definately isn't useful by itself without
> some application level adaptation.
>
> Now, let me give you a scenario I'd like to see happening: imagine to
> have this CMS system implemented and you provide a WebDAV view of your
> database.
>
> You connect to this 'web folder' (both Windows, Linux and MacOSX come
> with the ability to mount webdav hosts as they were file system
> folders), you browse it and you save your file from your favorite XML
> editor (or even using stuff like Adobe Illustrator for SVG).
>
> The CMS will control your accessibility (after authentication or using
> client side certification, whatever), perform the necessary steps
> defined on that folder by the workflow configurations (for example,
> sending email to the editor and placing the document with a status of
> 'to be reviewed') and save the document.
>

In this scenario though, wouldn't you actually want the webdav impl at the 
CMS layer and not built into Xindice itself?

The flow would be.

client <-> webdav <-> CMS <-> XML:DB API <-> CORBA <-> Xindice

With the goal of making it

client <-> webdav <-> CMS <-> XML:DB API <-> SOAP <-> Xindice

or optionally

client <-> webdav <-> CMS <-> SOAP <-> Xindice

Personally, I'd like to see webdav available as a module for Xindice. I'm 
not sure it needs to be there by default, but maybe it does. I just don't 
know if it makes sense for the scenario you describe above. Going from the 
CMS to Xindice via the XML:DB API would be much more efficient then going 
through webdav.

> Now, can I use XIndice to provide the storage system underneath this
> CMS?
>
> For example, in order to have a webdav view I need the ability to have
> 'node flavors': a node can be a 'folder' (currently done with
> collections), what is a 'document' and what is a 'document fragment' and
> what is a symlink to another document fragment.
>

It seems you would model most of these at the application level. Do you 
think the database needs to support more then just collections and 
documents? If so what and why?

> How can I perform access control at the node level without duplicating
> the information at the CMS level?

Why do you need node level access control for a CMS? That seems awfully 
fine grained control and it will be extremely complex to administer and 
expensive to implement. It's basically like asking to have column level 
access control for an RDBMS.

> how can I perform versioning without
> having to duplicate every document entirely?

I think having versioning in the database would be pretty useful for many 
different applications.

> Currently, whenever the CMS saves something on top of another document,
> it has to call for the document, perform the diff, get the XUpdate and
> send that.

You can replace the whole document if you want via the XML:DB API. Use of 
XUpdate is completely optional.

>
> I'm not asking to remove XUpdate from the feature list, but to give the
> appropriate tools depending on the uses.
>

Well that is fine, just don't say something is useless when it is only 
useless to you. :-) It isn't like XUpdate is the only way to change the 
content in the server.

> Yes, you are right saying that XQuery does include this functionality,
> but I suggest you to consider the following scenario:
>
> <db:database xmlns:db="xindice#internal" xmlns:cms="CMS">
>
>  <legal db:type="folder">
>   <copyright db:type="document" db:version="10.2"
> db:last-modified="20010223">
>     This is copyright info and blah blah...
>   </copyright>
>  </legal>
>
>  <press db:type="folder">
>   <press-releases db:type="folder">
>    <press-release date="20010212" author="blah"
>      db:type="document" db:version="10.2" db:last-modified="20010213"
>      cms:status="published">
>     <title>XIndice 2.0 released!</title>
>     <content>
>      <p>blah blah blah</p>
>      <p><db:link href="/legal/copyright[text()]"/></p>
>     </content>
>    </press-release>
>   </press-releases>
>  </press>
>
> </db:database>
>
> then, you can ask for the document
>
>  /press/press-releases/press-release[@data = '20010212']
>
> and you get
>
>  <press-release>
>   <title>XIndice 2.0 released!</title>
>   <content>
>    <p>blah blah blah</p>
>    <p>This is copyright info and blah blah...</p>
>   </content>
>  </press-release>
>
> which allows your users to avoid probably 200 pages of XQuery syntax to
> accomplish the same task (and also, probably, be much faster!).
>

Is your goal here to have the database be specified in XML or just to have 
the linking? For the database being specified in XML, that is a bad idea, 
but I don't think that is what you were really trying to convey.  For the 
linking that actually already exists and has since dbXML 0.2, but we call 
it experimental because there are a lot of issues with it.

1. It requires db specific tags in the XML documents. For some apps this 
is OK, for many it is not.
2. If you use XLink to solve problem 1 then you deny the ability of 
including XLinks that should be passed through to the client.
3. There is a problem between views on the document. Basically you need 
different views when editing a document vs. retrieving a document. Webdav 
has/had the same problem with dynamic pages, may be fixed in later spec I'
m not sure.
4. Runaway expansion of links (i.e. circular links) could have some very 
nasty results and could be difficult to detect.
5. Related to above but applicable even in cases where circular links do 
not exist, linking could bring large portions of the database into memory 
in cases where that would not be the desired behavior.
6. You have no way to express a relationship that you did not prewire into 
your data model.

Solutions are possible for most of these things and I'm not sure I agree 
with Tom that this should be abandoned for XQuery. I see them as being 
complementary if implemented correctly. For instance you could use linking 
as a mechanism to optimize XQuery evaluation by prewiring some of the 
relationships. Likewise XQuery can be used to express relationships that 
are not known via linking. I like the flexibility of having both, if the 
linking issues can be resolved acceptably.

> Without appropriate hooks for caches, any data storage system is
> destined not to scale in real life systems.
>
> I suggest you to place the above two features very high in the todo list
> or you'll find people very disappointed when they start getting
> scalability problems and you can't give them solutions to avoid
> saturation.
>

No disagreement at all here. I already consider those high priority. It's 
really a matter of exposing it through the API more then anything else.

>

Kimbro Staken
XML Database Software, Consulting and Writing
http://www.xmldatabases.org/


Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Stefano Mazzocchi <st...@apache.org>.
Tom, first of all, many thanks for your polite and useful reply. 

I used a somewhat overimflammable tone on purpose to test the 'heat
dissipation' capabilities of this community (which is a big part of my
job as an ASF sponsor) and the results indicate that my sponsoring job
will definately be an easy one :)

I already had the feeling this was the case (Sam as well, we talked
about it privately) but if even discussing the very technical foundation
of the effort doesn't create negative energy, there is nothing that will
destroy a community.

Enough for the community points.

Now, with my ASF hat removed, back to the technical points.

Tom Bradford wrote:
> 
> Stefano Mazzocchi wrote:
> > I see a native XML database as an incredibly great DBMS for
> > semi-structured data and an incredibly poor DBMS for structured data.
> 
> I don't think anyone's debating that, though I wouldn't use the label
> 'incredibly poor' for structured data, especially since the definition
> of what structured data is can't be answered by relational DBs
> either...  I don't consider normalization and joins as being structure,
> so much as I consider it to be a rigid decomposition of structure.

Good point.

> > Corba? no thanks, I need WebDAV.
> 
> As much as all of us hate it, CORBA absolutely has its uses.  We could
> never get away with wire-compression if we were using a 'service the
> world' WebDAV style approach.  Wire compression has bought us
> performance gains, though not enough to justify keeping it exclusively.

My points was not to remove CORBA from the picture (BTW, is there
anybody here who is usign XIndice from CORBA in a real-life
application?) but to indicate my impression that time spent on a webDAV
connection would have been better spent. No offense intended, just a
consideration from the document-oriented world where CORBA will never
even enter.
 
> > Joins? no thanks, I need document fragment aggregation.
> 
> In the context of XML, I think these are the same.

In terms of functionality, you might be right, in terms of performance,
well, I'm not as optimistic as you seem to be on this.

> > XMLSchemas? no thanks, I need infoset-neutral RelaxNG validation.
> 
> Personally, and I'm just reiterating things I've said in the past, I
> hate W3C XML Schemas, and many others do as well.  

Yep. I never heard anybody saying the opposite.

> I don't want to have
> to put ourselves in a position where we're forced to make a choice on
> any one validation mechanism to the detriment of our users.  

That's a good point, but again, I'm questioning the darwinistic
evolutionary process of this effort: do what people ask, not what
architectural elegance suggests or W3C recommends.

> So if we
> can continue to push validation to the client application, that's the
> track we should take... for a couple of important reasons: (1)
> Performance... validation is slow, Bogging down the server to perform it
> can only cause problems, and (2) Choice: If we standardize on W3C
> Schemas, then we exlude support for other schema specifications.  I
> think that's unwise, especially with the major backlash that XML Schemas
> has received.

I agree with you on the fact that the engine internals should deal with
validation. Just like Cocoon doesn't validate stuff by default.

The content management system I'd like to have could be build in two
ways:

 1) single layer: XIndice includes all the required functionality.

 2) double layer: XIndice is the db engine, something else wraps it and
performs CMS operations like access control, workflow management, data
validation, versioning, etc.

Separation of Concerns clearly indicates that the second option is the
best. This has been my view of the issues since May 2000, when I first
took a serious look to dbXML as the engine for such a system.

This is why I wanted XIndice over to Apache: separation of concerns is a
great way to do parallel design and increase productivity and give users
more choice, but it can't work without *solid* contracts between the
systems that interoperate.

So, what I'm asking, is *NOT* to turn XIndice into a CMS, not at all!
What I'd like to see is XIndice remaining *very* abstract on the XML
model, but without sacrificing performance and making it possible to
implement more complex systems on top.

> > If you have structured data, you can't beat the relational model. This
> > is the result of 50 years of database research: do we *really* believe
> > we are smarter/wiser/deeper-thinkers than all the people that worked on
> > the database industry since the 50's?
> 
> One might argue that the relational database industry hasn't learned
> very much in the decades that it's been around.  Not that I'm saying XML
> databases are better, but relational databases were created to solve the
> problems of the databases of their time.  That time has passed.  There
> are still a lot of applications that have the problem that relational
> databases are trying to solve, but there are many applications that have
> the problem that XML databases are trying to solve.  Further still,
> there are apps that no database can adequately solve.

Absolutely. Still, please, let's try to avoid a pissing contest with the
RDBMS communities and lead the way for those grounds where the relation
model fits, but with a very bad twist.

Example, I've seen a clever implementation of an XML database on top of
a relational DB using the parent-son relation of nodes. The problem was
transforming XPath queries into SQL queries with one inner join per
'slash' on the xpath. Go figure the performances :)
 
> > I see two big fields where XIndice can make a difference (and this is
> > the reason why I wanted this project to move under Apache in the first
> > place!):
> >
> >  - web services
> >  - content management systems
> 
> Don't forget health care, legal documents, and scientific applications.

These are all examples of the above two.

> These are three areas where Xindice has organically found a home in
> since its creation.

Of course.
 
> >  - one big tree with nodes flavor (following .NET blue/red nodes):
> > follows the design patterns of file systems with folders, files,
> > symlinks and such. [great would be the ability to dump the entire thing
> > as a huge namespaced XML file to allow easy backup and duplication]
> 
> >  - node-granular and ACL-based authorization and security [great would
> > be the ability to make nodes 'transparent' for those people who don't
> > have access to see them]
> >
> >  - file system-like direct access (WebDAV instead of useless XUpdate!)
> > [great for editing solutions since XUpdate requires the editor to get
> > the document, perform the diff and send the diff, while the same
> > operation can be performed by the server with one less connection, this
> > is what CVS does!]
> 
> Woah!  Stop right there.  XUpdate is far from useless, and your
> explaination of how it works, in the context of Xindice is incorrect.

No, I think you didn't get my point (see below).

> When you perform an XUpdate query, it's sent to the server which
> performs all of the work.  Never is a document sent to the client except
> for a summary of how many nodes were touched by the update.  It actually
> performs very well, because you can modify every single document in a
> collection, taking several different actions, with a single command.

XUpdate is a way express deltas, differences between trees. 

In the data-centric world, people are used to send deltas: change this
number with this other one, append this new address, remove this credit
card from the valid list.

In the document-centric world, people are used to think of files, not
about their diffs. 

CVS is a great system because does all the differential processing on
documents by itself, transparently.

Now, the use of a delta-oriented update language isn't necessarely bad
as a 'wire-transport' (much like CVS sends compressed diff between the
client and the server) but definately isn't useful by itself without
some application level adaptation.

Now, let me give you a scenario I'd like to see happening: imagine to
have this CMS system implemented and you provide a WebDAV view of your
database.

You connect to this 'web folder' (both Windows, Linux and MacOSX come
with the ability to mount webdav hosts as they were file system
folders), you browse it and you save your file from your favorite XML
editor (or even using stuff like Adobe Illustrator for SVG).

The CMS will control your accessibility (after authentication or using
client side certification, whatever), perform the necessary steps
defined on that folder by the workflow configurations (for example,
sending email to the editor and placing the document with a status of
'to be reviewed') and save the document.

Now, can I use XIndice to provide the storage system underneath this
CMS?

For example, in order to have a webdav view I need the ability to have
'node flavors': a node can be a 'folder' (currently done with
collections), what is a 'document' and what is a 'document fragment' and
what is a symlink to another document fragment.

How can I perform access control at the node level without duplicating
the information at the CMS level? how can I perform versioning without
having to duplicate every document entirely?

Currently, whenever the CMS saves something on top of another document,
it has to call for the document, perform the diff, get the XUpdate and
send that.

I'm not asking to remove XUpdate from the feature list, but to give the
appropriate tools depending on the uses.
 
> >  - internal aggregation of document fragments (the equivalent of file
> > system symlinks) [content aggregation at the database level will be much
> > faster than aggregation at the publishing level, very useful for content
> > that must be included in the same place... should replace the notion of
> > XML entities]
> 
> We have this functionality in a very experimental form.  It's called
> AutoLinking.  It's been around for a while, but it's going away at some
> point, to be replaced by XQuery.  The problem with it is that you have
> to modify the structure of your XML content, so it can't be treated as
> data.  XQuery will allow this aggregation using the data in the
> documents rather than instructions within the document.  Beyond that,
> there's nothing stopping somebody from using XLink, its just not a task
> that the server will perform because of the passive nature of XLinks.

Yes, you are right saying that XQuery does include this functionality,
but I suggest you to consider the following scenario:

<db:database xmlns:db="xindice#internal" xmlns:cms="CMS">

 <legal db:type="folder">
  <copyright db:type="document" db:version="10.2"
db:last-modified="20010223">
    This is copyright info and blah blah...
  </copyright>
 </legal>

 <press db:type="folder">
  <press-releases db:type="folder">
   <press-release date="20010212" author="blah" 
     db:type="document" db:version="10.2" db:last-modified="20010213" 
     cms:status="published">
    <title>XIndice 2.0 released!</title>
    <content>
     <p>blah blah blah</p>
     <p><db:link href="/legal/copyright[text()]"/></p>
    </content>
   </press-release>
  </press-releases>
 </press>
 
</db:database>

then, you can ask for the document

 /press/press-releases/press-release[@data = '20010212']

and you get

 <press-release>
  <title>XIndice 2.0 released!</title>
  <content>
   <p>blah blah blah</p>
   <p>This is copyright info and blah blah...</p>
  </content>
 </press-release>

which allows your users to avoid probably 200 pages of XQuery syntax to
accomplish the same task (and also, probably, be much faster!).

> >  - native metadata support (last modified time, author, etc..) [vital
> > for any useful caching system around the engine!]
> 
> Some of this is already available, there's no way to expose it currently
> though.
>
> >  - node-granular event triggers [inverts the control of the database:
> > when something happens the database does something, useful mostly to
> > avoid expensive validity lookup for cached resources]
> 
> We talked about this early on in developing the product, but decided to
> put it on a back burner for a while... probably for the same reason we
> decided to shelve any specification validation system.

Without appropriate hooks for caches, any data storage system is
destined not to scale in real life systems.

I suggest you to place the above two features very high in the todo list
or you'll find people very disappointed when they start getting
scalability problems and you can't give them solutions to avoid
saturation.

> > In short: I'd like to have a file system able to decompose XML documents
> > and store each single node as a file, scale to billions of nodes and
> > perform fast queries with XPath-like syntaxes.
> 
> This is not to far from where we are at the moment.  Nodes are
> individually addressable, but we cluster them into Documents for
> atomicity, much like an object database will cluster objects together in
> a way that ensures optimal I/O performance.
> 
> > This is my vision.
> 
> Now if this can work within the framework of my vision then nobody'll
> get hurt. :-)

Absolutely! That's why this project is here in the first place :)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Stefano Mazzocchi <st...@apache.org>.
ericjs@rcn.com wrote:

<snip/>

> As to performance, this may be the bias of my projects and experience, but I think
> performance is most crucial in the accessing / query of documents, not the writing of
> documents where the validation would take place. 

Well, if the data that comes out is different from the data that came in
(think of aggregation!), you need validation in both ways to ensure
complete integrity.

At the same time, if aggregation is performed using 'internal symlinking
nodes' validation can be performed when entering the document (after
augmenting the infoset with the aggregated data).

The same can be said even for XQuery aggregation, if XQuery templates
are known to the DB: it could be possible to perform validation when
entering a document by looking at those queries that 'cross-cut' that
document, perform them and validate the resulting documents.

> In most systems, large slow-
> validating documents are not going to be added to the system with anything like the
> frequency that accesses will take place. Frequent writes are more likely in systems
> with smaller more data-centric document-records, whose validation shouldn't be as
> time consuming. Small, frequent updates to large documents is another issue which
> might require the development of validation methods that don't revalidate the entire
> document. 

Hmmm, not sure this is entirely possible. I'll have to think more about
it.

> Performance should be a concern but not such a large one to negate the
> need for server-side validation entirely.

Absolutely agreed, validation is a much have, but I'm still not sure
*where* (I mean, at what level) it should live.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
On Monday 07 January 2002 12:34, Kimbro Staken wrote:
> . . .
> I'd like to here some opinions on DTDs though? So far we've been very anti
> DTD for the server. Does this make sense or should more attention be paid
> there?
> . . .

I think DTDs are still in wide use (mostly users coming from an SGML 
background) and shouldn't be ruled out.

Anyway, making validation pluggable is probably the most important thing; 
IMHO the validation mechanism should also allow custom java-based validation 
components, which might be very useful for example to validate against 
external databases.

-- 
 -- Bertrand Delacrétaz, www.codeconsult.ch
 -- web technologies consultant - OO, Java, XML, C++






Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Stefano Mazzocchi <st...@apache.org>.
Murray Altheim wrote:

> I don't think Xindice itself needs any of these validation facilities
> built-in, as how and when validation occurs, and what type of validation
> (markup or content), how strict, etc. are definitely application-specific
> issues. For example, one might have varying levels of strict schemas for
> a series of relating document types for different purposes. There may
> be even different schema languages used at different stages of processing
> for even the same markup language (eg., if you don't need content
> validation once the document has been created, you gain performance to
> then perform only structural/markup validation).

I agreed. Validation should not be included into the db engine but
should perform sort of 'firewalling' around it.
 
> Xindice is a core technology that can be utilitized in many places within
> an application framework, and applications vary so widely that it's almost
> impossible to generalize. It'd be sad to see Xindice shackled with
> components that are better attached to an application itself (ie., to the
> application engine that uses Xindice).

Absolutely agreed.

At the same time, it would be sad to see XIndice not useful because of
internal engine limitations (see my next email on internal technology
for this)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Murray Altheim <mu...@sun.com>.
Kimbro Staken wrote:
> On Friday, January 4, 2002, at 09:44 AM, ericjs@rcn.com wrote:
[...]
> > The W3C Xml Schema may well not be the best tool for this job, and I also
> > would be
> > queasy about tying the server to some such standard at this point in time.
> >  In an ideal
> > future, a mature xml database would support validation against any of the
> > major
> > schema languages. Clearly that would be over-ambitious at this point, but
> > perhaps
> > the server could be equipped with hooks for implementers to install the
> > validation
> > mechanisms of their choice.
> 
> I think this is the likely path we'll pursue. There seems to be two major
> languages worth paying attention to, RelaxNG and W3C XML Schema. If both
> of those are supported most issues will be covered.
> 
> I'd like to here some opinions on DTDs though? So far we've been very anti
> DTD for the server. Does this make sense or should more attention be paid
> there?

Perhaps I could summarize what these three schema languages are good
at:

  DTDs:  validation of markup structure but almost no content (except
         enumerated attribute values. But built into XML and every 
         validating XML parser, fairly easy to use and understand. 
         Also has a pretty terse syntax, though not in instance syntax
         (if that's an issue to you). Tried and true and boring, but
         it does exactly what it was designed to do and is in very
         wide application. It's what most XML languages are specified 
         in, as it's a good specification language. TimBL has forced
         this to change (there's a requirement that XML Schema, a 
         constraint language, now be used for all W3C specifications,
         though I'm not certain how rigidly this is being enforced).

  XML Schema: an extremely complex specification that provides both 
         markup and content validation at the expense of that complexity.
         If you aren't doing much content validation then this is really
         overkill. In markup syntax, but almost impossible to author by
         hand, except for trivial document types. Is this a solid 
         technology, and do implementations interoperate reliably? Very
         good questions.

  RelaxNG: IMO kind of a middle ground between the two. A tight specification
         with a mathematical underpinning, in instance syntax and relatively
         easy to learn and author in. Uses the data typing facilities of
         XML Schema (part 2) and written by a small team of some of the
         world's experts in markup (such as Murata Makoto, James Clark,
         Norm Walsh, etc.).

I don't think Xindice itself needs any of these validation facilities
built-in, as how and when validation occurs, and what type of validation
(markup or content), how strict, etc. are definitely application-specific
issues. For example, one might have varying levels of strict schemas for
a series of relating document types for different purposes. There may
be even different schema languages used at different stages of processing
for even the same markup language (eg., if you don't need content 
validation once the document has been created, you gain performance to
then perform only structural/markup validation).

Xindice is a core technology that can be utilitized in many places within 
an application framework, and applications vary so widely that it's almost
impossible to generalize. It'd be sad to see Xindice shackled with 
components that are better attached to an application itself (ie., to the
application engine that uses Xindice).

Murray

...........................................................................
Murray Altheim                         <mailto:murray.altheim&#x40;sun.com>
XML Technology Center, Java and XML Software
Sun Microsystems, Inc., MS MPK17-102, 1601 Willow Rd., Menlo Park, CA 94025

            Corporations do not have human rights, despite the 
          altogether too-human opinions of the US Supreme Court.

RE: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by "Timothy M. Dean" <td...@visi.com>.

> -----Original Message-----
> From: Kimbro Staken [mailto:kstaken@dbxmlgroup.com] 
> Sent: Monday, January 07, 2002 5:34 AM
> To: xindice-dev@xml.apache.org
> Subject: Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]
> I'd like to here some opinions on DTDs though? So far we've 
> been very anti 
> DTD for the server. Does this make sense or should more 
> attention be paid 
> there?
> 

Personally, I don't see much need for DTD's with my projects. Most of
the projects I've worked on have used DTD's for their validation in the
past, but have found them too limited in scope to do the kinds of
validation they need. XML Schemas, while certainly being more
complicated and harder to learn, have given much better ability to
implement all the needed validation within the schema. Most of the
projects I've been involved with have some DTD's remaining but are
migrating towards XML schemas.

I suppose there are a lot of DTD-based validating apps out there, and if
there is a significant need to support those apps it would be reasonable
get DTD's working in Xindice. However, I would definitely place DTD
support at a lower priority than XML schemas.

Just my .02

- Tim



Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Kimbro Staken <ks...@dbxmlgroup.com>.
On Friday, January 4, 2002, at 09:44 AM, ericjs@rcn.com wrote:
>
> These are very valid reasons, but I think it is very important for any 
> database to be
> able to offer data integrity and consistency. To me, ensuring that a 
> document is valid
> against some schema, is equivalent and just as essential as rdb's ability 
> to enforce
> constraints. And I would go further (in keeping with inevitable document 
> composablity
> / fragment aggregation needs) and insist that inter-document consistency 
> checking is
> needed.
>
> The W3C Xml Schema may well not be the best tool for this job, and I also 
> would be
> queasy about tying the server to some such standard at this point in time.
>  In an ideal
> future, a mature xml database would support validation against any of the 
> major
> schema languages. Clearly that would be over-ambitious at this point, but 
> perhaps
> the server could be equipped with hooks for implementers to install the 
> validation
> mechanisms of their choice.

I think this is the likely path we'll pursue. There seems to be two major 
languages worth paying attention to, RelaxNG and W3C XML Schema. If both 
of those are supported most issues will be covered.

I'd like to here some opinions on DTDs though? So far we've been very anti 
DTD for the server. Does this make sense or should more attention be paid 
there?

>
> As to performance, this may be the bias of my projects and experience, 
> but I think
> performance is most crucial in the accessing / query of documents, not 
> the writing of
> documents where the validation would take place. In most systems, large 
> slow-
> validating documents are not going to be added to the system with 
> anything like the
> frequency that accesses will take place. Frequent writes are more likely 
> in systems
> with smaller more data-centric document-records, whose validation shouldn'
> t be as
> time consuming. Small, frequent updates to large documents is another 
> issue which
> might require the development of validation methods that don't revalidate 
> the entire
> document. Performance should be a concern but not such a large one to 
> negate the
> need for server-side validation entirely.
>

You're probably not too far off here and I agree, we shouldn't let 
performance concerns stand in the way of getting the proper functionality 
in place. That's really been our goal all along anyway.

> Eric
>
>
Kimbro Staken
XML Database Software, Consulting and Writing
http://www.xmldatabases.org/


Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by er...@rcn.com.
On 4 Jan 2002 at 9:00, Tom Bradford wrote:

> Personally, and I'm just reiterating things I've said in the past, I
> hate W3C XML Schemas, and many others do as well.  I don't want to
> have to put ourselves in a position where we're forced to make a
> choice on any one validation mechanism to the detriment of our users. 
> So if we can continue to push validation to the client application,
> that's the track we should take... for a couple of important reasons:
> (1) Performance... validation is slow, Bogging down the server to
> perform it can only cause problems, and (2) Choice: If we standardize
> on W3C Schemas, then we exlude support for other schema
> specifications.  I think that's unwise, especially with the major
> backlash that XML Schemas has received.

These are very valid reasons, but I think it is very important for any database to be 
able to offer data integrity and consistency. To me, ensuring that a document is valid 
against some schema, is equivalent and just as essential as rdb's ability to enforce 
constraints. And I would go further (in keeping with inevitable document composablity 
/ fragment aggregation needs) and insist that inter-document consistency checking is 
needed. 

The W3C Xml Schema may well not be the best tool for this job, and I also would be 
queasy about tying the server to some such standard at this point in time. In an ideal 
future, a mature xml database would support validation against any of the major 
schema languages. Clearly that would be over-ambitious at this point, but perhaps 
the server could be equipped with hooks for implementers to install the validation 
mechanisms of their choice.

As to performance, this may be the bias of my projects and experience, but I think 
performance is most crucial in the accessing / query of documents, not the writing of 
documents where the validation would take place. In most systems, large slow-
validating documents are not going to be added to the system with anything like the 
frequency that accesses will take place. Frequent writes are more likely in systems 
with smaller more data-centric document-records, whose validation shouldn't be as 
time consuming. Small, frequent updates to large documents is another issue which 
might require the development of validation methods that don't revalidate the entire 
document. Performance should be a concern but not such a large one to negate the 
need for server-side validation entirely.

Eric

Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Tom Bradford <br...@dbxmlgroup.com>.
Stefano Mazzocchi wrote:
> I see a native XML database as an incredibly great DBMS for
> semi-structured data and an incredibly poor DBMS for structured data.

I don't think anyone's debating that, though I wouldn't use the label
'incredibly poor' for structured data, especially since the definition
of what structured data is can't be answered by relational DBs
either...  I don't consider normalization and joins as being structure,
so much as I consider it to be a rigid decomposition of structure.

> Corba? no thanks, I need WebDAV.

As much as all of us hate it, CORBA absolutely has its uses.  We could
never get away with wire-compression if we were using a 'service the
world' WebDAV style approach.  Wire compression has bought us
performance gains, though not enough to justify keeping it exclusively.

> Joins? no thanks, I need document fragment aggregation.

In the context of XML, I think these are the same.

> XMLSchemas? no thanks, I need infoset-neutral RelaxNG validation.

Personally, and I'm just reiterating things I've said in the past, I
hate W3C XML Schemas, and many others do as well.  I don't want to have
to put ourselves in a position where we're forced to make a choice on
any one validation mechanism to the detriment of our users.  So if we
can continue to push validation to the client application, that's the
track we should take... for a couple of important reasons: (1)
Performance... validation is slow, Bogging down the server to perform it
can only cause problems, and (2) Choice: If we standardize on W3C
Schemas, then we exlude support for other schema specifications.  I
think that's unwise, especially with the major backlash that XML Schemas
has received.

> If you have structured data, you can't beat the relational model. This
> is the result of 50 years of database research: do we *really* believe
> we are smarter/wiser/deeper-thinkers than all the people that worked on
> the database industry since the 50's?

One might argue that the relational database industry hasn't learned
very much in the decades that it's been around.  Not that I'm saying XML
databases are better, but relational databases were created to solve the
problems of the databases of their time.  That time has passed.  There
are still a lot of applications that have the problem that relational
databases are trying to solve, but there are many applications that have
the problem that XML databases are trying to solve.  Further still,
there are apps that no database can adequately solve.

> I see two big fields where XIndice can make a difference (and this is
> the reason why I wanted this project to move under Apache in the first
> place!):
> 
>  - web services
>  - content management systems

Don't forget health care, legal documents, and scientific applications. 
These are three areas where Xindice has organically found a home in
since its creation.

>  - one big tree with nodes flavor (following .NET blue/red nodes):
> follows the design patterns of file systems with folders, files,
> symlinks and such. [great would be the ability to dump the entire thing
> as a huge namespaced XML file to allow easy backup and duplication]



>  - node-granular and ACL-based authorization and security [great would
> be the ability to make nodes 'transparent' for those people who don't
> have access to see them]
> 
>  - file system-like direct access (WebDAV instead of useless XUpdate!)
> [great for editing solutions since XUpdate requires the editor to get
> the document, perform the diff and send the diff, while the same
> operation can be performed by the server with one less connection, this
> is what CVS does!]

Woah!  Stop right there.  XUpdate is far from useless, and your
explaination of how it works, in the context of Xindice is incorrect. 
When you perform an XUpdate query, it's sent to the server which
performs all of the work.  Never is a document sent to the client except
for a summary of how many nodes were touched by the update.  It actually
performs very well, because you can modify every single document in a
collection, taking several different actions, with a single command.

>  - internal aggregation of document fragments (the equivalent of file
> system symlinks) [content aggregation at the database level will be much
> faster than aggregation at the publishing level, very useful for content
> that must be included in the same place... should replace the notion of
> XML entities]

We have this functionality in a very experimental form.  It's called
AutoLinking.  It's been around for a while, but it's going away at some
point, to be replaced by XQuery.  The problem with it is that you have
to modify the structure of your XML content, so it can't be treated as
data.  XQuery will allow this aggregation using the data in the
documents rather than instructions within the document.  Beyond that,
there's nothing stopping somebody from using XLink, its just not a task
that the server will perform because of the passive nature of XLinks.

>  - native metadata support (last modified time, author, etc..) [vital
> for any useful caching system around the engine!]

Some of this is already available, there's no way to expose it currently
though.

>  - node-granular event triggers [inverts the control of the database:
> when something happens the database does something, useful mostly to
> avoid expensive validity lookup for cached resources]

We talked about this early on in developing the product, but decided to
put it on a back burner for a while... probably for the same reason we
decided to shelve any specification validation system.

> In short: I'd like to have a file system able to decompose XML documents
> and store each single node as a file, scale to billions of nodes and
> perform fast queries with XPath-like syntaxes.

This is not to far from where we are at the moment.  Nodes are
individually addressable, but we cluster them into Documents for
atomicity, much like an object database will cluster objects together in
a way that ensures optimal I/O performance.
 
> This is my vision.

Now if this can work within the framework of my vision then nobody'll
get hurt. :-)

> Now, with my years-old asbesto underwear on, I'll be ready for your
> comments :)

-- 
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (formerly dbXML)

XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Posted by Stefano Mazzocchi <st...@apache.org>.
DISCLAIMER: personal and potentially inflammable opinions inside.

Kimbro Staken wrote:

<skip/>

> This is actually an important question that affects the overall
> development of Xindice into the future. When Tom and I were developing
> dbXML we definitely leaned in the direction of XML as data. This is why we
> don't really care about DTDs and such. Now we need to decide if that is
> the right thing to continue forward in the future or if a more XML
> document oriented perspective is in order.
> 
> The form of Xindice 1.0 is pretty much set, we've put down the ground work
> and presented one potential path. Now this project needs to decide what is
> the right path to move down from here. It certainly isn't a black and
> white situation, but we do need to try to get a clearer picture so that we
> have some guidelines to help with decisions like this.
> 
> This is really a question about how the server is being used today or more
> likely how it would be used if it did X, Y and Z.  What kind of
> applications are people building? What kind do you want to be building?

I see a native XML database as an incredibly great DBMS for
semi-structured data and an incredibly poor DBMS for structured data.

Corba? no thanks, I need WebDAV.

Joins? no thanks, I need document fragment aggregation.

XMLSchemas? no thanks, I need infoset-neutral RelaxNG validation.

If you have structured data, you can't beat the relational model. This
is the result of 50 years of database research: do we *really* believe
we are smarter/wiser/deeper-thinkers than all the people that worked on
the database industry since the 50's?

I personally don't.

Back to the point: didn't you ever had the feeling that LDAP was crap
but you couldn't find a better way to do those things?

Great, you smelled the problem.

Did you ever tried to store and quickly retrieve and compose the
fragments of *millions* of documents with a relational solution? 

are you still sane? lucky you.

But there is more: try to go to a swiss bank to convince them to install
a native XML DB instead of their relational one. Ok, let's aiming lower:
go to your financial department and convince them to move away from
their Oracle (or even from their Excel files, for &deity;'s sake!) with
a native XML DB.

The entire XML community is plagued with the 'data vs. document' match,
but this is *NOT* the problem: documents are data. Period. The fact that
you use the same syntax for both should make it clear already.

The *real* issue is "fully-structured vs. semi-structured" data.

Or, using more understandable terms: "table-oriented vs. tree-oriented"
data

                                     - o -

I see two big fields where XIndice can make a difference (and this is
the reason why I wanted this project to move under Apache in the first
place!):

 - web services
 - content management systems

Interesting enough, the two ASF members that pushed for this project to
happen (Sam and myself) push exactly in those directions, Sam for web
services, myself for CMS.

And if you think about it, these are exactly those realms where
table-oriented data fits very badly since almost all data is
tree-oriented (hierarchies of nodes).

IMO, an XML DB is nothing more than a mix between a filesystem++ and
LDAP and should try to replace those two: file systems for deeply nested
node clusters (otherwise called "semi-structured documents") and LDAP
for deeply nested single nodes (for example, user profiles)

Guess what: .NET will work on a native XML db exactly to provide a
storage system for those tree-style data (user profiles, passport data,
user pictures, email documents, etc..)

And guess what again: the most useful example of use of XIndice is as a
repository for Cocoon documents. Note that Cocoon already provides
hard-core technologies for adapting relational data to the XML world,
but users find XIndice much more attractive for their tree-oriented data
while remain loyal to their RDBMS for table-oriented data (and use
Cocoon to adapt the SQL queries to the XML world).

And note I didn't even touch the issues of legacy data, legacy SQL
knowledge, market inertia, complexity of the XML model, stupidity of the
XMLSchema spec, XML hype, etc, etc.

                                     - o -

This is my feature-list for XIndice 2.0:

 - one big tree with nodes flavor (following .NET blue/red nodes):
follows the design patterns of file systems with folders, files,
symlinks and such. [great would be the ability to dump the entire thing
as a huge namespaced XML file to allow easy backup and duplication]

 - node-granular and ACL-based authorization and security [great would
be the ability to make nodes 'transparent' for those people who don't
have access to see them]

 - file system-like direct access (WebDAV instead of useless XUpdate!)
[great for editing solutions since XUpdate requires the editor to get
the document, perform the diff and send the diff, while the same
operation can be performed by the server with one less connection, this
is what CVS does!]

 - internal aggregation of document fragments (the equivalent of file
system symlinks) [content aggregation at the database level will be much
faster than aggregation at the publishing level, very useful for content
that must be included in the same place... should replace the notion of
XML entities]

 - native metadata support (last modified time, author, etc..) [vital
for any useful caching system around the engine!]

 - node-granular event triggers [inverts the control of the database:
when something happens the database does something, useful mostly to
avoid expensive validity lookup for cached resources]

In short: I'd like to have a file system able to decompose XML documents
and store each single node as a file, scale to billions of nodes and
perform fast queries with XPath-like syntaxes.

This is my vision.

Now, with my years-old asbesto underwear on, I'll be ready for your
comments :)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



RE: Data or Documents for Xindice 2.0 (was Re: XSD or DTD validation?)

Posted by "Timothy M. Dean" <td...@visi.com>.

> -----Original Message-----
> From: Kimbro Staken [mailto:kstaken@dbxmlgroup.com] 

> This is really a question about how the server is being used 
> today or more 
> likely how it would be used if it did X, Y and Z.  What kind of 
> applications are people building? What kind do you want to be 
> building?
> 

I am currently developing a web-based application that manages contacts
for users and groups of users. My data is most naturally represented
using XML documents: It would really complicate things if I had to map
to a relational DB. In most cases, all of the documents within a
collection have the same structure.

For my current application, schema validation is not absolutely
required. I have only one application that populates/reads the DB, and I
can code/debug that application to ensure that only valid documents are
stored. However, I could see using this technology for more projects in
which a single database serves many different applications, and in this
case it would strongly desirable to insulate applications from the
"mistakes" of other applications. Once an application is completed, it
generally expects that its database will be populated with good data. If
new applications are being developed that are not quite debugged yet,
they could easily introduce bad data that might break existing
applications. Allowing schema-based validation to be supported in the DB
server would go a long way to address these problems.

I do consulting for many different clients, many of whom are working on
web applications backed by shared data. While it is true that a large
percentage of these clients require their data to be stored in their
enterprise relational DB (usually Oracle), there as at least a
significant minority of clients that would be amenable to using
alternate technologies such as XML databases. The ability to validate at
the DB server level would make it significantly easier to sell this
technology to a lot of the clients I work with.

For what it's worth, I would only support the idea of schema-based
validation if certain conditions were met:
1) Validation should not be a requirement. For apps not needing
validation, they should be able to avoid it
2) For apps that do not require validation, there should be little or no
overhead introduced by native support for validation
3) Schemas should be in some way assignable to an entire collection, so
that all documents within the collection can be validated against a
single schema. Making them assignable to individual documents is fine,
but would not be a requirement for my needs

- Tim


RE: Data or Documents for Xindice 2.0 (was Re: XSD or DTD validation?)

Posted by Mike Mortensen <mm...@appsware.com>.
Yes.  I think we are in substantial agreement.  I was trying to be careful in my wording choosing to describe the documents in the collection as begin similar in content and structure (there exists a natural association among them).

I would also point out that there is not anything that compels them to be similar.  If validation is not required (as you point out), documents of any arbitrary structure or content can be placed in a collection.

The same argument for refactoring similar classes into a super class and its sub-classes can be made here.  There's something vexing about a collection of similar documents each with its own DTD or schema.  Due to the similarity of the document the DTD or schema is meant to describe, the DTD/schema must be similar.  It's the  value-less redundancy that nags at us and causes us to look for a way to "clean" it up with a single definition that describes them all.

-----Original Message-----
From: Mark J. Stang [mailto:markstang@earthlink.net]
Sent: Thursday, January 03, 2002 9:52 PM
To: xindice-dev@xml.apache.org
Subject: Re: Data or Documents for Xindice 2.0 (was Re: XSD or DTD validation?)

I think one view is that a collection is homogeneous.  And it appears that this
is Mikes' view, correct me if I am wrong.   In general, most of the documents
in my collections could be described by a DTD.   However, I have one collection
that is a collection of different types of documents.   I would prefer not to
have to create an individual collection for a single document.   I would also
not want to be constrained in having only one type of document in a collection.

In one of my collections, I am storing RepairOrders.   I also have a document that
is a list of all the "Open" RepairOrders.   I either have to have a very slick DTD to
cover both or put my list in another collection.   Seems artificial to have a single
collection for one document.

DTDs can be very useful when you are receiving documents from the outside
world.   They can help maintain the correctness of the data.   They remind me
of flexible schemas.

I for one do not need any of the above or the overhead that comes with it.
I don't have outside documents.   I will rely on developers and QA for the
correctness of my documents.   And I choose XML because it gives me
the flexibility to morph my document into any form that fits my Customer, not
my DBAs or Developers.

I see xml documents as data that comes in many formats.   In so many formats
that a DTD would be useless.

+1 for NOT requiring DTDs and defining collections as being ANY document.
DTDs as optional with no overhead for not using them is fine with me.

Mike Mortensen wrote:

> I believe that choice made (validating the collection) is the correct one.  It most closely represents what happens in the real world.  Even with widely different data and usage, it still makes sense to validate the collection.
>
> For example, if used in business, invoices generated and sent to customers could be stored in Xindice.  It is appropriate to validate the collection since all documents are, by their nature, similar in content and structure.  The case is likewise true if Xindice were used instead to store the chapters of a book.  Each chapter has similar content and structure.  We could just as easily throw the periodic table of the elements into Xindice and science would give "thumbs up" to the collection approach.
>
> I recognized that there may be occasions where the content is substantially dissimilar.  In this case, we simply put the documents into separate collections and still get the desired validation outcome.
> +1 for the path taken as the logical choice.

Re: Data or Documents for Xindice 2.0 (was Re: XSD or DTD validation?)

Posted by "Mark J. Stang" <ma...@earthlink.net>.
I think one view is that a collection is homogeneous.  And it appears that this
is Mikes' view, correct me if I am wrong.   In general, most of the documents
in my collections could be described by a DTD.   However, I have one collection
that is a collection of different types of documents.   I would prefer not to
have to create an individual collection for a single document.   I would also
not want to be constrained in having only one type of document in a collection.

In one of my collections, I am storing RepairOrders.   I also have a document that
is a list of all the "Open" RepairOrders.   I either have to have a very slick DTD to
cover both or put my list in another collection.   Seems artificial to have a single
collection for one document.

DTDs can be very useful when you are receiving documents from the outside
world.   They can help maintain the correctness of the data.   They remind me
of flexible schemas.

I for one do not need any of the above or the overhead that comes with it.
I don't have outside documents.   I will rely on developers and QA for the
correctness of my documents.   And I choose XML because it gives me
the flexibility to morph my document into any form that fits my Customer, not
my DBAs or Developers.

I see xml documents as data that comes in many formats.   In so many formats
that a DTD would be useless.

+1 for NOT requiring DTDs and defining collections as being ANY document.
DTDs as optional with no overhead for not using them is fine with me.

Mike Mortensen wrote:

> I believe that choice made (validating the collection) is the correct one.  It most closely represents what happens in the real world.  Even with widely different data and usage, it still makes sense to validate the collection.
>
> For example, if used in business, invoices generated and sent to customers could be stored in Xindice.  It is appropriate to validate the collection since all documents are, by their nature, similar in content and structure.  The case is likewise true if Xindice were used instead to store the chapters of a book.  Each chapter has similar content and structure.  We could just as easily throw the periodic table of the elements into Xindice and science would give "thumbs up" to the collection approach.
>
> I recognized that there may be occasions where the content is substantially dissimilar.  In this case, we simply put the documents into separate collections and still get the desired validation outcome.
> +1 for the path taken as the logical choice.


RE: Data or Documents for Xindice 2.0 (was Re: XSD or DTD validation?)

Posted by Mike Mortensen <mm...@appsware.com>.
I believe that choice made (validating the collection) is the correct one.  It most closely represents what happens in the real world.  Even with widely different data and usage, it still makes sense to validate the collection.

For example, if used in business, invoices generated and sent to customers could be stored in Xindice.  It is appropriate to validate the collection since all documents are, by their nature, similar in content and structure.  The case is likewise true if Xindice were used instead to store the chapters of a book.  Each chapter has similar content and structure.  We could just as easily throw the periodic table of the elements into Xindice and science would give "thumbs up" to the collection approach.

I recognized that there may be occasions where the content is substantially dissimilar.  In this case, we simply put the documents into separate collections and still get the desired validation outcome.  
+1 for the path taken as the logical choice.

RE: Data or Documents for Xindice 2.0 (was Re: XSD or DTD validation?)

Posted by Steven Noels <st...@outerthought.org>.
> -----Original Message-----
> From: Kimbro Staken [mailto:kstaken@dbxmlgroup.com]
> Sent: vrijdag 4 januari 2002 0:59
> To: xindice-dev@xml.apache.org
> Subject: Data or Documents for Xindice 2.0 (was Re: XSD or DTD
> validation?)

> This is actually an important question that affects the overall
> development of Xindice into the future. When Tom and I were developing
> dbXML we definitely leaned in the direction of XML as data. This is why we
> don't really care about DTDs and such. Now we need to decide if that is
> the right thing to continue forward in the future or if a more XML
> document oriented perspective is in order.

> This is really a question about how the server is being used today or more
> likely how it would be used if it did X, Y and Z.  What kind of
> applications are people building? What kind do you want to be building?

XIndice should go for documents just as XHive, Tamino & Excelon. The XML, DBMS
& Data market is already oversaturated with numerous XML/Java/RDBMS binding
frameworks, and the big RDBMS vendors (Ora, MS, IBM) are continuously adding
useful XML-support to their DB engines, which will be mainly used for XML &
Data applications.

Documents or semi-structured datasets is the way to go, which implies *some*
Schema awareness inside XIndice.

I believe XIndice will be used quite heavily as the storage layer of a Web
CMS, perhaps with Cocoon2 running on top of it.

Cheers,

</Steven>


Data or Documents for Xindice 2.0 (was Re: XSD or DTD validation?)

Posted by Kimbro Staken <ks...@dbxmlgroup.com>.
On Thursday, January 3, 2002, at 02:11 PM, Timothy M. Dean wrote:

> I can't see any use for per-document schemas in my application(s), but
> if others see the use in it who am I to dispute that. I would think that
> per-collection validation would be more the norm, so that any attempt to
> support per-document validation would be in addition to (and not instead
> of) per-collection validation.
>

Whether it makes sense or not depends on the view you take of the database 
it self. You can view the database as a DBMS for XML data or as a 
repository for XML documents. If you view it as XML data then making it 
like traditional databases with per collection constraint makes sense. 
However, if you take the XML document view then your validation is 
attached to the document instance and separating it would probably be 
unexpected.

This is a kind of schitzo pull that has troubled the whole native XML 
database industry. There are a lot of document oriented things that just 
don't make a lot of sense in a database (i.e. DTDs, external parsed 
entities) and there are a lot of database oriented things that don't exist 
in XML(i.e. joins, declarative updates). You can even see this if you look 
at specs like XQuery. Even though XQuery is a data oriented spec and is 
supposed to be for databases it still takes a very document oriented 
approach to a lot of things.

This is actually an important question that affects the overall 
development of Xindice into the future. When Tom and I were developing 
dbXML we definitely leaned in the direction of XML as data. This is why we 
don't really care about DTDs and such. Now we need to decide if that is 
the right thing to continue forward in the future or if a more XML 
document oriented perspective is in order.

The form of Xindice 1.0 is pretty much set, we've put down the ground work 
and presented one potential path. Now this project needs to decide what is 
the right path to move down from here. It certainly isn't a black and 
white situation, but we do need to try to get a clearer picture so that we 
have some guidelines to help with decisions like this.

This is really a question about how the server is being used today or more 
likely how it would be used if it did X, Y and Z.  What kind of 
applications are people building? What kind do you want to be building?


> Did the discussions of the past yield any results? Was there a consensus
> on a preferred direction, even if nobody has worked on it yet? I would
> be willing to take a look at some implementations if someone can point
> me to discussions of how the desired functionality should work.
>
> - Tim
>
>
Kimbro Staken
XML Database Software, Consulting and Writing
http://www.xmldatabases.org/


RE: XSD or DTD validation?

Posted by "Timothy M. Dean" <td...@visi.com>.
I can't see any use for per-document schemas in my application(s), but
if others see the use in it who am I to dispute that. I would think that
per-collection validation would be more the norm, so that any attempt to
support per-document validation would be in addition to (and not instead
of) per-collection validation.

Did the discussions of the past yield any results? Was there a consensus
on a preferred direction, even if nobody has worked on it yet? I would
be willing to take a look at some implementations if someone can point
me to discussions of how the desired functionality should work.

- Tim


> -----Original Message-----
> From: Kimbro Staken [mailto:kstaken@dbxmlgroup.com] 
> Sent: Thursday, January 03, 2002 12:57 PM
> To: xindice-dev@xml.apache.org
> Subject: Re: XSD or DTD validation?
> 
> 
> We've discussed this a lot in the past, but there isn't 
> currently any work 
> going on for it.  We had always talked about having an entire 
> collection 
> constrained to a particular schema. This may or may not be 
> the right way 
> to do things, I'm not sure. An argument can easily be made for per 
> document schemas, but then you lose some of the optimization 
> facilities 
> you'd have with a full collection schema.
> 
> On Wednesday, January 2, 2002, at 09:32 PM, Timothy M. Dean wrote:
> 
> > I was also curious about the same thing. I would be willing to 
> > contribute to this effort if necessary. Has anyone else given much 
> > thought to this idea?
> >
> > - Tim
> >
> > -----Original Message-----
> > From: Jerry Wang [mailto:jwang@elegant.ca]
> > Sent: Wednesday, January 02, 2002 6:33 PM
> > To: xindice-dev@xml.apache.org
> > Subject: XSD or DTD validation?
> >
> >
> > Any plan to support XSD or DTD validation when creating or updating 
> > document? I think it will be good for example we bound each 
> collection 
> > with an XSD or DTD.
> >
> > -Jerry Wang
> > Elegant Solution Consulting Inc.
> >
> >
> >
> >
> >
> Kimbro Staken
> XML Database Software, Consulting and Writing 
> http://www.xmldatabases.org/
> 
> 


Re: XSD or DTD validation?

Posted by Kimbro Staken <ks...@dbxmlgroup.com>.
We've discussed this a lot in the past, but there isn't currently any work 
going on for it.  We had always talked about having an entire collection 
constrained to a particular schema. This may or may not be the right way 
to do things, I'm not sure. An argument can easily be made for per 
document schemas, but then you lose some of the optimization facilities 
you'd have with a full collection schema.

On Wednesday, January 2, 2002, at 09:32 PM, Timothy M. Dean wrote:

> I was also curious about the same thing. I would be willing to
> contribute to this effort if necessary. Has anyone else given much
> thought to this idea?
>
> - Tim
>
> -----Original Message-----
> From: Jerry Wang [mailto:jwang@elegant.ca]
> Sent: Wednesday, January 02, 2002 6:33 PM
> To: xindice-dev@xml.apache.org
> Subject: XSD or DTD validation?
>
>
> Any plan to support XSD or DTD validation when creating or updating
> document? I think it will be good for example we bound each collection
> with an XSD or DTD.
>
> -Jerry Wang
> Elegant Solution Consulting Inc.
>
>
>
>
>
Kimbro Staken
XML Database Software, Consulting and Writing
http://www.xmldatabases.org/


RE: XSD or DTD validation?

Posted by "Timothy M. Dean" <td...@visi.com>.
I was also curious about the same thing. I would be willing to
contribute to this effort if necessary. Has anyone else given much
thought to this idea?

- Tim

-----Original Message-----
From: Jerry Wang [mailto:jwang@elegant.ca] 
Sent: Wednesday, January 02, 2002 6:33 PM
To: xindice-dev@xml.apache.org
Subject: XSD or DTD validation?


Any plan to support XSD or DTD validation when creating or updating
document? I think it will be good for example we bound each collection
with an XSD or DTD.

-Jerry Wang
Elegant Solution Consulting Inc.