You are viewing a plain text version of this content. The canonical link for it is here.

Posted to xindice-dev@xml.apache.org by Gianugo Rabellino <gi...@apache.org> on 2002/11/27 12:15:58 UTC

[RT] Xindice 2.0

This is probably a good time to start thinking about Xindice 2.0. The 
major number switch should come from a major evolution of the current 
architecture: we have now a quite solid XML database, but still there is 
a lot of work to do in order to make Xindice a viable solution for the 
use cases that have been aniticipated by our candidate users.

This is just a "starting point" to try and set things straight, in order 
to try to come up together with a sort of guideline for future 
developments. Please, feel free to fire at will, and remember that these 
are just Random Toughts. :-)

There are some major points that I would like to address in the next 
future. In no particular order I think we need to work on:

1. XML:DB API
This is not a 100% issue of Xindice, yet I think that since dbXML before 
and Xindice afterwards are the de facto standards for this API, the 
XML:DB APIs should be the primary way to access the database. I still 
think that it's really important to have a vendor-neutral API for 
accessing XML databases, so I would like to invest more and more on 
this: we might try to push on the xapi-dev list and see what happens, if 
we fail it will be always possible to run wild and do our own extensions.

I think that we need to extend the API in order to accomodate the needs 
anticipated by the users. These points at least are crucial to me:

- metadata: we need a neutral way to query metadata for collections and 
resources. I like David's solution of having a MetaData object with a 
set of fixed and basic metadata (author, creation, modification), a set 
of "properties" and a custom XML-based system: we don't really need much 
more than that, but we also need to refine it in order to come out with 
a complete solution that addresses the most basic needs (I, for one, 
would like to add to the MetaData the collection and the document ID). 
When the MetaData object is carved in stone we can decide how to get it: 
I'm all in favor for something like getMetaData() calls on Collection 
and Resource.

- transaction support: the API should have a basic support for atomic 
operations and for transactions;

- capabilities (is that the right English term?). There should be a way 
to query the Database (or maybe the Collection?) to understand if it 
supports some features (i.e.: transaction). A parallel with JDBC would 
be the DatabaseMetaData object even if I'm not really sure about the 
plethora of supports* methods, the alternative a SAX-feature like (URI 
based) set of capabilities and a single method to query for support, 
with a pseudocode of:

if (database.supports(Capabilities.TRANSACTIONS)) {
	begin()/work()/commit()
} else {
	workAndHopeForTheBest()
}

Again: this is not exactly the right place to discuss this, but before 
going to xapi-dev I'd like to hear your opinion and put together a draft 
that comprises all our present and (possibly :-)) future needs.

2. PERFORMANCE
Face it: we are slow. We are fair enough for small jobs but we cannot 
stand high loads or huge documents, no matter how accurate your indexes 
might be. I put a great deal of hope into Tom's work on Xalan DTM 
(http://xml.apache.org/xalan-j/dtm.html) to improve the Xindice 
performances, but as of now I'm afraid that Tom is MIA too, so unless he 
shows up we have no choice but doing it on our own and decide what might 
be the best way to improve the Xindice storage and retrieval 
performance. I see some possible directions:

a. Stefano pointed me to the Lore documentation. The guys at Stanford 
did a whole lot of work thinking about storage of semi-structured data, 
we might borrow something from there, if it's still up to date 
(http://www-db.stanford.edu/lore/);

b. DTM (http://xml.apache.org/xalan-j/dtm.html). I had a small chat with 
Shane Curcuru from Xalan at ApacheCon and he was cautious about using 
DTM for persistent storage. But it might be worth trying (by asking to 
xalan-dev) to see if the DTM model is good enough (or can possibly be 
extended) to accomodate our needs;

c. SAX events. There is almost no doubt about SAX being the most 
efficient way to deal with XML speed & memory wise. As of now Xindice is 
heavily based on DOM (albeit compressed and finely tuned), it might be 
worth investigating if this should change. Cocoon had very good results 
using SAX even for the internal cache, by compiling SAX events to byte 
streams and interpreting them at a later time: see 
http://cvs.apache.org/viewcvs.cgi/xml-cocoon2/src/java/org/apache/cocoon/components/sax/ 
and look for XMLByteStream[Compiler|Interpreter]. We might borrow that 
at least for the transport of SAX events over the wire in the XML-RPC 
protocol: if we have on the server side a Compiler (or, even better, if 
the documents are already stored in a compiled format) and on the client 
side an Interpreter things might be a whole lot faster, exp. when 
dealing with SAX based applications such as Cocoon.

3. AAA
Badly needed, on two sides:

a. Server side: not that hard to implement, after all, at least on a 
not-so-granular way. We might go the hard way with security-oriented 
markup languages and node based security or just rely on URI-based 
authentication, with a Tomcat/Slide/younameit-like role system. I'd go 
for the latter: Collection based security should be enough for most needs.

b. transport: if we are going to have username and passwords flying over 
the wire, we need to protect them. XML-RPC over HTTPS? CHAP? Kerberos? 
Other thoughts?

4. TRANSACTION
This is needed too. I don't know how JTA might help here, I have no idea 
of the API and never worked with it. Any expert around? We would need to 
know not only if JTA would make the job, but also if, performance wise, 
it will suffice without imposing severe penalties to the system.

======================================================================

OK, this was the first stone in the lake: I hope to sparkle some 
discussion on it and, once we manage to agree on what we want from 2.0, 
to start writing docs and code. I'm now borrowing the world-famous 
absbestos underwear from Stefano & Sam and I'm eagerly waiting for your 
replies.

Ciao,

-- 
Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by Kurt Ward <ku...@yahoo.com>.

----- Original Message -----
From: "Gianugo Rabellino" <gi...@apache.org>
To: <xi...@xml.apache.org>
Sent: Wednesday, November 27, 2002 6:15 AM
Subject: [RT] Xindice 2.0


>
> This is probably a good time to start thinking about Xindice 2.0. The
> major number switch should come from a major evolution of the current
> architecture: we have now a quite solid XML database, but still there is
> a lot of work to do in order to make Xindice a viable solution for the
> use cases that have been aniticipated by our candidate users.
>
> This is just a "starting point" to try and set things straight, in order
> to try to come up together with a sort of guideline for future
> developments. Please, feel free to fire at will, and remember that these
> are just Random Toughts. :-)
>
> There are some major points that I would like to address in the next
> future. In no particular order I think we need to work on:
>
> 1. XML:DB API
> This is not a 100% issue of Xindice, yet I think that since dbXML before
> and Xindice afterwards are the de facto standards for this API, the
> XML:DB APIs should be the primary way to access the database. I still
> think that it's really important to have a vendor-neutral API for
> accessing XML databases, so I would like to invest more and more on
> this: we might try to push on the xapi-dev list and see what happens, if
> we fail it will be always possible to run wild and do our own extensions.
>
> I think that we need to extend the API in order to accomodate the needs
> anticipated by the users. These points at least are crucial to me:
>
> - metadata: we need a neutral way to query metadata for collections and
> resources. I like David's solution of having a MetaData object with a
> set of fixed and basic metadata (author, creation, modification), a set
> of "properties" and a custom XML-based system: we don't really need much
> more than that, but we also need to refine it in order to come out with
> a complete solution that addresses the most basic needs (I, for one,
> would like to add to the MetaData the collection and the document ID).
> When the MetaData object is carved in stone we can decide how to get it:
> I'm all in favor for something like getMetaData() calls on Collection
> and Resource.
>
> - transaction support: the API should have a basic support for atomic
> operations and for transactions;
>
> - capabilities (is that the right English term?). There should be a way
> to query the Database (or maybe the Collection?) to understand if it
> supports some features (i.e.: transaction). A parallel with JDBC would
> be the DatabaseMetaData object even if I'm not really sure about the
> plethora of supports* methods, the alternative a SAX-feature like (URI
> based) set of capabilities and a single method to query for support,
> with a pseudocode of:
>
> if (database.supports(Capabilities.TRANSACTIONS)) {
> begin()/work()/commit()
> } else {
> workAndHopeForTheBest()
> }
>
> Again: this is not exactly the right place to discuss this, but before
> going to xapi-dev I'd like to hear your opinion and put together a draft
> that comprises all our present and (possibly :-)) future needs.
>
> 2. PERFORMANCE
> Face it: we are slow. We are fair enough for small jobs but we cannot
> stand high loads or huge documents, no matter how accurate your indexes
> might be. I put a great deal of hope into Tom's work on Xalan DTM
> (http://xml.apache.org/xalan-j/dtm.html) to improve the Xindice
> performances, but as of now I'm afraid that Tom is MIA too, so unless he
> shows up we have no choice but doing it on our own and decide what might
> be the best way to improve the Xindice storage and retrieval
> performance. I see some possible directions:
>
> a. Stefano pointed me to the Lore documentation. The guys at Stanford
> did a whole lot of work thinking about storage of semi-structured data,
> we might borrow something from there, if it's still up to date
> (http://www-db.stanford.edu/lore/);
>
> b. DTM (http://xml.apache.org/xalan-j/dtm.html). I had a small chat with
> Shane Curcuru from Xalan at ApacheCon and he was cautious about using
> DTM for persistent storage. But it might be worth trying (by asking to
> xalan-dev) to see if the DTM model is good enough (or can possibly be
> extended) to accomodate our needs;
>
> c. SAX events. There is almost no doubt about SAX being the most
> efficient way to deal with XML speed & memory wise. As of now Xindice is
> heavily based on DOM (albeit compressed and finely tuned), it might be
> worth investigating if this should change. Cocoon had very good results
> using SAX even for the internal cache, by compiling SAX events to byte
> streams and interpreting them at a later time: see
>
http://cvs.apache.org/viewcvs.cgi/xml-cocoon2/src/java/org/apache/cocoon/com
ponents/sax/
> and look for XMLByteStream[Compiler|Interpreter]. We might borrow that
> at least for the transport of SAX events over the wire in the XML-RPC
> protocol: if we have on the server side a Compiler (or, even better, if
> the documents are already stored in a compiled format) and on the client
> side an Interpreter things might be a whole lot faster, exp. when
> dealing with SAX based applications such as Cocoon.
>
> 3. AAA
> Badly needed, on two sides:
>
> a. Server side: not that hard to implement, after all, at least on a
> not-so-granular way. We might go the hard way with security-oriented
> markup languages and node based security or just rely on URI-based
> authentication, with a Tomcat/Slide/younameit-like role system. I'd go
> for the latter: Collection based security should be enough for most needs.
>
> b. transport: if we are going to have username and passwords flying over
> the wire, we need to protect them. XML-RPC over HTTPS? CHAP? Kerberos?
> Other thoughts?
>
> 4. TRANSACTION
> This is needed too. I don't know how JTA might help here, I have no idea
> of the API and never worked with it. Any expert around? We would need to
> know not only if JTA would make the job, but also if, performance wise,
> it will suffice without imposing severe penalties to the system.
>
> ======================================================================
>
> OK, this was the first stone in the lake: I hope to sparkle some
> discussion on it and, once we manage to agree on what we want from 2.0,
> to start writing docs and code. I'm now borrowing the world-famous
> absbestos underwear from Stefano & Sam and I'm eagerly waiting for your
> replies.
>
> Ciao,
>
> --
> Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

Vladimir R. Bossicard wrote:

> there's no "dream.xml" file but feel free to add one.

Actually, I'm wondering if it wouldn't be better to start a Wiki on 
that. It would allow for fast, live, and collaborative content editing 
for future developments. I might be able to set up one, but I have to 
check first: would you guys appreciate that?

Ciao,

-- 
Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by "Vladimir R. Bossicard" <vl...@apache.org>.

> Yes, definitely it was a great job. Thanks Vladimir!

I'm blushing :-)

> $ find src/documentation -name \*.xml -exec grep -i dream {} \;

there's no "dream.xml" file but feel free to add one.  Currently we have 
two todo lists:
- from the original website (copy/paste so may not be accurate anymore)
- for urgent things to be done (in the dev zone).  The idea was to show 
how close we are to a release.

+1 for dream.xml

-Vladimir

-- 
Vladimir R. Bossicard
Apache Xindice - http://xml.apache.org/xindice

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

Kevin Ross wrote:

> Wow, much better website!  I don't think I have to mention what this
> will do for our appearance to the general user community, showing real
> progress.  Kudos to Vladimir and everyone else who helped out.
>
Yes, definitely it was a great job. Thanks Vladimir!

> If we can agree to set some goals, I believe that Vladimir already has
> (had?) some on one of the web pages.  When you look at forrest, it is
> the 'dreams' link.  I can't seem to find that now Vladimir, you know
> where it is?

:-?

$ find src/documentation -name \*.xml -exec grep -i dream {} \;

$

nothing found here...

> PS- I need cross-collection XQuery

Argh... that was the magic word, XQuery... or were you meaning 
cross-collection XPathQuery? In the former case, well, XQuery looks to 
me like a giant beast, and I'm not even willing to tackle it until I'm 
dead sure that the specs are final. And even then it would be a pretty 
hard time to implement it... I had a look at qexo 
(http://www.gnu.org/software/qexo/) but I'm not that convinced after 
all. It's not just a matter of implementing that huge spec, the real 
problem is to implement it in a *fast* way, or it will be totally useless.

Ciao,

-- 
Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by Kevin Ross <Ke...@iVerticalLeap.com>.

Wow, much better website!  I don't think I have to mention what this 
will do for our appearance to the general user community, showing real 
progress.  Kudos to Vladimir and everyone else who helped out.

If we can agree to set some goals, I believe that Vladimir already has 
(had?) some on one of the web pages.  When you look at forrest, it is 
the 'dreams' link.  I can't seem to find that now Vladimir, you know 
where it is?

I think 'dreams' and 'todo' are a little different, in that we are 
commited to delivering todo's in the immediate or near timeline.

just my 2 cents...

-Kevin

PS- I need cross-collection XQuery

Gianugo Rabellino wrote:

>
> This is probably a good time to start thinking about Xindice 2.0. The 
> major number switch should come from a major evolution of the current 
> architecture: we have now a quite solid XML database, but still there 
> is a lot of work to do in order to make Xindice a viable solution for 
> the use cases that have been aniticipated by our candidate users.
>
> This is just a "starting point" to try and set things straight, in 
> order to try to come up together with a sort of guideline for future 
> developments. Please, feel free to fire at will, and remember that 
> these are just Random Toughts. :-)
>
> There are some major points that I would like to address in the next 
> future. In no particular order I think we need to work on:
>
> 1. XML:DB API
> This is not a 100% issue of Xindice, yet I think that since dbXML 
> before and Xindice afterwards are the de facto standards for this API, 
> the XML:DB APIs should be the primary way to access the database. I 
> still think that it's really important to have a vendor-neutral API 
> for accessing XML databases, so I would like to invest more and more 
> on this: we might try to push on the xapi-dev list and see what 
> happens, if we fail it will be always possible to run wild and do our 
> own extensions.
>
> I think that we need to extend the API in order to accomodate the 
> needs anticipated by the users. These points at least are crucial to me:
>
> - metadata: we need a neutral way to query metadata for collections 
> and resources. I like David's solution of having a MetaData object 
> with a set of fixed and basic metadata (author, creation, 
> modification), a set of "properties" and a custom XML-based system: we 
> don't really need much more than that, but we also need to refine it 
> in order to come out with a complete solution that addresses the most 
> basic needs (I, for one, would like to add to the MetaData the 
> collection and the document ID). When the MetaData object is carved in 
> stone we can decide how to get it: I'm all in favor for something like 
> getMetaData() calls on Collection and Resource.
>
> - transaction support: the API should have a basic support for atomic 
> operations and for transactions;
>
> - capabilities (is that the right English term?). There should be a 
> way to query the Database (or maybe the Collection?) to understand if 
> it supports some features (i.e.: transaction). A parallel with JDBC 
> would be the DatabaseMetaData object even if I'm not really sure about 
> the plethora of supports* methods, the alternative a SAX-feature like 
> (URI based) set of capabilities and a single method to query for 
> support, with a pseudocode of:
>
> if (database.supports(Capabilities.TRANSACTIONS)) {
>     begin()/work()/commit()
> } else {
>     workAndHopeForTheBest()
> }
>
> Again: this is not exactly the right place to discuss this, but before 
> going to xapi-dev I'd like to hear your opinion and put together a 
> draft that comprises all our present and (possibly :-)) future needs.
>
> 2. PERFORMANCE
> Face it: we are slow. We are fair enough for small jobs but we cannot 
> stand high loads or huge documents, no matter how accurate your 
> indexes might be. I put a great deal of hope into Tom's work on Xalan 
> DTM (http://xml.apache.org/xalan-j/dtm.html) to improve the Xindice 
> performances, but as of now I'm afraid that Tom is MIA too, so unless 
> he shows up we have no choice but doing it on our own and decide what 
> might be the best way to improve the Xindice storage and retrieval 
> performance. I see some possible directions:
>
> a. Stefano pointed me to the Lore documentation. The guys at Stanford 
> did a whole lot of work thinking about storage of semi-structured 
> data, we might borrow something from there, if it's still up to date 
> (http://www-db.stanford.edu/lore/);
>
> b. DTM (http://xml.apache.org/xalan-j/dtm.html). I had a small chat 
> with Shane Curcuru from Xalan at ApacheCon and he was cautious about 
> using DTM for persistent storage. But it might be worth trying (by 
> asking to xalan-dev) to see if the DTM model is good enough (or can 
> possibly be extended) to accomodate our needs;
>
> c. SAX events. There is almost no doubt about SAX being the most 
> efficient way to deal with XML speed & memory wise. As of now Xindice 
> is heavily based on DOM (albeit compressed and finely tuned), it might 
> be worth investigating if this should change. Cocoon had very good 
> results using SAX even for the internal cache, by compiling SAX events 
> to byte streams and interpreting them at a later time: see 
> http://cvs.apache.org/viewcvs.cgi/xml-cocoon2/src/java/org/apache/cocoon/components/sax/ 
> and look for XMLByteStream[Compiler|Interpreter]. We might borrow that 
> at least for the transport of SAX events over the wire in the XML-RPC 
> protocol: if we have on the server side a Compiler (or, even better, 
> if the documents are already stored in a compiled format) and on the 
> client side an Interpreter things might be a whole lot faster, exp. 
> when dealing with SAX based applications such as Cocoon.
>
> 3. AAA
> Badly needed, on two sides:
>
> a. Server side: not that hard to implement, after all, at least on a 
> not-so-granular way. We might go the hard way with security-oriented 
> markup languages and node based security or just rely on URI-based 
> authentication, with a Tomcat/Slide/younameit-like role system. I'd go 
> for the latter: Collection based security should be enough for most 
> needs.
>
> b. transport: if we are going to have username and passwords flying 
> over the wire, we need to protect them. XML-RPC over HTTPS? CHAP? 
> Kerberos? Other thoughts?
>
> 4. TRANSACTION
> This is needed too. I don't know how JTA might help here, I have no 
> idea of the API and never worked with it. Any expert around? We would 
> need to know not only if JTA would make the job, but also if, 
> performance wise, it will suffice without imposing severe penalties to 
> the system.
>
> ======================================================================
>
> OK, this was the first stone in the lake: I hope to sparkle some 
> discussion on it and, once we manage to agree on what we want from 
> 2.0, to start writing docs and code. I'm now borrowing the 
> world-famous absbestos underwear from Stefano & Sam and I'm eagerly 
> waiting for your replies.
>
> Ciao,
>

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

John Merrells wrote:

> Gianugo Rabellino wrote:
>
> > Nice to see you here, John. :-)
>
>
> I'm spying on you ;-)

So am I on the other side, we are even. ;-)

> > Af fo transaction, sorry, let me rephrase it better, I didn't make
> > myself clear: if you want ACID then you need transactions for
> > operations that modify the database. But on a mostly-read database the
> > transaction code might well be slow, since you are not doing many
> > writes. So, even if JTA might impose a performance penalty, if the
> > model is mostly read then it might be worth using that in the
> > immediate future. Does this sound better?
>
>
> The reads performance difference between a transacted store and an
> untransacted store
> should be very small. 

Which is exactly my point: transactions should not impose a severe 
performance hit on read operations. I don't know why, but we keep saying 
the same thing. :-)

> I don't know much about JTA... but I thought it was an interface for
> distributed transactions?
> XA for Java?

Neither do I. I'm scratching the surface as of now.

> DB XML is released under the same license as Berkeley DB. Basically, if
> you're open
> source then we're open source, if you're proprietary then you need to
> buy a license. We
> have alpha code available if anyone wants to play :-)

I've been actively playing for a while, and so far so good. :-) I will 
seriously consider using bdbxml as a possible backend (though it would 
break the cross-platform compatibility, but still it's worth a tought).

Ciao,

-- 
Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by John Merrells <me...@sleepycat.com>.

Gianugo Rabellino wrote:

> Nice to see you here, John. :-) 

I'm spying on you ;-)

> Af fo transaction, sorry, let me rephrase it better, I didn't make 
> myself clear: if you want ACID then you need transactions for 
> operations that modify the database. But on a mostly-read database the 
> transaction code might well be slow, since you are not doing many 
> writes. So, even if JTA might impose a performance penalty, if the 
> model is mostly read then it might be worth using that in the 
> immediate future. Does this sound better? 

The reads performance difference between a transacted store and an 
untransacted store
should be very small. You'd only notice the difference with very high 
conconcurrency and
very high query rates (typically many thousands per second). I think 
that you are assuming
that a transacted store means that reads have to be performed within a 
transaction. This is
not the case. LDAP servers typically transact updates and don't transact 
reads.

I don't know much about JTA... but I thought it was an interface for 
distributed transactions?
XA for Java?

Transactions are also a troublesome topic to discuss, as the term is 
overloaded with meaning.
For your server you need to think about transactions at three layers... 
internal, external, and
distributed.

> BTW, John, do you know if Sleepycat made a final resolution about what 
> will be the bdxml license? I.e.: will we be able to use bdbxml as a 
> backend for Xindice like Mysql does with Berkeley DB? This is one of 
> the paths I'm currently exploring, but I might well be wasting my time 
> here if the license doesn't allow us to use it.

DB XML is released under the same license as Berkeley DB. Basically, if 
you're open
source then we're open source, if you're proprietary then you need to 
buy a license. We
have alpha code available if anyone wants to play :-)

John

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

John Merrells wrote:

>
> Gianugo Rabellino wrote:
>
> >> >4. TRANSACTION
> >
>
> > Surely it's not in my personal top list. I think that XML databases
> > are closer to the LDAP model than to RDBMS: most of the time the
> > operations are pure reads, not writes. Yet, if it doesn't cost too
> > much, it might be worth considering.
>
>
>
> I find that statement curious. The ratio of reads to writes doesn't
> really have
> any bearing on the need for transactions.... If your user wants their
> data to
> be durable they need a transacted store.

Nice to see you here, John. :-)

Af fo transaction, sorry, let me rephrase it better, I didn't make 
myself clear: if you want ACID then you need transactions for operations 
that modify the database. But on a mostly-read database the transaction 
code might well be slow, since you are not doing many writes. So, even 
if JTA might impose a performance penalty, if the model is mostly read 
then it might be worth using that in the immediate future. Does this 
sound better?

BTW, John, do you know if Sleepycat made a final resolution about what 
will be the bdxml license? I.e.: will we be able to use bdbxml as a 
backend for Xindice like Mysql does with Berkeley DB? This is one of the 
paths I'm currently exploring, but I might well be wasting my time here 
if the license doesn't allow us to use it.

TIA,

-- 
Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by John Merrells <me...@sleepycat.com>.

Gianugo Rabellino wrote:

>> >4. TRANSACTION
>
>
> Surely it's not in my personal top list. I think that XML databases 
> are closer to the LDAP model than to RDBMS: most of the time the 
> operations are pure reads, not writes. Yet, if it doesn't cost too 
> much, it might be worth considering.

I find that statement curious. The ratio of reads to writes doesn't 
really have
any bearing on the need for transactions.... If your user wants their 
data to
be durable they need a transacted store.

John

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

Ahmed wrote:

>
> >4. TRANSACTION
> >This is needed too.
>
>
> Surely important, might not be urgent.

Surely it's not in my personal top list. I think that XML databases are 
closer to the LDAP model than to RDBMS: most of the time the operations 
are pure reads, not writes. Yet, if it doesn't cost too much, it might 
be worth considering.

Ciao,

-- 
Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by Ahmed <ah...@baizid.org>.

On Wed, 2002-11-27 at 12:15, Gianugo Rabellino wrote:
> There are some major points that I would like to address in the next 
> future. 

> 4. TRANSACTION
> This is needed too. 

Surely important, might not be urgent. 
MySQL became widely used without any transaction support. 
Of course, it has finaly been added.

-- 
Ahmed <ah...@baizid.org>

Re: [RT] Xindice 2.0

Posted by Gary Frederick <ga...@jsoft.com>.

I am happy with whatever comes out.

I used XUpdate by applying xslt stylesheets to xml to generate XUpdate commands. I then ran the commands in the xsh shell to update the xml. I was not using XUpdate from a program and had problems (that I don't remember...) with the command line stuff that came with XIndice.

Would sixdml have something similar, where I can create commands and update xml from the command line?

Gary

pwilkinson@thirdfloor.com.au wrote:
>>>I like the XUpdate stuff. Where does that fit into 2.0?
>>>
>>
>>It does fit, it just doesn't really need much work, it just works.:-)
>>
>>That said, I don't like that much XUpdate, I'd rather settle for
> 
> sixdml.
> 
>>Ciao,
>>
>>--
>>Gianugo Rabellino
> 
> 
> I'm definitely a vote for sixdml - when we were originally writing our
> apps we ran into a number of issues that limited us with xupdate, we
> basically gave up and drag docs out and mess with them and then stuff
> them back to make it easier - sixdml certainly looks very interesting
> and more complete.
> 
> I'd just like to say a quick thanks for all the work going on over the
> last little while - possibilities of future dev for Xindice were looking
> a bit dim for a while but its certainly looking up - great work.
> 
> Peter Wilkinson.

RE: [RT] Xindice 2.0

Posted by pw...@thirdfloor.com.au.

> >
> > I like the XUpdate stuff. Where does that fit into 2.0?
> >
> It does fit, it just doesn't really need much work, it just works.:-)
> 
> That said, I don't like that much XUpdate, I'd rather settle for
sixdml.
> 
> Ciao,
> 
> --
> Gianugo Rabellino

I'm definitely a vote for sixdml - when we were originally writing our
apps we ran into a number of issues that limited us with xupdate, we
basically gave up and drag docs out and mess with them and then stuff
them back to make it easier - sixdml certainly looks very interesting
and more complete.

I'd just like to say a quick thanks for all the work going on over the
last little while - possibilities of future dev for Xindice were looking
a bit dim for a while but its certainly looking up - great work.

Peter Wilkinson.

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

Gary Frederick wrote:

> >
> > This is probably a good time to start thinking about Xindice 2.0.
>
> trim
>
> I like the XUpdate stuff. Where does that fit into 2.0?
>
It does fit, it just doesn't really need much work, it just works.:-)

That said, I don't like that much XUpdate, I'd rather settle for sixdml.

Ciao,

-- 
Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by Gary Frederick <ga...@jsoft.com>.


Gianugo Rabellino wrote:
> 
> This is probably a good time to start thinking about Xindice 2.0. 
trim

I like the XUpdate stuff. Where does that fit into 2.0?

Gary

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

Thanks Kurt for this commit. It will be on my top list for a test in the 
next few days.

About encryption: have you tried stunnel for that 
(http://www.stunnel.org)? I'd be curious to see how it performs against 
Tomcat's SSL.

Ciao,

-- 
Gianugo


Kurt Ward wrote:

> 
>>To make things work under SSL, create a cert for Tomcat and install it
> 
> using
> 
>>port 8443 and
>>1. Uncomment the url starting with "https" in XindiceAdmin.java
> 
> 
> 1a. Rebuild with ant!
> 
> 
>>2. Restart the admin tool
> 
> 
> On a side note for encryption, I have also used an SSH tunnel set up for
> Xindice.  The performance is quite a bit slower than SSL, but it's a quick
> workaround that works without the need for SSL on both 1.0 and 1.1b:
> 
> SSH -f -N -L 7000:127.0.0.1:8080 192.168.100.1
> 
> Changing the url to http://127.0.0.1:7000/xindice1.1b will then route
> everything through SSH.
> 
> More of a speed experiment than anything else.
> 
> Kurt
>

Re: [RT] Xindice 2.0

Posted by Kurt Ward <ku...@yahoo.com>.

A few changes:

> To make things work under SSL, create a cert for Tomcat and install it
using
> port 8443 and
> 1. Uncomment the url starting with "https" in XindiceAdmin.java

1a. Rebuild with ant!

> 2. Restart the admin tool

On a side note for encryption, I have also used an SSH tunnel set up for
Xindice.  The performance is quite a bit slower than SSL, but it's a quick
workaround that works without the need for SSL on both 1.0 and 1.1b:

SSH -f -N -L 7000:127.0.0.1:8080 192.168.100.1

Changing the url to http://127.0.0.1:7000/xindice1.1b will then route
everything through SSH.

More of a speed experiment than anything else.

Kurt

Re: [RT] Xindice 2.0

Posted by Kurt Ward <ku...@yahoo.com>.

I have commited several changes to the scratchpad admin, including the SSL
support (sloppy at the moment, but working).

To use the code:

1. I am using JDK 1.3.1 which does not include the JSSE packages, 1.4.? has
these included.  If you are running JDK 1.3.x, you will need to install the
Sun JSSE package.
2. Run "ant build" from xml-xindice/java/scratchpad/admin
3. Run xml-xindice/java/scratchpad/admin/xindiceadmin.sh to start the admin
tool
4. entering 'help' will return a list of available commands (not many yet!)
5. To run a batch of commands, type 'execute script_file_here' you can see
the script file format and/or execute the sample script from
java/scratchpad/admin/test.scr
6. 'exit' to exit the app

To make things work under SSL, create a cert for Tomcat and install it using
port 8443 and
1. Uncomment the url starting with "https" in XindiceAdmin.java
2. Restart the admin tool

There are a couple other things in the XindiceAdmin.java file worth looking
at:
//XmlRpc.setKeepAlive(true);
//XmlRpc.setDebug(true);

Every command is using a stopwatch borrowed from the main tree to display
execution times.
Let me know if you have any problems getting this to work.

Kurt

----- Original Message -----
From: "Gianugo Rabellino" <gi...@apache.org>
To: <xi...@xml.apache.org>
Sent: Tuesday, December 03, 2002 3:52 PM
Subject: Re: [RT] Xindice 2.0

> Kurt Ward wrote:
> > Kurt Ward wrote:
> >
> >
> >>b. transport: if we are going to have username and passwords flying over
> >>the wire, we need to protect them. XML-RPC over HTTPS? CHAP? Kerberos?
> >>Other thoughts?
> >>
> >>
> >>XML-RPC over HTTPS is pretty straight-forward and easy for users to
> >>implement.  The interactive admin tools I have been working on already
> >
> > does
> >
> >>>this using the Sun JSSE package (although I have not commited it to the
> >>>scratchpad area yet).  Creation of the random key to start an SSL
> >
> > connection
> >
> >>>is a slow, but the performance is satisfactory in my opinion.
> >>
> >>Hmmm... I'm wondering if this startup delay will occur on every XML-RPC
> >>request: in this case it might be unaccettable in a production
> >>environment with lots of queries. Is it possible (I'm ignorant about
> >>JSSE) to cache a session-wide random key for reuse?
> >
> >
> > The random key is not generated on each request.  On initial startup of
an
> > XML-RPC client, it takes ~3-4 seconds to generate the key.  After that,
the
> > key is reused and speed degradation is not very noticable. (Maybe 30ms
or
> > so?).
>
> This is good news. I'm very curious to see how it works, if and when
> you're ready, count on me for testing. :-)
>
> Ciao,
>
> --
> Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

Kurt Ward wrote:
> Kurt Ward wrote:
> 
> 
>>b. transport: if we are going to have username and passwords flying over
>>the wire, we need to protect them. XML-RPC over HTTPS? CHAP? Kerberos?
>>Other thoughts?
>>
>>
>>XML-RPC over HTTPS is pretty straight-forward and easy for users to
>>implement.  The interactive admin tools I have been working on already
> 
> does
> 
>>>this using the Sun JSSE package (although I have not commited it to the
>>>scratchpad area yet).  Creation of the random key to start an SSL
> 
> connection
> 
>>>is a slow, but the performance is satisfactory in my opinion.
>>
>>Hmmm... I'm wondering if this startup delay will occur on every XML-RPC
>>request: in this case it might be unaccettable in a production
>>environment with lots of queries. Is it possible (I'm ignorant about
>>JSSE) to cache a session-wide random key for reuse?
> 
> 
> The random key is not generated on each request.  On initial startup of an
> XML-RPC client, it takes ~3-4 seconds to generate the key.  After that, the
> key is reused and speed degradation is not very noticable. (Maybe 30ms or
> so?).

This is good news. I'm very curious to see how it works, if and when 
you're ready, count on me for testing. :-)

Ciao,

-- 
Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by Kurt Ward <ku...@yahoo.com>.

> Kurt Ward wrote:
>
> > b. transport: if we are going to have username and passwords flying over
> > the wire, we need to protect them. XML-RPC over HTTPS? CHAP? Kerberos?
> > Other thoughts?
> >
> >
> > XML-RPC over HTTPS is pretty straight-forward and easy for users to
> > implement.  The interactive admin tools I have been working on already
does
> > this using the Sun JSSE package (although I have not commited it to the
> > scratchpad area yet).  Creation of the random key to start an SSL
connection
> > is a slow, but the performance is satisfactory in my opinion.
>
> Hmmm... I'm wondering if this startup delay will occur on every XML-RPC
> request: in this case it might be unaccettable in a production
> environment with lots of queries. Is it possible (I'm ignorant about
> JSSE) to cache a session-wide random key for reuse?

The random key is not generated on each request.  On initial startup of an
XML-RPC client, it takes ~3-4 seconds to generate the key.  After that, the
key is reused and speed degradation is not very noticable. (Maybe 30ms or
so?).

Kurt

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

Kurt Ward wrote:

> b. transport: if we are going to have username and passwords flying over
> the wire, we need to protect them. XML-RPC over HTTPS? CHAP? Kerberos?
> Other thoughts?
> 
> 
> XML-RPC over HTTPS is pretty straight-forward and easy for users to
> implement.  The interactive admin tools I have been working on already does
> this using the Sun JSSE package (although I have not commited it to the
> scratchpad area yet).  Creation of the random key to start an SSL connection
> is a slow, but the performance is satisfactory in my opinion.

Hmmm... I'm wondering if this startup delay will occur on every XML-RPC 
request: in this case it might be unaccettable in a production 
environment with lots of queries. Is it possible (I'm ignorant about 
JSSE) to cache a session-wide random key for reuse?

Ciao,

-- 
Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by Kurt Ward <ku...@yahoo.com>.

> 3. AAA
> Badly needed, on two sides:
>
> a. Server side: not that hard to implement, after all, at least on a
> not-so-granular way. We might go the hard way with security-oriented
> markup languages and node based security or just rely on URI-based
> authentication, with a Tomcat/Slide/younameit-like role system. I'd go
> for the latter: Collection based security should be enough for most needs.
>
> b. transport: if we are going to have username and passwords flying over
> the wire, we need to protect them. XML-RPC over HTTPS? CHAP? Kerberos?
> Other thoughts?

XML-RPC over HTTPS is pretty straight-forward and easy for users to
implement.  The interactive admin tools I have been working on already does
this using the Sun JSSE package (although I have not commited it to the
scratchpad area yet).  Creation of the random key to start an SSL connection
is a slow, but the performance is satisfactory in my opinion.

Kurt

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

Vladimir R. Bossicard wrote:

> > 1. XML:DB API
> > we might try to push on the xapi-dev list and see what happens, if we
> > fail it will be always possible to run wild and do our own extensions.
>
> I'm already in contact with the xapi-dev ml and they are responsive.
> Slowly but everyone can understand why. 

So am I. This is why I don't really want to sparkle a discussion there 
until we have a concrete proposal (at least about metadata).

> On a general POV, I would like to take the opportunity of 2.0 for:
>
> - moving the code into the 'src' directory
> - reformating it to adhere to the Apache conventions
> - correcting some package names (xindice.server.rpc but
> xindice.client.xmldb.xmlrpc)


+1 from me.

> What is badly lacking are:
>
> - unit tests
> - load tests
>
> Starting to write unit tests for every piece of code would help to clean
> up some spaghetti code that we have in some classes and tremendously
> increase the quality of the code.

Definitely. Will take a look at it in the next few days. At least we 
should be writing unit test for all the new code that is brought into CVS.

Ciao,

-- 
Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by Kurt Ward <ku...@yahoo.com>.

----- Original Message -----
From: "Gianugo Rabellino" <gi...@apache.org>
To: <xi...@xml.apache.org>
Sent: Tuesday, December 03, 2002 4:54 AM
Subject: Re: [RT] Xindice 2.0


> Kurt Ward wrote:
> > On a general POV, I would like to take the opportunity of 2.0 for:
> >
> > - moving the code into the 'src' directory
> > - reformating it to adhere to the Apache conventions
> > - correcting some package names (xindice.server.rpc but
> > xindice.client.xmldb.xmlrpc)
> >
> >
> > -1 on package names.  The xindice.server.rpc package has nothing to do
with
> > xmldb (well, at least to the user). I don't disagree with renaming the
> > package entirely, but remember that the end goal is to have XML:DB,
XML-RPC,
> > and SOAP as available methods to directly access the server, and not
just
> > XML:DB so we should either have extensions (as mentioned in a previous
> > thread) or some other package naming for XML-RPC and SOAP apart from
XML:DB.
>
> Makes sense. Though I still think that XML:DB should remain the primary
> way to access Xindice, we should work together with the xmldb group to
> extend the API and, if it doesn't work, make our own additions. But yes,
> it would be nice to have alternative ways of access.
>

Yes, the XML:DB API should be the primary way to access Xindice.  XML-RPC
(already in the 1.1 tree) and future SOAP messages allow non-Java users to
access the server.  There are already several applications and users using
the XML-RPC message interface for 1.0 (from Perl, PHP, etc.).  With the
removal of CORBA, this was the agreed upon solution.

Kurt

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

Kurt Ward wrote:
> On a general POV, I would like to take the opportunity of 2.0 for:
> 
> - moving the code into the 'src' directory
> - reformating it to adhere to the Apache conventions
> - correcting some package names (xindice.server.rpc but
> xindice.client.xmldb.xmlrpc)
> 
> 
> -1 on package names.  The xindice.server.rpc package has nothing to do with
> xmldb (well, at least to the user). I don't disagree with renaming the
> package entirely, but remember that the end goal is to have XML:DB, XML-RPC,
> and SOAP as available methods to directly access the server, and not just
> XML:DB so we should either have extensions (as mentioned in a previous
> thread) or some other package naming for XML-RPC and SOAP apart from XML:DB.

Makes sense. Though I still think that XML:DB should remain the primary 
way to access Xindice, we should work together with the xmldb group to 
extend the API and, if it doesn't work, make our own additions. But yes, 
it would be nice to have alternative ways of access.

Ciao,

-- 
Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by Kurt Ward <ku...@yahoo.com>.

> On a general POV, I would like to take the opportunity of 2.0 for:
>
> - moving the code into the 'src' directory
> - reformating it to adhere to the Apache conventions
> - correcting some package names (xindice.server.rpc but
> xindice.client.xmldb.xmlrpc)

-1 on package names.  The xindice.server.rpc package has nothing to do with
xmldb (well, at least to the user). I don't disagree with renaming the
package entirely, but remember that the end goal is to have XML:DB, XML-RPC,
and SOAP as available methods to directly access the server, and not just
XML:DB so we should either have extensions (as mentioned in a previous
thread) or some other package naming for XML-RPC and SOAP apart from XML:DB.

Kurt

Re: [RT] Xindice 2.0

Posted by "Vladimir R. Bossicard" <vl...@apache.org>.

> 1. XML:DB API
> we might try to push on the xapi-dev list and see what happens, if 
> we fail it will be always possible to run wild and do our own extensions.

I'm already in contact with the xapi-dev ml and they are responsive. 
Slowly but everyone can understand why.  In fact Xindic uses a patched 
version of xmldb-api (sorry, should have mentioned it).  But they are 
fixing the problem in the official code.

The problem I see with xmldb-api is that there's no official version. 
It's up to the different projects to build their version from cvs.  Not 
optimal IMO.  But this is a minor problem.

On a general POV, I would like to take the opportunity of 2.0 for:

- moving the code into the 'src' directory
- reformating it to adhere to the Apache conventions
- correcting some package names (xindice.server.rpc but 
xindice.client.xmldb.xmlrpc)

What is badly lacking are:

- unit tests
- load tests

Starting to write unit tests for every piece of code would help to clean 
up some spaghetti code that we have in some classes and tremendously 
increase the quality of the code.

-Vladimir

-- 
Vladimir R. Bossicard
Apache Xindice - http://xml.apache.org/xindice

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

Steven Noels wrote:

> > 4. TRANSACTION
> > This is needed too. I don't know how JTA might help here, I have no
> > idea of the API and never worked with it. Any expert around? We would
> > need to know not only if JTA would make the job, but also if,
> > performance wise, it will suffice without imposing severe penalties to
> > the system.
>
>
> 5. QUERYING
> I know there is a new JSR being planned to introduce a common Querying
> API for Java XML Native databases. Whether this will overlap with xml:db
> or not, I'm not sure.

 From a quick look to JCP I couldn't find that JSR. I'd really like to 
know more about it, i.e. to understand if there is someone from Apache 
into it or not (if that's the case we should try our best to speak up). 
Can you point me to more informations?

TIA,

-- 
Gianugo Rabellino

Re: [RT] Xindice 2.0

Posted by Steven Noels <st...@outerthought.org>.

Gianugo Rabellino wrote:

<snip/>

> 4. TRANSACTION
> This is needed too. I don't know how JTA might help here, I have no idea 
> of the API and never worked with it. Any expert around? We would need to 
> know not only if JTA would make the job, but also if, performance wise, 
> it will suffice without imposing severe penalties to the system.

5. QUERYING
I know there is a new JSR being planned to introduce a common Querying 
API for Java XML Native databases. Whether this will overlap with xml:db 
or not, I'm not sure.

> ======================================================================
> 
> OK, this was the first stone in the lake: I hope to sparkle some 
> discussion on it and, once we manage to agree on what we want from 2.0, 
> to start writing docs and code. I'm now borrowing the world-famous 
> absbestos underwear from Stefano & Sam and I'm eagerly waiting for your 
> replies.

That underwear must smell yucky by now ;-)

</Steven>
-- 
Steven Noels                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at              http://radio.weblogs.com/0103539/
stevenn at outerthought.org                stevenn at apache.org

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

John Wright wrote:

> It doesn't seem like the OR-DBM system is a very good solution to
> semi-structured data (something which I'm trying to benchmark with LORE,
> Xindice & a commercial DBMS.)  With all of the OR-DBM systems providing
> fairly efficient solutions for data-centric applications, it might be
> more beneficial to focus on document-centric XML.

Good to see I'm not alone. :-)

>
> I also have some ideas regarding the storage & search structures used
> (and please forgive me if they're already in place, as I'm very new to
> this project.)

OK, you are my man! :-)

Please go ahed with ideas and suggestions: you are most welcome. As of 
now we badly need skills like yours, so please, *please*, go ahead and 
share with us your opinions.

> 1. Are the binary tree representations stored as linked trees or arrays?
> Typically, if the B-trees can be stored as arrays, the spatial locality
> for programs increases substantially.
>
> 2. Are the B-trees balanced?

The short answer to these questions is "I really don't know". I have 
never been that deep into the Xindice internals, maybe others might help 
you here.

> 3. Could we use a hybrid approach to storage and indexing, using hash
> tables, array-based B+ trees, and/or Patricia tries?

Look, I'll be very straight with you: I know little or nothing about 
hard-core programming with data structures. I've been using database 
from a long while, but this is my first attempt in making one. While I 
try to catch up by reading papers and making exercises, please try to 
explain me what might be in your opinion the best solution (any comment 
about my previous email on SAX events storage?). Don't waste time 
typing: a pointer to some web documentation would be more than enough. I 
promise to learn fast. :-) And if you want to join the team, of course, 
you are more than welcome.

Ciao,

-- 
Gianugo Rabellino

RE: [RT] Xindice 2.0

Posted by John Wright <wr...@ufl.edu>.

> Xindice has its own B-Tree files for data storage and search. Could we
> consider leveraging existing RDBM systems? RDBM has been developed and

> fine
> tuned for so many years, and they have solved many issues that we are 
> going
> to tackle (performance, transaction, and security). 

It doesn't seem like the OR-DBM system is a very good solution to
semi-structured data (something which I'm trying to benchmark with LORE,
Xindice & a commercial DBMS.)  With all of the OR-DBM systems providing
fairly efficient solutions for data-centric applications, it might be
more beneficial to focus on document-centric XML.

I also have some ideas regarding the storage & search structures used
(and please forgive me if they're already in place, as I'm very new to
this project.)

1. Are the binary tree representations stored as linked trees or arrays?
Typically, if the B-trees can be stored as arrays, the spatial locality
for programs increases substantially.

2. Are the B-trees balanced?

3. Could we use a hybrid approach to storage and indexing, using hash
tables, array-based B+ trees, and/or Patricia tries?

RE: [RT] Xindice 2.0

Posted by Lixin Meng <lx...@yahoo.com>.

>
> OK. So just make me understand why you would want to use a
> *relational*
> database if your main target is to avoid *relations*. :-) I
> still think

As an alternative of embedding a plain db engine, using a out-of-shelf
product give user more choices and preserve previous investment.

Lixin

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

John Wright wrote:

> >I have lost you here. If not on disk or memory, where are you supposed
> >to store indexes? Do you mean that actually it might not be the case to
> >have some indexes?
>
>
> Either user chosen indices, 

OK, this is exactly what is done now: in order to have an index the 
administrator has to explicitely ask for it.

> or system adaptive indices, where frequently
> run queries "create" more indices on the data being queried.  It would
> be similar to the idea that results from frequent queries are cached;
> instead of caching, new index creation could be triggered, thus giving
> sort of an on-demand system.

And this is exactly what I had in mind. I was wondering if this 
procedure should be fully automated or if it should just give to the 
administrator some hints about indexes to create. Probably something in 
between would be the best bet.

Ciao,

-- 
Gianugo Rabellino

RE: [RT] Xindice 2.0

Posted by John Wright <wr...@ufl.edu>.

> I have lost you here. If not on disk or memory, where are you supposed

> to store indexes? Do you mean that actually it might not be the case
to 
> have some indexes?

Either user chosen indices, or system adaptive indices, where frequently
run queries "create" more indices on the data being queried.  It would
be similar to the idea that results from frequent queries are cached;
instead of caching, new index creation could be triggered, thus giving
sort of an on-demand system.

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

John Wright wrote:

> One interesting thing I've seen, particularly with the LORE project, is
> the use of multiple indices, since XML and semi-structured data have at
> least two distinct components - paths in the graph and nodes for the
> data.  This is also where you start to lose storage efficiency, though,

I don't think that storage is that much of an issue as of now. Storage 
is cheap ATM, so I'd personally focus on efficiency and speed of the 
whole engine (true... when you start having giga-sized indexes you loose 
speed too).

> and updates are similarly not very efficient...my first thought is that
> the all the indices don't need to be stored on disk, or even in memory.

I have lost you here. If not on disk or memory, where are you supposed 
to store indexes? Do you mean that actually it might not be the case to 
have some indexes?

> What sort of indexing schemes are we using right now?
>
Look at org.apache.xindice.core.indexer.*, basically there are two 
indexes (b-tree based): one for element/attributes names and one for 
element/attributes values.

Ciao,

-- 
Gianugo Rabellino

RE: [RT] Xindice 2.0

Posted by John Wright <wr...@ufl.edu>.

> I agree one should avoid JOIN at all cost. If one want to build a DOM
tree
> in RDBMS, JOIN will be inevitable (that's why I have some reservations

> over
> eXist). The preliminary idea in my previous email is not to build the
DOM
> tree in order to minimize the JOINs, with the price paid to prepare
those
> XPaths when inserting the document (kind of like a index). 

Why avoid joins?  True, they are costly, and it will be difficult to use
relationally equivalent techniques to shifting selections and
projections down, but joins are a necessary evil, particularly if we are
going to pursue XQuery or almost any other query language than XPath.
That's where the efficiency of the index structures comes in!

One interesting thing I've seen, particularly with the LORE project, is
the use of multiple indices, since XML and semi-structured data have at
least two distinct components - paths in the graph and nodes for the
data.  This is also where you start to lose storage efficiency, though,
and updates are similarly not very efficient...my first thought is that
the all the indices don't need to be stored on disk, or even in memory.

What sort of indexing schemes are we using right now?

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

Lixin Meng wrote:

> >Do you mean that there might be a use case for a metadata
> >that returns
> >the *whole* database content? What would happen on a database with
> >millions of documents? Is this feature available in RDBMS and JDBC? I
> >assume that you want to "clone" something like "SELECT * FROM CAT" or
> >"SHOW TABLES", am I right? If so, those commands will return you the
> >tables (in our case, roughly speaking, the Collections) but
> >never ever
> >the whole data. Sorry if I'm not getting the point, but I
> >feel a bit lost...
>
>
> Sorry confused (might even scare?) you. It is definitely not the whole
> database content. Only the *meta* information about the structure. It is
> more like in RDBMS, you can get the db schema from its system tables (like
> 'select table_name from user_tables' for Oracle). For RDBMS, one may only
> need to know tables and fields information. For XML database, to build a
> global meta tree will be much deeper and expensive than that. 

I see your point, yet I'm still scared (expecially when using Xindice as 
a persistence engine or with Web Services, like the use cases that you 
were citing in your previous email) that this might turn into a 
performance bottleneck, with metadata requests hammering too much the 
database. We might consider it to some extent, maybe adding something 
like the WebDAV Depth concept to limit what would be returned.

> That's the beauty of virtualization. By default, we return both. If you
> think the XPath actually represent the semantic meaning of the result, 
> there
> is no difference at the semantic level. Also why people want to create or
> categorize those collections at the first place? Because they want to give
> some meaning to the content. Isn't that the same idea behind those XML 
> tags?
> Crazy?

Not crazy, actually it might make sense, but I hear some FS bells 
ringing and I oversee a dead end alley somewhere, at least from the 
performance POV (think about looking at all possible permutations of 
such XPaths in the database: we should look if under USA there is any 
collection called California OR any document called California 
containing /Bayarea/Temperature OR any document containg 
/California/BayArea/Temperature. Then we move on to collection 
California, and we have to check if there is any collection called 
BayArea OR any document called Bayarea containing /Temperature OR any 
document containing /Bayarea/Temperature...

Computationally scary, don't you think? :-) Also, if we have more than 
one result, we need to return it in an intelligent way so that users 
might notice where we have collection and where we are talking about 
documents... all in all looks at least very difficult and not user 
friendly to me.

> I agree one should avoid JOIN at all cost. If one want to build a DOM tree
> in RDBMS, JOIN will be inevitable (that's why I have some reservations 
> over
> eXist). The preliminary idea in my previous email is not to build the DOM
> tree in order to minimize the JOINs, with the price paid to prepare those
> XPaths when inserting the document (kind of like a index). 

OK. So just make me understand why you would want to use a *relational* 
database if your main target is to avoid *relations*. :-) I still think 
that, while I see that RDBMSs have been optimized for ages, a plain 
database would be the best tool for the job. But I'd most probably +1 a 
RDBMS based implementation of *indexes* as an alternative. I don't see 
the need for having it as a storage (looks really *ugly* to me to just 
dump a BLOB into the DB...

> If the 'network latency' is
> referring to cost associated with JDBC connections, I guess it can be
> ignored at this stage, 

Not that sure. Remember that if we go to a RDBMS we are adding another 
level of indirection (client->server->RDBMS), so we need to take into 
account even that.

I don't want to reinvent the wheel. My point is that if all I have is a 
car, I need a car wheel, I don't need a truck or a bicycle wheel. :-)

Ciao,

-- 
Gianugo Rabellino

RE: [RT] Xindice 2.0

Posted by Lixin Meng <lx...@yahoo.com>.

I guess I need to clarify a little bit here.

>
> > /
> > |
> > +--/USA
> >       +--*Statistics
> >       |     |
> >       |     +--<California>
> >       |             |
> >       |             +--<BayArea>
> >       |                   |
> >       |                   +--<Temperature>
> >       |
> >       +--/California
> >               |
> >               +--*BayArea  (B)
> >                      |
> >                      +--<Temperature>
> >

The 'BayArea' in my original example is a XML node. It is not at the file
level. So, the tree is more like following when I say both query are equal
to each other. The '*Statistics' is just a meta information.

>
> > /
> > |
> > +--/USA
> >       +--<California>
> >       |             |
> >       |             +--<BayArea>
> >       |                   |
> >       |                   +--<Temperature>
> >       |
> >       +--/California
> >               |
> >               +--<BayArea>
> >                      |
> >                      +--<Temperature>
> >

Lixin

RE: [RT] Xindice 2.0

Posted by Lixin Meng <lx...@yahoo.com>.

> Do you mean that there might be a use case for a metadata
> that returns
> the *whole* database content? What would happen on a database with
> millions of documents? Is this feature available in RDBMS and JDBC? I
> assume that you want to "clone" something like "SELECT * FROM CAT" or
> "SHOW TABLES", am I right? If so, those commands will return you the
> tables (in our case, roughly speaking, the Collections) but
> never ever
> the whole data. Sorry if I'm not getting the point, but I
> feel a bit lost...

Sorry confused (might even scare?) you. It is definitely not the whole
database content. Only the *meta* information about the structure. It is
more like in RDBMS, you can get the db schema from its system tables (like
'select table_name from user_tables' for Oracle). For RDBMS, one may only
need to know tables and fields information. For XML database, to build a
global meta tree will be much deeper and expensive than that. Therefore,
some rules need to be introduced. Like only follow up to n-levels or ignore
'*/HTML/*', for example.

A database with millions of documents doesn't mean the meta tree will have
millions of nodes. Otherwise, whoever using the database in this way just
treat the database as a dump ground. No system can help that.


> The user possibly doesn't, but we definitely do. :-) Imagine
> we have a
> tree like this ("/" are Collections, "*" are Resources, "<>"
> are nodes
> in Resources):
>

I like the notation.

> /
> |
> +--/USA
>       +--*Statistics
>       |     |
>       |     +--<California>
>       |             |
>       |             +--<BayArea>
>       |                   |
>       |                   +--<Temperature>
>       |
>       +--/California
>               |
>               +--*BayArea
>                      |
>                      +--<Temperature>
>
>   How can we decide if Joe user wanted to know the value of
> the element
> <Temperature> on resource "Bayarea" contained inside the
> sub-collection
> "California" or if he wanted to query the USA collection for
> documents
> having an XPath of /California/BayArea/Temperature? Same XPath, but
> definitely different results...

That's the beauty of virtualization. By default, we return both. If you
think the XPath actually represent the semantic meaning of the result, there
is no difference at the semantic level. Also why people want to create or
categorize those collections at the first place? Because they want to give
some meaning to the content. Isn't that the same idea behind those XML tags?
Crazy?

> > On the other hand, if user really want to be specific, they can say
> > 	/USA/California[system_type='collection']/...
> > where 'system_type' is the meta information.
>
> A bit clumsy but it might work, yet you would need to specify
> that even
> USA is a collection, so just in case I'd rather go for something like:
> 	/collection[name='USA']/collection[name='California']/...
>

You can do it, but as you pointed out, it is just not very user friendly. On
the other hand, it comes handy when you allow user to use any character as
the collection name.

If we consider the *meaning* rather than its physical appearance, you can
just specify it as '/USA'. If you worry about introducing things like
'system_type', in current meta data proposal, we will introduce system
defined attribute names, e.g 'last-modified', any way. Of course, it is
still debating whether to wrap those meta information into XMLObject (?), or
as a separated one.

> relational database: in the end you would end up by using at most a
> handful of tables (while performing horrible and expensive
> JOINs)

I agree one should avoid JOIN at all cost. If one want to build a DOM tree
in RDBMS, JOIN will be inevitable (that's why I have some reservations over
eXist). The preliminary idea in my previous email is not to build the DOM
tree in order to minimize the JOINs, with the price paid to prepare those
XPaths when inserting the document (kind of like a index). Of course, this
may make one table particularly huge, but RDBMS is designed to handle
millions records in any table. Also, as I said, it has problem for returning
a sub-tree instead of the whole file. Therefore, I think it will be more
suitable for situations that has less write or update, but prefer a faster
query.

> to mention the overhead for serializing XML to SQL and SQL to
> XML. Add
> to this the network latency and you're set with a possibly suboptimal
> setup. On the other hand, nce you manage to have a tabular output you
> can use hashes, arrays and the like, so any DBM would
> suffice. Don't you
> think so?

First, the output is not a tabular. Each record still returns the original
XML file which can be a BLOB format in database. If the 'network latency' is
referring to cost associated with JDBC connections, I guess it can be
ignored at this stage, if we talk about minute-level query as some users
reported. There are so many optimizations have already been done by those
RDBMS forks, and we need to start from thinking about if our B-Tree is
balanced. Do we really need to reinvent the wheel?

Lixin

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

Lixin Meng wrote:

> However, if the database return meta information tree
> 	/db
> 		/addressbook
> 				/ email
> 				/ phone
> 				...
> It will open a door for an new breed of applications, such as a GUI tool
> that supports ad hoc query.

Do you mean that there might be a use case for a metadata that returns 
the *whole* database content? What would happen on a database with 
millions of documents? Is this feature available in RDBMS and JDBC? I 
assume that you want to "clone" something like "SELECT * FROM CAT" or 
"SHOW TABLES", am I right? If so, those commands will return you the 
tables (in our case, roughly speaking, the Collections) but never ever 
the whole data. Sorry if I'm not getting the point, but I feel a bit lost...

> For end user, if one wants to get some weather information from a 
> system, he
> naturally thinks about '/USA/California/Bayarea/Temperature'. Do they 
> really
> care about that '/USA' is a collection or '/USA/California' is a 
> collection?

The user possibly doesn't, but we definitely do. :-) Imagine we have a 
tree like this ("/" are Collections, "*" are Resources, "<>" are nodes 
in Resources):

/
|
+--/USA
      +--*Statistics
      |     |
      |     +--<California>
      |             |
      |             +--<BayArea>
      |                   |
      |                   +--<Temperature>
      |
      +--/California
              |
              +--*BayArea
                     |
                     +--<Temperature>

  How can we decide if Joe user wanted to know the value of the element 
<Temperature> on resource "Bayarea" contained inside the sub-collection 
"California" or if he wanted to query the USA collection for documents 
having an XPath of /California/BayArea/Temperature? Same XPath, but 
definitely different results...

> On the other hand, if user really want to be specific, they can say
> 	/USA/California[system_type='collection']/...
> where 'system_type' is the meta information.

A bit clumsy but it might work, yet you would need to specify that even 
USA is a collection, so just in case I'd rather go for something like:

	/collection[name='USA']/collection[name='California']/...

but then again if someone decides to put a Resource and call it 
"collection" you would be stuck anyway. True, you can add a namespace 
but it all feels so far away from Joe User to make it not really 
worthwile. But if we come up with a good syntax which is possibly 
compliant to the XPath specs (IIRC eXist had to go proprietary for some 
particular queries, I would do it only as a last resort), then I'm all ears.

> I have my reservation on the issue that we should focus more on
> document-centric XML files too. At least, there is a 50-50 chance in the
> real world. As I said, my oringial movitation on searching XML database is
> from data processing not content management. More and more Web Services
> implementations mean more SOAP messages need to be logged and retrieved.

True. And not only that: I foresee a great potential for Xindice (and 
XML databases in general) to become a great persistence engine. We have 
all sort of object serialization to XML, so we would end up with an 
OODBMS at little or no cost.

But my point, actually, is to try to build an engine that is capable of 
dealing efficiently with both kind of XML. After all, in XML, you don't 
need the "R" in RDBMS, so it is intrinsecally overkill to use a 
relational database: in the end you would end up by using at most a 
handful of tables (while performing horrible and expensive JOINs), not 
to mention the overhead for serializing XML to SQL and SQL to XML. Add 
to this the network latency and you're set with a possibly suboptimal 
setup. On the other hand, nce you manage to have a tabular output you 
can use hashes, arrays and the like, so any DBM would suffice. Don't you 
think so?

Ciao,

-- 
Gianugo Rabellino

RE: [RT] Xindice 2.0

Posted by Lixin Meng <lx...@yahoo.com>.

> > I also hope we can have metadata at the database level.
> > http://marc.theaimsgroup.com/?l=xindice-dev&m=103790372009713&w=2
>
>
> Can you be more specific on that? I saw the message on the
> archive, but
> I fail to see how would Database metadata help here. I tend to think
> that Database metadata are capabilities (like transaction
> support) and
> maybe the collection tree, nothing more really.

Just having access to collection tree may not be enough. For example, if the
database only tells user there is a collection hierarchy '/db/addressbook',
there is no way for user to imagine a query like '/db/addressbook/email',
unless they iterate every document in that collection.

However, if the database return meta information tree
	/db
		/addressbook
				/ email
				/ phone
				...
It will open a door for an new breed of applications, such as a GUI tool
that supports ad hoc query.

>
> As per XPath queries sent on the database, I understand that
> they might
> be useful, but I see a problem. Given an XPath like
> /db/content/whatever/A/B, how can you tell which one of the
> tokens is a
> collection, which one is a document and wich one is a real XML XPath?
> This would become even more difficult with XPaths like
> //*/A/B. But I'd
> be happy to be proven wrong, since I see lots of use cases for that.

To me, using XPath as query language, it has limitation on syntax. But the
biggest advantage is that it is more natural for end user. Each level is a
container or collection. It also provides another level of virtualization.
For end user, if one wants to get some weather information from a system, he
naturally thinks about '/USA/California/Bayarea/Temperature'. Do they really
care about that '/USA' is a collection or '/USA/California' is a collection?
What they care is that they are going to send a query to the system '/'.
Conceptually, every XML node is a collection too.

On the other hand, if user really want to be specific, they can say
	/USA/California[system_type='collection']/...
where 'system_type' is the meta information.

> > >2. PERFORMANCE
>
> Here I disagree. My point is that XML database should solve
> the problem
> of semistructured data. Pushing semistructured data on a
> relational DB
> looks at least suboptimal to me. I can see a reason when dealing with
> data oriented XML (like just tags an attributes), but things become
> really messy on text oriented documents: how could you
> efficiently break
> into a tabular format something like
>

I think I need to restate it a little bit. There are data-centric XML files
and there are document-centric XML files. I probably inherited more genes
from data processing background (including the proposal for database level
meta information). I agree the proposal may not be a good fit for
document-centric scenario and I don't expect one-size-fit-all. I *briefly*
(forgive my ignorance, if anyone from eXist :) ) scaned through eXist's sql
scripts before, I was not totally convinced to build a DOM tree in RDBMS
will help .

I have my reservation on the issue that we should focus more on
document-centric XML files too. At least, there is a 50-50 chance in the
real world. As I said, my oringial movitation on searching XML database is
from data processing not content management. More and more Web Services
implementations mean more SOAP messages need to be logged and retrieved.
Even for traditional middleware (JMS, MQSeries, ...) users, they tend to
wrap their messages in XML too. So, the point is that are plenty
data-centric use cases. If we can give user options to pick the suitable
configurations, won't that be great?

Lixin

Re: [RT] Xindice 2.0

Posted by Gianugo Rabellino <gi...@apache.org>.

Lixin Meng wrote:

> >- metadata: we need a neutral way to query metadata for
> >collections and
> >resources. I like David's solution of having a MetaData object with a
>
>
> I also hope we can have metadata at the database level.
> http://marc.theaimsgroup.com/?l=xindice-dev&m=103790372009713&w=2

Can you be more specific on that? I saw the message on the archive, but 
I fail to see how would Database metadata help here. I tend to think 
that Database metadata are capabilities (like transaction support) and 
maybe the collection tree, nothing more really.

As per XPath queries sent on the database, I understand that they might 
be useful, but I see a problem. Given an XPath like 
/db/content/whatever/A/B, how can you tell which one of the tokens is a 
collection, which one is a document and wich one is a real XML XPath? 
This would become even more difficult with XPaths like //*/A/B. But I'd 
be happy to be proven wrong, since I see lots of use cases for that.

> >2. PERFORMANCE
> >Face it: we are slow. We are fair enough for small jobs but we cannot
> >stand high loads or huge documents, no matter how accurate
> >your indexes
>
>
> Xindice has its own B-Tree files for data storage and search. Could we
> consider leveraging existing RDBM systems? RDBM has been developed and 
> fine
> tuned for so many years, and they have solved many issues that we are 
> going
> to tackle (performance, transaction, and security). 

Here I disagree. My point is that XML database should solve the problem 
of semistructured data. Pushing semistructured data on a relational DB 
looks at least suboptimal to me. I can see a reason when dealing with 
data oriented XML (like just tags an attributes), but things become 
really messy on text oriented documents: how could you efficiently break 
into a tabular format something like

<p>
This is a <i>text</i>. There are text <b>nodes</b> all over the place: I 
dare you to insert this stuff <emphasis>efficiently</emphasis> in a
<a href="http://www.mysql.com">relational database</a>.
</p>

Besides, I see no real reason to follow that path since there is another 
Open Source XML database (eXist) who's doing exactly that (not to 
mention that every database vendor has its own XML->DB engine). As a 
side matter, actually, I'd love to see the two projects merge together 
but it looks like it's not the right timing: yet Wolfgang and his team 
have my total appreciation for the job they are doing.

If we were to chose that path (tabular XML), I would actually 
investigate more on the forthcoming XML database from the Sleepycat 
guys: this way, as MySQL uses Berkeley DB for storage, we might leverage 
Berkeley XML DB.

But then again I'm starting to ask myself if we need a storage at all. I 
know it sounds provoking, but try to follow me and my crappy English on 
those two use cases:

1. Use case 1: we are asked a particular resource (say an XML document), 
and all we need to do is find it and deliver it *as fast as possible* to 
the user. This means that all we need to do is try to reduce bottlenecks 
and, apart from network bottlenecks, the only real limitation that I see 
is *parsing*: if we are to parse a file, then deliver it to a client 
over the network in a form that in turn needs some kind of parsing, we 
are just wasting our time. As of now we are dealing with DOM, which is 
the most expensive and slow XML data structure around. I am currently 
looking at DTM from Xalan (which however is showing some serious 
limitations) and I'm willing to try the SAX events compilation way "a' 
la Cocoon", where all you have is just a byte stream containing 
"recorded" SAX events. All we need to do is:

a. when writing a document, write it on disk as a byte stream of 
compiled events;

b. when we are requested a document, just send that byte stream over the 
network to the client;

c. let the client perform the reverse operation, by interpreting 
(playing back) the recorded SAX events (possibly to a DOM builder if the 
client application is requesting a DOM tree.

2. Use case 2: we are requested an XPath query (or, in the future, 
XQuery). Here we need to have real fast indexes and a real fast XPath 
engine. Here Xalan DTM might play a key role.

Now, I know that there are more (write oriented) use cases such as 
XUpdate, Sixdml and the like, but I still think that those kind of 
operations might be accomplished in an higher timeframe. Again, the 
parallel with LDAP stands: LDAP writes are *slow* but reads are 
*blazingly fast*. Not to mention that there might be a way to optimize 
that part too.

How does it sound? Crazy? :-)

Ciao,

-- 
Gianugo Rabellino

RE: [RT] Xindice 2.0

Posted by Lixin Meng <lx...@yahoo.com>.

> - metadata: we need a neutral way to query metadata for
> collections and
> resources. I like David's solution of having a MetaData object with a

I also hope we can have metadata at the database level.
http://marc.theaimsgroup.com/?l=xindice-dev&m=103790372009713&w=2

>
> 2. PERFORMANCE
> Face it: we are slow. We are fair enough for small jobs but we cannot
> stand high loads or huge documents, no matter how accurate
> your indexes

Xindice has its own B-Tree files for data storage and search. Could we
consider leveraging existing RDBM systems? RDBM has been developed and fine
tuned for so many years, and they have solved many issues that we are going
to tackle (performance, transaction, and security). What need to be done is
to define an efficient data schema and provide query language translation
(XPath to SQL). A very preliminary thought is attached at the end of this
mail.

n. Cross-collection search.

n+1. Vocabulary mapping

The '/A/B' in name space one might be equal to '/X/Y' in name space two. If
we allow user to set such rules, the system may return both when searching
for either '/A/B' or '/X/Y'.

Regards,
Lixin

-------------
Store and search an arbitrary XML files with out-of-shelf relational
database.

For XML file:
	 <A a='some attribute value'>
	    <B>
	        <C c='attribute for c'>Something here</C>
	        <D>first D</D>
	        <D>second D</D>
	        <D d='third'>third D</D>
	     </B>
	 </A>

	Break it down to a list of name-value pairs:

	"/A/@a"		"some attribute value"
	"/A/B/C/@c"		"attribute for c"
	"/A/B/C"		"Something here"
	"/A/B/D[1]"		"first D"
	"/A/B/D[2]"		"second D"
	"/A/B/D[3]"		"third D"
	"/A/B/D/@d"		"third"

Save them into tables (there should be more tables, such as one hold the
original file so we won't have to reconstruct it. Haven't thought about how
to return a sub-tree):

	Value table
	ID		Value
	id001		Some attribute value
	id002		attribute for c
	id003		Something here
	id004		first D
	id005		second D
	...	...

	Meaning/Meta table
	IDRef		DocID		path		Index
	id001		doc1		/A/@a
	id002		doc1		/A/B/C/@c
	id003		doc1		/A/B/C
	id004		doc1		/A/B/D	1
	id005		doc1		/A/B/D	2
	...	...	...

For query: /A/B/C[@c='Something here'], Convert it into SQL:
	select ...
	from   ...
	where ... "/A/B/C/@c" and ... "attribute for c"

The SQL will return a set of 'DocID', for example, for matching documents.
The SQL might be complex, but RDBMS are proven for handling large amount of
data.