You are viewing a plain text version of this content. The canonical link for it is here.

Posted to xindice-users@xml.apache.org by Honglin Ye <hy...@aoc.nrao.edu> on 2004/02/03 22:36:55 UTC

xindice xpath query over multiple collections


Hello, Vadim,
       Thank you for the message.
       I think search over multiple collections of similar document is a needed
feature. By partitioning the database into smaller collections may improve the
performance. I have documents to store in a collection tree like:

        /db/proposals
              /VLA
                  /200402
                  /200406
                  /200410
              /VLBA
                  /200402
                  /200406
                  /200410
              /EVLA
                  /200402
                  /200406
                  /200410
              /GBT
                  /200402
                  /200406
                  /200410

I modified XPathQuery.java to do multiple collection search.
      The are 2 ways to specified a collection list. The first uses a comma or semi-colon
separated list, for example:
      xindice xpath -c xmldb:xindice://localhost:8080/db/proposals/VLA/200402;xmldb:xindice://localhost:8080/db/proposals/VLBA/200406

      The second uses * to indicate all the child collections, for example
      xindice xpath -c xmldb:xindice://localhost:8080/db/proposals/*/200402
or
      xindice xpath -c xmldb:xindice://localhost:8080/db/proposals/VLA/*

The combination of the 2 way also ok.

      Attached XPathQuery.java contains my modification to the original file.
I added 2 private functions to extract collection list and modified execute
method to loop over all the collection list.

      I am new to xindice and am not sure if this modification make sense or
if it has impact on other modules.


Honglin Ye
National Radio Astronomy Observatory
Socorro, NM USA





























Honglin Ye wrote:

 > Honglin Ye wrote:
 >
 >> Is it possible to query db over multiple collection?
 >>
 >> For exacmple
 >>
 >> xindice xpath -c xmldb:xindice://host:8080/db/root/ -q "/root/childre"
 >>
 > sorry, it should read
 >
 > xindice xpath -c xmldb:xindice://host:8080/db/root/* -q "/root/childre"



See Xindice website, Todo page:
    <strong>The Query Engine</strong> The query engine has basic
    functionality right now.  Indexing and XPath query work against a
    Collection, but no unified cross-Collection query system currently
    exists.

In short:
No, Xindice currently does not support quering over multiple connections. You are welcome to enhance Xindice and send in a patch.

Thanks,
Vadim

Re: xindice xpath query over multiple collections

Posted by Honglin Ye <hy...@aoc.nrao.edu>.

Vadim Gritsenko wrote:

> Honglin Ye wrote:
> 
>> Vadim Gritsenko wrote:
>>
>>> Honglin Ye wrote:
>>>
>>>> By partitioning the database into smaller collections may improve the
>>>> performance.
>>>
>>>
>>>
>>> I'm curious, why do you think so? Is there any reason or observation 
>>> for this? 
>>
>>
>>
>> I am curious too. Suppose we have 4 groups, each group add 1000 files 
>> every month.
>> Some time we only need to search files from one group, sometime we 
>> only need to
>> search for a particular month. Will it be easier if we separate the 
>> files by groups
>> and months such that we can work with smaller groups?
> 
> 
> 
> Hm, I guess so. Certainly, search over 1000 files will be faster than 
> search over 1000*4*months files. During the search, Xindice will not go 
> through each file if you are using indexes, but still, I can think that 
> search over smaller collection / smaller indexes should be faster.
> 
> 
>> It may make little difference in searching?
> 
> 
> 
> Suppose you have everything in one collection. Then, you'll need at 
> least three indexes: one for a group, one for a month, and another on an 
> element/attribute you are searching for. Intersection of these three 
> indexes will give you a resulting set of documents. First, month index 
> will return you 4000 documents, then group index will return you 
> 1000*months documents, and then third index will be used. Intersection 
> of results returned by the index will give you resulting set of 
> documents. So Xindice will take some time reading / searching using 
> indexes and building intersection.
> 
> OTOH, if you have collection for a group and for a month, you will not 
> need first two indexes, so this will make search faster.
> 
> 
>> how about update? When we update a rescue, will
>> xindice directly modify that resource in the tbl and leave other part 
>> untouched?
> 
> 
> 
> Get document by ID / Set document by ID / perform an XUpdate of the 
> document are all fast operations. Xindice uses BTree to store key -> 
> document association, where each node in the BTree stores (4096 / 
> KeySize) keys. So, for keys 128 bytes in size (32 keys per 4096 page), 
> access speed will be log32(N), where N is count of documents in the 
> collection.
> 
> Xindice stores documents in paged file, so when you update a document, 
> only pages containing document will be updated. The rest of the .tbl 
> will not be touched.
> 
> 
>> Partitioning is an useful strategic in rdms, is it has any similarity 
>> in xindice?
> 
> 
> 
> If you are planning to have large database (>2Gb) then you'll have to 
> partition due to limits on file size (and this limits varies on 
> different operating systems / file systems).
> 
> Either way, let us know how many documents you were able to store and 
> how fast did it work ;-)
> 
> Vadim
> 
> 
> 
> 
Vadim,
    Thanks for the explanation. It clared things a lot.
Honglin

Re: xindice xpath query over multiple collections

Posted by Honglin Ye <hy...@aoc.nrao.edu>.

Vadim Gritsenko wrote:

> Honglin Ye wrote:
> 
>> Vadim Gritsenko wrote:
>>
>>> Honglin Ye wrote:
>>>
>>>> By partitioning the database into smaller collections may improve the
>>>> performance.
>>>
>>>
>>>
>>> I'm curious, why do you think so? Is there any reason or observation 
>>> for this? 
>>
>>
>>
>> I am curious too. Suppose we have 4 groups, each group add 1000 files 
>> every month.
>> Some time we only need to search files from one group, sometime we 
>> only need to
>> search for a particular month. Will it be easier if we separate the 
>> files by groups
>> and months such that we can work with smaller groups?
> 
> 
> 
> Hm, I guess so. Certainly, search over 1000 files will be faster than 
> search over 1000*4*months files. During the search, Xindice will not go 
> through each file if you are using indexes, but still, I can think that 
> search over smaller collection / smaller indexes should be faster.
> 
> 
>> It may make little difference in searching?
> 
> 
> 
> Suppose you have everything in one collection. Then, you'll need at 
> least three indexes: one for a group, one for a month, and another on an 
> element/attribute you are searching for. Intersection of these three 
> indexes will give you a resulting set of documents. First, month index 
> will return you 4000 documents, then group index will return you 
> 1000*months documents, and then third index will be used. Intersection 
> of results returned by the index will give you resulting set of 
> documents. So Xindice will take some time reading / searching using 
> indexes and building intersection.
> 
> OTOH, if you have collection for a group and for a month, you will not 
> need first two indexes, so this will make search faster.
> 
> 
>> how about update? When we update a rescue, will
>> xindice directly modify that resource in the tbl and leave other part 
>> untouched?
> 
> 
> 
> Get document by ID / Set document by ID / perform an XUpdate of the 
> document are all fast operations. Xindice uses BTree to store key -> 
> document association, where each node in the BTree stores (4096 / 
> KeySize) keys. So, for keys 128 bytes in size (32 keys per 4096 page), 
> access speed will be log32(N), where N is count of documents in the 
> collection.
> 
> Xindice stores documents in paged file, so when you update a document, 
> only pages containing document will be updated. The rest of the .tbl 
> will not be touched.
> 
> 
>> Partitioning is an useful strategic in rdms, is it has any similarity 
>> in xindice?
> 
> 
> 
> If you are planning to have large database (>2Gb) then you'll have to 
> partition due to limits on file size (and this limits varies on 
> different operating systems / file systems).
> 
> Either way, let us know how many documents you were able to store and 
> how fast did it work ;-)
> 
> Vadim
> 
> 
> 
> 
Vadim,
      The whole document retrieve - modify - store is mostly I needed.
I have a more demanding query requirement. I am building a proposal
submission and handling system for radio telescope systems and required to keep docs as xml.
It should be searchable over proposer names, proposal titles, observation
types, telescopes, configurations, target sources, frequencies, and
abstract text etc. From my understanding, there is no performance penalty
to have more indexes. It also looks like that as new proposals add in, the
indexes be generated automatically, is that true?
       Do you have pointers to the systems like I am building that uses
xml as storage? Do you have suggestions how should I proceed?

Honglin

Re: xindice xpath query over multiple collections

Posted by Vadim Gritsenko <va...@reverycodes.com>.

Honglin Ye wrote:

> Vadim Gritsenko wrote:
>
>> Honglin Ye wrote:
>>
>>> By partitioning the database into smaller collections may improve the
>>> performance.
>>
>>
>> I'm curious, why do you think so? Is there any reason or observation 
>> for this? 
>
>
> I am curious too. Suppose we have 4 groups, each group add 1000 files 
> every month.
> Some time we only need to search files from one group, sometime we 
> only need to
> search for a particular month. Will it be easier if we separate the 
> files by groups
> and months such that we can work with smaller groups?

Hm, I guess so. Certainly, search over 1000 files will be faster than 
search over 1000*4*months files. During the search, Xindice will not go 
through each file if you are using indexes, but still, I can think that 
search over smaller collection / smaller indexes should be faster.

> It may make little difference in searching?

Suppose you have everything in one collection. Then, you'll need at 
least three indexes: one for a group, one for a month, and another on an 
element/attribute you are searching for. Intersection of these three 
indexes will give you a resulting set of documents. First, month index 
will return you 4000 documents, then group index will return you 
1000*months documents, and then third index will be used. Intersection 
of results returned by the index will give you resulting set of 
documents. So Xindice will take some time reading / searching using 
indexes and building intersection.

OTOH, if you have collection for a group and for a month, you will not 
need first two indexes, so this will make search faster.

> how about update? When we update a rescue, will
> xindice directly modify that resource in the tbl and leave other part 
> untouched?

Get document by ID / Set document by ID / perform an XUpdate of the 
document are all fast operations. Xindice uses BTree to store key -> 
document association, where each node in the BTree stores (4096 / 
KeySize) keys. So, for keys 128 bytes in size (32 keys per 4096 page), 
access speed will be log32(N), where N is count of documents in the 
collection.

Xindice stores documents in paged file, so when you update a document, 
only pages containing document will be updated. The rest of the .tbl 
will not be touched.

> Partitioning is an useful strategic in rdms, is it has any similarity 
> in xindice?

If you are planning to have large database (>2Gb) then you'll have to 
partition due to limits on file size (and this limits varies on 
different operating systems / file systems).

Either way, let us know how many documents you were able to store and 
how fast did it work ;-)

Vadim

Re: xindice xpath query over multiple collections

Posted by Honglin Ye <hy...@aoc.nrao.edu>.

Vadim Gritsenko wrote:

> Honglin Ye wrote:
> 
>>       I think search over multiple collections of similar document is 
>> a needed
>> feature.
> 
> 
> 
> Well, it was and still is on Xindice TODO list :-)
> 
> 
>> By partitioning the database into smaller collections may improve the
>> performance.
> 
> 
> 
> I'm curious, why do you think so? Is there any reason or observation for 
> this?
> 
> Please bear in mind also that lots of collections will exhaust your 
> system resources, particularly, file handles (see archives - somebody 
> already hit that limit, I guess he had thousands of collections).
> 
> 
Vadim,

I am curious too. Suppose we have 4 groups, each group add 1000 files every month.
Some time we only need to search files from one group, sometime we only need to
search for a particular month. Will it be easier if we separate the files by groups
and months such that we can work with smaller groups? It may make little
difference in searching? how about update? When we update a rescue, will
xindice directly modify that resource in the tbl and leave other part untouched?
Partitioning is an useful strategic in rdms, is it has any similarity in xindice?

Honglin

Re: xindice xpath query over multiple collections

Posted by Vadim Gritsenko <va...@reverycodes.com>.

Honglin Ye wrote:

>       I think search over multiple collections of similar document is 
> a needed
> feature.

Well, it was and still is on Xindice TODO list :-)

> By partitioning the database into smaller collections may improve the
> performance.

I'm curious, why do you think so? Is there any reason or observation for 
this?

Please bear in mind also that lots of collections will exhaust your 
system resources, particularly, file handles (see archives - somebody 
already hit that limit, I guess he had thousands of collections).

...

> I modified XPathQuery.java to do multiple collection search.

...

> I am new to xindice and am not sure if this modification make sense or 
> if it has impact on other modules.

Thanks for your contribution; I'll review your changes as soon as I get 
some time.

Vadim

Re: xindice xpath query over multiple collections

Posted by Honglin Ye <hy...@aoc.nrao.edu>.

Vadim Gritsenko wrote:

> Honglin Ye wrote:
> 
>>      Attached XPathQuery.java contains my modification to the original 
>> file.
>> I added 2 private functions to extract collection list and modified 
>> execute
>> method to loop over all the collection list.
> 
> 
> 
> Ok, I took a look at your changes.
> 
> You have modified xindice admin tool to search over several collections. 
> But where / how you are planning to use this? I mean, with this patch, 
> Xindice still does not provide programmatic access to multiple 
> collections search, so you can't use this feature in your application.
> 
> I was thinking that this feature should be built either into the 
> QueryService (package org.apache.xindice.client.xmldb.services), or it 
> should be some new service.
> 
> What do you think?
> 
> Vadim
> 
> 
> 
> 
Vadim,
      The change I made is not intended to be a formal patch. It only suggest
how we can build a collection list and loop over it. It only works for the command
line. If the basic logic makes sense and if the multiple collection search
is of any practical use, then we can consider the way to incorporate it.
I think we can put this in QueryService.
Honglin

Re: xindice xpath query over multiple collections + CollectionList.java

Posted by Honglin Ye <hy...@aoc.nrao.edu>.

Hi, Vadim,

     I made the class to extract collection List. (see attached) The class
contain 2 static methods,

     public static ArrayList getCollectionList(Hashtable t) {...}
and
     public static ArrayList getCollectionList(String s) {..}

The first one is to be used for commandline query
and the second is to be used for programing query

I tested it, it seemed work just fine.

Also, is there a login control in the xindice?

Honglin

Re: xindice xpath query over multiple collections

Posted by Vadim Gritsenko <va...@reverycodes.com>.

Honglin Ye wrote:

>      Attached XPathQuery.java contains my modification to the original 
> file.
> I added 2 private functions to extract collection list and modified 
> execute
> method to loop over all the collection list.

Ok, I took a look at your changes.

You have modified xindice admin tool to search over several collections. 
But where / how you are planning to use this? I mean, with this patch, 
Xindice still does not provide programmatic access to multiple 
collections search, so you can't use this feature in your application.

I was thinking that this feature should be built either into the 
QueryService (package org.apache.xindice.client.xmldb.services), or it 
should be some new service.

What do you think?

Vadim