You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xindice-users@xml.apache.org by Honglin Ye <hy...@aoc.nrao.edu> on 2004/02/03 22:36:55 UTC
xindice xpath query over multiple collections
Hello, Vadim,
Thank you for the message.
I think search over multiple collections of similar document is a needed
feature. By partitioning the database into smaller collections may improve the
performance. I have documents to store in a collection tree like:
/db/proposals
/VLA
/200402
/200406
/200410
/VLBA
/200402
/200406
/200410
/EVLA
/200402
/200406
/200410
/GBT
/200402
/200406
/200410
I modified XPathQuery.java to do multiple collection search.
The are 2 ways to specified a collection list. The first uses a comma or semi-colon
separated list, for example:
xindice xpath -c xmldb:xindice://localhost:8080/db/proposals/VLA/200402;xmldb:xindice://localhost:8080/db/proposals/VLBA/200406
The second uses * to indicate all the child collections, for example
xindice xpath -c xmldb:xindice://localhost:8080/db/proposals/*/200402
or
xindice xpath -c xmldb:xindice://localhost:8080/db/proposals/VLA/*
The combination of the 2 way also ok.
Attached XPathQuery.java contains my modification to the original file.
I added 2 private functions to extract collection list and modified execute
method to loop over all the collection list.
I am new to xindice and am not sure if this modification make sense or
if it has impact on other modules.
Honglin Ye
National Radio Astronomy Observatory
Socorro, NM USA
Honglin Ye wrote:
> Honglin Ye wrote:
>
>> Is it possible to query db over multiple collection?
>>
>> For exacmple
>>
>> xindice xpath -c xmldb:xindice://host:8080/db/root/ -q "/root/childre"
>>
> sorry, it should read
>
> xindice xpath -c xmldb:xindice://host:8080/db/root/* -q "/root/childre"
See Xindice website, Todo page:
<strong>The Query Engine</strong> The query engine has basic
functionality right now. Indexing and XPath query work against a
Collection, but no unified cross-Collection query system currently
exists.
In short:
No, Xindice currently does not support quering over multiple connections. You are welcome to enhance Xindice and send in a patch.
Thanks,
Vadim
Re: xindice xpath query over multiple collections
Posted by Honglin Ye <hy...@aoc.nrao.edu>.
Vadim Gritsenko wrote:
> Honglin Ye wrote:
>
>> Vadim Gritsenko wrote:
>>
>>> Honglin Ye wrote:
>>>
>>>> By partitioning the database into smaller collections may improve the
>>>> performance.
>>>
>>>
>>>
>>> I'm curious, why do you think so? Is there any reason or observation
>>> for this?
>>
>>
>>
>> I am curious too. Suppose we have 4 groups, each group add 1000 files
>> every month.
>> Some time we only need to search files from one group, sometime we
>> only need to
>> search for a particular month. Will it be easier if we separate the
>> files by groups
>> and months such that we can work with smaller groups?
>
>
>
> Hm, I guess so. Certainly, search over 1000 files will be faster than
> search over 1000*4*months files. During the search, Xindice will not go
> through each file if you are using indexes, but still, I can think that
> search over smaller collection / smaller indexes should be faster.
>
>
>> It may make little difference in searching?
>
>
>
> Suppose you have everything in one collection. Then, you'll need at
> least three indexes: one for a group, one for a month, and another on an
> element/attribute you are searching for. Intersection of these three
> indexes will give you a resulting set of documents. First, month index
> will return you 4000 documents, then group index will return you
> 1000*months documents, and then third index will be used. Intersection
> of results returned by the index will give you resulting set of
> documents. So Xindice will take some time reading / searching using
> indexes and building intersection.
>
> OTOH, if you have collection for a group and for a month, you will not
> need first two indexes, so this will make search faster.
>
>
>> how about update? When we update a rescue, will
>> xindice directly modify that resource in the tbl and leave other part
>> untouched?
>
>
>
> Get document by ID / Set document by ID / perform an XUpdate of the
> document are all fast operations. Xindice uses BTree to store key ->
> document association, where each node in the BTree stores (4096 /
> KeySize) keys. So, for keys 128 bytes in size (32 keys per 4096 page),
> access speed will be log32(N), where N is count of documents in the
> collection.
>
> Xindice stores documents in paged file, so when you update a document,
> only pages containing document will be updated. The rest of the .tbl
> will not be touched.
>
>
>> Partitioning is an useful strategic in rdms, is it has any similarity
>> in xindice?
>
>
>
> If you are planning to have large database (>2Gb) then you'll have to
> partition due to limits on file size (and this limits varies on
> different operating systems / file systems).
>
> Either way, let us know how many documents you were able to store and
> how fast did it work ;-)
>
> Vadim
>
>
>
>
Vadim,
Thanks for the explanation. It clared things a lot.
Honglin
Re: xindice xpath query over multiple collections
Posted by Honglin Ye <hy...@aoc.nrao.edu>.
Vadim Gritsenko wrote:
> Honglin Ye wrote:
>
>> Vadim Gritsenko wrote:
>>
>>> Honglin Ye wrote:
>>>
>>>> By partitioning the database into smaller collections may improve the
>>>> performance.
>>>
>>>
>>>
>>> I'm curious, why do you think so? Is there any reason or observation
>>> for this?
>>
>>
>>
>> I am curious too. Suppose we have 4 groups, each group add 1000 files
>> every month.
>> Some time we only need to search files from one group, sometime we
>> only need to
>> search for a particular month. Will it be easier if we separate the
>> files by groups
>> and months such that we can work with smaller groups?
>
>
>
> Hm, I guess so. Certainly, search over 1000 files will be faster than
> search over 1000*4*months files. During the search, Xindice will not go
> through each file if you are using indexes, but still, I can think that
> search over smaller collection / smaller indexes should be faster.
>
>
>> It may make little difference in searching?
>
>
>
> Suppose you have everything in one collection. Then, you'll need at
> least three indexes: one for a group, one for a month, and another on an
> element/attribute you are searching for. Intersection of these three
> indexes will give you a resulting set of documents. First, month index
> will return you 4000 documents, then group index will return you
> 1000*months documents, and then third index will be used. Intersection
> of results returned by the index will give you resulting set of
> documents. So Xindice will take some time reading / searching using
> indexes and building intersection.
>
> OTOH, if you have collection for a group and for a month, you will not
> need first two indexes, so this will make search faster.
>
>
>> how about update? When we update a rescue, will
>> xindice directly modify that resource in the tbl and leave other part
>> untouched?
>
>
>
> Get document by ID / Set document by ID / perform an XUpdate of the
> document are all fast operations. Xindice uses BTree to store key ->
> document association, where each node in the BTree stores (4096 /
> KeySize) keys. So, for keys 128 bytes in size (32 keys per 4096 page),
> access speed will be log32(N), where N is count of documents in the
> collection.
>
> Xindice stores documents in paged file, so when you update a document,
> only pages containing document will be updated. The rest of the .tbl
> will not be touched.
>
>
>> Partitioning is an useful strategic in rdms, is it has any similarity
>> in xindice?
>
>
>
> If you are planning to have large database (>2Gb) then you'll have to
> partition due to limits on file size (and this limits varies on
> different operating systems / file systems).
>
> Either way, let us know how many documents you were able to store and
> how fast did it work ;-)
>
> Vadim
>
>
>
>
Vadim,
The whole document retrieve - modify - store is mostly I needed.
I have a more demanding query requirement. I am building a proposal
submission and handling system for radio telescope systems and required to keep docs as xml.
It should be searchable over proposer names, proposal titles, observation
types, telescopes, configurations, target sources, frequencies, and
abstract text etc. From my understanding, there is no performance penalty
to have more indexes. It also looks like that as new proposals add in, the
indexes be generated automatically, is that true?
Do you have pointers to the systems like I am building that uses
xml as storage? Do you have suggestions how should I proceed?
Honglin
Re: xindice xpath query over multiple collections
Posted by Vadim Gritsenko <va...@reverycodes.com>.
Honglin Ye wrote:
> Vadim Gritsenko wrote:
>
>> Honglin Ye wrote:
>>
>>> By partitioning the database into smaller collections may improve the
>>> performance.
>>
>>
>> I'm curious, why do you think so? Is there any reason or observation
>> for this?
>
>
> I am curious too. Suppose we have 4 groups, each group add 1000 files
> every month.
> Some time we only need to search files from one group, sometime we
> only need to
> search for a particular month. Will it be easier if we separate the
> files by groups
> and months such that we can work with smaller groups?
Hm, I guess so. Certainly, search over 1000 files will be faster than
search over 1000*4*months files. During the search, Xindice will not go
through each file if you are using indexes, but still, I can think that
search over smaller collection / smaller indexes should be faster.
> It may make little difference in searching?
Suppose you have everything in one collection. Then, you'll need at
least three indexes: one for a group, one for a month, and another on an
element/attribute you are searching for. Intersection of these three
indexes will give you a resulting set of documents. First, month index
will return you 4000 documents, then group index will return you
1000*months documents, and then third index will be used. Intersection
of results returned by the index will give you resulting set of
documents. So Xindice will take some time reading / searching using
indexes and building intersection.
OTOH, if you have collection for a group and for a month, you will not
need first two indexes, so this will make search faster.
> how about update? When we update a rescue, will
> xindice directly modify that resource in the tbl and leave other part
> untouched?
Get document by ID / Set document by ID / perform an XUpdate of the
document are all fast operations. Xindice uses BTree to store key ->
document association, where each node in the BTree stores (4096 /
KeySize) keys. So, for keys 128 bytes in size (32 keys per 4096 page),
access speed will be log32(N), where N is count of documents in the
collection.
Xindice stores documents in paged file, so when you update a document,
only pages containing document will be updated. The rest of the .tbl
will not be touched.
> Partitioning is an useful strategic in rdms, is it has any similarity
> in xindice?
If you are planning to have large database (>2Gb) then you'll have to
partition due to limits on file size (and this limits varies on
different operating systems / file systems).
Either way, let us know how many documents you were able to store and
how fast did it work ;-)
Vadim
Re: xindice xpath query over multiple collections
Posted by Honglin Ye <hy...@aoc.nrao.edu>.
Vadim Gritsenko wrote:
> Honglin Ye wrote:
>
>> I think search over multiple collections of similar document is
>> a needed
>> feature.
>
>
>
> Well, it was and still is on Xindice TODO list :-)
>
>
>> By partitioning the database into smaller collections may improve the
>> performance.
>
>
>
> I'm curious, why do you think so? Is there any reason or observation for
> this?
>
> Please bear in mind also that lots of collections will exhaust your
> system resources, particularly, file handles (see archives - somebody
> already hit that limit, I guess he had thousands of collections).
>
>
Vadim,
I am curious too. Suppose we have 4 groups, each group add 1000 files every month.
Some time we only need to search files from one group, sometime we only need to
search for a particular month. Will it be easier if we separate the files by groups
and months such that we can work with smaller groups? It may make little
difference in searching? how about update? When we update a rescue, will
xindice directly modify that resource in the tbl and leave other part untouched?
Partitioning is an useful strategic in rdms, is it has any similarity in xindice?
Honglin
Re: xindice xpath query over multiple collections
Posted by Vadim Gritsenko <va...@reverycodes.com>.
Honglin Ye wrote:
> I think search over multiple collections of similar document is
> a needed
> feature.
Well, it was and still is on Xindice TODO list :-)
> By partitioning the database into smaller collections may improve the
> performance.
I'm curious, why do you think so? Is there any reason or observation for
this?
Please bear in mind also that lots of collections will exhaust your
system resources, particularly, file handles (see archives - somebody
already hit that limit, I guess he had thousands of collections).
...
> I modified XPathQuery.java to do multiple collection search.
...
> I am new to xindice and am not sure if this modification make sense or
> if it has impact on other modules.
Thanks for your contribution; I'll review your changes as soon as I get
some time.
Vadim
Re: xindice xpath query over multiple collections
Posted by Honglin Ye <hy...@aoc.nrao.edu>.
Vadim Gritsenko wrote:
> Honglin Ye wrote:
>
>> Attached XPathQuery.java contains my modification to the original
>> file.
>> I added 2 private functions to extract collection list and modified
>> execute
>> method to loop over all the collection list.
>
>
>
> Ok, I took a look at your changes.
>
> You have modified xindice admin tool to search over several collections.
> But where / how you are planning to use this? I mean, with this patch,
> Xindice still does not provide programmatic access to multiple
> collections search, so you can't use this feature in your application.
>
> I was thinking that this feature should be built either into the
> QueryService (package org.apache.xindice.client.xmldb.services), or it
> should be some new service.
>
> What do you think?
>
> Vadim
>
>
>
>
Vadim,
The change I made is not intended to be a formal patch. It only suggest
how we can build a collection list and loop over it. It only works for the command
line. If the basic logic makes sense and if the multiple collection search
is of any practical use, then we can consider the way to incorporate it.
I think we can put this in QueryService.
Honglin
Re: xindice xpath query over multiple collections + CollectionList.java
Posted by Honglin Ye <hy...@aoc.nrao.edu>.
Hi, Vadim,
I made the class to extract collection List. (see attached) The class
contain 2 static methods,
public static ArrayList getCollectionList(Hashtable t) {...}
and
public static ArrayList getCollectionList(String s) {..}
The first one is to be used for commandline query
and the second is to be used for programing query
I tested it, it seemed work just fine.
Also, is there a login control in the xindice?
Honglin
Re: xindice xpath query over multiple collections
Posted by Vadim Gritsenko <va...@reverycodes.com>.
Honglin Ye wrote:
> Attached XPathQuery.java contains my modification to the original
> file.
> I added 2 private functions to extract collection list and modified
> execute
> method to loop over all the collection list.
Ok, I took a look at your changes.
You have modified xindice admin tool to search over several collections.
But where / how you are planning to use this? I mean, with this patch,
Xindice still does not provide programmatic access to multiple
collections search, so you can't use this feature in your application.
I was thinking that this feature should be built either into the
QueryService (package org.apache.xindice.client.xmldb.services), or it
should be some new service.
What do you think?
Vadim