You are viewing a plain text version of this content. The canonical link for it is here.

Posted to xindice-dev@xml.apache.org by Natalia Shilenkova <ns...@gmail.com> on 2007/07/13 15:27:30 UTC

Multiple full text indexes

Hi All,

I've been looking at the full text indexing patch that was submitted
by Andy Armstrong a couple years ago. It uses plain Lucene query
syntax to search the indexes.

Full text index (like any other index) has a pattern parameter that
determines what elements/attributes are going to be indexed. And it is
possible to create several indexes with different patterns.

If there are several indexes, which one should be used to execute a
query? Existing patch always uses the index with a shortest pattern,
but it does not really mean a better match and overall effect is the
same as there was only one index, since the only way to use another
one is to drop index with the shortest pattern.

So the question is, does it make sense to have more than one full text
index per collection? If so, how to find out which index is a better
match for a particular query (modifying query language to include
hints? using field names to find right pattern?), can query be run
against multiple indexes?

Any ideas?

Regards,
Natalia

Re: Multiple full text indexes

Posted by Vadim Gritsenko <va...@reverycodes.com>.

Natalia Shilenkova wrote:
> I've been looking at the full text indexing patch that was submitted
> by Andy Armstrong a couple years ago. It uses plain Lucene query
> syntax to search the indexes.
> 
> Full text index (like any other index) has a pattern parameter that
> determines what elements/attributes are going to be indexed. And it is
> possible to create several indexes with different patterns.

Hmmm... Ok...

> If there are several indexes, which one should be used to execute a
> query? Existing patch always uses the index with a shortest pattern,
> but it does not really mean a better match and overall effect is the
> same as there was only one index, since the only way to use another
> one is to drop index with the shortest pattern.

This does not sound good to me... If such 'index' with shortest pattern did not 
index the xml element(s) requested in the query... It would not find anything.

> So the question is, does it make sense to have more than one full text
> index per collection?

I'm not quite if there is any real need for more than one Lucene index. For most 
cases, IMHO it is sufficient to have single Lucene index per collection. And 
each such Lucene index can be associated with multiple Xindice Indexer objects, 
which would contribute patterns which should be indexed by this collection's 
Lucene indexe.

To illustrate this thought, say you create several Xindice full text Indexers 
with patterns:

   name
   phone
   phone@type

All three of these Indexers could be backed by single Lucene index which would 
contain multiple fields (org.apache.lucene.document.Field) for each document 
stored in Xindice (and which corresponds to org.apache.lucene.document.Document):

  Document:
   Field id=abcdeff  -- Stored field with Xindice document ID
   Field name=John   -- Indexed field created from <name>John</name> element
   Field phone=123-456-5555
   Field phone@type=work

So now when querying it is possible to assemble complex query - such as, give me 
all John's who have work phone, or some such. Lucene's index should store id 
field, so that we can retrieve ids of matching documents from the search result.

PS As an aside... There are at least two options on how Xindice document can 
correspond to Lucene Document:

  * 1:1 mapping. It would allow to search only for documents, since all we would 
know from the search result is document id.

  * Create a Document for each matching element, which would include its own 
data and data for all nested matches as well. It would allow a possibility to 
search for particular elements matching a query - if we can figure out a way on 
how to do this in Lucene query?

Either way we can start off with simpler option first and think about how to do 
more complex searches later.

Vadim

> If so, how to find out which index is a better
> match for a particular query (modifying query language to include
> hints? using field names to find right pattern?), can query be run
> against multiple indexes?
> 
> Any ideas?
> 
> Regards,
> Natalia