You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chris Conrad <cc...@vasoftware.com> on 2005/01/26 01:38:09 UTC

Optimal index structure

I'm currently working on building a search function for my application 
and am looking for guidance on what the optimal way to store the index 
would be.

The application has several different document types with documents 
split into different categories.  Each category has differing numbers 
of documents of each type.  A small category may have as few as 0 to 5 
documents of each type, a large category might have as many 50,000+ 
documents of each type.  There are upwards of 100,000 categories.  The 
search function would never have to search documents from more than one 
category at a time, but should be able to search either a single 
document type or multiple document types together.  I need to be able 
to handle over 1,000,000 searches a day with as many as 50 simultaneous 
searches at peak times.

My current thinking is that each category would get it's own index.  
Each document type would have a keyword which indicates which document 
type it is.  When doing a search, I can either add a filter for that 
particular document type, or if the search is over all document types I 
can leave the filter out.  Alternately, I could put everything in 1 
very large index and choose category and document type by filters.  Or 
I can have an index for each document type for each category and use 
multi-index searchers when necessary.

I'm afraid that the description above is quite convoluted, so let me 
know if further clarification is necessary.

Any advice is welcome.

Thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Optimal index structure

Posted by Chris Conrad <cc...@vasoftware.com>.
On Jan 25, 2005, at 5:29 PM, Tea Yu wrote:

>   How many total documents will be there?  I'll opt for a single index 
> if
> search in "all categories" meets the performance target, else you may 
> want
> to consider distributed searchers.  arguments for a single index:
>

Fortunately, there is no need for an all categories search.  I won't be 
searching across categories, just across document types.  Total, there 
will be somewhere near 15,000,000 documents across about 100,000 
categories.  But, again, the distribution across categories is very 
uneven.  There will be categories with a total of 5 or so documents, 
with other categories having over 100,000.

>   1) all doc scores will have to be calculated anyway leveraging 
> Searcher or
> (Parallel)MultiSearcher which should be most expensive (with a slight
> overhead to aggregate and sort the docs in the latter)
>   2) you'll most likely want to aggregate N categories into an index 
> anyway
> to avoid having too many opened files

I am concerned about the number of concurrent open files, but I think 
that may be mitigated since some categories will receive virtually no 
searches (since they have very few documents or those documents are 
mostly very old).  I would say that the number of categories searched 
frequently will be under 5000.  I was thinking of using a LRU cache of 
open indexes which would keep the number of open files under control 
and ensure that frequently used indexes are quickly available.


>   3) most of the time will be spent in context switching if having too 
> many
> indexes searched in parallel
>

I will be limiting the number of search threads to 4-12 (this will be 
running a dedicated quad xeon, most likely).

>   an alternative will be to optimize the structure base on usage 
> pattern,
> e.g. having 1 full category index and several sub-categories indexes, 
> if
> reindexing is not a problem
>

Re-indexing will be an issue since it looks like it will take on the 
order of 3-4 days to index everything.

Thanks for your input.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Optimal index structure

Posted by Tea Yu <te...@netvigator.com>.
  How many total documents will be there?  I'll opt for a single index if
search in "all categories" meets the performance target, else you may want
to consider distributed searchers.  arguments for a single index:

  1) all doc scores will have to be calculated anyway leveraging Searcher or
(Parallel)MultiSearcher which should be most expensive (with a slight
overhead to aggregate and sort the docs in the latter)
  2) you'll most likely want to aggregate N categories into an index anyway
to avoid having too many opened files
  3) most of the time will be spent in context switching if having too many
indexes searched in parallel

  an alternative will be to optimize the structure base on usage pattern,
e.g. having 1 full category index and several sub-categories indexes, if
reindexing is not a problem

  Tea

  > I'm currently working on building a search function for my application
  > and am looking for guidance on what the optimal way to store the index
  > would be.
  >
  > The application has several different document types with documents
  > split into different categories.  Each category has differing numbers
  > of documents of each type.  A small category may have as few as 0 to 5
  > documents of each type, a large category might have as many 50,000+
  > documents of each type.  There are upwards of 100,000 categories.  The
  > search function would never have to search documents from more than one
  > category at a time, but should be able to search either a single
  > document type or multiple document types together.  I need to be able
  > to handle over 1,000,000 searches a day with as many as 50 simultaneous
  > searches at peak times.
  >
  > My current thinking is that each category would get it's own index.
  > Each document type would have a keyword which indicates which document
  > type it is.  When doing a search, I can either add a filter for that
  > particular document type, or if the search is over all document types I
  > can leave the filter out.  Alternately, I could put everything in 1
  > very large index and choose category and document type by filters.  Or
  > I can have an index for each document type for each category and use
  > multi-index searchers when necessary.
  >
  > I'm afraid that the description above is quite convoluted, so let me
  > know if further clarification is necessary.
  >
  > Any advice is welcome.
  >
  > Thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org