You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Saurabh Gokhale <sa...@gmail.com> on 2011/08/15 19:14:51 UTC

Regarding multiple index creation and Searching

Hi All,

In my application, we have been maintaining lucene index for the 3 years
worth of data. (more than 70GB of single lucene index). To improve
performance, recently it was decided to break indexes into 1 year worth of
data each (3 indexes). before we work on the required change, I wanted to
get clarification on few questionsMy setup has following properties

OS: Red Hat Enterprise
Lucene: 3.2
Java: JDK 1.6

Questions:
1. What is the average acceptable size for Lucene index that is considered
OK for searching? (before it is broken down into multiple indexes)
2. Other than performance, what should be the criteria to decide on
separating the index into mutiple index. (Criteria like single file in the
index should not be more than 2GB, or the total lucene index folder size
should not be above 10GB etc)


(Regarding code changes required to break the documents into appropriate
year)
I will be reindexing all the documents again using modified code base. For
that I will be required to

3. Create multiple indexWriters and index the document using appropriate
writer as per the date of the document.
4. While searching, use multiSearcher or ParallelMultiSearcher to search
across all indexes at once.

do I have to make any other change as far as the code changes are concerned?

Thanks

Saurabh

Re: Regarding multiple index creation and Searching

Posted by Mihai Caraman <ca...@gmail.com>.

heard that ~80million docs per index (varying with average document size).

@Uwe Schindler: Is hashed distribution really necessary when using
MultiReader? I did hear that solr uses continuous hashing algorithm with
shards of indexes. But MultiReader didn't say anything about hashing.

RE: Regarding multiple index creation and Searching

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

This all depends on your index contents and hardware. In general the size of
a single index / index segment vs multiple segments / indexes is not an
issue on one single machine. To scale, you should also think of using more
than one machine with e.g. ElasticSearch or Apache Solr instead of plain
Lucene (which provide that functionality). For the single machine case, you
can only speed up the stuff by parallelization.

> 1. What is the average acceptable size for Lucene index that is considered
OK
> for searching? (before it is broken down into multiple indexes) 2. Other
than
> performance, what should be the criteria to decide on separating the index
into
> mutiple index. (Criteria like single file in the index should not be more
than
> 2GB, or the total lucene index folder size should not be above 10GB etc)

Depends. On one single machine it does not matter how big files are. When
searching, an index consisting of several sub-indexes / segments behaves
almost identical to one big optimized one. This is only different when you
parallelize.

> (Regarding code changes required to break the documents into appropriate
> year)
> I will be reindexing all the documents again using modified code base. For
that
> I will be required to
> 
> 3. Create multiple indexWriters and index the document using appropriate
> writer as per the date of the document.

That's fine. The question is if that makes sense. Will the results of search
queries be coming from all indexes equally distributed? If you want to
parallelize, its often better to have some hash-based distribution

> 4. While searching, use multiSearcher or ParallelMultiSearcher to search
> across all indexes at once.

MultiSearcher and ParallelMultiSearcher are deprecated and broken (and no
longer supported). The correct way tosearch different indexes is to wrap all
sub-Indexes by MultiReader and then use a single IndexSearcher on top of it.
To parallelize, pass an ExecutorService to its ctor. Please note:
IndexSearcher can only parallelize, if there are subindexes, so a big
optimized index does not help here :-) Ideally you would create several
separate indexes using a hash-based distribution of documents.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org