You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by lu...@jakarta.apache.org on 2004/07/20 23:39:16 UTC

[Jakarta Lucene Wiki] New: PainlessIndexing

   Date: 2004-07-20T14:39:16
   Editor: JulienNioche <ju...@lingway.com>
   Wiki: Jakarta Lucene Wiki
   Page: PainlessIndexing
   URL: http://wiki.apache.org/jakarta-lucene/PainlessIndexing

   hint for indexing with lucene

New Page:

IndexWriter has a useful method called (at least temporarily) '''setMinMergeDocs'''
that should be used in order to avoid file handles problems and reduce
indexing time.

File handles problem is often due to the fact that people use large '''mergeFactor''' 
values in order to speed up indexation.  The maximum number of open files while merging is around mergeFactor * (5 + number of indexed fields), 
which can be too much for the FSDirectory.

By setting a higher value to '''minMergeDocs''', you'll index and merge with a
RAMDirectory which is internally used by the IndexWriter. When the limit set by '''minMergeDocs''' is reached (ex 1000) a segment is written in
the FS. '''mergeFactor''' controls the number of segments to be merged, so when
you have 10 segments on the FS (which is already 10x1000 docs), the
IndexWriter will merge them all into a single segment. This is equivalent to
an optimize I think. The process continues like that until it's finished.

Combining these parameters should be enough to achieve good performance.
The good point of using '''minMergeDocs''' is that you make a heavy use of the
RAMDirectory used by your IndexWriter (== fast) without having to be too
careful with the RAM (which would be the case with RAMDirectory). At the
same time keeping your mergeFactor low, limits the risk of too many file handles
problems.

<hint given by JulienNioche>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org