You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by sarfaraz masood <sa...@yahoo.com> on 2010/07/12 20:14:04 UTC

Indexing large amount of data

i have a large amount of data (120 GB) to be indexed in the index. Hence i want to improve the performance of indexing this data. I went through the documentation given on the lucene website which mentioned various ways by which the performance can be improved.

i am working on debian linux with amd64. hence the file size supported is very large. java version is 1.6

i tried many points mentioned in that documentations but got unusual results.

1) Reuse field & document objects to reduce the GC overhead using the field.setValue() method.. By doing this, instead of speeding up, the indexing speed reduced drastically. i know this is unusual but thats what happened.

2) Tuning parameters by  setMergeFactor(), setMaxBufferedDocs(). 
now the default value for both is 10.. i increased the value to 1000.. by doing so the no of .CSF file in the index folder increased many folds.. and i got java.io.IOException : Too Many Files Open. 
    IF i choose the default value 10 for both the parameters then this error is avoided but then size of .fdt file in index becomes really high.

so where am i going wrong ?? how to overcome these problems..how to speed up my indexing process..

Re: Indexing large amount of data

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Jul 12, 2010 at 11:14 AM, sarfaraz masood <
sarfarazmasood2002@yahoo.com> wrote:

> 1) Reuse field & document objects to reduce the GC overhead using the
> field.setValue() method.. By doing this, instead of speeding up, the
> indexing speed reduced drastically. i know this is unusual but thats what
> happened.
>

GC overhead is much, much less on recent JVM's such as you are using.  It
still pays very large benefits to avoid *copying*, but it rarely pays to
avoid allocating.

You should look at the new TokenStream API.


>
> 2) Tuning parameters by  setMergeFactor(), setMaxBufferedDocs().
> now the default value for both is 10.. i increased the value to 1000.. by
> doing so the no of .CSF file in the index folder increased many folds.. and
> i got java.io.IOException : Too Many Files Open.
>

Have you set this limit to the maximum possible?  It is common for the
default limit to be unreasonably small.

so where am i going wrong ?? how to overcome these problems..how to speed up
> my indexing process..
>
>
Another thing you might investigate is indexing on multiple machines in
anticipation of doing sharded search using Solr Cloud or Katta.  That will
have the largest impact on total index time of any change that you can do
relatively easily.