You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Некрасов Александр Сергеевич <ne...@granit.ru> on 2009/06/30 15:18:47 UTC

40000 segments for index with 2000 documents

Im creating asp.net web site that uses Lucene.Net via NHibernate.Search as a search engine and i`ve run into very bad performance problem where removing (for update) one document from index lasts more then 5 minutes which is unacceptable. Site runs under IIS on Windows.

 

There are about 3000 documents with one field indexed that are being updated 3-5 times per minute.  It looks like new segment created per each transaction because right now there are about 40000 .cfs/.del (coupled) files which makes 80000 files in index and indexs size is about 25Mb. But after optimization (which took 7 minutes) index size shrunk to 350Kb.

 

Im not sure if its misconfiguration issue or smth else. Here are Lucene settings (default):

maxBufferedDeleteTerms = 1000, maxMergeDocs = 2147483647, mergeFactor = 10, minMergeDocs = 10,useCompoundFile = true


RE: 40000 segments for index with 2000 documents

Posted by George Aroush <ge...@aroush.net>.
Optimization is disk bound -- it will read the whole index and write it
back.  If the 7 minute it took to optimize your index is not acceptable, get
a faster hard-drive (fast RPM, seek, etc.)

Btw, 3000 documents is small, but if they *all* (or most) are being updated
every 3-5 minutes, you will run into fragmentation issues (and many segment
files) as your discovered.

-- George


-----Original Message-----
From: Dean Harding [mailto:dean.harding@dload.com.au] 
Sent: Tuesday, June 30, 2009 7:03 PM
To: lucene-net-user@incubator.apache.org
Subject: RE: 40000 segments for index with 2000 documents

> There are about 3000 documents with one field indexed that are being
> updated 3-5 times per minute.  It looks like new segment created per
> each transaction because right now there are about 40000 .cfs/.del
> (coupled) files which makes 80000 files in index and indexs size is
> about 25Mb. But after optimization (which took 7 minutes) index size
> shrunk to 350Kb.

So what's the performance like after optimization? Optimization doesn't
happen automatically in Lucene you must do it manually. Adding a document
simply appends it to the end of the index and removing a document simply
marks it as deleted. Updating a document is a remove-then-add operation.

It's only when you call Optimize() that it actually rearranges things on
disk for faster access, and that's something you should be doing on a
regular basis. Here, we do an Optimize() after every 1000 "modifications"
(add, delete, update). For a relatively small index like yours, regular
optimization shouldn't take more than a couple of seconds (it's only because
you let things go so out of hand that it took 7 minutes) and you can
continue to query the index while the optimization is happening.

At least, that's always been my understanding.

Dean.



RE: 40000 segments for index with 2000 documents

Posted by Dean Harding <de...@dload.com.au>.
> There are about 3000 documents with one field indexed that are being
> updated 3-5 times per minute.  It looks like new segment created per
> each transaction because right now there are about 40000 .cfs/.del
> (coupled) files which makes 80000 files in index and indexs size is
> about 25Mb. But after optimization (which took 7 minutes) index size
> shrunk to 350Kb.

So what's the performance like after optimization? Optimization doesn't
happen automatically in Lucene you must do it manually. Adding a document
simply appends it to the end of the index and removing a document simply
marks it as deleted. Updating a document is a remove-then-add operation.

It's only when you call Optimize() that it actually rearranges things on
disk for faster access, and that's something you should be doing on a
regular basis. Here, we do an Optimize() after every 1000 "modifications"
(add, delete, update). For a relatively small index like yours, regular
optimization shouldn't take more than a couple of seconds (it's only because
you let things go so out of hand that it took 7 minutes) and you can
continue to query the index while the optimization is happening.

At least, that's always been my understanding.

Dean.