You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Scott Smith <ss...@mainstreamdata.com> on 2012/07/16 22:29:21 UTC

Lucene reorganizing indexes

We have an application that has to do "real time" indexing of a number of documents.  What it does is wake up about every 20 seconds and updates the index with any changes that have been queued since the last time it ran.  This involves adding and deleting several hundred documents.  This is all done in a single thread.  There can be multiple threads doing searches simultaneous with the update thread (the searches run in a different process).

Back in the days of 1.42, we would force an index optimization once each day.  However, my impression is that the later versions of Lucene (we are currently using 3.5), Lucene will often do its own reorganization based on hitting certain criteria.  I've been told that optimizing the index is, perhaps, no longer necessary.  Can someone describe what happens here?

The reason I'm asking about this is that we see our application periodically using excessive amounts of kernel time (on Windows) which normally indicates a lot of disk activity.  We are unable to align this with anything our code is doing.  Obviously, we expect Lucene to be causing disk activity, it just seems that the last release (we were at 3.02 before going to 3.5) severely increased the disk activity which is interfering with other things running on the boxes.

Does any of this make sense to anyone?  Is there an explanation?  Thoughts about what we might do about it?

Thanks in advance.

Scott

RE: Lucene reorganizing indexes

Posted by Uwe Schindler <uw...@thetaphi.de>.

You may want to read:
http://www.searchworkings.org/blog/-/blogs/simon-says%3A-optimize-is-bad-for
-you

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Scott Smith [mailto:ssmith@mainstreamdata.com]
> Sent: Monday, July 16, 2012 10:29 PM
> To: java-user@lucene.apache.org
> Subject: Lucene reorganizing indexes
> 
> We have an application that has to do "real time" indexing of a number of
> documents.  What it does is wake up about every 20 seconds and updates the
> index with any changes that have been queued since the last time it ran.
This
> involves adding and deleting several hundred documents.  This is all done
in a
> single thread.  There can be multiple threads doing searches simultaneous
with
> the update thread (the searches run in a different process).
> 
> Back in the days of 1.42, we would force an index optimization once each
day.
> However, my impression is that the later versions of Lucene (we are
currently
> using 3.5), Lucene will often do its own reorganization based on hitting
certain
> criteria.  I've been told that optimizing the index is, perhaps, no longer
> necessary.  Can someone describe what happens here?
> 
> The reason I'm asking about this is that we see our application
periodically
> using excessive amounts of kernel time (on Windows) which normally
indicates
> a lot of disk activity.  We are unable to align this with anything our
code is
> doing.  Obviously, we expect Lucene to be causing disk activity, it just
seems
> that the last release (we were at 3.02 before going to 3.5) severely
increased
> the disk activity which is interfering with other things running on the
boxes.
> 
> Does any of this make sense to anyone?  Is there an explanation?  Thoughts
> about what we might do about it?
> 
> Thanks in advance.
> 
> Scott


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene reorganizing indexes

Posted by googoo <li...@gmail.com>.

Optimize will release disk space if have lots of delete. (Merge will do same
thing).
For me, I think optimize will little bit speed up search.

Which JRE are you using? for windows, if you are using 64bit JRE, then
lucene try to map index to memory.
that will use lots of memory and also involve lots of disk io.
you can rebuild lucene code to disable this behavior.

check below  MMapDirectory code.
org.apache.lucene.store.FSDirectory 
  public static FSDirectory open(File path, LockFactory lockFactory) throws
IOException {
    if ((Constants.WINDOWS || Constants.SUN_OS)
          && Constants.JRE_IS_64BIT && MMapDirectory.UNMAP_SUPPORTED) {
      return new MMapDirectory(path, lockFactory);
    } else if (Constants.WINDOWS) {
      return new SimpleFSDirectory(path, lockFactory);
    } else {
      return new NIOFSDirectory(path, lockFactory);
    }
  }

--
View this message in context: http://lucene.472066.n3.nabble.com/Lucene-reorganizing-indexes-tp3995399p3995702.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Lucene reorganizing indexes

Posted by Scott Smith <ss...@mainstreamdata.com>.

It's lucene

-----Original Message-----
From: Ralf Heyde [mailto:ralf.heyde@gmx.de] 
Sent: Monday, July 16, 2012 10:42 PM
To: java-user@lucene.apache.org
Subject: AW: Lucene reorganizing indexes 

Do you use Lucene or Solr?

We faced the problem in Solr due too big Caches, which where (re)warmed up after a commit and the never ending full GCs.

Greets Ralf

-----Ursprüngliche Nachricht-----
Von: Scott Smith [mailto:ssmith@mainstreamdata.com]
Gesendet: Montag, 16. Juli 2012 22:29
An: java-user@lucene.apache.org
Betreff: Lucene reorganizing indexes 

We have an application that has to do "real time" indexing of a number of documents.  What it does is wake up about every 20 seconds and updates the index with any changes that have been queued since the last time it ran.
This involves adding and deleting several hundred documents.  This is all done in a single thread.  There can be multiple threads doing searches simultaneous with the update thread (the searches run in a different process).

Back in the days of 1.42, we would force an index optimization once each day.  However, my impression is that the later versions of Lucene (we are currently using 3.5), Lucene will often do its own reorganization based on hitting certain criteria.  I've been told that optimizing the index is, perhaps, no longer necessary.  Can someone describe what happens here?

The reason I'm asking about this is that we see our application periodically using excessive amounts of kernel time (on Windows) which normally indicates a lot of disk activity.  We are unable to align this with anything our code is doing.  Obviously, we expect Lucene to be causing disk activity, it just seems that the last release (we were at 3.02 before going to 3.5) severely increased the disk activity which is interfering with other things running on the boxes.

Does any of this make sense to anyone?  Is there an explanation?  Thoughts about what we might do about it?

Thanks in advance.

Scott

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

AW: Lucene reorganizing indexes

Posted by Ralf Heyde <ra...@gmx.de>.

Do you use Lucene or Solr?

We faced the problem in Solr due too big Caches, which where (re)warmed up
after a commit and the never ending full GCs.

Greets Ralf

-----Ursprüngliche Nachricht-----
Von: Scott Smith [mailto:ssmith@mainstreamdata.com] 
Gesendet: Montag, 16. Juli 2012 22:29
An: java-user@lucene.apache.org
Betreff: Lucene reorganizing indexes 

We have an application that has to do "real time" indexing of a number of
documents.  What it does is wake up about every 20 seconds and updates the
index with any changes that have been queued since the last time it ran.
This involves adding and deleting several hundred documents.  This is all
done in a single thread.  There can be multiple threads doing searches
simultaneous with the update thread (the searches run in a different
process).

Back in the days of 1.42, we would force an index optimization once each
day.  However, my impression is that the later versions of Lucene (we are
currently using 3.5), Lucene will often do its own reorganization based on
hitting certain criteria.  I've been told that optimizing the index is,
perhaps, no longer necessary.  Can someone describe what happens here?

The reason I'm asking about this is that we see our application periodically
using excessive amounts of kernel time (on Windows) which normally indicates
a lot of disk activity.  We are unable to align this with anything our code
is doing.  Obviously, we expect Lucene to be causing disk activity, it just
seems that the last release (we were at 3.02 before going to 3.5) severely
increased the disk activity which is interfering with other things running
on the boxes.

Does any of this make sense to anyone?  Is there an explanation?  Thoughts
about what we might do about it?

Thanks in advance.

Scott


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org