You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Robert Schultz <ro...@cosmicrealms.com> on 2005/08/01 05:20:36 UTC

Any problems with a failed IndexWriter optimize call?

Hello! I am using Lucene 1.4.3

I'm building a Lucene index, that will have about 25 million documents 
when it is done.
I'm adding 250,000 at a time.

Currently there is about 1.2Million in there, and I ran into a problem.
After I had added a batch of 250,000 I go a 'java.lang.outOfMemory' 
threw by writer.optimize(); (a standard IndexWriter)

The exception caused my program to quit out, and it didn't call 
'writer.close();'

First, with it dying in the middle of an .optimize() is there any chance 
my index is corrupted?

Second, I know I can remove the /tmp/lucene*.lock file to remove the 
lock in order to add more, but is it safe to do that?

I've since figured out that I can pass -Xmx to the 'java' program in 
order to increase the maximum amount of RAM.
It was using the default of 64M, I plan on increasing that to 175M to 
start with.
That should solve the memory problems (I can allocate more if necessary 
down the line).

Lastly, when I go back, open it again, and add another 250,000 and then 
call optimize again, will a failed previous optimize hurt the index at all?



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Any problems with a failed IndexWriter optimize call?

Posted by Tony Schwartz <to...@simpleobjects.com>.
Your index should be fine.  You could use "luke" if you want to remove any dangling
files not in use.  Just run optimize again to fix it all up...  You might want to
allocate more memory that 175 though.  Depending on your document sizes and how
efficiently and fast you want lucene to work to index the data, you will want to give
lucene plenty of memory to work with.

Tony Schwartz
tony@simpleobjects.com





> Hello! I am using Lucene 1.4.3
>
> I'm building a Lucene index, that will have about 25 million documents
> when it is done.
> I'm adding 250,000 at a time.
>
> Currently there is about 1.2Million in there, and I ran into a problem.
> After I had added a batch of 250,000 I go a 'java.lang.outOfMemory'
> threw by writer.optimize(); (a standard IndexWriter)
>
> The exception caused my program to quit out, and it didn't call
> 'writer.close();'
>
> First, with it dying in the middle of an .optimize() is there any chance
> my index is corrupted?
>
> Second, I know I can remove the /tmp/lucene*.lock file to remove the
> lock in order to add more, but is it safe to do that?
>
> I've since figured out that I can pass -Xmx to the 'java' program in
> order to increase the maximum amount of RAM.
> It was using the default of 64M, I plan on increasing that to 175M to
> start with.
> That should solve the memory problems (I can allocate more if necessary
> down the line).
>
> Lastly, when I go back, open it again, and add another 250,000 and then
> call optimize again, will a failed previous optimize hurt the index at all?
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Any problems with a failed IndexWriter optimize call?

Posted by Dan Armbrust <da...@gmail.com>.
May I suggest:

Don't call optimize.  You don't need it.  Here is my approach: 

Keep each one of your 250,000 document indexes separate - so run your 
batch, build the index, and then just close it.  Don't try to optimize 
it.  For each 250,000 document batch, just put it into a different folder.

Now, when you have finished building your entire index, you will have a 
bunch of different unoptimized lucene indexes.  Open up a new, blank 
index, and merge all of your other indexes into this one.  The end 
result will be a single large (already optimized) index.


This approach has several benefits -
You can keep the parameters set in such a way that it performs better 
while indexing (without running into the out of file handles issues)
If a failure occurs, you only have to redo the batch, not start over the 
entire process.
You don't have unnecessary IO, but constantly rewriting your data with 
optimize() calls.
You can very easily break up the indexing across multiple machines.
If a failure occurs while trying to merge all of the indexes together, 
you don't lose anything - as you are only reading the existing indexes.  
You know they will all still be valid.

I actually wrote a wrapper for Lucene that does all of this under the 
covers.  At some point, I should get it released open source :)

Dan

-- 
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Any problems with a failed IndexWriter optimize call?

Posted by Robert Schultz <ro...@cosmicrealms.com>.
I am going to play it safe.
I'm going to wipe the index files, and start over (only put about 2 days 
of processing time into it so far).

This time with a maximum memory of over 512MB

Yonik Seeley wrote:
> If all segments were flushed to the disk (no adds since the last time
> the index writer was opened), then it seems like the index should be
> fine.
> 
> The big question I have is what happens when there are in-memory
> segments in the case of an OOM exception during an optimize?  Is data
> loss possible?
> 
> -Yonik
> 
> 
> On 8/1/05, Chris Hostetter <ho...@fucit.org> wrote:
> 
>>If i remember correctly, what you'll find when you remove the lock file is
>>that your index is still usable, and from the perspective of new
>>IndexWriter/IndexReaders it's in the same state it was prior to the call
>>to optimize, but from the perspective of an external observer, the index
>>directory will contain a bunch of garbage files from the aborted optimize.
>>
>>At my work, we've taken the "safe" attitude that if you get an OOM
>>exception, you should assume your index is corrupted, and rebuild from
>>scratch for safety -- but i think it's safe to cleanup up the garbage
>>files manually.
>>
>>which brings up something i ment to ask a while back:  Has anyone written
>>any index cleaning code?  that locks an index (using the public API) and
>>then inspects the file (using the API, or using low level knowledge of hte
>>file structure) to generate a list of 'garbage' files in the index
>>directory that should be safely deletable?
>>
>>(I considered writing this a few months ago, but then our "play it safe,
>>treat it as corrupt" policy came out, and it wasn't all that neccessary
>>for me)
>>
>>It seems like it might be a handy addition to the sandbox.
>>
>>
>>: Date: Sun, 31 Jul 2005 23:20:36 -0400
>>: From: Robert Schultz <ro...@cosmicrealms.com>
>>: Reply-To: java-user@lucene.apache.org
>>: To: java-user@lucene.apache.org
>>: Subject: Any problems with a failed IndexWriter optimize call?
>>:
>>: Hello! I am using Lucene 1.4.3
>>:
>>: I'm building a Lucene index, that will have about 25 million documents
>>: when it is done.
>>: I'm adding 250,000 at a time.
>>:
>>: Currently there is about 1.2Million in there, and I ran into a problem.
>>: After I had added a batch of 250,000 I go a 'java.lang.outOfMemory'
>>: threw by writer.optimize(); (a standard IndexWriter)
>>:
>>: The exception caused my program to quit out, and it didn't call
>>: 'writer.close();'
>>:
>>: First, with it dying in the middle of an .optimize() is there any chance
>>: my index is corrupted?
>>:
>>: Second, I know I can remove the /tmp/lucene*.lock file to remove the
>>: lock in order to add more, but is it safe to do that?
>>:
>>: I've since figured out that I can pass -Xmx to the 'java' program in
>>: order to increase the maximum amount of RAM.
>>: It was using the default of 64M, I plan on increasing that to 175M to
>>: start with.
>>: That should solve the memory problems (I can allocate more if necessary
>>: down the line).
>>:
>>: Lastly, when I go back, open it again, and add another 250,000 and then
>>: call optimize again, will a failed previous optimize hurt the index at all?
>>:
>>:
>>:
>>: ---------------------------------------------------------------------
>>: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>: For additional commands, e-mail: java-user-help@lucene.apache.org
>>:
>>
>>
>>
>>-Hoss
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Any problems with a failed IndexWriter optimize call?

Posted by Yonik Seeley <ys...@gmail.com>.
If all segments were flushed to the disk (no adds since the last time
the index writer was opened), then it seems like the index should be
fine.

The big question I have is what happens when there are in-memory
segments in the case of an OOM exception during an optimize?  Is data
loss possible?

-Yonik


On 8/1/05, Chris Hostetter <ho...@fucit.org> wrote:
> 
> If i remember correctly, what you'll find when you remove the lock file is
> that your index is still usable, and from the perspective of new
> IndexWriter/IndexReaders it's in the same state it was prior to the call
> to optimize, but from the perspective of an external observer, the index
> directory will contain a bunch of garbage files from the aborted optimize.
> 
> At my work, we've taken the "safe" attitude that if you get an OOM
> exception, you should assume your index is corrupted, and rebuild from
> scratch for safety -- but i think it's safe to cleanup up the garbage
> files manually.
> 
> which brings up something i ment to ask a while back:  Has anyone written
> any index cleaning code?  that locks an index (using the public API) and
> then inspects the file (using the API, or using low level knowledge of hte
> file structure) to generate a list of 'garbage' files in the index
> directory that should be safely deletable?
> 
> (I considered writing this a few months ago, but then our "play it safe,
> treat it as corrupt" policy came out, and it wasn't all that neccessary
> for me)
> 
> It seems like it might be a handy addition to the sandbox.
> 
> 
> : Date: Sun, 31 Jul 2005 23:20:36 -0400
> : From: Robert Schultz <ro...@cosmicrealms.com>
> : Reply-To: java-user@lucene.apache.org
> : To: java-user@lucene.apache.org
> : Subject: Any problems with a failed IndexWriter optimize call?
> :
> : Hello! I am using Lucene 1.4.3
> :
> : I'm building a Lucene index, that will have about 25 million documents
> : when it is done.
> : I'm adding 250,000 at a time.
> :
> : Currently there is about 1.2Million in there, and I ran into a problem.
> : After I had added a batch of 250,000 I go a 'java.lang.outOfMemory'
> : threw by writer.optimize(); (a standard IndexWriter)
> :
> : The exception caused my program to quit out, and it didn't call
> : 'writer.close();'
> :
> : First, with it dying in the middle of an .optimize() is there any chance
> : my index is corrupted?
> :
> : Second, I know I can remove the /tmp/lucene*.lock file to remove the
> : lock in order to add more, but is it safe to do that?
> :
> : I've since figured out that I can pass -Xmx to the 'java' program in
> : order to increase the maximum amount of RAM.
> : It was using the default of 64M, I plan on increasing that to 175M to
> : start with.
> : That should solve the memory problems (I can allocate more if necessary
> : down the line).
> :
> : Lastly, when I go back, open it again, and add another 250,000 and then
> : call optimize again, will a failed previous optimize hurt the index at all?
> :
> :
> :
> : ---------------------------------------------------------------------
> : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> : For additional commands, e-mail: java-user-help@lucene.apache.org
> :
> 
> 
> 
> -Hoss
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Any problems with a failed IndexWriter optimize call?

Posted by Chris Hostetter <ho...@fucit.org>.
If i remember correctly, what you'll find when you remove the lock file is
that your index is still usable, and from the perspective of new
IndexWriter/IndexReaders it's in the same state it was prior to the call
to optimize, but from the perspective of an external observer, the index
directory will contain a bunch of garbage files from the aborted optimize.

At my work, we've taken the "safe" attitude that if you get an OOM
exception, you should assume your index is corrupted, and rebuild from
scratch for safety -- but i think it's safe to cleanup up the garbage
files manually.

which brings up something i ment to ask a while back:  Has anyone written
any index cleaning code?  that locks an index (using the public API) and
then inspects the file (using the API, or using low level knowledge of hte
file structure) to generate a list of 'garbage' files in the index
directory that should be safely deletable?

(I considered writing this a few months ago, but then our "play it safe,
treat it as corrupt" policy came out, and it wasn't all that neccessary
for me)

It seems like it might be a handy addition to the sandbox.


: Date: Sun, 31 Jul 2005 23:20:36 -0400
: From: Robert Schultz <ro...@cosmicrealms.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Any problems with a failed IndexWriter optimize call?
:
: Hello! I am using Lucene 1.4.3
:
: I'm building a Lucene index, that will have about 25 million documents
: when it is done.
: I'm adding 250,000 at a time.
:
: Currently there is about 1.2Million in there, and I ran into a problem.
: After I had added a batch of 250,000 I go a 'java.lang.outOfMemory'
: threw by writer.optimize(); (a standard IndexWriter)
:
: The exception caused my program to quit out, and it didn't call
: 'writer.close();'
:
: First, with it dying in the middle of an .optimize() is there any chance
: my index is corrupted?
:
: Second, I know I can remove the /tmp/lucene*.lock file to remove the
: lock in order to add more, but is it safe to do that?
:
: I've since figured out that I can pass -Xmx to the 'java' program in
: order to increase the maximum amount of RAM.
: It was using the default of 64M, I plan on increasing that to 175M to
: start with.
: That should solve the memory problems (I can allocate more if necessary
: down the line).
:
: Lastly, when I go back, open it again, and add another 250,000 and then
: call optimize again, will a failed previous optimize hurt the index at all?
:
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org