You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Kevin A. Burton" <bu...@newsmonster.org> on 2004/07/08 20:02:17 UTC

Way to repair an index broking during 1/2 optimize?

So.. the other day I sent an email about building an index with 14M 
documents.

That went well but the optimize() was taking FOREVER.  It took 7 hours 
to generate the whole index and when complete as of 10AM it was still 
optimizing (6 hours later) and I needed the box back.

So is it possible to fix this index now?  Can I just delete the most 
recent segment that was created?  I can find this by ls -alt

Also... what can I do to speed up this optimize?  Ideally it wouldn't 
take 6 hours.

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Way to repair an index broking during 1/2 optimize?

Posted by Doug Cutting <cu...@apache.org>.
Kevin A. Burton wrote:
>> With the typical handful of fields, one should never see more than 
>> hundreds of files.
>>
> We only have 13 fields... Though to be honest I'm worried that even if I 
> COULD do the optimize that it would run out of file handles.

Optimization doesn't open all files at once.  The most files that are 
ever opened by an IndexWriter is just:

4 + (5 + numIndexedFields) * (mergeFactor-1)

This includes during optimization.

However, when searching, an IndexReader must keep most files open.  In 
particular, the maximum number of files an unoptimized, non-compound 
IndexReader can have open is:

(5 + numIndexedFields) * (mergeFactor-1) * 
(log_base_mergeFactor(numDocs/minMergeDocs))

A compound IndexReader, on the other hand, should open at most, just:

(mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs))

An optimized, non-compound IndexReader will open just (5 + 
numIndexedFields) files.

And an optimized, compound IndexReader should only keep one file open.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Way to repair an index broking during 1/2 optimize?

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Doug Cutting wrote:

>
> Something sounds very wrong for there to be that many files.
>
> The maximum number of files should be around:
>
> (7 + numIndexedFields) * (mergeFactor-1) * 
> (log_base_mergeFactor(numDocs/minMergeDocs))
>
> With 14M documents, log_10(14M/1000) is 4, which gives, for you:
>
> (7 + numIndexedFields) * 36 = 230k
> 7*36 + numIndexedFields*36 = 230k
> numIndexedFields = (230k - 7*36) / 36 =~ 6k
>
> So you'd have to have around 6k unique field names to get 230k files. 
> Or something else must be wrong. Are you running on win32, where file 
> deletion can be difficult?
>
> With the typical handful of fields, one should never see more than 
> hundreds of files.
>
We only have 13 fields... Though to be honest I'm worried that even if I 
COULD do the optimize that it would run out of file handles.

This is very strange...

I'm going to increase minMergeDocs to 10000 and then run the full 
converstion on one box and then try to do an optimize (of the corrupt) 
another box. See which one finishes first.

I assume the speed of optimize() can be increased the same way that 
indexing is increased...

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Way to repair an index broking during 1/2 optimize?

Posted by Doug Cutting <cu...@apache.org>.
Kevin A. Burton wrote:
> During an optimize I assume Lucene starts writing to a new segment and 
> leaves all others in place until everything is done and THEN deletes them?

That's correct.

> The only settings I uses are:
> 
> targetIndex.mergeFactor=10;
> targetIndex.minMergeDocs=1000;
> 
> the resulting index has 230k files in it :-/

Something sounds very wrong for there to be that many files.

The maximum number of files should be around:

   (7 + numIndexedFields) * (mergeFactor-1) * 
(log_base_mergeFactor(numDocs/minMergeDocs))

With 14M documents, log_10(14M/1000) is 4, which gives, for you:

   (7 + numIndexedFields) * 36 = 230k
    7*36 + numIndexedFields*36 = 230k
    numIndexedFields = (230k - 7*36) / 36 =~ 6k

So you'd have to have around 6k unique field names to get 230k files. 
Or something else must be wrong.  Are you running on win32, where file 
deletion can be difficult?

With the typical handful of fields, one should never see more than 
hundreds of files.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Way to repair an index broking during 1/2 optimize?

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Doug Cutting wrote:

> Kevin A. Burton wrote:
>
>> So is it possible to fix this index now? Can I just delete the most 
>> recent segment that was created? I can find this by ls -alt
>
>
> Sorry, I forgot to answer your question: this should work fine. I 
> don't think you should even have to delete that segment.

I'm worried about duplicate or missing content from the original index. 
I'd rather rebuild the index and waste another 6 hours (I've probably 
blown 100 hours of CPU time on this already) and have a correct index :)

During an optimize I assume Lucene starts writing to a new segment and 
leaves all others in place until everything is done and THEN deletes them?

> Also, to elaborate on my previous comment, a mergeFactor of 5000 not 
> only delays the work until the end, but it also makes the disk 
> workload more seek-dominated, which is not optimal. 

The only settings I uses are:

targetIndex.mergeFactor=10;
targetIndex.minMergeDocs=1000;

the resulting index has 230k files in it :-/

I assume this is contributing to all the disk seeks.

> So I suspect a smaller merge factor, together with a larger 
> minMergeDocs, will be much faster overall, including the final 
> optimize(). Please tell us how it goes.
>
This is what I did for this last round but then I ended up with the 
highly fragmented index.

hm...

Thanks for all the help btw!

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Way to repair an index broking during 1/2 optimize?

Posted by Doug Cutting <cu...@apache.org>.
Kevin A. Burton wrote:
> So is it possible to fix this index now?  Can I just delete the most 
> recent segment that was created?  I can find this by ls -alt

Sorry, I forgot to answer your question: this should work fine.  I don't 
think you should even have to delete that segment.

Also, to elaborate on my previous comment, a mergeFactor of 5000 not 
only delays the work until the end, but it also makes the disk workload 
more seek-dominated, which is not optimal.  So I suspect a smaller merge 
factor, together with a larger minMergeDocs, will be much faster 
overall, including the final optimize().  Please tell us how it goes.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Way to repair an index broking during 1/2 optimize?

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Peter M Cipollone wrote:

>You might try merging the existing index into a new index located on a ram
>disk.  Once it is done, you can move the directory from ram disk back to
>your hard disk.  I think this will work as long as the old index did not
>finish merging.  You might do a "strings" command on the segments file to
>make sure the new (merged) segment is not in there, and if there's a
>"deletable" file, make sure there are no segments from the old index listed
>therein.
>  
>
Its a HUGE index.  It won't fit in memory ;)  Right now its at 8G...

Thanks though! :)

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Way to repair an index broking during 1/2 optimize?

Posted by Peter M Cipollone <lu...@bihvhar.com>.
You might try merging the existing index into a new index located on a ram
disk.  Once it is done, you can move the directory from ram disk back to
your hard disk.  I think this will work as long as the old index did not
finish merging.  You might do a "strings" command on the segments file to
make sure the new (merged) segment is not in there, and if there's a
"deletable" file, make sure there are no segments from the old index listed
therein.

----- Original Message ----- 
From: "Kevin A. Burton" <bu...@newsmonster.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, July 08, 2004 2:02 PM
Subject: Way to repair an index broking during 1/2 optimize?


> So.. the other day I sent an email about building an index with 14M
> documents.
>
> That went well but the optimize() was taking FOREVER.  It took 7 hours
> to generate the whole index and when complete as of 10AM it was still
> optimizing (6 hours later) and I needed the box back.
>
> So is it possible to fix this index now?  Can I just delete the most
> recent segment that was created?  I can find this by ls -alt
>
> Also... what can I do to speed up this optimize?  Ideally it wouldn't
> take 6 hours.
>
> Kevin
>
> -- 
>
> Please reply using PGP.
>
>     http://peerfear.org/pubkey.asc
>
>     NewsMonster - http://www.newsmonster.org/
>
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>        AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>   IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Way to repair an index broking during 1/2 optimize?

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Doug Cutting wrote:

> Kevin A. Burton wrote:
>
>> No... I changed the mergeFactor back to 10 as you suggested.
>
>
> Then I am confused about why it should take so long.
>
> Did you by chance set the IndexWriter.infoStream to something, so that 
> it logs merges? If so, it would be interesting to see that output, 
> especially the last entry.
>
No I didn't actually... If I run it again I'll be sure to do this.

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Way to repair an index broking during 1/2 optimize?

Posted by Doug Cutting <cu...@apache.org>.
Kevin A. Burton wrote:
> No... I changed the mergeFactor back to 10 as you suggested.

Then I am confused about why it should take so long.

Did you by chance set the IndexWriter.infoStream to something, so that 
it logs merges?  If so, it would be interesting to see that output, 
especially the last entry.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Way to repair an index broking during 1/2 optimize?

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Doug Cutting wrote:

> Kevin A. Burton wrote:
>
>> Also... what can I do to speed up this optimize? Ideally it wouldn't 
>> take 6 hours.
>
>
> Was this the index with the mergeFactor of 5000? If so, that's why 
> it's so slow: you've delayed all of the work until the end. Indexing 
> on a ramfs will make things faster in general, however, if you have 
> enough RAM...

No... I changed the mergeFactor back to 10 as you suggested.

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Way to repair an index broking during 1/2 optimize?

Posted by Doug Cutting <cu...@apache.org>.
Kevin A. Burton wrote:
> Also... what can I do to speed up this optimize?  Ideally it wouldn't 
> take 6 hours.

Was this the index with the mergeFactor of 5000?  If so, that's why it's 
so slow: you've delayed all of the work until the end.  Indexing on a 
ramfs will make things faster in general, however, if you have enough RAM...

Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org