You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Kevin A. Burton" <bu...@newsmonster.org> on 2004/07/08 20:02:17 UTC
Way to repair an index broking during 1/2 optimize?
So.. the other day I sent an email about building an index with 14M
documents.
That went well but the optimize() was taking FOREVER. It took 7 hours
to generate the whole index and when complete as of 10AM it was still
optimizing (6 hours later) and I needed the box back.
So is it possible to fix this index now? Can I just delete the most
recent segment that was created? I can find this by ls -alt
Also... what can I do to speed up this optimize? Ideally it wouldn't
take 6 hours.
Kevin
--
Please reply using PGP.
http://peerfear.org/pubkey.asc
NewsMonster - http://www.newsmonster.org/
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
AIM/YIM - sfburtonator, Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Way to repair an index broking during 1/2 optimize?
Posted by Doug Cutting <cu...@apache.org>.
Kevin A. Burton wrote:
>> With the typical handful of fields, one should never see more than
>> hundreds of files.
>>
> We only have 13 fields... Though to be honest I'm worried that even if I
> COULD do the optimize that it would run out of file handles.
Optimization doesn't open all files at once. The most files that are
ever opened by an IndexWriter is just:
4 + (5 + numIndexedFields) * (mergeFactor-1)
This includes during optimization.
However, when searching, an IndexReader must keep most files open. In
particular, the maximum number of files an unoptimized, non-compound
IndexReader can have open is:
(5 + numIndexedFields) * (mergeFactor-1) *
(log_base_mergeFactor(numDocs/minMergeDocs))
A compound IndexReader, on the other hand, should open at most, just:
(mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs))
An optimized, non-compound IndexReader will open just (5 +
numIndexedFields) files.
And an optimized, compound IndexReader should only keep one file open.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Way to repair an index broking during 1/2 optimize?
Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Doug Cutting wrote:
>
> Something sounds very wrong for there to be that many files.
>
> The maximum number of files should be around:
>
> (7 + numIndexedFields) * (mergeFactor-1) *
> (log_base_mergeFactor(numDocs/minMergeDocs))
>
> With 14M documents, log_10(14M/1000) is 4, which gives, for you:
>
> (7 + numIndexedFields) * 36 = 230k
> 7*36 + numIndexedFields*36 = 230k
> numIndexedFields = (230k - 7*36) / 36 =~ 6k
>
> So you'd have to have around 6k unique field names to get 230k files.
> Or something else must be wrong. Are you running on win32, where file
> deletion can be difficult?
>
> With the typical handful of fields, one should never see more than
> hundreds of files.
>
We only have 13 fields... Though to be honest I'm worried that even if I
COULD do the optimize that it would run out of file handles.
This is very strange...
I'm going to increase minMergeDocs to 10000 and then run the full
converstion on one box and then try to do an optimize (of the corrupt)
another box. See which one finishes first.
I assume the speed of optimize() can be increased the same way that
indexing is increased...
Kevin
--
Please reply using PGP.
http://peerfear.org/pubkey.asc
NewsMonster - http://www.newsmonster.org/
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
AIM/YIM - sfburtonator, Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Way to repair an index broking during 1/2 optimize?
Posted by Doug Cutting <cu...@apache.org>.
Kevin A. Burton wrote:
> During an optimize I assume Lucene starts writing to a new segment and
> leaves all others in place until everything is done and THEN deletes them?
That's correct.
> The only settings I uses are:
>
> targetIndex.mergeFactor=10;
> targetIndex.minMergeDocs=1000;
>
> the resulting index has 230k files in it :-/
Something sounds very wrong for there to be that many files.
The maximum number of files should be around:
(7 + numIndexedFields) * (mergeFactor-1) *
(log_base_mergeFactor(numDocs/minMergeDocs))
With 14M documents, log_10(14M/1000) is 4, which gives, for you:
(7 + numIndexedFields) * 36 = 230k
7*36 + numIndexedFields*36 = 230k
numIndexedFields = (230k - 7*36) / 36 =~ 6k
So you'd have to have around 6k unique field names to get 230k files.
Or something else must be wrong. Are you running on win32, where file
deletion can be difficult?
With the typical handful of fields, one should never see more than
hundreds of files.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Way to repair an index broking during 1/2 optimize?
Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Doug Cutting wrote:
> Kevin A. Burton wrote:
>
>> So is it possible to fix this index now? Can I just delete the most
>> recent segment that was created? I can find this by ls -alt
>
>
> Sorry, I forgot to answer your question: this should work fine. I
> don't think you should even have to delete that segment.
I'm worried about duplicate or missing content from the original index.
I'd rather rebuild the index and waste another 6 hours (I've probably
blown 100 hours of CPU time on this already) and have a correct index :)
During an optimize I assume Lucene starts writing to a new segment and
leaves all others in place until everything is done and THEN deletes them?
> Also, to elaborate on my previous comment, a mergeFactor of 5000 not
> only delays the work until the end, but it also makes the disk
> workload more seek-dominated, which is not optimal.
The only settings I uses are:
targetIndex.mergeFactor=10;
targetIndex.minMergeDocs=1000;
the resulting index has 230k files in it :-/
I assume this is contributing to all the disk seeks.
> So I suspect a smaller merge factor, together with a larger
> minMergeDocs, will be much faster overall, including the final
> optimize(). Please tell us how it goes.
>
This is what I did for this last round but then I ended up with the
highly fragmented index.
hm...
Thanks for all the help btw!
Kevin
--
Please reply using PGP.
http://peerfear.org/pubkey.asc
NewsMonster - http://www.newsmonster.org/
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
AIM/YIM - sfburtonator, Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Way to repair an index broking during 1/2 optimize?
Posted by Doug Cutting <cu...@apache.org>.
Kevin A. Burton wrote:
> So is it possible to fix this index now? Can I just delete the most
> recent segment that was created? I can find this by ls -alt
Sorry, I forgot to answer your question: this should work fine. I don't
think you should even have to delete that segment.
Also, to elaborate on my previous comment, a mergeFactor of 5000 not
only delays the work until the end, but it also makes the disk workload
more seek-dominated, which is not optimal. So I suspect a smaller merge
factor, together with a larger minMergeDocs, will be much faster
overall, including the final optimize(). Please tell us how it goes.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Way to repair an index broking during 1/2 optimize?
Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Peter M Cipollone wrote:
>You might try merging the existing index into a new index located on a ram
>disk. Once it is done, you can move the directory from ram disk back to
>your hard disk. I think this will work as long as the old index did not
>finish merging. You might do a "strings" command on the segments file to
>make sure the new (merged) segment is not in there, and if there's a
>"deletable" file, make sure there are no segments from the old index listed
>therein.
>
>
Its a HUGE index. It won't fit in memory ;) Right now its at 8G...
Thanks though! :)
Kevin
--
Please reply using PGP.
http://peerfear.org/pubkey.asc
NewsMonster - http://www.newsmonster.org/
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
AIM/YIM - sfburtonator, Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Way to repair an index broking during 1/2 optimize?
Posted by Peter M Cipollone <lu...@bihvhar.com>.
You might try merging the existing index into a new index located on a ram
disk. Once it is done, you can move the directory from ram disk back to
your hard disk. I think this will work as long as the old index did not
finish merging. You might do a "strings" command on the segments file to
make sure the new (merged) segment is not in there, and if there's a
"deletable" file, make sure there are no segments from the old index listed
therein.
----- Original Message -----
From: "Kevin A. Burton" <bu...@newsmonster.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, July 08, 2004 2:02 PM
Subject: Way to repair an index broking during 1/2 optimize?
> So.. the other day I sent an email about building an index with 14M
> documents.
>
> That went well but the optimize() was taking FOREVER. It took 7 hours
> to generate the whole index and when complete as of 10AM it was still
> optimizing (6 hours later) and I needed the box back.
>
> So is it possible to fix this index now? Can I just delete the most
> recent segment that was created? I can find this by ls -alt
>
> Also... what can I do to speed up this optimize? Ideally it wouldn't
> take 6 hours.
>
> Kevin
>
> --
>
> Please reply using PGP.
>
> http://peerfear.org/pubkey.asc
>
> NewsMonster - http://www.newsmonster.org/
>
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
> AIM/YIM - sfburtonator, Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
> IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Way to repair an index broking during 1/2 optimize?
Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Doug Cutting wrote:
> Kevin A. Burton wrote:
>
>> No... I changed the mergeFactor back to 10 as you suggested.
>
>
> Then I am confused about why it should take so long.
>
> Did you by chance set the IndexWriter.infoStream to something, so that
> it logs merges? If so, it would be interesting to see that output,
> especially the last entry.
>
No I didn't actually... If I run it again I'll be sure to do this.
--
Please reply using PGP.
http://peerfear.org/pubkey.asc
NewsMonster - http://www.newsmonster.org/
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
AIM/YIM - sfburtonator, Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Way to repair an index broking during 1/2 optimize?
Posted by Doug Cutting <cu...@apache.org>.
Kevin A. Burton wrote:
> No... I changed the mergeFactor back to 10 as you suggested.
Then I am confused about why it should take so long.
Did you by chance set the IndexWriter.infoStream to something, so that
it logs merges? If so, it would be interesting to see that output,
especially the last entry.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Way to repair an index broking during 1/2 optimize?
Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Doug Cutting wrote:
> Kevin A. Burton wrote:
>
>> Also... what can I do to speed up this optimize? Ideally it wouldn't
>> take 6 hours.
>
>
> Was this the index with the mergeFactor of 5000? If so, that's why
> it's so slow: you've delayed all of the work until the end. Indexing
> on a ramfs will make things faster in general, however, if you have
> enough RAM...
No... I changed the mergeFactor back to 10 as you suggested.
Kevin
--
Please reply using PGP.
http://peerfear.org/pubkey.asc
NewsMonster - http://www.newsmonster.org/
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
AIM/YIM - sfburtonator, Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Way to repair an index broking during 1/2 optimize?
Posted by Doug Cutting <cu...@apache.org>.
Kevin A. Burton wrote:
> Also... what can I do to speed up this optimize? Ideally it wouldn't
> take 6 hours.
Was this the index with the mergeFactor of 5000? If so, that's why it's
so slow: you've delayed all of the work until the end. Indexing on a
ramfs will make things faster in general, however, if you have enough RAM...
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org