You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Bernhard Messer <Be...@intrafind.de> on 2004/08/07 14:12:53 UTC

possible SegmentMerger optimization

hi developers,

may be there is a small, but effective possibility to optimize the 
SegmentMerger class when compound file option is enabled, which is 
default since lucene 1.4.

The current implementation creates and writes the compound index file 
every time the merge() method is called. Due to the fact, that io 
operations are expensive and time consuming, it would be cool to write 
the compound index file just when optimizing the index. The change 
itself wouldn't be a big deal, adding a boolean parameter to 
SegmenMerger.merge(boolean finalize). Only if finalize==true and 
compound option is enabled, the compound file will be created. To 
fullfill the implementation, the same parameter could be added to 
mergeSegments(int minSegment, boolean finalize) within IndexWriter. When 
mergeSegments is called from flushRamSegments() or maybeMergeSegments(), 
finalize is set to false. Only when called from optimize(), finalize 
will be set to true and the compound file will be written.

The dark side will be to explain developers, if they are not optimizing 
the index before closing, compound file option has no effect. The other 
thing is, that we might run into the problem of too many open files, 
which sometimes was reported before the compound option was introduced.

The negative side could be solved when making the optimization 
optionally available thru IndexWriter. So developers using lucene could 
decide themself if they want to use the "single compound write" option 
or not.

If wanted and you would like to see the patch, leave me a note and i'll 
create it.

best regards
Bernhard


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: possible SegmentMerger optimization

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Bernhard Messer wrote:

> Dmitry,
>
> yeap, you're right Dmitry. Switch on/off compound file would be the 
> trick to simulate the same behavior i described. I did some test on 
> that and found that it working perfect.

Great! I'm glad that helps with your issue. By the way, I like what you 
did with reducing disk size requirements. That sounds like a great idea!

Thanks for taking this on. :)
Dmitry.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: possible SegmentMerger optimization

Posted by Bernhard Messer <Be...@intrafind.de>.
Dmitry,

yeap, you're right Dmitry. Switch on/off compound file would be the 
trick to simulate the same behavior i described. I did some test on that 
and found that it working perfect. I think we can leave everything as it 
is, maybe we should document it somewhere.

Does there exists something like a "tips and tricks" section on the 
lucene website ?

Bernhard

Dmitry Serebrennikov wrote:

> Bernhard Messer wrote:
>
>> hi developers,
>>
>> may be there is a small, but effective possibility to optimize the 
>> SegmentMerger class when compound file option is enabled, which is 
>> default since lucene 1.4.
>>
>> The current implementation creates and writes the compound index file 
>> every time the merge() method is called. Due to the fact, that io 
>> operations are expensive and time consuming, it would be cool to 
>> write the compound index file just when optimizing the index. The 
>> change itself wouldn't be a big deal, adding a boolean parameter to 
>> SegmenMerger.merge(boolean finalize). Only if finalize==true and 
>> compound option is enabled, the compound file will be created. To 
>> fullfill the implementation, the same parameter could be added to 
>> mergeSegments(int minSegment, boolean finalize) within IndexWriter. 
>> When mergeSegments is called from flushRamSegments() or 
>> maybeMergeSegments(), finalize is set to false. Only when called from 
>> optimize(), finalize will be set to true and the compound file will 
>> be written.
>>
>> The dark side will be to explain developers, if they are not 
>> optimizing the index before closing, compound file option has no 
>> effect. The other thing is, that we might run into the problem of too 
>> many open files, which sometimes was reported before the compound 
>> option was introduced.
>
>
> Yea, that was kind of the point of having the compound files - to 
> avoid too many file handles, especially during indexing. I hear you on 
> inefficient use of disk IO, though.
>
>>
>> The negative side could be solved when making the optimization 
>> optionally available thru IndexWriter. So developers using lucene 
>> could decide themself if they want to use the "single compound write" 
>> option or not.
>
>
> One could do that today. Just setUseCompoundFiles(false) during 
> indexing and call setUseCompoundFiles(true) before the final optimize. 
> Would that do the trick?

Dmitry.

>
>>
>> If wanted and you would like to see the patch, leave me a note and 
>> i'll create it.
>>
>> best regards
>> Bernhard
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: possible SegmentMerger optimization

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Bernhard Messer wrote:

> hi developers,
>
> may be there is a small, but effective possibility to optimize the 
> SegmentMerger class when compound file option is enabled, which is 
> default since lucene 1.4.
>
> The current implementation creates and writes the compound index file 
> every time the merge() method is called. Due to the fact, that io 
> operations are expensive and time consuming, it would be cool to write 
> the compound index file just when optimizing the index. The change 
> itself wouldn't be a big deal, adding a boolean parameter to 
> SegmenMerger.merge(boolean finalize). Only if finalize==true and 
> compound option is enabled, the compound file will be created. To 
> fullfill the implementation, the same parameter could be added to 
> mergeSegments(int minSegment, boolean finalize) within IndexWriter. 
> When mergeSegments is called from flushRamSegments() or 
> maybeMergeSegments(), finalize is set to false. Only when called from 
> optimize(), finalize will be set to true and the compound file will be 
> written.
>
> The dark side will be to explain developers, if they are not 
> optimizing the index before closing, compound file option has no 
> effect. The other thing is, that we might run into the problem of too 
> many open files, which sometimes was reported before the compound 
> option was introduced.

Yea, that was kind of the point of having the compound files - to avoid 
too many file handles, especially during indexing. I hear you on 
inefficient use of disk IO, though.

>
> The negative side could be solved when making the optimization 
> optionally available thru IndexWriter. So developers using lucene 
> could decide themself if they want to use the "single compound write" 
> option or not.

One could do that today. Just setUseCompoundFiles(false) during indexing 
and call setUseCompoundFiles(true) before the final optimize. Would that 
do the trick?

Dmitry.

>
> If wanted and you would like to see the patch, leave me a note and 
> i'll create it.
>
> best regards
> Bernhard
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org