You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Dmitry Serebrennikov <dm...@earthlink.net> on 2003/09/23 01:15:48 UTC

Re: file handle changes

Greetings again.

I've implemented the file handle reduction changes, roughly as proposed 
before. Here are the patches for your enjoyment! :)

------------------------------------------
SUMMARY:
The goal of this patch is to drastically reduce the number of file 
handles required by Lucene. This is achieved by reducing the number of 
files required by a single index segment from N to 1, where N depends on 
the number of indexed fields in the segment. Typically, one should see a 
drop in the number of file handles by an order of magnitude! It could 
even be greater for indexes that contain large numbers of indexed fields.

The best part is that to take advantage of this feature, one simply 
needs to call setUseCompoundFiles(true) on an IndexWriter before putting 
documents into it. Everything else is automatic!

------------------------------------------
DETAILS:
The proposed implementation adds a new property to the IndexWriter -- 
get/setUseCompoundFiles(boolean). This property defaults to false, which 
is the existing behavior prior to this patch. If the property is set to 
true, all segments created by this IndexWriter will be of the "compound 
file" format. Compound file segments have only one main file - <id>.cfs. 
Document deletions are handled as before -- if documents from this 
segment are deleted, a second file named <id>.del is created (I didn't 
change this code).

The get/setUseCompoundFiles setting can be changed at any time during 
the existance of the IndexWriter and takes effect during the next time 
the IndexWriter merges segments in its target directory. 
SegmentIndexReader can now work with either type of segment.

This change does not affect how the segments are handled in the 
temporary RAMDirectory used by the IndexWriter internally, only the 
final segments written to the target directory. Also, a given directory 
can contain both types of segments and everything works out automagically.

-----------------------------------------
I have also created a new JUnit test case to test these features, which 
runs successfully. For the moment it creates files off of the current 
working directory in which the junit is executed. I also converted some 
of the older tests "XXXTest" into "TestXXX", and made sure they work 
with the old implementation and the new one. These tests do not yet do 
enough assert(...) calls, but they now execute twice: with the 
multi-file indexes and the new compound file indexes, and assert that 
the output is the same. The old files are still there, I just added new 
ones with the inverted names. In one case - ThreadSafetyTest.java - I 
actually made changes to that file because I thougt this test was too 
long to run as an automatic test in JUnit. Build.xml required a small 
change to add a class from the src/demo tree to the classpath.

----------------------------------------
Doug, I've really considered keeping everything at the Directory level, 
as you suggested. This would have been preferred, I agree, but I really 
couldn't find a way to reconsile this approach with the other two goals 
I had: (a) keep specific file extension knowledge at the lucene.index.* 
level where it is now, and (b) avoid having to support writes to the 
compound file.

----------------------------------------
I'm attaching the patches against the current Lucene CVS source 
(basically output of "cvs diff -Buw"). The files listed as "?" are new 
files and are also attached.

(BTW, there are currently two failures in the existing JUnit test cases, 
but they occur with or without these patches, as has already been noted 
by Otis, Doug and Eric).

Finally, I should theoretically have commit access to Lucene's CVS, but 
I've never tried using it yet. If these changes seem ok, I could commit 
them myself (provided I can find my password, etc., etc.).

Enjoy.
Dmitry.


Re: file handle changes

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Doug Cutting wrote:

> Dmitry Serebrennikov wrote:
>
>> Doug, I've really considered keeping everything at the Directory 
>> level, as you suggested. This would have been preferred, I agree, but 
>> I really couldn't find a way to reconsile this approach with the 
>> other two goals I had: (a) keep specific file extension knowledge at 
>> the lucene.index.* level where it is now, and (b) avoid having to 
>> support writes to the compound file.
>
>
> Sorry I'm a little behind on my Lucene email.
>
> I just posted an alternate design before I saw this message.  But now 
> I see that your changes layer nicely on the existing Directory API, so 
> I don't really have a problem with them.  Although perhaps the 
> proposal in my previous message still has merit... What do you think? 

See my reply to that message. Generally, I think it would require too 
much change to the Directory API (and everyone's favorite directory 
implementations). Plus I like the way directory is structured now - nice 
and clean. And any way you slice it, adding any kind of call to say to a 
directory "please consider that you can now combine these files, even 
though you may not care since they are really database records for you, 
or something", just didn't seem very compelling :). The best I could 
come up with was something like setting some files are "read-only" and 
letting directory take it from there, but even then we still need to 
group related files...

Any way, I think it all worked out nicely. Even better than I thought it 
might - just one call on the IndexWriter does the job, plus IndexWriter 
is something that Lucene users have to deal with already.

>
>
> Does CompoundFileReader need to be public? 

No, it doesn't. I didn't think about that in the cleaup after the 
implementation. Feel free to make it package level.

>
>
> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>



Re: file handle changes

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Doug Cutting wrote:

> Dmitry Serebrennikov wrote:
>
>> Doug, I've really considered keeping everything at the Directory 
>> level, as you suggested. This would have been preferred, I agree, but 
>> I really couldn't find a way to reconsile this approach with the 
>> other two goals I had: (a) keep specific file extension knowledge at 
>> the lucene.index.* level where it is now, and (b) avoid having to 
>> support writes to the compound file.
>
>
> Sorry I'm a little behind on my Lucene email.
>
> I just posted an alternate design before I saw this message.  But now 
> I see that your changes layer nicely on the existing Directory API, so 
> I don't really have a problem with them.  Although perhaps the 
> proposal in my previous message still has merit... What do you think? 

See my reply to that message. Generally, I think it would require too 
much change to the Directory API (and everyone's favorite directory 
implementations). Plus I like the way directory is structured now - nice 
and clean. And any way you slice it, adding any kind of call to say to a 
directory "please consider that you can now combine these files, even 
though you may not care since they are really database records for you, 
or something", just didn't seem very compelling :). The best I could 
come up with was something like setting some files are "read-only" and 
letting directory take it from there, but even then we still need to 
group related files...

Any way, I think it all worked out nicely. Even better than I thought it 
might - just one call on the IndexWriter does the job, plus IndexWriter 
is something that Lucene users have to deal with already.

>
>
> Does CompoundFileReader need to be public? 

No, it doesn't. I didn't think about that in the cleaup after the 
implementation. Feel free to make it package level.

>
>
> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: file handle changes

Posted by Doug Cutting <cu...@lucene.com>.
Dmitry Serebrennikov wrote:
> Doug, I've really considered keeping everything at the Directory level, 
> as you suggested. This would have been preferred, I agree, but I really 
> couldn't find a way to reconsile this approach with the other two goals 
> I had: (a) keep specific file extension knowledge at the lucene.index.* 
> level where it is now, and (b) avoid having to support writes to the 
> compound file.

Sorry I'm a little behind on my Lucene email.

I just posted an alternate design before I saw this message.  But now I 
see that your changes layer nicely on the existing Directory API, so I 
don't really have a problem with them.  Although perhaps the proposal in 
my previous message still has merit... What do you think?

Does CompoundFileReader need to be public?

Doug


Re: file handle changes

Posted by Doug Cutting <cu...@lucene.com>.
Dmitry Serebrennikov wrote:
> Doug, I've really considered keeping everything at the Directory level, 
> as you suggested. This would have been preferred, I agree, but I really 
> couldn't find a way to reconsile this approach with the other two goals 
> I had: (a) keep specific file extension knowledge at the lucene.index.* 
> level where it is now, and (b) avoid having to support writes to the 
> compound file.

Sorry I'm a little behind on my Lucene email.

I just posted an alternate design before I saw this message.  But now I 
see that your changes layer nicely on the existing Directory API, so I 
don't really have a problem with them.  Although perhaps the proposal in 
my previous message still has merit... What do you think?

Does CompoundFileReader need to be public?

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: file handle changes

Posted by Bruce Ritchie <br...@jivesoftware.com>.
Dmitry Serebrennikov wrote:

> Bruce, PA, and possibly others
> 
> Thanks for giving the file handle patch a try. I'm very glad that it's 
> working for you. I wander if either one of you has any scripts / data to 
> monitor performance of your Lucene instance. If so, I would be very 
> curious to know if you have seen any performance impact of this patch. 
> Even without scripts, have you noticed anything just informally?

 From my casual testing I haven't noticed any differences, good or bad. I do have a little utility I 
wrote to stress searching (it's linked to in one of my previous messages in this list so I'll run a 
bit of an informal test to see if there is any noticable differences in search times.


Regards,

Bruce Ritchie

Re: file handle changes

Posted by petite_abeille <pe...@mac.com>.
Hi Dmitry,

On Tuesday, Sep 23, 2003, at 20:26 Europe/Amsterdam, Dmitry 
Serebrennikov wrote:

> Thanks for giving the file handle patch a try. I'm very glad that it's 
> working for you. I wander if either one of you has any scripts / data 
> to monitor performance of your Lucene instance. If so, I would be very 
> curious to know if you have seen any performance impact of this patch. 
> Even without scripts, have you noticed anything just informally?

I haven't run any formal test yet, but I haven't noticed any negative 
impact either. In fact, at least in the case of ZOE, your patch should 
have a very positive impact on responsiveness as it will allow the app 
to keep more indices open concurrently. Will let you know when I get 
some real numbers.

Cheers,

PA.


Re: file handle changes

Posted by petite_abeille <pe...@mac.com>.
Hi Dmitry,

On Tuesday, Sep 23, 2003, at 20:26 Europe/Amsterdam, Dmitry 
Serebrennikov wrote:

> Thanks for giving the file handle patch a try. I'm very glad that it's 
> working for you. I wander if either one of you has any scripts / data 
> to monitor performance of your Lucene instance. If so, I would be very 
> curious to know if you have seen any performance impact of this patch. 
> Even without scripts, have you noticed anything just informally?

I haven't run any formal test yet, but I haven't noticed any negative 
impact either. In fact, at least in the case of ZOE, your patch should 
have a very positive impact on responsiveness as it will allow the app 
to keep more indices open concurrently. Will let you know when I get 
some real numbers.

Cheers,

PA.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: file handle changes

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Bruce, PA, and possibly others

Thanks for giving the file handle patch a try. I'm very glad that it's 
working for you. I wander if either one of you has any scripts / data to 
monitor performance of your Lucene instance. If so, I would be very 
curious to know if you have seen any performance impact of this patch. 
Even without scripts, have you noticed anything just informally?

 From what I know, the patch in itself should have very little effect on 
performance. In fact, searching could be slightly slower or slightly 
faster, depending on the OS you are running on, the size of your index, 
the size of your RAM, and how that OS implements file caching. I'd like 
to know what kinds of things people are seeing in the real world with 
it. Certainly, if there is a slowdown, I'd like to know! This is mostly 
related to searching, but could apply to indexing as well. But for 
indexing, there is an exciting news: I believe that with this patch, the 
merge factor can be turned up quite a bit higher without running our of 
file handles. This probably means that the indexing rate can go up by a 
factor of two or more!

Thanks.very much
Dmitry.



Re: file handle changes

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Bruce, PA, and possibly others

Thanks for giving the file handle patch a try. I'm very glad that it's 
working for you. I wander if either one of you has any scripts / data to 
monitor performance of your Lucene instance. If so, I would be very 
curious to know if you have seen any performance impact of this patch. 
Even without scripts, have you noticed anything just informally?

 From what I know, the patch in itself should have very little effect on 
performance. In fact, searching could be slightly slower or slightly 
faster, depending on the OS you are running on, the size of your index, 
the size of your RAM, and how that OS implements file caching. I'd like 
to know what kinds of things people are seeing in the real world with 
it. Certainly, if there is a slowdown, I'd like to know! This is mostly 
related to searching, but could apply to indexing as well. But for 
indexing, there is an exciting news: I believe that with this patch, the 
merge factor can be turned up quite a bit higher without running our of 
file handles. This probably means that the indexing rate can go up by a 
factor of two or more!

Thanks.very much
Dmitry.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: file handle changes

Posted by Bruce Ritchie <br...@jivesoftware.com>.
Dmitry Serebrennikov wrote:

> I've implemented the file handle reduction changes, roughly as proposed 
> before. Here are the patches for your enjoyment! :)
> 
> ------------------------------------------
> SUMMARY:
> The goal of this patch is to drastically reduce the number of file 
> handles required by Lucene. This is achieved by reducing the number of 
> files required by a single index segment from N to 1, where N depends on 
> the number of indexed fields in the segment. Typically, one should see a 
> drop in the number of file handles by an order of magnitude! It could 
> even be greater for indexes that contain large numbers of indexed fields.
> 
> The best part is that to take advantage of this feature, one simply 
> needs to call setUseCompoundFiles(true) on an IndexWriter before putting 
> documents into it. Everything else is automatic!

This is a most welcome addition, many thanks! Quite a few of our customers have run into file 
descriptor issues that this patch would solve.

I've patched my local test server with this patch and so far it's working perfectly. I'll be testing 
it over the next week to see if I can uncover any issues.


Regards,

Bruce Ritchie

Re: file handle changes

Posted by petite_abeille <pe...@mac.com>.
Hi Dmitry,

On Tuesday, Sep 23, 2003, at 01:15 Europe/Amsterdam, Dmitry 
Serebrennikov wrote:

> I've implemented the file handle reduction changes, roughly as 
> proposed before. Here are the patches for your enjoyment! :)

Well... I just applied your patches to see how it goes and... I enjoy 
it immensely so far :)

Just flipped the switch on my IndexWriter and... all the indices files 
miraculously melted to one...

Very, very, very nice :)

I haven't done any extensive testing so far, but everything seems to be 
working just fine :)

Being one of those poor souls sporadically suffering from "file handles 
fever", I would strongly encourage every and each Lucene commiter to 
give this patch all the attention it deserves. In my opinion, Dmitry's 
patch elegantly, but decisively, addresses one of the few shortcoming 
of Lucene...

Thanks Dmitry! Very nice work :)

Cheers,

PA.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: file handle changes

Posted by petite_abeille <pe...@mac.com>.
Hi Dmitry,

On Tuesday, Sep 23, 2003, at 01:15 Europe/Amsterdam, Dmitry 
Serebrennikov wrote:

> I've implemented the file handle reduction changes, roughly as 
> proposed before. Here are the patches for your enjoyment! :)

Well... I just applied your patches to see how it goes and... I enjoy 
it immensely so far :)

Just flipped the switch on my IndexWriter and... all the indices files 
miraculously melted to one...

Very, very, very nice :)

I haven't done any extensive testing so far, but everything seems to be 
working just fine :)

Being one of those poor souls sporadically suffering from "file handles 
fever", I would strongly encourage every and each Lucene commiter to 
give this patch all the attention it deserves. In my opinion, Dmitry's 
patch elegantly, but decisively, addresses one of the few shortcoming 
of Lucene...

Thanks Dmitry! Very nice work :)

Cheers,

PA.