You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Dmitry Serebrennikov <dm...@earthlink.net> on 2003/09/23 01:15:48 UTC
Re: file handle changes
Greetings again.
I've implemented the file handle reduction changes, roughly as proposed
before. Here are the patches for your enjoyment! :)
------------------------------------------
SUMMARY:
The goal of this patch is to drastically reduce the number of file
handles required by Lucene. This is achieved by reducing the number of
files required by a single index segment from N to 1, where N depends on
the number of indexed fields in the segment. Typically, one should see a
drop in the number of file handles by an order of magnitude! It could
even be greater for indexes that contain large numbers of indexed fields.
The best part is that to take advantage of this feature, one simply
needs to call setUseCompoundFiles(true) on an IndexWriter before putting
documents into it. Everything else is automatic!
------------------------------------------
DETAILS:
The proposed implementation adds a new property to the IndexWriter --
get/setUseCompoundFiles(boolean). This property defaults to false, which
is the existing behavior prior to this patch. If the property is set to
true, all segments created by this IndexWriter will be of the "compound
file" format. Compound file segments have only one main file - <id>.cfs.
Document deletions are handled as before -- if documents from this
segment are deleted, a second file named <id>.del is created (I didn't
change this code).
The get/setUseCompoundFiles setting can be changed at any time during
the existance of the IndexWriter and takes effect during the next time
the IndexWriter merges segments in its target directory.
SegmentIndexReader can now work with either type of segment.
This change does not affect how the segments are handled in the
temporary RAMDirectory used by the IndexWriter internally, only the
final segments written to the target directory. Also, a given directory
can contain both types of segments and everything works out automagically.
-----------------------------------------
I have also created a new JUnit test case to test these features, which
runs successfully. For the moment it creates files off of the current
working directory in which the junit is executed. I also converted some
of the older tests "XXXTest" into "TestXXX", and made sure they work
with the old implementation and the new one. These tests do not yet do
enough assert(...) calls, but they now execute twice: with the
multi-file indexes and the new compound file indexes, and assert that
the output is the same. The old files are still there, I just added new
ones with the inverted names. In one case - ThreadSafetyTest.java - I
actually made changes to that file because I thougt this test was too
long to run as an automatic test in JUnit. Build.xml required a small
change to add a class from the src/demo tree to the classpath.
----------------------------------------
Doug, I've really considered keeping everything at the Directory level,
as you suggested. This would have been preferred, I agree, but I really
couldn't find a way to reconsile this approach with the other two goals
I had: (a) keep specific file extension knowledge at the lucene.index.*
level where it is now, and (b) avoid having to support writes to the
compound file.
----------------------------------------
I'm attaching the patches against the current Lucene CVS source
(basically output of "cvs diff -Buw"). The files listed as "?" are new
files and are also attached.
(BTW, there are currently two failures in the existing JUnit test cases,
but they occur with or without these patches, as has already been noted
by Otis, Doug and Eric).
Finally, I should theoretically have commit access to Lucene's CVS, but
I've never tried using it yet. If these changes seem ok, I could commit
them myself (provided I can find my password, etc., etc.).
Enjoy.
Dmitry.
Re: file handle changes
Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Doug Cutting wrote:
> Dmitry Serebrennikov wrote:
>
>> Doug, I've really considered keeping everything at the Directory
>> level, as you suggested. This would have been preferred, I agree, but
>> I really couldn't find a way to reconsile this approach with the
>> other two goals I had: (a) keep specific file extension knowledge at
>> the lucene.index.* level where it is now, and (b) avoid having to
>> support writes to the compound file.
>
>
> Sorry I'm a little behind on my Lucene email.
>
> I just posted an alternate design before I saw this message. But now
> I see that your changes layer nicely on the existing Directory API, so
> I don't really have a problem with them. Although perhaps the
> proposal in my previous message still has merit... What do you think?
See my reply to that message. Generally, I think it would require too
much change to the Directory API (and everyone's favorite directory
implementations). Plus I like the way directory is structured now - nice
and clean. And any way you slice it, adding any kind of call to say to a
directory "please consider that you can now combine these files, even
though you may not care since they are really database records for you,
or something", just didn't seem very compelling :). The best I could
come up with was something like setting some files are "read-only" and
letting directory take it from there, but even then we still need to
group related files...
Any way, I think it all worked out nicely. Even better than I thought it
might - just one call on the IndexWriter does the job, plus IndexWriter
is something that Lucene users have to deal with already.
>
>
> Does CompoundFileReader need to be public?
No, it doesn't. I didn't think about that in the cleaup after the
implementation. Feel free to make it package level.
>
>
> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
Re: file handle changes
Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Doug Cutting wrote:
> Dmitry Serebrennikov wrote:
>
>> Doug, I've really considered keeping everything at the Directory
>> level, as you suggested. This would have been preferred, I agree, but
>> I really couldn't find a way to reconsile this approach with the
>> other two goals I had: (a) keep specific file extension knowledge at
>> the lucene.index.* level where it is now, and (b) avoid having to
>> support writes to the compound file.
>
>
> Sorry I'm a little behind on my Lucene email.
>
> I just posted an alternate design before I saw this message. But now
> I see that your changes layer nicely on the existing Directory API, so
> I don't really have a problem with them. Although perhaps the
> proposal in my previous message still has merit... What do you think?
See my reply to that message. Generally, I think it would require too
much change to the Directory API (and everyone's favorite directory
implementations). Plus I like the way directory is structured now - nice
and clean. And any way you slice it, adding any kind of call to say to a
directory "please consider that you can now combine these files, even
though you may not care since they are really database records for you,
or something", just didn't seem very compelling :). The best I could
come up with was something like setting some files are "read-only" and
letting directory take it from there, but even then we still need to
group related files...
Any way, I think it all worked out nicely. Even better than I thought it
might - just one call on the IndexWriter does the job, plus IndexWriter
is something that Lucene users have to deal with already.
>
>
> Does CompoundFileReader need to be public?
No, it doesn't. I didn't think about that in the cleaup after the
implementation. Feel free to make it package level.
>
>
> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: file handle changes
Posted by Doug Cutting <cu...@lucene.com>.
Dmitry Serebrennikov wrote:
> Doug, I've really considered keeping everything at the Directory level,
> as you suggested. This would have been preferred, I agree, but I really
> couldn't find a way to reconsile this approach with the other two goals
> I had: (a) keep specific file extension knowledge at the lucene.index.*
> level where it is now, and (b) avoid having to support writes to the
> compound file.
Sorry I'm a little behind on my Lucene email.
I just posted an alternate design before I saw this message. But now I
see that your changes layer nicely on the existing Directory API, so I
don't really have a problem with them. Although perhaps the proposal in
my previous message still has merit... What do you think?
Does CompoundFileReader need to be public?
Doug
Re: file handle changes
Posted by Doug Cutting <cu...@lucene.com>.
Dmitry Serebrennikov wrote:
> Doug, I've really considered keeping everything at the Directory level,
> as you suggested. This would have been preferred, I agree, but I really
> couldn't find a way to reconsile this approach with the other two goals
> I had: (a) keep specific file extension knowledge at the lucene.index.*
> level where it is now, and (b) avoid having to support writes to the
> compound file.
Sorry I'm a little behind on my Lucene email.
I just posted an alternate design before I saw this message. But now I
see that your changes layer nicely on the existing Directory API, so I
don't really have a problem with them. Although perhaps the proposal in
my previous message still has merit... What do you think?
Does CompoundFileReader need to be public?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: file handle changes
Posted by Bruce Ritchie <br...@jivesoftware.com>.
Dmitry Serebrennikov wrote:
> Bruce, PA, and possibly others
>
> Thanks for giving the file handle patch a try. I'm very glad that it's
> working for you. I wander if either one of you has any scripts / data to
> monitor performance of your Lucene instance. If so, I would be very
> curious to know if you have seen any performance impact of this patch.
> Even without scripts, have you noticed anything just informally?
From my casual testing I haven't noticed any differences, good or bad. I do have a little utility I
wrote to stress searching (it's linked to in one of my previous messages in this list so I'll run a
bit of an informal test to see if there is any noticable differences in search times.
Regards,
Bruce Ritchie
Re: file handle changes
Posted by petite_abeille <pe...@mac.com>.
Hi Dmitry,
On Tuesday, Sep 23, 2003, at 20:26 Europe/Amsterdam, Dmitry
Serebrennikov wrote:
> Thanks for giving the file handle patch a try. I'm very glad that it's
> working for you. I wander if either one of you has any scripts / data
> to monitor performance of your Lucene instance. If so, I would be very
> curious to know if you have seen any performance impact of this patch.
> Even without scripts, have you noticed anything just informally?
I haven't run any formal test yet, but I haven't noticed any negative
impact either. In fact, at least in the case of ZOE, your patch should
have a very positive impact on responsiveness as it will allow the app
to keep more indices open concurrently. Will let you know when I get
some real numbers.
Cheers,
PA.
Re: file handle changes
Posted by petite_abeille <pe...@mac.com>.
Hi Dmitry,
On Tuesday, Sep 23, 2003, at 20:26 Europe/Amsterdam, Dmitry
Serebrennikov wrote:
> Thanks for giving the file handle patch a try. I'm very glad that it's
> working for you. I wander if either one of you has any scripts / data
> to monitor performance of your Lucene instance. If so, I would be very
> curious to know if you have seen any performance impact of this patch.
> Even without scripts, have you noticed anything just informally?
I haven't run any formal test yet, but I haven't noticed any negative
impact either. In fact, at least in the case of ZOE, your patch should
have a very positive impact on responsiveness as it will allow the app
to keep more indices open concurrently. Will let you know when I get
some real numbers.
Cheers,
PA.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: file handle changes
Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Bruce, PA, and possibly others
Thanks for giving the file handle patch a try. I'm very glad that it's
working for you. I wander if either one of you has any scripts / data to
monitor performance of your Lucene instance. If so, I would be very
curious to know if you have seen any performance impact of this patch.
Even without scripts, have you noticed anything just informally?
From what I know, the patch in itself should have very little effect on
performance. In fact, searching could be slightly slower or slightly
faster, depending on the OS you are running on, the size of your index,
the size of your RAM, and how that OS implements file caching. I'd like
to know what kinds of things people are seeing in the real world with
it. Certainly, if there is a slowdown, I'd like to know! This is mostly
related to searching, but could apply to indexing as well. But for
indexing, there is an exciting news: I believe that with this patch, the
merge factor can be turned up quite a bit higher without running our of
file handles. This probably means that the indexing rate can go up by a
factor of two or more!
Thanks.very much
Dmitry.
Re: file handle changes
Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Bruce, PA, and possibly others
Thanks for giving the file handle patch a try. I'm very glad that it's
working for you. I wander if either one of you has any scripts / data to
monitor performance of your Lucene instance. If so, I would be very
curious to know if you have seen any performance impact of this patch.
Even without scripts, have you noticed anything just informally?
From what I know, the patch in itself should have very little effect on
performance. In fact, searching could be slightly slower or slightly
faster, depending on the OS you are running on, the size of your index,
the size of your RAM, and how that OS implements file caching. I'd like
to know what kinds of things people are seeing in the real world with
it. Certainly, if there is a slowdown, I'd like to know! This is mostly
related to searching, but could apply to indexing as well. But for
indexing, there is an exciting news: I believe that with this patch, the
merge factor can be turned up quite a bit higher without running our of
file handles. This probably means that the indexing rate can go up by a
factor of two or more!
Thanks.very much
Dmitry.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: file handle changes
Posted by Bruce Ritchie <br...@jivesoftware.com>.
Dmitry Serebrennikov wrote:
> I've implemented the file handle reduction changes, roughly as proposed
> before. Here are the patches for your enjoyment! :)
>
> ------------------------------------------
> SUMMARY:
> The goal of this patch is to drastically reduce the number of file
> handles required by Lucene. This is achieved by reducing the number of
> files required by a single index segment from N to 1, where N depends on
> the number of indexed fields in the segment. Typically, one should see a
> drop in the number of file handles by an order of magnitude! It could
> even be greater for indexes that contain large numbers of indexed fields.
>
> The best part is that to take advantage of this feature, one simply
> needs to call setUseCompoundFiles(true) on an IndexWriter before putting
> documents into it. Everything else is automatic!
This is a most welcome addition, many thanks! Quite a few of our customers have run into file
descriptor issues that this patch would solve.
I've patched my local test server with this patch and so far it's working perfectly. I'll be testing
it over the next week to see if I can uncover any issues.
Regards,
Bruce Ritchie
Re: file handle changes
Posted by petite_abeille <pe...@mac.com>.
Hi Dmitry,
On Tuesday, Sep 23, 2003, at 01:15 Europe/Amsterdam, Dmitry
Serebrennikov wrote:
> I've implemented the file handle reduction changes, roughly as
> proposed before. Here are the patches for your enjoyment! :)
Well... I just applied your patches to see how it goes and... I enjoy
it immensely so far :)
Just flipped the switch on my IndexWriter and... all the indices files
miraculously melted to one...
Very, very, very nice :)
I haven't done any extensive testing so far, but everything seems to be
working just fine :)
Being one of those poor souls sporadically suffering from "file handles
fever", I would strongly encourage every and each Lucene commiter to
give this patch all the attention it deserves. In my opinion, Dmitry's
patch elegantly, but decisively, addresses one of the few shortcoming
of Lucene...
Thanks Dmitry! Very nice work :)
Cheers,
PA.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: file handle changes
Posted by petite_abeille <pe...@mac.com>.
Hi Dmitry,
On Tuesday, Sep 23, 2003, at 01:15 Europe/Amsterdam, Dmitry
Serebrennikov wrote:
> I've implemented the file handle reduction changes, roughly as
> proposed before. Here are the patches for your enjoyment! :)
Well... I just applied your patches to see how it goes and... I enjoy
it immensely so far :)
Just flipped the switch on my IndexWriter and... all the indices files
miraculously melted to one...
Very, very, very nice :)
I haven't done any extensive testing so far, but everything seems to be
working just fine :)
Being one of those poor souls sporadically suffering from "file handles
fever", I would strongly encourage every and each Lucene commiter to
give this patch all the attention it deserves. In my opinion, Dmitry's
patch elegantly, but decisively, addresses one of the few shortcoming
of Lucene...
Thanks Dmitry! Very nice work :)
Cheers,
PA.