You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Michael Busch <bu...@gmail.com> on 2007/05/17 15:17:30 UTC

IndexWriter shutdown

Hi,

if you run Lucene as a service you want to be able to shut it down in a 
certain period of time (usually 1-2 mins). This can be a problem if the 
IndexWriter is in the middle of a merge when the service shutdown 
request is received.

Therefore it would be nice if we had a method in IndexWriter called e. 
g. shutdown() which satisfies the following two requirements:
- if a merge is happening, abort it
- flush the buffered docs but do not trigger a merge

The latter is easy: we just need a flush method that does not trigger a 
merge. That's a two line change in IndexWriter.

The former is more complex. The first way of implementing this that came 
to my mind was to add checks to the different merge loops, like "only 
continue if shutdown hasn't been called yet". The obvious drawback of 
this approach is a performance impact and the need to make code changes 
in different places: merging fields, merging postings, merging 
termvectors, writing compound files. So I think this is a quite ugly 
approach.

The approach I implemented is sort of a hack, but I'd like to describe 
it briefly here. I extended the FSDirectory and FSIndexOutput:

    public static class ExtendedFSDirectory extends FSDirectory {
        private boolean interrupted = false;
        
        public void interrupt() {
            this.interrupted = true;
        }
        
        public void clearInterrupt() {
            this.interrupted = false;
        }
        
        public IndexOutput createOutput(String name) throws IOException {
            File file = new File(getFile(), name);
            if (file.exists() && !file.delete())          // delete 
existing, if any
              throw new IOException("Cannot overwrite: " + file);

            return new FSIndexOutput(file) {
                public void flushBuffer(byte[] b, int offset, int size) 
throws IOException {
                    if (ExtendedFSDirectory.this.interrupted) {
                        throw new IndexWriterInterruptException();
                    }
                    
                    super.flushBuffer(b, offset, size);
                }

            };
        }
    }
    
    // This exception is used to signal an interrupt request    
    static final class IndexWriterInterruptException extends IOException {
        private static final long serialVersionUID = 1L;
    }

So now FSIndexOutput.flushBuffer() throws an 
IndexWriterInterruptException in case interrupt() has been called. This 
causes the IndexWriter to abort the merge and to rollback the transaction.

I have another class that extends IndexWriter and overwrites the 
addDocument() and updateDocument() methods. In these methods I catch the 
IndexWriterInterruptException. In case it is thrown 
IndexWriter.flushRamSegments(boolean triggerMerge) is called with 
triggerMerge=false.
An advantage of this implementation is that almost all changes can be 
made on top of Lucene. The only core change is the protected method 
flushRamSegments(boolean triggerMerge) in IndexWriter.

My question is if people think that the shutdown feature is something we 
would like to add to the Lucene core? If yes, I can go ahead and attach 
my code to a JIRA issue, if no I'd like to make the small change to 
IndexWriter (add the protected method flushRamSegments(triggerMerge)). 
My approach seems to work quite well, but maybe others (e. g. the 
IndexWriter "experts") have different/better ideas how to implement it.

Thanks,
Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter shutdown

Posted by Michael Busch <bu...@gmail.com>.
Steven Parkes wrote:
>> I'm not certain, but would parts of your goal be achieved by the work
>>     
> i've
>   
>> seen floating arround Jira to refactor th MergePolicy so that it can
>>     
> be
>   
>> handled by multiple thrads?
>>     
>
> Well, in what I've been working on for LUCENE-847 (merge policy
> factoring) and LUCENE-870 (concurrent merge policy), what Michael's
> talking about really wouldn't be affected.
>
> The way I envision factoring the merge policy, the policy doesn't get
> involved in the actual merge itself. It simply defines what merges will
> occur. (This makes the merge policy variants very clean and gets them
> out of the segment merging which is a bit tricky.) So since Michael is
> asking for a way to abort an in-flight merge, the merge policy really
> doesn't get involved. 

Exactly. The merge policy decides *when* to merge. For the shutdown 
feature however we want to be able to stop an ongoing merge.

> (Well, it does a little: the merge policy will in
> general generate from the abstract merge or optimize request, a sequence
> of individual merges, each generating a new segment, so it could check
> between individual merge operations. However, since a single merge
> operation of large segments can take a long time, this isn't sufficient
> to bound the time.)
>   

Yes, we could do this already with the current merge policy in 
IndexWriter, but you are right, a single merge operation can already 
take too long.

> I thought about this when the commit/rollback stuff got added to
> IndexWriter. At that point, all it would take to get an immediate abort
> would be to convince the bottom writer to throw an I/O Exception, which
> it looks like is effectively what Michael is talking about, at least for
> the FSDirectory case.
>
> So my thoughts:
>
> I think something like what Michael has suggested is a good idea, but I
> would be in favor of putting it in the core, rather than making it a
> derived thing for a single Directory implementation. Seems to me like
> it's a pretty small code change for a very nice thing to have. Doesn't
> seem to add much complexity.
>   

Okay, it seems that this is a desired feature, so I will go ahead and 
open a Jira issue. I will attach the code that I have so far, even 
though it extends IndexWriter and FSDirectory and lacks test cases.

> As to what happens in the middle of a merge or optimize: I think it
> might depend on the autoCommit flag. 

In either case we have to ensure that the buffered docs get flushed to disk.

> Since an optimize may be done in
> stages, whether the intermediary stages are kept or not is going to
> depend on when the segments file gets updated (and I haven't checked the
> current status of this.) I can see it either way: keeping partial work
> (to resume) or throwing everything away on a shutdown.
>
>   
Good idea. I'm not too familiar with the new autoCommit code yet. I 
implemented the shutdown code before autoCommit was added. Will look 
into that...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: IndexWriter shutdown

Posted by Steven Parkes <st...@esseff.org>.
> Or instead of "shutdown" it's more of a "interrupt the
> merge if it's in progress" which then doesn't prevent further IO?

At a high level, this would seem like the most valuable approach. But I
think we would want to distinguish between writing new documents and
merges of existing segments. The way things stand, that means that the
merges from the ram segments onto disk should not be interruptible,
though disk to disk merges should be.

It does sound a little more complicated, but it might still not be too
messy. Especially given that the way things are looking, that the merge
from ram segments to disk is going to go away/be handled differently
than disk to disk anyway.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: IndexWriter shutdown

Posted by Michael McCandless <lu...@mikemccandless.com>.
"Doron Cohen" <DO...@il.ibm.com> wrote:
>
> "Michael McCandless" wrote:
> 
> > That's correct.
> >
> > On seeing the "shutdown in progress" exception, the current "finally"
> > clause in mergeSegments would revert the internal state of the
> > IndexWriter to be consistent, ie, put back the segments that were in
> > the process of being merged into its segmentInfos.  It will also
> > remove any partially created but now unusable newly merged segments
> > files.
> >
> > If the application catches this exception and calls
> > IndexWriter.close(), then the state until just before the aborted
> > merge would be committed to the index.  If instead the application
> > catches the exception and does nothing, then the state of the index
> > reverts back to where it was when this IndexWriter instance was first
> > opened.
> >
> > So the semantics of autoCommit=false will be correctly enforced if any
> > exception (not just this new one) comes up through mergeSegments.
> 
> Great.
> 
> So my comment on Antony's "mini-optimize" scenario was
> partially wrong, because under autcCommit=true (which is
> the default), those sub-merges that completed before shutdown
> are not lost, only the last one, the one that was interrupted.

Right.
 
> Mmmm... I can see how autocommit=true works fine, because
> anything (auto)committed is already saved, and there
> is no need to write anything more.  But for autoCommit=false
> it is not clear to me how such further call to indexWriter.close()
> by the application can work - because a shutdown state is in
> effect, and any attempt to write/flush anything would just throw
> the same exception again...  or am I missing something?

Ahh, you are correct: the global/static shutdown state would prevent
any further writes, so if the IndexWriter.close() tried to write the
new segments_N, it would hit the same exception.

Maybe this isn't really a big deal?  Ie people who open an IndexWriter
with autoCommit=false should be prepared on shutdown to lose all that
had been done during the lifetime of that writer?  Presumably faced
with this you would just open a new writer exclusively to do the
optimize.  Though for the merging case, which you can't control (just
happens on certain addDocument(...) calls) that's harder because you
could then lose added documents.

Or, maybe, you have a way to "un-shutdown" and you call this before
calling close?  Or instead of "shutdown" it's more of a "interrupt the
merge if it's in progress" which then doesn't prevent further IO?
This is getting somewhat complex...

Maybe we should leave this out of the core, and instead implement as
[external] subclass of FSDirectory, until we can get a better handle
on it?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: IndexWriter shutdown

Posted by Doron Cohen <DO...@il.ibm.com>.
"Michael McCandless" wrote:

> That's correct.
>
> On seeing the "shutdown in progress" exception, the current "finally"
> clause in mergeSegments would revert the internal state of the
> IndexWriter to be consistent, ie, put back the segments that were in
> the process of being merged into its segmentInfos.  It will also
> remove any partially created but now unusable newly merged segments
> files.
>
> If the application catches this exception and calls
> IndexWriter.close(), then the state until just before the aborted
> merge would be committed to the index.  If instead the application
> catches the exception and does nothing, then the state of the index
> reverts back to where it was when this IndexWriter instance was first
> opened.
>
> So the semantics of autoCommit=false will be correctly enforced if any
> exception (not just this new one) comes up through mergeSegments.

Great.

So my comment on Antony's "mini-optimize" scenario was
partially wrong, because under autcCommit=true (which is
the default), those sub-merges that completed before shutdown
are not lost, only the last one, the one that was interrupted.

Mmmm... I can see how autocommit=true works fine, because
anything (auto)committed is already saved, and there
is no need to write anything more.  But for autoCommit=false
it is not clear to me how such further call to indexWriter.close()
by the application can work - because a shutdown state is in
effect, and any attempt to write/flush anything would just throw
the same exception again...  or am I missing something?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: IndexWriter shutdown

Posted by Michael McCandless <lu...@mikemccandless.com>.
"Doron Cohen" <DO...@il.ibm.com> wrote:
> Michael McCandless wrote on 23/May/2007:
> 
> > Actually if autoCommit=true then the only choice is to keep
> > the partial work.  In this mode, optimize indeed goes in
> > "stages" (merging mergeFactor segments at a time) and after
> > each stage it commits a new segments_N file and removes the
> > now-merged segments.  Of course if it's a smallish index
> > (<= mergeFactor segments) then there is only 1
> > stage anyway.
> >
> > If autoCommit=false, then I think the index should rollback
> > to the state when the IndexWriter was opened.  This is the
> > point of autoCommit=false: either all or none of the changes
> > made during the lifetime of the IndexWriter instance
> > "make it" into the index.
> 
> Just to clarify - so this would just "happen by itself", because
> (if autoCommit is implemented as I think), the new Segments_N
> file is written only upon a completed commit. So whenever we
> stop due to shutdown, fully committed merges are already saved,
> interrupted ones are "lost" (as expected). Right?

That's correct.

On seeing the "shutdown in progress" exception, the current "finally"
clause in mergeSegments would revert the internal state of the
IndexWriter to be consistent, ie, put back the segments that were in
the process of being merged into its segmentInfos.  It will also
remove any partially created but now unusable newly merged segments
files.

If the application catches this exception and calls
IndexWriter.close(), then the state until just before the aborted
merge would be committed to the index.  If instead the application
catches the exception and does nothing, then the state of the index
reverts back to where it was when this IndexWriter instance was first
opened.

So the semantics of autoCommit=false will be correctly enforced if any
exception (not just this new one) comes up through mergeSegments.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: IndexWriter shutdown

Posted by Doron Cohen <DO...@il.ibm.com>.
Michael McCandless wrote on 23/May/2007:

> Actually if autoCommit=true then the only choice is to keep
> the partial work.  In this mode, optimize indeed goes in
> "stages" (merging mergeFactor segments at a time) and after
> each stage it commits a new segments_N file and removes the
> now-merged segments.  Of course if it's a smallish index
> (<= mergeFactor segments) then there is only 1
> stage anyway.
>
> If autoCommit=false, then I think the index should rollback
> to the state when the IndexWriter was opened.  This is the
> point of autoCommit=false: either all or none of the changes
> made during the lifetime of the IndexWriter instance
> "make it" into the index.

Just to clarify - so this would just "happen by itself", because
(if autoCommit is implemented as I think), the new Segments_N
file is written only upon a completed commit. So whenever we
stop due to shutdown, fully committed merges are already saved,
interrupted ones are "lost" (as expected). Right?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: IndexWriter shutdown

Posted by Michael McCandless <lu...@mikemccandless.com>.
"Steven Parkes" <st...@esseff.org> wrote:

> As to what happens in the middle of a merge or optimize: I think it
> might depend on the autoCommit flag. Since an optimize may be done
> in stages, whether the intermediary stages are kept or not is going
> to depend on when the segments file gets updated (and I haven't
> checked the current status of this.) I can see it either way:
> keeping partial work (to resume) or throwing everything away on a
> shutdown.

Actually if autoCommit=true then the only choice is to keep the
partial work.  In this mode, optimize indeed goes in "stages" (merging
mergeFactor segments at a time) and after each stage it commits a new
segments_N file and removes the now-merged segments.  Of course if
it's a smallish index (<= mergeFactor segments) then there is only 1
stage anyway.

If autoCommit=false, then I think the index should rollback to the
state when the IndexWriter was opened.  This is the point of
autoCommit=false: either all or none of the changes made during the
lifetime of the IndexWriter instance "make it" into the index.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: IndexWriter shutdown

Posted by Doron Cohen <DO...@il.ibm.com>.
Michael Busch wrote on 17/May/2007:

> public static class ExtendedFSDirectory extends FSDirectory {
> ...
> public void flushBuffer(byte[] b, int offset, int size)
>      throws IOException {
>   if (ExtendedFSDirectory.this.interrupted) {
>     throw new IndexWriterInterruptException();
>   }
>
>   super.flushBuffer(b, offset, size);
> }
>

Steven Parkes wrote on 22/may/2007:

> I think something like what Michael has suggested is a good
> idea, but I would be in favor of putting it in the core,
> rather than making it a derived thing for a single Directory
> implementation. Seems to me like it's a pretty small code
> change for a very nice thing to have. Doesn't seem to add
> much complexity.

I was thinking of how this can be made a "core logic". If we
regard the "shutdown()" op/status as system wide, ie static,
perhaps the shutdown feature can be supported only for
implementations of BufferedIndexOutput?

If so, would it work to check for a shutdown request in
two locations only:
  BufferedIndexOutput.writeBytes(byte[],int,int)
  BufferedIndexOutput.flush()

Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: IndexWriter shutdown

Posted by Steven Parkes <st...@esseff.org>.
> I'm not certain, but would parts of your goal be achieved by the work
i've
> seen floating arround Jira to refactor th MergePolicy so that it can
be
> handled by multiple thrads?

Well, in what I've been working on for LUCENE-847 (merge policy
factoring) and LUCENE-870 (concurrent merge policy), what Michael's
talking about really wouldn't be affected.

The way I envision factoring the merge policy, the policy doesn't get
involved in the actual merge itself. It simply defines what merges will
occur. (This makes the merge policy variants very clean and gets them
out of the segment merging which is a bit tricky.) So since Michael is
asking for a way to abort an in-flight merge, the merge policy really
doesn't get involved. (Well, it does a little: the merge policy will in
general generate from the abstract merge or optimize request, a sequence
of individual merges, each generating a new segment, so it could check
between individual merge operations. However, since a single merge
operation of large segments can take a long time, this isn't sufficient
to bound the time.)

I thought about this when the commit/rollback stuff got added to
IndexWriter. At that point, all it would take to get an immediate abort
would be to convince the bottom writer to throw an I/O Exception, which
it looks like is effectively what Michael is talking about, at least for
the FSDirectory case.

So my thoughts:

I think something like what Michael has suggested is a good idea, but I
would be in favor of putting it in the core, rather than making it a
derived thing for a single Directory implementation. Seems to me like
it's a pretty small code change for a very nice thing to have. Doesn't
seem to add much complexity.

As to what happens in the middle of a merge or optimize: I think it
might depend on the autoCommit flag. Since an optimize may be done in
stages, whether the intermediary stages are kept or not is going to
depend on when the segments file gets updated (and I haven't checked the
current status of this.) I can see it either way: keeping partial work
(to resume) or throwing everything away on a shutdown.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter shutdown

Posted by Chris Hostetter <ho...@fucit.org>.
I'm not certain, but would parts of your goal be achieved by the work i've
seen floating arround Jira to refactor th MergePolicy so that it can be
handled by multiple thrads? ... if MergePolicies are refactored into their
own class with ways of indicating when/if/how a merge should be done, then
perhaps there could be an InteruptibleMergePolicy that could abort the
merge when you tickle one of it's methods.

: My question is if people think that the shutdown feature is something we
: would like to add to the Lucene core? If yes, I can go ahead and attach




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter shutdown

Posted by Antony Bowesman <ad...@teamware.com>.
Doron Cohen wrote:
> Antony Bowesman wrote:
>> Another use this may have is that mini-optimize operations could be
>> done at more
>> regular intervals to reduce the time for a full optimize.  I could
>> then schedule
>> mini-optimise to run for a couple of minutes at more frequent intervals.
> 
> This seems to assume the proposed feature allows to continue an
> interrupted merge at a later time, from where it was stopped. But
> if I understood correctly then the proposed feature does not work
> this way - so all the (uncommitted) work done until shutdown will
> be "lost" - i.e. next merge() would start from scratch.

Yes, it does (wrongly) assume that.  For some reason, I had thought the optimize 
operation was a copy+pack operation, but of course, it's not, so I can see why 
this incremental approach is not possible (or at least non trivial).

Still, the shutdown function would be useful on its own.
Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter shutdown

Posted by Doron Cohen <DO...@il.ibm.com>.
Antony Bowesman wrote:
> Another use this may have is that mini-optimize operations could be
> done at more
> regular intervals to reduce the time for a full optimize.  I could
> then schedule
> mini-optimise to run for a couple of minutes at more frequent intervals.

This seems to assume the proposed feature allows to continue an
interrupted merge at a later time, from where it was stopped. But
if I understood correctly then the proposed feature does not work
this way - so all the (uncommitted) work done until shutdown will
be "lost" - i.e. next merge() would start from scratch.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter shutdown

Posted by Antony Bowesman <ad...@teamware.com>.
Michael Busch wrote:
> Hi,
> 
> if you run Lucene as a service you want to be able to shut it down in a 
> certain period of time (usually 1-2 mins). This can be a problem if the 
> IndexWriter is in the middle of a merge when the service shutdown 
> request is received.


> My question is if people think that the shutdown feature is something we 
> would like to add to the Lucene core? If yes, I can go ahead and attach 
> my code to a JIRA issue, if no I'd like to make the small change to 
> IndexWriter (add the protected method flushRamSegments(triggerMerge)). 
> My approach seems to work quite well, but maybe others (e. g. the 
> IndexWriter "experts") have different/better ideas how to implement it.

If these are conditions that also apply during an optimize(), then yes, I would 
vote for this feature.  I have a Lucene based service and optimisation takes 
over an hour for a freshly created 18GB index with 1.3M documents.

Although optimisation can be scheduled to run at whatever time, it could be 
necessary to shut down the service during the optimisation and this presents a 
problem in how to safely interrupt the optimize process.

Another use this may have is that mini-optimize operations could be done at more 
regular intervals to reduce the time for a full optimize.  I could then schedule 
mini-optimise to run for a couple of minutes at more frequent intervals.

Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org