You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Jonathan Resnick <jr...@gmail.com> on 2017/09/27 12:58:58 UTC

confused about segment merging and commits

Hi,

I am trying to understand how segment merging interacts with commits.

Consider the following timeline of events:

1. IndexWriter is opened on an index.
2. IndexWriter is used to add/update/delete docs, but not yet commit the
changes.
3. Activity in step 2 triggers segment merging on a background thread.
4. Commit() is called on IndexWriter while merging in step 3 has not yet
finished.

Does the Commit() in step 4 block while the segment merge in step 3
finishes?
If not, then when is the segment merge in 3 "committed" to the index? (i.e.
at what point would a new IndexReader see the merged segment file?)
Or does segment merging happen entirely independently of commits?

[More context: we are trying to build a backup system that copies the index
files to a backup server after every commit. Initially I thought it would
be sufficient to just keep track of file add/update/deletes since the
previous commit, but if segment merging is happening concurrently then
perhaps it's not so simple?]

More generally, is there any in-depth documentation available describing
how segment merging interacts with commits (even if it's for the Java
version of Lucene)?  My web searches have not turned up much...

Many thanks,
Jonathan

Re: confused about segment merging and commits

Posted by JA Purcell <9u...@gmail.com>.
Hey Jonathan,

I can only speak from my experience with Lucene.Net, but the XML doc
comment on the Commit method states that it does not wait for any
background merges to finish.  There is a WaitForMerges method on the
IndexWriter if you wanted to wait for those to finish.  From the way I
understand it, segments are still part of the searchable index; so the
reader would still read them in a search.  You merge them together to
reduce the number of files that the reader must parse reducing search
time.

If you have access to the Lucene In Action book, section 2.13 goes into
depth with how most of this works.  The XML docs are pretty good too.

I've never used it, but you can also have the IndexWriter write it's
verbose logging to a file via the SetInfoStream method so that you can see
exactly what it's doing.

Hope that helps some,
Adam

On Wed, Sep 27, 2017 at 5:59 AM Jonathan Resnick <jr...@gmail.com> wrote:

> Hi,
>
> I am trying to understand how segment merging interacts with commits.
>
> Consider the following timeline of events:
>
> 1. IndexWriter is opened on an index.
> 2. IndexWriter is used to add/update/delete docs, but not yet commit the
> changes.
> 3. Activity in step 2 triggers segment merging on a background thread.
> 4. Commit() is called on IndexWriter while merging in step 3 has not yet
> finished.
>
> Does the Commit() in step 4 block while the segment merge in step 3
> finishes?
> If not, then when is the segment merge in 3 "committed" to the index? (i.e.
> at what point would a new IndexReader see the merged segment file?)
> Or does segment merging happen entirely independently of commits?
>
> [More context: we are trying to build a backup system that copies the index
> files to a backup server after every commit. Initially I thought it would
> be sufficient to just keep track of file add/update/deletes since the
> previous commit, but if segment merging is happening concurrently then
> perhaps it's not so simple?]
>
> More generally, is there any in-depth documentation available describing
> how segment merging interacts with commits (even if it's for the Java
> version of Lucene)?  My web searches have not turned up much...
>
> Many thanks,
> Jonathan
>

Re: confused about segment merging and commits

Posted by Jonathan Resnick <jr...@gmail.com>.
Thanks guys for helping me piece this together - much appreciated!

The key piece of information that I was missing is explained in Lucene in
Action 2.13.4: During a commit, Lucene will "Sync all newly created files,
including newly flushed files *as well as any files produced by merges that
have finished since commit was last called* or since the IndexWriter was
opened."  and then "Write and sync the next segments_N file." (For
specifics on how segment_N files work:
https://lucene.apache.org/core/3_0_3/fileformats.html#Segments File)

I take this to mean that, by default, segment files generated by a merge
that is in progress at the time of a given commit will show up in the
segments_N file of the subsequent commit?

Re index writer locks ... My assumption was that we could just leave the
IndexWriter open (hence locking the index) for the duration of the backup,
assuming we don't actually write anything to the index during that period
of time. Based on this discussion, it sounds like we'd also have to call
WaitForMerges() prior to starting a backup to ensure that we aren't backing
up any partially written files.



On Wed, Sep 27, 2017 at 10:08 AM, Paul Irwin <pi...@feature23.com> wrote:

> Thinking about this further, you also should probably consider the lock
> implications of the index writer. Disposing of the writer will by default
> wait for merges to finish as well as releasing the lock. I don't know if
> you should or should not have the index locked when you back up. If anyone
> has any thoughts here that would be appreciated, as I'd be interested in
> learning what others would do here.
>
> Paul
>
> -----Original Message-----
> From: Paul Irwin [mailto:pirwin@feature23.com]
> Sent: Wednesday, September 27, 2017 9:59 AM
> To: user@lucenenet.apache.org
> Subject: RE: confused about segment merging and commits
>
> Merging indeed will happen in the background and not block unless you're
> waiting on an Optimize call or WaitForMerges/Dispose(true). The idea is
> that the segments being merged have already been committed. Since it's
> creating a new, larger segment as part of the merge instead of altering any
> existing segments, this can happen safely in the background/in parallel.
> Once the merge is complete it will record in the index to use the new
> segment and discard the old ones, then delete the old segments. Think about
> the information about which segments are in the index as an atomic pointer.
> You do all the hard, slow work of copying data in the background as part of
> the merge, then once complete you do the trivial, fast work of changing the
> pointer.
>
> In regards to your backup system, to be on the safe side I would probably
> call .Commit() followed by .WaitForMerges() if you want the quickest wait,
> or .Optimize() if you want to "force" it to merge before backing up. Both
> .WaitForMerges() and .Optimize() end up waiting on merges to finish, but
> Optimize will also try to merge if needed before waiting. Conventional
> wisdom, especially post-3.0 in the Java Lucene world, is to not call
> Optimize because Lucene can make better decisions about that than you can.
> However, from my experience, Optimize is very useful at the end of large
> batch index writes (i.e. daily) so that the merge doesn't have to wait
> until the next batch runs. That helps with consistent day-to-day
> performance and semi-predictable index run times.
>
> Paul
>
> -----Original Message-----
> From: Jonathan Resnick [mailto:jresnick@gmail.com]
> Sent: Wednesday, September 27, 2017 8:59 AM
> To: user@lucenenet.apache.org
> Subject: confused about segment merging and commits
>
> Hi,
>
> I am trying to understand how segment merging interacts with commits.
>
> Consider the following timeline of events:
>
> 1. IndexWriter is opened on an index.
> 2. IndexWriter is used to add/update/delete docs, but not yet commit the
> changes.
> 3. Activity in step 2 triggers segment merging on a background thread.
> 4. Commit() is called on IndexWriter while merging in step 3 has not yet
> finished.
>
> Does the Commit() in step 4 block while the segment merge in step 3
> finishes?
> If not, then when is the segment merge in 3 "committed" to the index? (i.e.
> at what point would a new IndexReader see the merged segment file?) Or
> does segment merging happen entirely independently of commits?
>
> [More context: we are trying to build a backup system that copies the
> index files to a backup server after every commit. Initially I thought it
> would be sufficient to just keep track of file add/update/deletes since the
> previous commit, but if segment merging is happening concurrently then
> perhaps it's not so simple?]
>
> More generally, is there any in-depth documentation available describing
> how segment merging interacts with commits (even if it's for the Java
> version of Lucene)?  My web searches have not turned up much...
>
> Many thanks,
> Jonathan
>

RE: confused about segment merging and commits

Posted by Paul Irwin <pi...@feature23.com>.
Thinking about this further, you also should probably consider the lock implications of the index writer. Disposing of the writer will by default wait for merges to finish as well as releasing the lock. I don't know if you should or should not have the index locked when you back up. If anyone has any thoughts here that would be appreciated, as I'd be interested in learning what others would do here.

Paul

-----Original Message-----
From: Paul Irwin [mailto:pirwin@feature23.com] 
Sent: Wednesday, September 27, 2017 9:59 AM
To: user@lucenenet.apache.org
Subject: RE: confused about segment merging and commits

Merging indeed will happen in the background and not block unless you're waiting on an Optimize call or WaitForMerges/Dispose(true). The idea is that the segments being merged have already been committed. Since it's creating a new, larger segment as part of the merge instead of altering any existing segments, this can happen safely in the background/in parallel. Once the merge is complete it will record in the index to use the new segment and discard the old ones, then delete the old segments. Think about the information about which segments are in the index as an atomic pointer. You do all the hard, slow work of copying data in the background as part of the merge, then once complete you do the trivial, fast work of changing the pointer.

In regards to your backup system, to be on the safe side I would probably call .Commit() followed by .WaitForMerges() if you want the quickest wait, or .Optimize() if you want to "force" it to merge before backing up. Both .WaitForMerges() and .Optimize() end up waiting on merges to finish, but Optimize will also try to merge if needed before waiting. Conventional wisdom, especially post-3.0 in the Java Lucene world, is to not call Optimize because Lucene can make better decisions about that than you can. However, from my experience, Optimize is very useful at the end of large batch index writes (i.e. daily) so that the merge doesn't have to wait until the next batch runs. That helps with consistent day-to-day performance and semi-predictable index run times.

Paul

-----Original Message-----
From: Jonathan Resnick [mailto:jresnick@gmail.com] 
Sent: Wednesday, September 27, 2017 8:59 AM
To: user@lucenenet.apache.org
Subject: confused about segment merging and commits

Hi,

I am trying to understand how segment merging interacts with commits.

Consider the following timeline of events:

1. IndexWriter is opened on an index.
2. IndexWriter is used to add/update/delete docs, but not yet commit the changes.
3. Activity in step 2 triggers segment merging on a background thread.
4. Commit() is called on IndexWriter while merging in step 3 has not yet finished.

Does the Commit() in step 4 block while the segment merge in step 3 finishes?
If not, then when is the segment merge in 3 "committed" to the index? (i.e.
at what point would a new IndexReader see the merged segment file?) Or does segment merging happen entirely independently of commits?

[More context: we are trying to build a backup system that copies the index files to a backup server after every commit. Initially I thought it would be sufficient to just keep track of file add/update/deletes since the previous commit, but if segment merging is happening concurrently then perhaps it's not so simple?]

More generally, is there any in-depth documentation available describing how segment merging interacts with commits (even if it's for the Java version of Lucene)?  My web searches have not turned up much...

Many thanks,
Jonathan

RE: confused about segment merging and commits

Posted by Paul Irwin <pi...@feature23.com>.
Merging indeed will happen in the background and not block unless you're waiting on an Optimize call or WaitForMerges/Dispose(true). The idea is that the segments being merged have already been committed. Since it's creating a new, larger segment as part of the merge instead of altering any existing segments, this can happen safely in the background/in parallel. Once the merge is complete it will record in the index to use the new segment and discard the old ones, then delete the old segments. Think about the information about which segments are in the index as an atomic pointer. You do all the hard, slow work of copying data in the background as part of the merge, then once complete you do the trivial, fast work of changing the pointer.

In regards to your backup system, to be on the safe side I would probably call .Commit() followed by .WaitForMerges() if you want the quickest wait, or .Optimize() if you want to "force" it to merge before backing up. Both .WaitForMerges() and .Optimize() end up waiting on merges to finish, but Optimize will also try to merge if needed before waiting. Conventional wisdom, especially post-3.0 in the Java Lucene world, is to not call Optimize because Lucene can make better decisions about that than you can. However, from my experience, Optimize is very useful at the end of large batch index writes (i.e. daily) so that the merge doesn't have to wait until the next batch runs. That helps with consistent day-to-day performance and semi-predictable index run times.

Paul

-----Original Message-----
From: Jonathan Resnick [mailto:jresnick@gmail.com] 
Sent: Wednesday, September 27, 2017 8:59 AM
To: user@lucenenet.apache.org
Subject: confused about segment merging and commits

Hi,

I am trying to understand how segment merging interacts with commits.

Consider the following timeline of events:

1. IndexWriter is opened on an index.
2. IndexWriter is used to add/update/delete docs, but not yet commit the changes.
3. Activity in step 2 triggers segment merging on a background thread.
4. Commit() is called on IndexWriter while merging in step 3 has not yet finished.

Does the Commit() in step 4 block while the segment merge in step 3 finishes?
If not, then when is the segment merge in 3 "committed" to the index? (i.e.
at what point would a new IndexReader see the merged segment file?) Or does segment merging happen entirely independently of commits?

[More context: we are trying to build a backup system that copies the index files to a backup server after every commit. Initially I thought it would be sufficient to just keep track of file add/update/deletes since the previous commit, but if segment merging is happening concurrently then perhaps it's not so simple?]

More generally, is there any in-depth documentation available describing how segment merging interacts with commits (even if it's for the Java version of Lucene)?  My web searches have not turned up much...

Many thanks,
Jonathan