You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by Andy Pook <an...@gmail.com> on 2013/04/30 20:56:02 UTC

Segments files

I'm having some trouble understanding when the "segments_*" and
"segments.gen" files are in sync with the segment files.

New segment files will be created when things like RAMBufferSizeMB are
exceeded and when merges happen (ie when Flush() is called). All good.
However the appropriate segements_x and segment.gen files are only updated
when a Commit() is performed.

This means that there are valid segment files on disk that are not
referenced the the current segments_ file. So if the process hosting the
IndexWriter dies (machine power failure) then a large number of segments
will be deleted on restart because they not referenced via segments.gen and
segments_x files.

Looking at the code this seems to be "by design". But my naive perspective
suggests that these should be kept in sync with the actual segments written
to disk.

Flush() will write the segment files but only Commit() will write the
segments.gen ans segments_x files.

Can anyone give some background on this (my google foo doesn't seem to be
working today).

Thanks,
  Andy

Re: Segments files

Posted by Andy Pook <an...@gmail.com>.
Thanks for the perspective.

So what's the recommendation for when to commit. My use case is adding a
stream of docs (approx 50-200 per min). Conceptually, there are no
transactions, simply adding new docs with a small percentage of updates).

What approaches are typically used: Commit periodically? Sync commits with
merges? Some other heuristic?

Any techniques/theories appreciated. Even if they don't fit my scenario.

Cheers,



On 6 May 2013 03:16, Christopher Currens <cu...@gmail.com> wrote:

> NOTE: This is mostly from memory, but I think it's correct.
>
> Lucene's IndexWriter follows transactional writes, so the Segments_N file
> isn't updated until Commit is called.  In fact, updating the segments files
> is the last thing that is done in a commit, since Commit() can throw an OOM
> exception.  If things were kept in sync during each write, it would no
> longer be transactional, and you could end up with bad state in the index
> (ie segments files pointing to segments that aren't complete, or didn't
> merge properly).
>
> Technically, there are multiple segments that are written to disk, but not
> referenced in the segments file, as you've alluded to, so without the
> careful tracking by the index writer, things could get corrupted pretty
> quickly if it tried to sync each time, considering the default segment
> merge policy has merging done in a background thread...it gets dicey when
> and exception is thrown on the background thread, and state can't always be
> restored in the index.  NRT search isn't really affected by this, because
> it's using a reader that's returned from the writer.  It has access to all
> of the segments that are on disk or haven't been committed yet.
>

Re: Segments files

Posted by Christopher Currens <cu...@gmail.com>.
NOTE: This is mostly from memory, but I think it's correct.

Lucene's IndexWriter follows transactional writes, so the Segments_N file
isn't updated until Commit is called.  In fact, updating the segments files
is the last thing that is done in a commit, since Commit() can throw an OOM
exception.  If things were kept in sync during each write, it would no
longer be transactional, and you could end up with bad state in the index
(ie segments files pointing to segments that aren't complete, or didn't
merge properly).

Technically, there are multiple segments that are written to disk, but not
referenced in the segments file, as you've alluded to, so without the
careful tracking by the index writer, things could get corrupted pretty
quickly if it tried to sync each time, considering the default segment
merge policy has merging done in a background thread...it gets dicey when
and exception is thrown on the background thread, and state can't always be
restored in the index.  NRT search isn't really affected by this, because
it's using a reader that's returned from the writer.  It has access to all
of the segments that are on disk or haven't been committed yet.