You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Larry White <lw...@tracelink.com> on 2015/09/12 16:59:38 UTC

mutability of lucene index files

Hi,

I'm writing a backup routine for a system that includes Lucene for
full-text search. The primary data store is based on immutable files, so it
can be backed-up incrementally by copying any new files (and removing any
files that have been deleted from earlier backups). It's my understanding
from brief comments found on the internet that most, if not all the files
that comprise a Lucene index are similarly immutable.

Can someone please confirm or deny that statement?

If the Lucene files are mostly, but not entirely, immutable, it would be
greatly appreciated if the exceptions could be identified. I would imagine
there might be log files that would be mutable, for example.

Thank you very much for your help.

Larry

RE: mutability of lucene index files

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

"segments.gen" no longer exists in Lucene 5.x (because of Java 7 NIO.2 update). Every commit point (segments_xxx) also gets a new filename.

This means: Yes, every (and really every) file in a Lucene index is write-once. That is the basis of the whole snapshotting concept that Lucene internally uses.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Larry White [mailto:lwhite@tracelink.com]
> Sent: Saturday, September 12, 2015 7:59 PM
> To: java-user@lucene.apache.org
> Subject: Re: mutability of lucene index files
> 
> Hi Erick,
> 
> Thank you.
> 
> Deleting old files is fine (and expected), so it sounds like the segment files
> are immutable (prior to deletion) and the file that handles deletion is
> renamed with every change, so it's effectively immutable, too.
> 
> That leaves the segments_* files and segments.gen, if I understand
> correctly.
> 
> And thank you for the pointer. I'm hoping to use the same process to backup
> and restore all my data (Lucene and otherwise), and to be able to use an
> incremental approach so that the system doesn't need to be offline too long,
> but I'll definitely take another look at snapshots.
> 
> Thanks again
> 
> 
> On Sat, Sep 12, 2015 at 12:50 PM, Erick Erickson <er...@gmail.com>
> wrote:
> 
> > The Lucene index segment files are immutable, once they're closed,
> > they are never changed. These are things like _1.fdt, _1.tim, etc. All
> > of the files with the same prefix (_1 in my example) comprise a single
> > "segment". Segments _will_, however, disappear. During indexing, two
> > or more segment are combined into a new segment, so _1.*, _2.* and
> > _3.* could be copied to _4.* then _1.*, _2.* and _3.* will be removed.
> >
> > There is one exception to the rule "segment files are not changed",
> > and that's the file that contains information about documents in that
> > segment that have been deleted. Actually that file is re-written to a
> > new name every time a doc is deleted from the segment upon commit.
> >
> > And another exception is that there is a file or two that contains the
> > information about what segments comprise the most recent (hard)
> > commit, in 4x segments_* and segments.gen.
> >
> > So rather than try to wrap your head around all this and then worry
> > about what changes when the next major release comes out, would it
> > work to just use the built-in snapshot process? Here's something I
> > found (but didn't look at very closely) to get you started:
> >
> > http://stackoverflow.com/questions/17753226/lucene-4-3-1-backup-
> proces
> > s
> >
> > And there's a link to the Lucene user's list where the question was
> > answered..
> >
> > Best,
> > Erick
> >
> > On Sat, Sep 12, 2015 at 7:59 AM, Larry White <lw...@tracelink.com>
> wrote:
> > > Hi,
> > >
> > > I'm writing a backup routine for a system that includes Lucene for
> > > full-text search. The primary data store is based on immutable
> > > files, so
> > it
> > > can be backed-up incrementally by copying any new files (and
> > > removing any files that have been deleted from earlier backups).
> > > It's my understanding from brief comments found on the internet that
> > > most, if not all the files that comprise a Lucene index are similarly
> immutable.
> > >
> > > Can someone please confirm or deny that statement?
> > >
> > > If the Lucene files are mostly, but not entirely, immutable, it
> > > would be greatly appreciated if the exceptions could be identified.
> > > I would
> > imagine
> > > there might be log files that would be mutable, for example.
> > >
> > > Thank you very much for your help.
> > >
> > > Larry
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> --
> *Larry White |  TraceLink Inc. | Principal Software Architect*
> 400 Riverpark Dr. | North Reading, MA | 01864
> e: lwhite@tracelink.com
> www.tracelink.com
> 
> 
> *Protect patients, enable health, grow profits, ensure compliance*


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: mutability of lucene index files

Posted by Larry White <lw...@tracelink.com>.
Hi Erick,

Thank you.

Deleting old files is fine (and expected), so it sounds like the segment
files are immutable (prior to deletion) and the file that handles deletion
is renamed with every change, so it's effectively immutable, too.

That leaves the segments_* files and segments.gen, if I understand
correctly.

And thank you for the pointer. I'm hoping to use the same process to backup
and restore all my data (Lucene and otherwise), and to be able to use an
incremental approach so that the system doesn't need to be offline too
long, but I'll definitely take another look at snapshots.

Thanks again


On Sat, Sep 12, 2015 at 12:50 PM, Erick Erickson <er...@gmail.com>
wrote:

> The Lucene index segment files are immutable, once they're closed,
> they are never changed. These are things like _1.fdt, _1.tim, etc. All
> of the files with the same prefix (_1 in my example) comprise a single
> "segment". Segments _will_, however, disappear. During indexing, two
> or more segment are combined into a new segment, so _1.*, _2.* and
> _3.* could be copied to _4.* then _1.*, _2.* and _3.* will be removed.
>
> There is one exception to the rule "segment files are not changed",
> and that's the file that contains information about documents in that
> segment that have been deleted. Actually that file is re-written to a
> new name every time a doc is deleted from the segment upon commit.
>
> And another exception is that there is a file or two that contains the
> information about what segments comprise the most recent (hard)
> commit, in 4x segments_* and segments.gen.
>
> So rather than try to wrap your head around all this and then worry
> about what changes when the next major release comes out, would it
> work to just use the built-in snapshot process? Here's something I
> found (but didn't look at very closely) to get you started:
>
> http://stackoverflow.com/questions/17753226/lucene-4-3-1-backup-process
>
> And there's a link to the Lucene user's list where the question was
> answered..
>
> Best,
> Erick
>
> On Sat, Sep 12, 2015 at 7:59 AM, Larry White <lw...@tracelink.com> wrote:
> > Hi,
> >
> > I'm writing a backup routine for a system that includes Lucene for
> > full-text search. The primary data store is based on immutable files, so
> it
> > can be backed-up incrementally by copying any new files (and removing any
> > files that have been deleted from earlier backups). It's my understanding
> > from brief comments found on the internet that most, if not all the files
> > that comprise a Lucene index are similarly immutable.
> >
> > Can someone please confirm or deny that statement?
> >
> > If the Lucene files are mostly, but not entirely, immutable, it would be
> > greatly appreciated if the exceptions could be identified. I would
> imagine
> > there might be log files that would be mutable, for example.
> >
> > Thank you very much for your help.
> >
> > Larry
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
*Larry White |  TraceLink Inc. | Principal Software Architect*
400 Riverpark Dr. | North Reading, MA | 01864
e: lwhite@tracelink.com
www.tracelink.com


*Protect patients, enable health, grow profits, ensure compliance*

Re: mutability of lucene index files

Posted by Erick Erickson <er...@gmail.com>.
The Lucene index segment files are immutable, once they're closed,
they are never changed. These are things like _1.fdt, _1.tim, etc. All
of the files with the same prefix (_1 in my example) comprise a single
"segment". Segments _will_, however, disappear. During indexing, two
or more segment are combined into a new segment, so _1.*, _2.* and
_3.* could be copied to _4.* then _1.*, _2.* and _3.* will be removed.

There is one exception to the rule "segment files are not changed",
and that's the file that contains information about documents in that
segment that have been deleted. Actually that file is re-written to a
new name every time a doc is deleted from the segment upon commit.

And another exception is that there is a file or two that contains the
information about what segments comprise the most recent (hard)
commit, in 4x segments_* and segments.gen.

So rather than try to wrap your head around all this and then worry
about what changes when the next major release comes out, would it
work to just use the built-in snapshot process? Here's something I
found (but didn't look at very closely) to get you started:

http://stackoverflow.com/questions/17753226/lucene-4-3-1-backup-process

And there's a link to the Lucene user's list where the question was answered..

Best,
Erick

On Sat, Sep 12, 2015 at 7:59 AM, Larry White <lw...@tracelink.com> wrote:
> Hi,
>
> I'm writing a backup routine for a system that includes Lucene for
> full-text search. The primary data store is based on immutable files, so it
> can be backed-up incrementally by copying any new files (and removing any
> files that have been deleted from earlier backups). It's my understanding
> from brief comments found on the internet that most, if not all the files
> that comprise a Lucene index are similarly immutable.
>
> Can someone please confirm or deny that statement?
>
> If the Lucene files are mostly, but not entirely, immutable, it would be
> greatly appreciated if the exceptions could be identified. I would imagine
> there might be log files that would be mutable, for example.
>
> Thank you very much for your help.
>
> Larry

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org