You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Vlad K <ku...@gmail.com> on 2015/09/02 08:06:56 UTC

Lucene 5.2.1: FSDirectory, is it possible to open existing output for append?

FSDirectory createOutput re-creates file because it opens stream with
TRUNCATE_EXISTING. What is the way to open existing file and append data? I
used it at Lucene 4.1 to create store with raw messages. I could use
Files.newOutputStream directly to do that but I just want to understand
what is the idea of the design that prohibits appending to existing data? I
can't keep IndexOutput always open, at least after restart of application I
have to re-open existing data and continue to append. What is the way
Lucene suggest for that now?

Re: Lucene 5.2.1: FSDirectory, is it possible to open existing output for append?

Posted by Vlad K <ku...@gmail.com>.
Thanks Uwe, this will help us to clarify details and make a decision.

If you write your own data files into a given index directory, it is very
> likely that you may corrupt your index. In later Lucene versions (5.x) we
> are very strict with not allowing files in the index directory, which were
> not created by Lucene. So it is better to have your repository data files
> completely separated from the index. Alternatively implement your
> additional repository data as a Lucene Codec or better store it in
> docvalues or stored fields, then it is completely under control of Lucene
> and commits work as expected. Why do you need your own logic to write the
> repository files into the index directory? Lucene is a perfect datastore,
> too. If you use it, you also make sure commits on index and your repository
> data is in a consistent state after committing.
>

Sure, we keep index and repository in a different directories, like:

parent: bucket_N
     child: repository (it has 2 files: compressed "main" and uncompressed
"partial"; background process merges "partial" into main file)
     child: index (it uses segment number + offset to refer the document
from repository; we don't use docvalue/store field for that)

As I understand, the reason we have repository instead of "lucene as
datastore"
- Our data are stream, sort of data feed. Having raw data gives us
opportunity to archive it w/o indexes, export and etc. We have no other
store where we may retrieve it again.
- also, our application parses incoming data stream and saves into
repository initially and only then it calls lucene to produce indexes (it
reduces risk of data loss).
- I am not sure, maybe Lucene didn't provide good compression in old
versions and this is our legacy (we started with version 3). But I wasn't
around when the decision was made.

RE: Lucene 5.2.1: FSDirectory, is it possible to open existing output for append?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi Vlad,

> Can you please share some details about that design decision "Whenever
> Lucene updates something in the index, it creates a new file". Is it right
> understanding that while IndexOutput is open, Lucene continues to use the
> same output/file. But after it closed (for instance application was restarted),
> lucene will create a new output-file? Does it mean that in case of many
> restarts, lucene will create many small files or it reads previous one and
> writes to the newly created (like merge)?

Lucene never keeps files open for longer time. It creates the index structures in memory, then opens a new file for output, writes the index structure and additional data in one go and finally closes the file. IndexWriter does not keep the file open for longer time.

The reason for this is the segmented index structure of Lucene: Whenever something is flushed or committed to disk this is done into a new set of files: an index segment. The Lucene index consists of several segments and each is write-once (the names of those segments are the prefix you see index directory like '_1l9i', which is the segment number in some base64-like encoding). A segment of a Lucene index is a small index of its own (atomic/leaf reader in API). When you search with IndexSearcher/IndexReader it executes your query on all those segments in parallel and collects the results.

Because the number of segment files will grow during indexing (you cannot change them), there is a background process during indexing, which merges segments (smaller indexes) into larger indexes. This process reads the smaller index files, creates a new larger index segment out of it and then writes the stuff into a new index file. The old index files are removed afterwards. See a video how this looks like: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

One reason for this design is snapshotting. You can take a given set of unmodifiable files and see them as a "snapshot" of your index. IndexReader/IndexSearcher operate on such a set of files. IndexWriter can write and merge data to your index in parallel, not affecting the readers - they only see the set of files they were opened with. This is also the reason why you need to reopen IndexReaders to see new changes added to the index. Reopening just reads the metadata file from the index directory to get all "active" segments and then open the sub-readers on the small segment indexes (AtomicReader/LeafReader interfaces).

> I am asking because at Lucene 4.1 we used lucene way and interfaces to
> work with our own data files. We have implemented RepositoryDirectory (FS
> and
> RAM) that implemented Directory interface and provided IndexInput and
> IndexOutput that we used to work with files. We write index and repository
> data into bucket directory and create a new bucket directory when index +
> repository reaches 1GB. That's why our raw data file size is usually 300Mb and
> we appended to it after close/restart. Now to upgrade to lucene 5 and higher
> we in a position to make a decision: either use our own interface to work
> with repository (data files) or understand lucene internals/motivation and
> continue to use it. I believe that lucene should use effective way how it
> works with Directory and maybe we could continue to use it for "raw data
> directory" too, but as results we may produce many small files (for every
> restart) or we will need to merge too big files.

If you write your own data files into a given index directory, it is very likely that you may corrupt your index. In later Lucene versions (5.x) we are very strict with not allowing files in the index directory, which were not created by Lucene. So it is better to have your repository data files completely separated from the index. Alternatively implement your additional repository data as a Lucene Codec or better store it in docvalues or stored fields, then it is completely under control of Lucene and commits work as expected. Why do you need your own logic to write the repository files into the index directory? Lucene is a perfect datastore, too. If you use it, you also make sure commits on index and your repository data is in a consistent state after committing.

> Can you point to some internals details?
> 
> Thanks!
> Vladimir Kuzmin
> 
> On Wed, Sep 2, 2015 at 12:47 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
> 
> > Hi,
> >
> > Lucene never appends to files, so this is not something that is not
> > used anywhere. Whenever Lucene updates something in the index, it
> > creates a new file. In earlier Lucene version there was seeking
> > supported, but this is removed since Lucene 4.7 (I think). This was
> > just a hack around some problems (requirement to modify header after
> > writing file), but this is now solved, so seek() was removed completely. And
> it won't come back.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> > > -----Original Message-----
> > > From: Vlad K [mailto:kuzminva@gmail.com]
> > > Sent: Wednesday, September 02, 2015 8:07 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Lucene 5.2.1: FSDirectory, is it possible to open existing
> > output for
> > > append?
> > >
> > > FSDirectory createOutput re-creates file because it opens stream
> > > with TRUNCATE_EXISTING. What is the way to open existing file and
> > > append data? I used it at Lucene 4.1 to create store with raw
> > > messages. I could
> > use
> > > Files.newOutputStream directly to do that but I just want to
> > > understand what is the idea of the design that prohibits appending
> > > to existing
> > data? I
> > > can't keep IndexOutput always open, at least after restart of
> > application I
> > > have to re-open existing data and continue to append. What is the
> > > way Lucene suggest for that now?
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 5.2.1: FSDirectory, is it possible to open existing output for append?

Posted by Vlad K <ku...@gmail.com>.
Hi Uwe,

Can you please share some details about that design decision "Whenever
Lucene updates something in the index, it creates a new file". Is it right
understanding that while IndexOutput is open, Lucene continues to use the
same output/file. But after it closed (for instance application was
restarted), lucene will create a new output-file? Does it mean that in case
of many restarts, lucene will create many small files or it reads previous
one and writes to the newly created (like merge)?

I am asking because at Lucene 4.1 we used lucene way and interfaces to work
with our own data files. We have implemented RepositoryDirectory (FS and
RAM) that implemented Directory interface and provided IndexInput and
IndexOutput that we used to work with files. We write index and repository
data into bucket directory and create a new bucket directory when index +
repository reaches 1GB. That's why our raw data file size is usually 300Mb
and we appended to it after close/restart. Now to upgrade to lucene 5 and
higher we in a position to make a decision: either use our own interface to
work with repository (data files) or understand lucene internals/motivation
and continue to use it. I believe that lucene should use effective way how
it works with Directory and maybe we could continue to use it for "raw data
directory" too, but as results we may produce many small files (for every
restart) or we will need to merge too big files.

Can you point to some internals details?

Thanks!
Vladimir Kuzmin

On Wed, Sep 2, 2015 at 12:47 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi,
>
> Lucene never appends to files, so this is not something that is not used
> anywhere. Whenever Lucene updates something in the index, it creates a new
> file. In earlier Lucene version there was seeking supported, but this is
> removed since Lucene 4.7 (I think). This was just a hack around some
> problems (requirement to modify header after writing file), but this is now
> solved, so seek() was removed completely. And it won't come back.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Vlad K [mailto:kuzminva@gmail.com]
> > Sent: Wednesday, September 02, 2015 8:07 AM
> > To: java-user@lucene.apache.org
> > Subject: Lucene 5.2.1: FSDirectory, is it possible to open existing
> output for
> > append?
> >
> > FSDirectory createOutput re-creates file because it opens stream with
> > TRUNCATE_EXISTING. What is the way to open existing file and append
> > data? I used it at Lucene 4.1 to create store with raw messages. I could
> use
> > Files.newOutputStream directly to do that but I just want to understand
> > what is the idea of the design that prohibits appending to existing
> data? I
> > can't keep IndexOutput always open, at least after restart of
> application I
> > have to re-open existing data and continue to append. What is the way
> > Lucene suggest for that now?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Lucene 5.2.1: FSDirectory, is it possible to open existing output for append?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

Lucene never appends to files, so this is not something that is not used anywhere. Whenever Lucene updates something in the index, it creates a new file. In earlier Lucene version there was seeking supported, but this is removed since Lucene 4.7 (I think). This was just a hack around some problems (requirement to modify header after writing file), but this is now solved, so seek() was removed completely. And it won't come back.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Vlad K [mailto:kuzminva@gmail.com]
> Sent: Wednesday, September 02, 2015 8:07 AM
> To: java-user@lucene.apache.org
> Subject: Lucene 5.2.1: FSDirectory, is it possible to open existing output for
> append?
> 
> FSDirectory createOutput re-creates file because it opens stream with
> TRUNCATE_EXISTING. What is the way to open existing file and append
> data? I used it at Lucene 4.1 to create store with raw messages. I could use
> Files.newOutputStream directly to do that but I just want to understand
> what is the idea of the design that prohibits appending to existing data? I
> can't keep IndexOutput always open, at least after restart of application I
> have to re-open existing data and continue to append. What is the way
> Lucene suggest for that now?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org