You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by András Péteri <ap...@b2international.com> on 2014/01/05 17:00:28 UTC

Merge policy for branching data model

Hello,

Our application uses Lucene to index documents received from a
back-end that supports storage of temporal data with branches, similar
to revision control systems like SVN: when looking at a single object,
one can choose to either retrieve the current state, go back to a
previous point in time, or switch to an alternative timeline (branch)
altogether. For indexing, we are only considering the latest revision
("HEAD") of any object on these branches. Indexes are stored in
separate Directories, and the file system's directory layout imitates
the nesting of created branches.

Creating a new branch (from the indexes' point of view) ended up very
similar to what SVN does, as well: we used SnapshotDeletionPolicy to
capture a snapshot of the parent writer, copied the files referenced
in the resulting IndexCommit to the directory of the new branch, and
released the snapshot. This method quickly became expensive in terms
of disk space, as a lot of branches were edited simultaneously, while
the number of changed documents per branch is usually small (the
dataset has about 10 million documents and the index size is about 2.5
GB).

To facilitate better sharing of unchanged data between branches, we
used another, customized FileDeletionPolicy for writers, that keeps
the branch creation points as IndexCommits in the parent index, and
also used a custom Directory implementation similar to
FileSwitchDirectory for the branch index, that supplies files either
from the (writeable) branch directory, or the (read-only) IndexCommit
from the parent. Attempts of syncing and deleting files from the
IndexCommit are treated as a no-op. Output files can only be created
in the writeable part. This resulted in much better disk space
utilization -- branch directories are now growing typically from a few
hundred kilobytes to a few megabytes each, after extensive editing.

One issue that appeared is when the parent IndexWriter's configured
merge policy selects segments for merging from the shared part of two
branches; these segments cannot be deleted by the IndexFileDeleter
after merging, since another IndexCommit (representing the creation of
the branch) still refers to it. This leaves both optimized and
unoptimized content in the same directory, which increases disk space
usage over time. Currently, the only way I see to prevent this is to
create a filtering MergePolicy implementation that removes segments
from the list of candidates to be merged if they come from these
shared parts. Can you give me some pointers on what would be the best
way to do so?

Thanks in advance,
András

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org