You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Alex vB <ma...@avomberg.de> on 2010/07/13 17:53:59 UTC

Indexing & Index format details

Hello everybody,

I read the „Lucene Index Format“ paper but there are still some points that
are not clear for me.
I understand the file concept behind Lucene and the compound file format.
Adding documents and merging those works like this with a SimpleFSDirectory
implementation:

1. First documents are added in RAM until MaxBufferedDocs is reached.
2. Docs are flushed to hard drive as segment.
3. Steps 1 and 2 are repeated until MergeFactor is reached (A larger
MergeFactor leads to more segments and less merging operations).
4. Segments are merged to one single segment.
5. Steps 1 – 4 are repeated until everything is indexed.

With standard settings this means I get by adding 100 documents 9 different
segments and with the last doc a merge is triggered which leads to a single
segment with 100 documents (10th segment is hold in RAM before). Are merges
done in RAM or also on the hard drive?

My problems are in the details:
1) How exactly is merging done? What is the algorithm for it?
2) When I store documents in segments they become a unique number in each
segment starting by zero. Does this imply a renumbering if I merge several
segments?

For example:
Segment1(0,1,2,3) and Segment2(0,1,2) --> Segment(0,1,2,3 (from here it’s
Segment2),4,5,6).

If I change the order of adding segments the numbering changes according to
it.
Segment2(0,1,2) and Segment1(0,1,2,3) --> Segment(0,1,2(from here it’s
Segment1),3,4,5,6).

3) If I merge two segments is the second segment only added “behind” the
first one and the DocID’s are adjusted such that no ordering or sorting on
the hard drive is necessary? Just “Copy & Paste” with the mentioned
renumbering?
4) How does Lucene write the index on the hard drive? Are the blocks written
sequential onto it? The API documentation says :
“A Directory is a flat list of files. Files may be written once, when they
are created. Once a file is created it may only be opened for read, or
deleted. Random access is permitted both when reading and writing.“

This means Lucene writes the segments sequential and only create holes
through deleting/updating? Therefore my Index gets fragmented by the time?
If it is getting fragmented can I defragment it by running
IndexWriter.optimize() such that the blocks on the hard drive are getting
sequential again? Or is this just renumbering my DocID’s? Or am I totally
wrong? :P

If I want to search on the created index Lucene first looks in the .tii file
in RAM and then “skips” to the correct position on the hard drive. Are there
really the exact hard drive positions for the terms dictionary in this file?
That means in my .tii file is every 128. term of the whole index dictionary
(except I set IndexInterval different to 128). The rest is done with binary
search as far as I know.

5) What happens if I copy a directory with a builded index onto another
drive? Are the positions in .tii still correct? Or should I use for copying
an index the function in Directory.copy(Directory src, Directory dest,
Boolean closeDirSrc)? Does this readjust the positions in the .tii file?
What happens if my other hard drive has a totally different block size?

Thanks in advance and kind regards
Alex

--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Index-format-details-tp963861p963861.html
Sent from the Lucene - General mailing list archive at Nabble.com.