You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by rm...@apache.org on 2012/10/09 17:16:32 UTC
svn commit: r1396064 - in /lucene/dev/branches/branch_4x: ./ lucene/
lucene/codecs/ lucene/codecs/src/java/org/apache/lucene/codecs/block/
lucene/core/ lucene/core/src/java/org/apache/lucene/codecs/
lucene/core/src/java/org/apache/lucene/codecs/lucene40/
Author: rmuir
Date: Tue Oct 9 15:16:31 2012
New Revision: 1396064
URL: http://svn.apache.org/viewvc?rev=1396064&view=rev
Log:
LUCENE-4446: factor blocktree's documentation out of Lucene40/Block, minor cleanups to Block's docs
Modified:
lucene/dev/branches/branch_4x/ (props changed)
lucene/dev/branches/branch_4x/lucene/ (props changed)
lucene/dev/branches/branch_4x/lucene/codecs/ (props changed)
lucene/dev/branches/branch_4x/lucene/codecs/src/java/org/apache/lucene/codecs/block/BlockPostingsFormat.java
lucene/dev/branches/branch_4x/lucene/core/ (props changed)
lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/BlockTreeTermsWriter.java
lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.java
Modified: lucene/dev/branches/branch_4x/lucene/codecs/src/java/org/apache/lucene/codecs/block/BlockPostingsFormat.java
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/codecs/src/java/org/apache/lucene/codecs/block/BlockPostingsFormat.java?rev=1396064&r1=1396063&r2=1396064&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/codecs/src/java/org/apache/lucene/codecs/block/BlockPostingsFormat.java (original)
+++ lucene/dev/branches/branch_4x/lucene/codecs/src/java/org/apache/lucene/codecs/block/BlockPostingsFormat.java Tue Oct 9 15:16:31 2012
@@ -38,8 +38,8 @@ import org.apache.lucene.util.IOUtils;
import org.apache.lucene.util.packed.PackedInts;
/**
- * Block postings format, which encodes postings in packed int blocks
- * for faster decode.
+ * Block postings format, which encodes postings in packed integer blocks
+ * for fast decode.
*
* <p><b>NOTE</b>: this format is still experimental and
* subject to change without backwards compatibility.
@@ -48,21 +48,22 @@ import org.apache.lucene.util.packed.Pac
* Basic idea:
* <ul>
* <li>
- * <b>Packed Block and VInt Block</b>:
- * <p>In packed block, integers are encoded with the same bit width ({@link PackedInts packed format}),
- * the block size (i.e. number of integers inside block) is fixed. </p>
- * <p>In VInt block, integers are encoded as {@link DataOutput#writeVInt VInt},
+ * <b>Packed Blocks and VInt Blocks</b>:
+ * <p>In packed blocks, integers are encoded with the same bit width ({@link PackedInts packed format}):
+ * the block size (i.e. number of integers inside block) is fixed (currently 128). Additionally blocks
+ * that are all the same value are encoded in an optimized way.</p>
+ * <p>In VInt blocks, integers are encoded as {@link DataOutput#writeVInt VInt}:
* the block size is variable.</p>
* </li>
*
* <li>
* <b>Block structure</b>:
- * <p>When the postings is long enough, BlockPostingsFormat will try to encode most integer data
- * as packed block.</p>
- * <p>Take a term with 259 documents as example, the first 256 document ids are encoded as two packed
- * blocks, while the remaining 3 as one VInt block. </p>
+ * <p>When the postings are long enough, BlockPostingsFormat will try to encode most integer data
+ * as a packed block.</p>
+ * <p>Take a term with 259 documents as an example, the first 256 document ids are encoded as two packed
+ * blocks, while the remaining 3 are encoded as one VInt block. </p>
* <p>Different kinds of data are always encoded separately into different packed blocks, but may
- * possible be encoded into a same VInt block. </p>
+ * possibly be interleaved into the same VInt block. </p>
* <p>This strategy is applied to pairs:
* <document number, frequency>,
* <position, payload length>,
@@ -71,25 +72,25 @@ import org.apache.lucene.util.packed.Pac
* </li>
*
* <li>
- * <b>Skipper setting</b>:
- * <p>The structure of skip table is quite similar to Lucene40PostingsFormat. Skip interval is the
+ * <b>Skipdata settings</b>:
+ * <p>The structure of skip table is quite similar to previous version of Lucene. Skip interval is the
* same as block size, and each skip entry points to the beginning of each block. However, for
* the first block, skip data is omitted.</p>
* </li>
*
* <li>
* <b>Positions, Payloads, and Offsets</b>:
- * <p>A position is an integer indicating where the term occurs at within one document.
+ * <p>A position is an integer indicating where the term occurs within one document.
* A payload is a blob of metadata associated with current position.
* An offset is a pair of integers indicating the tokenized start/end offsets for given term
- * in current position. </p>
+ * in current position: it is essentially a specialized payload. </p>
* <p>When payloads and offsets are not omitted, numPositions==numPayloads==numOffsets (assuming a
* null payload contributes one count). As mentioned in block structure, it is possible to encode
* these three either combined or separately.
- * <p>For all the cases, payloads and offsets are stored together. When encoded as packed block,
+ * <p>In all cases, payloads and offsets are stored together. When encoded as a packed block,
* position data is separated out as .pos, while payloads and offsets are encoded in .pay (payload
- * metadata will also be stored directly in .pay). When encoded as VInt block, all these three are
- * stored in .pos (so as payload metadata).</p>
+ * metadata will also be stored directly in .pay). When encoded as VInt blocks, all these three are
+ * stored interleaved into the .pos (so is payload metadata).</p>
* <p>With this strategy, the majority of payload and offset data will be outside .pos file.
* So for queries that require only position data, running on a full index with payloads and offsets,
* this reduces disk pre-fetches.</p>
@@ -113,35 +114,29 @@ import org.apache.lucene.util.packed.Pac
* <dd>
* <b>Term Dictionary</b>
*
- * <p>The .tim file format is quite similar to Lucene40PostingsFormat,
- * with minor difference in MetadataBlock</p>
+ * <p>The .tim file contains the list of terms in each
+ * field along with per-term statistics (such as docfreq)
+ * and pointers to the frequencies, positions, payload and
+ * skip data in the .doc, .pos, and .pay files.
+ * See {@link BlockTreeTermsWriter} for more details on the format.
+ * </p>
+ *
+ * <p>NOTE: The term dictionary can plug into different postings implementations:
+ * the postings writer/reader are actually responsible for encoding
+ * and decoding the Postings Metadata and Term Metadata sections described here:</p>
*
* <ul>
- * <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc -->
- * <li>TermDictionary(.tim) --> Header, PostingsHeader, PackedBlockSize,
- * <Block><sup>NumBlocks</sup>, FieldSummary, DirOffset</li>
- * <li>Block --> SuffixBlock, StatsBlock, MetadataBlock</li>
- * <li>SuffixBlock --> EntryCount, SuffixLength, {@link DataOutput#writeByte byte}<sup>SuffixLength</sup></li>
- * <li>StatsBlock --> StatsLength, <DocFreq, TotalTermFreq><sup>EntryCount</sup></li>
- * <li>MetadataBlock --> MetaLength, <DocFPDelta,
- * <PosFPDelta, PosVIntBlockFPDelta?, PayFPDelta?>?,
- * SkipFPDelta?><sup>EntryCount</sup></li>
- * <li>FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength,
- * {@link DataOutput#writeByte byte}<sup>RootCodeLength</sup>, SumDocFreq, DocCount>
- * <sup>NumFields</sup></li>
- * <li>Header, PostingsHeader --> {@link CodecUtil#writeHeader CodecHeader}</li>
- * <li>DirOffset --> {@link DataOutput#writeLong Uint64}</li>
- * <li>PackedBlockSize, EntryCount, SuffixLength, StatsLength, DocFreq, MetaLength,
- * PosVIntBlockFPDelta, SkipFPDelta, NumFields, FieldNumber, RootCodeLength, DocCount -->
- * {@link DataOutput#writeVInt VInt}</li>
- * <li>TotalTermFreq, DocFPDelta, PosFPDelta, PayFPDelta, NumTerms, SumTotalTermFreq, SumDocFreq -->
- * {@link DataOutput#writeVLong VLong}</li>
+ * <li>Postings Metadata --> Header, PackedBlockSize</li>
+ * <li>Term Metadata --> DocFPDelta, PosFPDelta?, PosVIntBlockFPDelta?, PayFPDelta?,
+ * SkipFPDelta?</li>
+ * <li>Header, --> {@link CodecUtil#writeHeader CodecHeader}</li>
+ * <li>PackedBlockSize, PosVIntBlockFPDelta, SkipFPDelta --> {@link DataOutput#writeVInt VInt}</li>
+ * <li>DocFPDelta, PosFPDelta, PayFPDelta --> {@link DataOutput#writeVLong VLong}</li>
* </ul>
* <p>Notes:</p>
* <ul>
- * <li>Here explains MetadataBlock only, other fields are mentioned in
- * <a href="{@docRoot}/../core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termdictionary">Lucene40PostingsFormat:TermDictionary</a>
- * </li>
+ * <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information
+ * for the postings.</li>
* <li>PackedBlockSize is the fixed block size for packed blocks. In packed block, bit width is
* determined by the largest integer. Smaller block size result in smaller variance among width
* of integers hence smaller indexes. Larger block size result in more efficient bulk i/o hence
@@ -175,8 +170,8 @@ import org.apache.lucene.util.packed.Pac
* <dl>
* <dd>
* <b>Term Index</b>
- * <p>The .tim file format is mentioned in
- * <a href="{@docRoot}/../core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termindex">Lucene40PostingsFormat:TermIndex</a>
+ * <p>The .tip file contains an index into the term dictionary, so that it can be
+ * accessed randomly. See {@link BlockTreeTermsWriter} for more details on the format.</p>
* </dd>
* </dl>
*
@@ -298,7 +293,7 @@ import org.apache.lucene.util.packed.Pac
* <dd>
* <b>Payloads and Offsets</b>
* <p>The .pay file will store payloads and offsets associated with certain term-document positions.
- * Some payloads and offsets will be separated out into .pos file, for speedup reason.</p>
+ * Some payloads and offsets will be separated out into .pos file, for performance reasons.</p>
* <ul>
* <li>PayFile(.pay): --> Header, <TermPayloads, TermOffsets?> <sup>TermCount</sup></li>
* <li>Header --> {@link CodecUtil#writeHeader CodecHeader}</li>
@@ -319,7 +314,7 @@ import org.apache.lucene.util.packed.Pac
* for PackedOffsetBlockNum.</li>
* <li>SumPayLength is the total length of payloads written within one block, should be the sum
* of PayLengths in one packed block.</li>
- * <li>PayLength in PackedPayLengthBlock is the length of each payload, associated with current
+ * <li>PayLength in PackedPayLengthBlock is the length of each payload associated with the current
* position.</li>
* </ul>
* </dd>
Modified: lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/BlockTreeTermsWriter.java
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/BlockTreeTermsWriter.java?rev=1396064&r1=1396063&r2=1396064&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/BlockTreeTermsWriter.java (original)
+++ lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/BlockTreeTermsWriter.java Tue Oct 9 15:16:31 2012
@@ -23,10 +23,12 @@ import java.util.Comparator;
import java.util.List;
import org.apache.lucene.index.FieldInfo.IndexOptions;
+import org.apache.lucene.index.DocsEnum;
import org.apache.lucene.index.FieldInfo;
import org.apache.lucene.index.FieldInfos;
import org.apache.lucene.index.IndexFileNames;
import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.store.DataOutput;
import org.apache.lucene.store.IndexOutput;
import org.apache.lucene.store.RAMOutputStream;
import org.apache.lucene.util.ArrayUtil;
@@ -76,6 +78,97 @@ import org.apache.lucene.util.fst.Util;
* Writes terms dict and index, block-encoding (column
* stride) each term's metadata for each set of terms
* between two index terms.
+ * <p>
+ * Files:
+ * <ul>
+ * <li><tt>.tim</tt>: <a href="#Termdictionary">Term Dictionary</a></li>
+ * <li><tt>.tip</tt>: <a href="#Termindex">Term Index</a></li>
+ * </ul>
+ * <p>
+ * <a name="Termdictionary" id="Termdictionary"></a>
+ * <h3>Term Dictionary</h3>
+ *
+ * <p>The .tim file contains the list of terms in each
+ * field along with per-term statistics (such as docfreq)
+ * and per-term metadata (typically pointers to the postings list
+ * for that term in the inverted index).
+ * </p>
+ *
+ * <p>The .tim is arranged in blocks: with blocks containing
+ * a variable number of entries (by default 25-48), where
+ * each entry is either a term or a reference to a
+ * sub-block.</p>
+ *
+ * <p>NOTE: The term dictionary can plug into different postings implementations:
+ * the postings writer/reader are actually responsible for encoding
+ * and decoding the Postings Metadata and Term Metadata sections.</p>
+ *
+ * <ul>
+ * <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc -->
+ * <li>TermsDict (.tim) --> Header, <i>Postings Metadata</i>, Block<sup>NumBlocks</sup>,
+ * FieldSummary, DirOffset</li>
+ * <li>Block --> SuffixBlock, StatsBlock, MetadataBlock</li>
+ * <li>SuffixBlock --> EntryCount, SuffixLength, Byte<sup>SuffixLength</sup></li>
+ * <li>StatsBlock --> StatsLength, <DocFreq, TotalTermFreq><sup>EntryCount</sup></li>
+ * <li>MetadataBlock --> MetaLength, <<i>Term Metadata</i>><sup>EntryCount</sup></li>
+ * <li>FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength, Byte<sup>RootCodeLength</sup>,
+ * SumDocFreq, DocCount><sup>NumFields</sup></li>
+ * <li>Header --> {@link CodecUtil#writeHeader CodecHeader}</li>
+ * <li>DirOffset --> {@link DataOutput#writeLong Uint64}</li>
+ * <li>EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields,
+ * FieldNumber,RootCodeLength,DocCount --> {@link DataOutput#writeVInt VInt}</li>
+ * <li>TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq -->
+ * {@link DataOutput#writeVLong VLong}</li>
+ * </ul>
+ * <p>Notes:</p>
+ * <ul>
+ * <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information
+ * for the BlockTree implementation.</li>
+ * <li>DirOffset is a pointer to the FieldSummary section.</li>
+ * <li>DocFreq is the count of documents which contain the term.</li>
+ * <li>TotalTermFreq is the total number of occurrences of the term. This is encoded
+ * as the difference between the total number of occurrences and the DocFreq.</li>
+ * <li>FieldNumber is the fields number from {@link FieldInfos}. (.fnm)</li>
+ * <li>NumTerms is the number of unique terms for the field.</li>
+ * <li>RootCode points to the root block for the field.</li>
+ * <li>SumDocFreq is the total number of postings, the number of term-document pairs across
+ * the entire field.</li>
+ * <li>DocCount is the number of documents that have at least one posting for this field.</li>
+ * <li>PostingsMetadata and TermMetadata are plugged into by the specific postings implementation:
+ * these contain arbitrary per-file data (such as parameters or versioning information)
+ * and per-term data (such as pointers to inverted files).
+ * </ul>
+ * <a name="Termindex" id="Termindex"></a>
+ * <h3>Term Index</h3>
+ * <p>The .tip file contains an index into the term dictionary, so that it can be
+ * accessed randomly. The index is also used to determine
+ * when a given term cannot exist on disk (in the .tim file), saving a disk seek.</p>
+ * <ul>
+ * <li>TermsIndex (.tip) --> Header, FSTIndex<sup>NumFields</sup>
+ * <IndexStartFP><sup>NumFields</sup>, DirOffset</li>
+ * <li>Header --> {@link CodecUtil#writeHeader CodecHeader}</li>
+ * <li>DirOffset --> {@link DataOutput#writeLong Uint64}</li>
+ * <li>IndexStartFP --> {@link DataOutput#writeVLong VLong}</li>
+ * <!-- TODO: better describe FST output here -->
+ * <li>FSTIndex --> {@link FST FST<byte[]>}</li>
+ * </ul>
+ * <p>Notes:</p>
+ * <ul>
+ * <li>The .tip file contains a separate FST for each
+ * field. The FST maps a term prefix to the on-disk
+ * block that holds all terms starting with that
+ * prefix. Each field's IndexStartFP points to its
+ * FST.</li>
+ * <li>DirOffset is a pointer to the start of the IndexStartFPs
+ * for all fields</li>
+ * <li>It's possible that an on-disk block would contain
+ * too many terms (more than the allowed maximum
+ * (default: 48)). When this happens, the block is
+ * sub-divided into new blocks (called "floor
+ * blocks"), and then the output in the FST for the
+ * block's prefix encodes the leading byte of each
+ * sub-block, and its file pointer.
+ * </ul>
*
* @see BlockTreeTermsReader
* @lucene.experimental
Modified: lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.java
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.java?rev=1396064&r1=1396063&r2=1396064&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.java (original)
+++ lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.java Tue Oct 9 15:16:31 2012
@@ -54,43 +54,24 @@ import org.apache.lucene.util.fst.FST; /
* field along with per-term statistics (such as docfreq)
* and pointers to the frequencies, positions and
* skip data in the .frq and .prx files.
+ * See {@link BlockTreeTermsWriter} for more details on the format.
* </p>
*
- * <p>The .tim is arranged in blocks: with blocks containing
- * a variable number of entries (by default 25-48), where
- * each entry is either a term or a reference to a
- * sub-block. It's written by {@link BlockTreeTermsWriter}
- * and read by {@link BlockTreeTermsReader}.</p>
- *
* <p>NOTE: The term dictionary can plug into different postings implementations:
- * for example the postings writer/reader are actually responsible for encoding
- * and decoding the MetadataBlock.</p>
- *
+ * the postings writer/reader are actually responsible for encoding
+ * and decoding the Postings Metadata and Term Metadata sections described here:</p>
* <ul>
- * <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc -->
- * <li>TermsDict (.tim) --> Header, PostingsHeader, SkipInterval,
- * MaxSkipLevels, SkipMinimum, Block<sup>NumBlocks</sup>,
- * FieldSummary, DirOffset</li>
- * <li>Block --> SuffixBlock, StatsBlock, MetadataBlock</li>
- * <li>SuffixBlock --> EntryCount, SuffixLength, Byte<sup>SuffixLength</sup></li>
- * <li>StatsBlock --> StatsLength, <DocFreq, TotalTermFreq><sup>EntryCount</sup></li>
- * <li>MetadataBlock --> MetaLength, <FreqDelta, SkipDelta?, ProxDelta?><sup>EntryCount</sup></li>
- * <li>FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength, Byte<sup>RootCodeLength</sup>,
- * SumDocFreq, DocCount><sup>NumFields</sup></li>
- * <li>Header,PostingsHeader --> {@link CodecUtil#writeHeader CodecHeader}</li>
- * <li>DirOffset --> {@link DataOutput#writeLong Uint64}</li>
+ * <li>Postings Metadata --> Header, SkipInterval, MaxSkipLevels, SkipMinimum</li>
+ * <li>Term Metadata --> FreqDelta, SkipDelta?, ProxDelta?
+ * <li>Header --> {@link CodecUtil#writeHeader CodecHeader}</li>
* <li>SkipInterval,MaxSkipLevels,SkipMinimum --> {@link DataOutput#writeInt Uint32}</li>
- * <li>EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,SkipDelta,NumFields,
- * FieldNumber,RootCodeLength,DocCount --> {@link DataOutput#writeVInt VInt}</li>
- * <li>TotalTermFreq,FreqDelta,ProxDelta,NumTerms,SumTotalTermFreq,SumDocFreq -->
- * {@link DataOutput#writeVLong VLong}</li>
+ * <li>SkipDelta --> {@link DataOutput#writeVInt VInt}</li>
+ * <li>FreqDelta,ProxDelta --> {@link DataOutput#writeVLong VLong}</li>
* </ul>
* <p>Notes:</p>
* <ul>
* <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information
- * for the BlockTree implementation. On the other hand, PostingsHeader stores the version
- * information for the postings reader/writer.</li>
- * <li>DirOffset is a pointer to the FieldSummary section.</li>
+ * for the postings.</li>
* <li>SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate
* {@link DocsEnum#advance(int)}. Larger values result in smaller indexes, greater
* acceleration, but fewer accelerable cases, while smaller values result in bigger indexes,
@@ -102,9 +83,6 @@ import org.apache.lucene.util.fst.FST; /
* information about skip levels.</li>
* <li>SkipMinimum is the minimum document frequency a term must have in order to write any
* skip data at all.</li>
- * <li>DocFreq is the count of documents which contain the term.</li>
- * <li>TotalTermFreq is the total number of occurrences of the term. This is encoded
- * as the difference between the total number of occurrences and the DocFreq.</li>
* <li>FreqDelta determines the position of this term's TermFreqs within the .frq
* file. In particular, it is the difference between the position of this term's
* data in that file and the position of the previous term's data (or zero, for
@@ -118,44 +96,11 @@ import org.apache.lucene.util.fst.FST; /
* file. In particular, it is the number of bytes after TermFreqs that the
* SkipData starts. In other words, it is the length of the TermFreq data.
* SkipDelta is only stored if DocFreq is not smaller than SkipMinimum.</li>
- * <li>FieldNumber is the fields number from {@link FieldInfos}. (.fnm)</li>
- * <li>NumTerms is the number of unique terms for the field.</li>
- * <li>RootCode points to the root block for the field.</li>
- * <li>SumDocFreq is the total number of postings, the number of term-document pairs across
- * the entire field.</li>
- * <li>DocCount is the number of documents that have at least one posting for this field.</li>
* </ul>
* <a name="Termindex" id="Termindex"></a>
* <h3>Term Index</h3>
* <p>The .tip file contains an index into the term dictionary, so that it can be
- * accessed randomly. The index is also used to determine
- * when a given term cannot exist on disk (in the .tim file), saving a disk seek.</p>
- * <ul>
- * <li>TermsIndex (.tip) --> Header, FSTIndex<sup>NumFields</sup>,
- * <IndexStartFP><sup>NumFields</sup>, DirOffset</li>
- * <li>Header --> {@link CodecUtil#writeHeader CodecHeader}</li>
- * <li>IndexStartFP --> {@link DataOutput#writeVLong VLong}</li>
- * <!-- TODO: better describe FST output here -->
- * <li>FSTIndex --> {@link FST FST<byte[]>}</li>
- * <li>DirOffset --> {@link DataOutput#writeLong Uint64}</li>
- * </ul>
- * <p>Notes:</p>
- * <ul>
- * <li>The .tip file contains a separate FST for each
- * field. The FST maps a term prefix to the on-disk
- * block that holds all terms starting with that
- * prefix. Each field's IndexStartFP points to its
- * FST.</li>
- * <li>DirOffset is a pointer to the start of the IndexStartFPs
- * for all fields</li>
- * <li>It's possible that an on-disk block would contain
- * too many terms (more than the allowed maximum
- * (default: 48)). When this happens, the block is
- * sub-divided into new blocks (called "floor
- * blocks"), and then the output in the FST for the
- * block's prefix encodes the leading byte of each
- * sub-block, and its file pointer.
- * </ul>
+ * accessed randomly. See {@link BlockTreeTermsWriter} for more details on the format.</p>
* <a name="Frequencies" id="Frequencies"></a>
* <h3>Frequencies</h3>
* <p>The .frq file contains the lists of documents which contain each term, along