You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by rm...@apache.org on 2012/10/09 17:16:32 UTC
svn commit: r1396064 - in /lucene/dev/branches/branch_4x: ./ lucene/ lucene/codecs/ lucene/codecs/src/java/org/apache/lucene/codecs/block/ lucene/core/ lucene/core/src/java/org/apache/lucene/codecs/ lucene/core/src/java/org/apache/lucene/codecs/lucene40/

Author: rmuir
Date: Tue Oct  9 15:16:31 2012
New Revision: 1396064

URL: http://svn.apache.org/viewvc?rev=1396064&view=rev
Log:
LUCENE-4446: factor blocktree's documentation out of Lucene40/Block, minor cleanups to Block's docs

Modified:
    lucene/dev/branches/branch_4x/   (props changed)
    lucene/dev/branches/branch_4x/lucene/   (props changed)
    lucene/dev/branches/branch_4x/lucene/codecs/   (props changed)
    lucene/dev/branches/branch_4x/lucene/codecs/src/java/org/apache/lucene/codecs/block/BlockPostingsFormat.java
    lucene/dev/branches/branch_4x/lucene/core/   (props changed)
    lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/BlockTreeTermsWriter.java
    lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.java

Modified: lucene/dev/branches/branch_4x/lucene/codecs/src/java/org/apache/lucene/codecs/block/BlockPostingsFormat.java
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/codecs/src/java/org/apache/lucene/codecs/block/BlockPostingsFormat.java?rev=1396064&r1=1396063&r2=1396064&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/codecs/src/java/org/apache/lucene/codecs/block/BlockPostingsFormat.java (original)
+++ lucene/dev/branches/branch_4x/lucene/codecs/src/java/org/apache/lucene/codecs/block/BlockPostingsFormat.java Tue Oct  9 15:16:31 2012
@@ -38,8 +38,8 @@ import org.apache.lucene.util.IOUtils;
 import org.apache.lucene.util.packed.PackedInts;
 
 /**
- * Block postings format, which encodes postings in packed int blocks 
- * for faster decode.
+ * Block postings format, which encodes postings in packed integer blocks 
+ * for fast decode.
  *
  * <p><b>NOTE</b>: this format is still experimental and
  * subject to change without backwards compatibility.
@@ -48,21 +48,22 @@ import org.apache.lucene.util.packed.Pac
  * Basic idea:
  * <ul>
  *   <li>
- *   <b>Packed Block and VInt Block</b>: 
- *   <p>In packed block, integers are encoded with the same bit width ({@link PackedInts packed format}), 
- *      the block size (i.e. number of integers inside block) is fixed. </p>
- *   <p>In VInt block, integers are encoded as {@link DataOutput#writeVInt VInt}, 
+ *   <b>Packed Blocks and VInt Blocks</b>: 
+ *   <p>In packed blocks, integers are encoded with the same bit width ({@link PackedInts packed format}):
+ *      the block size (i.e. number of integers inside block) is fixed (currently 128). Additionally blocks
+ *      that are all the same value are encoded in an optimized way.</p>
+ *   <p>In VInt blocks, integers are encoded as {@link DataOutput#writeVInt VInt}:
  *      the block size is variable.</p>
  *   </li>
  *
  *   <li> 
  *   <b>Block structure</b>: 
- *   <p>When the postings is long enough, BlockPostingsFormat will try to encode most integer data 
- *      as packed block.</p> 
- *   <p>Take a term with 259 documents as example, the first 256 document ids are encoded as two packed 
- *      blocks, while the remaining 3 as one VInt block. </p>
+ *   <p>When the postings are long enough, BlockPostingsFormat will try to encode most integer data 
+ *      as a packed block.</p> 
+ *   <p>Take a term with 259 documents as an example, the first 256 document ids are encoded as two packed 
+ *      blocks, while the remaining 3 are encoded as one VInt block. </p>
  *   <p>Different kinds of data are always encoded separately into different packed blocks, but may 
- *      possible be encoded into a same VInt block. </p>
+ *      possibly be interleaved into the same VInt block. </p>
  *   <p>This strategy is applied to pairs: 
  *      &lt;document number, frequency&gt;,
  *      &lt;position, payload length&gt;, 
@@ -71,25 +72,25 @@ import org.apache.lucene.util.packed.Pac
  *   </li>
  *
  *   <li>
- *   <b>Skipper setting</b>: 
- *   <p>The structure of skip table is quite similar to Lucene40PostingsFormat. Skip interval is the 
+ *   <b>Skipdata settings</b>: 
+ *   <p>The structure of skip table is quite similar to previous version of Lucene. Skip interval is the 
  *      same as block size, and each skip entry points to the beginning of each block. However, for 
  *      the first block, skip data is omitted.</p>
  *   </li>
  *
  *   <li>
  *   <b>Positions, Payloads, and Offsets</b>: 
- *   <p>A position is an integer indicating where the term occurs at within one document. 
+ *   <p>A position is an integer indicating where the term occurs within one document. 
  *      A payload is a blob of metadata associated with current position. 
  *      An offset is a pair of integers indicating the tokenized start/end offsets for given term 
- *      in current position. </p>
+ *      in current position: it is essentially a specialized payload. </p>
  *   <p>When payloads and offsets are not omitted, numPositions==numPayloads==numOffsets (assuming a 
  *      null payload contributes one count). As mentioned in block structure, it is possible to encode 
  *      these three either combined or separately. 
- *   <p>For all the cases, payloads and offsets are stored together. When encoded as packed block, 
+ *   <p>In all cases, payloads and offsets are stored together. When encoded as a packed block, 
  *      position data is separated out as .pos, while payloads and offsets are encoded in .pay (payload 
- *      metadata will also be stored directly in .pay). When encoded as VInt block, all these three are 
- *      stored in .pos (so as payload metadata).</p>
+ *      metadata will also be stored directly in .pay). When encoded as VInt blocks, all these three are 
+ *      stored interleaved into the .pos (so is payload metadata).</p>
  *   <p>With this strategy, the majority of payload and offset data will be outside .pos file. 
  *      So for queries that require only position data, running on a full index with payloads and offsets, 
  *      this reduces disk pre-fetches.</p>
@@ -113,35 +114,29 @@ import org.apache.lucene.util.packed.Pac
  * <dd>
  * <b>Term Dictionary</b>
  *
- * <p>The .tim file format is quite similar to Lucene40PostingsFormat, 
- *  with minor difference in MetadataBlock</p>
+ * <p>The .tim file contains the list of terms in each
+ * field along with per-term statistics (such as docfreq)
+ * and pointers to the frequencies, positions, payload and
+ * skip data in the .doc, .pos, and .pay files.
+ * See {@link BlockTreeTermsWriter} for more details on the format.
+ * </p>
+ *
+ * <p>NOTE: The term dictionary can plug into different postings implementations:
+ * the postings writer/reader are actually responsible for encoding 
+ * and decoding the Postings Metadata and Term Metadata sections described here:</p>
  *
  * <ul>
- * <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc -->
- *   <li>TermDictionary(.tim) --&gt; Header, PostingsHeader, PackedBlockSize, 
- *                                   &lt;Block&gt;<sup>NumBlocks</sup>, FieldSummary, DirOffset</li>
- *   <li>Block --&gt; SuffixBlock, StatsBlock, MetadataBlock</li>
- *   <li>SuffixBlock --&gt; EntryCount, SuffixLength, {@link DataOutput#writeByte byte}<sup>SuffixLength</sup></li>
- *   <li>StatsBlock --&gt; StatsLength, &lt;DocFreq, TotalTermFreq&gt;<sup>EntryCount</sup></li>
- *   <li>MetadataBlock --&gt; MetaLength, &lt;DocFPDelta, 
- *                            &lt;PosFPDelta, PosVIntBlockFPDelta?, PayFPDelta?&gt;?, 
- *                            SkipFPDelta?&gt;<sup>EntryCount</sup></li>
- *   <li>FieldSummary --&gt; NumFields, &lt;FieldNumber, NumTerms, RootCodeLength, 
- *                           {@link DataOutput#writeByte byte}<sup>RootCodeLength</sup>, SumDocFreq, DocCount&gt;
- *                           <sup>NumFields</sup></li>
- *   <li>Header, PostingsHeader --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
- *   <li>DirOffset --&gt; {@link DataOutput#writeLong Uint64}</li>
- *   <li>PackedBlockSize, EntryCount, SuffixLength, StatsLength, DocFreq, MetaLength, 
- *       PosVIntBlockFPDelta, SkipFPDelta, NumFields, FieldNumber, RootCodeLength, DocCount --&gt; 
- *       {@link DataOutput#writeVInt VInt}</li>
- *   <li>TotalTermFreq, DocFPDelta, PosFPDelta, PayFPDelta, NumTerms, SumTotalTermFreq, SumDocFreq --&gt; 
- *       {@link DataOutput#writeVLong VLong}</li>
+ *   <li>Postings Metadata --&gt; Header, PackedBlockSize</li>
+ *   <li>Term Metadata --&gt; DocFPDelta, PosFPDelta?, PosVIntBlockFPDelta?, PayFPDelta?, 
+ *                            SkipFPDelta?</li>
+ *   <li>Header, --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
+ *   <li>PackedBlockSize, PosVIntBlockFPDelta, SkipFPDelta --&gt; {@link DataOutput#writeVInt VInt}</li>
+ *   <li>DocFPDelta, PosFPDelta, PayFPDelta --&gt; {@link DataOutput#writeVLong VLong}</li>
  * </ul>
  * <p>Notes:</p>
  * <ul>
- *    <li>Here explains MetadataBlock only, other fields are mentioned in 
- *   <a href="{@docRoot}/../core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termdictionary">Lucene40PostingsFormat:TermDictionary</a>
- *    </li>
+ *    <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information
+ *        for the postings.</li>
  *    <li>PackedBlockSize is the fixed block size for packed blocks. In packed block, bit width is 
  *        determined by the largest integer. Smaller block size result in smaller variance among width 
  *        of integers hence smaller indexes. Larger block size result in more efficient bulk i/o hence
@@ -175,8 +170,8 @@ import org.apache.lucene.util.packed.Pac
  * <dl>
  * <dd>
  * <b>Term Index</b>
- * <p>The .tim file format is mentioned in
- *   <a href="{@docRoot}/../core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termindex">Lucene40PostingsFormat:TermIndex</a>
+ * <p>The .tip file contains an index into the term dictionary, so that it can be 
+ * accessed randomly.  See {@link BlockTreeTermsWriter} for more details on the format.</p>
  * </dd>
  * </dl>
  *
@@ -298,7 +293,7 @@ import org.apache.lucene.util.packed.Pac
  * <dd>
  * <b>Payloads and Offsets</b>
  * <p>The .pay file will store payloads and offsets associated with certain term-document positions. 
- *    Some payloads and offsets will be separated out into .pos file, for speedup reason.</p>
+ *    Some payloads and offsets will be separated out into .pos file, for performance reasons.</p>
  * <ul>
  *   <li>PayFile(.pay): --&gt; Header, &lt;TermPayloads, TermOffsets?&gt; <sup>TermCount</sup></li>
  *   <li>Header --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
@@ -319,7 +314,7 @@ import org.apache.lucene.util.packed.Pac
  *       for PackedOffsetBlockNum.</li>
  *   <li>SumPayLength is the total length of payloads written within one block, should be the sum
  *       of PayLengths in one packed block.</li>
- *   <li>PayLength in PackedPayLengthBlock is the length of each payload, associated with current 
+ *   <li>PayLength in PackedPayLengthBlock is the length of each payload associated with the current 
  *       position.</li>
  * </ul>
  * </dd>

Modified: lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/BlockTreeTermsWriter.java
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/BlockTreeTermsWriter.java?rev=1396064&r1=1396063&r2=1396064&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/BlockTreeTermsWriter.java (original)
+++ lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/BlockTreeTermsWriter.java Tue Oct  9 15:16:31 2012
@@ -23,10 +23,12 @@ import java.util.Comparator;
 import java.util.List;
 
 import org.apache.lucene.index.FieldInfo.IndexOptions;
+import org.apache.lucene.index.DocsEnum;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.FieldInfos;
 import org.apache.lucene.index.IndexFileNames;
 import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.store.DataOutput;
 import org.apache.lucene.store.IndexOutput;
 import org.apache.lucene.store.RAMOutputStream;
 import org.apache.lucene.util.ArrayUtil;
@@ -76,6 +78,97 @@ import org.apache.lucene.util.fst.Util;
  * Writes terms dict and index, block-encoding (column
  * stride) each term's metadata for each set of terms
  * between two index terms.
+ * <p>
+ * Files:
+ * <ul>
+ *   <li><tt>.tim</tt>: <a href="#Termdictionary">Term Dictionary</a></li>
+ *   <li><tt>.tip</tt>: <a href="#Termindex">Term Index</a></li>
+ * </ul>
+ * <p>
+ * <a name="Termdictionary" id="Termdictionary"></a>
+ * <h3>Term Dictionary</h3>
+ *
+ * <p>The .tim file contains the list of terms in each
+ * field along with per-term statistics (such as docfreq)
+ * and per-term metadata (typically pointers to the postings list
+ * for that term in the inverted index).
+ * </p>
+ *
+ * <p>The .tim is arranged in blocks: with blocks containing
+ * a variable number of entries (by default 25-48), where
+ * each entry is either a term or a reference to a
+ * sub-block.</p>
+ *
+ * <p>NOTE: The term dictionary can plug into different postings implementations:
+ * the postings writer/reader are actually responsible for encoding 
+ * and decoding the Postings Metadata and Term Metadata sections.</p>
+ *
+ * <ul>
+ * <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc -->
+ *    <li>TermsDict (.tim) --&gt; Header, <i>Postings Metadata</i>, Block<sup>NumBlocks</sup>,
+ *                               FieldSummary, DirOffset</li>
+ *    <li>Block --&gt; SuffixBlock, StatsBlock, MetadataBlock</li>
+ *    <li>SuffixBlock --&gt; EntryCount, SuffixLength, Byte<sup>SuffixLength</sup></li>
+ *    <li>StatsBlock --&gt; StatsLength, &lt;DocFreq, TotalTermFreq&gt;<sup>EntryCount</sup></li>
+ *    <li>MetadataBlock --&gt; MetaLength, &lt;<i>Term Metadata</i>&gt;<sup>EntryCount</sup></li>
+ *    <li>FieldSummary --&gt; NumFields, &lt;FieldNumber, NumTerms, RootCodeLength, Byte<sup>RootCodeLength</sup>,
+ *                            SumDocFreq, DocCount&gt;<sup>NumFields</sup></li>
+ *    <li>Header --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
+ *    <li>DirOffset --&gt; {@link DataOutput#writeLong Uint64}</li>
+ *    <li>EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields,
+ *        FieldNumber,RootCodeLength,DocCount --&gt; {@link DataOutput#writeVInt VInt}</li>
+ *    <li>TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq --&gt; 
+ *        {@link DataOutput#writeVLong VLong}</li>
+ * </ul>
+ * <p>Notes:</p>
+ * <ul>
+ *    <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information
+ *        for the BlockTree implementation.</li>
+ *    <li>DirOffset is a pointer to the FieldSummary section.</li>
+ *    <li>DocFreq is the count of documents which contain the term.</li>
+ *    <li>TotalTermFreq is the total number of occurrences of the term. This is encoded
+ *        as the difference between the total number of occurrences and the DocFreq.</li>
+ *    <li>FieldNumber is the fields number from {@link FieldInfos}. (.fnm)</li>
+ *    <li>NumTerms is the number of unique terms for the field.</li>
+ *    <li>RootCode points to the root block for the field.</li>
+ *    <li>SumDocFreq is the total number of postings, the number of term-document pairs across
+ *        the entire field.</li>
+ *    <li>DocCount is the number of documents that have at least one posting for this field.</li>
+ *    <li>PostingsMetadata and TermMetadata are plugged into by the specific postings implementation:
+ *        these contain arbitrary per-file data (such as parameters or versioning information) 
+ *        and per-term data (such as pointers to inverted files).
+ * </ul>
+ * <a name="Termindex" id="Termindex"></a>
+ * <h3>Term Index</h3>
+ * <p>The .tip file contains an index into the term dictionary, so that it can be 
+ * accessed randomly.  The index is also used to determine
+ * when a given term cannot exist on disk (in the .tim file), saving a disk seek.</p>
+ * <ul>
+ *   <li>TermsIndex (.tip) --&gt; Header, FSTIndex<sup>NumFields</sup>
+ *                                &lt;IndexStartFP&gt;<sup>NumFields</sup>, DirOffset</li>
+ *   <li>Header --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
+ *   <li>DirOffset --&gt; {@link DataOutput#writeLong Uint64}</li>
+ *   <li>IndexStartFP --&gt; {@link DataOutput#writeVLong VLong}</li>
+ *   <!-- TODO: better describe FST output here -->
+ *   <li>FSTIndex --&gt; {@link FST FST&lt;byte[]&gt;}</li>
+ * </ul>
+ * <p>Notes:</p>
+ * <ul>
+ *   <li>The .tip file contains a separate FST for each
+ *       field.  The FST maps a term prefix to the on-disk
+ *       block that holds all terms starting with that
+ *       prefix.  Each field's IndexStartFP points to its
+ *       FST.</li>
+ *   <li>DirOffset is a pointer to the start of the IndexStartFPs
+ *       for all fields</li>
+ *   <li>It's possible that an on-disk block would contain
+ *       too many terms (more than the allowed maximum
+ *       (default: 48)).  When this happens, the block is
+ *       sub-divided into new blocks (called "floor
+ *       blocks"), and then the output in the FST for the
+ *       block's prefix encodes the leading byte of each
+ *       sub-block, and its file pointer.
+ * </ul>
  *
  * @see BlockTreeTermsReader
  * @lucene.experimental

Modified: lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.java
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.java?rev=1396064&r1=1396063&r2=1396064&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.java (original)
+++ lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.java Tue Oct  9 15:16:31 2012
@@ -54,43 +54,24 @@ import org.apache.lucene.util.fst.FST; /
  * field along with per-term statistics (such as docfreq)
  * and pointers to the frequencies, positions and
  * skip data in the .frq and .prx files.
+ * See {@link BlockTreeTermsWriter} for more details on the format.
  * </p>
  *
- * <p>The .tim is arranged in blocks: with blocks containing
- * a variable number of entries (by default 25-48), where
- * each entry is either a term or a reference to a
- * sub-block.  It's written by {@link BlockTreeTermsWriter}
- * and read by {@link BlockTreeTermsReader}.</p>
- *
  * <p>NOTE: The term dictionary can plug into different postings implementations:
- * for example the postings writer/reader are actually responsible for encoding 
- * and decoding the MetadataBlock.</p>
- *
+ * the postings writer/reader are actually responsible for encoding 
+ * and decoding the Postings Metadata and Term Metadata sections described here:</p>
  * <ul>
- * <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc -->
- *    <li>TermsDict (.tim) --&gt; Header, PostingsHeader, SkipInterval,
- *                               MaxSkipLevels, SkipMinimum, Block<sup>NumBlocks</sup>,
- *                               FieldSummary, DirOffset</li>
- *    <li>Block --&gt; SuffixBlock, StatsBlock, MetadataBlock</li>
- *    <li>SuffixBlock --&gt; EntryCount, SuffixLength, Byte<sup>SuffixLength</sup></li>
- *    <li>StatsBlock --&gt; StatsLength, &lt;DocFreq, TotalTermFreq&gt;<sup>EntryCount</sup></li>
- *    <li>MetadataBlock --&gt; MetaLength, &lt;FreqDelta, SkipDelta?, ProxDelta?&gt;<sup>EntryCount</sup></li>
- *    <li>FieldSummary --&gt; NumFields, &lt;FieldNumber, NumTerms, RootCodeLength, Byte<sup>RootCodeLength</sup>,
- *                            SumDocFreq, DocCount&gt;<sup>NumFields</sup></li>
- *    <li>Header,PostingsHeader --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
- *    <li>DirOffset --&gt; {@link DataOutput#writeLong Uint64}</li>
+ *    <li>Postings Metadata --&gt; Header, SkipInterval, MaxSkipLevels, SkipMinimum</li>
+ *    <li>Term Metadata --&gt; FreqDelta, SkipDelta?, ProxDelta?
+ *    <li>Header --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
  *    <li>SkipInterval,MaxSkipLevels,SkipMinimum --&gt; {@link DataOutput#writeInt Uint32}</li>
- *    <li>EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,SkipDelta,NumFields,
- *        FieldNumber,RootCodeLength,DocCount --&gt; {@link DataOutput#writeVInt VInt}</li>
- *    <li>TotalTermFreq,FreqDelta,ProxDelta,NumTerms,SumTotalTermFreq,SumDocFreq --&gt; 
- *        {@link DataOutput#writeVLong VLong}</li>
+ *    <li>SkipDelta --&gt; {@link DataOutput#writeVInt VInt}</li>
+ *    <li>FreqDelta,ProxDelta --&gt; {@link DataOutput#writeVLong VLong}</li>
  * </ul>
  * <p>Notes:</p>
  * <ul>
  *    <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information
- *        for the BlockTree implementation. On the other hand, PostingsHeader stores the version
- *        information for the postings reader/writer.</li>
- *    <li>DirOffset is a pointer to the FieldSummary section.</li>
+ *        for the postings.</li>
  *    <li>SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate 
  *        {@link DocsEnum#advance(int)}. Larger values result in smaller indexes, greater 
  *        acceleration, but fewer accelerable cases, while smaller values result in bigger indexes, 
@@ -102,9 +83,6 @@ import org.apache.lucene.util.fst.FST; /
  *        information about skip levels.</li>
  *    <li>SkipMinimum is the minimum document frequency a term must have in order to write any 
  *        skip data at all.</li>
- *    <li>DocFreq is the count of documents which contain the term.</li>
- *    <li>TotalTermFreq is the total number of occurrences of the term. This is encoded
- *        as the difference between the total number of occurrences and the DocFreq.</li>
  *    <li>FreqDelta determines the position of this term's TermFreqs within the .frq
  *        file. In particular, it is the difference between the position of this term's
  *        data in that file and the position of the previous term's data (or zero, for
@@ -118,44 +96,11 @@ import org.apache.lucene.util.fst.FST; /
  *        file. In particular, it is the number of bytes after TermFreqs that the
  *        SkipData starts. In other words, it is the length of the TermFreq data.
  *        SkipDelta is only stored if DocFreq is not smaller than SkipMinimum.</li>
- *    <li>FieldNumber is the fields number from {@link FieldInfos}. (.fnm)</li>
- *    <li>NumTerms is the number of unique terms for the field.</li>
- *    <li>RootCode points to the root block for the field.</li>
- *    <li>SumDocFreq is the total number of postings, the number of term-document pairs across
- *        the entire field.</li>
- *    <li>DocCount is the number of documents that have at least one posting for this field.</li>
  * </ul>
  * <a name="Termindex" id="Termindex"></a>
  * <h3>Term Index</h3>
  * <p>The .tip file contains an index into the term dictionary, so that it can be 
- * accessed randomly.  The index is also used to determine
- * when a given term cannot exist on disk (in the .tim file), saving a disk seek.</p>
- * <ul>
- *   <li>TermsIndex (.tip) --&gt; Header, FSTIndex<sup>NumFields</sup>, 
- *                                &lt;IndexStartFP&gt;<sup>NumFields</sup>, DirOffset</li>
- *   <li>Header --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
- *   <li>IndexStartFP --&gt; {@link DataOutput#writeVLong VLong}</li>
- *   <!-- TODO: better describe FST output here -->
- *   <li>FSTIndex --&gt; {@link FST FST&lt;byte[]&gt;}</li>
- *   <li>DirOffset --&gt; {@link DataOutput#writeLong Uint64}</li>
- * </ul>
- * <p>Notes:</p>
- * <ul>
- *   <li>The .tip file contains a separate FST for each
- *       field.  The FST maps a term prefix to the on-disk
- *       block that holds all terms starting with that
- *       prefix.  Each field's IndexStartFP points to its
- *       FST.</li>
- *   <li>DirOffset is a pointer to the start of the IndexStartFPs
- *       for all fields</li>
- *   <li>It's possible that an on-disk block would contain
- *       too many terms (more than the allowed maximum
- *       (default: 48)).  When this happens, the block is
- *       sub-divided into new blocks (called "floor
- *       blocks"), and then the output in the FST for the
- *       block's prefix encodes the leading byte of each
- *       sub-block, and its file pointer.
- * </ul>
+ * accessed randomly.  See {@link BlockTreeTermsWriter} for more details on the format.</p>
  * <a name="Frequencies" id="Frequencies"></a>
  * <h3>Frequencies</h3>
  * <p>The .frq file contains the lists of documents which contain each term, along