You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by to...@apache.org on 2021/05/08 23:45:31 UTC
[lucene] branch main updated: reorganize termvectors format
description (javadocs). (#130)
This is an automated email from the ASF dual-hosted git repository.
tomoko pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/lucene.git
The following commit(s) were added to refs/heads/main by this push:
new 6ebf959 reorganize termvectors format description (javadocs). (#130)
6ebf959 is described below
commit 6ebf959502cc0ad125ca0cf88a0c28071a4fa70e
Author: Tomoko Uchida <to...@gmail.com>
AuthorDate: Sun May 9 08:45:24 2021 +0900
reorganize termvectors format description (javadocs). (#130)
---
.../codecs/lucene90/Lucene90TermVectorsFormat.java | 115 +++++++++++----------
1 file changed, 61 insertions(+), 54 deletions(-)
diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90TermVectorsFormat.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90TermVectorsFormat.java
index 0142f54..e19168f 100644
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90TermVectorsFormat.java
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90TermVectorsFormat.java
@@ -56,15 +56,20 @@ import org.apache.lucene.util.packed.PackedInts;
* <li>VectorMeta (.tvm) --> <Header>, PackedIntsVersion, ChunkSize,
* ChunkIndexMetadata, ChunkCount, DirtyChunkCount, DirtyDocsCount, Footer
* <li>Header --> {@link CodecUtil#writeIndexHeader IndexHeader}
- * <li>PackedIntsVersion --> {@link PackedInts#VERSION_CURRENT} as a {@link
- * DataOutput#writeVInt VInt}
- * <li>ChunkSize is the number of bytes of terms to accumulate before flushing, as a {@link
- * DataOutput#writeVInt VInt}
- * <li>ChunkCount is not known in advance and is the number of chunks necessary to store all
- * document of the segment
- * <li>DirtyChunkCount --> the number of prematurely flushed chunks in the .tvd file
+ * <li>PackedIntsVersion, ChunkSize --> {@link DataOutput#writeVInt VInt}
+ * <li>ChunkCount, DirtyChunkCount, DirtyDocsCount --> {@link DataOutput#writeVLong
+ * VLong}
+ * <li>ChunkIndexMetadata --> {@link FieldsIndexWriter}
* <li>Footer --> {@link CodecUtil#writeFooter CodecFooter}
* </ul>
+ * <p>Notes:
+ * <ul>
+ * <li>PackedIntsVersion is {@link PackedInts#VERSION_CURRENT}.
+ * <li>ChunkSize is the number of bytes of terms to accumulate before flushing.
+ * <li>ChunkCount is not known in advance and is the number of chunks necessary to store all
+ * document of the segment.
+ * <li>DirtyChunkCount is the number of prematurely flushed chunks in the .tvd file.
+ * </ul>
* <li><a id="vector_data"></a>
* <p>A vector data file (extension <code>.tvd</code>). This file stores terms, frequencies,
* positions, offsets and payloads for every document. Upon writing a new segment, it
@@ -80,76 +85,78 @@ import org.apache.lucene.util.packed.PackedInts;
* FieldNumOffs >, < Flags >, < NumTerms >, < TermLengths >, <
* TermFreqs >, < Positions >, < StartOffsets >, < Lengths >, <
* PayloadLengths >, < TermAndPayloads >
- * <li>DocBase is the ID of the first doc of the chunk as a {@link DataOutput#writeVInt
- * VInt}
- * <li>ChunkDocs is the number of documents in the chunk
* <li>NumFields --> DocNumFields<sup>ChunkDocs</sup>
- * <li>DocNumFields is the number of fields for each doc, written as a {@link
- * DataOutput#writeVInt VInt} if ChunkDocs==1 and as a {@link PackedInts} array
- * otherwise
- * <li>FieldNums --> FieldNumDelta<sup>TotalDistincFields</sup>, a delta-encoded list of
- * the sorted unique field numbers present in the chunk
- * <li>FieldNumOffs --> FieldNumOff<sup>TotalFields</sup>, as a {@link PackedInts} array
- * <li>FieldNumOff is the offset of the field number in FieldNums
- * <li>TotalFields is the total number of fields (sum of the values of NumFields)
+ * <li>FieldNums --> FieldNumDelta<sup>TotalDistincFields</sup>
* <li>Flags --> Bit < FieldFlags >
- * <li>Bit is a single bit which when true means that fields have the same options for every
- * document in the chunk
* <li>FieldFlags --> if Bit==1: Flag<sup>TotalDistinctFields</sup> else
* Flag<sup>TotalFields</sup>
- * <li>Flag: a 3-bits int where:
- * <ul>
- * <li>the first bit means that the field has positions
- * <li>the second bit means that the field has offsets
- * <li>the third bit means that the field has payloads
- * </ul>
* <li>NumTerms --> FieldNumTerms<sup>TotalFields</sup>
- * <li>FieldNumTerms: the number of terms for each field, using {@link BlockPackedWriter
- * blocks of 64 packed ints}
* <li>TermLengths --> PrefixLength<sup>TotalTerms</sup>
* SuffixLength<sup>TotalTerms</sup>
- * <li>TotalTerms: total number of terms (sum of NumTerms)
- * <li>PrefixLength: 0 for the first term of a field, the common prefix with the previous
- * term otherwise using {@link BlockPackedWriter blocks of 64 packed ints}
- * <li>SuffixLength: length of the term minus PrefixLength for every term using {@link
- * BlockPackedWriter blocks of 64 packed ints}
* <li>TermFreqs --> TermFreqMinus1<sup>TotalTerms</sup>
- * <li>TermFreqMinus1: (frequency - 1) for each term using {@link BlockPackedWriter blocks
- * of 64 packed ints}
* <li>Positions --> PositionDelta<sup>TotalPositions</sup>
- * <li>TotalPositions is the sum of frequencies of terms of all fields that have positions
- * <li>PositionDelta: the absolute position for the first position of a term, and the
- * difference with the previous positions for following positions using {@link
- * BlockPackedWriter blocks of 64 packed ints}
* <li>StartOffsets --> (AvgCharsPerTerm<sup>TotalDistinctFields</sup>)
* StartOffsetDelta<sup>TotalOffsets</sup>
- * <li>TotalOffsets is the sum of frequencies of terms of all fields that have offsets
- * <li>AvgCharsPerTerm: average number of chars per term, encoded as a float on 4 bytes.
- * They are not present if no field has both positions and offsets enabled.
- * <li>StartOffsetDelta: (startOffset - previousStartOffset - AvgCharsPerTerm *
- * PositionDelta). previousStartOffset is 0 for the first offset and AvgCharsPerTerm is
- * 0 if the field has no positions using {@link BlockPackedWriter blocks of 64 packed
- * ints}
* <li>Lengths --> LengthMinusTermLength<sup>TotalOffsets</sup>
- * <li>LengthMinusTermLength: (endOffset - startOffset - termLength) using {@link
- * BlockPackedWriter blocks of 64 packed ints}
* <li>PayloadLengths --> PayloadLength<sup>TotalPayloads</sup>
- * <li>TotalPayloads is the sum of frequencies of terms of all fields that have payloads
- * <li>PayloadLength is the payload length encoded using {@link BlockPackedWriter blocks of
- * 64 packed ints}
* <li>TermAndPayloads --> LZ4-compressed representation of < FieldTermsAndPayLoads
* ><sup>TotalFields</sup>
* <li>FieldTermsAndPayLoads --> Terms (Payloads)
- * <li>Terms: term bytes
- * <li>Payloads: payload bytes (if the field has payloads)
+ * <li>DocBase, ChunkDocs, DocNumFields (with ChunkDocs==1) --> {@link
+ * DataOutput#writeVInt VInt}
+ * <li>AvgCharsPerTerm --> {@link DataOutput#writeInt Int}
+ * <li>DocNumFields (with ChunkDocs>=1), FieldNumOffs --> {@link PackedInts} array
+ * <li>FieldNumTerms, PrefixLength, SuffixLength, TermFreqMinus1, PositionDelta,
+ * StartOffsetDelta, LengthMinusTermLength, PayloadLength --> {@link
+ * BlockPackedWriter blocks of 64 packed ints}
* <li>Footer --> {@link CodecUtil#writeFooter CodecFooter}
* </ul>
+ * <p>Notes:
+ * <ul>
+ * <li>DocBase is the ID of the first doc of the chunk.
+ * <li>ChunkDocs is the number of documents in the chunk.
+ * <li>DocNumFields is the number of fields for each doc.
+ * <li>FieldNums is a delta-encoded list of the sorted unique field numbers present in the
+ * chunk.
+ * <li>FieldNumOffs is the array of FieldNumOff; array size is the total number of fields in
+ * the chunk.
+ * <li>FieldNumOff is the offset of the field number in FieldNums.
+ * <li>TotalFields is the total number of fields (sum of the values of NumFields).
+ * <li>Bit in Flags is a single bit which when true means that fields have the same options
+ * for every document in the chunk.
+ * <li>Flag: a 3-bits int where:
+ * <ul>
+ * <li>the first bit means that the field has positions
+ * <li>the second bit means that the field has offsets
+ * <li>the third bit means that the field has payloads
+ * </ul>
+ * <li>FieldNumTerms is the number of terms for each field.
+ * <li>TotalTerms is the total number of terms (sum of NumTerms).
+ * <li>PrefixLength is 0 for the first term of a field, the common prefix with the previous
+ * term otherwise.
+ * <li>SuffixLength is the length of the term minus PrefixLength for every term using.
+ * <li>TermFreqMinus1 is (frequency - 1) for each term.
+ * <li>TotalPositions is the sum of frequencies of terms of all fields that have positions.
+ * <li>PositionDelta is the absolute position for the first position of a term, and the
+ * difference with the previous positions for following positions.
+ * <li>TotalOffsets is the sum of frequencies of terms of all fields that have offsets.
+ * <li>AvgCharsPerTerm is the average number of chars per term, encoded as a float on 4
+ * bytes. They are not present if no field has both positions and offsets enabled.
+ * <li>StartOffsetDelta is the (startOffset - previousStartOffset - AvgCharsPerTerm *
+ * PositionDelta). previousStartOffset is 0 for the first offset and AvgCharsPerTerm is
+ * 0 if the field has no positions.
+ * <li>LengthMinusTermLength is (endOffset - startOffset - termLength).
+ * <li>TotalPayloads is the sum of frequencies of terms of all fields that have payloads.
+ * <li>PayloadLength is the payload length encoded.
+ * <li>Terms is term bytes.
+ * <li>Payloads is payload bytes (if the field has payloads).
+ * </ul>
* <li><a id="vector_index"></a>
* <p>An index file (extension <code>.tvx</code>).
* <ul>
* <li>VectorIndex (.tvx) --> <Header>, <ChunkIndex>, Footer
* <li>Header --> {@link CodecUtil#writeIndexHeader IndexHeader}
- * <li>ChunkIndex: See {@link FieldsIndexWriter}
+ * <li>ChunkIndex --> {@link FieldsIndexWriter}
* <li>Footer --> {@link CodecUtil#writeFooter CodecFooter}
* </ul>
* </ol>