You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by rm...@apache.org on 2012/08/20 19:29:03 UTC

svn commit: r1375119 - /lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/index/package.html

Author: rmuir
Date: Mon Aug 20 17:29:02 2012
New Revision: 1375119

URL: http://svn.apache.org/viewvc?rev=1375119&view=rev
Log:
first start at some package javadoc for o.a.l.index

Modified:
    lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/index/package.html

Modified: lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/index/package.html
URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/index/package.html?rev=1375119&r1=1375118&r2=1375119&view=diff
==============================================================================
--- lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/index/package.html (original)
+++ lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/index/package.html Mon Aug 20 17:29:02 2012
@@ -21,6 +21,244 @@
 </head>
 <body>
 Code to maintain and access indices.
-<!-- TODO: add a BASIC overview here, including code examples of using postings apis -->
+<!-- TODO: add IndexWriter, IndexWriterConfig, DocValues, etc etc -->
+<h2>Table Of Contents</h2>
+<p>
+    <ol>
+        <li><a href="#postings">Postings APIs</a>
+            <ul>
+                <li><a href="#fields">Fields</a></li>
+                <li><a href="#terms">Terms</a></li>
+                <li><a href="#documents">Documents</a></li>
+                <li><a href="#positions">Positions</a></li>
+            </ul>
+        </li>
+        <li><a href="#stats">Index Statistics</a>
+            <ul>
+                <li><a href="#termstats">Term-level</a></li>
+                <li><a href="#fieldstats">Field-level</a></li>
+                <li><a href="#segmentstats">Segment-level</a></li>
+                <li><a href="#documentstats">Document-level</a></li>
+            </ul>
+        </li>
+    </ol>
+</p>
+<a name="postings"></a>
+<h2>Postings APIs</h2>
+<a name="fields"></a>
+<h4>
+    Fields
+</h4>
+<p>
+{@link org.apache.lucene.index.Fields} is the initial entry point into the 
+postings APIs, this can be obtained in several ways:
+<pre class="prettyprint">
+// access indexed fields for an index segment
+Fields fields = reader.fields();
+// access term vector fields for a specified document
+Fields fields = reader.getTermVectors(docid);
+</pre>
+Fields implements Java's Iterable interface, so its easy to enumerate the
+list of fields:
+<pre class="prettyprint">
+// enumerate list of fields
+for (String field : fields) {
+  // access the terms for this field
+  Terms terms = fields.terms(field);
+}
+</pre>
+</p>
+<a name="terms"></a>
+<h4>
+    Terms
+</h4>
+<p>
+{@link org.apache.lucene.index.Terms} represents the collection of terms
+within a field, exposes some metadata and <a href="#fieldstats">statistics</a>,
+and an API for enumeration.
+<pre class="prettyprint">
+// metadata about the field
+System.out.println("positions? " + terms.hasPositions());
+System.out.println("offsets? " + terms.hasOffsets());
+System.out.println("payloads? " + terms.hasPayloads());
+// iterate through terms
+TermsEnum termsEnum = terms.iterator(null);
+BytesRef term = null;
+while ((term = termsEnum.next()) != null) {
+  doSomethingWith(termsEnum.term());
+}
+</pre>
+{@link org.apache.lucene.index.TermsEnum} provides an iterator over the list
+of terms within a field, some <a href="#termstats">statistics</a> about the term,
+and methods to access the term's <a href="#documents">documents</a> and
+<a href="#positions">positions</a>.
+<pre class="prettyprint">
+// seek to a specific term
+boolean found = termsEnum.seekExact(new BytesRef("foobar"), true);
+if (found) {
+  // get the document frequency
+  System.out.println(termsEnum.docFreq());
+  // enumerate through documents
+  DocsEnum docs = termsEnum.docs(null, null);
+  // enumerate through documents and positions
+  DocsAndPositionsEnum docsAndPositions = termsEnum.docsAndPositions(null, null);
+}
+</pre>
+</p>
+<a name="documents"></a>
+<h4>
+    Documents
+</h4>
+<p>
+{@link org.apache.lucene.index.DocsEnum} is an extension of 
+{@link org.apache.lucene.search.DocIdSetIterator}that iterates over the list of
+documents for a term, along with the term frequency within that document.
+<pre class="prettyprint">
+int docid;
+while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  System.out.println(docid);
+  System.out.println(docsEnum.freq());
+}
+</pre>
+</p>
+<a name="positions"></a>
+<h4>
+    Positions
+</h4>
+<p>
+{@link org.apache.lucene.index.DocsAndPositionsEnum} is an extension of 
+{@link org.apache.lucene.index.DocsEnum} that additionally allows iteration
+of the positions a term occurred within the document, and any additional
+per-position information (offsets and payload)
+<pre class="prettyprint">
+int docid;
+while ((docid = docsAndPositionsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  System.out.println(docid);
+  int freq = docsAndPositionsEnum.freq();
+  for (int i = 0; i < freq; i++) {
+     System.out.println(docsAndPositionsEnum.nextPosition());
+     System.out.println(docsAndPositionsEnum.startOffset());
+     System.out.println(docsAndPositionsEnum.endOffset());
+     System.out.println(docsAndPositionsEnum.getPayload());
+  }
+}
+</pre>
+</p>
+<a name="stats"></a>
+<h2>Index Statistics</h2>
+<a name="termstats"></a>
+<h4>
+    Term statistics
+</h4>
+<p>
+    <ul>
+       <li>{@link org.apache.lucene.index.TermsEnum#docFreq}: Returns the number of 
+           documents that contain at least one occurrence of the term. This statistic 
+           is always available for an indexed term. Note that it will also count 
+           deleted documents, when segments are merged the statistic is updated as 
+           those deleted documents are merged away.
+       <li>{@link org.apache.lucene.index.TermsEnum#totalTermFreq}: Returns the number 
+           of occurrences of this term across all documents. Note that this statistic 
+           is unavailable (returns <code>-1</code>) if term frequencies were omitted 
+           from the index 
+           ({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY}) 
+           for the field. Like docFreq(), it will also count occurrences that appear in 
+           deleted documents.
+    </ul>
+</p>
+<a name="fieldstats"></a>
+<h4>
+    Field statistics
+</h4>
+<p>
+    <ul>
+       <li>{@link org.apache.lucene.index.Terms#size}: Returns the number of 
+           unique terms in the field. This statistic may be unavailable 
+           (returns <code>-1</code>) for some Terms implementations such as
+           {@link org.apache.lucene.index.MultiTerms}, where it cannot be efficiently
+           computed.  Note that this count also includes terms that appear only
+           in deleted documents: when segments are merged such terms are also merged
+           away and the statistic is then updated.
+       <li>{@link org.apache.lucene.index.Terms#getDocCount}: Returns the number of
+           documents that contain at least one occurrence of any term for this field.
+           This can be thought of as a Field-level docFreq(). Like docFreq() it will
+           also count deleted documents.
+       <li>{@link org.apache.lucene.index.Terms#getSumDocFreq}: Returns the number of
+           postings (term-document mappings in the inverted index) for the field. This
+           can be thought of as the sum of {@link org.apache.lucene.index.TermsEnum#docFreq}
+           across all terms in the field, and like docFreq() it will also count postings
+           that appear in deleted documents.
+       <li>{@link org.apache.lucene.index.Terms#getSumTotalTermFreq}: Returns the number
+           of tokens for the field. This can be thought of as the sum of 
+           {@link org.apache.lucene.index.TermsEnum#totalTermFreq} across all terms in the
+           field, and like totalTermFreq() it will also count occurrences that appear in
+           deleted documents, and will be unavailable (returns <code>-1</code>) if term 
+           frequencies were omitted from the index 
+           ({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY}) 
+           for the field.
+    </ul>
+</p>
+<a name="segmentstats"></a>
+<h4>
+    Segment statistics
+</h4>
+<p>
+    <ul>
+       <li>{@link org.apache.lucene.index.IndexReader#maxDoc}: Returns the number of 
+           documents (including deleted documents) in the index. 
+       <li>{@link org.apache.lucene.index.IndexReader#numDocs}: Returns the number 
+           of live documents (excluding deleted documents) in the index.
+       <li>{@link org.apache.lucene.index.IndexReader#numDeletedDocs}: Returns the
+           number of deleted documents in the index.
+       <li>{@link org.apache.lucene.index.Fields#size}: Returns the number of indexed
+           fields.
+       <li>{@link org.apache.lucene.index.Fields#getUniqueTermCount}: Returns the number 
+           of indexed terms, the sum of {@link org.apache.lucene.index.Terms#size}
+           across all fields.
+    </ul>
+</p>
+<a name="documentstats"></a>
+<h4>
+    Document statistics
+</h4>
+<p>
+Document statistics are available during the indexing process for an indexed field: typically
+a {@link org.apache.lucene.search.similarities.Similarity} implementation will store some
+of these values (possibly in a lossy way), into the normalization value for the document in
+its {@link org.apache.lucene.search.similarities.Similarity#computeNorm} method.
+</p>
+<p>
+    <ul>
+       <li>{@link org.apache.lucene.index.FieldInvertState#getLength}: Returns the number of 
+           tokens for this field in the document. Note that this is just the number
+           of times that {@link org.apache.lucene.analysis.TokenStream#incrementToken} returned
+           true, and is unrelated to the values in 
+           {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute}.
+       <li>{@link org.apache.lucene.index.FieldInvertState#getNumOverlap}: Returns the number
+           of tokens for this field in the document that had a position increment of zero. This
+           can be used to compute a document length that discounts artificial tokens
+           such as synonyms.
+       <li>{@link org.apache.lucene.index.FieldInvertState#getPosition}: Returns the accumulated
+           position value for this field in the document: computed from the values of
+           {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute} and including
+           {@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap}s across multivalued
+           fields.
+       <li>{@link org.apache.lucene.index.FieldInvertState#getOffset}: Returns the total
+           character offset value for this field in the document: computed from the values of
+           {@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute} returned by 
+           {@link org.apache.lucene.analysis.TokenStream#end}, and including
+           {@link org.apache.lucene.analysis.Analyzer#getOffsetGap}s across multivalued
+           fields.
+       <li>{@link org.apache.lucene.index.FieldInvertState#getUniqueTermCount}: Returns the number
+           of unique terms encountered for this field in the document.
+       <li>{@link org.apache.lucene.index.FieldInvertState#getMaxTermFrequency}: Returns the maximum
+           frequency across all unique terms encountered for this field in the document. 
+    </ul>
+</p>
+<p>
+Additional user-supplied statistics can be added to the document as DocValues fields and
+accessed via {@link org.apache.lucene.index.AtomicReader#docValues}.
+</p>
+<p>
 </body>
 </html>