You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/15 12:14:32 UTC

[Lucene-java Wiki] Update of "FlexibleIndexing" by MikeMcCandless

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The "FlexibleIndexing" page has been changed by MikeMcCandless.
The comment on this change is: Update to flex indexing as committed to 4.0-dev.
http://wiki.apache.org/lucene-java/FlexibleIndexing?action=diff&rev1=8&rev2=9

--------------------------------------------------

  == Related Information ==
  ConversationsBetweenDougMarvinAndGrant
  
- == Further steps towards flexible indexing ==
+ == Flexible indexing implemented in 4.0-dev (trunk) ==
  
- This section describes the high-level design of [[https://issues.apache.org/jira/browse/LUCENE-1458|LUCENE-1458]].
+ This section describes the changes committed to 4.0-dev under [[https://issues.apache.org/jira/browse/LUCENE-1458|LUCENE-1458]] and [[https://issues.apache.org/jira/browse/LUCENE-2111|LUCENE-2111]].
  
- The top goal is to make Lucene extensible, even at its lowest levels, on what it records into the index, and how.  Your app should be able to easily store new things into the index, or, alter how existing things (doc IDs, positions, payloads, etc.) are encoded.
+ The overall goal is to make Lucene extensible, even at its lowest levels, on what it records into the index, and how.  Your app should be able to easily store new things into the index, or, alter how existing things (doc IDs, positions, payloads, etc.) are encoded.  To accomplish this, a {{{Codec}}} class was introduced.  The {{{Codec}}} currently covers the postings API (fields, terms, docs, positions+payloads enumerators); other elements in the index (norms, deleted docs, stored docs/fields, term vectors) are not covered.
  
- While storing new things into the index is possible with this change, it hasn't really been tested yet.  I've been focusing so far on alternate ways to encode the "normal postings" (terms, doc, freq, pos, payload) that Lucene stores.
+ === Changes in how postings are consumed ===
  
+ The first big change in flexible indexing is the consumption of the postings enumerators APIs:
  
- === Major pieces ===
+   * A term is now an arbitrary {{{byte[]}}}, represented by a {{{BytesRef}}} (which references an offset + length "slice" into an existing {{{byte[]}}}).  By default terms will be UTF8 encoded character string, created during indexing, but your analysis chain can produce a term that is not UTF8 bytes.
  
-  1. New postings enumeration API
+   * Fields are separately enumerated (via {{{FieldsEnum}}}) from term text.  Consumers of the flex API no longer need to check {{{Term.field()}}} on each {{{.next()}}} call; instead, they obtain a {{{TermsEnum}}} for the specific field they need and iterate it until exhaustion.
  
+   * {{{TermsEnum}}} iterates and seeks to all terms (returned as {{{BytesRef}}}) in the index.  A {{{TermsEnum}}} is optionally able to seek to the ordinal (long) for the term, and return the ordinal for the current term.  {{{SegmentReader}}} implements this but {{{MultiReader}}} does not because working with ords is far too costly (requires merging).
  
-  A new "4d" (four dimensional) enumeration API for reading postings data (FieldsEnum -> TermsEnum -> DocsEnum -> PositionsEnum).  A consumer can choose to only iterate over eg fields & terms (eg a MultiTermQuery), or over everything (eg SegmentMerger).  This replaces today's TermEnum/TermDocs/TermPositions.
+   * Deleted documents are no longer implicitly filtered by {{{DocsEnum}}} (previously {{{TermDocs}}}).  Instead, you provide an arbitrary {{{skipDocs}}} bit set ({{{Bits}}}) stating which documents should be skipped during enumeration.  For example, this could be used with a cached filter to enforce your own deletions.  {{{IndexReader.getDeletedDocs}}} returns a Bits for the current deleted docs of this reader.
  
+   * Seeking to a term is no longer done by the docs/positions enums; instead, you must use {{{TermsEnum.seek}}} and then {{{TermsEnum.docs}}} or {{{.docsAndPositions}}} to obtain the enumerator (there are also sugar APIs to accomplish this).  {{{TermsEnum}}}'s seek method has three return values: {{{FOUND}}} (the exact term matched), {{{NOT_FOUND}}} (another term matched) and {{{END}}} (you seek'd past the end of the enum).
  
-  These classes extend AttributeSource, so that an app could plug in its own attributes.  For example, payloads could [in theory] now be implemented externally to Lucene.
+   * Composite readers (currently {{{MultiReader}}} or {{{DirectoryReader}}}) are not able to provide these postings enumerators directly; instead, one must use the static methods on {{{MultiFields}}} to obtain the enumerators.
  
-  This API represents terms in RAM more efficiently, by 1) keeping them in UTF8 form (byte[] instead of char[]) which is more efficient for ASCII-only terms data and trie terms, and 2) allowing reuse of block byte[] with the TermRef class.  (Whereas Lucene today uses String field (interned) and String text for every Term instance).
+ === Codec/CodecProvider ===
  
-  One important API is TermsEnum.docs, which returns the DocsEnum for the current term.  That method now takes an arbitrary "skipDocs", of type Bits, a new interace with just the method {{{public boolean get(int index)}}}.  And, IndexReader.getDeletedDocs now returns the Bits.  The idea is to allow enumeration of the docs with a custom skip-list.  This will also make it easier to implement random-access filters (LUCENE-1536).
+ The second big change in flexible indexing is the {{{Codec}}} and {{{CodecProvider}}} APIs that enables apps to plug in different implementations for writing and reading postings data in the index.  When you obtain an {{{IndexWriter}}} or {{{IndexReader}}}, you can optionally pass in a {{{CodecProvider}}}, which knows 1) which {{{Codec}}} should be used when writing a new segment, and 2) how to resolve the codec name ({{{String}}}) to a {{{Codec}}} instance, when reading from the index.
  
-  2. Codec based pluggability for postings
+ The default codec is {{{StandardCodec}}}, whose format is similar to the pre-4.0 index format, but introduces sizable improvements to how the terms index is stored.  In particular the RAM required by the terms index when reading a segment has been substantially reduced.  The on-disk format of the .tis/.tii files is also slightly smaller.
  
-  Make the postings files (terms dict+index, freq/doc/pos/payload) writers and readers pluggable.  A new Codec class hides all details of how the 4d data is written.
+ There are some experimental core codecs:
  
-  All index format specifics have been moved out of oal.index.* and under oal.index.codecs.*.  For example there is no more TermInfo class.  SegmentReader is now given a Codec impl that knows how to decode the files into the 4d API.
+   * {{{PulsingCodec}}} stores rare terms directly into the terms dicts.  This is an excellent match for primary key fields (see [[http://chbits.blogspot.com/2010/06/lucenes-pulsingcodec-on-primary-key.html|here]] for details), and should also help even "normal" fields (so this [[https://issues.apache.org/jira/browse/LUCENE-2492|may become the default codec at some point]]).
  
-  Separately, there is a Codecs class that is responsible for providing 1) the default writer (when creating a new segment) and 2) lookup a given codec by its name (when reading segments previoiusly written with different codecs).
+   * {{{SepCodec}}} and {{{IntBlockCodec}}} are available for block-based codecs.  These codes are not useful themselves, rather, they serve as the base for block-based codecs.  These codecs separately store the docs, freqs, positions and payloads data, allowing for int block codecs to encode the docs, freqs and positions.
  
-  3. A new "standard" (default) Codec, with improved terms dict index
+ [[https://issues.apache.org/jira/browse/LUCENE-1410|LUCENE-1410]] has a prototype PforDelta codec, an int block codec using [[http://cis.poly.edu/cs912/indexcomp.pdf|PFOR-DELTA]] encoding.
  
-  The "standard" codec implements Lucene's default Codec for writing new segment files.  The doc/freq/pos/payload format is nearly identical (except for a new header) to the format today, but the terms dict/index is quite a bit more efficient in that it requires much less RAM to load the terms index.
+ Apps can also create custom {{{Codec}}}s.  Please report back if you do!  All of these APIs are very new and need some good baking in time.
  
-  4. Some other interesting codecs
- 
-  These are largely for testing, but some of them we will want to make available.  The pulsing codec inlines postings for low-frequency terms directly into the terms dict.  The pfordelta codec uses the PForDelta impl from [[https://issues.apache.org/jira/browse/LUCENE-1410|LUCENE-1410]] to encode doc, freq, pos into their own files using PForDelta.
- 
- 
- === Current status ===
- 
- All tests pass for all the codecs except pfordelta, which fails because it's unable to encode negative ints.  But, Lucene only does this due to the deprecated bug from [[https://issues.apache.org/jira/browse/LUCENE-1542|LUCENE-1542]].
- 
- There are still many "nocommits" in the code, and more tests are needed.
-