You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2013/04/24 23:00:07 UTC
[Solr Wiki] Trivial Update of "DocValues" by AndyLester

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocValues" page has been changed by AndyLester:
http://wiki.apache.org/solr/DocValues?action=diff&rev1=1&rev2=2

Comment:
Teeny typos

  <<TableOfContents>>
  
  = Introduction =
- 
- With a search engine you typically build an inverted index ({{{indexed="true"}}}) for a field: where values point to documents. DocValues is a way to build a forward index ({{{docValues="true"}}}) so that documents point to values. 
+ With a search engine you typically build an inverted index ({{{indexed="true"}}}) for a field: where values point to documents. DocValues is a way to build a forward index ({{{docValues="true"}}}) so that documents point to values.
  
   1. What docvalues are:
-     * NRT-compatible: These are per-segment datastructures built at index-time and designed to be efficient for the use case where data is changing rapidly. 
+   * NRT-compatible: These are per-segment datastructures built at index-time and designed to be efficient for the use case where data is changing rapidly.
-     * Basic query/filter support: You can do basic term, range, etc queries on docvalues fields without also indexing them, but these are constant-score only and typically slower. If you care about performance and scoring, index the field too.
+   * Basic query/filter support: You can do basic term, range, etc queries on docvalues fields without also indexing them, but these are constant-score only and typically slower. If you care about performance and scoring, index the field too.
-     * Better compression than fieldcache: Docvalues fields compress better than fieldcache, and "insanity" is impossible.
+   * Better compression than fieldcache: Docvalues fields compress better than fieldcache, and "insanity" is impossible.
-     * Able to store data outside of heap memory: You can specify a different docValuesFormat on the fieldType ({{{docValuesFormat="Disk"}}}) to only load minimal data on the heap, keeping other data structures on disk.
+   * Able to store data outside of heap memory: You can specify a different docValuesFormat on the fieldType ({{{docValuesFormat="Disk"}}}) to only load minimal data on the heap, keeping other data structures on disk.
   1. What docvalues are not:
-     * Not a replacement for stored fields: These are unrelated to stored fields in every way and instead datastructures for search (sort/facet/group/join/scoring).
+   * Not a replacement for stored fields: These are unrelated to stored fields in every way and instead datastructures for search (sort/facet/group/join/scoring).
-     * Not a huge improvement for a static index: If you have a completely static index, docvalues won't seem very interesting to you. On the other hand if you are fighting the fieldcache, read on.
+   * Not a huge improvement for a static index: If you have a completely static index, docvalues won't seem very interesting to you. On the other hand if you are fighting the fieldcache, read on.
-     * Not for the risk-adverse: The integration with solr is very new and probably still has some exciting bugs!
+   * Not for the risk-averse: The integration with Solr is very new and probably still has some exciting bugs!
  
  = Lucene's DocValues types =
- 
  Lucene has four underlying types that a docvalues field can have. Currently Solr uses three of these:
  
-  1. NUMERIC: a single-valued per-document numeric type. This is like having a large long[] array for the whole index, though the data is compressed based upon the values that are actually used. 
+  1. NUMERIC: a single-valued per-document numeric type. This is like having a large long[] array for the whole index, though the data is compressed based upon the values that are actually used.
- 
-     For example, consider 3 documents with these values:
+   . For example, consider 3 documents with these values:
-     {{{
+   {{{
         doc[0] = 1005
         doc[1] = 1006
         doc[2] = 1005
-     }}}
+ }}}
-     In this example the field would use around 1 bit per document, since that is all that is needed.
+   In this example the field would use around 1 bit per document, since that is all that is needed.
  
   1. SORTED: a single-valued per-document string type. This is like having a large String[] array for the whole index, but with an additional level of indirection. Each unique value is assigned a term number that represents its ordinal value. So each document really stores a compressed integer, and separately there is a "dictionary" mapping these term numbers back to term values.
-     
-     For example, consider 3 documents with these values:
+   . For example, consider 3 documents with these values:
-     {{{
+   {{{
         doc[0] = "aardvark"
         doc[1] = "beaver"
         doc[2] = "aardvark"
+ }}}
-     }}}
- 
-     Value "aardvark" will be assigned ordinal 0, and "beaver" 1, creating these two data structures:
+   Value "aardvark" will be assigned ordinal 0, and "beaver" 1, creating these two data structures:
- 
-     {{{
+   {{{
         doc[0] = 0
         doc[1] = 1
         doc[2] = 0
  
         term[0] = "aardvark"
         term[1] = "beaver"
-     }}}
+ }}}
  
   1. SORTED_SET: a multi-valued per-document string type. Its similar to SORTED, except each document has a "set" of values (in increasing sorted order). So it intentionally discards duplicate values (frequency) within a document and loses order within the document.
- 
-     For example, consider 3 documents with these values:
+   . For example, consider 3 documents with these values:
-     {{{
+   {{{
         doc[0] = "cat", "aardvark", "beaver", "aardvark"
-        doc[1] = 
+        doc[1] =
         doc[2] = "cat"
+ }}}
-     }}}
- 
-     Value "aardvark" will be assigned ordinal 0, "beaver" 1, and "cat" 2, creating these two data structures:
+   Value "aardvark" will be assigned ordinal 0, "beaver" 1, and "cat" 2, creating these two data structures:
-     {{{
+   {{{
         doc[0] = [0, 1, 2]
         doc[1] = []
         doc[2] = [2]
@@ -70, +62 @@

         term[0] = "aardvark"
         term[1] = "beaver"
         term[2] = "cat"
-     }}}
+ }}}
  
   1. BINARY: a single-valued per-document byte[] array. This can be used for encoding custom per-document datastructures.
  
  = Solr's DocValues types =
- 
   1. StrField (multiValued=false): This uses the SORTED type behind the scenes. This is a good choice for a sort field.
-     Example:
+   . Example:
-     {{{<field name="manu_exact" type="str" indexed="false" stored="false" docValues="true" default=""/>}}}
+   {{{<field name="manu_exact" type="str" indexed="false" stored="false" docValues="true" default=""/>}}}
   1. StrField (multiValued=true): This uses the SORTED_SET type behind the scenes.
-     Example:
+   . Example:
-     {{{<field name="productCategories" type="str" indexed="false" stored="false" multiValued="true" docValues="true"/>}}}
+   {{{<field name="productCategories" type="str" indexed="false" stored="false" multiValued="true" docValues="true"/>}}}
   1. TrieXXXField (multiValued=false): This uses the NUMERIC type behind the scenes. This is a good choice for a sort field or scoring factor using in function queries.
-     Example:
+   . Example:
-     {{{<field name="popularity" type="int" indexed="false" stored="false" docValues="true" default="0"/>}}}
+   {{{<field name="popularity" type="int" indexed="false" stored="false" docValues="true" default="0"/>}}}
   1. TrieXXXField (multiValued=true): This uses the SORTED_SET type behind the scenes, encoding the numeric values such that ordinals reflect numeric sort order.
-     Example:
+   . Example:
-     {{{<field name="specialCodes" type="int" indexed="false" stored="false" multiValued="true" docValues="true"/>}}}
+   {{{<field name="specialCodes" type="int" indexed="false" stored="false" multiValued="true" docValues="true"/>}}}
  
  = Specifying a different Codec implementation =
+  . You can specify the {{{docValuesFormat}}} attribute on the fieldType to control the underlying implementation. Note that only the default implementation is supported by future version of Lucene: if you try an alternative format, you may need to switch back to the default and rewrite your index (e.g. forceMerge) before upgrading.
  
-   You can specify the {{{docValuesFormat}}} attribute on the fieldType to control the underlying implementation. Note that only the default implementation is supported
-   by future version of Lucene: if you try an alternative format, you may need to switch back to the default and rewrite your index (e.g. forceMerge) before upgrading.
- 
-   * {{{docValuesFormat="Lucene42"}}}: This is the default, which loads everything into heap memory.
+  * {{{docValuesFormat="Lucene42"}}}: This is the default, which loads everything into heap memory.
-   * {{{docValuesFormat="Disk"}}}: This implementation has a different layout, to try to keep most data on disk but with reasonable performance.
+  * {{{docValuesFormat="Disk"}}}: This implementation has a different layout, to try to keep most data on disk but with reasonable performance.
-   * {{{docValuesFormat="SimpleText"}}}: Plain-text, slow, and not for production.
+  * {{{docValuesFormat="SimpleText"}}}: Plain-text, slow, and not for production.
  
  Example of altering the codec implementation:
+ 
  {{{
    <fieldType name="string_disk" class="solr.StrField" docValuesFormat="Disk" />
  }}}