You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2013/04/24 23:00:07 UTC
[Solr Wiki] Trivial Update of "DocValues" by AndyLester
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "DocValues" page has been changed by AndyLester:
http://wiki.apache.org/solr/DocValues?action=diff&rev1=1&rev2=2
Comment:
Teeny typos
<<TableOfContents>>
= Introduction =
-
- With a search engine you typically build an inverted index ({{{indexed="true"}}}) for a field: where values point to documents. DocValues is a way to build a forward index ({{{docValues="true"}}}) so that documents point to values.
+ With a search engine you typically build an inverted index ({{{indexed="true"}}}) for a field: where values point to documents. DocValues is a way to build a forward index ({{{docValues="true"}}}) so that documents point to values.
1. What docvalues are:
- * NRT-compatible: These are per-segment datastructures built at index-time and designed to be efficient for the use case where data is changing rapidly.
+ * NRT-compatible: These are per-segment datastructures built at index-time and designed to be efficient for the use case where data is changing rapidly.
- * Basic query/filter support: You can do basic term, range, etc queries on docvalues fields without also indexing them, but these are constant-score only and typically slower. If you care about performance and scoring, index the field too.
+ * Basic query/filter support: You can do basic term, range, etc queries on docvalues fields without also indexing them, but these are constant-score only and typically slower. If you care about performance and scoring, index the field too.
- * Better compression than fieldcache: Docvalues fields compress better than fieldcache, and "insanity" is impossible.
+ * Better compression than fieldcache: Docvalues fields compress better than fieldcache, and "insanity" is impossible.
- * Able to store data outside of heap memory: You can specify a different docValuesFormat on the fieldType ({{{docValuesFormat="Disk"}}}) to only load minimal data on the heap, keeping other data structures on disk.
+ * Able to store data outside of heap memory: You can specify a different docValuesFormat on the fieldType ({{{docValuesFormat="Disk"}}}) to only load minimal data on the heap, keeping other data structures on disk.
1. What docvalues are not:
- * Not a replacement for stored fields: These are unrelated to stored fields in every way and instead datastructures for search (sort/facet/group/join/scoring).
+ * Not a replacement for stored fields: These are unrelated to stored fields in every way and instead datastructures for search (sort/facet/group/join/scoring).
- * Not a huge improvement for a static index: If you have a completely static index, docvalues won't seem very interesting to you. On the other hand if you are fighting the fieldcache, read on.
+ * Not a huge improvement for a static index: If you have a completely static index, docvalues won't seem very interesting to you. On the other hand if you are fighting the fieldcache, read on.
- * Not for the risk-adverse: The integration with solr is very new and probably still has some exciting bugs!
+ * Not for the risk-averse: The integration with Solr is very new and probably still has some exciting bugs!
= Lucene's DocValues types =
-
Lucene has four underlying types that a docvalues field can have. Currently Solr uses three of these:
- 1. NUMERIC: a single-valued per-document numeric type. This is like having a large long[] array for the whole index, though the data is compressed based upon the values that are actually used.
+ 1. NUMERIC: a single-valued per-document numeric type. This is like having a large long[] array for the whole index, though the data is compressed based upon the values that are actually used.
-
- For example, consider 3 documents with these values:
+ . For example, consider 3 documents with these values:
- {{{
+ {{{
doc[0] = 1005
doc[1] = 1006
doc[2] = 1005
- }}}
+ }}}
- In this example the field would use around 1 bit per document, since that is all that is needed.
+ In this example the field would use around 1 bit per document, since that is all that is needed.
1. SORTED: a single-valued per-document string type. This is like having a large String[] array for the whole index, but with an additional level of indirection. Each unique value is assigned a term number that represents its ordinal value. So each document really stores a compressed integer, and separately there is a "dictionary" mapping these term numbers back to term values.
-
- For example, consider 3 documents with these values:
+ . For example, consider 3 documents with these values:
- {{{
+ {{{
doc[0] = "aardvark"
doc[1] = "beaver"
doc[2] = "aardvark"
+ }}}
- }}}
-
- Value "aardvark" will be assigned ordinal 0, and "beaver" 1, creating these two data structures:
+ Value "aardvark" will be assigned ordinal 0, and "beaver" 1, creating these two data structures:
-
- {{{
+ {{{
doc[0] = 0
doc[1] = 1
doc[2] = 0
term[0] = "aardvark"
term[1] = "beaver"
- }}}
+ }}}
1. SORTED_SET: a multi-valued per-document string type. Its similar to SORTED, except each document has a "set" of values (in increasing sorted order). So it intentionally discards duplicate values (frequency) within a document and loses order within the document.
-
- For example, consider 3 documents with these values:
+ . For example, consider 3 documents with these values:
- {{{
+ {{{
doc[0] = "cat", "aardvark", "beaver", "aardvark"
- doc[1] =
+ doc[1] =
doc[2] = "cat"
+ }}}
- }}}
-
- Value "aardvark" will be assigned ordinal 0, "beaver" 1, and "cat" 2, creating these two data structures:
+ Value "aardvark" will be assigned ordinal 0, "beaver" 1, and "cat" 2, creating these two data structures:
- {{{
+ {{{
doc[0] = [0, 1, 2]
doc[1] = []
doc[2] = [2]
@@ -70, +62 @@
term[0] = "aardvark"
term[1] = "beaver"
term[2] = "cat"
- }}}
+ }}}
1. BINARY: a single-valued per-document byte[] array. This can be used for encoding custom per-document datastructures.
= Solr's DocValues types =
-
1. StrField (multiValued=false): This uses the SORTED type behind the scenes. This is a good choice for a sort field.
- Example:
+ . Example:
- {{{<field name="manu_exact" type="str" indexed="false" stored="false" docValues="true" default=""/>}}}
+ {{{<field name="manu_exact" type="str" indexed="false" stored="false" docValues="true" default=""/>}}}
1. StrField (multiValued=true): This uses the SORTED_SET type behind the scenes.
- Example:
+ . Example:
- {{{<field name="productCategories" type="str" indexed="false" stored="false" multiValued="true" docValues="true"/>}}}
+ {{{<field name="productCategories" type="str" indexed="false" stored="false" multiValued="true" docValues="true"/>}}}
1. TrieXXXField (multiValued=false): This uses the NUMERIC type behind the scenes. This is a good choice for a sort field or scoring factor using in function queries.
- Example:
+ . Example:
- {{{<field name="popularity" type="int" indexed="false" stored="false" docValues="true" default="0"/>}}}
+ {{{<field name="popularity" type="int" indexed="false" stored="false" docValues="true" default="0"/>}}}
1. TrieXXXField (multiValued=true): This uses the SORTED_SET type behind the scenes, encoding the numeric values such that ordinals reflect numeric sort order.
- Example:
+ . Example:
- {{{<field name="specialCodes" type="int" indexed="false" stored="false" multiValued="true" docValues="true"/>}}}
+ {{{<field name="specialCodes" type="int" indexed="false" stored="false" multiValued="true" docValues="true"/>}}}
= Specifying a different Codec implementation =
+ . You can specify the {{{docValuesFormat}}} attribute on the fieldType to control the underlying implementation. Note that only the default implementation is supported by future version of Lucene: if you try an alternative format, you may need to switch back to the default and rewrite your index (e.g. forceMerge) before upgrading.
- You can specify the {{{docValuesFormat}}} attribute on the fieldType to control the underlying implementation. Note that only the default implementation is supported
- by future version of Lucene: if you try an alternative format, you may need to switch back to the default and rewrite your index (e.g. forceMerge) before upgrading.
-
- * {{{docValuesFormat="Lucene42"}}}: This is the default, which loads everything into heap memory.
+ * {{{docValuesFormat="Lucene42"}}}: This is the default, which loads everything into heap memory.
- * {{{docValuesFormat="Disk"}}}: This implementation has a different layout, to try to keep most data on disk but with reasonable performance.
+ * {{{docValuesFormat="Disk"}}}: This implementation has a different layout, to try to keep most data on disk but with reasonable performance.
- * {{{docValuesFormat="SimpleText"}}}: Plain-text, slow, and not for production.
+ * {{{docValuesFormat="SimpleText"}}}: Plain-text, slow, and not for production.
Example of altering the codec implementation:
+
{{{
<fieldType name="string_disk" class="solr.StrField" docValuesFormat="Disk" />
}}}