You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2006/12/28 04:37:42 UTC
Re: [Solr Wiki] Update of "SolrFacetingOverview" by JJLarrea
JJ: Fantastic - this is excellent info, and sharing it helps a LOT!
Erik
On Dec 27, 2006, at 7:25 PM, Apache Wiki wrote:
> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Solr Wiki"
> for change notification.
>
> The following page has been changed by JJLarrea:
> http://wiki.apache.org/solr/SolrFacetingOverview
>
> The comment on the change is:
> Added page per 12/8/06 suggestion by Yonik
>
> New page:
> = Faceting Overview =
>
> Solr provides a [http://incubator.apache.org/solr/docs/api/org/
> apache/solr/request/SimpleFacets.html Simple Faceting toolkit]
> which can be reused by various Request Handlers to include "Facet
> counts" of based on some simple criteria. Both the
> StandardRequestHandler and the DisMaxRequestHandler currently use
> these utilities. Detailed descriptions of the parameters used to
> control faceting can be found (along with several examples) at
> [SimpleFacetParameters].
>
> This page briefly provides some general background information:
>
> = Facet Indexing =
>
> Faceting is done on __indexed__ rather than __stored__ values.
> This is because the primary use for faceting is drilldown into a
> subset of hits resulting from a query, and so the chosen facet
> value is used to construct a filter query which literally matches
> that value in the index. For the stock Solr request handlers this
> is done by adding an `fq=<facet-field>:<quoted facet-value>`
> parameter and resubmitting the query.
>
> Because faceting fields are often specified to serve two purposes,
> human-readable text and drill-down query value, they are frequently
> indexed differently from fields used for searching and sorting:
> * They are not tokenized into separate words
> * They are not mapped into lower case
> * Human-readable punctuation is not removed (other than double-
> quotes)
> * There is often no need to store them, since stored values would
> look much like indexed values and the faceting mechanism is used
> for value retrieval.
> * Depending on how the field is defined the SimpleFacets
> mechanism may only allow for a single value per field per document
> (see below)
>
> As an example, if I had a field with a list of authors, such as:
>
> Schildt, Herbert; Wolpert, Lewis; Davies, P.
>
> I might want to index the same data differently in three different
> fields (perhaps using the Solr [:SchemaXml#Copy Fields:copyField]
> directive):
> * For searching: Tokenized, case-folded, punctuation-stripped:
> schildt / herbert / wolpert / lewis / davies / p
> * For sorting: Untokenized, case-folded, punctuation-stripped:
> schildt herbert wolpert lewis davies p
> * For faceting: Primary author only, using a `solr.StringField`:
> Schildt, Herbert
>
> Then when the user drills down on the "Schildt, Herbert" string I
> would reissue the query with an added fq="Schild, Herbert" parameter.
>
> = Facet Operation =
>
> Currently SimpleFacets has 3 modes of operation:
>
> == FacetQueries ==
>
> Any number of [:SimpleFacetParameters#facet.query:facet.query]
> parameters can be passed to the request handler. Each distinct
> facet.query will first be executed against the entire index, with
> the results cached as a hashed set (if fewer than hashDocSet) or a
> bit set (if greater) of document IDs (see [:SolrCaching#The
> hashDocSet Max Size:hashDocSet]). Then every time that facet.query
> is used for faceting a query, the cached set will be intersected
> against the set of document ids returned by the query to count the
> number of documents for which the facet.query condition is true.
>
> == FacetFields ==
>
> Any number of [:SimpleFacetParameters#facet.field:facet.field]
> parameters can be passed to the request handler. For each
> facet.field, one of two approaches will be used:
>
> * Field Queries: If the facet field is defined in the schema
> as multi-valued, boolean, or tokenized, then every indexed value
> for the field will be iterated and a facet query will be executed
> and cached (as described above). This is excellent for fields
> where there is a small set of distinct values. For example,
> faceting on a field with U.S. States eg. `Alabama, Alaska, ...
> Wyoming` would lead to fifty cached queries which would be used
> over and over again. It also works in the case when the facet
> field can have multiple values for each document. However, it
> requires excessive amounts of memory and time when the number of
> field values is large and especially when it exceeds the filter
> cache size defined in [:SolrCaching#filterCache:filterCache]
>
> * Field Cache: If the facet field is not tokenized, not multi-
> valued, and not boolean, then a field-cache approach will be used.
> This is currently implemented with the Lucene [http://
> lucene.apache.org/java/docs/api/org/apache/lucene/search/
> FieldCache.html FieldCache] mechanism used for results sorting. An
> array of integers (one for every document in the index) is
> allocated, pre-filled with the first indexed value for that field
> in each document (offset into a table of strings for fields indexed
> as strings), and cached. Every time that facet.field is used for
> faceting a query, all the document IDs resulting from the query are
> looked up in the field cache and any value found has its tally
> incremented. This is excellent for situations where the number of
> indexed values for the field is too large to be practical using the
> field queries mechanism, such as faceting against authors or
> titles. However it is currently much slower and more memory-
> intensive than the field query
> mechanism for fields with a small number of values.
>
> Note at this time there is no way to manually control whether
> facet.field is handled via field queries or field cache other than
> defining in the schema whether the field is single- or multi-valued
> and the analyzer used: `solr.TextField` is always tokenized while
> `solr.StrField` is never. Control may be improved in the future,
> along with a means to handle multi-valued fields with a variant of
> the Field Cache mechanism.
> ----
> CategoryCategory