You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by "Steve Rowe (Confluence)" <co...@apache.org> on 2013/09/26 14:28:00 UTC

[CONF] Apache Solr Reference Guide > What Is An Analyzer?

Space: Apache Solr Reference Guide (https://cwiki.apache.org/confluence/display/solr)
Page: What Is An Analyzer? (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=32604227)

Change Comment:
---------------------------------------------------------------------
Removed newline between {code} and first line of example, so that PDF export won't have an extra blank line.

Edited by Steve Rowe:
---------------------------------------------------------------------
An analyzer examines the text of fields and generates a token stream. Analyzers are specified as a child of the {{<fieldType>}} element in the {{schema.xml}} configuration file that can be found in the {{solr/conf}} directory, or wherever {{solrconfig.xml}} is located.

In normal usage, only fields of type {{solr.TextField}} will specify an analyzer.  The simplest way to configure an analyzer is with a single {{<analyzer>}} element whose class attribute is a fully qualified Java class name. The named class must derive from {{org.apache.lucene.analysis.Analyzer}}. For example:

{code:xml|borderStyle=solid|borderColor=#666666}<fieldType name="nametext" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
</fieldType>
{code}

In this case a single class, {{WhitespaceAnalyzer}}, is responsible for analyzing the content of the named text field and emitting the corresponding tokens. For simple cases, such as plain English prose, a single analyzer class like this may be sufficient.  But it's often necessary to do more complex analysis of the field content.

Even the most complex analysis requirements can usually be decomposed into a series of discrete, relatively simple processing steps. As you will soon discover, the Solr distribution comes with a large selection of tokenizers and filters that covers most scenarios you are likely to encounter. Setting up an analyzer chain is very straightforward; you specify a simple {{<analyzer>}} element (no class attribute) with child elements that name factory classes for the tokenizer and filters to use, in the order you want them to run.

For example:

{code:xml|borderStyle=solid|borderColor=#666666}<fieldType name="nametext" class="solr.TextField">
    <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"/>
    </analyzer>
</fieldType>
{code}

Note that classes in the {{org.apache.solr.analysis}} package may be referred to here with the shorthand {{solr.}} prefix.

In this case, no Analyzer class was specified on the {{<analyzer>}} element.  Rather, a sequence of more specialized classes are wired together and collectively act as the Analyzer for the field.  The text of the field is passed to the first item in the list ({{solr.StandardTokenizerFactory}}), and the tokens that emerge from the last one ({{solr.EnglishPorterFilterFactory}}) are the terms that are used for indexing or querying any fields that use the "nametext" {{fieldType}}.

h2. Analysis Phases

Analysis takes place in two contexts. At index time, when a field is being created, the token stream that results from analysis is added to an index and defines the set of terms (including positions, sizes, and so on) for the field. At query time, the values being searched for are analyzed and the terms that result are matched against those that are stored in the field's index.

In many cases, the same analysis should be applied to both phases. This is desirable when you want to query for exact string matches, possibly with case-insensitivity, for example. In other cases, you may want to apply slightly different analysis steps during indexing than those used at query time.

If you provide a simple {{<analyzer>}} definition for a field type, as in the examples above, then it will be used for both indexing and queries. If you want distinct analyzers for each phase, you may include two {{<analyzer>}} definitions distinguished with a type attribute. For example:

{code:xml|borderStyle=solid|borderColor=#666666}<fieldType name="nametext" class="solr.TextField">
    <analyzer *type="index"{*}>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
        <filter class="solr.SynonymFilterFactory" synonyms="syns.txt"/>
    </analyzer>
    <analyzer *type="query"{*}>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>
{code}

In this theoretical example, at index time the text is tokenized, the tokens are set to lowercase, any that are not listed in {{keepwords.txt}} are discarded and those that remain are mapped to alternate values as defined by the synonym rules in the file {{syns.txt}}. This essentially builds an index from a restricted set of possible values and then normalizes them to values that may not even occur in the original text.

At query time, the only normalization that happens is to convert the query terms to lowercase. The filtering and mapping steps that occur at index time are not applied to the query terms.  Queries must then, in this example, be very precise, using only the normalized terms that were stored at index time.

{scrollbar}


Stop watching space: https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=solr
Change email notification preferences: https://cwiki.apache.org/confluence/users/editmyemailsettings.action