You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by "Hoss Man (Confluence)" <co...@apache.org> on 2013/12/09 19:20:00 UTC

[CONF] Apache Solr Reference Guide > Filter Descriptions

Space: Apache Solr Reference Guide (https://cwiki.apache.org/confluence/display/solr)
Page: Filter Descriptions (https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions)

Change Comment:
---------------------------------------------------------------------
updateOffsets no longer supported by TrimFilter

Edited by Hoss Man:
---------------------------------------------------------------------
{section}
{column:width=65%}

You configure each filter with a {{<filter>}} element in {{schema.xml}} as a child of {{<analyzer>}}, following the {{<tokenizer>}} element. Filter definitions should follow a tokenizer or another filter definition because they take a {{TokenStream}} as input. For example.

{code:xml|borderStyle=solid|borderColor=#666666}
<fieldType name="text" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>...
    </analyzer>
</fieldType>
{code}

The class attribute names a factory class that will instantiate a filter object as needed. Filter factory classes must implement the {{org.apache.solr.analysis.TokenFilterFactory}} interface. Like tokenizers, filters are also instances of TokenStream and thus are producers of tokens. Unlike tokenizers, filters also consume tokens from a TokenStream. This allows you to mix and match filters, in any order you prefer, downstream of a tokenizer.

Arguments may be passed to tokenizer factories to modify their behavior by setting attributes on the {{<filter>}} element. For example:

{code:xml|borderStyle=solid|borderColor=#666666}
<fieldType name="semicolonDelimited" class="solr.TextField">
  <analyzer type="query">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="; " />
    <filter class="solr.LengthFilterFactory" min="2" max="7"/>
  </analyzer>
</fieldType>
{code}

The following sections describe the filter factories that are included in this release of Solr.

For more information about Solr's filters, see [http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters].
{column}
{column:width=35%}
{panel}
Filters discussed in this section:
{toc}
{panel}
{column}
{section}

h2. ASCII Folding Filter

This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists. This filter converts characters from the following Unicode blocks:

* [C1 Controls and Latin-1 Supplement|http://www.unicode.org/charts/PDF/U0080.pdf] (PDF)
* [Latin Extended-A|http://www.unicode.org/charts/PDF/U0100.pdf] (PDF)
* [Latin Extended-B|http://www.unicode.org/charts/PDF/U0180.pdf] (PDF)
* [Latin Extended Additional|http://www.unicode.org/charts/PDF/U1E00.pdf] (PDF)
* [Latin Extended-C|http://www.unicode.org/charts/PDF/U2C60.pdf] (PDF)
* [Latin Extended-D|http://www.unicode.org/charts/PDF/UA720.pdf] (PDF)
* [IPA Extensions|http://www.unicode.org/charts/PDF/U0250.pdf] (PDF)
* [Phonetic Extensions|http://www.unicode.org/charts/PDF/U1D00.pdf] (PDF)
* [Phonetic Extensions Supplement|http://www.unicode.org/charts/PDF/U1D80.pdf] (PDF)
* [General Punctuation|http://www.unicode.org/charts/PDF/U2000.pdf] (PDF)
* [Superscripts and Subscripts|http://www.unicode.org/charts/PDF/U2070.pdf] (PDF)
* [Enclosed Alphanumerics|http://www.unicode.org/charts/PDF/U2460.pdf] (PDF)
* [Dingbats|http://www.unicode.org/charts/PDF/U2700.pdf] (PDF)
* [Supplemental Punctuation|http://www.unicode.org/charts/PDF/U2E00.pdf] (PDF)
* [Alphabetic Presentation Forms|http://www.unicode.org/charts/PDF/UFB00.pdf] (PDF)
* [Halfwidth and Fullwidth Forms|http://www.unicode.org/charts/PDF/UFF00.pdf] (PDF)

*Factory class:* {{solr.ASCIIFoldingFilterFactory}}

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
{code}

*In:* "รก" (Unicode character 00E1)

*Out:* "a" (ASCII character 97)

h2. Beider-Morse Filter

Implements the Beider-Morse Phonetic Matching (BMPM) algorithm, which allows identification of similar names, even if they are spelled differently or in different languages. More information about how this works is available in the section on [solr:Phonetic Matching].

*Factory class:* {{solr.BeiderMorseFilterFactory}}

*Arguments:*

{{nameType}}: Types of names. Valid values are GENERIC, ASHKENAZI, or SEPHARDIC. If not processing Ashkenazi or Sephardic names, use GENERIC.

{{ruleType}}: Types of rules to apply. Valid values are APPROX or EXACT.

{{concat}}: Defines if multiple possible matches should be combined with a pipe ("|").

{{languageSet}}: The language set to use. The value "auto" will allow the Filter to identify the language, or a comma-separated list can be supplied.

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" 
      concat="true" languageSet="auto">
   </filter>
</analyzer>
{code}


h2. Classic Filter

This filter takes the output of the [Classic Tokenizer|Tokenizers#Classic Tokenizer] and strips periods from acronyms and "'s" from possessives.

*Factory class:* {{solr.ClassicFilterFactory}}

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.ClassicTokenizerFactory"/>
  <filter class="solr.ClassicFilterFactory"/>
</analyzer>
{code}

*In:* "I.B.M. cat's can't"

*Tokenizer to Filter:* "I.B.M", "cat's", "can't"

*Out:* "IBM", "cat", "can't"

h2. Common Grams Filter

This filter creates word shingles by combining common tokens such as stop words with regular tokens. This is useful for creating phrase queries containing common words, such as "the cat." Solr normally ignores stop words in queried phrases, so searching for "the cat" would return all matches for the word "cat."

*Factory class:* {{solr.CommonGramsFilterFactory}}

*Arguments:*

{{words}}: (a common word file in .txt format) Provide the name of a common word file, such as {{stopwords.txt}}.

{{format}}: (optional) If the stopwords list has been formatted for Snowball, you can specify {{format="snowball"}} so Solr can read the stopwords file.

{{ignoreCase}}: (boolean) If true, the filter ignores the case of words when comparing them to the common word file.  The default is false.

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
{code}

*In:* "the Cat"

*Tokenizer to Filter:* "the", "Cat"

*Out:* "the_cat"

h2. Collation Key Filter

Collation allows sorting of text in a language-sensitive way. It is usually used for sorting, but can also be used with advanced searches. We've covered this in much more detail in the section on [Unicode Collation|Language Analysis#Unicode Collation].

h2. Edge N-Gram Filter

This filter generates edge n-gram tokens of sizes within the given range.

*Factory class:* {{solr.EdgeNGramFilterFactory}}

*Arguments:*

{{minGramSize}}: (integer, default 1) The minimum gram size.

{{maxGramSize}}: (integer, default 1) The maximum gram size.

*Example:*

Default behavior.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.EdgeNGramFilterFactory"/>
</analyzer>
{code}

*In:* "four score and twenty"

*Tokenizer to Filter:* "four", "score", "and", "twenty"

*Out:* "f", "s", "a", "t"

*Example:*

A range of 1 to 4.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="4"/>
</analyzer>
{code}

*In:* "four score"

*Tokenizer to Filter:* "four", "score"

*Out:* "f", "fo", "fou", "four", "s", "sc", "sco", "scor"

*Example:*

A range of 4 to 6.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="6"/>
</analyzer>
{code}

*In:* "four score and twenty"

*Tokenizer to Filter:* "four", "score", "and", "twenty"

*Out:* "four", "sco", "scor", "twen", "twent", "twenty"

h2. English Minimal Stem Filter

This filter stems plural English words to their singular form.

*Factory class:* {{solr.EnglishMinimalStemFilterFactory}}

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory "/>
  <filter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>
{code}

*In:* "dogs cats"

*Tokenizer to Filter:* "dogs", "cats"

*Out:* "dog", "cat"

h2. Hunspell Stem Filter

The [Hunspell Stem Filter|http://wiki.apache.org/solr/Hunspell] provides support for several languages. You must provide the dictionary ({{.dic}}) and rules ({{.aff}}) files for each language you wish to use with the Hunspell Stem Filter. You can download those language files [here|http://wiki.services.openoffice.org/wiki/Dictionaries]. Be aware that your results will vary widely based on the quality of the provided dictionary and rules files. For example, some languages have only a minimal word list with no morphological information. On the other hand, for languages that have no stemmer but do have an extensive dictionary file, the Hunspell stemmer may be a good choice.

*Factory class:* {{solr.HunspellStemFilterFactory}}

*Arguments:*

{{dictionary}}: (required) The path of a dictionary file. 
{{affix}}: (required) The path of a rules file. 
{{ignoreCase}}: (boolean) controls whether matching is case sensitive or not. The default is false.
{{strictAffixParsing}}: (boolean) controls whether the affix parsing is strict or not.  If true, an error while reading an affix rule causes a ParseException, otherwise is ignored.  The default is true.

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.HunspellStemFilterFactory"
    dictionary="en_GB.dic"
    affix="en_GB.aff"
    ignoreCase="true"
    strictAffixParsing="true" />
</analyzer>
{code}

*In:* "jump jumping jumped"

*Tokenizer to Filter:* "jump", "jumping", "jumped"

*Out:* "jump", "jump", "jump"

h2. Hyphenated Words Filter

This filter reconstructs hyphenated words that have been tokenized as two tokens because of a line break or other intervening whitespace in the field test. If a token ends with a hyphen, it is joined with the following token and the hyphen is discarded. Note that for this filter to work properly, the upstream tokenizer must not remove trailing hyphen characters. This filter is generally only useful at index time.

*Factory class:* {{solr.HyphenatedWordsFilterFactory}}

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.HyphenatedWordsFilterFactory"/>
</analyzer>
{code}

*In:* "A hyphen\- ated word"

*Tokenizer to Filter:* "A", "hyphen-", "ated", "word"

*Out:* "A", "hyphenated", "word"

h2. ICU Folding Filter

This filter is a custom Unicode normalization form that applies the foldings specified in [Unicode Technical Report 30|http://www.unicode.org/reports/tr30/tr30-4.html] in addition to the {{NFKC_Casefold}} normalization form as described in [ICU Normalizer 2 Filter|#ICU Normalizer 2 Filter]. This filter is a better substitute for the combined behavior of the [ASCII Folding Filter|#ASCII Folding Filter], [Lower Case Filter|#Lower Case Filter], and [ICU Normalizer 2 Filter|#ICU Normalizer 2 Filter].

To use this filter, see {{solr/contrib/analysis-extras/README.txt}} for instructions on which jars you need to add to your {{solr_home/lib}}.

*Factory class:* {{solr.ICUFoldingFilterFactory}}

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
{code}

For detailed information on this normalization form, see [http://www.unicode.org/reports/tr30/tr30-4.html].

h2. ICU Normalizer 2 Filter

This filter factory normalizes text according to one of five Unicode Normalization Forms as described in [Unicode Standard Annex #15|http://unicode.org/reports/tr15/]:

* NFC: (name="nfc" mode="compose") Normalization Form C, canonical decomposition
* NFD: (name="nfc" mode="decompose") Normalization Form D, canonical decomposition, followed by canonical composition
* NFKC: (name="nfkc" mode="compose") Normalization Form KC, compatibility decomposition
* NFKD: (name="nfkc" mode="decompose") Normalization Form KD, compatibility decomposition, followed by canonical composition
* NFKC_Casefold: (name="nfkc_cf" mode="compose") Normalization Form KC, with additional Unicode case folding. Using the ICU Normalizer 2 Filter is a better-performing substitution for the [Lower Case Filter|#Lower Case Filter] and NFKC normalization.

*Factory class:* {{solr.ICUNormalizer2FilterFactory}}

*Arguments:*

{{name}}: (string) The name of the normalization form; {{nfc}}, {{nfd}}, {{nfkc}}, {{nfkd}}, {{nfkc_cf}}

{{mode}}: (string) The mode of Unicode character composition and decomposition; {{compose}} or {{decompose}}

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ICUNormalizer2FilterFactory" name="nkc_cf" mode="compose"/>
</analyzer>
{code}

For detailed information about these Unicode Normalization Forms, see [http://unicode.org/reports/tr15/].

To use this filter, see {{solr/contrib/analysis-extras/README.txt}} for instructions on which jars you need to add to your {{solr_home/lib}}.

h2. ICU Transform Filter

This filter applies [ICU Tranforms|http://userguide.icu-project.org/transforms/general] to text. This filter supports only ICU System Transforms. Custom rule sets are not supported.

*Factory class:* {{solr.ICUTransformFilterFactory}}

*Arguments:*

{{id}}: (string) The identifier for the ICU System Transform you wish to apply with this filter. For a full list of ICU System Transforms, see [http://demo.icu-project.org/icu-bin/translit?TEMPLATE_FILE=data/translit_rule_main.html].

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
</analyzer>
{code}

For detailed information about ICU Transforms, see [http://userguide.icu-project.org/transforms/general].

To use this filter, see {{solr/contrib/analysis-extras/README.txt}} for instructions on which jars you need to add to your {{solr_home/lib}}.

h2. Keep Words Filter

This filter discards all tokens except those that are listed in the given word list. This is the inverse of the Stop Words Filter. This filter can be useful for building specialized indices for a constrained set of terms.

*Factory class:* {{solr.KeepWordFilterFactory}}

*Arguments:*

{{words}}: (required) Path of a text file containing the list of keep words, one per line. Blank lines and lines that begin with "#" are ignored. This may be an absolute path, or a simple filename in the Solr config directory.

{{ignoreCase}}: (true/false) If *true* then comparisons are done case-insensitively. If this argument is true, then the words file is assumed to contain only lowercase words. The default is *false*.

*Example:*

Where {{keepwords.txt}} contains:

 {{happy}}

 {{funny}}

 {{silly}}

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
</analyzer>
{code}

*In:* "Happy, sad or funny"

*Tokenizer to Filter:* "Happy", "sad", "or", "funny"

*Out:* "funny"

*Example:*

Same {{keepwords.txt}}, case insensitive:

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="true"/>
</analyzer>
{code}

*In:* "Happy, sad or funny"

*Tokenizer to Filter:* "Happy", "sad", "or", "funny"

*Out:* "Happy", "funny"

*Example:*

Using LowerCaseFilterFactory before filtering for keep words, no {{ignoreCase}} flag.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
</analyzer>
{code}

*In:* "Happy, sad or funny"

*Tokenizer to Filter:* "Happy", "sad", "or", "funny"

*Filter to Filter:* "happy", "sad", "or", "funny"

*Out:* "happy", "funny"

h2. KStem Filter

KStem is an alternative to the Porter Stem Filter for developers looking for a less aggressive stemmer. KStem was written by Bob Krovetz, ported to Lucene by Sergio Guzman-Lara (UMASS Amherst). This stemmer is only appropriate for English language text.

*Factory class:* {{solr.KStemFilterFactory}}

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory "/>
  <filter class="solr.KStemFilterFactory"/>
</analyzer>
{code}

*In:* "jump jumping jumped"

*Tokenizer to Filter:* "jump", "jumping", "jumped"

*Out:* "jump", "jump", "jump"

h2. Length Filter

This filter passes tokens whose length falls within the min/max limit specified. All other tokens are discarded.

*Factory class:* {{solr.LengthFilterFactory}}

*Arguments:*

{{min}}: (integer, required) Minimum token length. Tokens shorter than this are discarded.

{{max}}: (integer, required, must be >= min) Maximum token length. Tokens longer than this are discarded.

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LengthFilterFactory" min="3" max="7"/>
</analyzer>
{code}

*In:* "turn right at Albuquerque"

*Tokenizer to Filter:* "turn", "right", "at", "Albuquerque"

*Out:* "turn", "right"

h2. Lower Case Filter

Converts any uppercase letters in a token to the equivalent lowercase token. All other characters are left unchanged.

*Factory class:* {{solr.LowerCaseFilterFactory}}

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
{code}

*In:* "Down With CamelCase"

*Tokenizer to Filter:* "Down", "With", "CamelCase"

*Out:* "down", "with", "camelcase"

h2. N-Gram Filter

Generates n-gram tokens of sizes in the given range. Note that tokens are ordered by position and then by gram size.

*Factory class:* {{solr.NGramFilterFactory}}

*Arguments:*

{{minGramSize}}: (integer, default 1) The minimum gram size.

{{maxGramSize}}: (integer, default 2)  The maximum gram size.

*Example:*

Default behavior.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.NGramFilterFactory"/>
</analyzer>
{code}

*In:* "four score"

*Tokenizer to Filter:* "four", "score"

*Out:* "f", "o", "u", "r", "fo", "ou", "ur", "s", "c", "o", "r", "e", "sc", "co", "or", "re"

*Example:*

A range of 1 to 4.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="4"/>
</analyzer>
{code}

*In:* "four score"

*Tokenizer to Filter:* "four", "score"

*Out:* "f", "fo", "fou", "four", "s", "sc", "sco", "scor"

*Example:*

A range of 3 to 5.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="5"/>
</analyzer>
{code}

*In:* "four score"

*Tokenizer to Filter:* "four", "score"

*Out:* "fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore"

h2. Numeric Payload Token Filter

This filter adds a numeric floating point payload value to tokens that match a given type. Refer to the Javadoc for the {{org.apache.lucene.analysis.Token}} class for more information about token types and payloads.

*Factory class:* {{solr.NumericPayloadTokenFilterFactory}}

*Arguments:*

{{payload}}: (required) A floating point value that will be added to all matching tokens.

{{typeMatch}}: (required) A token type name string. Tokens with a matching type name will have their payload set to the above floating point value.

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.NumericPayloadTokenFilterFactory" payload="0.75" typeMatch="word"/>
</analyzer>
{code}

*In:* "bing bang boom"

*Tokenizer to Filter:* "bing", "bang", "boom"

*Out:* "bing"\[0.75\], "bang"\[0.75\], "boom"\[0.75\]

h2. Pattern Replace Filter

This filter applies a regular expression to each token and, for those that match, substitutes the given replacement string in place of the matched pattern. Tokens which do not match are passed though unchanged.

*Factory class:* {{solr.PatternReplaceFilter}}

*Arguments:*

{{pattern}}: (required) The regular expression to test against each token, as per {{java.util.regex.Pattern}}.

{{replacement}}: (required) A string to substitute in place of the matched pattern. This string may contain references to capture groups in the regex pattern. See the Javadoc for {{java.util.regex.Matcher}}.

{{replace}}: ("all" or "first", default "all") Indicates whether all occurrences of the pattern in the token should be replaced, or only the first.

*Example:*

Simple string replace:

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.PatternReplaceFilter" pattern="cat" replacement="dog"/>
</analyzer>
{code}

*In:* "cat concatenate catycat"

*Tokenizer to Filter:* "cat", "concatenate", "catycat"

*Out:* "dog", "condogenate", "dogydog"

*Example:*

String replacement, first occurrence only:

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.PatternReplaceFilter" pattern="cat" replacement="dog" replace="first"/>
</analyzer>
{code}

*In:* "cat concatenate catycat"

*Tokenizer to Filter:* "cat", "concatenate", "catycat"

*Out:* "dog", "condogenate", "dogycat"

*Example:*

More complex pattern with capture group reference in the replacement. Tokens that start with non-numeric characters and end with digits will have an underscore inserted before the numbers. Otherwise the token is passed through.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.PatternReplaceFilter" pattern="(\D+)(\d+)$" replacement="$1_$2"/>
</analyzer>
{code}

*In:* "cat foo1234 9987 blah1234foo"

*Tokenizer to Filter:* "cat", "foo1234", "9987", "blah1234foo"

*Out:* "cat", "foo_1234", "9987", "blah1234foo"

h2. Phonetic Filter

This filter creates tokens using one of the phonetic encoding algorithms in the {{org.apache.commons.codec}}.language package.

*Factory class:* {{solr.PhoneticFilterFactory}}

*Arguments:*

{{encoder}}: (required) The name of the encoder to use. The encoder name must be one of the following (case insensitive): "[DoubleMetaphone|http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/DoubleMetaphone.html]", "[Metaphone|http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/Metaphone.html]", "[Soundex|http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/Soundex.html]", "[RefinedSoundex|http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/RefinedSoundex.html]", "[Caverphone|http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/Caverphone.html]", or "[ColognePhonetic|http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/ColognePhonetic.html]"

{{inject}}: (true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the exact spelling of the target word may not match.

{{maxCodeLength}}: (integer) The maximum length of the code to be generated by the Metaphone or Double Metaphone encoders.

*Example:*

Default behavior for DoubleMetaphone encoding.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"/>
</analyzer>
{code}

*In:* "four score and twenty"

*Tokenizer to Filter:* "four"(1), "score"(2), "and"(3), "twenty"(4)

*Out:* "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "twenty"(4), "TNT"(4)

The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the token they were derived from (immediately preceding).

*Example:*

Discard original token.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="false"/>
</analyzer>
{code}

*In:* "four score and twenty"

*Tokenizer to Filter:* "four"(1), "score"(2), "and"(3), "twenty"(4)

*Out:* "FR"(1), "SKR"(2), "ANT"(3), "TWNT"(4)

*Example:*

Default Soundex encoder.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.PhoneticFilterFactory" encoder="Soundex"/>
</analyzer>
{code}

*In:* "four score and twenty"

*Tokenizer to Filter:* "four"(1), "score"(2), "and"(3), "twenty"(4)

*Out:* "four"(1), "F600"(1), "score"(2), "S600"(2), "and"(3), "A530"(3), "twenty"(4), "T530"(4)

h2. Porter Stem Filter

This filter applies the Porter Stemming Algorithm for English. The results are similar to using the Snowball Porter Stemmer with the {{language="English"}} argument. But this stemmer is coded directly in Java and is not based on Snowball. It does not accept a list of protected words and is only appropriate for English language text. However, it has been benchmarked as [four times faster|http://markmail.org/thread/d2c443z63z37rwf6] than the English Snowball stemmer, so can provide a performance enhancement.

*Factory class:* {{solr.PorterStemFilterFactory}}

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory "/>
  <filter class="solr.PorterStemFilterFactory"/>
</analyzer>
{code}

*In:* "jump jumping jumped"

*Tokenizer to Filter:* "jump", "jumping", "jumped"

*Out:* "jump", "jump", "jump"

h2. Position Filter Factory

This filter sets the position increment values of all tokens in a token stream except the first, which retains its original position increment value. This filter *has been deprecated* and will be removed in Solr 5.

*Factory class:* {{solr.PositionIncrementFilterFactory}}

*Arguments:*

{{positionIncrement}}: (integer, default = 0) The position increment value to apply to all tokens in a token stream except the first.

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.PositionFilterFactory" positionIncrement="1"/>
</analyzer>
{code}

*In:* "hello world"

*Tokenizer to Filter:* "hello", "world"

*Out:* "hello" (token position 1), "world" (token position 2)


h2. Remove Duplicates Token Filter

The filter removes duplicate tokens in the stream. Tokens are considered to be duplicates if they have the same text and position values.

*Factory class:* {{solr.RemoveDuplicatesTokenFilterFactory}}

*Arguments:* None

*Example:*

One example of where {{RemoveDuplicatesTokenFilterFactory}} is in situations where a synonym file is being used in conjuntion with a stemmer causes some synonyms to be reduced to the same stem.  Consider the following entry from a {{synonyms.txt}} file:

{noformat}
Television, Televisions, TV, TVs
{noformat}

When used in the following configuration:

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/>
  <filter class="solr.EnglishMinimalStemFilterFactory"/>
  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
{code}

*In:* "Watch TV"

*Tokenizer to Synonym Filter:* "Watch"(1) "TV"(2)

*Synonym Filter to Stem Filter:* "Watch"(1) "Television"(2) "Televisions"(2) "TV"(2) "TVs"(2) 

*Stem Filter to Remove Dups Filter:* "Watch"(1) "Television"(2) "Television"(2) "TV"(2) "TV"(2) 

*Out:* "Watch"(1) "Television"(2) "TV"(2) 

h2. Reversed Wildcard Filter

This filter reverses tokens to provide faster leading wildcard and prefix queries. Tokens without wildcards are not reversed.

*Factory class:* {{solr.ReveresedWildcardFilterFactory}}

*Arguments:*

{{withOriginal}} (boolean) If true, the filter produces both original and reversed tokens at the same positions. If false, produces only reversed tokens.

{{maxPosAsterisk}} (integer, default = 2) The maximum position of the asterisk wildcard ('*') that triggers the reversal of the query term. Terms with asterisks at positions above this value are not reversed.

{{maxPosQuestion}} (integer, default = 1) The maximum position of the question mark wildcard ('?') that triggers the reversal of query term. To reverse only pure suffix queries (queries with a single leading asterisk), set this to 0 and {{maxPosAsterisk}} to 1.

{{maxFractionAsterisk}} (float, default = 0.0) An additional parameter that triggers the reversal if asterisk ('*') position is less than this fraction of the query token length.

{{minTrailing}} (integer, default = 2) The minimum number of trailing characters in a query token after the last wildcard character. For good performance this should be set to a value larger than 1.

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
   maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>
</analyzer>
{code}

*In:* "*foo \*bar"

*Tokenizer to Filter:* "*foo", "*bar"

*Out:* "oof*", "rab*"

h2. Shingle Filter

This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a single token.

*Factory class:* {{solr.ShingleFilterFactory}}

*Arguments:*

{{minShingleSize}}: (integer, default 2) The minimum number of tokens per shingle.

{{maxShingleSize}}: (integer, must be >= 2, default 2) The maximum number of tokens per shingle.

{{outputUnigrams}}: (true/false) If true (the default), then each individual token is also included at its original position.

{{outputUnigramsIfNoShingles}}: (true/false) If false (the default), then individual tokens will be output if no shingles are possible.

{{tokenSeparator}}: (string, default is " ") The default string to use when joining adjacent tokens to form a shingle.

*Example:*

Default behavior.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ShingleFilterFactory"/>
</analyzer>
{code}

*In:* "To be, or what?"

*Tokenizer to Filter:* "To"(1), "be"(2), "or"(3), "what"(4)

*Out:* "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)

*Example:*

A shingle size of four, do not include original token.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="false"/>
</analyzer>
{code}

*In:* "To be, or not to be."

*Tokenizer to Filter:* "To"(1), "be"(2), "or"(3), "not"(4), "to"(5), "be"(6)

*Out:* "To be"(1), "To be or"(1), "To be or not"(1), "be or"(2), "be or not"(2), "be or not to"(2), "or not"(3), "or not to"(3), "or not to be"(3), "not to"(4), "not to be"(4), "to be"(5)

h2. Snowball Porter Stemmer Filter

This filter factory instantiates a language-specific stemmer generated by Snowball. Snowball is a software package that generates pattern-based word stemmers. This type of stemmer is not as accurate as a table-based stemmer, but is faster and less complex. Table-driven stemmers are labor intensive to create and maintain and so are typically commercial products.

Solr contains Snowball stemmers for Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish. For more information on Snowball, visit [http://snowball.tartarus.org/].

{{StopFilterFactory}}, {{CommonGramsFilterFactory}}, and {{CommonGramsQueryFilterFactory}} can optionally read stopwords in Snowball format (specify {{format="snowball"}} in the configuration of those FilterFactories).

*Factory class:* {{solr.SnowballPorterFilterFactory}}

*Arguments:*

{{language}}: (default "English") The name of a language, used to select the appropriate Porter stemmer to use. Case is significant. This string is used to select a package name in the "org.tartarus.snowball.ext" class hierarchy.

{{protected}}: Path of a text file containing a list of protected words, one per line. Protected words will not be stemmed. Blank lines and lines that begin with "#" are ignored. This may be an absolute path, or a simple file name in the Solr config directory.

*Example:*

Default behavior:

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.SnowballPorterFilterFactory"/>
</analyzer>
{code}

*In:* "flip flipped flipping"

*Tokenizer to Filter:* "flip", "flipped", "flipping"

*Out:* "flip", "flip", "flip"

*Example:*

French stemmer, English words:

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="French"/>
</analyzer>
{code}

*In:* "flip flipped flipping"

*Tokenizer to Filter:* "flip", "flipped", "flipping"

*Out:* "flip", "flipped", "flipping"

*Example:*

Spanish stemmer, Spanish words:

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Spanish"/>
</analyzer>
{code}

*In:* "cante canta"

*Tokenizer to Filter:* "cante", "canta"

*Out:* "cant", "cant"

h2. Standard Filter

This filter removes dots from acronyms and the substring "'s" from the end of tokens. This filter depends on the tokens being tagged with the appropriate term-type to recognize acronyms and words with apostrophes.

*Factory class:* {{solr.StandardFilterFactory}}

*Arguments:* None

{note}
This filter is no longer operational in Solr when the {{luceneMatchVersion}} (in {{solrconfig.xml}}) is higher than "3.1".
{note}

h2. Stop Filter

This filter discards, or _stops_ analysis of, tokens that are on the given stop words list. A standard stop words list is included in the Solr config directory, named {{stopwords.txt}}, which is appropriate for typical English language text.

*Factory class:* {{solr.StopFilterFactory}}

*Arguments:*

{{words}}: (optional) The path to a file that contains a list of stop words, one per line. Blank lines and lines that begin with "#" are ignored. This may be an absolute path, or path relative to the Solr config directory.

{{format}}: (optional) If the stopwords list has been formatted for Snowball, you can specify {{format="snowball"}} so Solr can read the stopwords file.

{{ignoreCase}}: (true/false, default false) Ignore case when testing for stop words. If true, the stop list should contain lowercase words.

{note}
As of Solr 4.4, the {{enablePositionIncrements}} argument is no longer supported.
{note}

*Example:*

Case-sensitive matching, capitalized words not stopped. Token positions skip stopped words.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
</analyzer>
{code}

*In:* "To be or what?"

*Tokenizer to Filter:* "To"(1), "be"(2), "or"(3), "what"(4)

*Out:* "To"(1), "what"(4)

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
{code}

*In:* "To be or what?"

*Tokenizer to Filter:* "To"(1), "be"(2), "or"(3), "what"(4)

*Out:* "what"(4)

h2. Synonym Filter

This filter does synonym mapping. Each token is looked up in the list of synonyms and if a match is found, then the synonym is emitted in place of the token. The position value of the new tokens are set such they all occur at the same position as the original token.

*Factory class:* {{solr.SynonymFilterFactory}}

*Arguments:*

{{synonyms}}: (required) The path of a file that contains a list of synonyms, one per line. Blank lines and lines that begin with "#" are ignored. This may be an absolute path, or path relative to the Solr config directory.There are two ways to specify synonym {{mappings}}:

* A comma-separated list of words. If the token matches any of the words, then all the words in the list are substituted, which will include the original token.

* Two comma-separated lists of words with the symbol "=>" between them. If the token matches any word on the left, then the list on the right is substituted. The original token will not be included unless it is also in the list on the right.

For the following examples, assume a synonyms file named {{mysynonyms.txt}}:

{code:language=none|borderStyle=solid|borderColor=#666666}
couch,sofa,divan
teh => the
huge,ginormous,humungous => large
small => tiny,teeny,weeny
{code}

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.SynonymFilterFactory" synonyms="mysynonyms.txt"/>
</analyzer>
{code}

*In:* "teh small couch"

*Tokenizer to Filter:* "teh"(1), "small"(2), "couch"(3)

*Out:* "the"(1), "tiny"(2), "teeny"(2), "weeny"(2), "couch"(3), "sofa"(3), "divan"(3)

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory "/>
  <filter class="solr.SynonymFilterFactory" synonyms="mysynonyms.txt"/>
</analyzer>
{code}

*In:* "teh ginormous, humungous sofa"

*Tokenizer to Filter:* "teh"(1), "ginormous"(2), "humungous"(3), "sofa"(4)

*Out:* "the"(1), "large"(2), "large"(3), "couch"(4), "sofa"(4), "divan"(4)

h2. Token Offset Payload Filter

This filter adds the numeric character offsets of the token as a payload value for that token.

*Factory class:* {{solr.TokenOffsetPayloadTokenFilterFactory}}

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.TokenOffsetPayloadTokenFilterFactory"/>
</analyzer>
{code}

*In:* "bing bang boom"

*Tokenizer to Filter:* "bing", "bang", "boom"

*Out:* "bing"\[0,4\], "bang"\[5,9\], "boom"\[10,14\]

h2. Trim Filter

This filter trims leading and/or trailing whitespace from tokens.  Most tokenizers break tokens at whitespace, so this filter is most often used for special situations.

*Factory class:* {{solr.TrimFilterFactory}}

*Arguments:* None

*Example:*

The PatternTokenizerFactory configuration used here splits the input on simple commas, it does not remove whitespace.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.PatternTokenizerFactory" pattern=","/>
  <filter class="solr.TrimFilterFactory"/>
</analyzer>
{code}

*In:* "one, two , three ,four "

*Tokenizer to Filter:* "one", " two ", " three ", "four "

*Out:* "one", "two", "three", "four"

h2. Type As Payload Filter

This filter adds the token's type, as an encoded byte sequence, as its payload.

*Factory class:* {{solr.TypeAsPayloadTokenFilterFactory}}

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.TypeAsPayloadTokenFilterFactory"/>
</analyzer>
{code}

*In:* "Pay Bob's I.O.U."

*Tokenizer to Filter:* "Pay", "Bob's", "I.O.U."

*Out:* "Pay"\[<ALPHANUM>\], "Bob's"\[<APOSTROPHE>\], "I.O.U."\[<ACRONYM>\]

h2. Type Token Filter

This filter blacklists or whitelists a specified list of token types, assuming the tokens have type metadata associated with them. For example, the [UAX29 URL Email Tokenizer|Tokenizers#UAX29 URL Email Tokenizer] emits "<URL>" and "<EMAIL>" typed tokens, as well as other types. This filter would allow you to pull out only e-mail addresses from text as tokens, if you wish.

*Factory class:* {{solr.TypeTokenFilterFactory}}

*Arguments:*

{{types}}: Defines the location of a file of types to filter.

{{useWhitelist}}: If *true*, the file defined in {{types}} should be used as include list. If *false*, or undefined, the file defined in {{types}} is used as a blacklist.

{note}
As of Solr 4.4, the {{enablePositionIncrements}} argument is no longer supported.
{note}

*Example:*

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
   <filter class="solr.TypeTokenFilterFactory" types="stoptypes.txt" useWhitelist="true"/>
</analyzer>
{code}

h2. Word Delimiter Filter

This filter splits tokens at word delimiters. The rules for determining delimiters are determined as follows:

* A change in case within a word: "CamelCase" *\->* "Camel", "Case". This can be disabled by setting {{splitOnCaseChange="0"}}.

* A transition from alpha to numeric characters or vice versa: "Gonzo5000" *\->* "Gonzo", "5000" "4500XL" *\->* "4500", "XL". This can be disabled by setting {{splitOnNumerics="0"}}.

* Non-alphanumeric characters (discarded): "hot-spot" *\->* "hot", "spot"

* A trailing "'s" is removed: "O'Reilly's" *\->* "O", "Reilly"

* Any leading or trailing delimiters are discarded: "\-\-hot-spot\-\-" *\->* "hot", "spot"

*Factory class:* {{solr.WordDelimiterFilterFactory}}

*Arguments:*

{{generateWordParts}}: (integer, default 1) If non-zero, splits words at delimiters. For example:"CamelCase", "hot-spot" *\->* "Camel", "Case", "hot", "spot"

{{generateNumberParts}}: (integer, default 1) If non-zero, splits numeric strings at delimiters:"1947-32" *\->*"1947", "32"

{{splitOnCaseChange}}: (integer, default 1) If 0, words are not split on camel-case changes:"BugBlaster-XL" *\->* "BugBlaster", "XL". Example 1 below illustrates the default (non-zero) splitting behavior.

{{splitOnNumerics}}: (integer, default 1) If 0, don't split words on transitions from alpha to numeric:"FemBot3000" *\->* "Fem", "Bot3000"

{{catenateWords}}: (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor's" *\->* "hotspotsensor"

{{catenateNumbers}}: (integer, default 0) If non-zero, maximal runs of number parts will be joined: 1947-32" *\->* "194732"

{{catenateAll}}: (0/1, default 0) If non-zero, runs of word and number parts will be joined: "Zap-Master-9000" *\->* "ZapMaster9000"

{{preserveOriginal}}: (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" *\->* "Zap-Master-9000", "Zap", "Master", "9000"

{{protected}}: (optional) The pathname of a file that contains a list of protected words that should be passed through without splitting.

{{stemEnglishPossessive}}: (integer, default 1) If 1, strips the possessive "'s" from each subword.

*Example:*

Default behavior. The whitespace tokenizer is used here to preserve non-alphanumeric characters.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.WordDelimiterFilterFactory"/>
</analyzer>
{code}

*In:* "hot-spot  RoboBlaster/9000 100XL"

*Tokenizer to Filter:* "hot-spot", "RoboBlaster/9000", "100XL"

*Out:* "hot", "spot", "Robo", "Blaster", "9000", "100", "XL"

*Example:*

Do not split on case changes, and do not generate number parts. Note that by not generating number parts, tokens containing only numeric parts are ultimately discarded.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.WordDelimiterFilterFactory" generateNumberParts="0" splitOnCaseChange="0"/>
</analyzer>
{code}

*In:* "hot-spot  RoboBlaster/9000 100-42"

*Tokenizer to Filter:* "hot-spot", "RoboBlaster/9000", "100-42"

*Out:* "hot", "spot", "RoboBlaster", "9000"

*Example:*

Concatenate word parts and number parts, but not word and number parts that occur in the same token.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateNumbers="1"/>
</analyzer>
{code}

*In:* "hot-spot 100+42 XL40"

*Tokenizer to Filter:* "hot-spot"(1), "100+42"(2), "XL40"(3)

*Out:* "hot"(1), "spot"(2), "hotspot"(2), "100"(3), "42"(4), "10042"(4),  "XL"(5),  "40"(6)

*Example:*

Concatenate all. Word and/or number parts are joined together.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.WordDelimiterFilterFactory" catenateAll="1"/>
</analyzer>
{code}

*In:* "XL-4000/ES"

*Tokenizer to Filter:* "XL-4000/ES"(1)

*Out:* "XL"(1), "4000"(2), "ES"(3), "XL4000ES"(3)

*Example:*

Using a protected words list that contains "AstroBlaster" and "XL-5000" (among others).

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"/>
</analyzer>
{code}

*In:* "FooBar AstroBlaster XL-5000 ==ES-34-"

*Tokenizer to Filter:* "FooBar", "AstroBlaster", "XL-5000", "==ES-34-"

*Out:* "FooBar", "FooBar", "AstroBlaster", "XL-5000", "ES", "34"

h2. Related Topics

* [TokenFilterFactories|http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#TokenFilterFactories]


{scrollbar}


Stop watching space: https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=solr
Change email notification preferences: https://cwiki.apache.org/confluence/users/editmyemailsettings.action