You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by "Cassandra Targett (Confluence)" <co...@apache.org> on 2013/09/27 17:56:00 UTC
[CONF] Apache Solr Reference Guide > Running Your Analyzer

Space: Apache Solr Reference Guide (https://cwiki.apache.org/confluence/display/solr)
Page: Running Your Analyzer (https://cwiki.apache.org/confluence/display/solr/Running+Your+Analyzer)

Change Comment:
---------------------------------------------------------------------
remove sentence with reference to old removed section

Edited by Cassandra Targett:
---------------------------------------------------------------------
Once you've defined a field type in {{schema.xml}} and specified the analysis steps that you want applied to it, you should test it out to make sure that it behaves the way you expect it to. Luckily, there is a very handy page in the Solr [admin interface|Using the Solr Administration User Interface] that lets you do just that. You can invoke the analyzer for any text field, provide sample input, and display the resulting token stream.

For example, assume that the following field type definition has been added to {{schema.xml}}:

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<fieldType name="mytextfield" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.HyphenatedWordsFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>
{code}

The objective here (during indexing) is to reconstruct hyphenated words, which may have been split across lines in the text, then to set all words to lowercase. For queries, you want to skip the de-hyphenation step.

To test this out, point your browser at the [solr:Analysis Screen] of the Solr Admin Web interface. By default, this will be at the following URL (adjust the hostname and/or port to match your configuration): [http://localhost:8983/solr/#/collection1/analysis]. You should see a page like this.

!analysis.png|border=1!
_Empty Analysis screen_

We want to test the field type definition for "mytextfield", defined above. The drop-down labeled "Analyse Fieldname/FieldType" allows choosing the field or field type to use for the analysis.

There are two "Field Value" boxes, one for how text will be analyzed during indexing and a second for how text will be analyzed for query processing. In the "Field Value (Index)" box enter some sample text "Super-computer" in this example) to be processed by the analyzer. We will leave the query field value empty for now. 

The result we expect is that {{HyphenatedWordsFilter}} will join the hyphenated pair "Super-" and "computer" into the single word "Supercomputer", and then {{LowerCaseFilter}} will set it to "supercomputer". Let's see what happens:

!analysis-supercomputer-verbose.png|border=1!\\
_Running index-time analyzer, verbose output._

The result is two distinct tokens rather than the one we expected. What went wrong? Looking at the first token that came out of {{StandardTokenizer}}, we can see the trailing hyphen has been stripped off of "Super-". Checking the documentation for {{StandardTokenizer}}, we see that it treats all punctuation characters as delimiters and discards them. What we really want in this case is a whitespace tokenizer that will preserve the hyphen character when it breaks the text into tokens.

Let's make this change and try again:

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<fieldType name="mytextfield" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.HyphenatedWordsFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>
{code}


Re-submitting the form by clicking "Analyse Values" again, we see the result in the screen shot below.

!analysis-supercomputer-verbose2.png|border=1!\\
_Using WhitespaceTokenizer, expected results._

That's more like it. Because the whitespace tokenizer preserved the trailing hyphen on the first token, {{HyphenatedWordsFilter}} was able to reconstruct the hyphenated word, which then passed it on to {{LowerCaseFilter}}, where capital letters are set to lowercase.

Now let's see what happens when invoking the analyzer for query processing. For query terms, we don't want to do de-hyphenation and we _do_ want to discard punctuation, so let's try the same input on it. We'll copy the same text to the "Field Value (Query)" box and clear the one for index analysis. We'll also include the full, unhyphenated word as another term to make sure it is processed to lower case as we expect. Submitting again yields these results:

!analysis-query-verbose.png|border=1!\\
_Query-time analyzer, good results._

We can see that for queries the analyzer behaves the way we want it to. Punctuation is stripped out, {{HyphenatedWordsFilter}} doesn't run, and we wind up with the three tokens we expected.

{scrollbar}


Stop watching space: https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=solr
Change email notification preferences: https://cwiki.apache.org/confluence/users/editmyemailsettings.action