You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Malte Hübner (JIRA)" <ji...@apache.org> on 2014/03/27 09:51:14 UTC

[jira] [Created] (SOLR-5921) WordDelimiterFilterFactory splits up hyphenated terms although splitOnNumerics, generateWordParts and generateNumberParts are set to 0 (false)

Malte Hübner created SOLR-5921:
----------------------------------

             Summary: WordDelimiterFilterFactory splits up hyphenated terms although splitOnNumerics, generateWordParts and generateNumberParts are set to 0 (false)
                 Key: SOLR-5921
                 URL: https://issues.apache.org/jira/browse/SOLR-5921
             Project: Solr
          Issue Type: Bug
          Components: Schema and Analysis
    Affects Versions: 4.7
            Reporter: Malte Hübner
             Fix For: 4.7.1
         Attachments: 2014-03-27 09_50_33-Solr Admin.png

WordDelimiterFilterFactory generates word parts although splitting configuration is deactivatet.

This is the fieldType setup:

{code}
		<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
			<analyzer type="index">
				<tokenizer class="solr.WhitespaceTokenizerFactory" />
				<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
				<filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="0" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1"/>
				<filter class="solr.LowerCaseFilterFactory" />
			</analyzer>
			<analyzer type="query">
				<tokenizer class="solr.WhitespaceTokenizerFactory" />
				<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms_de.txt" ignoreCase="true" expand="true" />
				<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
				<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"  preserveOriginal="1"/>
				<filter class="solr.LowerCaseFilterFactory" />
			</analyzer>
		</fieldType>
{code}

The given search term is: *X-002-99-495*

WordDelimiterFilterFactory indexes the following word parts:

* X-002-99-495
* X
* 00299495
* X00299495

But the 'X' should not be indexed or queried as a single term. You can see that splitting is completely deactivated in the schema.

I can move the charater part around in the search term:

Searching for *002-abc-99-495* gives me

* 002-abc-99-495
* 002
* abc
* 99495
* 002abc99495

This is not what I expect from the configuration! I think this must be a bug.







--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org