You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Malte Hübner (JIRA)" <ji...@apache.org> on 2014/03/27 16:15:16 UTC

[jira] [Commented] (SOLR-5921) WordDelimiterFilterFactory splits up hyphenated terms although splitOnNumerics, generateWordParts and generateNumberParts are set to 0 (false)

    [ https://issues.apache.org/jira/browse/SOLR-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13949438#comment-13949438 ] 

Malte Hübner commented on SOLR-5921:
------------------------------------

Erik, it would be nice if you could explain what you mean with "Hyphens are not disabled". The problem here is that Solr only "sometimes" splits hyphenated terms. I think this *must* be a bug as I have analysed this behaviour in detail. 

Why not set the priority to "Critical" and leave the bug report open?

> WordDelimiterFilterFactory splits up hyphenated terms although splitOnNumerics, generateWordParts and generateNumberParts are set to 0 (false)
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-5921
>                 URL: https://issues.apache.org/jira/browse/SOLR-5921
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 4.7
>            Reporter: Malte Hübner
>            Priority: Blocker
>             Fix For: 4.7.1
>
>         Attachments: 2014-03-27 09_50_33-Solr Admin.png, 2014-03-27 10_43_24-Solr Admin.png
>
>
> WordDelimiterFilterFactory generates word parts although splitting configuration is deactivated.
> *This is the fieldType setup from my schema:*
> {code}
> 		<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> 			<analyzer type="index">
> 				<tokenizer class="solr.WhitespaceTokenizerFactory" />
> 				<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
> 				<filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="0" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1"/>
> 				<filter class="solr.LowerCaseFilterFactory" />
> 			</analyzer>
> 			<analyzer type="query">
> 				<tokenizer class="solr.WhitespaceTokenizerFactory" />
> 				<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms_de.txt" ignoreCase="true" expand="true" />
> 				<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
> 				<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"  preserveOriginal="1"/>
> 				<filter class="solr.LowerCaseFilterFactory" />
> 			</analyzer>
> 		</fieldType>
> {code}
> The given search term is: *X-002-99-495*
> WordDelimiterFilterFactory indexes the following word parts:
> * X-002-99-495
> * X (shouldn't be there)
> * 00299495 (shouldn't be there)
> * X00299495
> But the 'X' should not be indexed or queried as a single term. You can see that splitting is completely deactivated in the schema.
> I can move the charater part around in the search term:
> Searching for *002-abc-99-495* gives me
> * 002-abc-99-495 
> * 002 (shouldn't be there)
> * abc (shouldn't be there)
> * 99495 (shouldn't be there)
> * 002abc99495
> Even if the term has te following content - WDF split's it up (F00-22-761):
> * F00-22-761
> * F00  (shouldn't be there)
> * 22761  (shouldn't be there)
> * F0022761
> Please have a look at the screenshot.
> This is not what I expect from the configuration! I think this must be a bug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org