You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "David Smiley (Jira)" <ji...@apache.org> on 2019/10/18 16:15:00 UTC

[jira] [Commented] (LUCENE-9009) Support for keyword protect when using ICUFoldingFilter and KeywordRepetFilter

    [ https://issues.apache.org/jira/browse/LUCENE-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954731#comment-16954731 ] 

David Smiley commented on LUCENE-9009:
--------------------------------------

It's a shame to see these common yet implemented ad-hoc options across some but not all of our token filters.  It would be nice if *any* filter could be made conditional based on a provided Predicate that might look at any Attribute.  At the factory level, there might be a special attribute like "unless" which might have some special values like "keyword" or a file/resource to a word list.

I realize my comment deserves it's own issue but wanted to throw this out here first.

> Support for keyword protect when using ICUFoldingFilter and KeywordRepetFilter
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-9009
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9009
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 7.7.2
>            Reporter: Profimedia
>            Priority: Major
>              Labels: ICUFoldingFilterFactory
>
> It would be great to support keyword protection if KeywordRepetFilter used. Like implementation at PorterStemFilter.
> {code:java}
> @Override
>   public final boolean incrementToken() throws IOException {
>     if (!input.incrementToken())
>       return false;
>     if ((!keywordAttr.isKeyword()) && stemmer.stem(termAtt.buffer(), 0, termAtt.length()))
>       termAtt.copyBuffer(stemmer.getResultBuffer(), 0, stemmer.getResultLength());
>     return true;
>   }
> {code}
>  
> Scenario:
> We analyzing word with some accents, for example "groš". And we would like to define searching like that:
> 1, Search for all items which have "groš"  or "gros" in some field.
> 2, If we send directly search phrase "groš" we want to prefer items with word "groš" and after that with word "gros" (by the higher score). Therefore we use KeywordProtectFilter.
>  
> Example field definition:
> {code:xml}
> <fieldType name="some_text" class="solr.TextField" positionIncrementGap="100">
>   <analyzer type="index">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.KeywordRepeatFilterFactory" />
>       <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
>       <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>       <filter class="solr.HunspellStemFilterFactory"
> 					dictionary="lang/our_en_US.dic"
> 					affix="lang/our_en_US.aff"
> 					ignoreCase="true"/>
>       <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>   </analyzer>
>   <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.KeywordRepeatFilterFactory" />
>       <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
>       <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>       <filter class="solr.HunspellStemFilterFactory"
> 					dictionary="lang/our_en_US.dic"
> 					affix="lang/our_en_US.aff"
> 					ignoreCase="true" />
>       <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>   </analyzer>
> </fieldType>
> {code}
>  
> The result of query and index analyzers should be like this: 
> |text|groš|gros|
> |raw_bytes|[67 72 6f c5 a1]|[67 72 6f c5 a1]|
> |start|0|0|
> |end|4|4|
> |positionLength|1|1|
> |type|<ALPHANUM>|<ALPHANUM>|
> |termFrequency|1|1|
> |position|1|1|
> |keyword|true|false|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org