You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "David Smiley (Jira)" <ji...@apache.org> on 2019/10/18 16:15:00 UTC
[jira] [Commented] (LUCENE-9009) Support for keyword protect when
using ICUFoldingFilter and KeywordRepetFilter
[ https://issues.apache.org/jira/browse/LUCENE-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954731#comment-16954731 ]
David Smiley commented on LUCENE-9009:
--------------------------------------
It's a shame to see these common yet implemented ad-hoc options across some but not all of our token filters. It would be nice if *any* filter could be made conditional based on a provided Predicate that might look at any Attribute. At the factory level, there might be a special attribute like "unless" which might have some special values like "keyword" or a file/resource to a word list.
I realize my comment deserves it's own issue but wanted to throw this out here first.
> Support for keyword protect when using ICUFoldingFilter and KeywordRepetFilter
> ------------------------------------------------------------------------------
>
> Key: LUCENE-9009
> URL: https://issues.apache.org/jira/browse/LUCENE-9009
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: 7.7.2
> Reporter: Profimedia
> Priority: Major
> Labels: ICUFoldingFilterFactory
>
> It would be great to support keyword protection if KeywordRepetFilter used. Like implementation at PorterStemFilter.
> {code:java}
> @Override
> public final boolean incrementToken() throws IOException {
> if (!input.incrementToken())
> return false;
> if ((!keywordAttr.isKeyword()) && stemmer.stem(termAtt.buffer(), 0, termAtt.length()))
> termAtt.copyBuffer(stemmer.getResultBuffer(), 0, stemmer.getResultLength());
> return true;
> }
> {code}
>
> Scenario:
> We analyzing word with some accents, for example "groš". And we would like to define searching like that:
> 1, Search for all items which have "groš" or "gros" in some field.
> 2, If we send directly search phrase "groš" we want to prefer items with word "groš" and after that with word "gros" (by the higher score). Therefore we use KeywordProtectFilter.
>
> Example field definition:
> {code:xml}
> <fieldType name="some_text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.KeywordRepeatFilterFactory" />
> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
> <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
> <filter class="solr.HunspellStemFilterFactory"
> dictionary="lang/our_en_US.dic"
> affix="lang/our_en_US.aff"
> ignoreCase="true"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.KeywordRepeatFilterFactory" />
> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
> <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
> <filter class="solr.HunspellStemFilterFactory"
> dictionary="lang/our_en_US.dic"
> affix="lang/our_en_US.aff"
> ignoreCase="true" />
> <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
> </analyzer>
> </fieldType>
> {code}
>
> The result of query and index analyzers should be like this:
> |text|groš|gros|
> |raw_bytes|[67 72 6f c5 a1]|[67 72 6f c5 a1]|
> |start|0|0|
> |end|4|4|
> |positionLength|1|1|
> |type|<ALPHANUM>|<ALPHANUM>|
> |termFrequency|1|1|
> |position|1|1|
> |keyword|true|false|
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org