You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Profimedia (Jira)" <ji...@apache.org> on 2019/10/16 07:39:00 UTC

[jira] [Created] (LUCENE-9009) Support for keyword protect when using ICUFoldingFilter and KeywordRepetFilter

Profimedia created LUCENE-9009:
----------------------------------

             Summary: Support for keyword protect when using ICUFoldingFilter and KeywordRepetFilter
                 Key: LUCENE-9009
                 URL: https://issues.apache.org/jira/browse/LUCENE-9009
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/analysis
    Affects Versions: 7.7.2
            Reporter: Profimedia


It would be great to support keyword protection if KeywordRepetFilter used. Like implementation at PorterStemFilter.
{code:java}
@Override
  public final boolean incrementToken() throws IOException {
    if (!input.incrementToken())
      return false;

    if ((!keywordAttr.isKeyword()) && stemmer.stem(termAtt.buffer(), 0, termAtt.length()))
      termAtt.copyBuffer(stemmer.getResultBuffer(), 0, stemmer.getResultLength());
    return true;
  }
{code}
 

Scenario:

We analyzing word with some accents, for example "groš". And we would like to define searching like that:

1, Search for all items which have "groš"  or "gros" in some field.

2, If we send directly search phrase "groš" we want to prefer items with word "groš" and after that with word "gros" (by the higher score)

 

Example field definition:
{code:xml}
<fieldType name="some_text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.KeywordRepeatFilterFactory" />
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>

      <filter class="solr.HunspellStemFilterFactory"
					dictionary="lang/our_en_US.dic"
					affix="lang/our_en_US.aff"
					ignoreCase="true"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
  </analyzer>
  <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.KeywordRepeatFilterFactory" />
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
      <filter class="solr.HunspellStemFilterFactory"
					dictionary="lang/our_en_US.dic"
					affix="lang/our_en_US.aff"
					ignoreCase="true" />
      <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
  </analyzer>
</fieldType>
{code}
 

The result of query and index analyzers should be this: 
|
|
|text|
|raw_bytes|
|start|
|end|
|positionLength|
|type|
|termFrequency|
|position|
|keyword|
|
|
|
|groš|
|[67 72 6f c5 a1]|
|0|
|4|
|1|
|<ALPHANUM>|
|1|
|1|
|true|
|
|
|
|gros|
|[67 72 6f 73]|
|0|
|4|
|1|
|<ALPHANUM>|
|1|
|1|
|false|
|
|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org