You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Profimedia (Jira)" <ji...@apache.org> on 2019/10/16 07:43:00 UTC
[jira] [Updated] (LUCENE-9009) Support for keyword protect when using ICUFoldingFilter and KeywordRepetFilter

     [ https://issues.apache.org/jira/browse/LUCENE-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Profimedia updated LUCENE-9009:
-------------------------------
    Description: 
It would be great to support keyword protection if KeywordRepetFilter used. Like implementation at PorterStemFilter.
{code:java}
@Override
  public final boolean incrementToken() throws IOException {
    if (!input.incrementToken())
      return false;

    if ((!keywordAttr.isKeyword()) && stemmer.stem(termAtt.buffer(), 0, termAtt.length()))
      termAtt.copyBuffer(stemmer.getResultBuffer(), 0, stemmer.getResultLength());
    return true;
  }
{code}
 

Scenario:

We analyzing word with some accents, for example "groš". And we would like to define searching like that:

1, Search for all items which have "groš"  or "gros" in some field.

2, If we send directly search phrase "groš" we want to prefer items with word "groš" and after that with word "gros" (by the higher score)

 

Example field definition:
{code:xml}
<fieldType name="some_text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.KeywordRepeatFilterFactory" />
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>

      <filter class="solr.HunspellStemFilterFactory"
					dictionary="lang/our_en_US.dic"
					affix="lang/our_en_US.aff"
					ignoreCase="true"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
  </analyzer>
  <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.KeywordRepeatFilterFactory" />
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
      <filter class="solr.HunspellStemFilterFactory"
					dictionary="lang/our_en_US.dic"
					affix="lang/our_en_US.aff"
					ignoreCase="true" />
      <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
  </analyzer>
</fieldType>
{code}
 

The result of query and index analyzers should be like this: 
|text|groš|gros|
|raw_bytes|[67 72 6f c5 a1]|[67 72 6f c5 a1]|
|start|0|0|
|end|4|4|
|positionLength|1|1|
|type|<ALPHANUM>|<ALPHANUM>|
|termFrequency|1|1|
|position|1|1|
|keyword|true|false|

  was:
It would be great to support keyword protection if KeywordRepetFilter used. Like implementation at PorterStemFilter.
{code:java}
@Override
  public final boolean incrementToken() throws IOException {
    if (!input.incrementToken())
      return false;

    if ((!keywordAttr.isKeyword()) && stemmer.stem(termAtt.buffer(), 0, termAtt.length()))
      termAtt.copyBuffer(stemmer.getResultBuffer(), 0, stemmer.getResultLength());
    return true;
  }
{code}
 

Scenario:

We analyzing word with some accents, for example "groš". And we would like to define searching like that:

1, Search for all items which have "groš"  or "gros" in some field.

2, If we send directly search phrase "groš" we want to prefer items with word "groš" and after that with word "gros" (by the higher score)

 

Example field definition:
{code:xml}
<fieldType name="some_text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.KeywordRepeatFilterFactory" />
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>

      <filter class="solr.HunspellStemFilterFactory"
					dictionary="lang/our_en_US.dic"
					affix="lang/our_en_US.aff"
					ignoreCase="true"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
  </analyzer>
  <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.KeywordRepeatFilterFactory" />
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
      <filter class="solr.HunspellStemFilterFactory"
					dictionary="lang/our_en_US.dic"
					affix="lang/our_en_US.aff"
					ignoreCase="true" />
      <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
  </analyzer>
</fieldType>
{code}
 

The result of query and index analyzers should be this: 
|text|groš|gros|
|raw_bytes|[67 72 6f c5 a1]|[67 72 6f c5 a1]|
|start|0|0|
|end|4|4|
|positionLength|1|1|
|type|<ALPHANUM>|<ALPHANUM>|
|termFrequency|1|1|
|position|1|1|
|keyword|true|false|


> Support for keyword protect when using ICUFoldingFilter and KeywordRepetFilter
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-9009
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9009
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 7.7.2
>            Reporter: Profimedia
>            Priority: Major
>              Labels: ICUFoldingFilterFactory
>
> It would be great to support keyword protection if KeywordRepetFilter used. Like implementation at PorterStemFilter.
> {code:java}
> @Override
>   public final boolean incrementToken() throws IOException {
>     if (!input.incrementToken())
>       return false;
>     if ((!keywordAttr.isKeyword()) && stemmer.stem(termAtt.buffer(), 0, termAtt.length()))
>       termAtt.copyBuffer(stemmer.getResultBuffer(), 0, stemmer.getResultLength());
>     return true;
>   }
> {code}
>  
> Scenario:
> We analyzing word with some accents, for example "groš". And we would like to define searching like that:
> 1, Search for all items which have "groš"  or "gros" in some field.
> 2, If we send directly search phrase "groš" we want to prefer items with word "groš" and after that with word "gros" (by the higher score)
>  
> Example field definition:
> {code:xml}
> <fieldType name="some_text" class="solr.TextField" positionIncrementGap="100">
>   <analyzer type="index">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.KeywordRepeatFilterFactory" />
>       <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
>       <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>       <filter class="solr.HunspellStemFilterFactory"
> 					dictionary="lang/our_en_US.dic"
> 					affix="lang/our_en_US.aff"
> 					ignoreCase="true"/>
>       <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>   </analyzer>
>   <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.KeywordRepeatFilterFactory" />
>       <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
>       <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>       <filter class="solr.HunspellStemFilterFactory"
> 					dictionary="lang/our_en_US.dic"
> 					affix="lang/our_en_US.aff"
> 					ignoreCase="true" />
>       <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>   </analyzer>
> </fieldType>
> {code}
>  
> The result of query and index analyzers should be like this: 
> |text|groš|gros|
> |raw_bytes|[67 72 6f c5 a1]|[67 72 6f c5 a1]|
> |start|0|0|
> |end|4|4|
> |positionLength|1|1|
> |type|<ALPHANUM>|<ALPHANUM>|
> |termFrequency|1|1|
> |position|1|1|
> |keyword|true|false|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org