You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Michael Tracey <mt...@biblio.com> on 2014/05/05 22:52:21 UTC

Turning on KeywordRepeat and RemoveDups on an existing fieldType.

As per the stemming docs ( https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming ), I want to score the original term higher than the stemmed version by adding:

   <filter class="solr.KeywordRepeatFilterFactory"/>
   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

to a field type that is already created (with Stemming). I have 100M documents in this index, and it gets slowly reindexed every month as records change.  My question is, can I add this to the existing fieldType, or do I need to make a new fieldType, and copyField the data over to it, and after it's all reindexed switch my code?  I'd rather be able to just add the lines to my fieldType because I don't think I have enough disk space on my cloud members to hold my primary fulltext field twice.

Just in case it helps, I'm running 4.4.0 and the field I'm wanting to mod looks like this:

    <fieldType name="keywordText" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="keyword_stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="keyword_stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
      </analyzer>
    </fieldType>

Thanks,

M.

Re: Turning on KeywordRepeat and RemoveDups on an existing fieldType.

Posted by Jack Krupansky <ja...@basetechnology.com>.

I haven't personally used this technique, but I gather that the intent is 
that the unstemmed term will have a lower term frequency (more unique) than 
the stemmed term which may generate the same stemmed term from a number of 
different source terms.

To answer your question, no, you don't need a separate field or type for 
this feature, but it will tend to generate a lot more terms in your index 
since it will index a stemmed term as two terms.

Only use the repeat/remove filters for the index analyzer.

You will need to reindex to see the full effect immediately, but you can do 
the reindex incrementally (as you replace existing documents) as well if you 
don't mind if the difference in relevancy takes an extended time to become 
apparent.

-- Jack Krupansky

-----Original Message----- 
From: Michael Tracey
Sent: Monday, May 5, 2014 4:52 PM
To: solr-user@lucene.apache.org
Subject: Turning on KeywordRepeat and RemoveDups on an existing fieldType.

As per the stemming docs ( 
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming ), I 
want to score the original term higher than the stemmed version by adding:

   <filter class="solr.KeywordRepeatFilterFactory"/>
   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

to a field type that is already created (with Stemming). I have 100M 
documents in this index, and it gets slowly reindexed every month as records 
change.  My question is, can I add this to the existing fieldType, or do I 
need to make a new fieldType, and copyField the data over to it, and after 
it's all reindexed switch my code?  I'd rather be able to just add the lines 
to my fieldType because I don't think I have enough disk space on my cloud 
members to hold my primary fulltext field twice.

Just in case it helps, I'm running 4.4.0 and the field I'm wanting to mod 
looks like this:

    <fieldType name="keywordText" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="keyword_stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SnowballPorterFilterFactory" language="English" 
protected="protwords.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="keyword_stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="0" 
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" 
protected="protwords.txt"/>
      </analyzer>
    </fieldType>

Thanks,

M.