You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by geeky2 <ge...@hotmail.com> on 2012/04/24 17:20:15 UTC

correct location in chain for EdgeNGramFilterFactory ?

hello all,

i want to experiment with the EdgeNGramFilterFactory at index time.

i believe this needs to go in post tokenization - but i am doing a pattern
replace as well as other things.

should the EdgeNGramFilterFactory go in right after the pattern replace?




    <fieldType name="text_en_splitting" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>


        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="\."
replacement="" replace="all"/>

*put EdgeNGramFilterFactory here ===> ?*

        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="\."
replacement="" replace="all"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

thanks for any help,



--
View this message in context: http://lucene.472066.n3.nabble.com/correct-location-in-chain-for-EdgeNGramFilterFactory-tp3935589p3935589.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: correct location in chain for EdgeNGramFilterFactory ?

Posted by Erick Erickson <er...@gmail.com>.
Well, what effect do you _want_?

I'd probably put it after the PorterStemFilterFactory. As it is, it'll
form a bunch of ngrams, then WordDelimiterFilterFactory will
try to break them up according to _its_ rules and eventually
you'll be sending absolute gibberish to the stemmer. I mean
what is the stemmer going to think of (starting out with running)
ru, run, runn, runni, runnin, running?

I suggest you spend some time with admin/analysis with various
orderings to understand better how all the parts interact.

Best
Erick

On Tue, Apr 24, 2012 at 11:20 AM, geeky2 <ge...@hotmail.com> wrote:
> hello all,
>
> i want to experiment with the EdgeNGramFilterFactory at index time.
>
> i believe this needs to go in post tokenization - but i am doing a pattern
> replace as well as other things.
>
> should the EdgeNGramFilterFactory go in right after the pattern replace?
>
>
>
>
>    <fieldType name="text_en_splitting" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>        <filter class="solr.PatternReplaceFilterFactory" pattern="\."
> replacement="" replace="all"/>
>
> *put EdgeNGramFilterFactory here ===> ?*
>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
> preserveOriginal="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.PorterStemFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>        <filter class="solr.PatternReplaceFilterFactory" pattern="\."
> replacement="" replace="all"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.PorterStemFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> thanks for any help,
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/correct-location-in-chain-for-EdgeNGramFilterFactory-tp3935589p3935589.html
> Sent from the Solr - User mailing list archive at Nabble.com.