You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by geeky2 <ge...@hotmail.com> on 2012/04/24 17:20:15 UTC
correct location in chain for EdgeNGramFilterFactory ?
hello all,
i want to experiment with the EdgeNGramFilterFactory at index time.
i believe this needs to go in post tokenization - but i am doing a pattern
replace as well as other things.
should the EdgeNGramFilterFactory go in right after the pattern replace?
<fieldType name="text_en_splitting" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="\."
replacement="" replace="all"/>
*put EdgeNGramFilterFactory here ===> ?*
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="\."
replacement="" replace="all"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
thanks for any help,
--
View this message in context: http://lucene.472066.n3.nabble.com/correct-location-in-chain-for-EdgeNGramFilterFactory-tp3935589p3935589.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: correct location in chain for EdgeNGramFilterFactory ?
Posted by Erick Erickson <er...@gmail.com>.
Well, what effect do you _want_?
I'd probably put it after the PorterStemFilterFactory. As it is, it'll
form a bunch of ngrams, then WordDelimiterFilterFactory will
try to break them up according to _its_ rules and eventually
you'll be sending absolute gibberish to the stemmer. I mean
what is the stemmer going to think of (starting out with running)
ru, run, runn, runni, runnin, running?
I suggest you spend some time with admin/analysis with various
orderings to understand better how all the parts interact.
Best
Erick
On Tue, Apr 24, 2012 at 11:20 AM, geeky2 <ge...@hotmail.com> wrote:
> hello all,
>
> i want to experiment with the EdgeNGramFilterFactory at index time.
>
> i believe this needs to go in post tokenization - but i am doing a pattern
> replace as well as other things.
>
> should the EdgeNGramFilterFactory go in right after the pattern replace?
>
>
>
>
> <fieldType name="text_en_splitting" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
> <filter class="solr.PatternReplaceFilterFactory" pattern="\."
> replacement="" replace="all"/>
>
> *put EdgeNGramFilterFactory here ===> ?*
>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
> preserveOriginal="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
> <filter class="solr.PatternReplaceFilterFactory" pattern="\."
> replacement="" replace="all"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
> </analyzer>
> </fieldType>
>
> thanks for any help,
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/correct-location-in-chain-for-EdgeNGramFilterFactory-tp3935589p3935589.html
> Sent from the Solr - User mailing list archive at Nabble.com.