You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Matt Mitchell <go...@gmail.com> on 2011/06/08 16:59:53 UTC
KeywordTokenizerFactory and stopwords
Hi,
I have an "autocomplete" fieldType that works really well, but because
the KeywordTokenizerFactory (if I understand correctly) is emitting a
single token, the stopword filter will not detect any stopwords.
Anyone know of a way to strip out stopwords when using
KeywordTokenizerFactory? I did try the reg-exp replace filter, but I'm
not sure I want to add a bunch of reg-exps for replacing every
stopword.
Thanks,
Matt
Here's the fieldType definition:
<fieldType name="autocomplete" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="50"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Re: KeywordTokenizerFactory and stopwords
Posted by Matt Mitchell <go...@gmail.com>.
Hi Erik. Yes something like what you describe would do the trick. I
did find this:
http://lucene.472066.n3.nabble.com/Concatenate-multiple-tokens-into-one-td1879611.html
I might try the pattern replace filter with stopwords, even though
that feels kinda clunky.
Matt
On Wed, Jun 8, 2011 at 11:04 AM, Erik Hatcher <er...@gmail.com> wrote:
> This seems like it deserves some kind of "collecting" TokenFilter(Factory) that will slurp up all incoming tokens and glue them together with a space (and allow separator to be configurable). Hmmm.... surprised one of those doesn't already exist. With something like that you could have a standard tokenization chain, and put it all back together at the end.
>
> Erik
>
> On Jun 8, 2011, at 10:59 , Matt Mitchell wrote:
>
>> Hi,
>>
>> I have an "autocomplete" fieldType that works really well, but because
>> the KeywordTokenizerFactory (if I understand correctly) is emitting a
>> single token, the stopword filter will not detect any stopwords.
>> Anyone know of a way to strip out stopwords when using
>> KeywordTokenizerFactory? I did try the reg-exp replace filter, but I'm
>> not sure I want to add a bunch of reg-exps for replacing every
>> stopword.
>>
>> Thanks,
>> Matt
>>
>> Here's the fieldType definition:
>>
>> <fieldType name="autocomplete" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer type="index">
>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>> <filter class="solr.TrimFilterFactory"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.ASCIIFoldingFilterFactory"/>
>>
>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>> maxGramSize="50"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>> <filter class="solr.TrimFilterFactory"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.ASCIIFoldingFilterFactory"/>
>> </analyzer>
>> </fieldType>
>
>
Re: KeywordTokenizerFactory and stopwords
Posted by Erik Hatcher <er...@gmail.com>.
This seems like it deserves some kind of "collecting" TokenFilter(Factory) that will slurp up all incoming tokens and glue them together with a space (and allow separator to be configurable). Hmmm.... surprised one of those doesn't already exist. With something like that you could have a standard tokenization chain, and put it all back together at the end.
Erik
On Jun 8, 2011, at 10:59 , Matt Mitchell wrote:
> Hi,
>
> I have an "autocomplete" fieldType that works really well, but because
> the KeywordTokenizerFactory (if I understand correctly) is emitting a
> single token, the stopword filter will not detect any stopwords.
> Anyone know of a way to strip out stopwords when using
> KeywordTokenizerFactory? I did try the reg-exp replace filter, but I'm
> not sure I want to add a bunch of reg-exps for replacing every
> stopword.
>
> Thanks,
> Matt
>
> Here's the fieldType definition:
>
> <fieldType name="autocomplete" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.TrimFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
>
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="50"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.TrimFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> </analyzer>
> </fieldType>