You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Matt Mitchell <go...@gmail.com> on 2011/06/08 16:59:53 UTC

KeywordTokenizerFactory and stopwords

Hi,

I have an "autocomplete" fieldType that works really well, but because
the KeywordTokenizerFactory (if I understand correctly) is emitting a
single token, the stopword filter will not detect any stopwords.
Anyone know of a way to strip out stopwords when using
KeywordTokenizerFactory? I did try the reg-exp replace filter, but I'm
not sure I want to add a bunch of reg-exps for replacing every
stopword.

Thanks,
Matt

Here's the fieldType definition:

<fieldType name="autocomplete" class="solr.TextField"
positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.TrimFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>

    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="50"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.TrimFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
  </analyzer>
</fieldType>

Re: KeywordTokenizerFactory and stopwords

Posted by Matt Mitchell <go...@gmail.com>.

Hi Erik. Yes something like what you describe would do the trick. I
did find this:

http://lucene.472066.n3.nabble.com/Concatenate-multiple-tokens-into-one-td1879611.html

I might try the pattern replace filter with stopwords, even though
that feels kinda clunky.

Matt

On Wed, Jun 8, 2011 at 11:04 AM, Erik Hatcher <er...@gmail.com> wrote:
> This seems like it deserves some kind of "collecting" TokenFilter(Factory) that will slurp up all incoming tokens and glue them together with a space (and allow separator to be configurable).   Hmmm.... surprised one of those doesn't already exist.  With something like that you could have a standard tokenization chain, and put it all back together at the end.
>
>        Erik
>
> On Jun 8, 2011, at 10:59 , Matt Mitchell wrote:
>
>> Hi,
>>
>> I have an "autocomplete" fieldType that works really well, but because
>> the KeywordTokenizerFactory (if I understand correctly) is emitting a
>> single token, the stopword filter will not detect any stopwords.
>> Anyone know of a way to strip out stopwords when using
>> KeywordTokenizerFactory? I did try the reg-exp replace filter, but I'm
>> not sure I want to add a bunch of reg-exps for replacing every
>> stopword.
>>
>> Thanks,
>> Matt
>>
>> Here's the fieldType definition:
>>
>> <fieldType name="autocomplete" class="solr.TextField"
>> positionIncrementGap="100">
>>  <analyzer type="index">
>>    <tokenizer class="solr.KeywordTokenizerFactory"/>
>>    <filter class="solr.TrimFilterFactory"/>
>>    <filter class="solr.LowerCaseFilterFactory"/>
>>    <filter class="solr.ASCIIFoldingFilterFactory"/>
>>
>>    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>> maxGramSize="50"/>
>>  </analyzer>
>>  <analyzer type="query">
>>    <tokenizer class="solr.KeywordTokenizerFactory"/>
>>    <filter class="solr.TrimFilterFactory"/>
>>    <filter class="solr.LowerCaseFilterFactory"/>
>>    <filter class="solr.ASCIIFoldingFilterFactory"/>
>>  </analyzer>
>> </fieldType>
>
>

Re: KeywordTokenizerFactory and stopwords

Posted by Erik Hatcher <er...@gmail.com>.

This seems like it deserves some kind of "collecting" TokenFilter(Factory) that will slurp up all incoming tokens and glue them together with a space (and allow separator to be configurable).   Hmmm.... surprised one of those doesn't already exist.  With something like that you could have a standard tokenization chain, and put it all back together at the end.

	Erik

On Jun 8, 2011, at 10:59 , Matt Mitchell wrote:

> Hi,
> 
> I have an "autocomplete" fieldType that works really well, but because
> the KeywordTokenizerFactory (if I understand correctly) is emitting a
> single token, the stopword filter will not detect any stopwords.
> Anyone know of a way to strip out stopwords when using
> KeywordTokenizerFactory? I did try the reg-exp replace filter, but I'm
> not sure I want to add a bunch of reg-exps for replacing every
> stopword.
> 
> Thanks,
> Matt
> 
> Here's the fieldType definition:
> 
> <fieldType name="autocomplete" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
>    <tokenizer class="solr.KeywordTokenizerFactory"/>
>    <filter class="solr.TrimFilterFactory"/>
>    <filter class="solr.LowerCaseFilterFactory"/>
>    <filter class="solr.ASCIIFoldingFilterFactory"/>
> 
>    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="50"/>
>  </analyzer>
>  <analyzer type="query">
>    <tokenizer class="solr.KeywordTokenizerFactory"/>
>    <filter class="solr.TrimFilterFactory"/>
>    <filter class="solr.LowerCaseFilterFactory"/>
>    <filter class="solr.ASCIIFoldingFilterFactory"/>
>  </analyzer>
> </fieldType>