You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Paul Dlug <pa...@gmail.com> on 2010/07/29 15:44:32 UTC

Excluding large tokens from indexing

Is there a filter available that will remove large tokens from the
token stream? Ideally something configurable to a character limit? I
have a noisy data set that has some large tokens (in this case more
than 50 characters) that I'd like to just strip. They're unlikely to
ever match a user query and will just take up space since there are a
large number of them that are not distinct.


--Paul

Re: Excluding large tokens from indexing

Posted by Paul Dlug <pa...@gmail.com>.
Thanks, that's exactly what I was looking for, not sure how I missed it.

On Thu, Jul 29, 2010 at 11:28 AM, Chantal Ackermann
<ch...@btelligent.de> wrote:
> This is probably what you want?
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory
>
>
>
> On Thu, 2010-07-29 at 15:44 +0200, Paul Dlug wrote:
>> Is there a filter available that will remove large tokens from the
>> token stream? Ideally something configurable to a character limit? I
>> have a noisy data set that has some large tokens (in this case more
>> than 50 characters) that I'd like to just strip. They're unlikely to
>> ever match a user query and will just take up space since there are a
>> large number of them that are not distinct.
>>
>>
>> --Paul
>
>
>
>

Re: Excluding large tokens from indexing

Posted by Chantal Ackermann <ch...@btelligent.de>.
This is probably what you want?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory



On Thu, 2010-07-29 at 15:44 +0200, Paul Dlug wrote:
> Is there a filter available that will remove large tokens from the
> token stream? Ideally something configurable to a character limit? I
> have a noisy data set that has some large tokens (in this case more
> than 50 characters) that I'd like to just strip. They're unlikely to
> ever match a user query and will just take up space since there are a
> large number of them that are not distinct.
> 
> 
> --Paul