You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Lance Norskog <go...@gmail.com> on 2009/12/23 01:47:21 UTC

RemoveDuplicatesTokenFilter redundancy problem?

It looks like the inner loop of
org.apache.solr.analysis.RemoveDuplicatesTokenFilter could use a
'break'. I don't remember enough Big-O analysis to give the
difference, but they will be two different formulae.

For people doing large documents (I've heard gigabytes for email
forensics) this would matter...

-- 
Lance Norskog
goksron@gmail.com

Re: RemoveDuplicatesTokenFilter redundancy problem?

Posted by Robert Muir <rc...@gmail.com>.

another option is to instead of looking ahead with the wierd Big-O runtime
you noticed, use a set to keep track of which terms have been seen (cleared
after each word with posInc>0).

i implemented this with the new ts api already and will plop the patch on
SOLR-1657

On Tue, Dec 22, 2009 at 7:47 PM, Lance Norskog <go...@gmail.com> wrote:

> It looks like the inner loop of
> org.apache.solr.analysis.RemoveDuplicatesTokenFilter could use a
> 'break'. I don't remember enough Big-O analysis to give the
> difference, but they will be two different formulae.
>
> For people doing large documents (I've heard gigabytes for email
> forensics) this would matter...
>
> --
> Lance Norskog
> goksron@gmail.com
>

-- 
Robert Muir
rcmuir@gmail.com