You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Lance Norskog <go...@gmail.com> on 2009/12/23 01:47:21 UTC
RemoveDuplicatesTokenFilter redundancy problem?
It looks like the inner loop of
org.apache.solr.analysis.RemoveDuplicatesTokenFilter could use a
'break'. I don't remember enough Big-O analysis to give the
difference, but they will be two different formulae.
For people doing large documents (I've heard gigabytes for email
forensics) this would matter...
--
Lance Norskog
goksron@gmail.com
Re: RemoveDuplicatesTokenFilter redundancy problem?
Posted by Robert Muir <rc...@gmail.com>.
another option is to instead of looking ahead with the wierd Big-O runtime
you noticed, use a set to keep track of which terms have been seen (cleared
after each word with posInc>0).
i implemented this with the new ts api already and will plop the patch on
SOLR-1657
On Tue, Dec 22, 2009 at 7:47 PM, Lance Norskog <go...@gmail.com> wrote:
> It looks like the inner loop of
> org.apache.solr.analysis.RemoveDuplicatesTokenFilter could use a
> 'break'. I don't remember enough Big-O analysis to give the
> difference, but they will be two different formulae.
>
> For people doing large documents (I've heard gigabytes for email
> forensics) this would matter...
>
> --
> Lance Norskog
> goksron@gmail.com
>
--
Robert Muir
rcmuir@gmail.com