You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Kaktu Chakarabati <ji...@gmail.com> on 2010/09/28 20:23:23 UTC

Field Collapsing Performance

hey guys,
Any word on this? has anyone did any benchmarking / used this in
production-like environment?
We are considering using this feature on a large scale for deduplication and
was wondering
if anyone has some numbers before I go ahead and start my own series of
tests...


thanks,
Chak

Re: Field Collapsing Performance

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Sep 28, 2010 at 8:14 PM, Li Li <fa...@gmail.com> wrote:
> I think current implmetation is slow. because it do collapse in all
> the hit docs. In my view, it will take more than 1s when using
> collapse and only 200ms-300ms when not in our environment. So we
> modify it as -- when user need top 100 docs, we collect top 200 docs
> and do collapse within these 200 docs.

Yep, like faceting, there's no one algorithm thats fast for all types
of distributions.
If you expect groups to be relatively unique, the most efficient way
is just for the
client to over-request a bit and do the collapse themselves.

We'll be adding more implementations as time goes on of course, but I think
tackling something first that the client *couldn't* easily do was a good choice.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Field Collapsing Performance

Posted by Kaktu Chakarabati <ji...@gmail.com>.
Hey Li,
Thanks - great answer, exactly touched on the points I was interested in.

One last Q  - Once you did tweak it to work in a 'top K' way,what was
performance impact like?
I've written similar components in the past that iterate over top result set
docs (on the order of 400-600 top results)
and these would usually run in no more than 4-5ms. Is this close to numbers
you're seeing for this component?

Thanks,
Chak

On Tue, Sep 28, 2010 at 5:14 PM, Li Li <fa...@gmail.com> wrote:

> I think current implmetation is slow. because it do collapse in all
> the hit docs. In my view, it will take more than 1s when using
> collapse and only 200ms-300ms when not in our environment. So we
> modify it as -- when user need top 100 docs, we collect top 200 docs
> and do collapse within these 200 docs. Of course, user may not see so
> much duplicated docs as before but I think it's not that important.
> anyway, collapsing is not clustering.
>
> 2010/9/29 Kaktu Chakarabati <ji...@gmail.com>:
> > hey guys,
> > Any word on this? has anyone did any benchmarking / used this in
> > production-like environment?
> > We are considering using this feature on a large scale for deduplication
> and
> > was wondering
> > if anyone has some numbers before I go ahead and start my own series of
> > tests...
> >
> >
> > thanks,
> > Chak
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Field Collapsing Performance

Posted by Li Li <fa...@gmail.com>.
I think current implmetation is slow. because it do collapse in all
the hit docs. In my view, it will take more than 1s when using
collapse and only 200ms-300ms when not in our environment. So we
modify it as -- when user need top 100 docs, we collect top 200 docs
and do collapse within these 200 docs. Of course, user may not see so
much duplicated docs as before but I think it's not that important.
anyway, collapsing is not clustering.

2010/9/29 Kaktu Chakarabati <ji...@gmail.com>:
> hey guys,
> Any word on this? has anyone did any benchmarking / used this in
> production-like environment?
> We are considering using this feature on a large scale for deduplication and
> was wondering
> if anyone has some numbers before I go ahead and start my own series of
> tests...
>
>
> thanks,
> Chak
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org