You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucy.apache.org by goran kent <go...@gmail.com> on 2011/11/09 22:14:31 UTC

[lucy-user] Aggregating multiple searchers

Hi,

Just in case Marvin doesn't get around to ClusterSearcher, I'm
wondering whether I can cobble something together using POE::Session
to fire off multiple remote searcher requests
(LucyX::Remote::SearchClient), wait for all to complete, then
aggregate the results.

That last bit has me stumped.

How can I aggregate the results from a bunch of
LucyX::Remote::SearchClient objects?  Unfortunately there's no
Lucy::Search::Aggregate.

Any ideas?

Re: [lucy-user] Aggregating multiple searchers

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Nov 09, 2011 at 11:14:31PM +0200, goran kent wrote:
> Just in case Marvin doesn't get around to ClusterSearcher, I'm
> wondering whether I can cobble something together using POE::Session
> to fire off multiple remote searcher requests
> (LucyX::Remote::SearchClient), wait for all to complete, then
> aggregate the results.
> 
> That last bit has me stumped.

> How can I aggregate the results from a bunch of
> LucyX::Remote::SearchClient objects?  Unfortunately there's no
> Lucy::Search::Aggregate.

The problem is that queries run against different indexes do not produce
comparable scores.

A naive implementation of an aggregator would do this:

  my $hits_a = $searcher_a->hits(query => $query);
  my $hits_b = $searcher_a->hits(query => $query);
  my @hit_docs;
  push(@hit_docs, $_) while $_ = $hits_a->next;
  push(@hit_docs, $_) while $_ = $hits_b->next;
  my @sorted = sort { $_[1]->get_score <=> $_[0]->get_score } @hit_docs;

However, say that you are searching for 'iphone' in two news archives, one
from 2001 and one from 2011.  In the more recent news archive, 'iphone'
will be a reasonably common term.  In the older news archive, 'iphone' will be
very rare -- let's imagine that it only appears in a single document, as a
typo.  Rare terms make for high scores -- so the top hit in your search for
'iphone' may well be the typo[1].

That's why you want to know the doc_freq for each term across the *entire*
corpus when performing query weighting.

That's not the only problem, but it's illustrative.

Marvin Humphrey

[1] I got this excellent example from Chris Hostetter.