You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by goran kent <go...@gmail.com> on 2011/11/09 22:14:31 UTC
[lucy-user] Aggregating multiple searchers
Hi,
Just in case Marvin doesn't get around to ClusterSearcher, I'm
wondering whether I can cobble something together using POE::Session
to fire off multiple remote searcher requests
(LucyX::Remote::SearchClient), wait for all to complete, then
aggregate the results.
That last bit has me stumped.
How can I aggregate the results from a bunch of
LucyX::Remote::SearchClient objects? Unfortunately there's no
Lucy::Search::Aggregate.
Any ideas?
Re: [lucy-user] Aggregating multiple searchers
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Nov 09, 2011 at 11:14:31PM +0200, goran kent wrote:
> Just in case Marvin doesn't get around to ClusterSearcher, I'm
> wondering whether I can cobble something together using POE::Session
> to fire off multiple remote searcher requests
> (LucyX::Remote::SearchClient), wait for all to complete, then
> aggregate the results.
>
> That last bit has me stumped.
> How can I aggregate the results from a bunch of
> LucyX::Remote::SearchClient objects? Unfortunately there's no
> Lucy::Search::Aggregate.
The problem is that queries run against different indexes do not produce
comparable scores.
A naive implementation of an aggregator would do this:
my $hits_a = $searcher_a->hits(query => $query);
my $hits_b = $searcher_a->hits(query => $query);
my @hit_docs;
push(@hit_docs, $_) while $_ = $hits_a->next;
push(@hit_docs, $_) while $_ = $hits_b->next;
my @sorted = sort { $_[1]->get_score <=> $_[0]->get_score } @hit_docs;
However, say that you are searching for 'iphone' in two news archives, one
from 2001 and one from 2011. In the more recent news archive, 'iphone'
will be a reasonably common term. In the older news archive, 'iphone' will be
very rare -- let's imagine that it only appears in a single document, as a
typo. Rare terms make for high scores -- so the top hit in your search for
'iphone' may well be the typo[1].
That's why you want to know the doc_freq for each term across the *entire*
corpus when performing query weighting.
That's not the only problem, but it's illustrative.
Marvin Humphrey
[1] I got this excellent example from Chris Hostetter.