You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ken Krugler <kk...@transpac.com> on 2006/04/27 23:32:05 UTC

MultiSearcher & skewed IDF values

Hi all,

I'm curious as to whether MultiSearcher (as of 1.9) does a good job 
of blending search results, when the various indexes being searched 
have significantly different characteristics.

For example, let's say I've got two indexes. One consists of 
documents where the term "platypus" almost never occurs. This index 
will have a very high IDF for that term.

The second index happens to be from the portion of the crawl that was 
covering biology departments in Australian universities, so the term 
"platypus" is significantly more common.

If I do a search on "platypus lifespan" using MultiSearcher, will 
hits from the first index get an unnatural boost because of the 
corresponding high IDF in that particular slice of the data? Or 
should I expect that the results will (closely) match what I'd get 
back if I merged the two indexes and used a regular searcher?

Thanks,

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: MultiSearcher & skewed IDF values

Posted by Doug Cutting <cu...@apache.org>.

Andrzej Bialecki wrote:
> Unfortunately, this is still an existing problem, and neither Nutch nor 
> Lucene does the right job here. Please see NUTCH-92 for more 
> information, and a sketch of solution for this issue.

Lucene's MultiSearcher now implements this correctly, no?  But Nutch's 
distributed search does not.  Two round trips to each node are required: 
the first to get IDF information for the query, and the second to get hits.

Doug

Re: MultiSearcher & skewed IDF values

Posted by Andrzej Bialecki <ab...@getopt.org>.

Ken Krugler wrote:
> Hi all,
>
> I'm curious as to whether MultiSearcher (as of 1.9) does a good job of 
> blending search results, when the various indexes being searched have 
> significantly different characteristics.
>
> For example, let's say I've got two indexes. One consists of documents 
> where the term "platypus" almost never occurs. This index will have a 
> very high IDF for that term.
>
> The second index happens to be from the portion of the crawl that was 
> covering biology departments in Australian universities, so the term 
> "platypus" is significantly more common.
>
> If I do a search on "platypus lifespan" using MultiSearcher, will hits 
> from the first index get an unnatural boost because of the 
> corresponding high IDF in that particular slice of the data? Or should 
> I expect that the results will (closely) match what I'd get back if I 
> merged the two indexes and used a regular searcher?

Unfortunately, this is still an existing problem, and neither Nutch nor 
Lucene does the right job here. Please see NUTCH-92 for more 
information, and a sketch of solution for this issue.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com