You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2008/04/12 05:39:49 UTC

IDF in Distributed Search

Hi,

With a well mixed distributed set of indices not having distributed/global IDF won't hurt much.
But what if one has a not so well mixed up set of shards?  One might want to apply rules when assigning documents to shards in order to group certain types of documents into only a subset of all shards instead of having them spread across all shards.  Doing such careful  sharding might allow the searcher to be smarter about which shards to search based on the query of client running the query, etc.

Thus, I've run through comments on SOLR-303 to see what has been said about distributed IDF.
Here is what I extracted:

"## I'm not quite sure about GlobalCollectionStat. Is its purpose just to normalize weights from the shards?"
  
"It's to make a distributed search score the same as it would if everything was in a single index.
 idf (inverse document frequency) is part of the scoring, so that component essentially does a distributed idf."

"...distributed idf... this has a performance cost, and should matter little in a well mixed index."


So, I'd like to see what it would take to add distributed IDF info to Solr's distributed search.
Here are some questions to get the discussion going:
- Is anyone already working on it?
- Does anyone plan on working on it in the very near future?
- Does anyone already have thoughts how and where dist. idf could be plugged in?
- There is a mention of dist idf and performance cost up there - any idea how costly dist idf would be?

Thanks,
Otis 
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




Re: IDF in Distributed Search

Posted by Walter Underwood <wu...@netflix.com>.
Global IDF does not require another request/response.
It is nearly free if you return the right info.

Return the total number of docs and the df in the original
response. Sum the doc counts and dfs, recompute the idf,
and re-rank.

See this post for an efficient way to do it:

  
http://wunderwood.org/most_casual_observer/2007/04/progressive_reranking.htm
l

This works best if you treat the results from each server as
a queue and refill just that queue when it is exhausted. All the
good results might be from one server.

wunder

On 4/11/08 8:50 PM, "Yonik Seeley" <yo...@apache.org> wrote:

> On Fri, Apr 11, 2008 at 11:39 PM, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
>>  So, I'd like to see what it would take to add distributed IDF info to Solr's
>> distributed search.
>>  Here are some questions to get the discussion going:
>>  - Is anyone already working on it?
>>  - Does anyone plan on working on it in the very near future?
>>  - Does anyone already have thoughts how and where dist. idf could be plugged
>> in?
>>  - There is a mention of dist idf and performance cost up there - any idea
>> how costly dist idf would
> 
> It's relatively easy to implement, but the performance cost is is not
> negligible since it adds another search "phase" (another
> request-response).  It should be optional of course (globalidf=true),
> so there is no reason not to add this feature.
> 
> I also left room for this stage (ResponseBuilder.STAGE_PARSE_QUERY),
> which is ordered before query execution.
> 
> -Yonik


Re: IDF in Distributed Search

Posted by Yonik Seeley <yo...@apache.org>.
On Fri, Apr 11, 2008 at 11:39 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
>  So, I'd like to see what it would take to add distributed IDF info to Solr's distributed search.
>  Here are some questions to get the discussion going:
>  - Is anyone already working on it?
>  - Does anyone plan on working on it in the very near future?
>  - Does anyone already have thoughts how and where dist. idf could be plugged in?
>  - There is a mention of dist idf and performance cost up there - any idea how costly dist idf would

It's relatively easy to implement, but the performance cost is is not
negligible since it adds another search "phase" (another
request-response).  It should be optional of course (globalidf=true),
so there is no reason not to add this feature.

I also left room for this stage (ResponseBuilder.STAGE_PARSE_QUERY),
which is ordered before query execution.

-Yonik