You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Otis Gospodnetic (JIRA)" <ji...@apache.org> on 2005/09/15 22:19:55 UTC
[jira] Commented: (NUTCH-92) DistributedSearch incorrectly scores results

    [ http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12329473 ] 

Otis Gospodnetic commented on NUTCH-92:
---------------------------------------

I recall a discussion on lucene-dev list several (6+?) months back about this or very similar issue.  Lucene's MultiSearcher has the same problem.  Chuck lead the discussion and had some proposed solutions, if I recall correctly, but I don't think they ever made it into Lucene core.

I'm saying this because maybe this can be fixed on a lower (Lucene) level and benefit both Lucene and Nutch.


> DistributedSearch incorrectly scores results
> --------------------------------------------
>
>          Key: NUTCH-92
>          URL: http://issues.apache.org/jira/browse/NUTCH-92
>      Project: Nutch
>         Type: Bug
>   Components: searcher
>     Versions: 0.8-dev, 0.7
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 

>
> When running search servers in a distributed setup, using DistributedSearch$Server and Client, total scores are incorrectly calculated. The symptoms are that scores differ depending on how segments are deployed to Servers, i.e. if there is uneven distribution of terms in segment indexes (due to segment size or content differences) then scores will differ depending on how many and which segments are deployed on a particular Server. This may lead to prioritizing of non-relevant results over more relevant ones.
> The underlying reason for this is that each IndexSearcher (which uses local index on each Server) calculates scores based on the local IDFs of query terms, and not the global IDFs from all indexes together. This means that scores arriving from different Servers to the Client cannot be meaningfully compared, unless all indexes have similar distribution of Terms and similar numbers of documents in them. However, currently the Client mixes all scores together, sorts them by absolute values and picks top hits. These absolute values will change if segments are un-evenly deployed to Servers.
> Currently the workaround is to deploy the same number of documents in segments per Server, and to ensure that segments contain well-randomized content so that term frequencies for common terms are very similar.
> The solution proposed here (as a result of discussion between ab and cutting, patches are coming) is to calculate global IDFs prior to running the query, and pre-boost query Terms with these global IDFs. This will require one more RPC call per each query (this can be optimized later, e.g. through caching). Then the scores will become normalized according to the global IDFs, and Client will be able to meaningfully compare them. Scores will also become independent of the segment content or local number of documents per Server. This will involve at least the following changes:
> * change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This enables us to manipulate scores independently of local IDFs.
> * add a new method to Searcher interface, int[] getDocFreqs(Term[]), which will return document frequencies for query terms.
> * modify getSegmentNames() so that it returns also the total number of documents in each segment, or implement this as a separate method (this will be called once during segment init)
> * in DistributedSearch$Client.search() first make a call to servers to return local IDFs for the current query, and calculate global IDFs for each relevant Term in that query.
> * multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and PhraseQuery boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for all of its terms
> This solution should be applicable with only minor changes to all branches, but initially the patches will be relative to trunk/ .
> Comments, suggestions and review are welcome!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira