You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tim Pease <ti...@gmail.com> on 2011/02/01 05:09:31 UTC

document boost of "Infinity"

I have nutch configured to crawl several dozen sites and store the results into Solr for searching. An adaptive fetch schedule is being used so that pages which change less frequently are crawled less often. However, I've run into an issue where many of the documents have a boost of "Infinity". Consequently, the document score in Solr is extremely high for these documents, and they swamp other more valid search results.

My question is why the boost value is "Infinity" for these documents?

I'm running nutch 1.2 via a simple script that iterates through a full crawl cycle. I'm wondering if re-crawling these pages (because of the adaptive fetch schedule) is affecting the score - i.e. a link from PageA to PageB is being counted multiple times and boosting the score of PageB?

I'm not able to follow all the code in the OPICScoringFilter and the "updatedb" process to be entirely sure that links are not being double counted. Any insights or pointers would be greatly appreciated.

Blessings,
TwP

PS  I am seeing a steady growth of boost values in the Solr search results. There are documents with normal boost values (around 1.0) and documents with increasing boost values all the way up to "Infinity". Obviously they plateau at "Infinity".