You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2007/03/08 01:52:43 UTC

[Nutch Wiki] Update of "FixingOpicScoring" by AndrzejBialecki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by AndrzejBialecki:
http://wiki.apache.org/nutch/FixingOpicScoring

------------------------------------------------------------------------------
1. You can't recrawl, at least not without having the recrawled pages get inflated scores. This isn't such a problem when the score is just being used to sort the fetch list, but when the score is also used to determine the page's boost (for the corresponding Lucene document) then this is a Bad Thing. And that's currently what Nutch does.
1. The total score of the "system" as defined by the graph of pages continues to increase, as each newly crawled page adds to the summed score of the system. This then penalizes pages that aren't recrawled, as their score (as a percentage of the total system) keeps dropping.

- There were other problems that have recently (as of August 2006) been fixed, such as self-referencial links creating positive feedback loops.
+ There were other problems that have recently (as of August 2006) been fixed, such as self-referential links creating positive feedback loops.

== The Adaptive OPIC Algorithm ==

I'm going to try to summarize the implementation proposed by the original paper, as I think it applies to Nutch, but there are still a few open issues that I'm trying to resolve with the authors.

1. Each page has a "current cash value" that represents the weight of inbound links.
- 1. Whenever a page is processed, the page's current cash is distributed to outlinks.
+ 1. Whenever a page is processed, the page's current cash is distributed to outlinks, and zeroed.
- 1. Each page also has a "historical cash value" that represents the cash that's flowed through the page. Initially this starts out as 0.0.
+ 1. Each page also has a "historical cash value" that represents the cash that's flowed into the page in the last iteration. Initially this starts out as 0.0.
1. The score of a page is represented by the sum of the historical and current cash values.
1. There is one special, virtual root page that has bidirectional links with every other page in the entire web graph.
a. When a crawl is initially started, the root page has a cash value of 1.0, and this is then distributed (as 1/n) to the n injected pages. When more pages are injected, it's not clear what happens, but I imagine that some of the cash that has accumulated in the root page is distributed to the injected pages, thus keeping the total "energy" of the system constant.
- a. Whenever a page is being processed, the root page can receive some of the page's current cash, due to the implicit link from every page to the root page.
+ a. Whenever a page is being processed, the root page can receive some of the page's current cash, due to the implicit link from every page to the root page. So called "dangling nodes", i.e. pages without outlinks, give all of their cash to the root page.
1. To handle recrawling, every page also has the last time it was processed. In addition, there's a fixed "time window" that's used to calculate the historical cash value of a page. For the Xyleme crawler, this was set at 3 months, but it seems to be heavily dependent on the rate of re-crawling (average time between page refetches). ''We could use a value derived from fetchInterval.''
1. When a page is being processed, its historical cash value is calculated from the page's current cash value and the previous historical cash value. The historical cash value is estimated via interpolation to come up with an "expected" historical cash value, that is close to what you'd get if every page was re-fetched and processed at the same, regular interval. Details are below.