You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Ken Krugler <kk...@transpac.com> on 2005/06/20 15:48:27 UTC

Optimizing which links to fetch

Hi all,

It seems that the default behavior of Nutch when sorting links to 
fetch is to use scoreByLinkCount. This then sets the fetch score for 
links on a page to be the same as the containing page's "in-bound 
link" score (or actually the log of same).

What I'd like to do is rate each link on a page separately, based on 
its proximity to key words and other calculated hot-spots. Has this 
been done before? Is the support already there, and I haven't found 
it yet?

If I need to do it myself, the most straightforward approach would be 
to modify emitFetchList() to parse each page (from webdb.pages()), 
matching up the anchors with what's returned by 
dbAnchors.getanchors(). But this seems inefficient and awkward. Would 
it be better to do this analysis when parsing the HTML originally, 
and somehow save each anchor's score in the web DB?

Thanks,

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Re: Optimizing which links to fetch

Posted by Doug Cutting <cu...@nutch.org>.

Ken Krugler wrote:
> It seems that the default behavior of Nutch when sorting links to fetch 
> is to use scoreByLinkCount. This then sets the fetch score for links on 
> a page to be the same as the containing page's "in-bound link" score (or 
> actually the log of same).

Please also see:

http://issues.apache.org/jira/browse/NUTCH-61

This is an extensible mechanism for altering the fetch schedule. 
Similarly, we need an extensible mechanism for computing page scores, 
which are used to prioritize the fetching of scheduled pages.  Note that 
the scoring mechanism has changed substantially in the development trunk 
from what is in the 0.7 release.

Doug