You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/03/14 15:14:40 UTC

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

    [ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370356 ] 

Andrzej Bialecki  commented on NUTCH-230:
-----------------------------------------

Hmmm, this is a deeply philosophical question... Should you spread out the OPIC score to all links that a page sports, or just to the links that you are interested in? Which option is closer to the real meaning of the OPIC score?

Let's consider this argument: the OPIC score is a "cash value", and it represents an intrinsic value of a page, or its usefulness. If a page contains useless links, it should lose some "cash" over those links, i.e. because of them the value of the page and its outlinks should be lowered. That's the effect we achieve in the current code.

On the other hand, if we were to change the calculation the way you propose, pages with a lot of bad links would heavily promote those few good links that they have. This seems to contradict the idea of OPIC, which is that "good" pages should promote all outlink-ed pages. If we follow your proposal, bad pages would promote more agressively than good pages...

> OPIC score for outlinks should be based on # of valid links, not total # of links.
> ----------------------------------------------------------------------------------
>
>          Key: NUTCH-230
>          URL: http://issues.apache.org/jira/browse/NUTCH-230
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Ken Krugler
>     Priority: Minor

>
> In ParseOutputFormat.java, the write() method currently divides the page score by the # of outlinks:
>           score /= links.length;
> It then loops over the links, and any that pass the normalize/filter gauntlet get added to the crawl output.
> But this means that any filtered links result in some amount of the page's OPIC score being "lost".
> For Nutch 0.7, I built a list of valid (post-filter) links, and then used that to determine the per-link OPIC score, after which I iterated over the list, adding entries to the crawl output.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira