You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by massimo miccoli <mm...@iltrovatore.it> on 2005/10/03 20:10:07 UTC
IlTrovatore check: e' SPAM? Re: [Fwd: Fetch list priority]
+1
I have read the paper about OPIc and it seam very good. I think it a
must for Nutch to have good (and fast) rank algo webgraph based. I
have fetched about 250 milions of pages and what I see is that the
only inlinks count is not good for big crawl and quality results.
Thanks,
Massimo
Il giorno 29/set/05, alle ore 23:38, Doug Cutting ha scritto:
> Here's some interesting stuff about OPIC, an easy-to-calculate link-
> based measure of page quality. I'm going to read the papers, and
> if it is a good as it sounds, perhaps implement this in the mapred
> branch. Does anyone have experience with OPIC?
>
> -------- Original Message --------
> Subject: Fetch list priority
> Date: Thu, 29 Sep 2005 10:57:31 +0200
> From: Carlos Alberto-Alejandro CASTILLO-Ocaranza
> Organization: Universitat Pompeu Fabra
>
> Hi Doug, I'm ChaTo, developer of the WIRE crawler; we met in Compiegne
> during the OSWIR workshop.
>
> I told you I would contact you about the priorities of the crawler;
> and
> that there were best strategies than using log(indegree). I
> suggested to
> use OPIC (online page importance computation).
>
> OPIC is described here by Abiteboul et al.:
>
> http://www.citeulike.org/user/ChaTo/article/240858
>
> We did experiments with OPIC in two collections of 2-million pages
> each,
> and we tested that these collections have the same power-law exponents
> that the full web [I'm attaching a graph of Pagerank vs page
> downloaded]. Ordering pages by indegree is as bad as random:
>
> http://www.citeulike.org/user/ChaTo/article/240824
>
> http://www.citeulike.org/user/ChaTo/article/240898
>
> Why? Because the crawler tends to focus in a few Web sites. See for
> instance Boldi et al. "Do your worst to make the best":
>
> http://www.citeulike.org/user/ChaTo/article/240866
>
> ======================================================================
> =
>
> Here is the general idea of OPIC: at the beginning, each page has the
> same score. Let's call it 'opic':
>
> for all initial pages i:
> opic[i] = 1;
>
> Whenever you find a link:
>
> opic[destination] += opic[source] / outdegree[source];
>
> This is it. Abiteboul's paper proves that this converges even in a
> changing graph, and that it is a good estimator of quality. He also
> suggests using the history of a page to keep it's opic across crawls,
> but even without the history we have seen that it works quite well.
>
> In your case, what you do in org.apache.nutch.tools.FetchListTool is:
> ...
> String[] anchors = dbAnchors.getAnchors(page.getURL());
> curScore.set(scoreByLinkCount ?
> (float)Math.log(anchors.length+1) : page.getScore());
> ...
>
> You need something different, because you will have to read the scores
> of the pages that are pointing to your page. You can do it by (a)
> keeping or reading the scores of the inlinks to each page or (b) do
> this
> cycle for the source pages in the other order:
>
> for each page P in the webdb:
> for each outlinks in page P
> opic[destination] += opic[P] / outdegree[P];
>
> Note that to make this more effective you must also update the
> 'opic' of
> the pages you already crawled, and that I think you should avoid
> self-links.
>
> The 'opic' scores will also be statistically distributed according
> to a
> power-law so it's sensible to use log(opic) when combining this with
> other scores with a different distribution, such as text similarity.
>
> ======================================================================
> ==
>
> I hope this is useful for you.
>
> All the best,
>
> --
> ChaTo = Carlos Alberto-Alejandro CASTILLO-Ocaranza, PhD
>
>