You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mike Smith <mi...@gmail.com> on 2006/12/15 03:11:27 UTC

pagerank implementation

Hi,

I remember nutch 0.7 used to have a scoring method similar to Google
Pagerank by analysing links globally. But, since that was computationally
intensive it was replaced by OPIC scoring. Since, Nutch is completely moved
to Map/Red structure now, is it worth it to port that scoring into
Map/Reduce? Does anybody have any idea of which scoring is more reliable or
closer to Google's pagerank. I know OPIC is an incremental scoring
and google pagerank is a global ranking, but I guess map/reduce will help to
solve previous computational complexity!?

Thanks,
Mike.

Re: pagerank implementation

Posted by Andrzej Bialecki <ab...@getopt.org>.
Mike Smith wrote:
> Hi,
>
> I remember nutch 0.7 used to have a scoring method similar to Google
> Pagerank by analysing links globally. But, since that was computationally
> intensive it was replaced by OPIC scoring. Since, Nutch is completely 
> moved
> to Map/Red structure now, is it worth it to port that scoring into
> Map/Reduce? Does anybody have any idea of which scoring is more 
> reliable or
> closer to Google's pagerank. I know OPIC is an incremental scoring
> and google pagerank is a global ranking, but I guess map/reduce will 
> help to
> solve previous computational complexity!?

Mike,

It would be certainly interesting to implement PageRank scoring using 
map-reduce. Current implementation of OPIC ranking in Nutch is probably 
subtly broken (or at least it's not certain how well it corresponds to 
the method described in the original paper).

I don't think a straightforward PR computation is feasible, even with 
map-reduce - but there are many faster, approximate methods of 
computation described in the literature. Please run a search for 
"pagerank" on Citeseer (http://citeseer.ist.psu.edu/), and also look at 
the proceedings of the last 3 WWW conferences (http://www2006.org/, 
http://www2005.org/, http://www2004.org/).

There was also an option in Nutch 0.7 of approximating PR by considering 
only the first level of linkage, i.e. to calculate the score based only 
on the the number of directly outgoing and incoming links (without 
propagating the score through the graph). This should be trivial to 
implement as a scoring plugin.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com