You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2008/06/14 19:59:45 UTC
[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

    [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605091#action_12605091 ] 

Andrzej Bialecki  commented on NUTCH-635:
-----------------------------------------

This patch looks great! A few comments:

* in OutlinksDb.reduce() you use a simple assignment mostRecent = next. This doesn't work as expected, because Hadoop iterator reuses the same single instance of Outlinks under the hood, so if you keep a reference to it its value will mysteriously change under your feet as you call values.next(). This should be replaced with a deep copy (or clone) of the instance, either through a dedicated method of Outlinks or WritableUtils.copy().

* you should avoid spurious whitespace changes to existing classes, this makes the reading more difficult ... (e.g. Outlink.java)

* in Outlinks.write() I think there's a bug - you write out System.currentTimeMillis() instead of this.timestamp, is this intentional?

* in LinkAnalysis.Counter.map() , since you output static values, you should avoid creating new instances and use a pair of static instances.

* by the way, in an implementation of similar algo I used Hadoop Counters to count the totals, this way you avoid storing magic numbers in the db itself (although you still need to preserve them somewhere, so I'd create an additional file with this value ... well, perhaps not so elegant either after all ;) ).

* LinkAnalysis.Analyzer.reduce() - you should retrieve config parameters in configure(Job), otherwise you pay the price of getting floats from Configuration (which involves repeated creation of Float via Float.parseFloat()). Also, HashPartitioner should be created once. Well, this is a general comment to this patch - it creates a lot of objects unnecessarily. We can optimize it now or later, whatever you prefer.

I didn't go into the algorithm itself yet to give any useful comments ... But I have a dataset of ~4mln pages I can test it on.



> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.