You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dennis Kubes (JIRA)" <ji...@apache.org> on 2008/06/12 20:12:45 UTC

[jira] Created: (NUTCH-635) LinkAnalysis Tool for Nutch

LinkAnalysis Tool for Nutch
---------------------------

                 Key: NUTCH-635
                 URL: https://issues.apache.org/jira/browse/NUTCH-635
             Project: Nutch
          Issue Type: New Feature
    Affects Versions: 1.0.0
         Environment: All
            Reporter: Dennis Kubes
            Assignee: Dennis Kubes
             Fix For: 1.0.0


This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623504#action_12623504 ] 

Andrzej Bialecki  commented on NUTCH-635:
-----------------------------------------

Dennis, please split this patch into the link analysis and indexing parts, and move the part related to the new indexing framework to a separate issue, so that we deal only with the link analysis patch here. Thank you!

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-635:
-------------------------------

    Attachment: NUTCH-635-6-20080725.patch

Finished link analysis and indexer framework along with tools.

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630478#action_12630478 ] 

Dennis Kubes commented on NUTCH-635:
------------------------------------

Ooops, yeah crud.  I must have switched them.  I have a little cleanup to do on the indexing one, then I will repost.  Sorry about that.

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-8-20080818.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657686#action_12657686 ] 

Hudson commented on NUTCH-635:
------------------------------

Integrated in Nutch-trunk #667 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/667/])
    

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620079#action_12620079 ] 

Dennis Kubes commented on NUTCH-635:
------------------------------------

> # some crucial javadoc is missing, such as the comments on class level (at least), especially if they are cmd-line utilities or classes that support a major functionality.

Yup.  Still going through and doing the javadoc.  I will have all that done before any final commit.

> # perhaps we don't need a separate Node db, this information can be added directly to the CrawlDb, which could save us the trouble with running the ScoreUpdater.

The crawldb can get huge as you know.  It could be updated into the crawldb but then we are stuck using the crawldb everywhere we currently use the nodedb, which is a lot of places both in the analysis and in the indexing.  The way it currently is works much faster and allows us to at a glance see scores and number of links per url using the NodeReader tool.

> linkType should be byte, not int - this saves 3 bytes on each entry.

Done

> Loops.Route.readFields(): I think it's better to use Text.readString() instead of DataInput.readUTF(). Or for that matter, replace the plain Strings with Text, since many times in other places in Loops you need to create a Text object anyway, out of one of Route's fields.

This has been fixed in a more recent patch

> I don't understand why clearScore is set to 0.00001f. What's with the magic number?

Leftovers.  This has been fixed to 0.0f

> ReprUrlFixer should go into tools.compat

Done

> the new indexing framework: I like the added flexibility, but the cost for that seems high. Previously we only had to run a single map-red job to create an index, now we have to run at least 6 jobs, each with a large dataset. I vote for splitting the patch and creating a separate issue for this framework, so that we can discuss it further.

I agree.  This patch is getting big and the indexing stuff should go into a separate issue.  I will create one.  Also I have reworked the indexer to allow for field filters.  I will post the new patch on the new issue.  

I agree that it is more jobs but I don't see a way around that.  And the new analysis is also more jobs.  I am not afraid of running more jobs on the system as that can be automated.  I am afraid of not having the flexibility that I need and the ability to apply a type of analysis.  The current indexer locks in the databases that can be used and we need more flexibility than that, not just in the what is indexed but also how.  With this approach we can create fields from any MR job and then integrate and index all of those fields.  New fields and analysis scores can be added without changing the indexing code.  The newer patch also creates an extension point for field filters that allow manipulation of the fields and document in the index once the fields are aggregated together.  This allows a great deal of flexibility in indexing fields, aggregates and manipulating document boosts, and in taking other actions such as blacklisting.  Again I will post the new patch soon.


> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-635:
-------------------------------

    Attachment:     (was: NUTCH-635-8-20080818.patch)

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-635:
-------------------------------

    Attachment: NUTCH-635-3-20080614.patch

Stable patch that fixes some of the issues commented on and mentioned previously.  This patch converges well on a dataset of over 100K pages and handles reciprocal linking.  As of yet link farms don't seem to be a problem but we shall see.

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-635:
-------------------------------

    Patch Info: [Patch Available]

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-8-20080818.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-635:
-------------------------------

    Attachment: NUTCH-635-5-20080620.patch

Refactored patch that removes network calls using MapFile.Readers and simulates better a row matrix though inverting and merging inlink scores.  This patch works in the general sort-merge-process structure of MapReduce and as such should be significantly faster.  The previous jobs were taking far to long to process on a large dataset.  This patch includes the link anlaysis tool, a tool for updating the crawl db with a new score and clearing scores of urls with no score, an outlink database tool, a new inlink database tool that will keep inlinks consistent with outlinks, and a new scoring plugin which replaces the opic plugin.

The order of tool runs should now be: Inject, Generate, Fetch, UpdateDb, OutlinkDb, InlinkDb, LinkAnalysis, ScoreUpdater, Indexer

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605102#action_12605102 ] 

Dennis Kubes commented on NUTCH-635:
------------------------------------

Andrzej Bialecki Wrote:

    *  in OutlinksDb.reduce() you use a simple assignment mostRecent = next. This doesn't work as expected, because Hadoop iterator reuses the same single instance of Outlinks under the hood, so if you keep a reference to it its value will mysteriously change under your feet as you call values.next(). This should be replaced with a deep copy (or clone) of the instance, either through a dedicated method of Outlinks or WritableUtils.copy().

Fixed this.  Thanks.  I knew it happened for writables but wasn't aware that it was implemented the same way in the iterators.

    * you should avoid spurious whitespace changes to existing classes, this makes the reading more difficult ... (e.g. Outlink.java)

That was a mistake, fixed it.

    * in Outlinks.write() I think there's a bug - you write out System.currentTimeMillis() instead of this.timestamp, is this intentional?

Nope, that was a bug from an earlier version of it.  Fixed.

    * in LinkAnalysis.Counter.map() , since you output static values, you should avoid creating new instances and use a pair of static instances.

    * by the way, in an implementation of similar algo I used Hadoop Counters to count the totals, this way you avoid storing magic numbers in the db itself (although you still need to preserve them somewhere, so I'd create an additional file with this value ... well, perhaps not so elegant either after all ).

This is really just a temp file.  I count the urls put it into a file using a single reduce task and then read it back in the update method of LinkAnalysis and pass it into the jobs through conf.  Once it is read I delete the file.

    * LinkAnalysis.Analyzer.reduce() - you should retrieve config parameters in configure(Job), otherwise you pay the price of getting floats from Configuration (which involves repeated creation of Float via Float.parseFloat()). Also, HashPartitioner should be created once. Well, this is a general comment to this patch - it creates a lot of objects unnecessarily. We can optimize it now or later, whatever you prefer.

I think a bit of both.  I fixed the HashPartitioner one.  My intention with this first version is to get a workable tool that converges the score and to provide workarounds for the common types of link spam such at reciprocal links and link farms / tightly knit communities.  Once it is working we can always optimize the speed later.  That being said the current version is faster than I thought it would be.  The current patch does converge and it handled reciprocal links and some cases of link farms but it is currently being overinflued by link loops of three or more sights.  Once I have that taken care of I will post a new path.


> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619941#action_12619941 ] 

Andrzej Bialecki  commented on NUTCH-635:
-----------------------------------------

A few comments to the latest patch:

* some crucial javadoc is missing, such as the comments on class level (at least), especially if they are cmd-line utilities or classes that support a major functionality.
* perhaps we don't need a separate Node db, this information can be added directly to the CrawlDb, which could save us the trouble with running the ScoreUpdater.
* minor thing, but in many classes you use a repeating pattern of creating instances of List, HashSet, ObjWritable, etc, etc inside the map()/reduce() methods, while they should be created once and reused.
* LinkDatum:
** linkType should be byte, not int - this saves 3 bytes on each entry.
* LinkRank:
** I wonder if we couldn't skip the Counter job, and instead collect the total number of links via Hadoop job counters. I.e. define counters in Mapper/Reducer of the analysis job, and then after the job is done you can retrieve them from a RunningJob instance. We could then maintain this value on each update of the db in a well-known location, as you do this already, except we could skip this additional runCounter(..) job ...
* Loops:
** Loops.Route.readFields(): I think it's better to use Text.readString() instead of DataInput.readUTF(). Or for that matter, replace the plain Strings with Text, since many times in other places in Loops you need to create a Text object anyway, out of one of Route's fields.
* LinkUpdater:
** I don't understand why clearScore is set to 0.00001f. What's with the magic number?
* ReprUrlFixer should go into tools.compat
* ResolveUrls uses ReprUrlFixer log, it should use its own. Besides, this tool is not relevant to this patch, so I think it should be submitted separately.
* the new indexing framework: I like the added flexibility, but the cost for that seems high. Previously we only had to run a single map-red job to create an index, now we have to run at least 6 jobs, each with a large dataset. I vote for splitting the patch and creating a separate issue for this framework, so that we can discuss it further.


> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605091#action_12605091 ] 

Andrzej Bialecki  commented on NUTCH-635:
-----------------------------------------

This patch looks great! A few comments:

* in OutlinksDb.reduce() you use a simple assignment mostRecent = next. This doesn't work as expected, because Hadoop iterator reuses the same single instance of Outlinks under the hood, so if you keep a reference to it its value will mysteriously change under your feet as you call values.next(). This should be replaced with a deep copy (or clone) of the instance, either through a dedicated method of Outlinks or WritableUtils.copy().

* you should avoid spurious whitespace changes to existing classes, this makes the reading more difficult ... (e.g. Outlink.java)

* in Outlinks.write() I think there's a bug - you write out System.currentTimeMillis() instead of this.timestamp, is this intentional?

* in LinkAnalysis.Counter.map() , since you output static values, you should avoid creating new instances and use a pair of static instances.

* by the way, in an implementation of similar algo I used Hadoop Counters to count the totals, this way you avoid storing magic numbers in the db itself (although you still need to preserve them somewhere, so I'd create an additional file with this value ... well, perhaps not so elegant either after all ;) ).

* LinkAnalysis.Analyzer.reduce() - you should retrieve config parameters in configure(Job), otherwise you pay the price of getting floats from Configuration (which involves repeated creation of Float via Float.parseFloat()). Also, HashPartitioner should be created once. Well, this is a general comment to this patch - it creates a lot of objects unnecessarily. We can optimize it now or later, whatever you prefer.

I didn't go into the algorithm itself yet to give any useful comments ... But I have a dataset of ~4mln pages I can test it on.



> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-635:
-------------------------------

    Attachment: NUTCH-635-2-20080613.patch

Updated patch.  Contains a score updater for crawl db.  A scoring filter to work with the link analysis tool.  Updated the LinkAnalysis tool to handle reciprocal links, links from the same domain/subdomains, rank sinks, and link loops.  Also included a display tool to view inlinks/outlinks and scores for a given url.  Should be ready for large scale testing.  Tested on a dataset of 25K pages and the results were promising.

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630268#action_12630268 ] 

Doğacan Güney commented on NUTCH-635:
-------------------------------------

I have skimmed through the last patches in this one and NUTCH-646. But I am confused. Are the patches swapped? This one here seems to be about indexing, while NUTCH-646 has loops and link analysis and web graphs :)

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-8-20080818.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: NTCH-635 LinkAnalysis Tool for Nutch

Posted by Pradeep Pujari <pr...@macys.com>.
Where is the document, can you please send me?

Thanks/Regards
Pradeep Pujari



                                                                           
             "Eric J.                                                      
             Christeson"                                                   
             <Eric.Christeson@                                          To 
             ndsu.edu>                 nutch-dev@lucene.apache.org         
                                                                        cc 
             02/12/2009 04:05                                              
             PM                                                    Subject 
                                       NTCH-635 LinkAnalysis Tool for      
                                       Nutch                               
             Please respond to                                             
             nutch-dev@lucene.                                             
                apache.org                                                 
                                                                           
                                                                           
                                                                           




I went through org.apache.nutch.scoring.webgraph.* found all the
config settings I could, threw them into nutch-default.xml and tried
to document them.  Who wants the patches?

Eric
--
Eric J. Christeson
<Er...@ndsu.edu>
Enterprise Computing and Infrastructure    (701) 231-8693 (Voice)
North Dakota State University





NTCH-635 LinkAnalysis Tool for Nutch

Posted by "Eric J. Christeson" <Er...@ndsu.edu>.
I went through org.apache.nutch.scoring.webgraph.* found all the  
config settings I could, threw them into nutch-default.xml and tried  
to document them.  Who wants the patches?

Eric
--
Eric J. Christeson                                  
<Er...@ndsu.edu>
Enterprise Computing and Infrastructure    (701) 231-8693 (Voice)
North Dakota State University


[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633644#action_12633644 ] 

Doğacan Güney commented on NUTCH-635:
-------------------------------------

Sorry for the late review.... 

Patch looks great, and since this is very self contained I see no reason why we do not commit this immediately.

Some notes:

  - Can we also commit a small (5K-6K nodes maybe) test graph, so that future changes can be tested against it?
  - There are many WritableUtils.clone calls in the code. I don't think that they are necessary.
  - Instead of ObjectWritable, I would suggest using NutchWritable. NutchWritable is lighter.
  - There are a couple of new warnings. Mostly with unused JobConf-s and with OptionBuilder. 
  - It may be a good idea to create some plugins for webgraph package to give users some control over which
    outlinks they want to filter and which to keep (obviously for later)
  - Can you explain your score formula?

{code}

      // calculate linkRank score formula
      float linkRankScore = (1 - this.dampingFactor)
        + (this.dampingFactor * totalInlinkScore);

{code}
 
      I may be mistaken, but you only seem to have the use case where the random surfer clicks a link on a page and not the he-types-a-new-url-to-start-over use case. Also, why do you add 0.15 (as default value) to every score?

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-8-20080818.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-635:
-------------------------------

    Attachment: NUTCH-635-7-20080808.patch

Final patch, includes comments, change suggestions, the new scoring and link analysis tools, and the new indexing framework.

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605142#action_12605142 ] 

Dennis Kubes commented on NUTCH-635:
------------------------------------

Andrzej Bialecki wrote:

> One more question: you said the algorithm converges, but do you have a reference set of values from this dataset, calculated using some other pagerank impl? It would be worthwhile to make sure that the > > values are indeed the PageRank, as described, and not yet another subtle variation such as our OPIC

I was doing it low tech.  By turning on the debug logging, warning it is a large output, and using grep you can see the score converge after a few iterations ;)

> There are a few Java packages for computing PageRank, we could adapt one of those to serve as a baseline:
> 
> http://law.dsi.unimi.it/
> http://webla.sourceforge.net/javadocs/pt/tumba/links/PageRank.html

I agree it would be a good comparison.  Strictly speaking though it is not just pagerank.  There are optimizations for multiple links from a given domain, penalties for very few inlinks, and a minimum score value.  All of which are able to be changed through the configuration.  Besides that it does follow the original pagerank algorithm closely.

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes closed NUTCH-635.
------------------------------


> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes resolved NUTCH-635.
--------------------------------

    Resolution: Fixed

Committed with revision 723441

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-635:
-------------------------------

    Attachment: NUTCH-635-1-20080612.patch

Basic patch, doesn't include unit tests but it has been tested.  Includes the LinkAnalysis tool and the Outlink tool.  Still needs to handle cases such at telelportation and rank sinks.  But here it is as a first pass for people to see.

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-635:
-------------------------------

    Attachment:     (was: NUTCH-635-7-20080808.patch)

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-635:
-------------------------------

    Attachment: NUTCH-635-9-20081126.patch

Updated final patch for new link analysis framework.  I am also going to write up some documentation on the wiki for how this new process works.

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-635:
-------------------------------

    Attachment: NUTCH-635-8-20080818.patch

Breaks out the new indexing framework into its own patch NUTCH-646.  Removes the ResolveURLs tool into its own patch.  Makes the patch java 5 compatible.

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-8-20080818.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-635:
-------------------------------

    Attachment: NUTCH-635-4-20080615.patch

Adds normalization for many links from a single domain and a penalty threshold for very few inlinks.  Also adds the ability to alter the boost into the index to compensate for front end query boosts.

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-635:
-------------------------------

    Attachment: NUTCH-635-7-20080808.patch

Final patch.  Includes comment and code change suggestions.  Includes new scoring, link analysis, and indexing frameworks and tools.

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605139#action_12605139 ] 

Andrzej Bialecki  commented on NUTCH-635:
-----------------------------------------

One more question: you said the algorithm converges, but do you have a reference set of values from this dataset, calculated using some other pagerank impl? It would be worthwhile to make sure that the values are indeed the PageRank, as described, and not yet another subtle variation such as our OPIC ;)

There are a few Java packages for computing PageRank, we could adapt one of those to serve as a baseline:

http://law.dsi.unimi.it/
http://webla.sourceforge.net/javadocs/pt/tumba/links/PageRank.html


> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations.  This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.