You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Nilesh Chakraborty (JIRA)" <ji...@apache.org> on 2014/01/28 18:07:40 UTC

[jira] [Commented] (MAHOUT-742) Pagerank implementation in Map/Reduce

    [ https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884332#comment-13884332 ] 

Nilesh Chakraborty commented on MAHOUT-742:
-------------------------------------------

Hey [~ssc], I've been working on an implementation of large-scale blockwise matrix-vector multiplication using a single MapReduce job. The current algorithms and implementations need two MapReduce jobs for blockwise multiplication (or any sparse mat-vec mult where the vector isn't stored in memory). I will be using it to implement PageRank. I'll benchmark my implementation against the state-of-the-art in MapReduce-based PageRank - Pegasus (they've contributed the Pegasus code to https://github.com/intel-hadoop/HiBench).

If my version turns out to be faster, I'll be writing the code for algorithms like SVD and Lanczos algorithm (http://en.wikipedia.org/wiki/Lanczos_algorithm) too.

Do you think these can make for a useful contribution to Mahout? I need to keep that in mind before I go forward with coding.

> Pagerank implementation in Map/Reduce
> -------------------------------------
>
>                 Key: MAHOUT-742
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-742
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Graph
>    Affects Versions: 0.6
>            Reporter: Christoph Nagel
>            Assignee: Sebastian Schelter
>             Fix For: 0.6
>
>         Attachments: MAHOUT-742.patch
>
>
> Hi,
> my name is Christoph Nagel. I'm student on technical university Berlin and participating on the course of Isabel Drost and Sebastian Schelter.
> My work is to implement the pagerank-algorithm, where the pagerank-vector fits in memory.
> For the computation I used the naive algorithm shown in the book 'Mining of Massive Datasets' from Rajaraman & Ullman (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf).
> Matrix- and vector-multiplication are done with mahout methods.
> Most work is the transformation the input graph, which has to consists of a nodes- and edges file.
> Format of nodes file: <node>\n
> Format of edges file: <startNode>\t<endNode>\n
> Therefore I created the following classes:
> * LineIndexer: assigns each line an index
> * EdgesToIndex: indexes the nodes of the edges
> * EdgesIndexToTransitionMatrix: creates the transition matrix
> * Pagerank: computes PR from transition matrix
> * JoinNodesWithPagerank: creates the joined output
> * PagerankExampleJob: does the complete job
> Each class has a test (not PagerankExampleJob) and I took the example of the book for evaluating.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Re: [jira] [Commented] (MAHOUT-742) Pagerank implementation in Map/Reduce

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Nilesh,

We've had an implementation of PageRank, which was removed a few 
releases ago. I don't think PageRank would be a valuable contribution, 
because I don't see a MapReduce based implementation being able to 
compete with systems such as Apache Giraph that also run on a standard 
Hadoop cluster.

If you wanted to work on a block-based version of our matrix 
multiplication code (that is used in some algorithms such as our 
existing Lanczos implementation afaik) that would be a very valuable 
contribution, however.

@list Any other opinions on that?

Best,
Sebastian

On 01/28/2014 06:07 PM, Nilesh Chakraborty (JIRA) wrote:
>
>      [ https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884332#comment-13884332 ]
>
> Nilesh Chakraborty commented on MAHOUT-742:
> -------------------------------------------
>
> Hey [~ssc], I've been working on an implementation of large-scale blockwise matrix-vector multiplication using a single MapReduce job. The current algorithms and implementations need two MapReduce jobs for blockwise multiplication (or any sparse mat-vec mult where the vector isn't stored in memory). I will be using it to implement PageRank. I'll benchmark my implementation against the state-of-the-art in MapReduce-based PageRank - Pegasus (they've contributed the Pegasus code to https://github.com/intel-hadoop/HiBench).
>
> If my version turns out to be faster, I'll be writing the code for algorithms like SVD and Lanczos algorithm (http://en.wikipedia.org/wiki/Lanczos_algorithm) too.
>
> Do you think these can make for a useful contribution to Mahout? I need to keep that in mind before I go forward with coding.
>
>> Pagerank implementation in Map/Reduce
>> -------------------------------------
>>
>>                  Key: MAHOUT-742
>>                  URL: https://issues.apache.org/jira/browse/MAHOUT-742
>>              Project: Mahout
>>           Issue Type: New Feature
>>           Components: Graph
>>     Affects Versions: 0.6
>>             Reporter: Christoph Nagel
>>             Assignee: Sebastian Schelter
>>              Fix For: 0.6
>>
>>          Attachments: MAHOUT-742.patch
>>
>>
>> Hi,
>> my name is Christoph Nagel. I'm student on technical university Berlin and participating on the course of Isabel Drost and Sebastian Schelter.
>> My work is to implement the pagerank-algorithm, where the pagerank-vector fits in memory.
>> For the computation I used the naive algorithm shown in the book 'Mining of Massive Datasets' from Rajaraman & Ullman (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf).
>> Matrix- and vector-multiplication are done with mahout methods.
>> Most work is the transformation the input graph, which has to consists of a nodes- and edges file.
>> Format of nodes file: <node>\n
>> Format of edges file: <startNode>\t<endNode>\n
>> Therefore I created the following classes:
>> * LineIndexer: assigns each line an index
>> * EdgesToIndex: indexes the nodes of the edges
>> * EdgesIndexToTransitionMatrix: creates the transition matrix
>> * Pagerank: computes PR from transition matrix
>> * JoinNodesWithPagerank: creates the joined output
>> * PagerankExampleJob: does the complete job
>> Each class has a test (not PagerankExampleJob) and I took the example of the book for evaluating.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.1.5#6160)
>