You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apurv Verma <da...@gmail.com> on 2012/03/24 13:55:43 UTC

GSoC2012 Idea: Integrating Nutch With Hama

Hi,
 Would the Nutch community be interested in integrating Nutch and Hama.
Apache Hama is a Bulk Synchronous Parallel programming model written on top
of HDFS, highly suited for graph algorithms.
Currently Nutch supports running with Map Reduce paradigm. If the community
is interested I would like to take it up as a gsoc project.

--
thanks and regards,

Apurv Verma
B. Tech.(CSE)
IIT- Ropar

Re: GSoC2012 Idea: Integrating Nutch With Hama

Posted by Apurv Verma <da...@gmail.com>.
Hi,
 Here is what I think, please correct me if I am wrong.


   1. At its core, since Nutch is a web crawler, there must be a bfs going
   on. In local mode we would be using a simple bfs algorithm but in deploy
   mode we need a distributed version of it.
   In the current version of Nutch, this should have been implemented as a
   Map Reduce program. My suggestion is to implement it as a BSP program using
   Hama.

   Advantages:
   BSP is naturally suited model for graph algorithms. Please see [0] and
   [1]. IMO we should see a performance improvement with Hama.



[0]
http://www.slideshare.net/chodakowski/processing-graphrelational-data-with-mapreduce-and-bulk-synchronous-parallel
[1]
http://www.slideshare.net/udanax/apache-hama-an-introduction-tobulk-synchronization-parallel-on-hadoop-2699426

--
thanks and regards,

Apurv Verma
B. Tech.(CSE)
IIT- Ropar






On Sun, Mar 25, 2012 at 2:50 AM, Mathijs Homminga <
mathijs.homminga@kalooga.com> wrote:

> This is interesting, can you elaborate a bit more on this. In what way do
> you think could Nutch benefit from an implementation in Hama?
>
> Mathijs Homminga
>
> On 24 mrt. 2012, at 13:55, Apurv Verma wrote:
>
> > Hi,
> >  Would the Nutch community be interested in integrating Nutch and Hama.
> Apache Hama is a Bulk Synchronous Parallel programming model written on top
> of HDFS, highly suited for graph algorithms.
> > Currently Nutch supports running with Map Reduce paradigm. If the
> community is interested I would like to take it up as a gsoc project.
> >
> > --
> > thanks and regards,
> >
> > Apurv Verma
> > B. Tech.(CSE)
> > IIT- Ropar
> >
> >
> >
> >
>
>

Re: GSoC2012 Idea: Integrating Nutch With Hama

Posted by Mathijs Homminga <ma...@kalooga.com>.
This is interesting, can you elaborate a bit more on this. In what way do you think could Nutch benefit from an implementation in Hama?

Mathijs Homminga

On 24 mrt. 2012, at 13:55, Apurv Verma wrote:

> Hi,
>  Would the Nutch community be interested in integrating Nutch and Hama. Apache Hama is a Bulk Synchronous Parallel programming model written on top of HDFS, highly suited for graph algorithms.
> Currently Nutch supports running with Map Reduce paradigm. If the community is interested I would like to take it up as a gsoc project.
> 
> --
> thanks and regards,
> 
> Apurv Verma
> B. Tech.(CSE)
> IIT- Ropar
> 
> 
> 
>