You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hama.apache.org by Leonidas Fegaras <fe...@cse.uta.edu> on 2012/10/11 17:15:40 UTC

Implementing Hadoop map-reduce on Hama

I have seen some emails in this mailing list asking questions, such as:
I have an X algorithm running on Hadoop map-reduce. Is it suitable for  
Hama?
I think it would be great if we had a good implementation of the
Hadoop map-reduce classes on Hama. Other distributed main-memory
systems have already done so. See:
M3R (http://vldb.org/pvldb/vol5/p1736_avrahamshinnar_vldb2012.pdf) and  
Spark.
It is actually easier than you think. I have done something similar
for my query system, MRQL. What we need is to reimplement
org.apache.hadoop.mapreduce.Job to execute one superstep for each
map-reduce job. Then a Hadoop map-reduce program that may contain
complex workflows and/or loops of map-reduce jobs would need minor
changes to run on Hama as a single BSPJob. Obviously, to implement
map-reduce in Hama, the mapper output can be shuffled to reducers
based on key by sending messages using hashing:
peer.getPeerName(key.hashValue() % peer.getNumPeers())
Then the reducer superstep groups the data by the key in memory and
applies the reducer method. To handle input/intermediate data, we can
use a mapping from path_name to (count,vector) at each node. The
path_name is the path name of some input or intermediate HDFS file,
vector contains the data partition from this file assigned to the  
node, and
count is the max number of times we can scan this vector (after count
times, the vector is garbage-collected). The special case where
count=1 can be implemented using a stream (a Java inner class that
implements a stream Iterator). Given that the map-reduce Job output
is rarely accessed more than once, the translation of most map-reduce
jobs to Hama will not require any data to be stored in memory other
than those used by the map-reduce jobs. One exception is the graph
data that need to persist in memory across all jobs (then count=maxint).
Based on my experience with MRQL, the implementation of these ideas
may need up to 1K lines of Java code. Let me know if you are interested.
Leonidas Fegaras

Re: Implementing Hadoop map-reduce on Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

Hot! I want to discuss only about in-memory. I many heard about
in-memory stuffs.

In fact, a data loading phase will be duplicated in every BSP job that
requires in-memory processing. I don't like idea of implementing
MapReduce on top of BSP but, I think we can consider about novel
model.

On Fri, Oct 12, 2012 at 4:03 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
> OK. Since this is already work in progress by Apurv and it's not a
> high-priority
> by the Hama team, I will not pursue it any further.
> Leonidas
>
>
>
> On Oct 11, 2012, at 12:57 PM, Thomas Jungblut wrote:
>
>> Thanks you two for bringing up that discussion.
>>
>> Personally I have a very strong opinion on that, I think that building a
>> MapReduce solution on top of BSP is useless.
>> We had nearly ten years of development in this paradigm and it has grown
>> and specialized itself very much.
>> You can express MapReduce in BSP, that's totally fine. But that does not
>> mean that every MapReduce algorithm is automagically efficient on BSP.
>> There was (and still is) lots of development on the MapReduce engine and
>> you can't cope with that on a more abstract paradigm.
>>
>> But, of course there are things where MapReduce is inefficient (iterative
>> jobs, grouping, no explicit output caching).
>> Yeah grouping, actually grouping is the main part of reducing, but it is
>> solved inefficiently in Hadoop.
>> You are forced to sort and that's (when I recall your paper correctly)
>> also
>> a drawback which lead you to implement mrql with BSP, because grouping by
>> hash is for several cases much more faster and sometimes also more
>> efficient.
>> It's funny because the original paper [1] suggested that they just have
>> sort as a nice feature to build an inverted index and to do binary search
>> on the tokens. So it's more of a nice side-effect than the real design of
>> the system.
>>
>> All in all, it does not mean that I am not interested in providing such
>> functionality in Hama, but I'm sure that we should invest our time more
>> carefully on features that bring value to the users (improving message
>> scalability, improve performance, provide more examples and algorithms, do
>> talks and presentations) than coding a half baked solution that is easily
>> outperformed by the normal MapReduce.
>> It was never my intention to "kill" Hadoop by developing with Hama, but to
>> improve certain use cases that can not be done efficiently in MapReduce.
>> So if it's just 1k lines and it is not a half-baked solution, feel free to
>> contribute your stuff.
>>
>> [1] http://research.google.com/archive/mapreduce.html
>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Implementing Hadoop map-reduce on Hama

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

OK. Since this is already work in progress by Apurv and it's not a  
high-priority
by the Hama team, I will not pursue it any further.
Leonidas


On Oct 11, 2012, at 12:57 PM, Thomas Jungblut wrote:

> Thanks you two for bringing up that discussion.
>
> Personally I have a very strong opinion on that, I think that  
> building a
> MapReduce solution on top of BSP is useless.
> We had nearly ten years of development in this paradigm and it has  
> grown
> and specialized itself very much.
> You can express MapReduce in BSP, that's totally fine. But that does  
> not
> mean that every MapReduce algorithm is automagically efficient on BSP.
> There was (and still is) lots of development on the MapReduce engine  
> and
> you can't cope with that on a more abstract paradigm.
>
> But, of course there are things where MapReduce is inefficient  
> (iterative
> jobs, grouping, no explicit output caching).
> Yeah grouping, actually grouping is the main part of reducing, but  
> it is
> solved inefficiently in Hadoop.
> You are forced to sort and that's (when I recall your paper  
> correctly) also
> a drawback which lead you to implement mrql with BSP, because  
> grouping by
> hash is for several cases much more faster and sometimes also more
> efficient.
> It's funny because the original paper [1] suggested that they just  
> have
> sort as a nice feature to build an inverted index and to do binary  
> search
> on the tokens. So it's more of a nice side-effect than the real  
> design of
> the system.
>
> All in all, it does not mean that I am not interested in providing  
> such
> functionality in Hama, but I'm sure that we should invest our time  
> more
> carefully on features that bring value to the users (improving message
> scalability, improve performance, provide more examples and  
> algorithms, do
> talks and presentations) than coding a half baked solution that is  
> easily
> outperformed by the normal MapReduce.
> It was never my intention to "kill" Hadoop by developing with Hama,  
> but to
> improve certain use cases that can not be done efficiently in  
> MapReduce.
> So if it's just 1k lines and it is not a half-baked solution, feel  
> free to
> contribute your stuff.
>
> [1] http://research.google.com/archive/mapreduce.html

Re: Implementing Hadoop map-reduce on Hama

Posted by Thomas Jungblut <th...@gmail.com>.

Thanks you two for bringing up that discussion.

Personally I have a very strong opinion on that, I think that building a
MapReduce solution on top of BSP is useless.
We had nearly ten years of development in this paradigm and it has grown
and specialized itself very much.
You can express MapReduce in BSP, that's totally fine. But that does not
mean that every MapReduce algorithm is automagically efficient on BSP.
There was (and still is) lots of development on the MapReduce engine and
you can't cope with that on a more abstract paradigm.

But, of course there are things where MapReduce is inefficient (iterative
jobs, grouping, no explicit output caching).
Yeah grouping, actually grouping is the main part of reducing, but it is
solved inefficiently in Hadoop.
You are forced to sort and that's (when I recall your paper correctly) also
a drawback which lead you to implement mrql with BSP, because grouping by
hash is for several cases much more faster and sometimes also more
efficient.
It's funny because the original paper [1] suggested that they just have
sort as a nice feature to build an inverted index and to do binary search
on the tokens. So it's more of a nice side-effect than the real design of
the system.

All in all, it does not mean that I am not interested in providing such
functionality in Hama, but I'm sure that we should invest our time more
carefully on features that bring value to the users (improving message
scalability, improve performance, provide more examples and algorithms, do
talks and presentations) than coding a half baked solution that is easily
outperformed by the normal MapReduce.
It was never my intention to "kill" Hadoop by developing with Hama, but to
improve certain use cases that can not be done efficiently in MapReduce.
So if it's just 1k lines and it is not a half-baked solution, feel free to
contribute your stuff.

[1] http://research.google.com/archive/mapreduce.html

Re: Implementing Hadoop map-reduce on Hama

Posted by Apurv Verma <da...@gmail.com>.

Yup, I get your point. For repetitive MR jobs it would be better. On a side
note, I wanted to ask, would a bsp implementation of Pagerank be faster or
an MR implementation. Please correct me if I am wrong, but I try to think
of it as, since Pagerank is a graph problem and nodes need to communicate
messages, therefore bsp would be a better model for implementing PageRank
than MR.

But I get your point for complex workflows and iterative MR jobs the bsp
implementation with streaming would be better. I had not thought about
streaming in my implementation. My implementation hadn't been thought
 using streaming. Now I will need to include a hook for input being coming
from a previous job directly.

Please feel free to go forth with your idea, and let me know if I can help
in any manner.

--
Regards,
Apurv Verma





On Thu, Oct 11, 2012 at 10:07 PM, Leonidas Fegaras <fe...@cse.uta.edu>wrote:

> Hi Apurv,
> Allow me to disagree. For complex workflows and repetitive map-reduce
> jobs, such as PageRank,
> a Hama implementation of map-reduce would be far superior. I haven't
> looked at your code yet but
> I hope it does not use HDFS or memory for I/O for each map-reduce job. It
> must use streaming to pass
> results across map-reduce jobs. If you are currently doing this, there is
> no point for me to step in.
> But I really think that this would be very useful in practice, not just
> for demonstration.
> Best regards,
> Leonidas Fegaras
>
>
> On Oct 11, 2012, at 11:04 AM, Apurv Verma wrote:
>
>  Hey Leonidas,
>> Glad that you are thinking about it. IMO it would be good to have such a
>> functionality for demonstration purposes. But only for demonstration
>> purposes. Its best to let hadoop do what's its best at and hama do what
>> its
>> best at. ;) Suraj and I had tried hands on it in the past. Please see [0]
>> and [1]. Also I was working on such a module on my github account. But I
>> couldn't find much time. Basically its mostly the way you have expressed
>> in
>> your last mail. Do you have a github account, I have already done some
>> work
>> on github, I could share work it with you and divide the work if you want?
>>
>> [0]
>> http://code.google.com/p/**anahad/source/browse/trunk/**
>> src/main/java/org/anahata/bsp/**WordCount.java<http://code.google.com/p/anahad/source/browse/trunk/src/main/java/org/anahata/bsp/WordCount.java>
>>     My POC of the Wordcount example on Hama. Super dirty and not generic
>> but works.
>>
>> [1] https://github.com/ssmenon/**hama <https://github.com/ssmenon/hama>
>> Suraj's in memory implementation.
>>
>>
>> Let me know what you think
>>
>> --
>> Regards,
>> Apurv Verma
>>
>>
>>
>>
>>
>> On Thu, Oct 11, 2012 at 8:45 PM, Leonidas Fegaras <fegaras@cse.uta.edu
>> >wrote:
>>
>>  I have seen some emails in this mailing list asking questions, such as:
>>> I have an X algorithm running on Hadoop map-reduce. Is it suitable for
>>> Hama?
>>> I think it would be great if we had a good implementation of the
>>> Hadoop map-reduce classes on Hama. Other distributed main-memory
>>> systems have already done so. See:
>>> M3R (http://vldb.org/pvldb/vol5/****p1736_avrahamshinnar_vldb2012.**
>>> **pdf <http://vldb.org/pvldb/vol5/**p1736_avrahamshinnar_vldb2012.**pdf>
>>> <http://vldb.org/pvldb/**vol5/p1736_avrahamshinnar_**vldb2012.pdf<http://vldb.org/pvldb/vol5/p1736_avrahamshinnar_vldb2012.pdf>
>>> >)
>>>
>>> and Spark.
>>> It is actually easier than you think. I have done something similar
>>> for my query system, MRQL. What we need is to reimplement
>>> org.apache.hadoop.mapreduce.****Job to execute one superstep for each
>>>
>>> map-reduce job. Then a Hadoop map-reduce program that may contain
>>> complex workflows and/or loops of map-reduce jobs would need minor
>>> changes to run on Hama as a single BSPJob. Obviously, to implement
>>> map-reduce in Hama, the mapper output can be shuffled to reducers
>>> based on key by sending messages using hashing:
>>> peer.getPeerName(key.****hashValue() % peer.getNumPeers())
>>>
>>> Then the reducer superstep groups the data by the key in memory and
>>> applies the reducer method. To handle input/intermediate data, we can
>>> use a mapping from path_name to (count,vector) at each node. The
>>> path_name is the path name of some input or intermediate HDFS file,
>>> vector contains the data partition from this file assigned to the node,
>>> and
>>> count is the max number of times we can scan this vector (after count
>>> times, the vector is garbage-collected). The special case where
>>> count=1 can be implemented using a stream (a Java inner class that
>>> implements a stream Iterator). Given that the map-reduce Job output
>>> is rarely accessed more than once, the translation of most map-reduce
>>> jobs to Hama will not require any data to be stored in memory other
>>> than those used by the map-reduce jobs. One exception is the graph
>>> data that need to persist in memory across all jobs (then count=maxint).
>>> Based on my experience with MRQL, the implementation of these ideas
>>> may need up to 1K lines of Java code. Let me know if you are interested.
>>> Leonidas Fegaras
>>>
>>>
>>>
>

Re: Implementing Hadoop map-reduce on Hama

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

Hi Apurv,
Allow me to disagree. For complex workflows and repetitive map-reduce  
jobs, such as PageRank,
a Hama implementation of map-reduce would be far superior. I haven't  
looked at your code yet but
I hope it does not use HDFS or memory for I/O for each map-reduce job.  
It must use streaming to pass
results across map-reduce jobs. If you are currently doing this, there  
is no point for me to step in.
But I really think that this would be very useful in practice, not  
just for demonstration.
Best regards,
Leonidas Fegaras

On Oct 11, 2012, at 11:04 AM, Apurv Verma wrote:

> Hey Leonidas,
> Glad that you are thinking about it. IMO it would be good to have  
> such a
> functionality for demonstration purposes. But only for demonstration
> purposes. Its best to let hadoop do what's its best at and hama do  
> what its
> best at. ;) Suraj and I had tried hands on it in the past. Please  
> see [0]
> and [1]. Also I was working on such a module on my github account.  
> But I
> couldn't find much time. Basically its mostly the way you have  
> expressed in
> your last mail. Do you have a github account, I have already done  
> some work
> on github, I could share work it with you and divide the work if you  
> want?
>
> [0]
> http://code.google.com/p/anahad/source/browse/trunk/src/main/java/org/anahata/bsp/WordCount.java
>     My POC of the Wordcount example on Hama. Super dirty and not  
> generic
> but works.
>
> [1] https://github.com/ssmenon/hama
> Suraj's in memory implementation.
>
>
> Let me know what you think
>
> --
> Regards,
> Apurv Verma
>
>
>
>
>
> On Thu, Oct 11, 2012 at 8:45 PM, Leonidas Fegaras  
> <fe...@cse.uta.edu>wrote:
>
>> I have seen some emails in this mailing list asking questions, such  
>> as:
>> I have an X algorithm running on Hadoop map-reduce. Is it suitable  
>> for
>> Hama?
>> I think it would be great if we had a good implementation of the
>> Hadoop map-reduce classes on Hama. Other distributed main-memory
>> systems have already done so. See:
>> M3R (http://vldb.org/pvldb/vol5/ 
>> **p1736_avrahamshinnar_vldb2012.**pdf<http://vldb.org/pvldb/vol5/p1736_avrahamshinnar_vldb2012.pdf 
>> >)
>> and Spark.
>> It is actually easier than you think. I have done something similar
>> for my query system, MRQL. What we need is to reimplement
>> org.apache.hadoop.mapreduce.**Job to execute one superstep for each
>> map-reduce job. Then a Hadoop map-reduce program that may contain
>> complex workflows and/or loops of map-reduce jobs would need minor
>> changes to run on Hama as a single BSPJob. Obviously, to implement
>> map-reduce in Hama, the mapper output can be shuffled to reducers
>> based on key by sending messages using hashing:
>> peer.getPeerName(key.**hashValue() % peer.getNumPeers())
>> Then the reducer superstep groups the data by the key in memory and
>> applies the reducer method. To handle input/intermediate data, we can
>> use a mapping from path_name to (count,vector) at each node. The
>> path_name is the path name of some input or intermediate HDFS file,
>> vector contains the data partition from this file assigned to the  
>> node, and
>> count is the max number of times we can scan this vector (after count
>> times, the vector is garbage-collected). The special case where
>> count=1 can be implemented using a stream (a Java inner class that
>> implements a stream Iterator). Given that the map-reduce Job output
>> is rarely accessed more than once, the translation of most map-reduce
>> jobs to Hama will not require any data to be stored in memory other
>> than those used by the map-reduce jobs. One exception is the graph
>> data that need to persist in memory across all jobs (then  
>> count=maxint).
>> Based on my experience with MRQL, the implementation of these ideas
>> may need up to 1K lines of Java code. Let me know if you are  
>> interested.
>> Leonidas Fegaras
>>
>>

Re: Implementing Hadoop map-reduce on Hama

Posted by Apurv Verma <da...@gmail.com>.

Hey Leonidas,
 Glad that you are thinking about it. IMO it would be good to have such a
functionality for demonstration purposes. But only for demonstration
purposes. Its best to let hadoop do what's its best at and hama do what its
best at. ;) Suraj and I had tried hands on it in the past. Please see [0]
and [1]. Also I was working on such a module on my github account. But I
couldn't find much time. Basically its mostly the way you have expressed in
your last mail. Do you have a github account, I have already done some work
on github, I could share work it with you and divide the work if you want?

[0]
http://code.google.com/p/anahad/source/browse/trunk/src/main/java/org/anahata/bsp/WordCount.java
     My POC of the Wordcount example on Hama. Super dirty and not generic
but works.

[1] https://github.com/ssmenon/hama
Suraj's in memory implementation.


Let me know what you think

--
Regards,
Apurv Verma





On Thu, Oct 11, 2012 at 8:45 PM, Leonidas Fegaras <fe...@cse.uta.edu>wrote:

> I have seen some emails in this mailing list asking questions, such as:
> I have an X algorithm running on Hadoop map-reduce. Is it suitable for
> Hama?
> I think it would be great if we had a good implementation of the
> Hadoop map-reduce classes on Hama. Other distributed main-memory
> systems have already done so. See:
> M3R (http://vldb.org/pvldb/vol5/**p1736_avrahamshinnar_vldb2012.**pdf<http://vldb.org/pvldb/vol5/p1736_avrahamshinnar_vldb2012.pdf>)
> and Spark.
> It is actually easier than you think. I have done something similar
> for my query system, MRQL. What we need is to reimplement
> org.apache.hadoop.mapreduce.**Job to execute one superstep for each
> map-reduce job. Then a Hadoop map-reduce program that may contain
> complex workflows and/or loops of map-reduce jobs would need minor
> changes to run on Hama as a single BSPJob. Obviously, to implement
> map-reduce in Hama, the mapper output can be shuffled to reducers
> based on key by sending messages using hashing:
> peer.getPeerName(key.**hashValue() % peer.getNumPeers())
> Then the reducer superstep groups the data by the key in memory and
> applies the reducer method. To handle input/intermediate data, we can
> use a mapping from path_name to (count,vector) at each node. The
> path_name is the path name of some input or intermediate HDFS file,
> vector contains the data partition from this file assigned to the node, and
> count is the max number of times we can scan this vector (after count
> times, the vector is garbage-collected). The special case where
> count=1 can be implemented using a stream (a Java inner class that
> implements a stream Iterator). Given that the map-reduce Job output
> is rarely accessed more than once, the translation of most map-reduce
> jobs to Hama will not require any data to be stored in memory other
> than those used by the map-reduce jobs. One exception is the graph
> data that need to persist in memory across all jobs (then count=maxint).
> Based on my experience with MRQL, the implementation of these ideas
> may need up to 1K lines of Java code. Let me know if you are interested.
> Leonidas Fegaras
>
>