You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "@dataElGrande" <ma...@gmail.com> on 2012/05/15 20:01:58 UTC

Re: Hadoop - Distributed sorting

Check out Pentaho's howto's when dealing with Hadoop or NoSQL or anything big
data related. http://wiki.pentaho.com/display/BAD/How+To%27s


madhu_sushmi wrote:
> 
> Hi,
> I need to implement distributed sorting using Hadoop. I am quite new to
> Hadoop and I am getting confused. If I want to implement Merge sort, what
> my Map and reduce should be doing. ? Should all the sorting happen at
> reduce side? 
> 
> Please help. This is an urgent requirement. Please guide me.
> 
> Thanks,
> Madhu
> 

-- 
View this message in context: http://old.nabble.com/Hadoop---Distributed-sorting-tp32876784p33849704.html
Sent from the Hadoop core-dev mailing list archive at Nabble.com.

Re: Hadoop - Distributed sorting

Posted by samir das mohapatra <sa...@gmail.com>.

Hi
  Steps to do this:
1) Map: It will only define the key value for each number
 2) Combiner : To sort locally  over chunk of dataset .
 3) Reducer: It will sort after over whole chunk globally-------------->
OUT PUT as sorted

Note: set combiner and reducer as Same class.

Example:
  Let us assume that our data set (integers) is constrained between 100 to
200 and we have 5 files each containing 1000 random integers between 100
and 200 (so a total of 5000 integers between 100 and 200). We read each
file into a Map and then in the Reduce phase, we produce a final Map which
contains the count of all the integers. Now if we sort all the integers
from the final Map and output it
into a list data structure in the form of <Integer, Count> then we have
sorted all the data (see figure below). Aside : In Java, you don’t even
have to come up with the data-structure that I am talking about, if you
just use a TreeMap<http://java.sun.com/javase/6/docs/api/index.html?java/util/TreeMap.html>in
the final Reduce phase, then all the keys (i.e. data) are already
sorted
as long as the key type (e.g. String, Integer, etc.) implements the
Comparable<http://java.sun.com/javase/6/docs/api/index.html?java/lang/Comparable.html>interface
(
Hadoop <http://hadoop.apache.org/> has something similar called
WritableComparable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/WritableComparable.html>and
I am using a TreeMap that takes Strings as keys in
Reducer<http://code.google.com/p/dalalstreet/source/browse/trunk/MapReduce/src/org/karticks/mapreduce/Reducer.java>

Thanks
   Samir
On Tue, May 15, 2012 at 11:31 PM, @dataElGrande <ma...@gmail.com>wrote:

>
> Check out Pentaho's howto's when dealing with Hadoop or NoSQL or anything
> big
> data related. http://wiki.pentaho.com/display/BAD/How+To%27s
>
>
> madhu_sushmi wrote:
> >
> > Hi,
> > I need to implement distributed sorting using Hadoop. I am quite new to
> > Hadoop and I am getting confused. If I want to implement Merge sort, what
> > my Map and reduce should be doing. ? Should all the sorting happen at
> > reduce side?
> >
> > Please help. This is an urgent requirement. Please guide me.
> >
> > Thanks,
> > Madhu
> >
>
> --
> View this message in context:
> http://old.nabble.com/Hadoop---Distributed-sorting-tp32876784p33849704.html
> Sent from the Hadoop core-dev mailing list archive at Nabble.com.
>
>