You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Matthew John <tm...@gmail.com> on 2011/01/25 12:46:26 UTC

Map->Reduce->Reduce

Hi all,


I was working on a MapReduce program which does BytesWritable
dataprocessing. But currently I am basically running two MapReduces
consecutively to get the final output :

Input  ----(MapReduce1)---> Intermediate ----(MapReduce2)---> Output

Here I am running MapReduce2 only to sort the intermediate data on the basis
of a Key comparator logic.

I wanted to cut short the number of MapReduces to just one. I have figured
out a logic to do the same. But the only problem is that in my  logic I need
to run a sort on the Reduce output to get the  final output. the flow looks
like this :

Input ----(MapReduce1)----> Output (not sorted)

I want to know if its possible to attach one more Reduce module to the
dataflow so that it can perform the inherent sort before the 2nd reduce
call. It would look like :

Input --(Map)---> MapOutput ---(Reduce1)-->Output (not sorted) ---(Reduce2 -
for which Reduce 1 acts as a Mapper)---> Output

Please let me know  if  there can be some means of sorting the output
without invoking a separate MapReduce just for the sake of sorting it .

Thanks ,
Matthew

Re: Command Line Arguments for Client

Posted by Harsh J <qw...@gmail.com>.

Hey,

On Wed, Feb 23, 2011 at 6:22 AM, C.V.Krishnakumar Iyer
<f2...@gmail.com> wrote:
> Hi,
>
> Could anyone tell how we could set the commandline arguments ( like -Xmx and -Xms) for the  client (not for the map/reduce tasks) from the command  that is usually used to launch the job?

You can set the HADOOP_CLIENT_OPTS environment variable to apply
additional JVM opts to all client side commands alone (fs, jar, etc.).

-- 
Harsh J
www.harshj.com

Command Line Arguments for Client

Posted by "C.V.Krishnakumar Iyer" <f2...@gmail.com>.

Hi,

Could anyone tell how we could set the commandline arguments ( like -Xmx and -Xms) for the  client (not for the map/reduce tasks) from the command  that is usually used to launch the job? 

Thanks,
Krishnakumar

Re: Map->Reduce->Reduce

Posted by madhu phatak <ph...@gmail.com>.

Reducer will get the <Key,Value> pair in sorted manner.If you can generate
key in order of required sort you can process in map reduce job

On Tue, Jan 25, 2011 at 6:21 PM, Harsh J <qw...@gmail.com> wrote:

> Vanilla Hadoop does not support this without the intermediate I/O
> cost. You can checkout the Hadoop Online Project at
> http://code.google.com/p/hop, as that does support letting a Reducer's
> output go directly to the next job's mapper (as in, a pipeline).
>
> In this topic of pipelining, also checkout what's being done in Plume
> (Based on Google's FlumeJava): http://github.com/tdunning/Plume
>
> On Tue, Jan 25, 2011 at 5:16 PM, Matthew John
> <tm...@gmail.com> wrote:
> > Hi all,
> >
> >
> > I was working on a MapReduce program which does BytesWritable
> > dataprocessing. But currently I am basically running two MapReduces
> > consecutively to get the final output :
> >
> > Input  ----(MapReduce1)---> Intermediate ----(MapReduce2)---> Output
> >
> > Here I am running MapReduce2 only to sort the intermediate data on the
> basis
> > of a Key comparator logic.
> >
> > I wanted to cut short the number of MapReduces to just one. I have
> figured
> > out a logic to do the same. But the only problem is that in my  logic I
> need
> > to run a sort on the Reduce output to get the  final output. the flow
> looks
> > like this :
> >
> > Input ----(MapReduce1)----> Output (not sorted)
> >
> > I want to know if its possible to attach one more Reduce module to the
> > dataflow so that it can perform the inherent sort before the 2nd reduce
> > call. It would look like :
> >
> > Input --(Map)---> MapOutput ---(Reduce1)-->Output (not sorted)
> ---(Reduce2 -
> > for which Reduce 1 acts as a Mapper)---> Output
> >
> > Please let me know  if  there can be some means of sorting the output
> > without invoking a separate MapReduce just for the sake of sorting it .
> >
> > Thanks ,
> > Matthew
> >
>
>
>
> --
> Harsh J
> www.harshj.com
>

Re: Map->Reduce->Reduce

Posted by Harsh J <qw...@gmail.com>.

Vanilla Hadoop does not support this without the intermediate I/O
cost. You can checkout the Hadoop Online Project at
http://code.google.com/p/hop, as that does support letting a Reducer's
output go directly to the next job's mapper (as in, a pipeline).

In this topic of pipelining, also checkout what's being done in Plume
(Based on Google's FlumeJava): http://github.com/tdunning/Plume

On Tue, Jan 25, 2011 at 5:16 PM, Matthew John
<tm...@gmail.com> wrote:
> Hi all,
>
>
> I was working on a MapReduce program which does BytesWritable
> dataprocessing. But currently I am basically running two MapReduces
> consecutively to get the final output :
>
> Input  ----(MapReduce1)---> Intermediate ----(MapReduce2)---> Output
>
> Here I am running MapReduce2 only to sort the intermediate data on the basis
> of a Key comparator logic.
>
> I wanted to cut short the number of MapReduces to just one. I have figured
> out a logic to do the same. But the only problem is that in my  logic I need
> to run a sort on the Reduce output to get the  final output. the flow looks
> like this :
>
> Input ----(MapReduce1)----> Output (not sorted)
>
> I want to know if its possible to attach one more Reduce module to the
> dataflow so that it can perform the inherent sort before the 2nd reduce
> call. It would look like :
>
> Input --(Map)---> MapOutput ---(Reduce1)-->Output (not sorted) ---(Reduce2 -
> for which Reduce 1 acts as a Mapper)---> Output
>
> Please let me know  if  there can be some means of sorting the output
> without invoking a separate MapReduce just for the sake of sorting it .
>
> Thanks ,
> Matthew
>



-- 
Harsh J
www.harshj.com