You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Tarandeep Singh <ta...@gmail.com> on 2008/02/04 21:34:11 UTC

More than one map-reduce tasks in one program ?

Hi,

I am working on a problem - process log files and count the number of
times all keywords occur - a kinda word count program that comes with
Hadoop examples. In addition to that I need to do post processing of
the result, like identify the top 10 most frequently occurring
keywords or keywords with increasing / decreasing trend.

One way to do this is to break the problem into 2 parts - solve each
one of them using a separate program (2nd program reads the output of
first). Is there a way I can do this in one program ?

Thanks,
Tarandeep

Re: More than one map-reduce tasks in one program ?

Posted by Billy <sa...@pearsonwholesale.com>.

Sense MR jobs work mostly from local disk I not sure there would be much if 
any advantage to process both on one job.

Billy


"Tarandeep Singh" <ta...@gmail.com> wrote 
in message 
news:e75c02ef0802041234i7fcfe399m40215d7e032b20d1@mail.gmail.com...
> Hi,
>
> I am working on a problem - process log files and count the number of
> times all keywords occur - a kinda word count program that comes with
> Hadoop examples. In addition to that I need to do post processing of
> the result, like identify the top 10 most frequently occurring
> keywords or keywords with increasing / decreasing trend.
>
> One way to do this is to break the problem into 2 parts - solve each
> one of them using a separate program (2nd program reads the output of
> first). Is there a way I can do this in one program ?
>
> Thanks,
> Tarandeep
>

Re: More than one map-reduce tasks in one program ?

Posted by Tim Wintle <ti...@teamrubber.com>.

Sounds like you're using Hadoop Streaming - if you are using Java or
Jython (I believe) you can easily write multiple map-reduce stages into
one application.

Unless the output data set is huge, though, I would personally dump the
statistically relevant results of your first mapreduce into mysql or
some other db, and then do the analysis using SQL - it should respond
very quickly unless you are doing serious analysis in the frequency
domain - in which case I would run a second mapreduce job.

Tim

On Mon, 2008-02-04 at 12:34 -0800, Tarandeep Singh wrote:
> Hi,
> 
> I am working on a problem - process log files and count the number of
> times all keywords occur - a kinda word count program that comes with
> Hadoop examples. In addition to that I need to do post processing of
> the result, like identify the top 10 most frequently occurring
> keywords or keywords with increasing / decreasing trend.
> 
> One way to do this is to break the problem into 2 parts - solve each
> one of them using a separate program (2nd program reads the output of
> first). Is there a way I can do this in one program ?
> 
> Thanks,
> Tarandeep

Re: More than one map-reduce tasks in one program ?

Posted by Amar Kamat <am...@yahoo-inc.com>.

Tarandeep Singh wrote:
> Hi,
>
> I am working on a problem - process log files and count the number of
> times all keywords occur - a kinda word count program that comes with
> Hadoop examples. In addition to that I need to do post processing of
> the result, like identify the top 10 most frequently occurring
> keywords or keywords with increasing / decreasing trend.
>
>   
This requires filtering at the map side and also at the reducer side. 
Consider an example of finding out the top(most frequent) keywords. You 
have to write your own sorting class which apart from sorting the 
complete data returns the top few elements and discards the rest. What 
you returns from the sort is what the reducers sees. So each map emits 
the top elements it has seen. Now have one reducer and again apply the 
same logic where the sorter returns the top few elements instead of 
everything. Its my wild guess that this should work. Let us know if it 
does not.
Amar
> One way to do this is to break the problem into 2 parts - solve each
> one of them using a separate program (2nd program reads the output of
> first). Is there a way I can do this in one program ?
>
> Thanks,
> Tarandeep
>