You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tarandeep Singh <ta...@gmail.com> on 2008/02/04 21:34:11 UTC
More than one map-reduce tasks in one program ?
Hi,
I am working on a problem - process log files and count the number of
times all keywords occur - a kinda word count program that comes with
Hadoop examples. In addition to that I need to do post processing of
the result, like identify the top 10 most frequently occurring
keywords or keywords with increasing / decreasing trend.
One way to do this is to break the problem into 2 parts - solve each
one of them using a separate program (2nd program reads the output of
first). Is there a way I can do this in one program ?
Thanks,
Tarandeep
Re: More than one map-reduce tasks in one program ?
Posted by Billy <sa...@pearsonwholesale.com>.
Sense MR jobs work mostly from local disk I not sure there would be much if
any advantage to process both on one job.
Billy
"Tarandeep Singh" <ta...@gmail.com> wrote
in message
news:e75c02ef0802041234i7fcfe399m40215d7e032b20d1@mail.gmail.com...
> Hi,
>
> I am working on a problem - process log files and count the number of
> times all keywords occur - a kinda word count program that comes with
> Hadoop examples. In addition to that I need to do post processing of
> the result, like identify the top 10 most frequently occurring
> keywords or keywords with increasing / decreasing trend.
>
> One way to do this is to break the problem into 2 parts - solve each
> one of them using a separate program (2nd program reads the output of
> first). Is there a way I can do this in one program ?
>
> Thanks,
> Tarandeep
>
Re: More than one map-reduce tasks in one program ?
Posted by Tim Wintle <ti...@teamrubber.com>.
Sounds like you're using Hadoop Streaming - if you are using Java or
Jython (I believe) you can easily write multiple map-reduce stages into
one application.
Unless the output data set is huge, though, I would personally dump the
statistically relevant results of your first mapreduce into mysql or
some other db, and then do the analysis using SQL - it should respond
very quickly unless you are doing serious analysis in the frequency
domain - in which case I would run a second mapreduce job.
Tim
On Mon, 2008-02-04 at 12:34 -0800, Tarandeep Singh wrote:
> Hi,
>
> I am working on a problem - process log files and count the number of
> times all keywords occur - a kinda word count program that comes with
> Hadoop examples. In addition to that I need to do post processing of
> the result, like identify the top 10 most frequently occurring
> keywords or keywords with increasing / decreasing trend.
>
> One way to do this is to break the problem into 2 parts - solve each
> one of them using a separate program (2nd program reads the output of
> first). Is there a way I can do this in one program ?
>
> Thanks,
> Tarandeep
Re: More than one map-reduce tasks in one program ?
Posted by Amar Kamat <am...@yahoo-inc.com>.
Tarandeep Singh wrote:
> Hi,
>
> I am working on a problem - process log files and count the number of
> times all keywords occur - a kinda word count program that comes with
> Hadoop examples. In addition to that I need to do post processing of
> the result, like identify the top 10 most frequently occurring
> keywords or keywords with increasing / decreasing trend.
>
>
This requires filtering at the map side and also at the reducer side.
Consider an example of finding out the top(most frequent) keywords. You
have to write your own sorting class which apart from sorting the
complete data returns the top few elements and discards the rest. What
you returns from the sort is what the reducers sees. So each map emits
the top elements it has seen. Now have one reducer and again apply the
same logic where the sorter returns the top few elements instead of
everything. Its my wild guess that this should work. Let us know if it
does not.
Amar
> One way to do this is to break the problem into 2 parts - solve each
> one of them using a separate program (2nd program reads the output of
> first). Is there a way I can do this in one program ?
>
> Thanks,
> Tarandeep
>