You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Arko Provo Mukherjee <ar...@gmail.com> on 2011/06/13 23:46:53 UTC

Programming Multiple rounds of mapreduce

Hello,

I am trying to write a program where I need to write multiple rounds of map
and reduce.

The output of the last round of map-reduce must be fed into the input of the
next round.

Can anyone please guide me to any link / material that can teach me as to
how I can achieve this.

Thanks a lot in advance!

Thanks & regards
Arko

Re: Programming Multiple rounds of mapreduce

Posted by Sean Owen <sr...@gmail.com>.
You could have a look at the MapReduce pipelines in Apache Mahout
(http://mahout.apache.org). See for instance
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob. This shows how
most of Mahout constructs and runs a series of rounds of MapReduce to
accomplish a task. Each job feeds into one or more of the later
rounds. It is at least an example of getting in done in straight
Hadoop -- though workflow systems like Oozie et al are probably the
kinds of things you want to look at now.

On Mon, Jun 13, 2011 at 10:46 PM, Arko Provo Mukherjee
<ar...@gmail.com> wrote:
> Hello,
>
> I am trying to write a program where I need to write multiple rounds of map
> and reduce.
>
> The output of the last round of map-reduce must be fed into the input of the
> next round.
>
> Can anyone please guide me to any link / material that can teach me as to
> how I can achieve this.
>
> Thanks a lot in advance!
>
> Thanks & regards
> Arko
>

Re: Programming Multiple rounds of mapreduce

Posted by Alejandro Abdelnur <tu...@cloudera.com>.
Thanks Matt,

Arko, if you plan to use Oozie, you can have a simple coordinator job that
does does, for example (the following schedules a WF every 5 mins that
consumes the output produced by the previous run, you just have to have the
initial data)

Thxs.

Alejandro

----
<coordinator-app name="coord-1" frequency="${coord:minutes(5)}"
start="${start}" end="${end}" timezone="UTC"
                 xmlns="uri:oozie:coordinator:0.1">
  <controls>
    <concurrency>1</concurrency>
  </controls>

  <datasets>
    <dataset name="data" frequency="${coord:minutes(5)}"
initial-instance="${start}" timezone="UTC">

<uri-template>${nameNode}/user/${coord:user()}/examples/${dataRoot}/${YEAR}-${MONTH}-${DAY}-${HOUR}-${MINUTE}
      </uri-template>
    </dataset>
  </datasets>

  <input-events>
    <data-in name="input" dataset="data">
      <instance>${coord:current(0)}</instance>
    </data-in>
  </input-events>

  <output-events>
    <data-out name="output" dataset="data">
      <instance>${coord:current(1)}</instance>
    </data-out>
  </output-events>

  <action>
    <workflow>

<app-path>${nameNode}/user/${coord:user()}/examples/apps/subwf-1</app-path>
      <configuration>
        <property>
          <name>jobTracker</name>
          <value>${jobTracker}</value>
        </property>
        <property>
          <name>nameNode</name>
          <value>${nameNode}</value>
        </property>
        <property>
          <name>queueName</name>
          <value>${queueName}</value>
        </property>
        <property>
          <name>examplesRoot</name>
          <value>${examplesRoot}</value>
        </property>
        <property>
          <name>inputDir</name>
          <value>${coord:dataIn('input')}</value>
        </property>
        <property>
          <name>outputDir</name>
          <value>${coord:dataOut('output')}</value>
        </property>
      </configuration>
    </workflow>
  </action>
</coordinator-app>
------

On Mon, Jun 13, 2011 at 3:01 PM, GOEKE, MATTHEW (AG/1000) <
matthew.goeke@monsanto.com> wrote:

> If you know for certain that it needs to be split into multiple work units
> I would suggest looking into Oozie. Easy to install, light weight, low
> learning curve... for my purposes it's been very helpful so far. I am also
> fairly certain you can chain multiple job confs into the same run but I have
> not actually tried that therefore I can't promise it is easy or possible.
>
> http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-oozie/
>
> If you are not running CDH3u0 then you can also get the tarball and
> documentation directly here:
> https://ccp.cloudera.com/display/SUPPORT/CDH3+Downloadable+Tarballs
>
> Matt
>
> -----Original Message-----
> From: Marcos Ortiz [mailto:mlortiz@uci.cu]
> Sent: Monday, June 13, 2011 4:57 PM
> To: mapreduce-user@hadoop.apache.org
> Cc: Arko Provo Mukherjee
> Subject: Re: Programming Multiple rounds of mapreduce
>
> Well, you can define a job for each round and then, you can define the
> running workflow based in your implementation and to chain your jobs
>
> El 6/13/2011 5:46 PM, Arko Provo Mukherjee escribió:
> > Hello,
> >
> > I am trying to write a program where I need to write multiple rounds
> > of map and reduce.
> >
> > The output of the last round of map-reduce must be fed into the input
> > of the next round.
> >
> > Can anyone please guide me to any link / material that can teach me as
> > to how I can achieve this.
> >
> > Thanks a lot in advance!
> >
> > Thanks & regards
> > Arko
>
> --
> Marcos Luís Ortíz Valmaseda
>  Software Engineer (UCI)
>  http://marcosluis2186.posterous.com
>  http://twitter.com/marcosluis2186
>
>
> This e-mail message may contain privileged and/or confidential information,
> and is intended to be received only by persons entitled
> to receive such information. If you have received this e-mail in error,
> please notify the sender immediately. Please delete it and
> all attachments from any servers, hard drives or any other media. Other use
> of this e-mail by you is strictly prohibited.
>
> All e-mails and attachments sent and received are subject to monitoring,
> reading and archival by Monsanto, including its
> subsidiaries. The recipient of this e-mail is solely responsible for
> checking for the presence of "Viruses" or other "Malware".
> Monsanto, along with its subsidiaries, accepts no liability for any damage
> caused by any such code transmitted by or accompanying
> this e-mail or any attachment.
>
>
> The information contained in this email may be subject to the export
> control laws and regulations of the United States, potentially
> including but not limited to the Export Administration Regulations (EAR)
> and sanctions regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
> information you are obligated to comply with all
> applicable U.S. export laws and regulations.
>
>

RE: Programming Multiple rounds of mapreduce

Posted by "GOEKE, MATTHEW (AG/1000)" <ma...@monsanto.com>.
If you know for certain that it needs to be split into multiple work units I would suggest looking into Oozie. Easy to install, light weight, low learning curve... for my purposes it's been very helpful so far. I am also fairly certain you can chain multiple job confs into the same run but I have not actually tried that therefore I can't promise it is easy or possible.

http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-oozie/

If you are not running CDH3u0 then you can also get the tarball and documentation directly here:
https://ccp.cloudera.com/display/SUPPORT/CDH3+Downloadable+Tarballs

Matt

-----Original Message-----
From: Marcos Ortiz [mailto:mlortiz@uci.cu] 
Sent: Monday, June 13, 2011 4:57 PM
To: mapreduce-user@hadoop.apache.org
Cc: Arko Provo Mukherjee
Subject: Re: Programming Multiple rounds of mapreduce

Well, you can define a job for each round and then, you can define the 
running workflow based in your implementation and to chain your jobs

El 6/13/2011 5:46 PM, Arko Provo Mukherjee escribió:
> Hello,
>
> I am trying to write a program where I need to write multiple rounds 
> of map and reduce.
>
> The output of the last round of map-reduce must be fed into the input 
> of the next round.
>
> Can anyone please guide me to any link / material that can teach me as 
> to how I can achieve this.
>
> Thanks a lot in advance!
>
> Thanks & regards
> Arko

-- 
Marcos Luís Ortíz Valmaseda
  Software Engineer (UCI)
  http://marcosluis2186.posterous.com
  http://twitter.com/marcosluis2186
   

This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you are obligated to comply with all
applicable U.S. export laws and regulations.


Re: Programming Multiple rounds of mapreduce

Posted by Marcos Ortiz <ml...@uci.cu>.
Well, you can define a job for each round and then, you can define the 
running workflow based in your implementation and to chain your jobs

El 6/13/2011 5:46 PM, Arko Provo Mukherjee escribió:
> Hello,
>
> I am trying to write a program where I need to write multiple rounds 
> of map and reduce.
>
> The output of the last round of map-reduce must be fed into the input 
> of the next round.
>
> Can anyone please guide me to any link / material that can teach me as 
> to how I can achieve this.
>
> Thanks a lot in advance!
>
> Thanks & regards
> Arko

-- 
Marcos Luís Ortíz Valmaseda
  Software Engineer (UCI)
  http://marcosluis2186.posterous.com
  http://twitter.com/marcosluis2186
   


Re: Programming Multiple rounds of mapreduce

Posted by Bibek Paudel <et...@gmail.com>.
Hi,

On Mon, Jun 13, 2011 at 11:46 PM, Arko Provo Mukherjee
<ar...@gmail.com> wrote:
> Hello,
>
> I am trying to write a program where I need to write multiple rounds of map
> and reduce.
>
> The output of the last round of map-reduce must be fed into the input of the
> next round.
>
> Can anyone please guide me to any link / material that can teach me as to
> how I can achieve this.
>

The way I do it is:

create job1
job1 <-- feed all the configuration parameters (incl input and output
path) to this job
run job 1

create job2
job2 <-- feed all config params (output of job1 as input, another path
as output)
run job2

....
so on.

I think this is the recommended way of running multiple rounds of MR in hadoop.

-b

Re: Programming Multiple rounds of mapreduce

Posted by Moustafa Gaber <mo...@gmail.com>.
Actually, HaLoop is a new framework above Hadoop which targets the problem
of transitive closure algorithms. This type of algorithms contain rounds of
hadoop jobs, so I think it may contain some useful examples for you.

On Mon, Jun 13, 2011 at 6:39 PM, Arko Provo Mukherjee <
arkoprovomukherjee@gmail.com> wrote:

> Hello,
>
> Thanks everyone for your responses.
>
> I am new to Hadoop, so this was a lot of new information for me. I will
> surely go though all of these.
>
> However, I was actually hoping that someone could point me to some example
> codes where multiple rounds of map-reduce has been used.
>
> Please let me know if anyone has any such examples as they are the best way
> to learn for me :-)
>
> Thanks much!
> Cheers
> Arko
>
>
>
>
> On Mon, Jun 13, 2011 at 5:30 PM, Moustafa Gaber <mo...@gmail.com>wrote:
>
>> I think HaLoop is a framework which can answer your question:
>> http://code.google.com/p/haloop/
>>
>>
>> On Mon, Jun 13, 2011 at 5:46 PM, Arko Provo Mukherjee <
>> arkoprovomukherjee@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I am trying to write a program where I need to write multiple rounds of
>>> map and reduce.
>>>
>>> The output of the last round of map-reduce must be fed into the input of
>>> the next round.
>>>
>>> Can anyone please guide me to any link / material that can teach me as to
>>> how I can achieve this.
>>>
>>> Thanks a lot in advance!
>>>
>>> Thanks & regards
>>> Arko
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Mostafa Ead
>>
>>
>


-- 
Best Regards,
Mostafa Ead

Re: Programming Multiple rounds of mapreduce

Posted by Arko Provo Mukherjee <ar...@gmail.com>.
Hello,

Thanks everyone for your responses.

I am new to Hadoop, so this was a lot of new information for me. I will
surely go though all of these.

However, I was actually hoping that someone could point me to some example
codes where multiple rounds of map-reduce has been used.

Please let me know if anyone has any such examples as they are the best way
to learn for me :-)

Thanks much!
Cheers
Arko



On Mon, Jun 13, 2011 at 5:30 PM, Moustafa Gaber <mo...@gmail.com>wrote:

> I think HaLoop is a framework which can answer your question:
> http://code.google.com/p/haloop/
>
>
> On Mon, Jun 13, 2011 at 5:46 PM, Arko Provo Mukherjee <
> arkoprovomukherjee@gmail.com> wrote:
>
>> Hello,
>>
>> I am trying to write a program where I need to write multiple rounds of
>> map and reduce.
>>
>> The output of the last round of map-reduce must be fed into the input of
>> the next round.
>>
>> Can anyone please guide me to any link / material that can teach me as to
>> how I can achieve this.
>>
>> Thanks a lot in advance!
>>
>> Thanks & regards
>> Arko
>>
>
>
>
> --
> Best Regards,
> Mostafa Ead
>
>

Re: Programming Multiple rounds of mapreduce

Posted by Moustafa Gaber <mo...@gmail.com>.
I think HaLoop is a framework which can answer your question:
http://code.google.com/p/haloop/

On Mon, Jun 13, 2011 at 5:46 PM, Arko Provo Mukherjee <
arkoprovomukherjee@gmail.com> wrote:

> Hello,
>
> I am trying to write a program where I need to write multiple rounds of map
> and reduce.
>
> The output of the last round of map-reduce must be fed into the input of
> the next round.
>
> Can anyone please guide me to any link / material that can teach me as to
> how I can achieve this.
>
> Thanks a lot in advance!
>
> Thanks & regards
> Arko
>



-- 
Best Regards,
Mostafa Ead