You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Billy Pearson <sa...@pearsonwholesale.com> on 2008/06/14 22:31:46 UTC

Ec2 and MR Job question

I have a question someone may have answered here before but I can not find 
the answer.

Assuming I have a cluster of servers hosting a large amount of data
I want to run a large job that the maps take a lot of cpu power to run and 
the reduces only take a small amount cpu to run.
I want to run the maps on a group of EC2 servers and run the reduces on the 
local cluster of 10 machines.

The problem I am seeing is the map outputs, if I run the maps on EC2 they 
are stored local on the instance
What I am looking to do is have the map output files stored in hdfs so I can 
kill the EC2 instances sense I do not need them for the reduces.

The only way I can thank to do this is run two jobs one maper and store the 
output on hdfs and then run a second job to run the reduces
from the map outputs store on the hfds.

Is there away to make the mappers store the final output in hdfs?

Re: Ec2 and MR Job question

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

I understand how to run it as two jobs my only question is
Is there away to make the mappers store the final output in hdfs?

so I can kill the ec2 machines without waiting to the reduce stage ends!

Billy



"Chris K Wensel" <ch...@wensel.net> wrote in 
message news:645B7D70-359A-4AF8-9A0B-1EC84A9CBF82@wensel.net...
> well, to answer your last question first, just set the # reducers to 
> zero.
>
> but you can't just run reducers without mappers (as far as I know,  having 
> never tried). so your local job will need to run identity  mappers in 
> order to feed your reducers.
> http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html
>
> ckw
>
> On Jun 14, 2008, at 1:31 PM, Billy Pearson wrote:
>
>> I have a question someone may have answered here before but I can  not 
>> find the answer.
>>
>> Assuming I have a cluster of servers hosting a large amount of data
>> I want to run a large job that the maps take a lot of cpu power to  run 
>> and the reduces only take a small amount cpu to run.
>> I want to run the maps on a group of EC2 servers and run the reduces  on 
>> the local cluster of 10 machines.
>>
>> The problem I am seeing is the map outputs, if I run the maps on EC2 
>> they are stored local on the instance
>> What I am looking to do is have the map output files stored in hdfs  so I 
>> can kill the EC2 instances sense I do not need them for the  reduces.
>>
>> The only way I can thank to do this is run two jobs one maper and  store 
>> the output on hdfs and then run a second job to run the reduces
>> from the map outputs store on the hfds.
>>
>> Is there away to make the mappers store the final output in hdfs?
>>
>
> --
> Chris K Wensel
> chris@wensel.net
> http://chris.wensel.net/
> http://www.cascading.org/
>
>
>
>
>
>

Re: Ec2 and MR Job question

Posted by Chris K Wensel <ch...@wensel.net>.

well, to answer your last question first, just set the # reducers to  
zero.

but you can't just run reducers without mappers (as far as I know,  
having never tried). so your local job will need to run identity  
mappers in order to feed your reducers.
http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html

ckw

On Jun 14, 2008, at 1:31 PM, Billy Pearson wrote:

> I have a question someone may have answered here before but I can  
> not find the answer.
>
> Assuming I have a cluster of servers hosting a large amount of data
> I want to run a large job that the maps take a lot of cpu power to  
> run and the reduces only take a small amount cpu to run.
> I want to run the maps on a group of EC2 servers and run the reduces  
> on the local cluster of 10 machines.
>
> The problem I am seeing is the map outputs, if I run the maps on EC2  
> they are stored local on the instance
> What I am looking to do is have the map output files stored in hdfs  
> so I can kill the EC2 instances sense I do not need them for the  
> reduces.
>
> The only way I can thank to do this is run two jobs one maper and  
> store the output on hdfs and then run a second job to run the reduces
> from the map outputs store on the hfds.
>
> Is there away to make the mappers store the final output in hdfs?
>

--
Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/

Re: Ec2 and MR Job question

Posted by Chanchal James <ch...@gmail.com>.

Hi Billy, when I tested Hadoop on an EC2 machine, I didnt come across the
hostname problem.. Probably because I changed the hostname to a public FQDN.


On Sat, Jun 14, 2008 at 10:09 PM, Billy Pearson <sa...@pearsonwholesale.com>
wrote:

> My second question is about the ec2 machines has anyone solved the hostname
> problem in a automated way?
>
> Example if I launch a ec2 server to run a task tracker the hostname
> reported back to my local cluster with its internal address
> the local reduce task can not access the map files on the ec2 machine
> because with the default hostname.
> I get a error:
> WARN org.apache.hadoop.mapred.ReduceTask: java.net.UnknownHostException:
> domU-12-31-39-00-A4-05.compute-1.internal
>
> <question>
> Is there a automated way to start a tasktracker on a ec2 machine with it
> useing the public hostname so the local task can get the maps from the ec2
> machines?
> example something like
> bin/hadoop-daemon.sh start tasktracker host=
> ec2-xx-xx-xx-xx.z-2.compute-1.amazonaws.com
>
> That I can run to start just the tasktracker with the correct hostname
> </question>
>
> What I am trying to do is build a custom ami image that I can just launch
> when need to add extra cpu power to my cluster and to automatically start
> the tasktracker vi a shell script that can be ran at startup.
>
> Billy
>
>
>
>
>

Re: Ec2 and MR Job question

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

My second question is about the ec2 machines has anyone solved the hostname 
problem in a automated way?

Example if I launch a ec2 server to run a task tracker the hostname reported 
back to my local cluster with its internal address
the local reduce task can not access the map files on the ec2 machine 
because with the default hostname.
I get a error:
WARN org.apache.hadoop.mapred.ReduceTask: java.net.UnknownHostException: 
domU-12-31-39-00-A4-05.compute-1.internal

<question>
Is there a automated way to start a tasktracker on a ec2 machine with it 
useing the public hostname so the local task can get the maps from the ec2 
machines?
example something like
bin/hadoop-daemon.sh start tasktracker 
host=ec2-xx-xx-xx-xx.z-2.compute-1.amazonaws.com

That I can run to start just the tasktracker with the correct hostname
</question>

What I am trying to do is build a custom ami image that I can just launch 
when need to add extra cpu power to my cluster and to automatically start
the tasktracker vi a shell script that can be ran at startup.

Billy


"Billy Pearson" <sa...@pearsonwholesale.com> 
wrote in message news:g319ri$eqi$1@ger.gmane.org...
>I have a question someone may have answered here before but I can not find 
>the answer.
>
> Assuming I have a cluster of servers hosting a large amount of data
> I want to run a large job that the maps take a lot of cpu power to run and 
> the reduces only take a small amount cpu to run.
> I want to run the maps on a group of EC2 servers and run the reduces on 
> the local cluster of 10 machines.
>
> The problem I am seeing is the map outputs, if I run the maps on EC2 they 
> are stored local on the instance
> What I am looking to do is have the map output files stored in hdfs so I 
> can kill the EC2 instances sense I do not need them for the reduces.
>
> The only way I can thank to do this is run two jobs one maper and store 
> the output on hdfs and then run a second job to run the reduces
> from the map outputs store on the hfds.
>
> Is there away to make the mappers store the final output in hdfs?
>
>