You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by John Clarke <cl...@gmail.com> on 2009/11/10 15:13:23 UTC

Automate EC2 cluster termination

Hi,

I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great
but I want to automate it a bit more.

I want to be able to:
- start cluster
- copy data from S3 to the DFS
- run the job
- copy result data from DFS to S3
- verify it all copied ok
- shutdown the cluster.


I guess the hardest part is reliably detecting when a job is complete. I've
seen solutions that provide a time based shutdown but they are not suitable
as our jobs vary in time.

Has anyone made a script that does this already? I'm using the Cloudera
python scripts to start/terminate my cluster.

Thanks,
John

Re: Automate EC2 cluster termination

Posted by John Clarke <cl...@gmail.com>.
Hi Edmund,

I'll look into what you suggested. Yes I'm aware of being able to use S3
directly but I had problems getting it working - I must try again.

cheers
John

2009/11/10 Edmund Kohlwey <ek...@gmail.com>

> You should be able to detect the status of the job in your java main()
> method, just do either: job.waitForCompletion(), and, when the job finishes
> running, use job.isSuccessful(), or if you want to you can write a custom
> "watcher" thread to poll job status manually; this will allow you to, for
> instance, launch several jobs and wait for them to return. You will poll the
> job tracker using either method, but I think the overhead is pretty minimal.
>
> I'm not sure if its necessary to copy data from S3 to DFS, btw (unless you
> have a performance reason to do so... even then, since you're not really
> guaranteed very much locality on EC2 you probably won't see a huge
> difference). You should probably just set the default file system to s3. See
> http://wiki.apache.org/hadoop/AmazonS3 .
>
>
>
> On 11/10/09 9:13 AM, John Clarke wrote:
>
>> Hi,
>>
>> I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works
>> great
>> but I want to automate it a bit more.
>>
>> I want to be able to:
>> - start cluster
>> - copy data from S3 to the DFS
>> - run the job
>> - copy result data from DFS to S3
>> - verify it all copied ok
>> - shutdown the cluster.
>>
>>
>> I guess the hardest part is reliably detecting when a job is complete.
>> I've
>> seen solutions that provide a time based shutdown but they are not
>> suitable
>> as our jobs vary in time.
>>
>> Has anyone made a script that does this already? I'm using the Cloudera
>> python scripts to start/terminate my cluster.
>>
>> Thanks,
>> John
>>
>>
>>
>
>

Re: Automate EC2 cluster termination

Posted by Edmund Kohlwey <ek...@gmail.com>.
You should be able to detect the status of the job in your java main() 
method, just do either: job.waitForCompletion(), and, when the job 
finishes running, use job.isSuccessful(), or if you want to you can 
write a custom "watcher" thread to poll job status manually; this will 
allow you to, for instance, launch several jobs and wait for them to 
return. You will poll the job tracker using either method, but I think 
the overhead is pretty minimal.

I'm not sure if its necessary to copy data from S3 to DFS, btw (unless 
you have a performance reason to do so... even then, since you're not 
really guaranteed very much locality on EC2 you probably won't see a 
huge difference). You should probably just set the default file system 
to s3. See http://wiki.apache.org/hadoop/AmazonS3 .


On 11/10/09 9:13 AM, John Clarke wrote:
> Hi,
>
> I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great
> but I want to automate it a bit more.
>
> I want to be able to:
> - start cluster
> - copy data from S3 to the DFS
> - run the job
> - copy result data from DFS to S3
> - verify it all copied ok
> - shutdown the cluster.
>
>
> I guess the hardest part is reliably detecting when a job is complete. I've
> seen solutions that provide a time based shutdown but they are not suitable
> as our jobs vary in time.
>
> Has anyone made a script that does this already? I'm using the Cloudera
> python scripts to start/terminate my cluster.
>
> Thanks,
> John
>
>    


Re: Automate EC2 cluster termination

Posted by John Clarke <cl...@gmail.com>.
I've never used Amazon Elastic MapReduce as we are trying to minimise costs
but if I cant find a good way to solve my problem then I might reconsider.

cheers,
John



2009/11/10 Hitchcock, Andrew <an...@amazon.com>

> Hi John,
>
> Have you considered Amazon Elastic MapReduce? (Disclaimer: I work on
> Elastic MapReduce)
>
> http://aws.amazon.com/elasticmapreduce/
>
> It waits for your job to finish and then automatically shuts down the
> cluster. With a simple command like:
>
>  elastic-mapreduce --create --num-instances 10 --jar s3://mybucket/my.jar
> --args s3://mybucket/input/,s3://mybucket/output/
>
> It will automatically create a cluster, run your jar, and then shut
> everything down. Elastic MapReduce costs a little bit more than just plain
> EC2, but if it prevents your cluster from running longer than necessary, you
> might save money.
>
> Andrew
>
>
> On 11/10/09 6:13 AM, "John Clarke" <cl...@gmail.com> wrote:
>
> Hi,
>
> I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great
> but I want to automate it a bit more.
>
> I want to be able to:
> - start cluster
> - copy data from S3 to the DFS
> - run the job
> - copy result data from DFS to S3
> - verify it all copied ok
> - shutdown the cluster.
>
>
> I guess the hardest part is reliably detecting when a job is complete. I've
> seen solutions that provide a time based shutdown but they are not suitable
> as our jobs vary in time.
>
> Has anyone made a script that does this already? I'm using the Cloudera
> python scripts to start/terminate my cluster.
>
> Thanks,
> John
>
>

Re: Automate EC2 cluster termination

Posted by "Hitchcock, Andrew" <an...@amazon.com>.
Hi John,

Have you considered Amazon Elastic MapReduce? (Disclaimer: I work on Elastic MapReduce)

http://aws.amazon.com/elasticmapreduce/

It waits for your job to finish and then automatically shuts down the cluster. With a simple command like:

 elastic-mapreduce --create --num-instances 10 --jar s3://mybucket/my.jar --args s3://mybucket/input/,s3://mybucket/output/

It will automatically create a cluster, run your jar, and then shut everything down. Elastic MapReduce costs a little bit more than just plain EC2, but if it prevents your cluster from running longer than necessary, you might save money.

Andrew


On 11/10/09 6:13 AM, "John Clarke" <cl...@gmail.com> wrote:

Hi,

I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great
but I want to automate it a bit more.

I want to be able to:
- start cluster
- copy data from S3 to the DFS
- run the job
- copy result data from DFS to S3
- verify it all copied ok
- shutdown the cluster.


I guess the hardest part is reliably detecting when a job is complete. I've
seen solutions that provide a time based shutdown but they are not suitable
as our jobs vary in time.

Has anyone made a script that does this already? I'm using the Cloudera
python scripts to start/terminate my cluster.

Thanks,
John