You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by tim robertson <ti...@gmail.com> on 2008/06/28 11:24:54 UTC

Hadoop on EC2 + S3 - best practice?

Hi all,
I have data in a file (150million lines at 100Gb or so) and have several
MapReduce classes for my processing (custom index generation).

Can someone please confirm the following is the best way to run on EC2 and
S3 (both of which I am new to..)

1) load my 100Gb file into S3
2) create a class that will load the file from S3 and use as input to
mapreduce (S3 not used during processing) and save output back to S3
3) create an AMI with the Hadoop + dependencies and my Jar file (loading the
S3 input and the MR code) - I will base this on the public Hadoop AMI I
guess
4) run using the standard scripts

Is this best practice?
I assume this is pretty common... is there a better way where I can submit
my Jar at runtime and just pass in the URL for the input and output files in
S3?

If not, has anyone an example that takes input from S3 and writes output to
S3 also?

Thanks for advice, or suggestions of best way to run.

Tim

Re: Hadoop on EC2 + S3 - best practice?

Posted by tim robertson <ti...@gmail.com>.

Hi Tom,
Thanks for the reply, and after posting I found your blogs and followed your
instructions - thanks

There were a couple of gotchya's
1) My <secret> had a / in it and the escaping does not work
2) I copied to the root directory in the S3 bucket and I could not manage to
get it out again using a distcp, so I needed to blow it away and do another
copy up.

It was nice to get running in the end, and I blogged my experience:
http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html(I
thank you at the bottom ;o)

Thanks,

Tim




On Tue, Jul 1, 2008 at 6:35 PM, Tom White <to...@gmail.com> wrote:

> Hi Tim,
>
> The steps you outline look about right. Because your file is >5GB you
> will need to use the S3 block file system, which has a s3 URL. (See
> http://wiki.apache.org/hadoop/AmazonS3) You shouldn't have to build
> your own AMI unless you have dependencies that can't be submitted as a
> part of the MapReduce job.
>
> To read and write to S3 you can just use s3 URLs. Otherwise you can
> use distcp to copy between S3 and HDFS before and after running your
> job. This article I wrote has some more tips:
> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873
>
> Hope that helps,
>
> Tom
>
> On Sat, Jun 28, 2008 at 10:24 AM, tim robertson
> <ti...@gmail.com> wrote:
> > Hi all,
> > I have data in a file (150million lines at 100Gb or so) and have several
> > MapReduce classes for my processing (custom index generation).
> >
> > Can someone please confirm the following is the best way to run on EC2
> and
> > S3 (both of which I am new to..)
> >
> > 1) load my 100Gb file into S3
> > 2) create a class that will load the file from S3 and use as input to
> > mapreduce (S3 not used during processing) and save output back to S3
> > 3) create an AMI with the Hadoop + dependencies and my Jar file (loading
> the
> > S3 input and the MR code) - I will base this on the public Hadoop AMI I
> > guess
> > 4) run using the standard scripts
> >
> > Is this best practice?
> > I assume this is pretty common... is there a better way where I can
> submit
> > my Jar at runtime and just pass in the URL for the input and output files
> in
> > S3?
> >
> > If not, has anyone an example that takes input from S3 and writes output
> to
> > S3 also?
> >
> > Thanks for advice, or suggestions of best way to run.
> >
> > Tim
> >
>

Re: Hadoop on EC2 + S3 - best practice?

Posted by Tom White <to...@gmail.com>.

Hi Tim,

The steps you outline look about right. Because your file is >5GB you
will need to use the S3 block file system, which has a s3 URL. (See
http://wiki.apache.org/hadoop/AmazonS3) You shouldn't have to build
your own AMI unless you have dependencies that can't be submitted as a
part of the MapReduce job.

To read and write to S3 you can just use s3 URLs. Otherwise you can
use distcp to copy between S3 and HDFS before and after running your
job. This article I wrote has some more tips:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873

Hope that helps,

Tom

On Sat, Jun 28, 2008 at 10:24 AM, tim robertson
<ti...@gmail.com> wrote:
> Hi all,
> I have data in a file (150million lines at 100Gb or so) and have several
> MapReduce classes for my processing (custom index generation).
>
> Can someone please confirm the following is the best way to run on EC2 and
> S3 (both of which I am new to..)
>
> 1) load my 100Gb file into S3
> 2) create a class that will load the file from S3 and use as input to
> mapreduce (S3 not used during processing) and save output back to S3
> 3) create an AMI with the Hadoop + dependencies and my Jar file (loading the
> S3 input and the MR code) - I will base this on the public Hadoop AMI I
> guess
> 4) run using the standard scripts
>
> Is this best practice?
> I assume this is pretty common... is there a better way where I can submit
> my Jar at runtime and just pass in the URL for the input and output files in
> S3?
>
> If not, has anyone an example that takes input from S3 and writes output to
> S3 also?
>
> Thanks for advice, or suggestions of best way to run.
>
> Tim
>