You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Malcolm Matalka <mm...@millennialmedia.com> on 2009/03/11 14:30:52 UTC

Persistent HDFS On EC2

If this is not the correct place to ask Hadoop + EC2 questions please
let me know.

 

I am trying to get a handle on how to use Hadoop on EC2 before
committing any money to it.  My question is, how do I maintain a
persistent HDFS between restarts of instances.  Most of the tutorials I
have found involve the cluster being wiped once all the instances are
shut down but in my particular case I will be feeding output of a
previous days run as the input of the current days run and this data
will get large over time.  I see I can use s3 as the file system, would
I just create an EBS  volume for each instance?  What are my options?

 

Thanks

Re: Persistent HDFS On EC2

Posted by Steve Loughran <st...@apache.org>.

Kris Jirapinyo wrote:
> Why would you lose the locality of storage-per-machine if one EBS volume is
> mounted to each machine instance?  When that machine goes down, you can just
> restart the instance and re-mount the exact same volume.  I've tried this
> idea before successfully on a 10 node cluster on EC2, and didn't see any
> adverse performance effects--


I was thinking more of S3 FS, which is remote-ish and write times measurable

> and actually amazon claims that EBS I/O should
> be even better than the instance stores.  

Assuming the transient filesystems are virtual disks (and not physical 
disks that get scrubbed, formatted and mounted on every VM 
instantiation), and also assuming that EBS disks are on a SAN in the 
same datacentre, this is probably true. Disk IO performance in virtual 
disks is currently pretty slow as you are navigating through >1 
filesystem, and potentially seeking at lot, even something that appears 
unfragmented at the VM level




The only concerns I see are that
> you need to pay for EBS storage regardless of whether you use that storage
> or not.  So, if you have 10 EBS volumes of 1 TB each, and you're just
> starting out with your cluster so you're using only 50GB on each EBS volume
> so far for the month, you'd still have to pay for 10TB worth of EBS volumes,
> and that could be a hefty price for each month.  Also, currently EBS needs
> to be created in the same availability zone as your instances, so you need
> to make sure that they are created correctly, as there is no direct
> migration of EBS to different availability zones.

View EBS as renting space in SAN and it starts to make sense.


-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

RE: Persistent HDFS On EC2

Posted by Malcolm Matalka <mm...@millennialmedia.com>.

Haha, good to know I might be a guinea pig!

-----Original Message-----
From: Kris Jirapinyo [mailto:kris.jirapinyo@biz360.com] 
Sent: Wednesday, March 11, 2009 15:59
To: core-user@hadoop.apache.org
Subject: Re: Persistent HDFS On EC2

That was also the starting point for my experiment (Tom White's
article).
Note that the most painful part about this setup is probably writing and
testing the scripts that will enable this to happen (and also
customizing
your EC2 images).  It would be interesting to see someone else try it.


On Wed, Mar 11, 2009 at 12:04 PM, Adam Rose <ad...@tubemogul.com> wrote:

> Tom White wrote a great blog post about some options here:
>
>
http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.h
tml
>
> plus an Amazon article:
>
>
>
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873
&categoryID=112
>
> Regards,
>
> - Adam
>
>
>
> Kris Jirapinyo wrote:
>
>> Why would you lose the locality of storage-per-machine if one EBS
volume
>> is
>> mounted to each machine instance?  When that machine goes down, you
can
>> just
>> restart the instance and re-mount the exact same volume.  I've tried
this
>> idea before successfully on a 10 node cluster on EC2, and didn't see
any
>> adverse performance effects--and actually amazon claims that EBS I/O
>> should
>> be even better than the instance stores.  The only concerns I see are
that
>> you need to pay for EBS storage regardless of whether you use that
storage
>> or not.  So, if you have 10 EBS volumes of 1 TB each, and you're just
>> starting out with your cluster so you're using only 50GB on each EBS
>> volume
>> so far for the month, you'd still have to pay for 10TB worth of EBS
>> volumes,
>> and that could be a hefty price for each month.  Also, currently EBS
needs
>> to be created in the same availability zone as your instances, so you
need
>> to make sure that they are created correctly, as there is no direct
>> migration of EBS to different availability zones.
>>
>>
>> On Wed, Mar 11, 2009 at 6:39 AM, Steve Loughran <st...@apache.org>
>> wrote:
>>
>>
>>
>>> Malcolm Matalka wrote:
>>>
>>>
>>>
>>>> If this is not the correct place to ask Hadoop + EC2 questions
please
>>>> let me know.
>>>>
>>>>
>>>> I am trying to get a handle on how to use Hadoop on EC2 before
>>>> committing any money to it.  My question is, how do I maintain a
>>>> persistent HDFS between restarts of instances.  Most of the
tutorials I
>>>> have found involve the cluster being wiped once all the instances
are
>>>> shut down but in my particular case I will be feeding output of a
>>>> previous days run as the input of the current days run and this
data
>>>> will get large over time.  I see I can use s3 as the file system,
would
>>>> I just create an EBS  volume for each instance?  What are my
options?
>>>>
>>>>
>>>>
>>>  EBS would cost you more; you'd lose the locality of
storage-per-machine.
>>>
>>> If you stick the output of some runs back into S3 then the next jobs
have
>>> no locality and higher startup overhead to pull the data down, but
you
>>> dont
>>> pay for that download (just the time it takes).
>>>
>>>
>>>
>>
>>
>>
>

Re: Persistent HDFS On EC2

Posted by Kris Jirapinyo <kr...@biz360.com>.

That was also the starting point for my experiment (Tom White's article).
Note that the most painful part about this setup is probably writing and
testing the scripts that will enable this to happen (and also customizing
your EC2 images).  It would be interesting to see someone else try it.


On Wed, Mar 11, 2009 at 12:04 PM, Adam Rose <ad...@tubemogul.com> wrote:

> Tom White wrote a great blog post about some options here:
>
> http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.html
>
> plus an Amazon article:
>
>
> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873&categoryID=112
>
> Regards,
>
> - Adam
>
>
>
> Kris Jirapinyo wrote:
>
>> Why would you lose the locality of storage-per-machine if one EBS volume
>> is
>> mounted to each machine instance?  When that machine goes down, you can
>> just
>> restart the instance and re-mount the exact same volume.  I've tried this
>> idea before successfully on a 10 node cluster on EC2, and didn't see any
>> adverse performance effects--and actually amazon claims that EBS I/O
>> should
>> be even better than the instance stores.  The only concerns I see are that
>> you need to pay for EBS storage regardless of whether you use that storage
>> or not.  So, if you have 10 EBS volumes of 1 TB each, and you're just
>> starting out with your cluster so you're using only 50GB on each EBS
>> volume
>> so far for the month, you'd still have to pay for 10TB worth of EBS
>> volumes,
>> and that could be a hefty price for each month.  Also, currently EBS needs
>> to be created in the same availability zone as your instances, so you need
>> to make sure that they are created correctly, as there is no direct
>> migration of EBS to different availability zones.
>>
>>
>> On Wed, Mar 11, 2009 at 6:39 AM, Steve Loughran <st...@apache.org>
>> wrote:
>>
>>
>>
>>> Malcolm Matalka wrote:
>>>
>>>
>>>
>>>> If this is not the correct place to ask Hadoop + EC2 questions please
>>>> let me know.
>>>>
>>>>
>>>> I am trying to get a handle on how to use Hadoop on EC2 before
>>>> committing any money to it.  My question is, how do I maintain a
>>>> persistent HDFS between restarts of instances.  Most of the tutorials I
>>>> have found involve the cluster being wiped once all the instances are
>>>> shut down but in my particular case I will be feeding output of a
>>>> previous days run as the input of the current days run and this data
>>>> will get large over time.  I see I can use s3 as the file system, would
>>>> I just create an EBS  volume for each instance?  What are my options?
>>>>
>>>>
>>>>
>>>  EBS would cost you more; you'd lose the locality of storage-per-machine.
>>>
>>> If you stick the output of some runs back into S3 then the next jobs have
>>> no locality and higher startup overhead to pull the data down, but you
>>> dont
>>> pay for that download (just the time it takes).
>>>
>>>
>>>
>>
>>
>>
>

Re: Persistent HDFS On EC2

Posted by Adam Rose <ad...@tubemogul.com>.

Tom White wrote a great blog post about some options here:

http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.html

plus an Amazon article:

http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873&categoryID=112

Regards,

- Adam


Kris Jirapinyo wrote:
> Why would you lose the locality of storage-per-machine if one EBS volume is
> mounted to each machine instance?  When that machine goes down, you can just
> restart the instance and re-mount the exact same volume.  I've tried this
> idea before successfully on a 10 node cluster on EC2, and didn't see any
> adverse performance effects--and actually amazon claims that EBS I/O should
> be even better than the instance stores.  The only concerns I see are that
> you need to pay for EBS storage regardless of whether you use that storage
> or not.  So, if you have 10 EBS volumes of 1 TB each, and you're just
> starting out with your cluster so you're using only 50GB on each EBS volume
> so far for the month, you'd still have to pay for 10TB worth of EBS volumes,
> and that could be a hefty price for each month.  Also, currently EBS needs
> to be created in the same availability zone as your instances, so you need
> to make sure that they are created correctly, as there is no direct
> migration of EBS to different availability zones.
>
>
> On Wed, Mar 11, 2009 at 6:39 AM, Steve Loughran <st...@apache.org> wrote:
>
>   
>> Malcolm Matalka wrote:
>>
>>     
>>> If this is not the correct place to ask Hadoop + EC2 questions please
>>> let me know.
>>>
>>>
>>> I am trying to get a handle on how to use Hadoop on EC2 before
>>> committing any money to it.  My question is, how do I maintain a
>>> persistent HDFS between restarts of instances.  Most of the tutorials I
>>> have found involve the cluster being wiped once all the instances are
>>> shut down but in my particular case I will be feeding output of a
>>> previous days run as the input of the current days run and this data
>>> will get large over time.  I see I can use s3 as the file system, would
>>> I just create an EBS  volume for each instance?  What are my options?
>>>
>>>       
>>  EBS would cost you more; you'd lose the locality of storage-per-machine.
>>
>> If you stick the output of some runs back into S3 then the next jobs have
>> no locality and higher startup overhead to pull the data down, but you dont
>> pay for that download (just the time it takes).
>>
>>     
>
>

Re: Persistent HDFS On EC2

Posted by Kris Jirapinyo <kr...@biz360.com>.

Why would you lose the locality of storage-per-machine if one EBS volume is
mounted to each machine instance?  When that machine goes down, you can just
restart the instance and re-mount the exact same volume.  I've tried this
idea before successfully on a 10 node cluster on EC2, and didn't see any
adverse performance effects--and actually amazon claims that EBS I/O should
be even better than the instance stores.  The only concerns I see are that
you need to pay for EBS storage regardless of whether you use that storage
or not.  So, if you have 10 EBS volumes of 1 TB each, and you're just
starting out with your cluster so you're using only 50GB on each EBS volume
so far for the month, you'd still have to pay for 10TB worth of EBS volumes,
and that could be a hefty price for each month.  Also, currently EBS needs
to be created in the same availability zone as your instances, so you need
to make sure that they are created correctly, as there is no direct
migration of EBS to different availability zones.

On Wed, Mar 11, 2009 at 6:39 AM, Steve Loughran <st...@apache.org> wrote:

> Malcolm Matalka wrote:
>
>> If this is not the correct place to ask Hadoop + EC2 questions please
>> let me know.
>>
>>
>> I am trying to get a handle on how to use Hadoop on EC2 before
>> committing any money to it.  My question is, how do I maintain a
>> persistent HDFS between restarts of instances.  Most of the tutorials I
>> have found involve the cluster being wiped once all the instances are
>> shut down but in my particular case I will be feeding output of a
>> previous days run as the input of the current days run and this data
>> will get large over time.  I see I can use s3 as the file system, would
>> I just create an EBS  volume for each instance?  What are my options?
>>
>
>  EBS would cost you more; you'd lose the locality of storage-per-machine.
>
> If you stick the output of some runs back into S3 then the next jobs have
> no locality and higher startup overhead to pull the data down, but you dont
> pay for that download (just the time it takes).
>

RE: Persistent HDFS On EC2

Posted by Malcolm Matalka <mm...@millennialmedia.com>.

I am estimating that all of the data I will need to run the job will be
~2 terabytes.  Is that too large a data set to be copying from S3 every
startup?

-----Original Message-----
From: Steve Loughran [mailto:stevel@apache.org] 
Sent: Wednesday, March 11, 2009 9:39
To: core-user@hadoop.apache.org
Subject: Re: Persistent HDFS On EC2

Malcolm Matalka wrote:
> If this is not the correct place to ask Hadoop + EC2 questions please
> let me know.
> 
>  
> 
> I am trying to get a handle on how to use Hadoop on EC2 before
> committing any money to it.  My question is, how do I maintain a
> persistent HDFS between restarts of instances.  Most of the tutorials
I
> have found involve the cluster being wiped once all the instances are
> shut down but in my particular case I will be feeding output of a
> previous days run as the input of the current days run and this data
> will get large over time.  I see I can use s3 as the file system,
would
> I just create an EBS  volume for each instance?  What are my options?

  EBS would cost you more; you'd lose the locality of
storage-per-machine.

If you stick the output of some runs back into S3 then the next jobs 
have no locality and higher startup overhead to pull the data down, but 
you dont pay for that download (just the time it takes).

Re: Persistent HDFS On EC2

Posted by Steve Loughran <st...@apache.org>.

Malcolm Matalka wrote:
> If this is not the correct place to ask Hadoop + EC2 questions please
> let me know.
> 
>  
> 
> I am trying to get a handle on how to use Hadoop on EC2 before
> committing any money to it.  My question is, how do I maintain a
> persistent HDFS between restarts of instances.  Most of the tutorials I
> have found involve the cluster being wiped once all the instances are
> shut down but in my particular case I will be feeding output of a
> previous days run as the input of the current days run and this data
> will get large over time.  I see I can use s3 as the file system, would
> I just create an EBS  volume for each instance?  What are my options?

  EBS would cost you more; you'd lose the locality of storage-per-machine.

If you stick the output of some runs back into S3 then the next jobs 
have no locality and higher startup overhead to pull the data down, but 
you dont pay for that download (just the time it takes).