You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sandeep Reddy P <sa...@gmail.com> on 2012/05/31 20:41:53 UTC

Hadoop on physical Machines compared to Amazon Ec2 / virtual machines

Hi,
We are getting 100TB of data with replication factor of 3 this goes to
300TB of data. We are planning to use hadoop with 65nodes. We want to know
which option will be better in terms of hardware either physical Machines
or deploy hadoop on EC2. Is there any document that supports use of
physical machines.
Hardware specs:  2 quad core cpu, 32 Gb Ram, 12*1 Tb hard drives , 10Gb
Ethernet Switches costs $10k for each machine. Is that cheaper to use EC2
?? will there be any performance issues??
-- 
Thanks,
sandeep

Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines

Posted by Andrew Pawloski <pa...@temple.edu>.
>
> Thanks for the reply Mathias,
> Actual data is 100TB i think we need to host 100TB on AWS.
>

It's also worth noting that besides storage costs, simply moving 100TB to
AWS is not a trivial task. Import/Export (
http://aws.amazon.com/importexport/) has a limit of 16TB, although they do
seem like they might be flexible for larger volumes.

On Thu, May 31, 2012 at 3:01 PM, Sandeep Reddy P <
sandeepreddy.3647@gmail.com> wrote:

> Thanks for the reply Mathias,
> Actual data is 100TB i think we need to host 100TB on AWS. Do we have
> replication even in AWS??
> We are looking for comparision between performance curves/issues between
> physical machines and AWS??
>
> On Thu, May 31, 2012 at 2:50 PM, Mathias Herberts <
> mathias.herberts@gmail.com> wrote:
>
> > Correct me if I'm wrong, but the sole cost of storing 300TB on AWS
> > will account for roughly 300000*0.10*12 = 360000 USD per annum.
> >
> > We operate a cluster with 112 nodes offering 800+ TB of raw HDFS
> > capacity and the CAPEX was less than 700k USD, if you ask me there is
> > no comparison possible if you have the datacenter space to host your
> > machines.
> >
> > Do you really need 10Gbe? We're quite happy with 1Gbe will no
> > over-subscription.
> >
> > Mathias.
> >
>
>
>
> --
> Thanks,
> sandeep
>

Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines

Posted by Sandeep Reddy P <sa...@gmail.com>.
Thanks for the reply Mathias,
Actual data is 100TB i think we need to host 100TB on AWS. Do we have
replication even in AWS??
We are looking for comparision between performance curves/issues between
physical machines and AWS??

On Thu, May 31, 2012 at 2:50 PM, Mathias Herberts <
mathias.herberts@gmail.com> wrote:

> Correct me if I'm wrong, but the sole cost of storing 300TB on AWS
> will account for roughly 300000*0.10*12 = 360000 USD per annum.
>
> We operate a cluster with 112 nodes offering 800+ TB of raw HDFS
> capacity and the CAPEX was less than 700k USD, if you ask me there is
> no comparison possible if you have the datacenter space to host your
> machines.
>
> Do you really need 10Gbe? We're quite happy with 1Gbe will no
> over-subscription.
>
> Mathias.
>



-- 
Thanks,
sandeep

Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines

Posted by Shi Yu <sh...@uchicago.edu>.
We once calculated the cost of using EC2 to train our machine 
learning model (assuming we did everything in one shot, which is 
almost impossible) using EM algorithm. The cost for each model 
is 10,000 US dollars.  The cost for each individual node for 
each hour seems cheap, but when it scales up (multiplied by the 
number of nodes times the number of hours required for model 
training), it is still quite shocking.

Shi

Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines

Posted by Edward Capriolo <ed...@gmail.com>.
We actually were in an Amazon/host it yourself debate with someone.
Which prompted us to do some calculations:

http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/myth_busters_ops_editition_is

We calculated the cost for storage alone of 300 TB on ec2 as 585K a month!

The cloud people hate hearing facts like this with staggering $
values. They also do not like hearing how a $35 dollar a month
physical server at Joe's datacenter is much better then an equivilent
cloud machine.

http://blog.carlmercier.com/2012/01/05/ec2-is-basically-one-big-ripoff/

When you bring these facts the go-to-move is go-buzzword with phrases
"cost of system admin", "elastic", "up front initial costs".

I will say that Amazons EMR service is pretty cool and their is
something to it, but the cost of storage and good performance is off
the scale for me.


On 5/31/12, Mathias Herberts <ma...@gmail.com> wrote:
> Correct me if I'm wrong, but the sole cost of storing 300TB on AWS
> will account for roughly 300000*0.10*12 = 360000 USD per annum.
>
> We operate a cluster with 112 nodes offering 800+ TB of raw HDFS
> capacity and the CAPEX was less than 700k USD, if you ask me there is
> no comparison possible if you have the datacenter space to host your
> machines.
>
> Do you really need 10Gbe? We're quite happy with 1Gbe will no
> over-subscription.
>
> Mathias.
>

Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines

Posted by Mathias Herberts <ma...@gmail.com>.
Correct me if I'm wrong, but the sole cost of storing 300TB on AWS
will account for roughly 300000*0.10*12 = 360000 USD per annum.

We operate a cluster with 112 nodes offering 800+ TB of raw HDFS
capacity and the CAPEX was less than 700k USD, if you ask me there is
no comparison possible if you have the datacenter space to host your
machines.

Do you really need 10Gbe? We're quite happy with 1Gbe will no over-subscription.

Mathias.

Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines

Posted by Jane Wayne <ja...@gmail.com>.
Sandeep,

How are you guys moving 100 TB into the AWS cloud? Are you using S3 or
EBS? If you are using S3, it does not work like HDFS. Although data is
replicated (I believe within an availability zone) in S3, it is not
the same as HDFS replication. You lose the data locality optimization
feature of Hadoop when you use S3, which runs counter to the "sending
code to data" paradigm of MapReduce. Mind you, traffic in/out of S3
equates to costs incurred as well (when you lose data locality
optimization).

I hear that to get PBs worth of data into AWS, it is not uncommon to
drive a truck with your data on some physical storage device (in fact,
Amazon will help you do this).

Please update us, this is an interesting problem.

Thanks,

On Thu, May 31, 2012 at 2:41 PM, Sandeep Reddy P
<sa...@gmail.com> wrote:
> Hi,
> We are getting 100TB of data with replication factor of 3 this goes to
> 300TB of data. We are planning to use hadoop with 65nodes. We want to know
> which option will be better in terms of hardware either physical Machines
> or deploy hadoop on EC2. Is there any document that supports use of
> physical machines.
> Hardware specs:  2 quad core cpu, 32 Gb Ram, 12*1 Tb hard drives , 10Gb
> Ethernet Switches costs $10k for each machine. Is that cheaper to use EC2
> ?? will there be any performance issues??
> --
> Thanks,
> sandeep