You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Adarsh Sharma <ad...@orkash.com> on 2011/01/17 05:11:54 UTC

Why Hadoop is slow in Cloud

Dear all,

Yesterday I performed a kind of testing between *Hadoop in Standalone 
Servers* & *Hadoop in Cloud.

*I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in 
which one node act as Master ( Namenode , Jobtracker ) and the remaining 
nodes act as slaves ( Datanodes, Tasktracker ).
On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made 
one Standalone Machine as *Hadoop Master* and the slaves are configured 
on the VM's in Cloud.

I am confused about the stats obtained after the testing. What I 
concluded that the VM are giving half peformance as compared with 
Standalone Servers.

I am expected some slow down but at this level I never expect. Would 
this is genuine or there may be some configuration problem.

I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in 
Standalone Servers.

Please have a look on the results and if interested comment on it.



Thanks & Regards

Adarsh Sharma

Re: Why Hadoop is slow in Cloud

Posted by Steve Loughran <st...@apache.org>.
On 20/01/11 23:24, Marc Farnum Rendino wrote:
> On Wed, Jan 19, 2011 at 2:50 PM, Edward Capriolo<ed...@gmail.com>  wrote:
>> As for virtualization,paravirtualization,emulation.....(whatever ulization)
>
> Wow; that's a really big category.
>
>> There are always a lot of variables, but the net result is always
>> less. It may be 2% 10% or 15%, but it is always less.
>
> If it's less of something I don't care about, it's not a factor (for me).
>
> On the other hand, if I'm paying less and getting more of what I DO
> care about, I'd rather go with that.
>
> It's about the cost/benefit *ratio*.

There's also perf vs storage. On a big cluster, you could add a second 
Nehalem CPU and maybe get 10-15% boost on throughput, or for the same 
capex and opex add 10% new servers, which at scale means many more TB of 
storage and the compute to go with it. The decision rests with the team 
and their problems.

Re: Why Hadoop is slow in Cloud

Posted by Marc Farnum Rendino <mv...@gmail.com>.
On Wed, Jan 19, 2011 at 2:50 PM, Edward Capriolo <ed...@gmail.com> wrote:
> As for virtualization,paravirtualization,emulation.....(whatever ulization)

Wow; that's a really big category.

> There are always a lot of variables, but the net result is always
> less. It may be 2% 10% or 15%, but it is always less.

If it's less of something I don't care about, it's not a factor (for me).

On the other hand, if I'm paying less and getting more of what I DO
care about, I'd rather go with that.

It's about the cost/benefit *ratio*.

Re: Why Hadoop is slow in Cloud

Posted by Edward Capriolo <ed...@gmail.com>.
On Wed, Jan 19, 2011 at 1:32 PM, Marc Farnum Rendino <mv...@gmail.com> wrote:
> On Tue, Jan 18, 2011 at 8:59 AM, Adarsh Sharma <ad...@orkash.com> wrote:
>> I want to know *AT WHAT COSTS  *it comes.
>> 10-15% is tolerable but at this rate, it needs some work.
>>
>> As Steve rightly suggest , I am in some CPU bound testing work to  know the
>>  exact stats.
>
> Yep; you've got to test your own workflow to see how it's affected by
> your conditions - lots of variables.
>
> BTW: For AWS (Amazon) there are significant differences in I/O, for
> different instance types; if I recall correctly, for best I/O, start
> no lower than m1.large. And the three storage types (instance, EBS,
> and S3) have different characteristics as well; I'd start with EBS,
> though I haven't worked much with S3 yet, and that does offer some
> benefits.
>
As for virtualization,paravirtualization,emulation.....(whatever ulization)
There are always a lot of variables, but the net result is always
less. It may be 2% 10% or 15%, but it is always less. A $50,000 server
and such a solution takes 10% performance right off the top. There
goes $5,000.00 performance right out the window. I never think
throwing away performance was acceptable ( I was born without a silver
SSD in my crib).  Plus some people even pay for virtualization
software (vendors will remain nameless) Truly paying for less.

Re: Why Hadoop is slow in Cloud

Posted by Marc Farnum Rendino <mv...@gmail.com>.
On Tue, Jan 18, 2011 at 8:59 AM, Adarsh Sharma <ad...@orkash.com> wrote:
> I want to know *AT WHAT COSTS  *it comes.
> 10-15% is tolerable but at this rate, it needs some work.
>
> As Steve rightly suggest , I am in some CPU bound testing work to  know the
>  exact stats.

Yep; you've got to test your own workflow to see how it's affected by
your conditions - lots of variables.

BTW: For AWS (Amazon) there are significant differences in I/O, for
different instance types; if I recall correctly, for best I/O, start
no lower than m1.large. And the three storage types (instance, EBS,
and S3) have different characteristics as well; I'd start with EBS,
though I haven't worked much with S3 yet, and that does offer some
benefits.

Re: Why Hadoop is slow in Cloud

Posted by Steve Loughran <st...@apache.org>.
On 21/01/11 09:20, Evert Lammerts wrote:
>> Even with performance hit, there are still benefits running Hadoop this
>> way
>>    -as you only consume/pay for CPU time you use, if you are only
>> running
>> batch jobs, its lower cost than having a hadoop cluster that is under-
>> used.
>>
>>    -if your data is stored in the cloud infrastructure, then you need to
>> data mine it in VMs, unless you want to take the time and money hit of
>> moving it out, and have somewhere to store it.
>>
>> -if the infrastructure lets you, you can lock down the cluster so it is
>> secure.
>>
>> Where a physical cluster is good is that it is a very low cost way of
>> storing data, provided you can analyse it with Hadoop, and provided you
>> can keep that cluster busy most of the time, either with Hadoop work or
>> other scheduled work. If your cluster is idle for computation, you are
>> still paying the capital and (reduced) electricity costs, so the cost
>> of
>> storage and what compute you do effectively increases.
>
> Agreed, but this has little to do with Hadoop as a middleware and more to do
> with the benefits of virtualized vs physical infrastructure. I agree that it
> is convenient to use HDFS as a DFS to keep your data local to your VMs, but
> you could choose other DFS's as well.

We don't use HDFS, we bring up VMs close to where the data persists.

http://www.slideshare.net/steve_l/high-availability-hadoop

>
> The major benefit of Hadoop is its data-locality principle, and this is what
> you give up when you move to the cloud. Regardless of whether you store your
> data within your VM or on a NAS, it *will* have to travel over a line. As
> soon as that happens you lose the benefit of data-locality and are left with
> MapReduce as a way for parallel computing. And in that case you could use
> less restrictive software, like maybe PBS. You could even install HOD on
> your virtual cluster, if you'd like the possibility of MapReduce.

We don't suffer locality hits so much, but you do pay for the extra 
infrastructure costs of a more agile datacentre, and if you go to 
redundancy in hardware over replication, you have less places to run 
your code.

Even on EC2, which doesn't let you tell it what datasets you want to 
play with for its VM placer to use in its decisions, once data is in the 
datanodes you do get locality

>
> Adarsh, there are probably results around of more generic benchmark tools
> (Phoronix, POV-Ray, ...) for I/O and CPU performance in a VM. Those should
> give you a better idea of the penalties of virtualization. (Our experience
> with a number of technologies on our OpenNebula cloud is, like Steve points
> out, that you mainly pay for disk I/O performance.)

-would be interesting to see anything you can publish there...

>
> I think a decision to go with either cloud or physical infrastructure should
> be based on the frequency, intensity and types of computation you expect on
> the short term (that should include operations dealing with data), and the
> way you think these parameters will develop on a mid-long term. And then
> compare the prices of a physical cluster that meets those demands (make sure
> to include power and operations) and the investment you would otherwise need
> to make in Cloud.

+1


RE: Why Hadoop is slow in Cloud

Posted by Evert Lammerts <Ev...@sara.nl>.
> Even with performance hit, there are still benefits running Hadoop this
> way
>   -as you only consume/pay for CPU time you use, if you are only
> running
> batch jobs, its lower cost than having a hadoop cluster that is under-
> used.
> 
>   -if your data is stored in the cloud infrastructure, then you need to
> data mine it in VMs, unless you want to take the time and money hit of
> moving it out, and have somewhere to store it.
> 
> -if the infrastructure lets you, you can lock down the cluster so it is
> secure.
> 
> Where a physical cluster is good is that it is a very low cost way of
> storing data, provided you can analyse it with Hadoop, and provided you
> can keep that cluster busy most of the time, either with Hadoop work or
> other scheduled work. If your cluster is idle for computation, you are
> still paying the capital and (reduced) electricity costs, so the cost
> of
> storage and what compute you do effectively increases.

Agreed, but this has little to do with Hadoop as a middleware and more to do
with the benefits of virtualized vs physical infrastructure. I agree that it
is convenient to use HDFS as a DFS to keep your data local to your VMs, but
you could choose other DFS's as well.

The major benefit of Hadoop is its data-locality principle, and this is what
you give up when you move to the cloud. Regardless of whether you store your
data within your VM or on a NAS, it *will* have to travel over a line. As
soon as that happens you lose the benefit of data-locality and are left with
MapReduce as a way for parallel computing. And in that case you could use
less restrictive software, like maybe PBS. You could even install HOD on
your virtual cluster, if you'd like the possibility of MapReduce.

Adarsh, there are probably results around of more generic benchmark tools
(Phoronix, POV-Ray, ...) for I/O and CPU performance in a VM. Those should
give you a better idea of the penalties of virtualization. (Our experience
with a number of technologies on our OpenNebula cloud is, like Steve points
out, that you mainly pay for disk I/O performance.)

I think a decision to go with either cloud or physical infrastructure should
be based on the frequency, intensity and types of computation you expect on
the short term (that should include operations dealing with data), and the
way you think these parameters will develop on a mid-long term. And then
compare the prices of a physical cluster that meets those demands (make sure
to include power and operations) and the investment you would otherwise need
to make in Cloud.

Re: Could not obtain block

Posted by madhu phatak <ph...@gmail.com>.
May be some datanode  is down in the cluster ...check datanode logs of nodes
in cluster

On Thu, Jan 20, 2011 at 3:43 PM, Cavus,M.,Fa. Post Direkt <
M.Cavus@postdirekt.de> wrote:

> Hi,
> I process the wordcount example on my hadoop cluster and get a Could not
> obtain block Exception. Did any one know what is the problem? If I start
> this program in my local than processed it good.
>
> I do this:
>
> root@master bin]# ./hadoop jar ../hadoop-0.20.2-examples.jar wordcount
> point/start-all.sh  s/start-all.sh
> 11/01/20 11:57:56 INFO input.FileInputFormat: Total input paths to
> process : 1
> 11/01/20 11:57:57 INFO mapred.JobClient: Running job:
> job_201101201036_0002
> 11/01/20 11:57:58 INFO mapred.JobClient:  map 0% reduce 0%
> 11/01/20 11:58:16 INFO mapred.JobClient: Task Id :
> attempt_201101201036_0002_m_000000_0, Status : FAILED
> java.io.IOException: Could not obtain block:
> blk_7716960257524845873_1708 file=/user/root/point/start-all.sh
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient
> .java:1812)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.ja
> va:1638)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767
> )
>        at java.io.DataInputStream.read(DataInputStream.java:83)
>        at
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>        at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(Line
> RecordReader.java:97)
>        at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Ma
> pTask.java:423)
>        at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>        at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 11/01/20 11:58:33 INFO mapred.JobClient: Task Id :
> attempt_201101201036_0002_m_000000_1, Status : FAILED
> java.io.IOException: Could not obtain block:
> blk_7716960257524845873_1708 file=/user/root/point/start-all.sh
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient
> .java:1812)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.ja
> va:1638)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767
> )
>        at java.io.DataInputStream.read(DataInputStream.java:83)
>        at
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>        at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(Line
> RecordReader.java:97)
>        at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Ma
> pTask.java:423)
>        at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>        at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 11/01/20 11:58:48 INFO mapred.JobClient: Task Id :
> attempt_201101201036_0002_m_000000_2, Status : FAILED
> java.io.IOException: Could not obtain block:
> blk_7716960257524845873_1708 file=/user/root/point/start-all.sh
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient
> .java:1812)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.ja
> va:1638)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767
> )
>        at java.io.DataInputStream.read(DataInputStream.java:83)
>        at
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>        at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(Line
> RecordReader.java:97)
>        at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Ma
> pTask.java:423)
>        at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>        at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 11/01/20 11:59:06 INFO mapred.JobClient: Job complete:
> job_201101201036_0002
> 11/01/20 11:59:06 INFO mapred.JobClient: Counters: 2
> 11/01/20 11:59:06 INFO mapred.JobClient:   Job Counters
> 11/01/20 11:59:06 INFO mapred.JobClient:     Launched map tasks=4
> 11/01/20 11:59:06 INFO mapred.JobClient:     Failed map tasks=1
>
>
> Regards
> Musa Cavus
>

Could not obtain block

Posted by "Cavus,M.,Fa. Post Direkt" <M....@postdirekt.de>.
Hi,
I process the wordcount example on my hadoop cluster and get a Could not
obtain block Exception. Did any one know what is the problem? If I start
this program in my local than processed it good.

I do this:

root@master bin]# ./hadoop jar ../hadoop-0.20.2-examples.jar wordcount
point/start-all.sh  s/start-all.sh
11/01/20 11:57:56 INFO input.FileInputFormat: Total input paths to
process : 1
11/01/20 11:57:57 INFO mapred.JobClient: Running job:
job_201101201036_0002
11/01/20 11:57:58 INFO mapred.JobClient:  map 0% reduce 0%
11/01/20 11:58:16 INFO mapred.JobClient: Task Id :
attempt_201101201036_0002_m_000000_0, Status : FAILED
java.io.IOException: Could not obtain block:
blk_7716960257524845873_1708 file=/user/root/point/start-all.sh
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient
.java:1812)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.ja
va:1638)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767
)
        at java.io.DataInputStream.read(DataInputStream.java:83)
        at
org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
        at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(Line
RecordReader.java:97)
        at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Ma
pTask.java:423)
        at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
        at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

11/01/20 11:58:33 INFO mapred.JobClient: Task Id :
attempt_201101201036_0002_m_000000_1, Status : FAILED
java.io.IOException: Could not obtain block:
blk_7716960257524845873_1708 file=/user/root/point/start-all.sh
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient
.java:1812)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.ja
va:1638)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767
)
        at java.io.DataInputStream.read(DataInputStream.java:83)
        at
org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
        at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(Line
RecordReader.java:97)
        at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Ma
pTask.java:423)
        at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
        at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

11/01/20 11:58:48 INFO mapred.JobClient: Task Id :
attempt_201101201036_0002_m_000000_2, Status : FAILED
java.io.IOException: Could not obtain block:
blk_7716960257524845873_1708 file=/user/root/point/start-all.sh
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient
.java:1812)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.ja
va:1638)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767
)
        at java.io.DataInputStream.read(DataInputStream.java:83)
        at
org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
        at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(Line
RecordReader.java:97)
        at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Ma
pTask.java:423)
        at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
        at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

11/01/20 11:59:06 INFO mapred.JobClient: Job complete:
job_201101201036_0002
11/01/20 11:59:06 INFO mapred.JobClient: Counters: 2
11/01/20 11:59:06 INFO mapred.JobClient:   Job Counters 
11/01/20 11:59:06 INFO mapred.JobClient:     Launched map tasks=4
11/01/20 11:59:06 INFO mapred.JobClient:     Failed map tasks=1


Regards
Musa Cavus

Re: Why Hadoop is slow in Cloud

Posted by Steve Loughran <st...@apache.org>.
On 18/01/11 13:59, Adarsh Sharma wrote:
> Marc Farnum Rendino wrote:
>> Virtualization != Emulation
>>
>> Yes, virtualization does have its own costs (as does running directly
>> on hardware) - depending on the specifics of both the virtualization
>> *and* the task at hand.
> Absolutely right, and for this I perform the initial testing.
>
> I want to know *AT WHAT COSTS *it comes.
> 10-15% is tolerable but at this rate, it needs some work.

Even with performance hit, there are still benefits running Hadoop this way
  -as you only consume/pay for CPU time you use, if you are only running 
batch jobs, its lower cost than having a hadoop cluster that is under-used.

  -if your data is stored in the cloud infrastructure, then you need to 
data mine it in VMs, unless you want to take the time and money hit of 
moving it out, and have somewhere to store it.

-if the infrastructure lets you, you can lock down the cluster so it is 
secure.

Where a physical cluster is good is that it is a very low cost way of 
storing data, provided you can analyse it with Hadoop, and provided you 
can keep that cluster busy most of the time, either with Hadoop work or 
other scheduled work. If your cluster is idle for computation, you are 
still paying the capital and (reduced) electricity costs, so the cost of 
storage and what compute you do effectively increases.

-steve

Re: Why Hadoop is slow in Cloud

Posted by Adarsh Sharma <ad...@orkash.com>.
Marc Farnum Rendino wrote:
> Virtualization != Emulation
>
> Yes, virtualization does have its own costs (as does running directly
> on hardware) - depending on the specifics of both the virtualization
> *and* the task at hand.
>   
Absolutely right, and for this I perform the initial testing.

I want to know *AT WHAT COSTS  *it comes.
10-15% is tolerable but at this rate, it needs some work.

As Steve rightly suggest , I am in some CPU bound testing work to  know 
the  exact stats.

I let you know after the work.

> If my task (in the general sense) is CPU bound, it doesn't matter (to
> me) if the virtualization has a disk I/O penalty.
>   

But is it possible to perform some tuning in the work-flow of the VM's 
to increase some performance or not.

If on the other hand, my task is limited by a disk I/O penalty, I'll
> weigh that into the *total* cost/benefit, and virtualization may not -
> or may still - be an advantageous choice.
>
>   
Some reasons of slowness will highly helpful. Any guidance is appreciable.

> Context is king.
>
>   
Thanks & best Regards

Adarsh Sharma

> On Mon, Jan 17, 2011 at 10:41 AM, Edward Capriolo <ed...@gmail.com> wrote:
>   
>> Everything you emulate you cut X% performance right off the top...
>>     


Re: Why Hadoop is slow in Cloud

Posted by Marc Farnum Rendino <mv...@gmail.com>.
Virtualization != Emulation

Yes, virtualization does have its own costs (as does running directly
on hardware) - depending on the specifics of both the virtualization
*and* the task at hand.

If my task (in the general sense) is CPU bound, it doesn't matter (to
me) if the virtualization has a disk I/O penalty.

If on the other hand, my task is limited by a disk I/O penalty, I'll
weigh that into the *total* cost/benefit, and virtualization may not -
or may still - be an advantageous choice.

Context is king.

On Mon, Jan 17, 2011 at 10:41 AM, Edward Capriolo <ed...@gmail.com> wrote:
> Everything you emulate you cut X% performance right off the top...

Re: Why Hadoop is slow in Cloud

Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Jan 17, 2011 at 6:08 AM, Steve Loughran <st...@apache.org> wrote:
> On 17/01/11 04:11, Adarsh Sharma wrote:
>>
>> Dear all,
>>
>> Yesterday I performed a kind of testing between *Hadoop in Standalone
>> Servers* & *Hadoop in Cloud.
>>
>> *I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in
>> which one node act as Master ( Namenode , Jobtracker ) and the remaining
>> nodes act as slaves ( Datanodes, Tasktracker ).
>> On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made
>> one Standalone Machine as *Hadoop Master* and the slaves are configured
>> on the VM's in Cloud.
>>
>> I am confused about the stats obtained after the testing. What I
>> concluded that the VM are giving half peformance as compared with
>> Standalone Servers.
>
> Interesting stats, nothing that massively surprises me, especially as your
> benchmarks are very much streaming through datasets. If you were doing
> something more CPU intensive (graph work, for example), things wouldn't look
> so bad
>
> I've done stuff in this area.
> http://www.slideshare.net/steve_l/farming-hadoop-inthecloud
>
>
>
>>
>> I am expected some slow down but at this level I never expect. Would
>> this is genuine or there may be some configuration problem.
>>
>> I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in
>> Standalone Servers.
>>
>> Please have a look on the results and if interested comment on it.
>>
>
>
> The big killer here is File IO, with today's HDD controllers and virtual
> filesystems, disk IO is way underpowered compared to physical disk IO.
> Networking is reduced (but improving), and CPU can be pretty good, but disk
> is bad.
>
>
> Why?
>
> 1.  Every access to a block in the VM is turned into virtual disk controller
> operations which are then interpreted by the VDC and turned into
> reads/writes in the virtual disk drive
>
> 2. which is turned into seeks, reads and writes in the physical hardware.
>
> Some workarounds
>
> -allocate physical disks for the HDFS filesystem, for the duration of the
> VMs.
>
> -have the local hosts serve up a bit of their filesystem on a fast protocol
> (like NFS), and have every VM mount the local physical NFS filestore as
> their hadoop data dirs.
>
>

Q: "Why is my Nintendo emulator slow on a 800 MHZ computer made 10
years after Nintendo?"
A: Emulation

Everything you emulate you cut X% performance right off the top.

Emulation is great when you want to run mac on windows or freebsd on
linux or nintendo on linux. However most people would do better with
technologies that use kernel level isolation such as Linux containers,
Solaris Zones, Linux VServer (my favorite) http://linux-vserver.org/,
User Mode Linux or similar technologies that ISOLATE rather then
EMULATE.

Sorry list I feel I rant about this bi-annually. I have just always
been so shocked about how many people get lured into cloud and
virtualized solutions for "better management" and "near native
performance"

Re: Why Hadoop is slow in Cloud

Posted by Steve Loughran <st...@apache.org>.
On 17/01/11 04:11, Adarsh Sharma wrote:
> Dear all,
>
> Yesterday I performed a kind of testing between *Hadoop in Standalone
> Servers* & *Hadoop in Cloud.
>
> *I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in
> which one node act as Master ( Namenode , Jobtracker ) and the remaining
> nodes act as slaves ( Datanodes, Tasktracker ).
> On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made
> one Standalone Machine as *Hadoop Master* and the slaves are configured
> on the VM's in Cloud.
>
> I am confused about the stats obtained after the testing. What I
> concluded that the VM are giving half peformance as compared with
> Standalone Servers.

Interesting stats, nothing that massively surprises me, especially as 
your benchmarks are very much streaming through datasets. If you were 
doing something more CPU intensive (graph work, for example), things 
wouldn't look so bad

I've done stuff in this area.
http://www.slideshare.net/steve_l/farming-hadoop-inthecloud



>
> I am expected some slow down but at this level I never expect. Would
> this is genuine or there may be some configuration problem.
>
> I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in
> Standalone Servers.
>
> Please have a look on the results and if interested comment on it.
>


The big killer here is File IO, with today's HDD controllers and virtual 
filesystems, disk IO is way underpowered compared to physical disk IO. 
Networking is reduced (but improving), and CPU can be pretty good, but 
disk is bad.


Why?

1.  Every access to a block in the VM is turned into virtual disk 
controller operations which are then interpreted by the VDC and turned 
into reads/writes in the virtual disk drive

2. which is turned into seeks, reads and writes in the physical hardware.

Some workarounds

-allocate physical disks for the HDFS filesystem, for the duration of 
the VMs.

-have the local hosts serve up a bit of their filesystem on a fast 
protocol (like NFS), and have every VM mount the local physical NFS 
filestore as their hadoop data dirs.