You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Adarsh Sharma <ad...@orkash.com> on 2011/01/17 05:11:54 UTC
Why Hadoop is slow in Cloud
Dear all,
Yesterday I performed a kind of testing between *Hadoop in Standalone
Servers* & *Hadoop in Cloud.
*I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in
which one node act as Master ( Namenode , Jobtracker ) and the remaining
nodes act as slaves ( Datanodes, Tasktracker ).
On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made
one Standalone Machine as *Hadoop Master* and the slaves are configured
on the VM's in Cloud.
I am confused about the stats obtained after the testing. What I
concluded that the VM are giving half peformance as compared with
Standalone Servers.
I am expected some slow down but at this level I never expect. Would
this is genuine or there may be some configuration problem.
I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in
Standalone Servers.
Please have a look on the results and if interested comment on it.
Thanks & Regards
Adarsh Sharma
Re: Why Hadoop is slow in Cloud
Posted by Steve Loughran <st...@apache.org>.
On 20/01/11 23:24, Marc Farnum Rendino wrote:
> On Wed, Jan 19, 2011 at 2:50 PM, Edward Capriolo<ed...@gmail.com> wrote:
>> As for virtualization,paravirtualization,emulation.....(whatever ulization)
>
> Wow; that's a really big category.
>
>> There are always a lot of variables, but the net result is always
>> less. It may be 2% 10% or 15%, but it is always less.
>
> If it's less of something I don't care about, it's not a factor (for me).
>
> On the other hand, if I'm paying less and getting more of what I DO
> care about, I'd rather go with that.
>
> It's about the cost/benefit *ratio*.
There's also perf vs storage. On a big cluster, you could add a second
Nehalem CPU and maybe get 10-15% boost on throughput, or for the same
capex and opex add 10% new servers, which at scale means many more TB of
storage and the compute to go with it. The decision rests with the team
and their problems.
Re: Why Hadoop is slow in Cloud
Posted by Marc Farnum Rendino <mv...@gmail.com>.
On Wed, Jan 19, 2011 at 2:50 PM, Edward Capriolo <ed...@gmail.com> wrote:
> As for virtualization,paravirtualization,emulation.....(whatever ulization)
Wow; that's a really big category.
> There are always a lot of variables, but the net result is always
> less. It may be 2% 10% or 15%, but it is always less.
If it's less of something I don't care about, it's not a factor (for me).
On the other hand, if I'm paying less and getting more of what I DO
care about, I'd rather go with that.
It's about the cost/benefit *ratio*.
Re: Why Hadoop is slow in Cloud
Posted by Edward Capriolo <ed...@gmail.com>.
On Wed, Jan 19, 2011 at 1:32 PM, Marc Farnum Rendino <mv...@gmail.com> wrote:
> On Tue, Jan 18, 2011 at 8:59 AM, Adarsh Sharma <ad...@orkash.com> wrote:
>> I want to know *AT WHAT COSTS *it comes.
>> 10-15% is tolerable but at this rate, it needs some work.
>>
>> As Steve rightly suggest , I am in some CPU bound testing work to know the
>> exact stats.
>
> Yep; you've got to test your own workflow to see how it's affected by
> your conditions - lots of variables.
>
> BTW: For AWS (Amazon) there are significant differences in I/O, for
> different instance types; if I recall correctly, for best I/O, start
> no lower than m1.large. And the three storage types (instance, EBS,
> and S3) have different characteristics as well; I'd start with EBS,
> though I haven't worked much with S3 yet, and that does offer some
> benefits.
>
As for virtualization,paravirtualization,emulation.....(whatever ulization)
There are always a lot of variables, but the net result is always
less. It may be 2% 10% or 15%, but it is always less. A $50,000 server
and such a solution takes 10% performance right off the top. There
goes $5,000.00 performance right out the window. I never think
throwing away performance was acceptable ( I was born without a silver
SSD in my crib). Plus some people even pay for virtualization
software (vendors will remain nameless) Truly paying for less.
Re: Why Hadoop is slow in Cloud
Posted by Marc Farnum Rendino <mv...@gmail.com>.
On Tue, Jan 18, 2011 at 8:59 AM, Adarsh Sharma <ad...@orkash.com> wrote:
> I want to know *AT WHAT COSTS *it comes.
> 10-15% is tolerable but at this rate, it needs some work.
>
> As Steve rightly suggest , I am in some CPU bound testing work to know the
> exact stats.
Yep; you've got to test your own workflow to see how it's affected by
your conditions - lots of variables.
BTW: For AWS (Amazon) there are significant differences in I/O, for
different instance types; if I recall correctly, for best I/O, start
no lower than m1.large. And the three storage types (instance, EBS,
and S3) have different characteristics as well; I'd start with EBS,
though I haven't worked much with S3 yet, and that does offer some
benefits.
Re: Why Hadoop is slow in Cloud
Posted by Steve Loughran <st...@apache.org>.
On 21/01/11 09:20, Evert Lammerts wrote:
>> Even with performance hit, there are still benefits running Hadoop this
>> way
>> -as you only consume/pay for CPU time you use, if you are only
>> running
>> batch jobs, its lower cost than having a hadoop cluster that is under-
>> used.
>>
>> -if your data is stored in the cloud infrastructure, then you need to
>> data mine it in VMs, unless you want to take the time and money hit of
>> moving it out, and have somewhere to store it.
>>
>> -if the infrastructure lets you, you can lock down the cluster so it is
>> secure.
>>
>> Where a physical cluster is good is that it is a very low cost way of
>> storing data, provided you can analyse it with Hadoop, and provided you
>> can keep that cluster busy most of the time, either with Hadoop work or
>> other scheduled work. If your cluster is idle for computation, you are
>> still paying the capital and (reduced) electricity costs, so the cost
>> of
>> storage and what compute you do effectively increases.
>
> Agreed, but this has little to do with Hadoop as a middleware and more to do
> with the benefits of virtualized vs physical infrastructure. I agree that it
> is convenient to use HDFS as a DFS to keep your data local to your VMs, but
> you could choose other DFS's as well.
We don't use HDFS, we bring up VMs close to where the data persists.
http://www.slideshare.net/steve_l/high-availability-hadoop
>
> The major benefit of Hadoop is its data-locality principle, and this is what
> you give up when you move to the cloud. Regardless of whether you store your
> data within your VM or on a NAS, it *will* have to travel over a line. As
> soon as that happens you lose the benefit of data-locality and are left with
> MapReduce as a way for parallel computing. And in that case you could use
> less restrictive software, like maybe PBS. You could even install HOD on
> your virtual cluster, if you'd like the possibility of MapReduce.
We don't suffer locality hits so much, but you do pay for the extra
infrastructure costs of a more agile datacentre, and if you go to
redundancy in hardware over replication, you have less places to run
your code.
Even on EC2, which doesn't let you tell it what datasets you want to
play with for its VM placer to use in its decisions, once data is in the
datanodes you do get locality
>
> Adarsh, there are probably results around of more generic benchmark tools
> (Phoronix, POV-Ray, ...) for I/O and CPU performance in a VM. Those should
> give you a better idea of the penalties of virtualization. (Our experience
> with a number of technologies on our OpenNebula cloud is, like Steve points
> out, that you mainly pay for disk I/O performance.)
-would be interesting to see anything you can publish there...
>
> I think a decision to go with either cloud or physical infrastructure should
> be based on the frequency, intensity and types of computation you expect on
> the short term (that should include operations dealing with data), and the
> way you think these parameters will develop on a mid-long term. And then
> compare the prices of a physical cluster that meets those demands (make sure
> to include power and operations) and the investment you would otherwise need
> to make in Cloud.
+1
RE: Why Hadoop is slow in Cloud
Posted by Evert Lammerts <Ev...@sara.nl>.
> Even with performance hit, there are still benefits running Hadoop this
> way
> -as you only consume/pay for CPU time you use, if you are only
> running
> batch jobs, its lower cost than having a hadoop cluster that is under-
> used.
>
> -if your data is stored in the cloud infrastructure, then you need to
> data mine it in VMs, unless you want to take the time and money hit of
> moving it out, and have somewhere to store it.
>
> -if the infrastructure lets you, you can lock down the cluster so it is
> secure.
>
> Where a physical cluster is good is that it is a very low cost way of
> storing data, provided you can analyse it with Hadoop, and provided you
> can keep that cluster busy most of the time, either with Hadoop work or
> other scheduled work. If your cluster is idle for computation, you are
> still paying the capital and (reduced) electricity costs, so the cost
> of
> storage and what compute you do effectively increases.
Agreed, but this has little to do with Hadoop as a middleware and more to do
with the benefits of virtualized vs physical infrastructure. I agree that it
is convenient to use HDFS as a DFS to keep your data local to your VMs, but
you could choose other DFS's as well.
The major benefit of Hadoop is its data-locality principle, and this is what
you give up when you move to the cloud. Regardless of whether you store your
data within your VM or on a NAS, it *will* have to travel over a line. As
soon as that happens you lose the benefit of data-locality and are left with
MapReduce as a way for parallel computing. And in that case you could use
less restrictive software, like maybe PBS. You could even install HOD on
your virtual cluster, if you'd like the possibility of MapReduce.
Adarsh, there are probably results around of more generic benchmark tools
(Phoronix, POV-Ray, ...) for I/O and CPU performance in a VM. Those should
give you a better idea of the penalties of virtualization. (Our experience
with a number of technologies on our OpenNebula cloud is, like Steve points
out, that you mainly pay for disk I/O performance.)
I think a decision to go with either cloud or physical infrastructure should
be based on the frequency, intensity and types of computation you expect on
the short term (that should include operations dealing with data), and the
way you think these parameters will develop on a mid-long term. And then
compare the prices of a physical cluster that meets those demands (make sure
to include power and operations) and the investment you would otherwise need
to make in Cloud.
Re: Could not obtain block
Posted by madhu phatak <ph...@gmail.com>.
May be some datanode is down in the cluster ...check datanode logs of nodes
in cluster
On Thu, Jan 20, 2011 at 3:43 PM, Cavus,M.,Fa. Post Direkt <
M.Cavus@postdirekt.de> wrote:
> Hi,
> I process the wordcount example on my hadoop cluster and get a Could not
> obtain block Exception. Did any one know what is the problem? If I start
> this program in my local than processed it good.
>
> I do this:
>
> root@master bin]# ./hadoop jar ../hadoop-0.20.2-examples.jar wordcount
> point/start-all.sh s/start-all.sh
> 11/01/20 11:57:56 INFO input.FileInputFormat: Total input paths to
> process : 1
> 11/01/20 11:57:57 INFO mapred.JobClient: Running job:
> job_201101201036_0002
> 11/01/20 11:57:58 INFO mapred.JobClient: map 0% reduce 0%
> 11/01/20 11:58:16 INFO mapred.JobClient: Task Id :
> attempt_201101201036_0002_m_000000_0, Status : FAILED
> java.io.IOException: Could not obtain block:
> blk_7716960257524845873_1708 file=/user/root/point/start-all.sh
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient
> .java:1812)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.ja
> va:1638)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767
> )
> at java.io.DataInputStream.read(DataInputStream.java:83)
> at
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(Line
> RecordReader.java:97)
> at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Ma
> pTask.java:423)
> at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 11/01/20 11:58:33 INFO mapred.JobClient: Task Id :
> attempt_201101201036_0002_m_000000_1, Status : FAILED
> java.io.IOException: Could not obtain block:
> blk_7716960257524845873_1708 file=/user/root/point/start-all.sh
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient
> .java:1812)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.ja
> va:1638)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767
> )
> at java.io.DataInputStream.read(DataInputStream.java:83)
> at
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(Line
> RecordReader.java:97)
> at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Ma
> pTask.java:423)
> at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 11/01/20 11:58:48 INFO mapred.JobClient: Task Id :
> attempt_201101201036_0002_m_000000_2, Status : FAILED
> java.io.IOException: Could not obtain block:
> blk_7716960257524845873_1708 file=/user/root/point/start-all.sh
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient
> .java:1812)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.ja
> va:1638)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767
> )
> at java.io.DataInputStream.read(DataInputStream.java:83)
> at
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(Line
> RecordReader.java:97)
> at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Ma
> pTask.java:423)
> at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 11/01/20 11:59:06 INFO mapred.JobClient: Job complete:
> job_201101201036_0002
> 11/01/20 11:59:06 INFO mapred.JobClient: Counters: 2
> 11/01/20 11:59:06 INFO mapred.JobClient: Job Counters
> 11/01/20 11:59:06 INFO mapred.JobClient: Launched map tasks=4
> 11/01/20 11:59:06 INFO mapred.JobClient: Failed map tasks=1
>
>
> Regards
> Musa Cavus
>
Could not obtain block
Posted by "Cavus,M.,Fa. Post Direkt" <M....@postdirekt.de>.
Hi,
I process the wordcount example on my hadoop cluster and get a Could not
obtain block Exception. Did any one know what is the problem? If I start
this program in my local than processed it good.
I do this:
root@master bin]# ./hadoop jar ../hadoop-0.20.2-examples.jar wordcount
point/start-all.sh s/start-all.sh
11/01/20 11:57:56 INFO input.FileInputFormat: Total input paths to
process : 1
11/01/20 11:57:57 INFO mapred.JobClient: Running job:
job_201101201036_0002
11/01/20 11:57:58 INFO mapred.JobClient: map 0% reduce 0%
11/01/20 11:58:16 INFO mapred.JobClient: Task Id :
attempt_201101201036_0002_m_000000_0, Status : FAILED
java.io.IOException: Could not obtain block:
blk_7716960257524845873_1708 file=/user/root/point/start-all.sh
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient
.java:1812)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.ja
va:1638)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767
)
at java.io.DataInputStream.read(DataInputStream.java:83)
at
org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(Line
RecordReader.java:97)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Ma
pTask.java:423)
at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
11/01/20 11:58:33 INFO mapred.JobClient: Task Id :
attempt_201101201036_0002_m_000000_1, Status : FAILED
java.io.IOException: Could not obtain block:
blk_7716960257524845873_1708 file=/user/root/point/start-all.sh
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient
.java:1812)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.ja
va:1638)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767
)
at java.io.DataInputStream.read(DataInputStream.java:83)
at
org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(Line
RecordReader.java:97)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Ma
pTask.java:423)
at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
11/01/20 11:58:48 INFO mapred.JobClient: Task Id :
attempt_201101201036_0002_m_000000_2, Status : FAILED
java.io.IOException: Could not obtain block:
blk_7716960257524845873_1708 file=/user/root/point/start-all.sh
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient
.java:1812)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.ja
va:1638)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767
)
at java.io.DataInputStream.read(DataInputStream.java:83)
at
org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(Line
RecordReader.java:97)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Ma
pTask.java:423)
at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
11/01/20 11:59:06 INFO mapred.JobClient: Job complete:
job_201101201036_0002
11/01/20 11:59:06 INFO mapred.JobClient: Counters: 2
11/01/20 11:59:06 INFO mapred.JobClient: Job Counters
11/01/20 11:59:06 INFO mapred.JobClient: Launched map tasks=4
11/01/20 11:59:06 INFO mapred.JobClient: Failed map tasks=1
Regards
Musa Cavus
Re: Why Hadoop is slow in Cloud
Posted by Steve Loughran <st...@apache.org>.
On 18/01/11 13:59, Adarsh Sharma wrote:
> Marc Farnum Rendino wrote:
>> Virtualization != Emulation
>>
>> Yes, virtualization does have its own costs (as does running directly
>> on hardware) - depending on the specifics of both the virtualization
>> *and* the task at hand.
> Absolutely right, and for this I perform the initial testing.
>
> I want to know *AT WHAT COSTS *it comes.
> 10-15% is tolerable but at this rate, it needs some work.
Even with performance hit, there are still benefits running Hadoop this way
-as you only consume/pay for CPU time you use, if you are only running
batch jobs, its lower cost than having a hadoop cluster that is under-used.
-if your data is stored in the cloud infrastructure, then you need to
data mine it in VMs, unless you want to take the time and money hit of
moving it out, and have somewhere to store it.
-if the infrastructure lets you, you can lock down the cluster so it is
secure.
Where a physical cluster is good is that it is a very low cost way of
storing data, provided you can analyse it with Hadoop, and provided you
can keep that cluster busy most of the time, either with Hadoop work or
other scheduled work. If your cluster is idle for computation, you are
still paying the capital and (reduced) electricity costs, so the cost of
storage and what compute you do effectively increases.
-steve
Re: Why Hadoop is slow in Cloud
Posted by Adarsh Sharma <ad...@orkash.com>.
Marc Farnum Rendino wrote:
> Virtualization != Emulation
>
> Yes, virtualization does have its own costs (as does running directly
> on hardware) - depending on the specifics of both the virtualization
> *and* the task at hand.
>
Absolutely right, and for this I perform the initial testing.
I want to know *AT WHAT COSTS *it comes.
10-15% is tolerable but at this rate, it needs some work.
As Steve rightly suggest , I am in some CPU bound testing work to know
the exact stats.
I let you know after the work.
> If my task (in the general sense) is CPU bound, it doesn't matter (to
> me) if the virtualization has a disk I/O penalty.
>
But is it possible to perform some tuning in the work-flow of the VM's
to increase some performance or not.
If on the other hand, my task is limited by a disk I/O penalty, I'll
> weigh that into the *total* cost/benefit, and virtualization may not -
> or may still - be an advantageous choice.
>
>
Some reasons of slowness will highly helpful. Any guidance is appreciable.
> Context is king.
>
>
Thanks & best Regards
Adarsh Sharma
> On Mon, Jan 17, 2011 at 10:41 AM, Edward Capriolo <ed...@gmail.com> wrote:
>
>> Everything you emulate you cut X% performance right off the top...
>>
Re: Why Hadoop is slow in Cloud
Posted by Marc Farnum Rendino <mv...@gmail.com>.
Virtualization != Emulation
Yes, virtualization does have its own costs (as does running directly
on hardware) - depending on the specifics of both the virtualization
*and* the task at hand.
If my task (in the general sense) is CPU bound, it doesn't matter (to
me) if the virtualization has a disk I/O penalty.
If on the other hand, my task is limited by a disk I/O penalty, I'll
weigh that into the *total* cost/benefit, and virtualization may not -
or may still - be an advantageous choice.
Context is king.
On Mon, Jan 17, 2011 at 10:41 AM, Edward Capriolo <ed...@gmail.com> wrote:
> Everything you emulate you cut X% performance right off the top...
Re: Why Hadoop is slow in Cloud
Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Jan 17, 2011 at 6:08 AM, Steve Loughran <st...@apache.org> wrote:
> On 17/01/11 04:11, Adarsh Sharma wrote:
>>
>> Dear all,
>>
>> Yesterday I performed a kind of testing between *Hadoop in Standalone
>> Servers* & *Hadoop in Cloud.
>>
>> *I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in
>> which one node act as Master ( Namenode , Jobtracker ) and the remaining
>> nodes act as slaves ( Datanodes, Tasktracker ).
>> On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made
>> one Standalone Machine as *Hadoop Master* and the slaves are configured
>> on the VM's in Cloud.
>>
>> I am confused about the stats obtained after the testing. What I
>> concluded that the VM are giving half peformance as compared with
>> Standalone Servers.
>
> Interesting stats, nothing that massively surprises me, especially as your
> benchmarks are very much streaming through datasets. If you were doing
> something more CPU intensive (graph work, for example), things wouldn't look
> so bad
>
> I've done stuff in this area.
> http://www.slideshare.net/steve_l/farming-hadoop-inthecloud
>
>
>
>>
>> I am expected some slow down but at this level I never expect. Would
>> this is genuine or there may be some configuration problem.
>>
>> I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in
>> Standalone Servers.
>>
>> Please have a look on the results and if interested comment on it.
>>
>
>
> The big killer here is File IO, with today's HDD controllers and virtual
> filesystems, disk IO is way underpowered compared to physical disk IO.
> Networking is reduced (but improving), and CPU can be pretty good, but disk
> is bad.
>
>
> Why?
>
> 1. Every access to a block in the VM is turned into virtual disk controller
> operations which are then interpreted by the VDC and turned into
> reads/writes in the virtual disk drive
>
> 2. which is turned into seeks, reads and writes in the physical hardware.
>
> Some workarounds
>
> -allocate physical disks for the HDFS filesystem, for the duration of the
> VMs.
>
> -have the local hosts serve up a bit of their filesystem on a fast protocol
> (like NFS), and have every VM mount the local physical NFS filestore as
> their hadoop data dirs.
>
>
Q: "Why is my Nintendo emulator slow on a 800 MHZ computer made 10
years after Nintendo?"
A: Emulation
Everything you emulate you cut X% performance right off the top.
Emulation is great when you want to run mac on windows or freebsd on
linux or nintendo on linux. However most people would do better with
technologies that use kernel level isolation such as Linux containers,
Solaris Zones, Linux VServer (my favorite) http://linux-vserver.org/,
User Mode Linux or similar technologies that ISOLATE rather then
EMULATE.
Sorry list I feel I rant about this bi-annually. I have just always
been so shocked about how many people get lured into cloud and
virtualized solutions for "better management" and "near native
performance"
Re: Why Hadoop is slow in Cloud
Posted by Steve Loughran <st...@apache.org>.
On 17/01/11 04:11, Adarsh Sharma wrote:
> Dear all,
>
> Yesterday I performed a kind of testing between *Hadoop in Standalone
> Servers* & *Hadoop in Cloud.
>
> *I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in
> which one node act as Master ( Namenode , Jobtracker ) and the remaining
> nodes act as slaves ( Datanodes, Tasktracker ).
> On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made
> one Standalone Machine as *Hadoop Master* and the slaves are configured
> on the VM's in Cloud.
>
> I am confused about the stats obtained after the testing. What I
> concluded that the VM are giving half peformance as compared with
> Standalone Servers.
Interesting stats, nothing that massively surprises me, especially as
your benchmarks are very much streaming through datasets. If you were
doing something more CPU intensive (graph work, for example), things
wouldn't look so bad
I've done stuff in this area.
http://www.slideshare.net/steve_l/farming-hadoop-inthecloud
>
> I am expected some slow down but at this level I never expect. Would
> this is genuine or there may be some configuration problem.
>
> I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in
> Standalone Servers.
>
> Please have a look on the results and if interested comment on it.
>
The big killer here is File IO, with today's HDD controllers and virtual
filesystems, disk IO is way underpowered compared to physical disk IO.
Networking is reduced (but improving), and CPU can be pretty good, but
disk is bad.
Why?
1. Every access to a block in the VM is turned into virtual disk
controller operations which are then interpreted by the VDC and turned
into reads/writes in the virtual disk drive
2. which is turned into seeks, reads and writes in the physical hardware.
Some workarounds
-allocate physical disks for the HDFS filesystem, for the duration of
the VMs.
-have the local hosts serve up a bit of their filesystem on a fast
protocol (like NFS), and have every VM mount the local physical NFS
filestore as their hadoop data dirs.