You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Vadim Zaliva <kr...@gmail.com> on 2009/03/11 18:15:29 UTC

tuning performance

Hi!

I have a question about fine-tunining hadoop performance on 8-core machines.
I have 2 machines I am testing. One is 8-core Xeon and another is 8-core
Opteron. 16Gb RAM each. They both run mapreduce and dfs nodes. Currently
I've set up each of them to run 32 map and 8 reduce tasks.
Also, HADOOP_HEAPSIZE=2048.

I see CPU is under utilized. If there is a guideline how I can find optimal
number of tasks and memory setting for this kind of hardware.

Also, since we going to my more machines like this, I need to decided
whenever buy Xeons or Opterons. Any advise on that?

Sincerely,
Vadim

P.S. I am using Hadoop 19 and java version "1.6.0_12":
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)

Re: tuning performance

Posted by jason hadoop <ja...@gmail.com>.
For a simple test, set the replication on your entire cluster to 6 hadoop
dfs -setRep  -R  -w 6 /

This will triple your disk usage and probably take a while, but then you are
guaranteed that all data is local.

You can also get a rough idea from the Job Counters, 'Data-local map tasks'
total field, of the data local map tasks.



-setrep [-R] [-w] <rep> <path/file>:  Set the replication level of a file.
        The -R flag requests a recursive change of replication level
        for an entire tree.


On Thu, Mar 12, 2009 at 7:13 PM, Vadim Zaliva <kr...@gmail.com> wrote:

> The machines have 4 disk each, stripped.
> However I do not see disks being a bottleneck. Monitoring system activity
> shows that CPU is utilized 2-70%, disk usage is moderate, while network
> activity seems to be quite high. In this particular cluster we have 6
> machines
> and replication factor is 2. I was wondering if increasing replication
> factor would
> help, so there is a better chance that data block is available locally.
>
> Sincerely,
> Vadim
>
>
> On Thu, Mar 12, 2009 at 13:27, Aaron Kimball <aa...@cloudera.com> wrote:
> > Xeon vs. Opteron is likely not going to be a major factor. More important
> > than this is the number of disks you have per machine. Task performance
> is
> > proportional to both the number of CPUs and the number of disks.
> >
> > You are probably using way too many tasks. Adding more tasks/node isn't
> > necessarily going to increase utilization if they're waiting on data from
> > the disks; you'll just increase the IO pressure and probably make things
> > seek more. You've configured up to 40 simultaneous tasks/node, to run on
> 8
> > cores. I'd lower that to maybe 8 maps and 6 reduces. You might get better
> > usage since they don't compete for IO as much. The correct value for you
> > might lie somewhere in between, too, depending on your workload.
> >
> > How many hdds does your machine have? You should stripe dfs.data.dir and
> > mapred.local.dir across all of them. Having a 2:1 core:disk (And thus a
> 2:1
> > map task:disk) ratio is probably a good starting point, so for eight
> cores,
> > you should have at least 4 disks.
> >
> > - Aaron
> >
> > On Wed, Mar 11, 2009 at 10:15 AM, Vadim Zaliva <kr...@gmail.com>
> wrote:
> >
> >> Hi!
> >>
> >> I have a question about fine-tunining hadoop performance on 8-core
> >> machines.
> >> I have 2 machines I am testing. One is 8-core Xeon and another is 8-core
> >> Opteron. 16Gb RAM each. They both run mapreduce and dfs nodes. Currently
> >> I've set up each of them to run 32 map and 8 reduce tasks.
> >> Also, HADOOP_HEAPSIZE=2048.
> >>
> >> I see CPU is under utilized. If there is a guideline how I can find
> optimal
> >> number of tasks and memory setting for this kind of hardware.
> >>
> >> Also, since we going to my more machines like this, I need to decided
> >> whenever buy Xeons or Opterons. Any advise on that?
> >>
> >> Sincerely,
> >> Vadim
> >>
> >> P.S. I am using Hadoop 19 and java version "1.6.0_12":
> >> Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
> >> Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
> >>
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: tuning performance

Posted by Scott Carey <sc...@richrelevance.com>.
Yes, I am referring to HDFS taking multiple mounts points and automatically round-robin block allocation across it.
A single file block will only exist on a single disk, but the extra speed you can get with raid-0 within a block can't be used effectively by almost any mapper or reducer anyway.  Perhaps an identity mapper can read faster than a single disk - but certainly not if the content is compressed.  \

RAID-0 may be more useful for local temp space.

In effect, you can say that HDFS data nodes already do RAID-0, but with a very large block size, and where failure of a disk reduces the redundancy minimally and temporarily.

For reference, today's Intel / AMD CPUs can decompress a gzip stream at less than 30MB/sec usually  (50MB to 100MB of uncompressed data output a sec).


On 3/14/09 1:53 AM, "Vadim Zaliva" <kr...@gmail.com> wrote:

Scott,

Thanks for interesting information. By JBOD, I assume you mean just listing
multiple partition mount points in hadoop config?

Vadim

On Fri, Mar 13, 2009 at 12:48, Scott Carey <sc...@richrelevance.com> wrote:
> On 3/13/09 11:56 AM, "Allen Wittenauer" <aw...@yahoo-inc.com> wrote:
>
> On 3/13/09 11:25 AM, "Vadim Zaliva" <kr...@gmail.com> wrote:
>
>>>    When you stripe you automatically make every disk in the system have the
>>> same speed as the slowest disk.  In our experiences, systems are more likely
>>> to have a 'slow' disk than a dead one.... and detecting that is really
>>> really hard.  In a distributed system, that multiplier effect can have
>>> significant consequences on the whole grids performance.
>>
>> All disk are the same, so there is no speed difference.
>
>    There will be when they start to fail. :)
>
>
>
> This has been discussed before:
> http://www.nabble.com/RAID-vs.-JBOD-td21404366.html
>
> JBOD is going to be better, the only benefit of RAID-0 is slightly easier management in hadoop config, but harder to manage at the OS level.
> When a single JBOD drive dies, you only lose that set of data.  The datanode goes down but a restart brings back up the parts that still exist.  Then you can leave it be while the replacement is procured... With RAID-0 the whole node is down until you get the new drive and recreate the RAID.
>
> With JBOD, don't forget to set the linux readahead for the drives to a decent level  (you'll gain up to 25% more sequential read throughput depending on your kernel version).  (blockdev -setra 8192 /dev/<device>).  I also see good gains by using xfs instead of ext3.  For a big shocker check out the difference in time to delete a bunch of large files with ext3 (long time) versus xfs (almost instant).
>
> For the newer drives, they can do about 120MB/sec at the front of the drive when tuned (xfs, readahead >4096) and the back of the drive is 60MB/sec.  If you are going to not use 100% of the drive for HDFS, use this knowledge and place the partitions appropriately.  The last 20% or so of the drive is a lot slower than the front 60%.  Here is a typical sequential transfer rate chart for a SATA drive as a function of LBA:
> http://www.tomshardware.com/reviews/Seagate-Barracuda-1.5-TB,2032-5.html
> (graphs aare about 3/4 of the way down the page before the comments).
>


Re: tuning performance

Posted by Vadim Zaliva <kr...@gmail.com>.
Scott,

Thanks for interesting information. By JBOD, I assume you mean just listing
multiple partition mount points in hadoop config?

Vadim

On Fri, Mar 13, 2009 at 12:48, Scott Carey <sc...@richrelevance.com> wrote:
> On 3/13/09 11:56 AM, "Allen Wittenauer" <aw...@yahoo-inc.com> wrote:
>
> On 3/13/09 11:25 AM, "Vadim Zaliva" <kr...@gmail.com> wrote:
>
>>>    When you stripe you automatically make every disk in the system have the
>>> same speed as the slowest disk.  In our experiences, systems are more likely
>>> to have a 'slow' disk than a dead one.... and detecting that is really
>>> really hard.  In a distributed system, that multiplier effect can have
>>> significant consequences on the whole grids performance.
>>
>> All disk are the same, so there is no speed difference.
>
>    There will be when they start to fail. :)
>
>
>
> This has been discussed before:
> http://www.nabble.com/RAID-vs.-JBOD-td21404366.html
>
> JBOD is going to be better, the only benefit of RAID-0 is slightly easier management in hadoop config, but harder to manage at the OS level.
> When a single JBOD drive dies, you only lose that set of data.  The datanode goes down but a restart brings back up the parts that still exist.  Then you can leave it be while the replacement is procured... With RAID-0 the whole node is down until you get the new drive and recreate the RAID.
>
> With JBOD, don't forget to set the linux readahead for the drives to a decent level  (you'll gain up to 25% more sequential read throughput depending on your kernel version).  (blockdev -setra 8192 /dev/<device>).  I also see good gains by using xfs instead of ext3.  For a big shocker check out the difference in time to delete a bunch of large files with ext3 (long time) versus xfs (almost instant).
>
> For the newer drives, they can do about 120MB/sec at the front of the drive when tuned (xfs, readahead >4096) and the back of the drive is 60MB/sec.  If you are going to not use 100% of the drive for HDFS, use this knowledge and place the partitions appropriately.  The last 20% or so of the drive is a lot slower than the front 60%.  Here is a typical sequential transfer rate chart for a SATA drive as a function of LBA:
> http://www.tomshardware.com/reviews/Seagate-Barracuda-1.5-TB,2032-5.html
> (graphs aare about 3/4 of the way down the page before the comments).
>

Re: tuning performance

Posted by Scott Carey <sc...@richrelevance.com>.
On 3/13/09 11:56 AM, "Allen Wittenauer" <aw...@yahoo-inc.com> wrote:

On 3/13/09 11:25 AM, "Vadim Zaliva" <kr...@gmail.com> wrote:

>>    When you stripe you automatically make every disk in the system have the
>> same speed as the slowest disk.  In our experiences, systems are more likely
>> to have a 'slow' disk than a dead one.... and detecting that is really
>> really hard.  In a distributed system, that multiplier effect can have
>> significant consequences on the whole grids performance.
>
> All disk are the same, so there is no speed difference.

    There will be when they start to fail. :)



This has been discussed before:
http://www.nabble.com/RAID-vs.-JBOD-td21404366.html

JBOD is going to be better, the only benefit of RAID-0 is slightly easier management in hadoop config, but harder to manage at the OS level.
When a single JBOD drive dies, you only lose that set of data.  The datanode goes down but a restart brings back up the parts that still exist.  Then you can leave it be while the replacement is procured... With RAID-0 the whole node is down until you get the new drive and recreate the RAID.

With JBOD, don't forget to set the linux readahead for the drives to a decent level  (you'll gain up to 25% more sequential read throughput depending on your kernel version).  (blockdev -setra 8192 /dev/<device>).  I also see good gains by using xfs instead of ext3.  For a big shocker check out the difference in time to delete a bunch of large files with ext3 (long time) versus xfs (almost instant).

For the newer drives, they can do about 120MB/sec at the front of the drive when tuned (xfs, readahead >4096) and the back of the drive is 60MB/sec.  If you are going to not use 100% of the drive for HDFS, use this knowledge and place the partitions appropriately.  The last 20% or so of the drive is a lot slower than the front 60%.  Here is a typical sequential transfer rate chart for a SATA drive as a function of LBA:
http://www.tomshardware.com/reviews/Seagate-Barracuda-1.5-TB,2032-5.html
(graphs aare about 3/4 of the way down the page before the comments).

Re: tuning performance

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.


On 3/13/09 11:25 AM, "Vadim Zaliva" <kr...@gmail.com> wrote:

>>    When you stripe you automatically make every disk in the system have the
>> same speed as the slowest disk.  In our experiences, systems are more likely
>> to have a 'slow' disk than a dead one.... and detecting that is really
>> really hard.  In a distributed system, that multiplier effect can have
>> significant consequences on the whole grids performance.
> 
> All disk are the same, so there is no speed difference.

    There will be when they start to fail. :)



Re: tuning performance

Posted by Vadim Zaliva <kr...@gmail.com>.
>    When you stripe you automatically make every disk in the system have the
> same speed as the slowest disk.  In our experiences, systems are more likely
> to have a 'slow' disk than a dead one.... and detecting that is really
> really hard.  In a distributed system, that multiplier effect can have
> significant consequences on the whole grids performance.

All disk are the same, so there is no speed difference. Stripping can speed up
disk reads (4x more heads to read at the same time).

>    If you actually have all 40 task slots going, the system is likely
> spending quite a bit of time paging...

interestingly, the swap usage is 0. Looks like my tasks are not
utilizing all alloted
heap.

>    We generally use cores/2-1 for map and reduce slots.  This leaves some
> cores and memory for the OS to use for monitoring, data node, etc. So 8
> cores=3 maps, 3 reduces per node.  Going above that has been done for
> extremely lightweight processes, but you're more likely headed for heartache
> if you aren't careful.

I will try to reduce number of tasks per your recommendation to see if
it would make
any significant difference.

Thanks you for all suggestions.

Vadim

Re: tuning performance

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.


On 3/12/09 7:13 PM, "Vadim Zaliva" <kr...@gmail.com> wrote:

> The machines have 4 disk each, stripped.
> However I do not see disks being a bottleneck.

    When you stripe you automatically make every disk in the system have the
same speed as the slowest disk.  In our experiences, systems are more likely
to have a 'slow' disk than a dead one.... and detecting that is really
really hard.  In a distributed system, that multiplier effect can have
significant consequences on the whole grids performance.

    Also:

>>> 16Gb RAM each. They both run mapreduce and dfs nodes. Currently
>>> I've set up each of them to run 32 map and 8 reduce tasks.
>>> Also, HADOOP_HEAPSIZE=2048.

    IIRC, HEAPSIZE is in MB, sooo...

        32*2048=65536
        8*2048=16384

        81920MB of VM *just in heap* ...

>>> I see CPU is under utilized. If there is a guideline how I can find optimal
>>> number of tasks and memory setting for this kind of hardware.

    If you actually have all 40 task slots going, the system is likely
spending quite a bit of time paging...

    We generally use cores/2-1 for map and reduce slots.  This leaves some
cores and memory for the OS to use for monitoring, data node, etc. So 8
cores=3 maps, 3 reduces per node.  Going above that has been done for
extremely lightweight processes, but you're more likely headed for heartache
if you aren't careful.


Re: tuning performance

Posted by Vadim Zaliva <kr...@gmail.com>.
The machines have 4 disk each, stripped.
However I do not see disks being a bottleneck. Monitoring system activity
shows that CPU is utilized 2-70%, disk usage is moderate, while network
activity seems to be quite high. In this particular cluster we have 6 machines
and replication factor is 2. I was wondering if increasing replication
factor would
help, so there is a better chance that data block is available locally.

Sincerely,
Vadim


On Thu, Mar 12, 2009 at 13:27, Aaron Kimball <aa...@cloudera.com> wrote:
> Xeon vs. Opteron is likely not going to be a major factor. More important
> than this is the number of disks you have per machine. Task performance is
> proportional to both the number of CPUs and the number of disks.
>
> You are probably using way too many tasks. Adding more tasks/node isn't
> necessarily going to increase utilization if they're waiting on data from
> the disks; you'll just increase the IO pressure and probably make things
> seek more. You've configured up to 40 simultaneous tasks/node, to run on 8
> cores. I'd lower that to maybe 8 maps and 6 reduces. You might get better
> usage since they don't compete for IO as much. The correct value for you
> might lie somewhere in between, too, depending on your workload.
>
> How many hdds does your machine have? You should stripe dfs.data.dir and
> mapred.local.dir across all of them. Having a 2:1 core:disk (And thus a 2:1
> map task:disk) ratio is probably a good starting point, so for eight cores,
> you should have at least 4 disks.
>
> - Aaron
>
> On Wed, Mar 11, 2009 at 10:15 AM, Vadim Zaliva <kr...@gmail.com> wrote:
>
>> Hi!
>>
>> I have a question about fine-tunining hadoop performance on 8-core
>> machines.
>> I have 2 machines I am testing. One is 8-core Xeon and another is 8-core
>> Opteron. 16Gb RAM each. They both run mapreduce and dfs nodes. Currently
>> I've set up each of them to run 32 map and 8 reduce tasks.
>> Also, HADOOP_HEAPSIZE=2048.
>>
>> I see CPU is under utilized. If there is a guideline how I can find optimal
>> number of tasks and memory setting for this kind of hardware.
>>
>> Also, since we going to my more machines like this, I need to decided
>> whenever buy Xeons or Opterons. Any advise on that?
>>
>> Sincerely,
>> Vadim
>>
>> P.S. I am using Hadoop 19 and java version "1.6.0_12":
>> Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
>> Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
>>
>

Re: tuning performance

Posted by Aaron Kimball <aa...@cloudera.com>.
Xeon vs. Opteron is likely not going to be a major factor. More important
than this is the number of disks you have per machine. Task performance is
proportional to both the number of CPUs and the number of disks.

You are probably using way too many tasks. Adding more tasks/node isn't
necessarily going to increase utilization if they're waiting on data from
the disks; you'll just increase the IO pressure and probably make things
seek more. You've configured up to 40 simultaneous tasks/node, to run on 8
cores. I'd lower that to maybe 8 maps and 6 reduces. You might get better
usage since they don't compete for IO as much. The correct value for you
might lie somewhere in between, too, depending on your workload.

How many hdds does your machine have? You should stripe dfs.data.dir and
mapred.local.dir across all of them. Having a 2:1 core:disk (And thus a 2:1
map task:disk) ratio is probably a good starting point, so for eight cores,
you should have at least 4 disks.

- Aaron

On Wed, Mar 11, 2009 at 10:15 AM, Vadim Zaliva <kr...@gmail.com> wrote:

> Hi!
>
> I have a question about fine-tunining hadoop performance on 8-core
> machines.
> I have 2 machines I am testing. One is 8-core Xeon and another is 8-core
> Opteron. 16Gb RAM each. They both run mapreduce and dfs nodes. Currently
> I've set up each of them to run 32 map and 8 reduce tasks.
> Also, HADOOP_HEAPSIZE=2048.
>
> I see CPU is under utilized. If there is a guideline how I can find optimal
> number of tasks and memory setting for this kind of hardware.
>
> Also, since we going to my more machines like this, I need to decided
> whenever buy Xeons or Opterons. Any advise on that?
>
> Sincerely,
> Vadim
>
> P.S. I am using Hadoop 19 and java version "1.6.0_12":
> Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
> Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
>