You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Patai Sangbutsarakum <si...@gmail.com> on 2012/10/12 19:46:25 UTC

Spindle per Cores

I have read around about the hardware recommendation for hadoop cluster.
One of them is recommend 1:1 ratio between spindle per core.

Intel CPU come with Hyperthread which will double the number cores on
one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
where we start to calculate about number of task slots per node.

Once it come to spindle, i strongly believe I should pick 8 cores and
picks 8 disks in order to get 1:1 ratio.

Please suggest
Patai

Re: Spindle per Cores

Posted by Ted Dunning <td...@maprtech.com>.
I think that this rule of thumb is to prevent people configuring 2 disk
clusters with 16 cores or 48 disk machines with 4 cores.  Both
configurations could make sense in narrow applications, but both would most
probably be sub-optimal.

Within narrow bands, I doubt you will see huge changes.  I like to be able
to

a) be able to saturate disk I/O which requires some CPU and a good control.
 Different distros vary a lot here

b) have enough memory per slot.  Lots of people go cheap on this and they
wind up hamstringing performance

c) make sure there is enough CPU left over for the application.  This is
hugely app dependent, obviously.

On Fri, Oct 12, 2012 at 1:45 PM, Hank Cohen <ha...@altior.com> wrote:

> What empirical evidence is there for this rule of thumb?
> In other words, what tests or metrics would indicate an optimal
> spindle/core ratio and how dependent is this on the nature of the data and
> of the map/reduce computation?
>
> My understanding is that there are lots of clusters with more spindles
> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk drives.
>  So lots of Hadoop clusters have dual 4 core processors and 12 spindles.
>  Would it be better to have 6 core processors if you are loading up the
> boxes with 12 disks?  And most importantly, how would one know that the mix
> was optimal?
>
> Hank Cohen
> Altior Inc.
>
> -----Original Message-----
> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> Sent: Friday, October 12, 2012 10:46 AM
> To: user@hadoop.apache.org
> Subject: Spindle per Cores
>
> I have read around about the hardware recommendation for hadoop cluster.
> One of them is recommend 1:1 ratio between spindle per core.
>
> Intel CPU come with Hyperthread which will double the number cores on one
> physical CPU. eg. 8 cores with Hyperthread it because 16 which is where we
> start to calculate about number of task slots per node.
>
> Once it come to spindle, i strongly believe I should pick 8 cores and
> picks 8 disks in order to get 1:1 ratio.
>
> Please suggest
> Patai
>
>
>

Re: Spindle per Cores

Posted by Ted Dunning <td...@maprtech.com>.
I think that this rule of thumb is to prevent people configuring 2 disk
clusters with 16 cores or 48 disk machines with 4 cores.  Both
configurations could make sense in narrow applications, but both would most
probably be sub-optimal.

Within narrow bands, I doubt you will see huge changes.  I like to be able
to

a) be able to saturate disk I/O which requires some CPU and a good control.
 Different distros vary a lot here

b) have enough memory per slot.  Lots of people go cheap on this and they
wind up hamstringing performance

c) make sure there is enough CPU left over for the application.  This is
hugely app dependent, obviously.

On Fri, Oct 12, 2012 at 1:45 PM, Hank Cohen <ha...@altior.com> wrote:

> What empirical evidence is there for this rule of thumb?
> In other words, what tests or metrics would indicate an optimal
> spindle/core ratio and how dependent is this on the nature of the data and
> of the map/reduce computation?
>
> My understanding is that there are lots of clusters with more spindles
> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk drives.
>  So lots of Hadoop clusters have dual 4 core processors and 12 spindles.
>  Would it be better to have 6 core processors if you are loading up the
> boxes with 12 disks?  And most importantly, how would one know that the mix
> was optimal?
>
> Hank Cohen
> Altior Inc.
>
> -----Original Message-----
> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> Sent: Friday, October 12, 2012 10:46 AM
> To: user@hadoop.apache.org
> Subject: Spindle per Cores
>
> I have read around about the hardware recommendation for hadoop cluster.
> One of them is recommend 1:1 ratio between spindle per core.
>
> Intel CPU come with Hyperthread which will double the number cores on one
> physical CPU. eg. 8 cores with Hyperthread it because 16 which is where we
> start to calculate about number of task slots per node.
>
> Once it come to spindle, i strongly believe I should pick 8 cores and
> picks 8 disks in order to get 1:1 ratio.
>
> Please suggest
> Patai
>
>
>

Re: Spindle per Cores

Posted by Patai Sangbutsarakum <Pa...@turn.com>.
Thanks everyone.
I think I got the idea and I would start to think in reverse by looking at the budget that willing to pay for the cluster then get
the best performance that can keep ratio of CPU:memory:Spindle.


From: ranjith raghunath <ra...@gmail.com>>
Reply-To: <us...@hadoop.apache.org>>
Date: Fri, 12 Oct 2012 22:27:50 -0500
To: <us...@hadoop.apache.org>>
Subject: Re: Spindle per Cores


Thanks Michael.

On Oct 12, 2012 9:59 PM, "Michael Segel" <mi...@hotmail.com>> wrote:
I think what we are seeing is the ratio based on physical Xeon cores.
So hyper threading wouldn't make any change to  the actual ratio.
(1 disk per physical core, would be 1 disk per 2 virtual cores.)

Again YMMV and of course thanks to this guy Moore who decided to write some weird laws... the ratio could change over time as the CPUs become more efficient and faster.


On Oct 12, 2012, at 9:52 PM, ranjith raghunath <ra...@gmail.com>> wrote:


Does hypertheading affect this ratio?

On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com>> wrote:
First, the obvious caveat... YMMV

Having said that.

The key here is to take a look across the various jobs that you will run. Some may be more CPU intensive, others more I/O intensive.

If you monitor these jobs via Ganglia, when you have too few spindles you should see the wait cpu rise on the machines in the cluster.  That is to say that you are putting an extra load on the systems because you're waiting for the disks to catch up.

If you increase the ratio of disks to CPU, you should see that load drop as you are not wasting CPU cycles.

Note that its not just the number of spindles, but also the bus and the controller cards that can also affect the throughput of disk I/O.

Now just IMHO, there was a discussion on some of the CPU recommendations. To a point, it doesn't matter that much. You want to maximize the bang for the buck you can get w your hardware purchase.

Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core, and you're wasting the cpu that you bought.

Going higher than a ratio of 1, like 1.5, and you may be buying too many spindles and not see a performance gain that offsets your cost.

Search for a happy medium and don't sweat the maximum performance that you may get.

HTH

On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com>> wrote:

> I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.
>
> Jeff
>
>> -----Original Message-----
>> From: Hank Cohen [mailto:hank.cohen@altior.com<ma...@altior.com>]
>> Sent: Friday, October 12, 2012 1:46 PM
>> To: user@hadoop.apache.org<ma...@hadoop.apache.org>
>> Subject: RE: Spindle per Cores
>>
>> What empirical evidence is there for this rule of thumb?
>> In other words, what tests or metrics would indicate an optimal
>> spindle/core ratio and how dependent is this on the nature of the data
>> and of the map/reduce computation?
>>
>> My understanding is that there are lots of clusters with more spindles
>> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
>> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
>> spindles.  Would it be better to have 6 core processors if you are
>> loading up the boxes with 12 disks?  And most importantly, how would
>> one know that the mix was optimal?
>>
>> Hank Cohen
>> Altior Inc.
>>
>> -----Original Message-----
>> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com<ma...@gmail.com>]
>> Sent: Friday, October 12, 2012 10:46 AM
>> To: user@hadoop.apache.org<ma...@hadoop.apache.org>
>> Subject: Spindle per Cores
>>
>> I have read around about the hardware recommendation for hadoop
>> cluster.
>> One of them is recommend 1:1 ratio between spindle per core.
>>
>> Intel CPU come with Hyperthread which will double the number cores on
>> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
>> where we start to calculate about number of task slots per node.
>>
>> Once it come to spindle, i strongly believe I should pick 8 cores and
>> picks 8 disks in order to get 1:1 ratio.
>>
>> Please suggest
>> Patai
>>
>
>



Re: Spindle per Cores

Posted by Patai Sangbutsarakum <Pa...@turn.com>.
Thanks everyone.
I think I got the idea and I would start to think in reverse by looking at the budget that willing to pay for the cluster then get
the best performance that can keep ratio of CPU:memory:Spindle.


From: ranjith raghunath <ra...@gmail.com>>
Reply-To: <us...@hadoop.apache.org>>
Date: Fri, 12 Oct 2012 22:27:50 -0500
To: <us...@hadoop.apache.org>>
Subject: Re: Spindle per Cores


Thanks Michael.

On Oct 12, 2012 9:59 PM, "Michael Segel" <mi...@hotmail.com>> wrote:
I think what we are seeing is the ratio based on physical Xeon cores.
So hyper threading wouldn't make any change to  the actual ratio.
(1 disk per physical core, would be 1 disk per 2 virtual cores.)

Again YMMV and of course thanks to this guy Moore who decided to write some weird laws... the ratio could change over time as the CPUs become more efficient and faster.


On Oct 12, 2012, at 9:52 PM, ranjith raghunath <ra...@gmail.com>> wrote:


Does hypertheading affect this ratio?

On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com>> wrote:
First, the obvious caveat... YMMV

Having said that.

The key here is to take a look across the various jobs that you will run. Some may be more CPU intensive, others more I/O intensive.

If you monitor these jobs via Ganglia, when you have too few spindles you should see the wait cpu rise on the machines in the cluster.  That is to say that you are putting an extra load on the systems because you're waiting for the disks to catch up.

If you increase the ratio of disks to CPU, you should see that load drop as you are not wasting CPU cycles.

Note that its not just the number of spindles, but also the bus and the controller cards that can also affect the throughput of disk I/O.

Now just IMHO, there was a discussion on some of the CPU recommendations. To a point, it doesn't matter that much. You want to maximize the bang for the buck you can get w your hardware purchase.

Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core, and you're wasting the cpu that you bought.

Going higher than a ratio of 1, like 1.5, and you may be buying too many spindles and not see a performance gain that offsets your cost.

Search for a happy medium and don't sweat the maximum performance that you may get.

HTH

On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com>> wrote:

> I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.
>
> Jeff
>
>> -----Original Message-----
>> From: Hank Cohen [mailto:hank.cohen@altior.com<ma...@altior.com>]
>> Sent: Friday, October 12, 2012 1:46 PM
>> To: user@hadoop.apache.org<ma...@hadoop.apache.org>
>> Subject: RE: Spindle per Cores
>>
>> What empirical evidence is there for this rule of thumb?
>> In other words, what tests or metrics would indicate an optimal
>> spindle/core ratio and how dependent is this on the nature of the data
>> and of the map/reduce computation?
>>
>> My understanding is that there are lots of clusters with more spindles
>> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
>> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
>> spindles.  Would it be better to have 6 core processors if you are
>> loading up the boxes with 12 disks?  And most importantly, how would
>> one know that the mix was optimal?
>>
>> Hank Cohen
>> Altior Inc.
>>
>> -----Original Message-----
>> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com<ma...@gmail.com>]
>> Sent: Friday, October 12, 2012 10:46 AM
>> To: user@hadoop.apache.org<ma...@hadoop.apache.org>
>> Subject: Spindle per Cores
>>
>> I have read around about the hardware recommendation for hadoop
>> cluster.
>> One of them is recommend 1:1 ratio between spindle per core.
>>
>> Intel CPU come with Hyperthread which will double the number cores on
>> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
>> where we start to calculate about number of task slots per node.
>>
>> Once it come to spindle, i strongly believe I should pick 8 cores and
>> picks 8 disks in order to get 1:1 ratio.
>>
>> Please suggest
>> Patai
>>
>
>



Re: Spindle per Cores

Posted by Patai Sangbutsarakum <Pa...@turn.com>.
Thanks everyone.
I think I got the idea and I would start to think in reverse by looking at the budget that willing to pay for the cluster then get
the best performance that can keep ratio of CPU:memory:Spindle.


From: ranjith raghunath <ra...@gmail.com>>
Reply-To: <us...@hadoop.apache.org>>
Date: Fri, 12 Oct 2012 22:27:50 -0500
To: <us...@hadoop.apache.org>>
Subject: Re: Spindle per Cores


Thanks Michael.

On Oct 12, 2012 9:59 PM, "Michael Segel" <mi...@hotmail.com>> wrote:
I think what we are seeing is the ratio based on physical Xeon cores.
So hyper threading wouldn't make any change to  the actual ratio.
(1 disk per physical core, would be 1 disk per 2 virtual cores.)

Again YMMV and of course thanks to this guy Moore who decided to write some weird laws... the ratio could change over time as the CPUs become more efficient and faster.


On Oct 12, 2012, at 9:52 PM, ranjith raghunath <ra...@gmail.com>> wrote:


Does hypertheading affect this ratio?

On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com>> wrote:
First, the obvious caveat... YMMV

Having said that.

The key here is to take a look across the various jobs that you will run. Some may be more CPU intensive, others more I/O intensive.

If you monitor these jobs via Ganglia, when you have too few spindles you should see the wait cpu rise on the machines in the cluster.  That is to say that you are putting an extra load on the systems because you're waiting for the disks to catch up.

If you increase the ratio of disks to CPU, you should see that load drop as you are not wasting CPU cycles.

Note that its not just the number of spindles, but also the bus and the controller cards that can also affect the throughput of disk I/O.

Now just IMHO, there was a discussion on some of the CPU recommendations. To a point, it doesn't matter that much. You want to maximize the bang for the buck you can get w your hardware purchase.

Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core, and you're wasting the cpu that you bought.

Going higher than a ratio of 1, like 1.5, and you may be buying too many spindles and not see a performance gain that offsets your cost.

Search for a happy medium and don't sweat the maximum performance that you may get.

HTH

On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com>> wrote:

> I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.
>
> Jeff
>
>> -----Original Message-----
>> From: Hank Cohen [mailto:hank.cohen@altior.com<ma...@altior.com>]
>> Sent: Friday, October 12, 2012 1:46 PM
>> To: user@hadoop.apache.org<ma...@hadoop.apache.org>
>> Subject: RE: Spindle per Cores
>>
>> What empirical evidence is there for this rule of thumb?
>> In other words, what tests or metrics would indicate an optimal
>> spindle/core ratio and how dependent is this on the nature of the data
>> and of the map/reduce computation?
>>
>> My understanding is that there are lots of clusters with more spindles
>> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
>> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
>> spindles.  Would it be better to have 6 core processors if you are
>> loading up the boxes with 12 disks?  And most importantly, how would
>> one know that the mix was optimal?
>>
>> Hank Cohen
>> Altior Inc.
>>
>> -----Original Message-----
>> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com<ma...@gmail.com>]
>> Sent: Friday, October 12, 2012 10:46 AM
>> To: user@hadoop.apache.org<ma...@hadoop.apache.org>
>> Subject: Spindle per Cores
>>
>> I have read around about the hardware recommendation for hadoop
>> cluster.
>> One of them is recommend 1:1 ratio between spindle per core.
>>
>> Intel CPU come with Hyperthread which will double the number cores on
>> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
>> where we start to calculate about number of task slots per node.
>>
>> Once it come to spindle, i strongly believe I should pick 8 cores and
>> picks 8 disks in order to get 1:1 ratio.
>>
>> Please suggest
>> Patai
>>
>
>



Re: Spindle per Cores

Posted by Patai Sangbutsarakum <Pa...@turn.com>.
Thanks everyone.
I think I got the idea and I would start to think in reverse by looking at the budget that willing to pay for the cluster then get
the best performance that can keep ratio of CPU:memory:Spindle.


From: ranjith raghunath <ra...@gmail.com>>
Reply-To: <us...@hadoop.apache.org>>
Date: Fri, 12 Oct 2012 22:27:50 -0500
To: <us...@hadoop.apache.org>>
Subject: Re: Spindle per Cores


Thanks Michael.

On Oct 12, 2012 9:59 PM, "Michael Segel" <mi...@hotmail.com>> wrote:
I think what we are seeing is the ratio based on physical Xeon cores.
So hyper threading wouldn't make any change to  the actual ratio.
(1 disk per physical core, would be 1 disk per 2 virtual cores.)

Again YMMV and of course thanks to this guy Moore who decided to write some weird laws... the ratio could change over time as the CPUs become more efficient and faster.


On Oct 12, 2012, at 9:52 PM, ranjith raghunath <ra...@gmail.com>> wrote:


Does hypertheading affect this ratio?

On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com>> wrote:
First, the obvious caveat... YMMV

Having said that.

The key here is to take a look across the various jobs that you will run. Some may be more CPU intensive, others more I/O intensive.

If you monitor these jobs via Ganglia, when you have too few spindles you should see the wait cpu rise on the machines in the cluster.  That is to say that you are putting an extra load on the systems because you're waiting for the disks to catch up.

If you increase the ratio of disks to CPU, you should see that load drop as you are not wasting CPU cycles.

Note that its not just the number of spindles, but also the bus and the controller cards that can also affect the throughput of disk I/O.

Now just IMHO, there was a discussion on some of the CPU recommendations. To a point, it doesn't matter that much. You want to maximize the bang for the buck you can get w your hardware purchase.

Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core, and you're wasting the cpu that you bought.

Going higher than a ratio of 1, like 1.5, and you may be buying too many spindles and not see a performance gain that offsets your cost.

Search for a happy medium and don't sweat the maximum performance that you may get.

HTH

On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com>> wrote:

> I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.
>
> Jeff
>
>> -----Original Message-----
>> From: Hank Cohen [mailto:hank.cohen@altior.com<ma...@altior.com>]
>> Sent: Friday, October 12, 2012 1:46 PM
>> To: user@hadoop.apache.org<ma...@hadoop.apache.org>
>> Subject: RE: Spindle per Cores
>>
>> What empirical evidence is there for this rule of thumb?
>> In other words, what tests or metrics would indicate an optimal
>> spindle/core ratio and how dependent is this on the nature of the data
>> and of the map/reduce computation?
>>
>> My understanding is that there are lots of clusters with more spindles
>> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
>> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
>> spindles.  Would it be better to have 6 core processors if you are
>> loading up the boxes with 12 disks?  And most importantly, how would
>> one know that the mix was optimal?
>>
>> Hank Cohen
>> Altior Inc.
>>
>> -----Original Message-----
>> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com<ma...@gmail.com>]
>> Sent: Friday, October 12, 2012 10:46 AM
>> To: user@hadoop.apache.org<ma...@hadoop.apache.org>
>> Subject: Spindle per Cores
>>
>> I have read around about the hardware recommendation for hadoop
>> cluster.
>> One of them is recommend 1:1 ratio between spindle per core.
>>
>> Intel CPU come with Hyperthread which will double the number cores on
>> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
>> where we start to calculate about number of task slots per node.
>>
>> Once it come to spindle, i strongly believe I should pick 8 cores and
>> picks 8 disks in order to get 1:1 ratio.
>>
>> Please suggest
>> Patai
>>
>
>



Re: Spindle per Cores

Posted by ranjith raghunath <ra...@gmail.com>.
Thanks Michael.
On Oct 12, 2012 9:59 PM, "Michael Segel" <mi...@hotmail.com> wrote:

> I think what we are seeing is the ratio based on physical Xeon cores.
> So hyper threading wouldn't make any change to  the actual ratio.
> (1 disk per physical core, would be 1 disk per 2 virtual cores.)
>
> Again YMMV and of course thanks to this guy Moore who decided to write
> some weird laws... the ratio could change over time as the CPUs become more
> efficient and faster.
>
>
> On Oct 12, 2012, at 9:52 PM, ranjith raghunath <
> ranjith.raghunath1@gmail.com> wrote:
>
> Does hypertheading affect this ratio?
> On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com>
> wrote:
>
>> First, the obvious caveat... YMMV
>>
>> Having said that.
>>
>> The key here is to take a look across the various jobs that you will run.
>> Some may be more CPU intensive, others more I/O intensive.
>>
>> If you monitor these jobs via Ganglia, when you have too few spindles you
>> should see the wait cpu rise on the machines in the cluster.  That is to
>> say that you are putting an extra load on the systems because you're
>> waiting for the disks to catch up.
>>
>> If you increase the ratio of disks to CPU, you should see that load drop
>> as you are not wasting CPU cycles.
>>
>> Note that its not just the number of spindles, but also the bus and the
>> controller cards that can also affect the throughput of disk I/O.
>>
>> Now just IMHO, there was a discussion on some of the CPU recommendations.
>> To a point, it doesn't matter that much. You want to maximize the bang for
>> the buck you can get w your hardware purchase.
>>
>> Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core,
>> and you're wasting the cpu that you bought.
>>
>> Going higher than a ratio of 1, like 1.5, and you may be buying too many
>> spindles and not see a performance gain that offsets your cost.
>>
>> Search for a happy medium and don't sweat the maximum performance that
>> you may get.
>>
>> HTH
>>
>> On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:
>>
>> > I've done some experiments along these lines.  I'm using
>> high-performance 15K RPM SAS drives instead of the more usual SATA drives,
>> which should reduce the number of drives I need.  I have dual 4-core
>> processors at 3.6 GHz.  These are more powerful than the average 4-core
>> processor, which should increase the number of drives I need.  Assuming
>> these 2 effects cancel, then my results should also apply to machines with
>> SATA drives and average processors.  Using 8 drives (1-1) gets good
>> performance for teragen and terasort.  Going to 12 drives (1.5 per core)
>> increases terasort performance by 15%.  That might not seem like much
>> compared to increasing the number of drives by 50%, but a better comparison
>> is that 4 extra drives increased the cost of each machine by only about
>> 12%, so the extra drives are (barely) worth it. If you're more time
>> sensitive than cost sensitive, they they're definitely worth it.  The extra
>> drives did not help teragen, apparently because both CPU and the internal
>> storage controller were close to saturation. So, of course everything
>> depends on the app.  You're shooting for saturated CPUs and disk bandwidth.
>>  Check that the CPU is not saturated (after checking Hadoop tuning and
>> optimizing the number of tasks). Check that you have enough memory for more
>> tasks with room leftover for a large buffer cache.  Use 10 GbE networking
>> or make sure the network has enough headroom.  Check the storage controller
>> can handle more bandwidth.  If all are true (that is, no other
>> bottlenecks), consider adding more drives.
>> >
>> > Jeff
>> >
>> >> -----Original Message-----
>> >> From: Hank Cohen [mailto:hank.cohen@altior.com]
>> >> Sent: Friday, October 12, 2012 1:46 PM
>> >> To: user@hadoop.apache.org
>> >> Subject: RE: Spindle per Cores
>> >>
>> >> What empirical evidence is there for this rule of thumb?
>> >> In other words, what tests or metrics would indicate an optimal
>> >> spindle/core ratio and how dependent is this on the nature of the data
>> >> and of the map/reduce computation?
>> >>
>> >> My understanding is that there are lots of clusters with more spindles
>> >> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
>> >> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
>> >> spindles.  Would it be better to have 6 core processors if you are
>> >> loading up the boxes with 12 disks?  And most importantly, how would
>> >> one know that the mix was optimal?
>> >>
>> >> Hank Cohen
>> >> Altior Inc.
>> >>
>> >> -----Original Message-----
>> >> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
>> >> Sent: Friday, October 12, 2012 10:46 AM
>> >> To: user@hadoop.apache.org
>> >> Subject: Spindle per Cores
>> >>
>> >> I have read around about the hardware recommendation for hadoop
>> >> cluster.
>> >> One of them is recommend 1:1 ratio between spindle per core.
>> >>
>> >> Intel CPU come with Hyperthread which will double the number cores on
>> >> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
>> >> where we start to calculate about number of task slots per node.
>> >>
>> >> Once it come to spindle, i strongly believe I should pick 8 cores and
>> >> picks 8 disks in order to get 1:1 ratio.
>> >>
>> >> Please suggest
>> >> Patai
>> >>
>> >
>> >
>>
>>
>

Re: Spindle per Cores

Posted by ranjith raghunath <ra...@gmail.com>.
Thanks Michael.
On Oct 12, 2012 9:59 PM, "Michael Segel" <mi...@hotmail.com> wrote:

> I think what we are seeing is the ratio based on physical Xeon cores.
> So hyper threading wouldn't make any change to  the actual ratio.
> (1 disk per physical core, would be 1 disk per 2 virtual cores.)
>
> Again YMMV and of course thanks to this guy Moore who decided to write
> some weird laws... the ratio could change over time as the CPUs become more
> efficient and faster.
>
>
> On Oct 12, 2012, at 9:52 PM, ranjith raghunath <
> ranjith.raghunath1@gmail.com> wrote:
>
> Does hypertheading affect this ratio?
> On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com>
> wrote:
>
>> First, the obvious caveat... YMMV
>>
>> Having said that.
>>
>> The key here is to take a look across the various jobs that you will run.
>> Some may be more CPU intensive, others more I/O intensive.
>>
>> If you monitor these jobs via Ganglia, when you have too few spindles you
>> should see the wait cpu rise on the machines in the cluster.  That is to
>> say that you are putting an extra load on the systems because you're
>> waiting for the disks to catch up.
>>
>> If you increase the ratio of disks to CPU, you should see that load drop
>> as you are not wasting CPU cycles.
>>
>> Note that its not just the number of spindles, but also the bus and the
>> controller cards that can also affect the throughput of disk I/O.
>>
>> Now just IMHO, there was a discussion on some of the CPU recommendations.
>> To a point, it doesn't matter that much. You want to maximize the bang for
>> the buck you can get w your hardware purchase.
>>
>> Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core,
>> and you're wasting the cpu that you bought.
>>
>> Going higher than a ratio of 1, like 1.5, and you may be buying too many
>> spindles and not see a performance gain that offsets your cost.
>>
>> Search for a happy medium and don't sweat the maximum performance that
>> you may get.
>>
>> HTH
>>
>> On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:
>>
>> > I've done some experiments along these lines.  I'm using
>> high-performance 15K RPM SAS drives instead of the more usual SATA drives,
>> which should reduce the number of drives I need.  I have dual 4-core
>> processors at 3.6 GHz.  These are more powerful than the average 4-core
>> processor, which should increase the number of drives I need.  Assuming
>> these 2 effects cancel, then my results should also apply to machines with
>> SATA drives and average processors.  Using 8 drives (1-1) gets good
>> performance for teragen and terasort.  Going to 12 drives (1.5 per core)
>> increases terasort performance by 15%.  That might not seem like much
>> compared to increasing the number of drives by 50%, but a better comparison
>> is that 4 extra drives increased the cost of each machine by only about
>> 12%, so the extra drives are (barely) worth it. If you're more time
>> sensitive than cost sensitive, they they're definitely worth it.  The extra
>> drives did not help teragen, apparently because both CPU and the internal
>> storage controller were close to saturation. So, of course everything
>> depends on the app.  You're shooting for saturated CPUs and disk bandwidth.
>>  Check that the CPU is not saturated (after checking Hadoop tuning and
>> optimizing the number of tasks). Check that you have enough memory for more
>> tasks with room leftover for a large buffer cache.  Use 10 GbE networking
>> or make sure the network has enough headroom.  Check the storage controller
>> can handle more bandwidth.  If all are true (that is, no other
>> bottlenecks), consider adding more drives.
>> >
>> > Jeff
>> >
>> >> -----Original Message-----
>> >> From: Hank Cohen [mailto:hank.cohen@altior.com]
>> >> Sent: Friday, October 12, 2012 1:46 PM
>> >> To: user@hadoop.apache.org
>> >> Subject: RE: Spindle per Cores
>> >>
>> >> What empirical evidence is there for this rule of thumb?
>> >> In other words, what tests or metrics would indicate an optimal
>> >> spindle/core ratio and how dependent is this on the nature of the data
>> >> and of the map/reduce computation?
>> >>
>> >> My understanding is that there are lots of clusters with more spindles
>> >> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
>> >> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
>> >> spindles.  Would it be better to have 6 core processors if you are
>> >> loading up the boxes with 12 disks?  And most importantly, how would
>> >> one know that the mix was optimal?
>> >>
>> >> Hank Cohen
>> >> Altior Inc.
>> >>
>> >> -----Original Message-----
>> >> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
>> >> Sent: Friday, October 12, 2012 10:46 AM
>> >> To: user@hadoop.apache.org
>> >> Subject: Spindle per Cores
>> >>
>> >> I have read around about the hardware recommendation for hadoop
>> >> cluster.
>> >> One of them is recommend 1:1 ratio between spindle per core.
>> >>
>> >> Intel CPU come with Hyperthread which will double the number cores on
>> >> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
>> >> where we start to calculate about number of task slots per node.
>> >>
>> >> Once it come to spindle, i strongly believe I should pick 8 cores and
>> >> picks 8 disks in order to get 1:1 ratio.
>> >>
>> >> Please suggest
>> >> Patai
>> >>
>> >
>> >
>>
>>
>

Re: Spindle per Cores

Posted by ranjith raghunath <ra...@gmail.com>.
Thanks Michael.
On Oct 12, 2012 9:59 PM, "Michael Segel" <mi...@hotmail.com> wrote:

> I think what we are seeing is the ratio based on physical Xeon cores.
> So hyper threading wouldn't make any change to  the actual ratio.
> (1 disk per physical core, would be 1 disk per 2 virtual cores.)
>
> Again YMMV and of course thanks to this guy Moore who decided to write
> some weird laws... the ratio could change over time as the CPUs become more
> efficient and faster.
>
>
> On Oct 12, 2012, at 9:52 PM, ranjith raghunath <
> ranjith.raghunath1@gmail.com> wrote:
>
> Does hypertheading affect this ratio?
> On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com>
> wrote:
>
>> First, the obvious caveat... YMMV
>>
>> Having said that.
>>
>> The key here is to take a look across the various jobs that you will run.
>> Some may be more CPU intensive, others more I/O intensive.
>>
>> If you monitor these jobs via Ganglia, when you have too few spindles you
>> should see the wait cpu rise on the machines in the cluster.  That is to
>> say that you are putting an extra load on the systems because you're
>> waiting for the disks to catch up.
>>
>> If you increase the ratio of disks to CPU, you should see that load drop
>> as you are not wasting CPU cycles.
>>
>> Note that its not just the number of spindles, but also the bus and the
>> controller cards that can also affect the throughput of disk I/O.
>>
>> Now just IMHO, there was a discussion on some of the CPU recommendations.
>> To a point, it doesn't matter that much. You want to maximize the bang for
>> the buck you can get w your hardware purchase.
>>
>> Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core,
>> and you're wasting the cpu that you bought.
>>
>> Going higher than a ratio of 1, like 1.5, and you may be buying too many
>> spindles and not see a performance gain that offsets your cost.
>>
>> Search for a happy medium and don't sweat the maximum performance that
>> you may get.
>>
>> HTH
>>
>> On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:
>>
>> > I've done some experiments along these lines.  I'm using
>> high-performance 15K RPM SAS drives instead of the more usual SATA drives,
>> which should reduce the number of drives I need.  I have dual 4-core
>> processors at 3.6 GHz.  These are more powerful than the average 4-core
>> processor, which should increase the number of drives I need.  Assuming
>> these 2 effects cancel, then my results should also apply to machines with
>> SATA drives and average processors.  Using 8 drives (1-1) gets good
>> performance for teragen and terasort.  Going to 12 drives (1.5 per core)
>> increases terasort performance by 15%.  That might not seem like much
>> compared to increasing the number of drives by 50%, but a better comparison
>> is that 4 extra drives increased the cost of each machine by only about
>> 12%, so the extra drives are (barely) worth it. If you're more time
>> sensitive than cost sensitive, they they're definitely worth it.  The extra
>> drives did not help teragen, apparently because both CPU and the internal
>> storage controller were close to saturation. So, of course everything
>> depends on the app.  You're shooting for saturated CPUs and disk bandwidth.
>>  Check that the CPU is not saturated (after checking Hadoop tuning and
>> optimizing the number of tasks). Check that you have enough memory for more
>> tasks with room leftover for a large buffer cache.  Use 10 GbE networking
>> or make sure the network has enough headroom.  Check the storage controller
>> can handle more bandwidth.  If all are true (that is, no other
>> bottlenecks), consider adding more drives.
>> >
>> > Jeff
>> >
>> >> -----Original Message-----
>> >> From: Hank Cohen [mailto:hank.cohen@altior.com]
>> >> Sent: Friday, October 12, 2012 1:46 PM
>> >> To: user@hadoop.apache.org
>> >> Subject: RE: Spindle per Cores
>> >>
>> >> What empirical evidence is there for this rule of thumb?
>> >> In other words, what tests or metrics would indicate an optimal
>> >> spindle/core ratio and how dependent is this on the nature of the data
>> >> and of the map/reduce computation?
>> >>
>> >> My understanding is that there are lots of clusters with more spindles
>> >> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
>> >> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
>> >> spindles.  Would it be better to have 6 core processors if you are
>> >> loading up the boxes with 12 disks?  And most importantly, how would
>> >> one know that the mix was optimal?
>> >>
>> >> Hank Cohen
>> >> Altior Inc.
>> >>
>> >> -----Original Message-----
>> >> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
>> >> Sent: Friday, October 12, 2012 10:46 AM
>> >> To: user@hadoop.apache.org
>> >> Subject: Spindle per Cores
>> >>
>> >> I have read around about the hardware recommendation for hadoop
>> >> cluster.
>> >> One of them is recommend 1:1 ratio between spindle per core.
>> >>
>> >> Intel CPU come with Hyperthread which will double the number cores on
>> >> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
>> >> where we start to calculate about number of task slots per node.
>> >>
>> >> Once it come to spindle, i strongly believe I should pick 8 cores and
>> >> picks 8 disks in order to get 1:1 ratio.
>> >>
>> >> Please suggest
>> >> Patai
>> >>
>> >
>> >
>>
>>
>

Re: Spindle per Cores

Posted by ranjith raghunath <ra...@gmail.com>.
Thanks Michael.
On Oct 12, 2012 9:59 PM, "Michael Segel" <mi...@hotmail.com> wrote:

> I think what we are seeing is the ratio based on physical Xeon cores.
> So hyper threading wouldn't make any change to  the actual ratio.
> (1 disk per physical core, would be 1 disk per 2 virtual cores.)
>
> Again YMMV and of course thanks to this guy Moore who decided to write
> some weird laws... the ratio could change over time as the CPUs become more
> efficient and faster.
>
>
> On Oct 12, 2012, at 9:52 PM, ranjith raghunath <
> ranjith.raghunath1@gmail.com> wrote:
>
> Does hypertheading affect this ratio?
> On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com>
> wrote:
>
>> First, the obvious caveat... YMMV
>>
>> Having said that.
>>
>> The key here is to take a look across the various jobs that you will run.
>> Some may be more CPU intensive, others more I/O intensive.
>>
>> If you monitor these jobs via Ganglia, when you have too few spindles you
>> should see the wait cpu rise on the machines in the cluster.  That is to
>> say that you are putting an extra load on the systems because you're
>> waiting for the disks to catch up.
>>
>> If you increase the ratio of disks to CPU, you should see that load drop
>> as you are not wasting CPU cycles.
>>
>> Note that its not just the number of spindles, but also the bus and the
>> controller cards that can also affect the throughput of disk I/O.
>>
>> Now just IMHO, there was a discussion on some of the CPU recommendations.
>> To a point, it doesn't matter that much. You want to maximize the bang for
>> the buck you can get w your hardware purchase.
>>
>> Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core,
>> and you're wasting the cpu that you bought.
>>
>> Going higher than a ratio of 1, like 1.5, and you may be buying too many
>> spindles and not see a performance gain that offsets your cost.
>>
>> Search for a happy medium and don't sweat the maximum performance that
>> you may get.
>>
>> HTH
>>
>> On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:
>>
>> > I've done some experiments along these lines.  I'm using
>> high-performance 15K RPM SAS drives instead of the more usual SATA drives,
>> which should reduce the number of drives I need.  I have dual 4-core
>> processors at 3.6 GHz.  These are more powerful than the average 4-core
>> processor, which should increase the number of drives I need.  Assuming
>> these 2 effects cancel, then my results should also apply to machines with
>> SATA drives and average processors.  Using 8 drives (1-1) gets good
>> performance for teragen and terasort.  Going to 12 drives (1.5 per core)
>> increases terasort performance by 15%.  That might not seem like much
>> compared to increasing the number of drives by 50%, but a better comparison
>> is that 4 extra drives increased the cost of each machine by only about
>> 12%, so the extra drives are (barely) worth it. If you're more time
>> sensitive than cost sensitive, they they're definitely worth it.  The extra
>> drives did not help teragen, apparently because both CPU and the internal
>> storage controller were close to saturation. So, of course everything
>> depends on the app.  You're shooting for saturated CPUs and disk bandwidth.
>>  Check that the CPU is not saturated (after checking Hadoop tuning and
>> optimizing the number of tasks). Check that you have enough memory for more
>> tasks with room leftover for a large buffer cache.  Use 10 GbE networking
>> or make sure the network has enough headroom.  Check the storage controller
>> can handle more bandwidth.  If all are true (that is, no other
>> bottlenecks), consider adding more drives.
>> >
>> > Jeff
>> >
>> >> -----Original Message-----
>> >> From: Hank Cohen [mailto:hank.cohen@altior.com]
>> >> Sent: Friday, October 12, 2012 1:46 PM
>> >> To: user@hadoop.apache.org
>> >> Subject: RE: Spindle per Cores
>> >>
>> >> What empirical evidence is there for this rule of thumb?
>> >> In other words, what tests or metrics would indicate an optimal
>> >> spindle/core ratio and how dependent is this on the nature of the data
>> >> and of the map/reduce computation?
>> >>
>> >> My understanding is that there are lots of clusters with more spindles
>> >> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
>> >> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
>> >> spindles.  Would it be better to have 6 core processors if you are
>> >> loading up the boxes with 12 disks?  And most importantly, how would
>> >> one know that the mix was optimal?
>> >>
>> >> Hank Cohen
>> >> Altior Inc.
>> >>
>> >> -----Original Message-----
>> >> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
>> >> Sent: Friday, October 12, 2012 10:46 AM
>> >> To: user@hadoop.apache.org
>> >> Subject: Spindle per Cores
>> >>
>> >> I have read around about the hardware recommendation for hadoop
>> >> cluster.
>> >> One of them is recommend 1:1 ratio between spindle per core.
>> >>
>> >> Intel CPU come with Hyperthread which will double the number cores on
>> >> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
>> >> where we start to calculate about number of task slots per node.
>> >>
>> >> Once it come to spindle, i strongly believe I should pick 8 cores and
>> >> picks 8 disks in order to get 1:1 ratio.
>> >>
>> >> Please suggest
>> >> Patai
>> >>
>> >
>> >
>>
>>
>

Re: Spindle per Cores

Posted by Michael Segel <mi...@hotmail.com>.
I think what we are seeing is the ratio based on physical Xeon cores. 
So hyper threading wouldn't make any change to  the actual ratio.
(1 disk per physical core, would be 1 disk per 2 virtual cores.) 

Again YMMV and of course thanks to this guy Moore who decided to write some weird laws... the ratio could change over time as the CPUs become more efficient and faster. 


On Oct 12, 2012, at 9:52 PM, ranjith raghunath <ra...@gmail.com> wrote:

> Does hypertheading affect this ratio?
> 
> On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com> wrote:
> First, the obvious caveat... YMMV
> 
> Having said that.
> 
> The key here is to take a look across the various jobs that you will run. Some may be more CPU intensive, others more I/O intensive.
> 
> If you monitor these jobs via Ganglia, when you have too few spindles you should see the wait cpu rise on the machines in the cluster.  That is to say that you are putting an extra load on the systems because you're waiting for the disks to catch up.
> 
> If you increase the ratio of disks to CPU, you should see that load drop as you are not wasting CPU cycles.
> 
> Note that its not just the number of spindles, but also the bus and the controller cards that can also affect the throughput of disk I/O.
> 
> Now just IMHO, there was a discussion on some of the CPU recommendations. To a point, it doesn't matter that much. You want to maximize the bang for the buck you can get w your hardware purchase.
> 
> Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core, and you're wasting the cpu that you bought.
> 
> Going higher than a ratio of 1, like 1.5, and you may be buying too many spindles and not see a performance gain that offsets your cost.
> 
> Search for a happy medium and don't sweat the maximum performance that you may get.
> 
> HTH
> 
> On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:
> 
> > I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.
> >
> > Jeff
> >
> >> -----Original Message-----
> >> From: Hank Cohen [mailto:hank.cohen@altior.com]
> >> Sent: Friday, October 12, 2012 1:46 PM
> >> To: user@hadoop.apache.org
> >> Subject: RE: Spindle per Cores
> >>
> >> What empirical evidence is there for this rule of thumb?
> >> In other words, what tests or metrics would indicate an optimal
> >> spindle/core ratio and how dependent is this on the nature of the data
> >> and of the map/reduce computation?
> >>
> >> My understanding is that there are lots of clusters with more spindles
> >> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
> >> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
> >> spindles.  Would it be better to have 6 core processors if you are
> >> loading up the boxes with 12 disks?  And most importantly, how would
> >> one know that the mix was optimal?
> >>
> >> Hank Cohen
> >> Altior Inc.
> >>
> >> -----Original Message-----
> >> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> >> Sent: Friday, October 12, 2012 10:46 AM
> >> To: user@hadoop.apache.org
> >> Subject: Spindle per Cores
> >>
> >> I have read around about the hardware recommendation for hadoop
> >> cluster.
> >> One of them is recommend 1:1 ratio between spindle per core.
> >>
> >> Intel CPU come with Hyperthread which will double the number cores on
> >> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> >> where we start to calculate about number of task slots per node.
> >>
> >> Once it come to spindle, i strongly believe I should pick 8 cores and
> >> picks 8 disks in order to get 1:1 ratio.
> >>
> >> Please suggest
> >> Patai
> >>
> >
> >
> 


Re: Spindle per Cores

Posted by Michael Segel <mi...@hotmail.com>.
I think what we are seeing is the ratio based on physical Xeon cores. 
So hyper threading wouldn't make any change to  the actual ratio.
(1 disk per physical core, would be 1 disk per 2 virtual cores.) 

Again YMMV and of course thanks to this guy Moore who decided to write some weird laws... the ratio could change over time as the CPUs become more efficient and faster. 


On Oct 12, 2012, at 9:52 PM, ranjith raghunath <ra...@gmail.com> wrote:

> Does hypertheading affect this ratio?
> 
> On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com> wrote:
> First, the obvious caveat... YMMV
> 
> Having said that.
> 
> The key here is to take a look across the various jobs that you will run. Some may be more CPU intensive, others more I/O intensive.
> 
> If you monitor these jobs via Ganglia, when you have too few spindles you should see the wait cpu rise on the machines in the cluster.  That is to say that you are putting an extra load on the systems because you're waiting for the disks to catch up.
> 
> If you increase the ratio of disks to CPU, you should see that load drop as you are not wasting CPU cycles.
> 
> Note that its not just the number of spindles, but also the bus and the controller cards that can also affect the throughput of disk I/O.
> 
> Now just IMHO, there was a discussion on some of the CPU recommendations. To a point, it doesn't matter that much. You want to maximize the bang for the buck you can get w your hardware purchase.
> 
> Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core, and you're wasting the cpu that you bought.
> 
> Going higher than a ratio of 1, like 1.5, and you may be buying too many spindles and not see a performance gain that offsets your cost.
> 
> Search for a happy medium and don't sweat the maximum performance that you may get.
> 
> HTH
> 
> On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:
> 
> > I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.
> >
> > Jeff
> >
> >> -----Original Message-----
> >> From: Hank Cohen [mailto:hank.cohen@altior.com]
> >> Sent: Friday, October 12, 2012 1:46 PM
> >> To: user@hadoop.apache.org
> >> Subject: RE: Spindle per Cores
> >>
> >> What empirical evidence is there for this rule of thumb?
> >> In other words, what tests or metrics would indicate an optimal
> >> spindle/core ratio and how dependent is this on the nature of the data
> >> and of the map/reduce computation?
> >>
> >> My understanding is that there are lots of clusters with more spindles
> >> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
> >> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
> >> spindles.  Would it be better to have 6 core processors if you are
> >> loading up the boxes with 12 disks?  And most importantly, how would
> >> one know that the mix was optimal?
> >>
> >> Hank Cohen
> >> Altior Inc.
> >>
> >> -----Original Message-----
> >> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> >> Sent: Friday, October 12, 2012 10:46 AM
> >> To: user@hadoop.apache.org
> >> Subject: Spindle per Cores
> >>
> >> I have read around about the hardware recommendation for hadoop
> >> cluster.
> >> One of them is recommend 1:1 ratio between spindle per core.
> >>
> >> Intel CPU come with Hyperthread which will double the number cores on
> >> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> >> where we start to calculate about number of task slots per node.
> >>
> >> Once it come to spindle, i strongly believe I should pick 8 cores and
> >> picks 8 disks in order to get 1:1 ratio.
> >>
> >> Please suggest
> >> Patai
> >>
> >
> >
> 


Re: Spindle per Cores

Posted by Michael Segel <mi...@hotmail.com>.
I think what we are seeing is the ratio based on physical Xeon cores. 
So hyper threading wouldn't make any change to  the actual ratio.
(1 disk per physical core, would be 1 disk per 2 virtual cores.) 

Again YMMV and of course thanks to this guy Moore who decided to write some weird laws... the ratio could change over time as the CPUs become more efficient and faster. 


On Oct 12, 2012, at 9:52 PM, ranjith raghunath <ra...@gmail.com> wrote:

> Does hypertheading affect this ratio?
> 
> On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com> wrote:
> First, the obvious caveat... YMMV
> 
> Having said that.
> 
> The key here is to take a look across the various jobs that you will run. Some may be more CPU intensive, others more I/O intensive.
> 
> If you monitor these jobs via Ganglia, when you have too few spindles you should see the wait cpu rise on the machines in the cluster.  That is to say that you are putting an extra load on the systems because you're waiting for the disks to catch up.
> 
> If you increase the ratio of disks to CPU, you should see that load drop as you are not wasting CPU cycles.
> 
> Note that its not just the number of spindles, but also the bus and the controller cards that can also affect the throughput of disk I/O.
> 
> Now just IMHO, there was a discussion on some of the CPU recommendations. To a point, it doesn't matter that much. You want to maximize the bang for the buck you can get w your hardware purchase.
> 
> Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core, and you're wasting the cpu that you bought.
> 
> Going higher than a ratio of 1, like 1.5, and you may be buying too many spindles and not see a performance gain that offsets your cost.
> 
> Search for a happy medium and don't sweat the maximum performance that you may get.
> 
> HTH
> 
> On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:
> 
> > I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.
> >
> > Jeff
> >
> >> -----Original Message-----
> >> From: Hank Cohen [mailto:hank.cohen@altior.com]
> >> Sent: Friday, October 12, 2012 1:46 PM
> >> To: user@hadoop.apache.org
> >> Subject: RE: Spindle per Cores
> >>
> >> What empirical evidence is there for this rule of thumb?
> >> In other words, what tests or metrics would indicate an optimal
> >> spindle/core ratio and how dependent is this on the nature of the data
> >> and of the map/reduce computation?
> >>
> >> My understanding is that there are lots of clusters with more spindles
> >> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
> >> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
> >> spindles.  Would it be better to have 6 core processors if you are
> >> loading up the boxes with 12 disks?  And most importantly, how would
> >> one know that the mix was optimal?
> >>
> >> Hank Cohen
> >> Altior Inc.
> >>
> >> -----Original Message-----
> >> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> >> Sent: Friday, October 12, 2012 10:46 AM
> >> To: user@hadoop.apache.org
> >> Subject: Spindle per Cores
> >>
> >> I have read around about the hardware recommendation for hadoop
> >> cluster.
> >> One of them is recommend 1:1 ratio between spindle per core.
> >>
> >> Intel CPU come with Hyperthread which will double the number cores on
> >> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> >> where we start to calculate about number of task slots per node.
> >>
> >> Once it come to spindle, i strongly believe I should pick 8 cores and
> >> picks 8 disks in order to get 1:1 ratio.
> >>
> >> Please suggest
> >> Patai
> >>
> >
> >
> 


Re: Spindle per Cores

Posted by Michael Segel <mi...@hotmail.com>.
I think what we are seeing is the ratio based on physical Xeon cores. 
So hyper threading wouldn't make any change to  the actual ratio.
(1 disk per physical core, would be 1 disk per 2 virtual cores.) 

Again YMMV and of course thanks to this guy Moore who decided to write some weird laws... the ratio could change over time as the CPUs become more efficient and faster. 


On Oct 12, 2012, at 9:52 PM, ranjith raghunath <ra...@gmail.com> wrote:

> Does hypertheading affect this ratio?
> 
> On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com> wrote:
> First, the obvious caveat... YMMV
> 
> Having said that.
> 
> The key here is to take a look across the various jobs that you will run. Some may be more CPU intensive, others more I/O intensive.
> 
> If you monitor these jobs via Ganglia, when you have too few spindles you should see the wait cpu rise on the machines in the cluster.  That is to say that you are putting an extra load on the systems because you're waiting for the disks to catch up.
> 
> If you increase the ratio of disks to CPU, you should see that load drop as you are not wasting CPU cycles.
> 
> Note that its not just the number of spindles, but also the bus and the controller cards that can also affect the throughput of disk I/O.
> 
> Now just IMHO, there was a discussion on some of the CPU recommendations. To a point, it doesn't matter that much. You want to maximize the bang for the buck you can get w your hardware purchase.
> 
> Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core, and you're wasting the cpu that you bought.
> 
> Going higher than a ratio of 1, like 1.5, and you may be buying too many spindles and not see a performance gain that offsets your cost.
> 
> Search for a happy medium and don't sweat the maximum performance that you may get.
> 
> HTH
> 
> On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:
> 
> > I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.
> >
> > Jeff
> >
> >> -----Original Message-----
> >> From: Hank Cohen [mailto:hank.cohen@altior.com]
> >> Sent: Friday, October 12, 2012 1:46 PM
> >> To: user@hadoop.apache.org
> >> Subject: RE: Spindle per Cores
> >>
> >> What empirical evidence is there for this rule of thumb?
> >> In other words, what tests or metrics would indicate an optimal
> >> spindle/core ratio and how dependent is this on the nature of the data
> >> and of the map/reduce computation?
> >>
> >> My understanding is that there are lots of clusters with more spindles
> >> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
> >> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
> >> spindles.  Would it be better to have 6 core processors if you are
> >> loading up the boxes with 12 disks?  And most importantly, how would
> >> one know that the mix was optimal?
> >>
> >> Hank Cohen
> >> Altior Inc.
> >>
> >> -----Original Message-----
> >> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> >> Sent: Friday, October 12, 2012 10:46 AM
> >> To: user@hadoop.apache.org
> >> Subject: Spindle per Cores
> >>
> >> I have read around about the hardware recommendation for hadoop
> >> cluster.
> >> One of them is recommend 1:1 ratio between spindle per core.
> >>
> >> Intel CPU come with Hyperthread which will double the number cores on
> >> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> >> where we start to calculate about number of task slots per node.
> >>
> >> Once it come to spindle, i strongly believe I should pick 8 cores and
> >> picks 8 disks in order to get 1:1 ratio.
> >>
> >> Please suggest
> >> Patai
> >>
> >
> >
> 


Re: Spindle per Cores

Posted by ranjith raghunath <ra...@gmail.com>.
Does hypertheading affect this ratio?
On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com> wrote:

> First, the obvious caveat... YMMV
>
> Having said that.
>
> The key here is to take a look across the various jobs that you will run.
> Some may be more CPU intensive, others more I/O intensive.
>
> If you monitor these jobs via Ganglia, when you have too few spindles you
> should see the wait cpu rise on the machines in the cluster.  That is to
> say that you are putting an extra load on the systems because you're
> waiting for the disks to catch up.
>
> If you increase the ratio of disks to CPU, you should see that load drop
> as you are not wasting CPU cycles.
>
> Note that its not just the number of spindles, but also the bus and the
> controller cards that can also affect the throughput of disk I/O.
>
> Now just IMHO, there was a discussion on some of the CPU recommendations.
> To a point, it doesn't matter that much. You want to maximize the bang for
> the buck you can get w your hardware purchase.
>
> Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core,
> and you're wasting the cpu that you bought.
>
> Going higher than a ratio of 1, like 1.5, and you may be buying too many
> spindles and not see a performance gain that offsets your cost.
>
> Search for a happy medium and don't sweat the maximum performance that you
> may get.
>
> HTH
>
> On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:
>
> > I've done some experiments along these lines.  I'm using
> high-performance 15K RPM SAS drives instead of the more usual SATA drives,
> which should reduce the number of drives I need.  I have dual 4-core
> processors at 3.6 GHz.  These are more powerful than the average 4-core
> processor, which should increase the number of drives I need.  Assuming
> these 2 effects cancel, then my results should also apply to machines with
> SATA drives and average processors.  Using 8 drives (1-1) gets good
> performance for teragen and terasort.  Going to 12 drives (1.5 per core)
> increases terasort performance by 15%.  That might not seem like much
> compared to increasing the number of drives by 50%, but a better comparison
> is that 4 extra drives increased the cost of each machine by only about
> 12%, so the extra drives are (barely) worth it. If you're more time
> sensitive than cost sensitive, they they're definitely worth it.  The extra
> drives did not help teragen, apparently because both CPU and the internal
> storage controller were close to saturation. So, of course everything
> depends on the app.  You're shooting for saturated CPUs and disk bandwidth.
>  Check that the CPU is not saturated (after checking Hadoop tuning and
> optimizing the number of tasks). Check that you have enough memory for more
> tasks with room leftover for a large buffer cache.  Use 10 GbE networking
> or make sure the network has enough headroom.  Check the storage controller
> can handle more bandwidth.  If all are true (that is, no other
> bottlenecks), consider adding more drives.
> >
> > Jeff
> >
> >> -----Original Message-----
> >> From: Hank Cohen [mailto:hank.cohen@altior.com]
> >> Sent: Friday, October 12, 2012 1:46 PM
> >> To: user@hadoop.apache.org
> >> Subject: RE: Spindle per Cores
> >>
> >> What empirical evidence is there for this rule of thumb?
> >> In other words, what tests or metrics would indicate an optimal
> >> spindle/core ratio and how dependent is this on the nature of the data
> >> and of the map/reduce computation?
> >>
> >> My understanding is that there are lots of clusters with more spindles
> >> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
> >> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
> >> spindles.  Would it be better to have 6 core processors if you are
> >> loading up the boxes with 12 disks?  And most importantly, how would
> >> one know that the mix was optimal?
> >>
> >> Hank Cohen
> >> Altior Inc.
> >>
> >> -----Original Message-----
> >> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> >> Sent: Friday, October 12, 2012 10:46 AM
> >> To: user@hadoop.apache.org
> >> Subject: Spindle per Cores
> >>
> >> I have read around about the hardware recommendation for hadoop
> >> cluster.
> >> One of them is recommend 1:1 ratio between spindle per core.
> >>
> >> Intel CPU come with Hyperthread which will double the number cores on
> >> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> >> where we start to calculate about number of task slots per node.
> >>
> >> Once it come to spindle, i strongly believe I should pick 8 cores and
> >> picks 8 disks in order to get 1:1 ratio.
> >>
> >> Please suggest
> >> Patai
> >>
> >
> >
>
>

Re: Spindle per Cores

Posted by ranjith raghunath <ra...@gmail.com>.
Does hypertheading affect this ratio?
On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com> wrote:

> First, the obvious caveat... YMMV
>
> Having said that.
>
> The key here is to take a look across the various jobs that you will run.
> Some may be more CPU intensive, others more I/O intensive.
>
> If you monitor these jobs via Ganglia, when you have too few spindles you
> should see the wait cpu rise on the machines in the cluster.  That is to
> say that you are putting an extra load on the systems because you're
> waiting for the disks to catch up.
>
> If you increase the ratio of disks to CPU, you should see that load drop
> as you are not wasting CPU cycles.
>
> Note that its not just the number of spindles, but also the bus and the
> controller cards that can also affect the throughput of disk I/O.
>
> Now just IMHO, there was a discussion on some of the CPU recommendations.
> To a point, it doesn't matter that much. You want to maximize the bang for
> the buck you can get w your hardware purchase.
>
> Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core,
> and you're wasting the cpu that you bought.
>
> Going higher than a ratio of 1, like 1.5, and you may be buying too many
> spindles and not see a performance gain that offsets your cost.
>
> Search for a happy medium and don't sweat the maximum performance that you
> may get.
>
> HTH
>
> On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:
>
> > I've done some experiments along these lines.  I'm using
> high-performance 15K RPM SAS drives instead of the more usual SATA drives,
> which should reduce the number of drives I need.  I have dual 4-core
> processors at 3.6 GHz.  These are more powerful than the average 4-core
> processor, which should increase the number of drives I need.  Assuming
> these 2 effects cancel, then my results should also apply to machines with
> SATA drives and average processors.  Using 8 drives (1-1) gets good
> performance for teragen and terasort.  Going to 12 drives (1.5 per core)
> increases terasort performance by 15%.  That might not seem like much
> compared to increasing the number of drives by 50%, but a better comparison
> is that 4 extra drives increased the cost of each machine by only about
> 12%, so the extra drives are (barely) worth it. If you're more time
> sensitive than cost sensitive, they they're definitely worth it.  The extra
> drives did not help teragen, apparently because both CPU and the internal
> storage controller were close to saturation. So, of course everything
> depends on the app.  You're shooting for saturated CPUs and disk bandwidth.
>  Check that the CPU is not saturated (after checking Hadoop tuning and
> optimizing the number of tasks). Check that you have enough memory for more
> tasks with room leftover for a large buffer cache.  Use 10 GbE networking
> or make sure the network has enough headroom.  Check the storage controller
> can handle more bandwidth.  If all are true (that is, no other
> bottlenecks), consider adding more drives.
> >
> > Jeff
> >
> >> -----Original Message-----
> >> From: Hank Cohen [mailto:hank.cohen@altior.com]
> >> Sent: Friday, October 12, 2012 1:46 PM
> >> To: user@hadoop.apache.org
> >> Subject: RE: Spindle per Cores
> >>
> >> What empirical evidence is there for this rule of thumb?
> >> In other words, what tests or metrics would indicate an optimal
> >> spindle/core ratio and how dependent is this on the nature of the data
> >> and of the map/reduce computation?
> >>
> >> My understanding is that there are lots of clusters with more spindles
> >> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
> >> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
> >> spindles.  Would it be better to have 6 core processors if you are
> >> loading up the boxes with 12 disks?  And most importantly, how would
> >> one know that the mix was optimal?
> >>
> >> Hank Cohen
> >> Altior Inc.
> >>
> >> -----Original Message-----
> >> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> >> Sent: Friday, October 12, 2012 10:46 AM
> >> To: user@hadoop.apache.org
> >> Subject: Spindle per Cores
> >>
> >> I have read around about the hardware recommendation for hadoop
> >> cluster.
> >> One of them is recommend 1:1 ratio between spindle per core.
> >>
> >> Intel CPU come with Hyperthread which will double the number cores on
> >> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> >> where we start to calculate about number of task slots per node.
> >>
> >> Once it come to spindle, i strongly believe I should pick 8 cores and
> >> picks 8 disks in order to get 1:1 ratio.
> >>
> >> Please suggest
> >> Patai
> >>
> >
> >
>
>

Re: Spindle per Cores

Posted by ranjith raghunath <ra...@gmail.com>.
Does hypertheading affect this ratio?
On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com> wrote:

> First, the obvious caveat... YMMV
>
> Having said that.
>
> The key here is to take a look across the various jobs that you will run.
> Some may be more CPU intensive, others more I/O intensive.
>
> If you monitor these jobs via Ganglia, when you have too few spindles you
> should see the wait cpu rise on the machines in the cluster.  That is to
> say that you are putting an extra load on the systems because you're
> waiting for the disks to catch up.
>
> If you increase the ratio of disks to CPU, you should see that load drop
> as you are not wasting CPU cycles.
>
> Note that its not just the number of spindles, but also the bus and the
> controller cards that can also affect the throughput of disk I/O.
>
> Now just IMHO, there was a discussion on some of the CPU recommendations.
> To a point, it doesn't matter that much. You want to maximize the bang for
> the buck you can get w your hardware purchase.
>
> Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core,
> and you're wasting the cpu that you bought.
>
> Going higher than a ratio of 1, like 1.5, and you may be buying too many
> spindles and not see a performance gain that offsets your cost.
>
> Search for a happy medium and don't sweat the maximum performance that you
> may get.
>
> HTH
>
> On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:
>
> > I've done some experiments along these lines.  I'm using
> high-performance 15K RPM SAS drives instead of the more usual SATA drives,
> which should reduce the number of drives I need.  I have dual 4-core
> processors at 3.6 GHz.  These are more powerful than the average 4-core
> processor, which should increase the number of drives I need.  Assuming
> these 2 effects cancel, then my results should also apply to machines with
> SATA drives and average processors.  Using 8 drives (1-1) gets good
> performance for teragen and terasort.  Going to 12 drives (1.5 per core)
> increases terasort performance by 15%.  That might not seem like much
> compared to increasing the number of drives by 50%, but a better comparison
> is that 4 extra drives increased the cost of each machine by only about
> 12%, so the extra drives are (barely) worth it. If you're more time
> sensitive than cost sensitive, they they're definitely worth it.  The extra
> drives did not help teragen, apparently because both CPU and the internal
> storage controller were close to saturation. So, of course everything
> depends on the app.  You're shooting for saturated CPUs and disk bandwidth.
>  Check that the CPU is not saturated (after checking Hadoop tuning and
> optimizing the number of tasks). Check that you have enough memory for more
> tasks with room leftover for a large buffer cache.  Use 10 GbE networking
> or make sure the network has enough headroom.  Check the storage controller
> can handle more bandwidth.  If all are true (that is, no other
> bottlenecks), consider adding more drives.
> >
> > Jeff
> >
> >> -----Original Message-----
> >> From: Hank Cohen [mailto:hank.cohen@altior.com]
> >> Sent: Friday, October 12, 2012 1:46 PM
> >> To: user@hadoop.apache.org
> >> Subject: RE: Spindle per Cores
> >>
> >> What empirical evidence is there for this rule of thumb?
> >> In other words, what tests or metrics would indicate an optimal
> >> spindle/core ratio and how dependent is this on the nature of the data
> >> and of the map/reduce computation?
> >>
> >> My understanding is that there are lots of clusters with more spindles
> >> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
> >> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
> >> spindles.  Would it be better to have 6 core processors if you are
> >> loading up the boxes with 12 disks?  And most importantly, how would
> >> one know that the mix was optimal?
> >>
> >> Hank Cohen
> >> Altior Inc.
> >>
> >> -----Original Message-----
> >> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> >> Sent: Friday, October 12, 2012 10:46 AM
> >> To: user@hadoop.apache.org
> >> Subject: Spindle per Cores
> >>
> >> I have read around about the hardware recommendation for hadoop
> >> cluster.
> >> One of them is recommend 1:1 ratio between spindle per core.
> >>
> >> Intel CPU come with Hyperthread which will double the number cores on
> >> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> >> where we start to calculate about number of task slots per node.
> >>
> >> Once it come to spindle, i strongly believe I should pick 8 cores and
> >> picks 8 disks in order to get 1:1 ratio.
> >>
> >> Please suggest
> >> Patai
> >>
> >
> >
>
>

Re: Spindle per Cores

Posted by ranjith raghunath <ra...@gmail.com>.
Does hypertheading affect this ratio?
On Oct 12, 2012 9:36 PM, "Michael Segel" <mi...@hotmail.com> wrote:

> First, the obvious caveat... YMMV
>
> Having said that.
>
> The key here is to take a look across the various jobs that you will run.
> Some may be more CPU intensive, others more I/O intensive.
>
> If you monitor these jobs via Ganglia, when you have too few spindles you
> should see the wait cpu rise on the machines in the cluster.  That is to
> say that you are putting an extra load on the systems because you're
> waiting for the disks to catch up.
>
> If you increase the ratio of disks to CPU, you should see that load drop
> as you are not wasting CPU cycles.
>
> Note that its not just the number of spindles, but also the bus and the
> controller cards that can also affect the throughput of disk I/O.
>
> Now just IMHO, there was a discussion on some of the CPU recommendations.
> To a point, it doesn't matter that much. You want to maximize the bang for
> the buck you can get w your hardware purchase.
>
> Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core,
> and you're wasting the cpu that you bought.
>
> Going higher than a ratio of 1, like 1.5, and you may be buying too many
> spindles and not see a performance gain that offsets your cost.
>
> Search for a happy medium and don't sweat the maximum performance that you
> may get.
>
> HTH
>
> On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:
>
> > I've done some experiments along these lines.  I'm using
> high-performance 15K RPM SAS drives instead of the more usual SATA drives,
> which should reduce the number of drives I need.  I have dual 4-core
> processors at 3.6 GHz.  These are more powerful than the average 4-core
> processor, which should increase the number of drives I need.  Assuming
> these 2 effects cancel, then my results should also apply to machines with
> SATA drives and average processors.  Using 8 drives (1-1) gets good
> performance for teragen and terasort.  Going to 12 drives (1.5 per core)
> increases terasort performance by 15%.  That might not seem like much
> compared to increasing the number of drives by 50%, but a better comparison
> is that 4 extra drives increased the cost of each machine by only about
> 12%, so the extra drives are (barely) worth it. If you're more time
> sensitive than cost sensitive, they they're definitely worth it.  The extra
> drives did not help teragen, apparently because both CPU and the internal
> storage controller were close to saturation. So, of course everything
> depends on the app.  You're shooting for saturated CPUs and disk bandwidth.
>  Check that the CPU is not saturated (after checking Hadoop tuning and
> optimizing the number of tasks). Check that you have enough memory for more
> tasks with room leftover for a large buffer cache.  Use 10 GbE networking
> or make sure the network has enough headroom.  Check the storage controller
> can handle more bandwidth.  If all are true (that is, no other
> bottlenecks), consider adding more drives.
> >
> > Jeff
> >
> >> -----Original Message-----
> >> From: Hank Cohen [mailto:hank.cohen@altior.com]
> >> Sent: Friday, October 12, 2012 1:46 PM
> >> To: user@hadoop.apache.org
> >> Subject: RE: Spindle per Cores
> >>
> >> What empirical evidence is there for this rule of thumb?
> >> In other words, what tests or metrics would indicate an optimal
> >> spindle/core ratio and how dependent is this on the nature of the data
> >> and of the map/reduce computation?
> >>
> >> My understanding is that there are lots of clusters with more spindles
> >> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
> >> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
> >> spindles.  Would it be better to have 6 core processors if you are
> >> loading up the boxes with 12 disks?  And most importantly, how would
> >> one know that the mix was optimal?
> >>
> >> Hank Cohen
> >> Altior Inc.
> >>
> >> -----Original Message-----
> >> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> >> Sent: Friday, October 12, 2012 10:46 AM
> >> To: user@hadoop.apache.org
> >> Subject: Spindle per Cores
> >>
> >> I have read around about the hardware recommendation for hadoop
> >> cluster.
> >> One of them is recommend 1:1 ratio between spindle per core.
> >>
> >> Intel CPU come with Hyperthread which will double the number cores on
> >> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> >> where we start to calculate about number of task slots per node.
> >>
> >> Once it come to spindle, i strongly believe I should pick 8 cores and
> >> picks 8 disks in order to get 1:1 ratio.
> >>
> >> Please suggest
> >> Patai
> >>
> >
> >
>
>

Re: Spindle per Cores

Posted by Michael Segel <mi...@hotmail.com>.
First, the obvious caveat... YMMV

Having said that.

The key here is to take a look across the various jobs that you will run. Some may be more CPU intensive, others more I/O intensive.  

If you monitor these jobs via Ganglia, when you have too few spindles you should see the wait cpu rise on the machines in the cluster.  That is to say that you are putting an extra load on the systems because you're waiting for the disks to catch up. 

If you increase the ratio of disks to CPU, you should see that load drop as you are not wasting CPU cycles. 

Note that its not just the number of spindles, but also the bus and the controller cards that can also affect the throughput of disk I/O. 

Now just IMHO, there was a discussion on some of the CPU recommendations. To a point, it doesn't matter that much. You want to maximize the bang for the buck you can get w your hardware purchase. 

Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core, and you're wasting the cpu that you bought. 

Going higher than a ratio of 1, like 1.5, and you may be buying too many spindles and not see a performance gain that offsets your cost. 

Search for a happy medium and don't sweat the maximum performance that you may get. 

HTH

On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:

> I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.
> 
> Jeff
> 
>> -----Original Message-----
>> From: Hank Cohen [mailto:hank.cohen@altior.com]
>> Sent: Friday, October 12, 2012 1:46 PM
>> To: user@hadoop.apache.org
>> Subject: RE: Spindle per Cores
>> 
>> What empirical evidence is there for this rule of thumb?
>> In other words, what tests or metrics would indicate an optimal
>> spindle/core ratio and how dependent is this on the nature of the data
>> and of the map/reduce computation?
>> 
>> My understanding is that there are lots of clusters with more spindles
>> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
>> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
>> spindles.  Would it be better to have 6 core processors if you are
>> loading up the boxes with 12 disks?  And most importantly, how would
>> one know that the mix was optimal?
>> 
>> Hank Cohen
>> Altior Inc.
>> 
>> -----Original Message-----
>> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
>> Sent: Friday, October 12, 2012 10:46 AM
>> To: user@hadoop.apache.org
>> Subject: Spindle per Cores
>> 
>> I have read around about the hardware recommendation for hadoop
>> cluster.
>> One of them is recommend 1:1 ratio between spindle per core.
>> 
>> Intel CPU come with Hyperthread which will double the number cores on
>> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
>> where we start to calculate about number of task slots per node.
>> 
>> Once it come to spindle, i strongly believe I should pick 8 cores and
>> picks 8 disks in order to get 1:1 ratio.
>> 
>> Please suggest
>> Patai
>> 
> 
> 


Re: Spindle per Cores

Posted by Michael Segel <mi...@hotmail.com>.
First, the obvious caveat... YMMV

Having said that.

The key here is to take a look across the various jobs that you will run. Some may be more CPU intensive, others more I/O intensive.  

If you monitor these jobs via Ganglia, when you have too few spindles you should see the wait cpu rise on the machines in the cluster.  That is to say that you are putting an extra load on the systems because you're waiting for the disks to catch up. 

If you increase the ratio of disks to CPU, you should see that load drop as you are not wasting CPU cycles. 

Note that its not just the number of spindles, but also the bus and the controller cards that can also affect the throughput of disk I/O. 

Now just IMHO, there was a discussion on some of the CPU recommendations. To a point, it doesn't matter that much. You want to maximize the bang for the buck you can get w your hardware purchase. 

Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core, and you're wasting the cpu that you bought. 

Going higher than a ratio of 1, like 1.5, and you may be buying too many spindles and not see a performance gain that offsets your cost. 

Search for a happy medium and don't sweat the maximum performance that you may get. 

HTH

On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:

> I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.
> 
> Jeff
> 
>> -----Original Message-----
>> From: Hank Cohen [mailto:hank.cohen@altior.com]
>> Sent: Friday, October 12, 2012 1:46 PM
>> To: user@hadoop.apache.org
>> Subject: RE: Spindle per Cores
>> 
>> What empirical evidence is there for this rule of thumb?
>> In other words, what tests or metrics would indicate an optimal
>> spindle/core ratio and how dependent is this on the nature of the data
>> and of the map/reduce computation?
>> 
>> My understanding is that there are lots of clusters with more spindles
>> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
>> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
>> spindles.  Would it be better to have 6 core processors if you are
>> loading up the boxes with 12 disks?  And most importantly, how would
>> one know that the mix was optimal?
>> 
>> Hank Cohen
>> Altior Inc.
>> 
>> -----Original Message-----
>> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
>> Sent: Friday, October 12, 2012 10:46 AM
>> To: user@hadoop.apache.org
>> Subject: Spindle per Cores
>> 
>> I have read around about the hardware recommendation for hadoop
>> cluster.
>> One of them is recommend 1:1 ratio between spindle per core.
>> 
>> Intel CPU come with Hyperthread which will double the number cores on
>> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
>> where we start to calculate about number of task slots per node.
>> 
>> Once it come to spindle, i strongly believe I should pick 8 cores and
>> picks 8 disks in order to get 1:1 ratio.
>> 
>> Please suggest
>> Patai
>> 
> 
> 


Re: Spindle per Cores

Posted by Michael Segel <mi...@hotmail.com>.
First, the obvious caveat... YMMV

Having said that.

The key here is to take a look across the various jobs that you will run. Some may be more CPU intensive, others more I/O intensive.  

If you monitor these jobs via Ganglia, when you have too few spindles you should see the wait cpu rise on the machines in the cluster.  That is to say that you are putting an extra load on the systems because you're waiting for the disks to catch up. 

If you increase the ratio of disks to CPU, you should see that load drop as you are not wasting CPU cycles. 

Note that its not just the number of spindles, but also the bus and the controller cards that can also affect the throughput of disk I/O. 

Now just IMHO, there was a discussion on some of the CPU recommendations. To a point, it doesn't matter that much. You want to maximize the bang for the buck you can get w your hardware purchase. 

Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core, and you're wasting the cpu that you bought. 

Going higher than a ratio of 1, like 1.5, and you may be buying too many spindles and not see a performance gain that offsets your cost. 

Search for a happy medium and don't sweat the maximum performance that you may get. 

HTH

On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:

> I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.
> 
> Jeff
> 
>> -----Original Message-----
>> From: Hank Cohen [mailto:hank.cohen@altior.com]
>> Sent: Friday, October 12, 2012 1:46 PM
>> To: user@hadoop.apache.org
>> Subject: RE: Spindle per Cores
>> 
>> What empirical evidence is there for this rule of thumb?
>> In other words, what tests or metrics would indicate an optimal
>> spindle/core ratio and how dependent is this on the nature of the data
>> and of the map/reduce computation?
>> 
>> My understanding is that there are lots of clusters with more spindles
>> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
>> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
>> spindles.  Would it be better to have 6 core processors if you are
>> loading up the boxes with 12 disks?  And most importantly, how would
>> one know that the mix was optimal?
>> 
>> Hank Cohen
>> Altior Inc.
>> 
>> -----Original Message-----
>> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
>> Sent: Friday, October 12, 2012 10:46 AM
>> To: user@hadoop.apache.org
>> Subject: Spindle per Cores
>> 
>> I have read around about the hardware recommendation for hadoop
>> cluster.
>> One of them is recommend 1:1 ratio between spindle per core.
>> 
>> Intel CPU come with Hyperthread which will double the number cores on
>> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
>> where we start to calculate about number of task slots per node.
>> 
>> Once it come to spindle, i strongly believe I should pick 8 cores and
>> picks 8 disks in order to get 1:1 ratio.
>> 
>> Please suggest
>> Patai
>> 
> 
> 


Re: Spindle per Cores

Posted by Michael Segel <mi...@hotmail.com>.
First, the obvious caveat... YMMV

Having said that.

The key here is to take a look across the various jobs that you will run. Some may be more CPU intensive, others more I/O intensive.  

If you monitor these jobs via Ganglia, when you have too few spindles you should see the wait cpu rise on the machines in the cluster.  That is to say that you are putting an extra load on the systems because you're waiting for the disks to catch up. 

If you increase the ratio of disks to CPU, you should see that load drop as you are not wasting CPU cycles. 

Note that its not just the number of spindles, but also the bus and the controller cards that can also affect the throughput of disk I/O. 

Now just IMHO, there was a discussion on some of the CPU recommendations. To a point, it doesn't matter that much. You want to maximize the bang for the buck you can get w your hardware purchase. 

Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core, and you're wasting the cpu that you bought. 

Going higher than a ratio of 1, like 1.5, and you may be buying too many spindles and not see a performance gain that offsets your cost. 

Search for a happy medium and don't sweat the maximum performance that you may get. 

HTH

On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jb...@vmware.com> wrote:

> I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.
> 
> Jeff
> 
>> -----Original Message-----
>> From: Hank Cohen [mailto:hank.cohen@altior.com]
>> Sent: Friday, October 12, 2012 1:46 PM
>> To: user@hadoop.apache.org
>> Subject: RE: Spindle per Cores
>> 
>> What empirical evidence is there for this rule of thumb?
>> In other words, what tests or metrics would indicate an optimal
>> spindle/core ratio and how dependent is this on the nature of the data
>> and of the map/reduce computation?
>> 
>> My understanding is that there are lots of clusters with more spindles
>> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
>> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
>> spindles.  Would it be better to have 6 core processors if you are
>> loading up the boxes with 12 disks?  And most importantly, how would
>> one know that the mix was optimal?
>> 
>> Hank Cohen
>> Altior Inc.
>> 
>> -----Original Message-----
>> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
>> Sent: Friday, October 12, 2012 10:46 AM
>> To: user@hadoop.apache.org
>> Subject: Spindle per Cores
>> 
>> I have read around about the hardware recommendation for hadoop
>> cluster.
>> One of them is recommend 1:1 ratio between spindle per core.
>> 
>> Intel CPU come with Hyperthread which will double the number cores on
>> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
>> where we start to calculate about number of task slots per node.
>> 
>> Once it come to spindle, i strongly believe I should pick 8 cores and
>> picks 8 disks in order to get 1:1 ratio.
>> 
>> Please suggest
>> Patai
>> 
> 
> 


RE: Spindle per Cores

Posted by Jeffrey Buell <jb...@vmware.com>.
I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.

Jeff

> -----Original Message-----
> From: Hank Cohen [mailto:hank.cohen@altior.com]
> Sent: Friday, October 12, 2012 1:46 PM
> To: user@hadoop.apache.org
> Subject: RE: Spindle per Cores
> 
> What empirical evidence is there for this rule of thumb?
> In other words, what tests or metrics would indicate an optimal
> spindle/core ratio and how dependent is this on the nature of the data
> and of the map/reduce computation?
> 
> My understanding is that there are lots of clusters with more spindles
> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
> spindles.  Would it be better to have 6 core processors if you are
> loading up the boxes with 12 disks?  And most importantly, how would
> one know that the mix was optimal?
> 
> Hank Cohen
> Altior Inc.
> 
> -----Original Message-----
> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> Sent: Friday, October 12, 2012 10:46 AM
> To: user@hadoop.apache.org
> Subject: Spindle per Cores
> 
> I have read around about the hardware recommendation for hadoop
> cluster.
> One of them is recommend 1:1 ratio between spindle per core.
> 
> Intel CPU come with Hyperthread which will double the number cores on
> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> where we start to calculate about number of task slots per node.
> 
> Once it come to spindle, i strongly believe I should pick 8 cores and
> picks 8 disks in order to get 1:1 ratio.
> 
> Please suggest
> Patai
> 


RE: Spindle per Cores

Posted by Jeffrey Buell <jb...@vmware.com>.
I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.

Jeff

> -----Original Message-----
> From: Hank Cohen [mailto:hank.cohen@altior.com]
> Sent: Friday, October 12, 2012 1:46 PM
> To: user@hadoop.apache.org
> Subject: RE: Spindle per Cores
> 
> What empirical evidence is there for this rule of thumb?
> In other words, what tests or metrics would indicate an optimal
> spindle/core ratio and how dependent is this on the nature of the data
> and of the map/reduce computation?
> 
> My understanding is that there are lots of clusters with more spindles
> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
> spindles.  Would it be better to have 6 core processors if you are
> loading up the boxes with 12 disks?  And most importantly, how would
> one know that the mix was optimal?
> 
> Hank Cohen
> Altior Inc.
> 
> -----Original Message-----
> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> Sent: Friday, October 12, 2012 10:46 AM
> To: user@hadoop.apache.org
> Subject: Spindle per Cores
> 
> I have read around about the hardware recommendation for hadoop
> cluster.
> One of them is recommend 1:1 ratio between spindle per core.
> 
> Intel CPU come with Hyperthread which will double the number cores on
> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> where we start to calculate about number of task slots per node.
> 
> Once it come to spindle, i strongly believe I should pick 8 cores and
> picks 8 disks in order to get 1:1 ratio.
> 
> Please suggest
> Patai
> 


Re: Spindle per Cores

Posted by Ted Dunning <td...@maprtech.com>.
I think that this rule of thumb is to prevent people configuring 2 disk
clusters with 16 cores or 48 disk machines with 4 cores.  Both
configurations could make sense in narrow applications, but both would most
probably be sub-optimal.

Within narrow bands, I doubt you will see huge changes.  I like to be able
to

a) be able to saturate disk I/O which requires some CPU and a good control.
 Different distros vary a lot here

b) have enough memory per slot.  Lots of people go cheap on this and they
wind up hamstringing performance

c) make sure there is enough CPU left over for the application.  This is
hugely app dependent, obviously.

On Fri, Oct 12, 2012 at 1:45 PM, Hank Cohen <ha...@altior.com> wrote:

> What empirical evidence is there for this rule of thumb?
> In other words, what tests or metrics would indicate an optimal
> spindle/core ratio and how dependent is this on the nature of the data and
> of the map/reduce computation?
>
> My understanding is that there are lots of clusters with more spindles
> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk drives.
>  So lots of Hadoop clusters have dual 4 core processors and 12 spindles.
>  Would it be better to have 6 core processors if you are loading up the
> boxes with 12 disks?  And most importantly, how would one know that the mix
> was optimal?
>
> Hank Cohen
> Altior Inc.
>
> -----Original Message-----
> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> Sent: Friday, October 12, 2012 10:46 AM
> To: user@hadoop.apache.org
> Subject: Spindle per Cores
>
> I have read around about the hardware recommendation for hadoop cluster.
> One of them is recommend 1:1 ratio between spindle per core.
>
> Intel CPU come with Hyperthread which will double the number cores on one
> physical CPU. eg. 8 cores with Hyperthread it because 16 which is where we
> start to calculate about number of task slots per node.
>
> Once it come to spindle, i strongly believe I should pick 8 cores and
> picks 8 disks in order to get 1:1 ratio.
>
> Please suggest
> Patai
>
>
>

Re: Spindle per Cores

Posted by Ted Dunning <td...@maprtech.com>.
I think that this rule of thumb is to prevent people configuring 2 disk
clusters with 16 cores or 48 disk machines with 4 cores.  Both
configurations could make sense in narrow applications, but both would most
probably be sub-optimal.

Within narrow bands, I doubt you will see huge changes.  I like to be able
to

a) be able to saturate disk I/O which requires some CPU and a good control.
 Different distros vary a lot here

b) have enough memory per slot.  Lots of people go cheap on this and they
wind up hamstringing performance

c) make sure there is enough CPU left over for the application.  This is
hugely app dependent, obviously.

On Fri, Oct 12, 2012 at 1:45 PM, Hank Cohen <ha...@altior.com> wrote:

> What empirical evidence is there for this rule of thumb?
> In other words, what tests or metrics would indicate an optimal
> spindle/core ratio and how dependent is this on the nature of the data and
> of the map/reduce computation?
>
> My understanding is that there are lots of clusters with more spindles
> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk drives.
>  So lots of Hadoop clusters have dual 4 core processors and 12 spindles.
>  Would it be better to have 6 core processors if you are loading up the
> boxes with 12 disks?  And most importantly, how would one know that the mix
> was optimal?
>
> Hank Cohen
> Altior Inc.
>
> -----Original Message-----
> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> Sent: Friday, October 12, 2012 10:46 AM
> To: user@hadoop.apache.org
> Subject: Spindle per Cores
>
> I have read around about the hardware recommendation for hadoop cluster.
> One of them is recommend 1:1 ratio between spindle per core.
>
> Intel CPU come with Hyperthread which will double the number cores on one
> physical CPU. eg. 8 cores with Hyperthread it because 16 which is where we
> start to calculate about number of task slots per node.
>
> Once it come to spindle, i strongly believe I should pick 8 cores and
> picks 8 disks in order to get 1:1 ratio.
>
> Please suggest
> Patai
>
>
>

RE: Spindle per Cores

Posted by Jeffrey Buell <jb...@vmware.com>.
I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.

Jeff

> -----Original Message-----
> From: Hank Cohen [mailto:hank.cohen@altior.com]
> Sent: Friday, October 12, 2012 1:46 PM
> To: user@hadoop.apache.org
> Subject: RE: Spindle per Cores
> 
> What empirical evidence is there for this rule of thumb?
> In other words, what tests or metrics would indicate an optimal
> spindle/core ratio and how dependent is this on the nature of the data
> and of the map/reduce computation?
> 
> My understanding is that there are lots of clusters with more spindles
> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
> spindles.  Would it be better to have 6 core processors if you are
> loading up the boxes with 12 disks?  And most importantly, how would
> one know that the mix was optimal?
> 
> Hank Cohen
> Altior Inc.
> 
> -----Original Message-----
> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> Sent: Friday, October 12, 2012 10:46 AM
> To: user@hadoop.apache.org
> Subject: Spindle per Cores
> 
> I have read around about the hardware recommendation for hadoop
> cluster.
> One of them is recommend 1:1 ratio between spindle per core.
> 
> Intel CPU come with Hyperthread which will double the number cores on
> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> where we start to calculate about number of task slots per node.
> 
> Once it come to spindle, i strongly believe I should pick 8 cores and
> picks 8 disks in order to get 1:1 ratio.
> 
> Please suggest
> Patai
> 


RE: Spindle per Cores

Posted by Jeffrey Buell <jb...@vmware.com>.
I've done some experiments along these lines.  I'm using high-performance 15K RPM SAS drives instead of the more usual SATA drives, which should reduce the number of drives I need.  I have dual 4-core processors at 3.6 GHz.  These are more powerful than the average 4-core processor, which should increase the number of drives I need.  Assuming these 2 effects cancel, then my results should also apply to machines with SATA drives and average processors.  Using 8 drives (1-1) gets good performance for teragen and terasort.  Going to 12 drives (1.5 per core) increases terasort performance by 15%.  That might not seem like much compared to increasing the number of drives by 50%, but a better comparison is that 4 extra drives increased the cost of each machine by only about 12%, so the extra drives are (barely) worth it. If you're more time sensitive than cost sensitive, they they're definitely worth it.  The extra drives did not help teragen, apparently because both CPU and the internal storage controller were close to saturation. So, of course everything depends on the app.  You're shooting for saturated CPUs and disk bandwidth.  Check that the CPU is not saturated (after checking Hadoop tuning and optimizing the number of tasks). Check that you have enough memory for more tasks with room leftover for a large buffer cache.  Use 10 GbE networking or make sure the network has enough headroom.  Check the storage controller can handle more bandwidth.  If all are true (that is, no other bottlenecks), consider adding more drives.

Jeff

> -----Original Message-----
> From: Hank Cohen [mailto:hank.cohen@altior.com]
> Sent: Friday, October 12, 2012 1:46 PM
> To: user@hadoop.apache.org
> Subject: RE: Spindle per Cores
> 
> What empirical evidence is there for this rule of thumb?
> In other words, what tests or metrics would indicate an optimal
> spindle/core ratio and how dependent is this on the nature of the data
> and of the map/reduce computation?
> 
> My understanding is that there are lots of clusters with more spindles
> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
> spindles.  Would it be better to have 6 core processors if you are
> loading up the boxes with 12 disks?  And most importantly, how would
> one know that the mix was optimal?
> 
> Hank Cohen
> Altior Inc.
> 
> -----Original Message-----
> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
> Sent: Friday, October 12, 2012 10:46 AM
> To: user@hadoop.apache.org
> Subject: Spindle per Cores
> 
> I have read around about the hardware recommendation for hadoop
> cluster.
> One of them is recommend 1:1 ratio between spindle per core.
> 
> Intel CPU come with Hyperthread which will double the number cores on
> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> where we start to calculate about number of task slots per node.
> 
> Once it come to spindle, i strongly believe I should pick 8 cores and
> picks 8 disks in order to get 1:1 ratio.
> 
> Please suggest
> Patai
> 


RE: Spindle per Cores

Posted by Hank Cohen <ha...@altior.com>.
What empirical evidence is there for this rule of thumb?
In other words, what tests or metrics would indicate an optimal spindle/core ratio and how dependent is this on the nature of the data and of the map/reduce computation?

My understanding is that there are lots of clusters with more spindles than cores.  Specifically, typical 2U servers can hold 12 3.5" disk drives.  So lots of Hadoop clusters have dual 4 core processors and 12 spindles.  Would it be better to have 6 core processors if you are loading up the boxes with 12 disks?  And most importantly, how would one know that the mix was optimal?

Hank Cohen
Altior Inc.

-----Original Message-----
From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com] 
Sent: Friday, October 12, 2012 10:46 AM
To: user@hadoop.apache.org
Subject: Spindle per Cores

I have read around about the hardware recommendation for hadoop cluster.
One of them is recommend 1:1 ratio between spindle per core.

Intel CPU come with Hyperthread which will double the number cores on one physical CPU. eg. 8 cores with Hyperthread it because 16 which is where we start to calculate about number of task slots per node.

Once it come to spindle, i strongly believe I should pick 8 cores and picks 8 disks in order to get 1:1 ratio.

Please suggest
Patai



RE: Spindle per Cores

Posted by Hank Cohen <ha...@altior.com>.
What empirical evidence is there for this rule of thumb?
In other words, what tests or metrics would indicate an optimal spindle/core ratio and how dependent is this on the nature of the data and of the map/reduce computation?

My understanding is that there are lots of clusters with more spindles than cores.  Specifically, typical 2U servers can hold 12 3.5" disk drives.  So lots of Hadoop clusters have dual 4 core processors and 12 spindles.  Would it be better to have 6 core processors if you are loading up the boxes with 12 disks?  And most importantly, how would one know that the mix was optimal?

Hank Cohen
Altior Inc.

-----Original Message-----
From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com] 
Sent: Friday, October 12, 2012 10:46 AM
To: user@hadoop.apache.org
Subject: Spindle per Cores

I have read around about the hardware recommendation for hadoop cluster.
One of them is recommend 1:1 ratio between spindle per core.

Intel CPU come with Hyperthread which will double the number cores on one physical CPU. eg. 8 cores with Hyperthread it because 16 which is where we start to calculate about number of task slots per node.

Once it come to spindle, i strongly believe I should pick 8 cores and picks 8 disks in order to get 1:1 ratio.

Please suggest
Patai



Re: Spindle per Cores

Posted by Patai Sangbutsarakum <Pa...@turn.com>.
Thanks Ted,

Sorry I was too rush to type. I was asking should I pick number 8 (not count the Hyperthread)
or number 16 (count the Hyperthreads) in order to adjust spindle and cores ratios


From: Ted Dunning <td...@maprtech.com>>
Reply-To: <us...@hadoop.apache.org>>
Date: Fri, 12 Oct 2012 12:11:56 -0700
To: <us...@hadoop.apache.org>>
Subject: Re: Spindle per Cores

It depends on your distribution.  Some distributions are more efficient at driving spindles than others.

Ratios as high as 2 spindles per core are sometimes quite reasonable.

On Fri, Oct 12, 2012 at 10:46 AM, Patai Sangbutsarakum <si...@gmail.com>> wrote:
I have read around about the hardware recommendation for hadoop cluster.
One of them is recommend 1:1 ratio between spindle per core.

Intel CPU come with Hyperthread which will double the number cores on
one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
where we start to calculate about number of task slots per node.

Once it come to spindle, i strongly believe I should pick 8 cores and
picks 8 disks in order to get 1:1 ratio.

Please suggest
Patai


Re: Spindle per Cores

Posted by Patai Sangbutsarakum <Pa...@turn.com>.
Thanks Ted,

Sorry I was too rush to type. I was asking should I pick number 8 (not count the Hyperthread)
or number 16 (count the Hyperthreads) in order to adjust spindle and cores ratios


From: Ted Dunning <td...@maprtech.com>>
Reply-To: <us...@hadoop.apache.org>>
Date: Fri, 12 Oct 2012 12:11:56 -0700
To: <us...@hadoop.apache.org>>
Subject: Re: Spindle per Cores

It depends on your distribution.  Some distributions are more efficient at driving spindles than others.

Ratios as high as 2 spindles per core are sometimes quite reasonable.

On Fri, Oct 12, 2012 at 10:46 AM, Patai Sangbutsarakum <si...@gmail.com>> wrote:
I have read around about the hardware recommendation for hadoop cluster.
One of them is recommend 1:1 ratio between spindle per core.

Intel CPU come with Hyperthread which will double the number cores on
one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
where we start to calculate about number of task slots per node.

Once it come to spindle, i strongly believe I should pick 8 cores and
picks 8 disks in order to get 1:1 ratio.

Please suggest
Patai


Re: Spindle per Cores

Posted by Patai Sangbutsarakum <Pa...@turn.com>.
Thanks Ted,

Sorry I was too rush to type. I was asking should I pick number 8 (not count the Hyperthread)
or number 16 (count the Hyperthreads) in order to adjust spindle and cores ratios


From: Ted Dunning <td...@maprtech.com>>
Reply-To: <us...@hadoop.apache.org>>
Date: Fri, 12 Oct 2012 12:11:56 -0700
To: <us...@hadoop.apache.org>>
Subject: Re: Spindle per Cores

It depends on your distribution.  Some distributions are more efficient at driving spindles than others.

Ratios as high as 2 spindles per core are sometimes quite reasonable.

On Fri, Oct 12, 2012 at 10:46 AM, Patai Sangbutsarakum <si...@gmail.com>> wrote:
I have read around about the hardware recommendation for hadoop cluster.
One of them is recommend 1:1 ratio between spindle per core.

Intel CPU come with Hyperthread which will double the number cores on
one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
where we start to calculate about number of task slots per node.

Once it come to spindle, i strongly believe I should pick 8 cores and
picks 8 disks in order to get 1:1 ratio.

Please suggest
Patai


Re: Spindle per Cores

Posted by Patai Sangbutsarakum <Pa...@turn.com>.
Thanks Ted,

Sorry I was too rush to type. I was asking should I pick number 8 (not count the Hyperthread)
or number 16 (count the Hyperthreads) in order to adjust spindle and cores ratios


From: Ted Dunning <td...@maprtech.com>>
Reply-To: <us...@hadoop.apache.org>>
Date: Fri, 12 Oct 2012 12:11:56 -0700
To: <us...@hadoop.apache.org>>
Subject: Re: Spindle per Cores

It depends on your distribution.  Some distributions are more efficient at driving spindles than others.

Ratios as high as 2 spindles per core are sometimes quite reasonable.

On Fri, Oct 12, 2012 at 10:46 AM, Patai Sangbutsarakum <si...@gmail.com>> wrote:
I have read around about the hardware recommendation for hadoop cluster.
One of them is recommend 1:1 ratio between spindle per core.

Intel CPU come with Hyperthread which will double the number cores on
one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
where we start to calculate about number of task slots per node.

Once it come to spindle, i strongly believe I should pick 8 cores and
picks 8 disks in order to get 1:1 ratio.

Please suggest
Patai


Re: Spindle per Cores

Posted by Ted Dunning <td...@maprtech.com>.
It depends on your distribution.  Some distributions are more efficient at
driving spindles than others.

Ratios as high as 2 spindles per core are sometimes quite reasonable.

On Fri, Oct 12, 2012 at 10:46 AM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> I have read around about the hardware recommendation for hadoop cluster.
> One of them is recommend 1:1 ratio between spindle per core.
>
> Intel CPU come with Hyperthread which will double the number cores on
> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> where we start to calculate about number of task slots per node.
>
> Once it come to spindle, i strongly believe I should pick 8 cores and
> picks 8 disks in order to get 1:1 ratio.
>
> Please suggest
> Patai
>

Re: Spindle per Cores

Posted by Ted Dunning <td...@maprtech.com>.
It depends on your distribution.  Some distributions are more efficient at
driving spindles than others.

Ratios as high as 2 spindles per core are sometimes quite reasonable.

On Fri, Oct 12, 2012 at 10:46 AM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> I have read around about the hardware recommendation for hadoop cluster.
> One of them is recommend 1:1 ratio between spindle per core.
>
> Intel CPU come with Hyperthread which will double the number cores on
> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> where we start to calculate about number of task slots per node.
>
> Once it come to spindle, i strongly believe I should pick 8 cores and
> picks 8 disks in order to get 1:1 ratio.
>
> Please suggest
> Patai
>

RE: Spindle per Cores

Posted by Hank Cohen <ha...@altior.com>.
What empirical evidence is there for this rule of thumb?
In other words, what tests or metrics would indicate an optimal spindle/core ratio and how dependent is this on the nature of the data and of the map/reduce computation?

My understanding is that there are lots of clusters with more spindles than cores.  Specifically, typical 2U servers can hold 12 3.5" disk drives.  So lots of Hadoop clusters have dual 4 core processors and 12 spindles.  Would it be better to have 6 core processors if you are loading up the boxes with 12 disks?  And most importantly, how would one know that the mix was optimal?

Hank Cohen
Altior Inc.

-----Original Message-----
From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com] 
Sent: Friday, October 12, 2012 10:46 AM
To: user@hadoop.apache.org
Subject: Spindle per Cores

I have read around about the hardware recommendation for hadoop cluster.
One of them is recommend 1:1 ratio between spindle per core.

Intel CPU come with Hyperthread which will double the number cores on one physical CPU. eg. 8 cores with Hyperthread it because 16 which is where we start to calculate about number of task slots per node.

Once it come to spindle, i strongly believe I should pick 8 cores and picks 8 disks in order to get 1:1 ratio.

Please suggest
Patai



Re: Spindle per Cores

Posted by Ted Dunning <td...@maprtech.com>.
It depends on your distribution.  Some distributions are more efficient at
driving spindles than others.

Ratios as high as 2 spindles per core are sometimes quite reasonable.

On Fri, Oct 12, 2012 at 10:46 AM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> I have read around about the hardware recommendation for hadoop cluster.
> One of them is recommend 1:1 ratio between spindle per core.
>
> Intel CPU come with Hyperthread which will double the number cores on
> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> where we start to calculate about number of task slots per node.
>
> Once it come to spindle, i strongly believe I should pick 8 cores and
> picks 8 disks in order to get 1:1 ratio.
>
> Please suggest
> Patai
>

Re: Spindle per Cores

Posted by Ted Dunning <td...@maprtech.com>.
It depends on your distribution.  Some distributions are more efficient at
driving spindles than others.

Ratios as high as 2 spindles per core are sometimes quite reasonable.

On Fri, Oct 12, 2012 at 10:46 AM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> I have read around about the hardware recommendation for hadoop cluster.
> One of them is recommend 1:1 ratio between spindle per core.
>
> Intel CPU come with Hyperthread which will double the number cores on
> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
> where we start to calculate about number of task slots per node.
>
> Once it come to spindle, i strongly believe I should pick 8 cores and
> picks 8 disks in order to get 1:1 ratio.
>
> Please suggest
> Patai
>

RE: Spindle per Cores

Posted by Hank Cohen <ha...@altior.com>.
What empirical evidence is there for this rule of thumb?
In other words, what tests or metrics would indicate an optimal spindle/core ratio and how dependent is this on the nature of the data and of the map/reduce computation?

My understanding is that there are lots of clusters with more spindles than cores.  Specifically, typical 2U servers can hold 12 3.5" disk drives.  So lots of Hadoop clusters have dual 4 core processors and 12 spindles.  Would it be better to have 6 core processors if you are loading up the boxes with 12 disks?  And most importantly, how would one know that the mix was optimal?

Hank Cohen
Altior Inc.

-----Original Message-----
From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com] 
Sent: Friday, October 12, 2012 10:46 AM
To: user@hadoop.apache.org
Subject: Spindle per Cores

I have read around about the hardware recommendation for hadoop cluster.
One of them is recommend 1:1 ratio between spindle per core.

Intel CPU come with Hyperthread which will double the number cores on one physical CPU. eg. 8 cores with Hyperthread it because 16 which is where we start to calculate about number of task slots per node.

Once it come to spindle, i strongly believe I should pick 8 cores and picks 8 disks in order to get 1:1 ratio.

Please suggest
Patai