You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Patai Sangbutsarakum <si...@gmail.com> on 2012/10/11 18:22:47 UTC

Why they recommend this (CPU) ?

Hello Hadoopers,

I was reading the hardware recommendation doc. from Cloudera/HP, and
this is one of the recommendation about CPU.

"To remove the bottleneck for CPU bound workloads, for the best
cost/performance tradeoff, we recommend buying 6 core processors with
faster clock speeds as opposed to buying 8 core processors."

Yes, my application is heavily using CPU than IO, but isn't that we
need more cores --> more slots (yes..with enough memory) more
computation slots.
My gut feeling doesn't tell me that 6 core with faster clock would
deliver more horse power than 8 cores. I guess because it said
"cost/performance".

What do guys think?

Thanks in advance
Patai

Re: Why they recommend this (CPU) ?

Posted by Aaron Eng <ae...@maprtech.com>.

Without a doubt, there are many CPU intensive workloads where the amount of
CPU cycles consumed to process some amount of data is many times higher
than what would be considered relatively normal.  But at the same time,
there are many memory intensive workloads and IO bound workloads that are
common as well.  I've worked with companies who have been doing all 3 on a
single cluster, which is another point to be aware of.

Unless you are building a single application, single purpose cluster,
you'll probably have a mix of jobs with a mix of resource profiles.  So
designing a cluster so your CPU heavy job runs faster may mean you skimped
on spindles or disk speed, and when you want to run your new application
and do your mixed workload, you end up having a bottleneck on the IO side.

So keep in mind, not just the profile of a specific workload, but of the
work you want to support on the cluster in general.

On Thu, Oct 11, 2012 at 12:03 PM, Russell Jurney
<ru...@gmail.com>wrote:

> My own clusters are too temporary and virtual for me to notice. I haven't
> thought of clock speed as having mattered in a long time, so I'm curious
> what kind of use cases might benefit from faster cores. Is there a category
> in some way where this sweet spot for faster cores occurs?
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>
> You should measure your workload.  Your experience will vary dramatically
> with different computations.
>
> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <russell.jurney@gmail.com
> > wrote:
>
>> Anyone got data on this? This is interesting, and somewhat
>> counter-intuitive.
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>> > Presumably, if you have a reasonable number of cores - speeding the
>> cores up will be better than forking a task into smaller and smaller chunks
>> - because at some point the overhead of multiple processes would be a
>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>> every problem has a different sweet spot.
>>
>
>

Re: Why they recommend this (CPU) ?

Posted by Aaron Eng <ae...@maprtech.com>.

Without a doubt, there are many CPU intensive workloads where the amount of
CPU cycles consumed to process some amount of data is many times higher
than what would be considered relatively normal.  But at the same time,
there are many memory intensive workloads and IO bound workloads that are
common as well.  I've worked with companies who have been doing all 3 on a
single cluster, which is another point to be aware of.

Unless you are building a single application, single purpose cluster,
you'll probably have a mix of jobs with a mix of resource profiles.  So
designing a cluster so your CPU heavy job runs faster may mean you skimped
on spindles or disk speed, and when you want to run your new application
and do your mixed workload, you end up having a bottleneck on the IO side.

So keep in mind, not just the profile of a specific workload, but of the
work you want to support on the cluster in general.

On Thu, Oct 11, 2012 at 12:03 PM, Russell Jurney
<ru...@gmail.com>wrote:

> My own clusters are too temporary and virtual for me to notice. I haven't
> thought of clock speed as having mattered in a long time, so I'm curious
> what kind of use cases might benefit from faster cores. Is there a category
> in some way where this sweet spot for faster cores occurs?
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>
> You should measure your workload.  Your experience will vary dramatically
> with different computations.
>
> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <russell.jurney@gmail.com
> > wrote:
>
>> Anyone got data on this? This is interesting, and somewhat
>> counter-intuitive.
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>> > Presumably, if you have a reasonable number of cores - speeding the
>> cores up will be better than forking a task into smaller and smaller chunks
>> - because at some point the overhead of multiple processes would be a
>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>> every problem has a different sweet spot.
>>
>
>

Re: Why they recommend this (CPU) ?

Posted by Russell Jurney <ru...@gmail.com>.

Wow, thanks for an awesome reply, Steve!

On Friday, October 12, 2012, Steve Loughran wrote:

>
>
> On 11 October 2012 20:47, Goldstone, Robin J. <goldstone1@llnl.gov<javascript:_e({}, 'cvml', 'goldstone1@llnl.gov');>
> > wrote:
>
>>  Be sure you are comparing apples to apples.  The E5-2650 has a larger
>> cache than the E5-2640, faster system bus and can support faster (1600Ghz
>> vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.
>>
>>  http://ark.intel.com/compare/64590,64591
>>
>>
> mmm. There is more $L3, and in-CPU sync can be done better than over the
> inter-socket bus -you're also less vulnerable to NUMA memory allocation
> issues (*).
>
> There's another issue that drives these recommendations, namely the price
> curve that server parts follow over time, the Bill-of-Materials curve, aka
> the "BOM Curve". Most parts come in at one price, and that price drops over
> time as a function of: volume parts shipped covering
> Non-Recoverable-Engineering (NRE costs), improvements in yield and
> manufacturing quality in that specific process, ...etc), until it levels
> out a actual selling price (ASP) to the people who make the boxes (Original
> Design Manufacturers==ODMs) where it tends to stay for the rest of that
> part's lifespan.
>
> DRAM, HDDs follow a fairly predictable exponential decay curve. You can
> look at the cost of a part, it's history, determine the variables and then
> come up with a prediction of how much it will cost at a time in the near
> future. It's these BOM curves that was key to Dell's business model -direct
> sales to customer meant they didn't need so much inventory and could
> actually get into a situation where they had the cash from the customer
> before the ODM had built the box, let alone been paid for it. There was a
> price: utter unpredictability of what DRAM and HDDs you were going to get.
> Server-side things have stabilised and all the tier-1 PC vendors qualify a
> set of DRAM and storage options, so they can source from multiple vendors,
> so eliminating a single vendor as a SPOF and allowing them to negotiate
> better on cost of parts -which again changes that BOM curve.
>
> This may seem strange but you should all know that the retail price of a
> laptop, flatscreen TV, etc comes down over time -what's not so obvious are
> the maths behind the changes in it's price.
>
> One of the odd parts in this business is the CPU. There is a near-monopoly
> in supplies, and intel don't want their business at the flat bit of the
> curve. They need the money not just to keep their shareholders happy, but
> for the $B needed to build the next generation of Fabs and hence continue
> to keep their shareholders happy in future. Intel parts come in high when
> they initially ship, and stay at that price until the next time Intel
> change their price list, which is usually quarterly. The first price change
> is very steep, then the gradient d$/dT reduces, as it gets low enough that
> part drops off the price list never to be seen again, except maybe in
> embedded designs.
>
> What does that mean? It means you pay a lot for the top of the line x86
> CPUs, and unless you are 100% sure that you really need it, you may be
> better off investing your money in:
>  -more DRAM with better ECCs (product placement: Chip-kill), buffering, :
> less swapping, ability to run more reducers/node.
>  -more HDDs : more storage in same #of racks, assuming your site can take
> the weight.
>  -SFF HDDs : less storage but more IO bandwidth off the disks.
>  -SSD: faster storage
>  -GPUs: very good performance for algorithms you can recompile onto them
>  -support from Hortonworks to can keep your Hadoop cluster going.
>  -10 GbE networking, or multiple bonded 1GbE
>  -more servers (this becomes more of a factor on larger clusters, where
> the cost savings of the less expensive parts scale up)
>  -paying the electricity bill.
>  -keeping the cost of building up a hadoop cluster down, so making it more
> affordable to store PB of data whose value will only appreciate over time.
>  -paying your ops team more money, keeping them happier and so increasing
> the probability they will field the 4am support crisis.
>
> That's why it isn't clear cut that 8 cores are better. It's not just a
> simple performance question -it's the opportunity cost of the price
> difference scaled up by the number of nodes. You do -as Ted pointed out-
> need to know what you actually want.
>
> Finally, as a basic "data science" exercise for the reader:
>
> 1. calculate the price curves of, say, a Dell laptop, and compare with the
> price curve of an apple laptop introduced with the same CPU and at the same
> time. Don't look at the absolute values -normalising them to a percentage
> is better to view.
> 2. Look at which one follows a soft gradient and which follows more of a
> step function.
> 3. add to the graph the intel pricing and see how that correlates with the
> ASP.
> 4. Determine from this which vendor has the best margins -not just at time
> of release, but over the lifespan of a product. Integration is a useful
> technique here. Bear in mind Apple's NRE costs on laptop are higher due to
> the better HW design but also the software development is only funded from
> their sales alone.
> 5. Using this information, decide when is the best time to buy a dell or
> an apple laptop.
>
>
> I should make a blog post of this, "server prices: it's all down to the
> exponential decay equations of the individual parts"
>
> Steve "why yes, I have spent time in the PC industry" Loughran
>
>
>
> (*) If you don't know what NUMA this is, do some research and think about
> its implications in heap allocation.
>
>
>
>>
>>   From: Patrick Angeles <patrick@cloudera.com <javascript:_e({}, 'cvml',
>> 'patrick@cloudera.com');>>
>> Reply-To: "user@hadoop.apache.org <javascript:_e({}, 'cvml',
>> 'user@hadoop.apache.org');>" <user@hadoop.apache.org <javascript:_e({},
>> 'cvml', 'user@hadoop.apache.org');>>
>> Date: Thursday, October 11, 2012 12:36 PM
>> To: "user@hadoop.apache.org <javascript:_e({}, 'cvml',
>> 'user@hadoop.apache.org');>" <user@hadoop.apache.org <javascript:_e({},
>> 'cvml', 'user@hadoop.apache.org');>>
>> Subject: Re: Why they recommend this (CPU) ?
>>
>>   If you look at comparable Intel parts:
>>
>>  Intel E5-2640
>> 6 cores @ 2.5 Ghz
>> 95W - $885
>>
>>  Intel E5-2650
>> 8 cores @ 2.0 Ghz
>> 95W - $1107
>>
>>  So, for $400 more on a dual proc system -- which really isn't much --
>> you get 2 more cores for a 20% drop in speed. I can believe that for some
>> scenarios, the faster cores would fare better. Gzip compression is one that
>> comes to mind, where you are aggressively trading CPU for lower storage
>> volume and IO. An HBase cluster is another example.
>>
>> On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <russell.jurney@gmail.com<javascript:_e({}, 'cvml', 'russell.jurney@gmail.com');>
>> > wrote:
>>
>>>  My own clusters are too temporary and virtual for me to notice. I
>>> haven't thought of clock speed as having mattered in a long time, so I'm
>>> curious what kind of use cases might benefit from faster cores. Is there a
>>> category in some way where this sweet spot for faster cores occurs?
>>>
>>> Russell Jurney http://datasyndrome.com
>>>
>>> On Oct 11, 2012, at 11:39 AM, Ted Dunning <tdunning@maprtech.com<javascript:_e({}, 'cvml', 'tdunning@maprtech.com');>>
>>> wrote:
>>>
>>>   You should measure your workload.  Your experience will vary
>>> dramatically with different computations.
>>>
>>> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <
>>> russell.jurney@gmail.com <javascript:_e({}, 'cvml',
>>> 'russell.jurney@gmail.com');>> wrote:
>>>
>>>> Anyone got data on this? This is interesting, and somewhat
>>>> counter-intuitive.
>>>>
>>>> Russell Jurney http://datasyndrome.com
>>>>
>>>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <jayunit100@gmail.com<javascript:_e({}, 'cvml', 'jayunit100@gmail.com');>>
>>>> wrote:
>>>>
>>>> > Presumably, if you have a reasonable number of cores - speeding the
>>>> cores up will be better than forking a task into smaller and smaller chunks
>>>> - because at some point the overhead of multiple processes would be a
>>>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>>>> every problem has a different sweet spot.
>>>>
>>>
>>>
>>
>

-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Why they recommend this (CPU) ?

Posted by Russell Jurney <ru...@gmail.com>.

Wow, thanks for an awesome reply, Steve!

On Friday, October 12, 2012, Steve Loughran wrote:

>
>
> On 11 October 2012 20:47, Goldstone, Robin J. <goldstone1@llnl.gov<javascript:_e({}, 'cvml', 'goldstone1@llnl.gov');>
> > wrote:
>
>>  Be sure you are comparing apples to apples.  The E5-2650 has a larger
>> cache than the E5-2640, faster system bus and can support faster (1600Ghz
>> vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.
>>
>>  http://ark.intel.com/compare/64590,64591
>>
>>
> mmm. There is more $L3, and in-CPU sync can be done better than over the
> inter-socket bus -you're also less vulnerable to NUMA memory allocation
> issues (*).
>
> There's another issue that drives these recommendations, namely the price
> curve that server parts follow over time, the Bill-of-Materials curve, aka
> the "BOM Curve". Most parts come in at one price, and that price drops over
> time as a function of: volume parts shipped covering
> Non-Recoverable-Engineering (NRE costs), improvements in yield and
> manufacturing quality in that specific process, ...etc), until it levels
> out a actual selling price (ASP) to the people who make the boxes (Original
> Design Manufacturers==ODMs) where it tends to stay for the rest of that
> part's lifespan.
>
> DRAM, HDDs follow a fairly predictable exponential decay curve. You can
> look at the cost of a part, it's history, determine the variables and then
> come up with a prediction of how much it will cost at a time in the near
> future. It's these BOM curves that was key to Dell's business model -direct
> sales to customer meant they didn't need so much inventory and could
> actually get into a situation where they had the cash from the customer
> before the ODM had built the box, let alone been paid for it. There was a
> price: utter unpredictability of what DRAM and HDDs you were going to get.
> Server-side things have stabilised and all the tier-1 PC vendors qualify a
> set of DRAM and storage options, so they can source from multiple vendors,
> so eliminating a single vendor as a SPOF and allowing them to negotiate
> better on cost of parts -which again changes that BOM curve.
>
> This may seem strange but you should all know that the retail price of a
> laptop, flatscreen TV, etc comes down over time -what's not so obvious are
> the maths behind the changes in it's price.
>
> One of the odd parts in this business is the CPU. There is a near-monopoly
> in supplies, and intel don't want their business at the flat bit of the
> curve. They need the money not just to keep their shareholders happy, but
> for the $B needed to build the next generation of Fabs and hence continue
> to keep their shareholders happy in future. Intel parts come in high when
> they initially ship, and stay at that price until the next time Intel
> change their price list, which is usually quarterly. The first price change
> is very steep, then the gradient d$/dT reduces, as it gets low enough that
> part drops off the price list never to be seen again, except maybe in
> embedded designs.
>
> What does that mean? It means you pay a lot for the top of the line x86
> CPUs, and unless you are 100% sure that you really need it, you may be
> better off investing your money in:
>  -more DRAM with better ECCs (product placement: Chip-kill), buffering, :
> less swapping, ability to run more reducers/node.
>  -more HDDs : more storage in same #of racks, assuming your site can take
> the weight.
>  -SFF HDDs : less storage but more IO bandwidth off the disks.
>  -SSD: faster storage
>  -GPUs: very good performance for algorithms you can recompile onto them
>  -support from Hortonworks to can keep your Hadoop cluster going.
>  -10 GbE networking, or multiple bonded 1GbE
>  -more servers (this becomes more of a factor on larger clusters, where
> the cost savings of the less expensive parts scale up)
>  -paying the electricity bill.
>  -keeping the cost of building up a hadoop cluster down, so making it more
> affordable to store PB of data whose value will only appreciate over time.
>  -paying your ops team more money, keeping them happier and so increasing
> the probability they will field the 4am support crisis.
>
> That's why it isn't clear cut that 8 cores are better. It's not just a
> simple performance question -it's the opportunity cost of the price
> difference scaled up by the number of nodes. You do -as Ted pointed out-
> need to know what you actually want.
>
> Finally, as a basic "data science" exercise for the reader:
>
> 1. calculate the price curves of, say, a Dell laptop, and compare with the
> price curve of an apple laptop introduced with the same CPU and at the same
> time. Don't look at the absolute values -normalising them to a percentage
> is better to view.
> 2. Look at which one follows a soft gradient and which follows more of a
> step function.
> 3. add to the graph the intel pricing and see how that correlates with the
> ASP.
> 4. Determine from this which vendor has the best margins -not just at time
> of release, but over the lifespan of a product. Integration is a useful
> technique here. Bear in mind Apple's NRE costs on laptop are higher due to
> the better HW design but also the software development is only funded from
> their sales alone.
> 5. Using this information, decide when is the best time to buy a dell or
> an apple laptop.
>
>
> I should make a blog post of this, "server prices: it's all down to the
> exponential decay equations of the individual parts"
>
> Steve "why yes, I have spent time in the PC industry" Loughran
>
>
>
> (*) If you don't know what NUMA this is, do some research and think about
> its implications in heap allocation.
>
>
>
>>
>>   From: Patrick Angeles <patrick@cloudera.com <javascript:_e({}, 'cvml',
>> 'patrick@cloudera.com');>>
>> Reply-To: "user@hadoop.apache.org <javascript:_e({}, 'cvml',
>> 'user@hadoop.apache.org');>" <user@hadoop.apache.org <javascript:_e({},
>> 'cvml', 'user@hadoop.apache.org');>>
>> Date: Thursday, October 11, 2012 12:36 PM
>> To: "user@hadoop.apache.org <javascript:_e({}, 'cvml',
>> 'user@hadoop.apache.org');>" <user@hadoop.apache.org <javascript:_e({},
>> 'cvml', 'user@hadoop.apache.org');>>
>> Subject: Re: Why they recommend this (CPU) ?
>>
>>   If you look at comparable Intel parts:
>>
>>  Intel E5-2640
>> 6 cores @ 2.5 Ghz
>> 95W - $885
>>
>>  Intel E5-2650
>> 8 cores @ 2.0 Ghz
>> 95W - $1107
>>
>>  So, for $400 more on a dual proc system -- which really isn't much --
>> you get 2 more cores for a 20% drop in speed. I can believe that for some
>> scenarios, the faster cores would fare better. Gzip compression is one that
>> comes to mind, where you are aggressively trading CPU for lower storage
>> volume and IO. An HBase cluster is another example.
>>
>> On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <russell.jurney@gmail.com<javascript:_e({}, 'cvml', 'russell.jurney@gmail.com');>
>> > wrote:
>>
>>>  My own clusters are too temporary and virtual for me to notice. I
>>> haven't thought of clock speed as having mattered in a long time, so I'm
>>> curious what kind of use cases might benefit from faster cores. Is there a
>>> category in some way where this sweet spot for faster cores occurs?
>>>
>>> Russell Jurney http://datasyndrome.com
>>>
>>> On Oct 11, 2012, at 11:39 AM, Ted Dunning <tdunning@maprtech.com<javascript:_e({}, 'cvml', 'tdunning@maprtech.com');>>
>>> wrote:
>>>
>>>   You should measure your workload.  Your experience will vary
>>> dramatically with different computations.
>>>
>>> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <
>>> russell.jurney@gmail.com <javascript:_e({}, 'cvml',
>>> 'russell.jurney@gmail.com');>> wrote:
>>>
>>>> Anyone got data on this? This is interesting, and somewhat
>>>> counter-intuitive.
>>>>
>>>> Russell Jurney http://datasyndrome.com
>>>>
>>>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <jayunit100@gmail.com<javascript:_e({}, 'cvml', 'jayunit100@gmail.com');>>
>>>> wrote:
>>>>
>>>> > Presumably, if you have a reasonable number of cores - speeding the
>>>> cores up will be better than forking a task into smaller and smaller chunks
>>>> - because at some point the overhead of multiple processes would be a
>>>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>>>> every problem has a different sweet spot.
>>>>
>>>
>>>
>>
>

-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Why they recommend this (CPU) ?

Posted by Russell Jurney <ru...@gmail.com>.

Wow, thanks for an awesome reply, Steve!

On Friday, October 12, 2012, Steve Loughran wrote:

>
>
> On 11 October 2012 20:47, Goldstone, Robin J. <goldstone1@llnl.gov<javascript:_e({}, 'cvml', 'goldstone1@llnl.gov');>
> > wrote:
>
>>  Be sure you are comparing apples to apples.  The E5-2650 has a larger
>> cache than the E5-2640, faster system bus and can support faster (1600Ghz
>> vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.
>>
>>  http://ark.intel.com/compare/64590,64591
>>
>>
> mmm. There is more $L3, and in-CPU sync can be done better than over the
> inter-socket bus -you're also less vulnerable to NUMA memory allocation
> issues (*).
>
> There's another issue that drives these recommendations, namely the price
> curve that server parts follow over time, the Bill-of-Materials curve, aka
> the "BOM Curve". Most parts come in at one price, and that price drops over
> time as a function of: volume parts shipped covering
> Non-Recoverable-Engineering (NRE costs), improvements in yield and
> manufacturing quality in that specific process, ...etc), until it levels
> out a actual selling price (ASP) to the people who make the boxes (Original
> Design Manufacturers==ODMs) where it tends to stay for the rest of that
> part's lifespan.
>
> DRAM, HDDs follow a fairly predictable exponential decay curve. You can
> look at the cost of a part, it's history, determine the variables and then
> come up with a prediction of how much it will cost at a time in the near
> future. It's these BOM curves that was key to Dell's business model -direct
> sales to customer meant they didn't need so much inventory and could
> actually get into a situation where they had the cash from the customer
> before the ODM had built the box, let alone been paid for it. There was a
> price: utter unpredictability of what DRAM and HDDs you were going to get.
> Server-side things have stabilised and all the tier-1 PC vendors qualify a
> set of DRAM and storage options, so they can source from multiple vendors,
> so eliminating a single vendor as a SPOF and allowing them to negotiate
> better on cost of parts -which again changes that BOM curve.
>
> This may seem strange but you should all know that the retail price of a
> laptop, flatscreen TV, etc comes down over time -what's not so obvious are
> the maths behind the changes in it's price.
>
> One of the odd parts in this business is the CPU. There is a near-monopoly
> in supplies, and intel don't want their business at the flat bit of the
> curve. They need the money not just to keep their shareholders happy, but
> for the $B needed to build the next generation of Fabs and hence continue
> to keep their shareholders happy in future. Intel parts come in high when
> they initially ship, and stay at that price until the next time Intel
> change their price list, which is usually quarterly. The first price change
> is very steep, then the gradient d$/dT reduces, as it gets low enough that
> part drops off the price list never to be seen again, except maybe in
> embedded designs.
>
> What does that mean? It means you pay a lot for the top of the line x86
> CPUs, and unless you are 100% sure that you really need it, you may be
> better off investing your money in:
>  -more DRAM with better ECCs (product placement: Chip-kill), buffering, :
> less swapping, ability to run more reducers/node.
>  -more HDDs : more storage in same #of racks, assuming your site can take
> the weight.
>  -SFF HDDs : less storage but more IO bandwidth off the disks.
>  -SSD: faster storage
>  -GPUs: very good performance for algorithms you can recompile onto them
>  -support from Hortonworks to can keep your Hadoop cluster going.
>  -10 GbE networking, or multiple bonded 1GbE
>  -more servers (this becomes more of a factor on larger clusters, where
> the cost savings of the less expensive parts scale up)
>  -paying the electricity bill.
>  -keeping the cost of building up a hadoop cluster down, so making it more
> affordable to store PB of data whose value will only appreciate over time.
>  -paying your ops team more money, keeping them happier and so increasing
> the probability they will field the 4am support crisis.
>
> That's why it isn't clear cut that 8 cores are better. It's not just a
> simple performance question -it's the opportunity cost of the price
> difference scaled up by the number of nodes. You do -as Ted pointed out-
> need to know what you actually want.
>
> Finally, as a basic "data science" exercise for the reader:
>
> 1. calculate the price curves of, say, a Dell laptop, and compare with the
> price curve of an apple laptop introduced with the same CPU and at the same
> time. Don't look at the absolute values -normalising them to a percentage
> is better to view.
> 2. Look at which one follows a soft gradient and which follows more of a
> step function.
> 3. add to the graph the intel pricing and see how that correlates with the
> ASP.
> 4. Determine from this which vendor has the best margins -not just at time
> of release, but over the lifespan of a product. Integration is a useful
> technique here. Bear in mind Apple's NRE costs on laptop are higher due to
> the better HW design but also the software development is only funded from
> their sales alone.
> 5. Using this information, decide when is the best time to buy a dell or
> an apple laptop.
>
>
> I should make a blog post of this, "server prices: it's all down to the
> exponential decay equations of the individual parts"
>
> Steve "why yes, I have spent time in the PC industry" Loughran
>
>
>
> (*) If you don't know what NUMA this is, do some research and think about
> its implications in heap allocation.
>
>
>
>>
>>   From: Patrick Angeles <patrick@cloudera.com <javascript:_e({}, 'cvml',
>> 'patrick@cloudera.com');>>
>> Reply-To: "user@hadoop.apache.org <javascript:_e({}, 'cvml',
>> 'user@hadoop.apache.org');>" <user@hadoop.apache.org <javascript:_e({},
>> 'cvml', 'user@hadoop.apache.org');>>
>> Date: Thursday, October 11, 2012 12:36 PM
>> To: "user@hadoop.apache.org <javascript:_e({}, 'cvml',
>> 'user@hadoop.apache.org');>" <user@hadoop.apache.org <javascript:_e({},
>> 'cvml', 'user@hadoop.apache.org');>>
>> Subject: Re: Why they recommend this (CPU) ?
>>
>>   If you look at comparable Intel parts:
>>
>>  Intel E5-2640
>> 6 cores @ 2.5 Ghz
>> 95W - $885
>>
>>  Intel E5-2650
>> 8 cores @ 2.0 Ghz
>> 95W - $1107
>>
>>  So, for $400 more on a dual proc system -- which really isn't much --
>> you get 2 more cores for a 20% drop in speed. I can believe that for some
>> scenarios, the faster cores would fare better. Gzip compression is one that
>> comes to mind, where you are aggressively trading CPU for lower storage
>> volume and IO. An HBase cluster is another example.
>>
>> On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <russell.jurney@gmail.com<javascript:_e({}, 'cvml', 'russell.jurney@gmail.com');>
>> > wrote:
>>
>>>  My own clusters are too temporary and virtual for me to notice. I
>>> haven't thought of clock speed as having mattered in a long time, so I'm
>>> curious what kind of use cases might benefit from faster cores. Is there a
>>> category in some way where this sweet spot for faster cores occurs?
>>>
>>> Russell Jurney http://datasyndrome.com
>>>
>>> On Oct 11, 2012, at 11:39 AM, Ted Dunning <tdunning@maprtech.com<javascript:_e({}, 'cvml', 'tdunning@maprtech.com');>>
>>> wrote:
>>>
>>>   You should measure your workload.  Your experience will vary
>>> dramatically with different computations.
>>>
>>> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <
>>> russell.jurney@gmail.com <javascript:_e({}, 'cvml',
>>> 'russell.jurney@gmail.com');>> wrote:
>>>
>>>> Anyone got data on this? This is interesting, and somewhat
>>>> counter-intuitive.
>>>>
>>>> Russell Jurney http://datasyndrome.com
>>>>
>>>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <jayunit100@gmail.com<javascript:_e({}, 'cvml', 'jayunit100@gmail.com');>>
>>>> wrote:
>>>>
>>>> > Presumably, if you have a reasonable number of cores - speeding the
>>>> cores up will be better than forking a task into smaller and smaller chunks
>>>> - because at some point the overhead of multiple processes would be a
>>>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>>>> every problem has a different sweet spot.
>>>>
>>>
>>>
>>
>

-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Why they recommend this (CPU) ?

Posted by Russell Jurney <ru...@gmail.com>.

Wow, thanks for an awesome reply, Steve!

On Friday, October 12, 2012, Steve Loughran wrote:

>
>
> On 11 October 2012 20:47, Goldstone, Robin J. <goldstone1@llnl.gov<javascript:_e({}, 'cvml', 'goldstone1@llnl.gov');>
> > wrote:
>
>>  Be sure you are comparing apples to apples.  The E5-2650 has a larger
>> cache than the E5-2640, faster system bus and can support faster (1600Ghz
>> vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.
>>
>>  http://ark.intel.com/compare/64590,64591
>>
>>
> mmm. There is more $L3, and in-CPU sync can be done better than over the
> inter-socket bus -you're also less vulnerable to NUMA memory allocation
> issues (*).
>
> There's another issue that drives these recommendations, namely the price
> curve that server parts follow over time, the Bill-of-Materials curve, aka
> the "BOM Curve". Most parts come in at one price, and that price drops over
> time as a function of: volume parts shipped covering
> Non-Recoverable-Engineering (NRE costs), improvements in yield and
> manufacturing quality in that specific process, ...etc), until it levels
> out a actual selling price (ASP) to the people who make the boxes (Original
> Design Manufacturers==ODMs) where it tends to stay for the rest of that
> part's lifespan.
>
> DRAM, HDDs follow a fairly predictable exponential decay curve. You can
> look at the cost of a part, it's history, determine the variables and then
> come up with a prediction of how much it will cost at a time in the near
> future. It's these BOM curves that was key to Dell's business model -direct
> sales to customer meant they didn't need so much inventory and could
> actually get into a situation where they had the cash from the customer
> before the ODM had built the box, let alone been paid for it. There was a
> price: utter unpredictability of what DRAM and HDDs you were going to get.
> Server-side things have stabilised and all the tier-1 PC vendors qualify a
> set of DRAM and storage options, so they can source from multiple vendors,
> so eliminating a single vendor as a SPOF and allowing them to negotiate
> better on cost of parts -which again changes that BOM curve.
>
> This may seem strange but you should all know that the retail price of a
> laptop, flatscreen TV, etc comes down over time -what's not so obvious are
> the maths behind the changes in it's price.
>
> One of the odd parts in this business is the CPU. There is a near-monopoly
> in supplies, and intel don't want their business at the flat bit of the
> curve. They need the money not just to keep their shareholders happy, but
> for the $B needed to build the next generation of Fabs and hence continue
> to keep their shareholders happy in future. Intel parts come in high when
> they initially ship, and stay at that price until the next time Intel
> change their price list, which is usually quarterly. The first price change
> is very steep, then the gradient d$/dT reduces, as it gets low enough that
> part drops off the price list never to be seen again, except maybe in
> embedded designs.
>
> What does that mean? It means you pay a lot for the top of the line x86
> CPUs, and unless you are 100% sure that you really need it, you may be
> better off investing your money in:
>  -more DRAM with better ECCs (product placement: Chip-kill), buffering, :
> less swapping, ability to run more reducers/node.
>  -more HDDs : more storage in same #of racks, assuming your site can take
> the weight.
>  -SFF HDDs : less storage but more IO bandwidth off the disks.
>  -SSD: faster storage
>  -GPUs: very good performance for algorithms you can recompile onto them
>  -support from Hortonworks to can keep your Hadoop cluster going.
>  -10 GbE networking, or multiple bonded 1GbE
>  -more servers (this becomes more of a factor on larger clusters, where
> the cost savings of the less expensive parts scale up)
>  -paying the electricity bill.
>  -keeping the cost of building up a hadoop cluster down, so making it more
> affordable to store PB of data whose value will only appreciate over time.
>  -paying your ops team more money, keeping them happier and so increasing
> the probability they will field the 4am support crisis.
>
> That's why it isn't clear cut that 8 cores are better. It's not just a
> simple performance question -it's the opportunity cost of the price
> difference scaled up by the number of nodes. You do -as Ted pointed out-
> need to know what you actually want.
>
> Finally, as a basic "data science" exercise for the reader:
>
> 1. calculate the price curves of, say, a Dell laptop, and compare with the
> price curve of an apple laptop introduced with the same CPU and at the same
> time. Don't look at the absolute values -normalising them to a percentage
> is better to view.
> 2. Look at which one follows a soft gradient and which follows more of a
> step function.
> 3. add to the graph the intel pricing and see how that correlates with the
> ASP.
> 4. Determine from this which vendor has the best margins -not just at time
> of release, but over the lifespan of a product. Integration is a useful
> technique here. Bear in mind Apple's NRE costs on laptop are higher due to
> the better HW design but also the software development is only funded from
> their sales alone.
> 5. Using this information, decide when is the best time to buy a dell or
> an apple laptop.
>
>
> I should make a blog post of this, "server prices: it's all down to the
> exponential decay equations of the individual parts"
>
> Steve "why yes, I have spent time in the PC industry" Loughran
>
>
>
> (*) If you don't know what NUMA this is, do some research and think about
> its implications in heap allocation.
>
>
>
>>
>>   From: Patrick Angeles <patrick@cloudera.com <javascript:_e({}, 'cvml',
>> 'patrick@cloudera.com');>>
>> Reply-To: "user@hadoop.apache.org <javascript:_e({}, 'cvml',
>> 'user@hadoop.apache.org');>" <user@hadoop.apache.org <javascript:_e({},
>> 'cvml', 'user@hadoop.apache.org');>>
>> Date: Thursday, October 11, 2012 12:36 PM
>> To: "user@hadoop.apache.org <javascript:_e({}, 'cvml',
>> 'user@hadoop.apache.org');>" <user@hadoop.apache.org <javascript:_e({},
>> 'cvml', 'user@hadoop.apache.org');>>
>> Subject: Re: Why they recommend this (CPU) ?
>>
>>   If you look at comparable Intel parts:
>>
>>  Intel E5-2640
>> 6 cores @ 2.5 Ghz
>> 95W - $885
>>
>>  Intel E5-2650
>> 8 cores @ 2.0 Ghz
>> 95W - $1107
>>
>>  So, for $400 more on a dual proc system -- which really isn't much --
>> you get 2 more cores for a 20% drop in speed. I can believe that for some
>> scenarios, the faster cores would fare better. Gzip compression is one that
>> comes to mind, where you are aggressively trading CPU for lower storage
>> volume and IO. An HBase cluster is another example.
>>
>> On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <russell.jurney@gmail.com<javascript:_e({}, 'cvml', 'russell.jurney@gmail.com');>
>> > wrote:
>>
>>>  My own clusters are too temporary and virtual for me to notice. I
>>> haven't thought of clock speed as having mattered in a long time, so I'm
>>> curious what kind of use cases might benefit from faster cores. Is there a
>>> category in some way where this sweet spot for faster cores occurs?
>>>
>>> Russell Jurney http://datasyndrome.com
>>>
>>> On Oct 11, 2012, at 11:39 AM, Ted Dunning <tdunning@maprtech.com<javascript:_e({}, 'cvml', 'tdunning@maprtech.com');>>
>>> wrote:
>>>
>>>   You should measure your workload.  Your experience will vary
>>> dramatically with different computations.
>>>
>>> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <
>>> russell.jurney@gmail.com <javascript:_e({}, 'cvml',
>>> 'russell.jurney@gmail.com');>> wrote:
>>>
>>>> Anyone got data on this? This is interesting, and somewhat
>>>> counter-intuitive.
>>>>
>>>> Russell Jurney http://datasyndrome.com
>>>>
>>>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <jayunit100@gmail.com<javascript:_e({}, 'cvml', 'jayunit100@gmail.com');>>
>>>> wrote:
>>>>
>>>> > Presumably, if you have a reasonable number of cores - speeding the
>>>> cores up will be better than forking a task into smaller and smaller chunks
>>>> - because at some point the overhead of multiple processes would be a
>>>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>>>> every problem has a different sweet spot.
>>>>
>>>
>>>
>>
>

-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Why they recommend this (CPU) ?

Posted by Steve Loughran <st...@hortonworks.com>.

On 11 October 2012 20:47, Goldstone, Robin J. <go...@llnl.gov> wrote:

>  Be sure you are comparing apples to apples.  The E5-2650 has a larger
> cache than the E5-2640, faster system bus and can support faster (1600Ghz
> vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.
>
>  http://ark.intel.com/compare/64590,64591
>
>
mmm. There is more $L3, and in-CPU sync can be done better than over the
inter-socket bus -you're also less vulnerable to NUMA memory allocation
issues (*).

There's another issue that drives these recommendations, namely the price
curve that server parts follow over time, the Bill-of-Materials curve, aka
the "BOM Curve". Most parts come in at one price, and that price drops over
time as a function of: volume parts shipped covering
Non-Recoverable-Engineering (NRE costs), improvements in yield and
manufacturing quality in that specific process, ...etc), until it levels
out a actual selling price (ASP) to the people who make the boxes (Original
Design Manufacturers==ODMs) where it tends to stay for the rest of that
part's lifespan.

DRAM, HDDs follow a fairly predictable exponential decay curve. You can
look at the cost of a part, it's history, determine the variables and then
come up with a prediction of how much it will cost at a time in the near
future. It's these BOM curves that was key to Dell's business model -direct
sales to customer meant they didn't need so much inventory and could
actually get into a situation where they had the cash from the customer
before the ODM had built the box, let alone been paid for it. There was a
price: utter unpredictability of what DRAM and HDDs you were going to get.
Server-side things have stabilised and all the tier-1 PC vendors qualify a
set of DRAM and storage options, so they can source from multiple vendors,
so eliminating a single vendor as a SPOF and allowing them to negotiate
better on cost of parts -which again changes that BOM curve.

This may seem strange but you should all know that the retail price of a
laptop, flatscreen TV, etc comes down over time -what's not so obvious are
the maths behind the changes in it's price.

One of the odd parts in this business is the CPU. There is a near-monopoly
in supplies, and intel don't want their business at the flat bit of the
curve. They need the money not just to keep their shareholders happy, but
for the $B needed to build the next generation of Fabs and hence continue
to keep their shareholders happy in future. Intel parts come in high when
they initially ship, and stay at that price until the next time Intel
change their price list, which is usually quarterly. The first price change
is very steep, then the gradient d$/dT reduces, as it gets low enough that
part drops off the price list never to be seen again, except maybe in
embedded designs.

What does that mean? It means you pay a lot for the top of the line x86
CPUs, and unless you are 100% sure that you really need it, you may be
better off investing your money in:
 -more DRAM with better ECCs (product placement: Chip-kill), buffering, :
less swapping, ability to run more reducers/node.
 -more HDDs : more storage in same #of racks, assuming your site can take
the weight.
 -SFF HDDs : less storage but more IO bandwidth off the disks.
 -SSD: faster storage
 -GPUs: very good performance for algorithms you can recompile onto them
 -support from Hortonworks to can keep your Hadoop cluster going.
 -10 GbE networking, or multiple bonded 1GbE
 -more servers (this becomes more of a factor on larger clusters, where the
cost savings of the less expensive parts scale up)
 -paying the electricity bill.
 -keeping the cost of building up a hadoop cluster down, so making it more
affordable to store PB of data whose value will only appreciate over time.
 -paying your ops team more money, keeping them happier and so increasing
the probability they will field the 4am support crisis.

That's why it isn't clear cut that 8 cores are better. It's not just a
simple performance question -it's the opportunity cost of the price
difference scaled up by the number of nodes. You do -as Ted pointed out-
need to know what you actually want.

Finally, as a basic "data science" exercise for the reader:

1. calculate the price curves of, say, a Dell laptop, and compare with the
price curve of an apple laptop introduced with the same CPU and at the same
time. Don't look at the absolute values -normalising them to a percentage
is better to view.
2. Look at which one follows a soft gradient and which follows more of a
step function.
3. add to the graph the intel pricing and see how that correlates with the
ASP.
4. Determine from this which vendor has the best margins -not just at time
of release, but over the lifespan of a product. Integration is a useful
technique here. Bear in mind Apple's NRE costs on laptop are higher due to
the better HW design but also the software development is only funded from
their sales alone.
5. Using this information, decide when is the best time to buy a dell or an
apple laptop.

I should make a blog post of this, "server prices: it's all down to the
exponential decay equations of the individual parts"

Steve "why yes, I have spent time in the PC industry" Loughran

(*) If you don't know what NUMA this is, do some research and think about
its implications in heap allocation.

>
>   From: Patrick Angeles <pa...@cloudera.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Thursday, October 11, 2012 12:36 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Why they recommend this (CPU) ?
>
>   If you look at comparable Intel parts:
>
>  Intel E5-2640
> 6 cores @ 2.5 Ghz
> 95W - $885
>
>  Intel E5-2650
> 8 cores @ 2.0 Ghz
> 95W - $1107
>
>  So, for $400 more on a dual proc system -- which really isn't much --
> you get 2 more cores for a 20% drop in speed. I can believe that for some
> scenarios, the faster cores would fare better. Gzip compression is one that
> comes to mind, where you are aggressively trading CPU for lower storage
> volume and IO. An HBase cluster is another example.
>
> On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>>  My own clusters are too temporary and virtual for me to notice. I
>> haven't thought of clock speed as having mattered in a long time, so I'm
>> curious what kind of use cases might benefit from faster cores. Is there a
>> category in some way where this sweet spot for faster cores occurs?
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>>
>>   You should measure your workload.  Your experience will vary
>> dramatically with different computations.
>>
>> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <
>> russell.jurney@gmail.com> wrote:
>>
>>> Anyone got data on this? This is interesting, and somewhat
>>> counter-intuitive.
>>>
>>> Russell Jurney http://datasyndrome.com
>>>
>>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>> > Presumably, if you have a reasonable number of cores - speeding the
>>> cores up will be better than forking a task into smaller and smaller chunks
>>> - because at some point the overhead of multiple processes would be a
>>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>>> every problem has a different sweet spot.
>>>
>>
>>
>

Re: Why they recommend this (CPU) ?

Posted by Steve Loughran <st...@hortonworks.com>.

On 11 October 2012 20:47, Goldstone, Robin J. <go...@llnl.gov> wrote:

>  Be sure you are comparing apples to apples.  The E5-2650 has a larger
> cache than the E5-2640, faster system bus and can support faster (1600Ghz
> vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.
>
>  http://ark.intel.com/compare/64590,64591
>
>
mmm. There is more $L3, and in-CPU sync can be done better than over the
inter-socket bus -you're also less vulnerable to NUMA memory allocation
issues (*).

There's another issue that drives these recommendations, namely the price
curve that server parts follow over time, the Bill-of-Materials curve, aka
the "BOM Curve". Most parts come in at one price, and that price drops over
time as a function of: volume parts shipped covering
Non-Recoverable-Engineering (NRE costs), improvements in yield and
manufacturing quality in that specific process, ...etc), until it levels
out a actual selling price (ASP) to the people who make the boxes (Original
Design Manufacturers==ODMs) where it tends to stay for the rest of that
part's lifespan.

DRAM, HDDs follow a fairly predictable exponential decay curve. You can
look at the cost of a part, it's history, determine the variables and then
come up with a prediction of how much it will cost at a time in the near
future. It's these BOM curves that was key to Dell's business model -direct
sales to customer meant they didn't need so much inventory and could
actually get into a situation where they had the cash from the customer
before the ODM had built the box, let alone been paid for it. There was a
price: utter unpredictability of what DRAM and HDDs you were going to get.
Server-side things have stabilised and all the tier-1 PC vendors qualify a
set of DRAM and storage options, so they can source from multiple vendors,
so eliminating a single vendor as a SPOF and allowing them to negotiate
better on cost of parts -which again changes that BOM curve.

This may seem strange but you should all know that the retail price of a
laptop, flatscreen TV, etc comes down over time -what's not so obvious are
the maths behind the changes in it's price.

One of the odd parts in this business is the CPU. There is a near-monopoly
in supplies, and intel don't want their business at the flat bit of the
curve. They need the money not just to keep their shareholders happy, but
for the $B needed to build the next generation of Fabs and hence continue
to keep their shareholders happy in future. Intel parts come in high when
they initially ship, and stay at that price until the next time Intel
change their price list, which is usually quarterly. The first price change
is very steep, then the gradient d$/dT reduces, as it gets low enough that
part drops off the price list never to be seen again, except maybe in
embedded designs.

What does that mean? It means you pay a lot for the top of the line x86
CPUs, and unless you are 100% sure that you really need it, you may be
better off investing your money in:
 -more DRAM with better ECCs (product placement: Chip-kill), buffering, :
less swapping, ability to run more reducers/node.
 -more HDDs : more storage in same #of racks, assuming your site can take
the weight.
 -SFF HDDs : less storage but more IO bandwidth off the disks.
 -SSD: faster storage
 -GPUs: very good performance for algorithms you can recompile onto them
 -support from Hortonworks to can keep your Hadoop cluster going.
 -10 GbE networking, or multiple bonded 1GbE
 -more servers (this becomes more of a factor on larger clusters, where the
cost savings of the less expensive parts scale up)
 -paying the electricity bill.
 -keeping the cost of building up a hadoop cluster down, so making it more
affordable to store PB of data whose value will only appreciate over time.
 -paying your ops team more money, keeping them happier and so increasing
the probability they will field the 4am support crisis.

That's why it isn't clear cut that 8 cores are better. It's not just a
simple performance question -it's the opportunity cost of the price
difference scaled up by the number of nodes. You do -as Ted pointed out-
need to know what you actually want.

Finally, as a basic "data science" exercise for the reader:

1. calculate the price curves of, say, a Dell laptop, and compare with the
price curve of an apple laptop introduced with the same CPU and at the same
time. Don't look at the absolute values -normalising them to a percentage
is better to view.
2. Look at which one follows a soft gradient and which follows more of a
step function.
3. add to the graph the intel pricing and see how that correlates with the
ASP.
4. Determine from this which vendor has the best margins -not just at time
of release, but over the lifespan of a product. Integration is a useful
technique here. Bear in mind Apple's NRE costs on laptop are higher due to
the better HW design but also the software development is only funded from
their sales alone.
5. Using this information, decide when is the best time to buy a dell or an
apple laptop.

I should make a blog post of this, "server prices: it's all down to the
exponential decay equations of the individual parts"

Steve "why yes, I have spent time in the PC industry" Loughran

(*) If you don't know what NUMA this is, do some research and think about
its implications in heap allocation.

>
>   From: Patrick Angeles <pa...@cloudera.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Thursday, October 11, 2012 12:36 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Why they recommend this (CPU) ?
>
>   If you look at comparable Intel parts:
>
>  Intel E5-2640
> 6 cores @ 2.5 Ghz
> 95W - $885
>
>  Intel E5-2650
> 8 cores @ 2.0 Ghz
> 95W - $1107
>
>  So, for $400 more on a dual proc system -- which really isn't much --
> you get 2 more cores for a 20% drop in speed. I can believe that for some
> scenarios, the faster cores would fare better. Gzip compression is one that
> comes to mind, where you are aggressively trading CPU for lower storage
> volume and IO. An HBase cluster is another example.
>
> On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>>  My own clusters are too temporary and virtual for me to notice. I
>> haven't thought of clock speed as having mattered in a long time, so I'm
>> curious what kind of use cases might benefit from faster cores. Is there a
>> category in some way where this sweet spot for faster cores occurs?
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>>
>>   You should measure your workload.  Your experience will vary
>> dramatically with different computations.
>>
>> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <
>> russell.jurney@gmail.com> wrote:
>>
>>> Anyone got data on this? This is interesting, and somewhat
>>> counter-intuitive.
>>>
>>> Russell Jurney http://datasyndrome.com
>>>
>>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>> > Presumably, if you have a reasonable number of cores - speeding the
>>> cores up will be better than forking a task into smaller and smaller chunks
>>> - because at some point the overhead of multiple processes would be a
>>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>>> every problem has a different sweet spot.
>>>
>>
>>
>

Re: Why they recommend this (CPU) ?

Posted by Ted Dunning <td...@maprtech.com>.

Like I said, measure twice, cut once.

On Thu, Oct 11, 2012 at 12:47 PM, Goldstone, Robin J.
<go...@llnl.gov>wrote:

>  Be sure you are comparing apples to apples.  The E5-2650 has a larger
> cache than the E5-2640, faster system bus and can support faster (1600Ghz
> vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.
>
>  http://ark.intel.com/compare/64590,64591
>
>
>   From: Patrick Angeles <pa...@cloudera.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Thursday, October 11, 2012 12:36 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Why they recommend this (CPU) ?
>
>   If you look at comparable Intel parts:
>
>  Intel E5-2640
> 6 cores @ 2.5 Ghz
> 95W - $885
>
>  Intel E5-2650
> 8 cores @ 2.0 Ghz
> 95W - $1107
>
>  So, for $400 more on a dual proc system -- which really isn't much --
> you get 2 more cores for a 20% drop in speed. I can believe that for some
> scenarios, the faster cores would fare better. Gzip compression is one that
> comes to mind, where you are aggressively trading CPU for lower storage
> volume and IO. An HBase cluster is another example.
>
> On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>>  My own clusters are too temporary and virtual for me to notice. I
>> haven't thought of clock speed as having mattered in a long time, so I'm
>> curious what kind of use cases might benefit from faster cores. Is there a
>> category in some way where this sweet spot for faster cores occurs?
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>>
>>   You should measure your workload.  Your experience will vary
>> dramatically with different computations.
>>
>> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <
>> russell.jurney@gmail.com> wrote:
>>
>>> Anyone got data on this? This is interesting, and somewhat
>>> counter-intuitive.
>>>
>>> Russell Jurney http://datasyndrome.com
>>>
>>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>> > Presumably, if you have a reasonable number of cores - speeding the
>>> cores up will be better than forking a task into smaller and smaller chunks
>>> - because at some point the overhead of multiple processes would be a
>>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>>> every problem has a different sweet spot.
>>>
>>
>>
>

Re: Why they recommend this (CPU) ?

Posted by Steve Loughran <st...@hortonworks.com>.

On 11 October 2012 20:47, Goldstone, Robin J. <go...@llnl.gov> wrote:

>  Be sure you are comparing apples to apples.  The E5-2650 has a larger
> cache than the E5-2640, faster system bus and can support faster (1600Ghz
> vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.
>
>  http://ark.intel.com/compare/64590,64591
>
>
mmm. There is more $L3, and in-CPU sync can be done better than over the
inter-socket bus -you're also less vulnerable to NUMA memory allocation
issues (*).

There's another issue that drives these recommendations, namely the price
curve that server parts follow over time, the Bill-of-Materials curve, aka
the "BOM Curve". Most parts come in at one price, and that price drops over
time as a function of: volume parts shipped covering
Non-Recoverable-Engineering (NRE costs), improvements in yield and
manufacturing quality in that specific process, ...etc), until it levels
out a actual selling price (ASP) to the people who make the boxes (Original
Design Manufacturers==ODMs) where it tends to stay for the rest of that
part's lifespan.

DRAM, HDDs follow a fairly predictable exponential decay curve. You can
look at the cost of a part, it's history, determine the variables and then
come up with a prediction of how much it will cost at a time in the near
future. It's these BOM curves that was key to Dell's business model -direct
sales to customer meant they didn't need so much inventory and could
actually get into a situation where they had the cash from the customer
before the ODM had built the box, let alone been paid for it. There was a
price: utter unpredictability of what DRAM and HDDs you were going to get.
Server-side things have stabilised and all the tier-1 PC vendors qualify a
set of DRAM and storage options, so they can source from multiple vendors,
so eliminating a single vendor as a SPOF and allowing them to negotiate
better on cost of parts -which again changes that BOM curve.

This may seem strange but you should all know that the retail price of a
laptop, flatscreen TV, etc comes down over time -what's not so obvious are
the maths behind the changes in it's price.

One of the odd parts in this business is the CPU. There is a near-monopoly
in supplies, and intel don't want their business at the flat bit of the
curve. They need the money not just to keep their shareholders happy, but
for the $B needed to build the next generation of Fabs and hence continue
to keep their shareholders happy in future. Intel parts come in high when
they initially ship, and stay at that price until the next time Intel
change their price list, which is usually quarterly. The first price change
is very steep, then the gradient d$/dT reduces, as it gets low enough that
part drops off the price list never to be seen again, except maybe in
embedded designs.

What does that mean? It means you pay a lot for the top of the line x86
CPUs, and unless you are 100% sure that you really need it, you may be
better off investing your money in:
 -more DRAM with better ECCs (product placement: Chip-kill), buffering, :
less swapping, ability to run more reducers/node.
 -more HDDs : more storage in same #of racks, assuming your site can take
the weight.
 -SFF HDDs : less storage but more IO bandwidth off the disks.
 -SSD: faster storage
 -GPUs: very good performance for algorithms you can recompile onto them
 -support from Hortonworks to can keep your Hadoop cluster going.
 -10 GbE networking, or multiple bonded 1GbE
 -more servers (this becomes more of a factor on larger clusters, where the
cost savings of the less expensive parts scale up)
 -paying the electricity bill.
 -keeping the cost of building up a hadoop cluster down, so making it more
affordable to store PB of data whose value will only appreciate over time.
 -paying your ops team more money, keeping them happier and so increasing
the probability they will field the 4am support crisis.

That's why it isn't clear cut that 8 cores are better. It's not just a
simple performance question -it's the opportunity cost of the price
difference scaled up by the number of nodes. You do -as Ted pointed out-
need to know what you actually want.

Finally, as a basic "data science" exercise for the reader:

1. calculate the price curves of, say, a Dell laptop, and compare with the
price curve of an apple laptop introduced with the same CPU and at the same
time. Don't look at the absolute values -normalising them to a percentage
is better to view.
2. Look at which one follows a soft gradient and which follows more of a
step function.
3. add to the graph the intel pricing and see how that correlates with the
ASP.
4. Determine from this which vendor has the best margins -not just at time
of release, but over the lifespan of a product. Integration is a useful
technique here. Bear in mind Apple's NRE costs on laptop are higher due to
the better HW design but also the software development is only funded from
their sales alone.
5. Using this information, decide when is the best time to buy a dell or an
apple laptop.

I should make a blog post of this, "server prices: it's all down to the
exponential decay equations of the individual parts"

Steve "why yes, I have spent time in the PC industry" Loughran

(*) If you don't know what NUMA this is, do some research and think about
its implications in heap allocation.

>
>   From: Patrick Angeles <pa...@cloudera.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Thursday, October 11, 2012 12:36 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Why they recommend this (CPU) ?
>
>   If you look at comparable Intel parts:
>
>  Intel E5-2640
> 6 cores @ 2.5 Ghz
> 95W - $885
>
>  Intel E5-2650
> 8 cores @ 2.0 Ghz
> 95W - $1107
>
>  So, for $400 more on a dual proc system -- which really isn't much --
> you get 2 more cores for a 20% drop in speed. I can believe that for some
> scenarios, the faster cores would fare better. Gzip compression is one that
> comes to mind, where you are aggressively trading CPU for lower storage
> volume and IO. An HBase cluster is another example.
>
> On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>>  My own clusters are too temporary and virtual for me to notice. I
>> haven't thought of clock speed as having mattered in a long time, so I'm
>> curious what kind of use cases might benefit from faster cores. Is there a
>> category in some way where this sweet spot for faster cores occurs?
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>>
>>   You should measure your workload.  Your experience will vary
>> dramatically with different computations.
>>
>> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <
>> russell.jurney@gmail.com> wrote:
>>
>>> Anyone got data on this? This is interesting, and somewhat
>>> counter-intuitive.
>>>
>>> Russell Jurney http://datasyndrome.com
>>>
>>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>> > Presumably, if you have a reasonable number of cores - speeding the
>>> cores up will be better than forking a task into smaller and smaller chunks
>>> - because at some point the overhead of multiple processes would be a
>>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>>> every problem has a different sweet spot.
>>>
>>
>>
>

Re: Why they recommend this (CPU) ?

Posted by Ted Dunning <td...@maprtech.com>.

Like I said, measure twice, cut once.

On Thu, Oct 11, 2012 at 12:47 PM, Goldstone, Robin J.
<go...@llnl.gov>wrote:

>  Be sure you are comparing apples to apples.  The E5-2650 has a larger
> cache than the E5-2640, faster system bus and can support faster (1600Ghz
> vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.
>
>  http://ark.intel.com/compare/64590,64591
>
>
>   From: Patrick Angeles <pa...@cloudera.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Thursday, October 11, 2012 12:36 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Why they recommend this (CPU) ?
>
>   If you look at comparable Intel parts:
>
>  Intel E5-2640
> 6 cores @ 2.5 Ghz
> 95W - $885
>
>  Intel E5-2650
> 8 cores @ 2.0 Ghz
> 95W - $1107
>
>  So, for $400 more on a dual proc system -- which really isn't much --
> you get 2 more cores for a 20% drop in speed. I can believe that for some
> scenarios, the faster cores would fare better. Gzip compression is one that
> comes to mind, where you are aggressively trading CPU for lower storage
> volume and IO. An HBase cluster is another example.
>
> On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>>  My own clusters are too temporary and virtual for me to notice. I
>> haven't thought of clock speed as having mattered in a long time, so I'm
>> curious what kind of use cases might benefit from faster cores. Is there a
>> category in some way where this sweet spot for faster cores occurs?
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>>
>>   You should measure your workload.  Your experience will vary
>> dramatically with different computations.
>>
>> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <
>> russell.jurney@gmail.com> wrote:
>>
>>> Anyone got data on this? This is interesting, and somewhat
>>> counter-intuitive.
>>>
>>> Russell Jurney http://datasyndrome.com
>>>
>>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>> > Presumably, if you have a reasonable number of cores - speeding the
>>> cores up will be better than forking a task into smaller and smaller chunks
>>> - because at some point the overhead of multiple processes would be a
>>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>>> every problem has a different sweet spot.
>>>
>>
>>
>

Re: Why they recommend this (CPU) ?

Posted by Steve Loughran <st...@hortonworks.com>.

On 11 October 2012 20:47, Goldstone, Robin J. <go...@llnl.gov> wrote:

>  Be sure you are comparing apples to apples.  The E5-2650 has a larger
> cache than the E5-2640, faster system bus and can support faster (1600Ghz
> vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.
>
>  http://ark.intel.com/compare/64590,64591
>
>
mmm. There is more $L3, and in-CPU sync can be done better than over the
inter-socket bus -you're also less vulnerable to NUMA memory allocation
issues (*).

There's another issue that drives these recommendations, namely the price
curve that server parts follow over time, the Bill-of-Materials curve, aka
the "BOM Curve". Most parts come in at one price, and that price drops over
time as a function of: volume parts shipped covering
Non-Recoverable-Engineering (NRE costs), improvements in yield and
manufacturing quality in that specific process, ...etc), until it levels
out a actual selling price (ASP) to the people who make the boxes (Original
Design Manufacturers==ODMs) where it tends to stay for the rest of that
part's lifespan.

DRAM, HDDs follow a fairly predictable exponential decay curve. You can
look at the cost of a part, it's history, determine the variables and then
come up with a prediction of how much it will cost at a time in the near
future. It's these BOM curves that was key to Dell's business model -direct
sales to customer meant they didn't need so much inventory and could
actually get into a situation where they had the cash from the customer
before the ODM had built the box, let alone been paid for it. There was a
price: utter unpredictability of what DRAM and HDDs you were going to get.
Server-side things have stabilised and all the tier-1 PC vendors qualify a
set of DRAM and storage options, so they can source from multiple vendors,
so eliminating a single vendor as a SPOF and allowing them to negotiate
better on cost of parts -which again changes that BOM curve.

This may seem strange but you should all know that the retail price of a
laptop, flatscreen TV, etc comes down over time -what's not so obvious are
the maths behind the changes in it's price.

One of the odd parts in this business is the CPU. There is a near-monopoly
in supplies, and intel don't want their business at the flat bit of the
curve. They need the money not just to keep their shareholders happy, but
for the $B needed to build the next generation of Fabs and hence continue
to keep their shareholders happy in future. Intel parts come in high when
they initially ship, and stay at that price until the next time Intel
change their price list, which is usually quarterly. The first price change
is very steep, then the gradient d$/dT reduces, as it gets low enough that
part drops off the price list never to be seen again, except maybe in
embedded designs.

What does that mean? It means you pay a lot for the top of the line x86
CPUs, and unless you are 100% sure that you really need it, you may be
better off investing your money in:
 -more DRAM with better ECCs (product placement: Chip-kill), buffering, :
less swapping, ability to run more reducers/node.
 -more HDDs : more storage in same #of racks, assuming your site can take
the weight.
 -SFF HDDs : less storage but more IO bandwidth off the disks.
 -SSD: faster storage
 -GPUs: very good performance for algorithms you can recompile onto them
 -support from Hortonworks to can keep your Hadoop cluster going.
 -10 GbE networking, or multiple bonded 1GbE
 -more servers (this becomes more of a factor on larger clusters, where the
cost savings of the less expensive parts scale up)
 -paying the electricity bill.
 -keeping the cost of building up a hadoop cluster down, so making it more
affordable to store PB of data whose value will only appreciate over time.
 -paying your ops team more money, keeping them happier and so increasing
the probability they will field the 4am support crisis.

That's why it isn't clear cut that 8 cores are better. It's not just a
simple performance question -it's the opportunity cost of the price
difference scaled up by the number of nodes. You do -as Ted pointed out-
need to know what you actually want.

Finally, as a basic "data science" exercise for the reader:

1. calculate the price curves of, say, a Dell laptop, and compare with the
price curve of an apple laptop introduced with the same CPU and at the same
time. Don't look at the absolute values -normalising them to a percentage
is better to view.
2. Look at which one follows a soft gradient and which follows more of a
step function.
3. add to the graph the intel pricing and see how that correlates with the
ASP.
4. Determine from this which vendor has the best margins -not just at time
of release, but over the lifespan of a product. Integration is a useful
technique here. Bear in mind Apple's NRE costs on laptop are higher due to
the better HW design but also the software development is only funded from
their sales alone.
5. Using this information, decide when is the best time to buy a dell or an
apple laptop.

I should make a blog post of this, "server prices: it's all down to the
exponential decay equations of the individual parts"

Steve "why yes, I have spent time in the PC industry" Loughran

(*) If you don't know what NUMA this is, do some research and think about
its implications in heap allocation.

>
>   From: Patrick Angeles <pa...@cloudera.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Thursday, October 11, 2012 12:36 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Why they recommend this (CPU) ?
>
>   If you look at comparable Intel parts:
>
>  Intel E5-2640
> 6 cores @ 2.5 Ghz
> 95W - $885
>
>  Intel E5-2650
> 8 cores @ 2.0 Ghz
> 95W - $1107
>
>  So, for $400 more on a dual proc system -- which really isn't much --
> you get 2 more cores for a 20% drop in speed. I can believe that for some
> scenarios, the faster cores would fare better. Gzip compression is one that
> comes to mind, where you are aggressively trading CPU for lower storage
> volume and IO. An HBase cluster is another example.
>
> On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>>  My own clusters are too temporary and virtual for me to notice. I
>> haven't thought of clock speed as having mattered in a long time, so I'm
>> curious what kind of use cases might benefit from faster cores. Is there a
>> category in some way where this sweet spot for faster cores occurs?
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>>
>>   You should measure your workload.  Your experience will vary
>> dramatically with different computations.
>>
>> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <
>> russell.jurney@gmail.com> wrote:
>>
>>> Anyone got data on this? This is interesting, and somewhat
>>> counter-intuitive.
>>>
>>> Russell Jurney http://datasyndrome.com
>>>
>>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>> > Presumably, if you have a reasonable number of cores - speeding the
>>> cores up will be better than forking a task into smaller and smaller chunks
>>> - because at some point the overhead of multiple processes would be a
>>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>>> every problem has a different sweet spot.
>>>
>>
>>
>

Re: Why they recommend this (CPU) ?

Posted by Ted Dunning <td...@maprtech.com>.

Like I said, measure twice, cut once.

On Thu, Oct 11, 2012 at 12:47 PM, Goldstone, Robin J.
<go...@llnl.gov>wrote:

>  Be sure you are comparing apples to apples.  The E5-2650 has a larger
> cache than the E5-2640, faster system bus and can support faster (1600Ghz
> vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.
>
>  http://ark.intel.com/compare/64590,64591
>
>
>   From: Patrick Angeles <pa...@cloudera.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Thursday, October 11, 2012 12:36 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Why they recommend this (CPU) ?
>
>   If you look at comparable Intel parts:
>
>  Intel E5-2640
> 6 cores @ 2.5 Ghz
> 95W - $885
>
>  Intel E5-2650
> 8 cores @ 2.0 Ghz
> 95W - $1107
>
>  So, for $400 more on a dual proc system -- which really isn't much --
> you get 2 more cores for a 20% drop in speed. I can believe that for some
> scenarios, the faster cores would fare better. Gzip compression is one that
> comes to mind, where you are aggressively trading CPU for lower storage
> volume and IO. An HBase cluster is another example.
>
> On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>>  My own clusters are too temporary and virtual for me to notice. I
>> haven't thought of clock speed as having mattered in a long time, so I'm
>> curious what kind of use cases might benefit from faster cores. Is there a
>> category in some way where this sweet spot for faster cores occurs?
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>>
>>   You should measure your workload.  Your experience will vary
>> dramatically with different computations.
>>
>> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <
>> russell.jurney@gmail.com> wrote:
>>
>>> Anyone got data on this? This is interesting, and somewhat
>>> counter-intuitive.
>>>
>>> Russell Jurney http://datasyndrome.com
>>>
>>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>> > Presumably, if you have a reasonable number of cores - speeding the
>>> cores up will be better than forking a task into smaller and smaller chunks
>>> - because at some point the overhead of multiple processes would be a
>>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>>> every problem has a different sweet spot.
>>>
>>
>>
>

Re: Why they recommend this (CPU) ?

Posted by Ted Dunning <td...@maprtech.com>.

Like I said, measure twice, cut once.

On Thu, Oct 11, 2012 at 12:47 PM, Goldstone, Robin J.
<go...@llnl.gov>wrote:

>  Be sure you are comparing apples to apples.  The E5-2650 has a larger
> cache than the E5-2640, faster system bus and can support faster (1600Ghz
> vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.
>
>  http://ark.intel.com/compare/64590,64591
>
>
>   From: Patrick Angeles <pa...@cloudera.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Thursday, October 11, 2012 12:36 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Why they recommend this (CPU) ?
>
>   If you look at comparable Intel parts:
>
>  Intel E5-2640
> 6 cores @ 2.5 Ghz
> 95W - $885
>
>  Intel E5-2650
> 8 cores @ 2.0 Ghz
> 95W - $1107
>
>  So, for $400 more on a dual proc system -- which really isn't much --
> you get 2 more cores for a 20% drop in speed. I can believe that for some
> scenarios, the faster cores would fare better. Gzip compression is one that
> comes to mind, where you are aggressively trading CPU for lower storage
> volume and IO. An HBase cluster is another example.
>
> On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>>  My own clusters are too temporary and virtual for me to notice. I
>> haven't thought of clock speed as having mattered in a long time, so I'm
>> curious what kind of use cases might benefit from faster cores. Is there a
>> category in some way where this sweet spot for faster cores occurs?
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>>
>>   You should measure your workload.  Your experience will vary
>> dramatically with different computations.
>>
>> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <
>> russell.jurney@gmail.com> wrote:
>>
>>> Anyone got data on this? This is interesting, and somewhat
>>> counter-intuitive.
>>>
>>> Russell Jurney http://datasyndrome.com
>>>
>>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>> > Presumably, if you have a reasonable number of cores - speeding the
>>> cores up will be better than forking a task into smaller and smaller chunks
>>> - because at some point the overhead of multiple processes would be a
>>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>>> every problem has a different sweet spot.
>>>
>>
>>
>

Re: Why they recommend this (CPU) ?

Posted by "Goldstone, Robin J." <go...@llnl.gov>.

Be sure you are comparing apples to apples.  The E5-2650 has a larger cache than the E5-2640, faster system bus and can support faster (1600Ghz vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.

http://ark.intel.com/compare/64590,64591

From: Patrick Angeles <pa...@cloudera.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Thursday, October 11, 2012 12:36 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Why they recommend this (CPU) ?

If you look at comparable Intel parts:

Intel E5-2640
6 cores @ 2.5 Ghz
95W - $885

Intel E5-2650
8 cores @ 2.0 Ghz
95W - $1107

So, for $400 more on a dual proc system -- which really isn't much -- you get 2 more cores for a 20% drop in speed. I can believe that for some scenarios, the faster cores would fare better. Gzip compression is one that comes to mind, where you are aggressively trading CPU for lower storage volume and IO. An HBase cluster is another example.

On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>> wrote:
My own clusters are too temporary and virtual for me to notice. I haven't thought of clock speed as having mattered in a long time, so I'm curious what kind of use cases might benefit from faster cores. Is there a category in some way where this sweet spot for faster cores occurs?

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com>> wrote:

You should measure your workload.  Your experience will vary dramatically with different computations.

On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <ru...@gmail.com>> wrote:
Anyone got data on this? This is interesting, and somewhat counter-intuitive.

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com>> wrote:

> Presumably, if you have a reasonable number of cores - speeding the cores up will be better than forking a task into smaller and smaller chunks - because at some point the overhead of multiple processes would be a bottleneck - maybe due to streaming reads and writes?  I'm sure each and every problem has a different sweet spot.

Re: Why they recommend this (CPU) ?

Posted by "Goldstone, Robin J." <go...@llnl.gov>.

Be sure you are comparing apples to apples.  The E5-2650 has a larger cache than the E5-2640, faster system bus and can support faster (1600Ghz vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.

http://ark.intel.com/compare/64590,64591

From: Patrick Angeles <pa...@cloudera.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Thursday, October 11, 2012 12:36 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Why they recommend this (CPU) ?

If you look at comparable Intel parts:

Intel E5-2640
6 cores @ 2.5 Ghz
95W - $885

Intel E5-2650
8 cores @ 2.0 Ghz
95W - $1107

So, for $400 more on a dual proc system -- which really isn't much -- you get 2 more cores for a 20% drop in speed. I can believe that for some scenarios, the faster cores would fare better. Gzip compression is one that comes to mind, where you are aggressively trading CPU for lower storage volume and IO. An HBase cluster is another example.

On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>> wrote:
My own clusters are too temporary and virtual for me to notice. I haven't thought of clock speed as having mattered in a long time, so I'm curious what kind of use cases might benefit from faster cores. Is there a category in some way where this sweet spot for faster cores occurs?

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com>> wrote:

You should measure your workload.  Your experience will vary dramatically with different computations.

On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <ru...@gmail.com>> wrote:
Anyone got data on this? This is interesting, and somewhat counter-intuitive.

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com>> wrote:

> Presumably, if you have a reasonable number of cores - speeding the cores up will be better than forking a task into smaller and smaller chunks - because at some point the overhead of multiple processes would be a bottleneck - maybe due to streaming reads and writes?  I'm sure each and every problem has a different sweet spot.

Re: Why they recommend this (CPU) ?

Posted by "Goldstone, Robin J." <go...@llnl.gov>.

Be sure you are comparing apples to apples.  The E5-2650 has a larger cache than the E5-2640, faster system bus and can support faster (1600Ghz vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.

http://ark.intel.com/compare/64590,64591

From: Patrick Angeles <pa...@cloudera.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Thursday, October 11, 2012 12:36 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Why they recommend this (CPU) ?

If you look at comparable Intel parts:

Intel E5-2640
6 cores @ 2.5 Ghz
95W - $885

Intel E5-2650
8 cores @ 2.0 Ghz
95W - $1107

So, for $400 more on a dual proc system -- which really isn't much -- you get 2 more cores for a 20% drop in speed. I can believe that for some scenarios, the faster cores would fare better. Gzip compression is one that comes to mind, where you are aggressively trading CPU for lower storage volume and IO. An HBase cluster is another example.

On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>> wrote:
My own clusters are too temporary and virtual for me to notice. I haven't thought of clock speed as having mattered in a long time, so I'm curious what kind of use cases might benefit from faster cores. Is there a category in some way where this sweet spot for faster cores occurs?

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com>> wrote:

You should measure your workload.  Your experience will vary dramatically with different computations.

On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <ru...@gmail.com>> wrote:
Anyone got data on this? This is interesting, and somewhat counter-intuitive.

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com>> wrote:

> Presumably, if you have a reasonable number of cores - speeding the cores up will be better than forking a task into smaller and smaller chunks - because at some point the overhead of multiple processes would be a bottleneck - maybe due to streaming reads and writes?  I'm sure each and every problem has a different sweet spot.

Re: Why they recommend this (CPU) ?

Posted by "Goldstone, Robin J." <go...@llnl.gov>.

Be sure you are comparing apples to apples.  The E5-2650 has a larger cache than the E5-2640, faster system bus and can support faster (1600Ghz vs 1333Ghz) DRAM resulting in greater potential memory bandwidth.

http://ark.intel.com/compare/64590,64591

From: Patrick Angeles <pa...@cloudera.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Thursday, October 11, 2012 12:36 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Why they recommend this (CPU) ?

If you look at comparable Intel parts:

Intel E5-2640
6 cores @ 2.5 Ghz
95W - $885

Intel E5-2650
8 cores @ 2.0 Ghz
95W - $1107

So, for $400 more on a dual proc system -- which really isn't much -- you get 2 more cores for a 20% drop in speed. I can believe that for some scenarios, the faster cores would fare better. Gzip compression is one that comes to mind, where you are aggressively trading CPU for lower storage volume and IO. An HBase cluster is another example.

On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>> wrote:
My own clusters are too temporary and virtual for me to notice. I haven't thought of clock speed as having mattered in a long time, so I'm curious what kind of use cases might benefit from faster cores. Is there a category in some way where this sweet spot for faster cores occurs?

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com>> wrote:

You should measure your workload.  Your experience will vary dramatically with different computations.

On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <ru...@gmail.com>> wrote:
Anyone got data on this? This is interesting, and somewhat counter-intuitive.

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com>> wrote:

> Presumably, if you have a reasonable number of cores - speeding the cores up will be better than forking a task into smaller and smaller chunks - because at some point the overhead of multiple processes would be a bottleneck - maybe due to streaming reads and writes?  I'm sure each and every problem has a different sweet spot.

Re: Why they recommend this (CPU) ?

Posted by Patrick Angeles <pa...@cloudera.com>.

If you look at comparable Intel parts:

Intel E5-2640
6 cores @ 2.5 Ghz
95W - $885

Intel E5-2650
8 cores @ 2.0 Ghz
95W - $1107

So, for $400 more on a dual proc system -- which really isn't much -- you
get 2 more cores for a 20% drop in speed. I can believe that for some
scenarios, the faster cores would fare better. Gzip compression is one that
comes to mind, where you are aggressively trading CPU for lower storage
volume and IO. An HBase cluster is another example.

On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>wrote:

> My own clusters are too temporary and virtual for me to notice. I haven't
> thought of clock speed as having mattered in a long time, so I'm curious
> what kind of use cases might benefit from faster cores. Is there a category
> in some way where this sweet spot for faster cores occurs?
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>
> You should measure your workload.  Your experience will vary dramatically
> with different computations.
>
> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <russell.jurney@gmail.com
> > wrote:
>
>> Anyone got data on this? This is interesting, and somewhat
>> counter-intuitive.
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>> > Presumably, if you have a reasonable number of cores - speeding the
>> cores up will be better than forking a task into smaller and smaller chunks
>> - because at some point the overhead of multiple processes would be a
>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>> every problem has a different sweet spot.
>>
>
>

Re: Why they recommend this (CPU) ?

Posted by Aaron Eng <ae...@maprtech.com>.

Without a doubt, there are many CPU intensive workloads where the amount of
CPU cycles consumed to process some amount of data is many times higher
than what would be considered relatively normal.  But at the same time,
there are many memory intensive workloads and IO bound workloads that are
common as well.  I've worked with companies who have been doing all 3 on a
single cluster, which is another point to be aware of.

Unless you are building a single application, single purpose cluster,
you'll probably have a mix of jobs with a mix of resource profiles.  So
designing a cluster so your CPU heavy job runs faster may mean you skimped
on spindles or disk speed, and when you want to run your new application
and do your mixed workload, you end up having a bottleneck on the IO side.

So keep in mind, not just the profile of a specific workload, but of the
work you want to support on the cluster in general.

On Thu, Oct 11, 2012 at 12:03 PM, Russell Jurney
<ru...@gmail.com>wrote:

> My own clusters are too temporary and virtual for me to notice. I haven't
> thought of clock speed as having mattered in a long time, so I'm curious
> what kind of use cases might benefit from faster cores. Is there a category
> in some way where this sweet spot for faster cores occurs?
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>
> You should measure your workload.  Your experience will vary dramatically
> with different computations.
>
> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <russell.jurney@gmail.com
> > wrote:
>
>> Anyone got data on this? This is interesting, and somewhat
>> counter-intuitive.
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>> > Presumably, if you have a reasonable number of cores - speeding the
>> cores up will be better than forking a task into smaller and smaller chunks
>> - because at some point the overhead of multiple processes would be a
>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>> every problem has a different sweet spot.
>>
>
>

Re: Why they recommend this (CPU) ?

Posted by Aaron Eng <ae...@maprtech.com>.

Without a doubt, there are many CPU intensive workloads where the amount of
CPU cycles consumed to process some amount of data is many times higher
than what would be considered relatively normal.  But at the same time,
there are many memory intensive workloads and IO bound workloads that are
common as well.  I've worked with companies who have been doing all 3 on a
single cluster, which is another point to be aware of.

Unless you are building a single application, single purpose cluster,
you'll probably have a mix of jobs with a mix of resource profiles.  So
designing a cluster so your CPU heavy job runs faster may mean you skimped
on spindles or disk speed, and when you want to run your new application
and do your mixed workload, you end up having a bottleneck on the IO side.

So keep in mind, not just the profile of a specific workload, but of the
work you want to support on the cluster in general.

On Thu, Oct 11, 2012 at 12:03 PM, Russell Jurney
<ru...@gmail.com>wrote:

> My own clusters are too temporary and virtual for me to notice. I haven't
> thought of clock speed as having mattered in a long time, so I'm curious
> what kind of use cases might benefit from faster cores. Is there a category
> in some way where this sweet spot for faster cores occurs?
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>
> You should measure your workload.  Your experience will vary dramatically
> with different computations.
>
> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <russell.jurney@gmail.com
> > wrote:
>
>> Anyone got data on this? This is interesting, and somewhat
>> counter-intuitive.
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>> > Presumably, if you have a reasonable number of cores - speeding the
>> cores up will be better than forking a task into smaller and smaller chunks
>> - because at some point the overhead of multiple processes would be a
>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>> every problem has a different sweet spot.
>>
>
>

Re: Why they recommend this (CPU) ?

Posted by Patrick Angeles <pa...@cloudera.com>.

If you look at comparable Intel parts:

Intel E5-2640
6 cores @ 2.5 Ghz
95W - $885

Intel E5-2650
8 cores @ 2.0 Ghz
95W - $1107

So, for $400 more on a dual proc system -- which really isn't much -- you
get 2 more cores for a 20% drop in speed. I can believe that for some
scenarios, the faster cores would fare better. Gzip compression is one that
comes to mind, where you are aggressively trading CPU for lower storage
volume and IO. An HBase cluster is another example.

On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>wrote:

> My own clusters are too temporary and virtual for me to notice. I haven't
> thought of clock speed as having mattered in a long time, so I'm curious
> what kind of use cases might benefit from faster cores. Is there a category
> in some way where this sweet spot for faster cores occurs?
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>
> You should measure your workload.  Your experience will vary dramatically
> with different computations.
>
> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <russell.jurney@gmail.com
> > wrote:
>
>> Anyone got data on this? This is interesting, and somewhat
>> counter-intuitive.
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>> > Presumably, if you have a reasonable number of cores - speeding the
>> cores up will be better than forking a task into smaller and smaller chunks
>> - because at some point the overhead of multiple processes would be a
>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>> every problem has a different sweet spot.
>>
>
>

Re: Why they recommend this (CPU) ?

Posted by Patrick Angeles <pa...@cloudera.com>.

If you look at comparable Intel parts:

Intel E5-2640
6 cores @ 2.5 Ghz
95W - $885

Intel E5-2650
8 cores @ 2.0 Ghz
95W - $1107

So, for $400 more on a dual proc system -- which really isn't much -- you
get 2 more cores for a 20% drop in speed. I can believe that for some
scenarios, the faster cores would fare better. Gzip compression is one that
comes to mind, where you are aggressively trading CPU for lower storage
volume and IO. An HBase cluster is another example.

On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>wrote:

> My own clusters are too temporary and virtual for me to notice. I haven't
> thought of clock speed as having mattered in a long time, so I'm curious
> what kind of use cases might benefit from faster cores. Is there a category
> in some way where this sweet spot for faster cores occurs?
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>
> You should measure your workload.  Your experience will vary dramatically
> with different computations.
>
> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <russell.jurney@gmail.com
> > wrote:
>
>> Anyone got data on this? This is interesting, and somewhat
>> counter-intuitive.
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>> > Presumably, if you have a reasonable number of cores - speeding the
>> cores up will be better than forking a task into smaller and smaller chunks
>> - because at some point the overhead of multiple processes would be a
>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>> every problem has a different sweet spot.
>>
>
>

Re: Why they recommend this (CPU) ?

Posted by Patrick Angeles <pa...@cloudera.com>.

If you look at comparable Intel parts:

Intel E5-2640
6 cores @ 2.5 Ghz
95W - $885

Intel E5-2650
8 cores @ 2.0 Ghz
95W - $1107

So, for $400 more on a dual proc system -- which really isn't much -- you
get 2 more cores for a 20% drop in speed. I can believe that for some
scenarios, the faster cores would fare better. Gzip compression is one that
comes to mind, where you are aggressively trading CPU for lower storage
volume and IO. An HBase cluster is another example.

On Thu, Oct 11, 2012 at 3:03 PM, Russell Jurney <ru...@gmail.com>wrote:

> My own clusters are too temporary and virtual for me to notice. I haven't
> thought of clock speed as having mattered in a long time, so I'm curious
> what kind of use cases might benefit from faster cores. Is there a category
> in some way where this sweet spot for faster cores occurs?
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:
>
> You should measure your workload.  Your experience will vary dramatically
> with different computations.
>
> On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney <russell.jurney@gmail.com
> > wrote:
>
>> Anyone got data on this? This is interesting, and somewhat
>> counter-intuitive.
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>> > Presumably, if you have a reasonable number of cores - speeding the
>> cores up will be better than forking a task into smaller and smaller chunks
>> - because at some point the overhead of multiple processes would be a
>> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
>> every problem has a different sweet spot.
>>
>
>

Re: Why they recommend this (CPU) ?

Posted by Russell Jurney <ru...@gmail.com>.

My own clusters are too temporary and virtual for me to notice. I haven't
thought of clock speed as having mattered in a long time, so I'm curious
what kind of use cases might benefit from faster cores. Is there a category
in some way where this sweet spot for faster cores occurs?

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:

You should measure your workload.  Your experience will vary dramatically
with different computations.

On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney
<ru...@gmail.com>wrote:

> Anyone got data on this? This is interesting, and somewhat
> counter-intuitive.
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>
> > Presumably, if you have a reasonable number of cores - speeding the
> cores up will be better than forking a task into smaller and smaller chunks
> - because at some point the overhead of multiple processes would be a
> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
> every problem has a different sweet spot.
>

Re: Why they recommend this (CPU) ?

Posted by Russell Jurney <ru...@gmail.com>.

My own clusters are too temporary and virtual for me to notice. I haven't
thought of clock speed as having mattered in a long time, so I'm curious
what kind of use cases might benefit from faster cores. Is there a category
in some way where this sweet spot for faster cores occurs?

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:

You should measure your workload.  Your experience will vary dramatically
with different computations.

On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney
<ru...@gmail.com>wrote:

> Anyone got data on this? This is interesting, and somewhat
> counter-intuitive.
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>
> > Presumably, if you have a reasonable number of cores - speeding the
> cores up will be better than forking a task into smaller and smaller chunks
> - because at some point the overhead of multiple processes would be a
> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
> every problem has a different sweet spot.
>

Re: Why they recommend this (CPU) ?

Posted by Russell Jurney <ru...@gmail.com>.

My own clusters are too temporary and virtual for me to notice. I haven't
thought of clock speed as having mattered in a long time, so I'm curious
what kind of use cases might benefit from faster cores. Is there a category
in some way where this sweet spot for faster cores occurs?

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:

You should measure your workload.  Your experience will vary dramatically
with different computations.

On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney
<ru...@gmail.com>wrote:

> Anyone got data on this? This is interesting, and somewhat
> counter-intuitive.
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>
> > Presumably, if you have a reasonable number of cores - speeding the
> cores up will be better than forking a task into smaller and smaller chunks
> - because at some point the overhead of multiple processes would be a
> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
> every problem has a different sweet spot.
>

Re: Why they recommend this (CPU) ?

Posted by Russell Jurney <ru...@gmail.com>.

My own clusters are too temporary and virtual for me to notice. I haven't
thought of clock speed as having mattered in a long time, so I'm curious
what kind of use cases might benefit from faster cores. Is there a category
in some way where this sweet spot for faster cores occurs?

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 11:39 AM, Ted Dunning <td...@maprtech.com> wrote:

You should measure your workload.  Your experience will vary dramatically
with different computations.

On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney
<ru...@gmail.com>wrote:

> Anyone got data on this? This is interesting, and somewhat
> counter-intuitive.
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>
> > Presumably, if you have a reasonable number of cores - speeding the
> cores up will be better than forking a task into smaller and smaller chunks
> - because at some point the overhead of multiple processes would be a
> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
> every problem has a different sweet spot.
>

Re: Why they recommend this (CPU) ?

Posted by Ted Dunning <td...@maprtech.com>.

You should measure your workload.  Your experience will vary dramatically
with different computations.

On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney
<ru...@gmail.com>wrote:

> Anyone got data on this? This is interesting, and somewhat
> counter-intuitive.
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>
> > Presumably, if you have a reasonable number of cores - speeding the
> cores up will be better than forking a task into smaller and smaller chunks
> - because at some point the overhead of multiple processes would be a
> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
> every problem has a different sweet spot.
>

Re: Why they recommend this (CPU) ?

Posted by Ted Dunning <td...@maprtech.com>.

You should measure your workload.  Your experience will vary dramatically
with different computations.

On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney
<ru...@gmail.com>wrote:

> Anyone got data on this? This is interesting, and somewhat
> counter-intuitive.
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>
> > Presumably, if you have a reasonable number of cores - speeding the
> cores up will be better than forking a task into smaller and smaller chunks
> - because at some point the overhead of multiple processes would be a
> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
> every problem has a different sweet spot.
>

Re: Why they recommend this (CPU) ?

Posted by Ted Dunning <td...@maprtech.com>.

You should measure your workload.  Your experience will vary dramatically
with different computations.

On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney
<ru...@gmail.com>wrote:

> Anyone got data on this? This is interesting, and somewhat
> counter-intuitive.
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>
> > Presumably, if you have a reasonable number of cores - speeding the
> cores up will be better than forking a task into smaller and smaller chunks
> - because at some point the overhead of multiple processes would be a
> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
> every problem has a different sweet spot.
>

Re: Why they recommend this (CPU) ?

Posted by Ted Dunning <td...@maprtech.com>.

You should measure your workload.  Your experience will vary dramatically
with different computations.

On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney
<ru...@gmail.com>wrote:

> Anyone got data on this? This is interesting, and somewhat
> counter-intuitive.
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:
>
> > Presumably, if you have a reasonable number of cores - speeding the
> cores up will be better than forking a task into smaller and smaller chunks
> - because at some point the overhead of multiple processes would be a
> bottleneck - maybe due to streaming reads and writes?  I'm sure each and
> every problem has a different sweet spot.
>

Re: Why they recommend this (CPU) ?

Posted by Russell Jurney <ru...@gmail.com>.

Anyone got data on this? This is interesting, and somewhat counter-intuitive.

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:

> Presumably, if you have a reasonable number of cores - speeding the cores up will be better than forking a task into smaller and smaller chunks - because at some point the overhead of multiple processes would be a bottleneck - maybe due to streaming reads and writes?  I'm sure each and every problem has a different sweet spot.

Re: Why they recommend this (CPU) ?

Posted by Russell Jurney <ru...@gmail.com>.

Anyone got data on this? This is interesting, and somewhat counter-intuitive.

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:

> Presumably, if you have a reasonable number of cores - speeding the cores up will be better than forking a task into smaller and smaller chunks - because at some point the overhead of multiple processes would be a bottleneck - maybe due to streaming reads and writes?  I'm sure each and every problem has a different sweet spot.

Re: Why they recommend this (CPU) ?

Posted by Russell Jurney <ru...@gmail.com>.

Anyone got data on this? This is interesting, and somewhat counter-intuitive.

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:

> Presumably, if you have a reasonable number of cores - speeding the cores up will be better than forking a task into smaller and smaller chunks - because at some point the overhead of multiple processes would be a bottleneck - maybe due to streaming reads and writes?  I'm sure each and every problem has a different sweet spot.

Re: Why they recommend this (CPU) ?

Posted by Russell Jurney <ru...@gmail.com>.

Anyone got data on this? This is interesting, and somewhat counter-intuitive.

Russell Jurney http://datasyndrome.com

On Oct 11, 2012, at 10:47 AM, Jay Vyas <ja...@gmail.com> wrote:

> Presumably, if you have a reasonable number of cores - speeding the cores up will be better than forking a task into smaller and smaller chunks - because at some point the overhead of multiple processes would be a bottleneck - maybe due to streaming reads and writes?  I'm sure each and every problem has a different sweet spot.

Re: Why they recommend this (CPU) ?

Posted by Jay Vyas <ja...@gmail.com>.

Presumably, if you have a reasonable number of cores - speeding the cores
up will be better than forking a task into smaller and smaller chunks -
because at some point the overhead of multiple processes would be a
bottleneck - maybe due to streaming reads and writes?  I'm sure each and
every problem has a different sweet spot.

Re: Why they recommend this (CPU) ?

Posted by Jay Vyas <ja...@gmail.com>.

Presumably, if you have a reasonable number of cores - speeding the cores
up will be better than forking a task into smaller and smaller chunks -
because at some point the overhead of multiple processes would be a
bottleneck - maybe due to streaming reads and writes?  I'm sure each and
every problem has a different sweet spot.

Re: Why they recommend this (CPU) ?

Posted by Jay Vyas <ja...@gmail.com>.

Presumably, if you have a reasonable number of cores - speeding the cores
up will be better than forking a task into smaller and smaller chunks -
because at some point the overhead of multiple processes would be a
bottleneck - maybe due to streaming reads and writes?  I'm sure each and
every problem has a different sweet spot.

RE: Start NN with dynamic property (option -D)

Posted by Guillaume Polaert <gp...@cyres.fr>.

It's a good news, I'll try it soon.
Thanks you :)

Guillaume Polaert 

-----Message d'origine-----
De : Harsh J [mailto:harsh@cloudera.com] 
Envoyé : vendredi 12 octobre 2012 17:21
À : user@hadoop.apache.org
Objet : Re: Start NN with dynamic property (option -D)

Hi Guillaume,

See https://issues.apache.org/jira/browse/HDFS-2580 - This improvement was recently added to trunk but is not in any Apache release I know of yet.

I instead make things like this work via the Configuration's ability to use System Properties (from the JVM) when substituting. So if your config files have <value>${foo}</value> and HADOOP_NAMENODE_OPTS has -Dfoo=bar, then the <value> passed in is bar.

Hope this helps!

On Fri, Oct 12, 2012 at 5:13 PM, Guillaume Polaert <gp...@cyres.fr> wrote:
> Hello,
>
> I'm trying to start a NN using the -Dproperty=value functionality.
> I've modified hadoop-env.sh like this :
>    export HADOOP_NAMENODE_OPTS= ... -Ddfs.namenode.http-address=0.0.0.0:50071 ...
>
> and I've launched the daemon with hadoop-hdfs-namenode script.
>
> ps -efHww returns  ".../java -Dproc_namenode ... -Dhadoop.security.logger=INFO,RFAS -Ddfs.namenode.http-address=0.0.0.0:50071 -Dhdfs.audit.logger=INFO,NullAppender ..."
>
> But the property has always the same value : 50070 (default port).
>
> Is "-D-substitution" working for this use case ?
>
> Thanks, Guillaume
>

--
Harsh J

RE: Start NN with dynamic property (option -D)

Posted by Guillaume Polaert <gp...@cyres.fr>.

It's a good news, I'll try it soon.
Thanks you :)

Guillaume Polaert 

-----Message d'origine-----
De : Harsh J [mailto:harsh@cloudera.com] 
Envoyé : vendredi 12 octobre 2012 17:21
À : user@hadoop.apache.org
Objet : Re: Start NN with dynamic property (option -D)

Hi Guillaume,

See https://issues.apache.org/jira/browse/HDFS-2580 - This improvement was recently added to trunk but is not in any Apache release I know of yet.

I instead make things like this work via the Configuration's ability to use System Properties (from the JVM) when substituting. So if your config files have <value>${foo}</value> and HADOOP_NAMENODE_OPTS has -Dfoo=bar, then the <value> passed in is bar.

Hope this helps!

On Fri, Oct 12, 2012 at 5:13 PM, Guillaume Polaert <gp...@cyres.fr> wrote:
> Hello,
>
> I'm trying to start a NN using the -Dproperty=value functionality.
> I've modified hadoop-env.sh like this :
>    export HADOOP_NAMENODE_OPTS= ... -Ddfs.namenode.http-address=0.0.0.0:50071 ...
>
> and I've launched the daemon with hadoop-hdfs-namenode script.
>
> ps -efHww returns  ".../java -Dproc_namenode ... -Dhadoop.security.logger=INFO,RFAS -Ddfs.namenode.http-address=0.0.0.0:50071 -Dhdfs.audit.logger=INFO,NullAppender ..."
>
> But the property has always the same value : 50070 (default port).
>
> Is "-D-substitution" working for this use case ?
>
> Thanks, Guillaume
>

--
Harsh J

RE: Start NN with dynamic property (option -D)

Posted by Guillaume Polaert <gp...@cyres.fr>.

It's a good news, I'll try it soon.
Thanks you :)

Guillaume Polaert 

-----Message d'origine-----
De : Harsh J [mailto:harsh@cloudera.com] 
Envoyé : vendredi 12 octobre 2012 17:21
À : user@hadoop.apache.org
Objet : Re: Start NN with dynamic property (option -D)

Hi Guillaume,

See https://issues.apache.org/jira/browse/HDFS-2580 - This improvement was recently added to trunk but is not in any Apache release I know of yet.

I instead make things like this work via the Configuration's ability to use System Properties (from the JVM) when substituting. So if your config files have <value>${foo}</value> and HADOOP_NAMENODE_OPTS has -Dfoo=bar, then the <value> passed in is bar.

Hope this helps!

On Fri, Oct 12, 2012 at 5:13 PM, Guillaume Polaert <gp...@cyres.fr> wrote:
> Hello,
>
> I'm trying to start a NN using the -Dproperty=value functionality.
> I've modified hadoop-env.sh like this :
>    export HADOOP_NAMENODE_OPTS= ... -Ddfs.namenode.http-address=0.0.0.0:50071 ...
>
> and I've launched the daemon with hadoop-hdfs-namenode script.
>
> ps -efHww returns  ".../java -Dproc_namenode ... -Dhadoop.security.logger=INFO,RFAS -Ddfs.namenode.http-address=0.0.0.0:50071 -Dhdfs.audit.logger=INFO,NullAppender ..."
>
> But the property has always the same value : 50070 (default port).
>
> Is "-D-substitution" working for this use case ?
>
> Thanks, Guillaume
>

--
Harsh J

RE: Start NN with dynamic property (option -D)

Posted by Guillaume Polaert <gp...@cyres.fr>.

It's a good news, I'll try it soon.
Thanks you :)

Guillaume Polaert 

-----Message d'origine-----
De : Harsh J [mailto:harsh@cloudera.com] 
Envoyé : vendredi 12 octobre 2012 17:21
À : user@hadoop.apache.org
Objet : Re: Start NN with dynamic property (option -D)

Hi Guillaume,

See https://issues.apache.org/jira/browse/HDFS-2580 - This improvement was recently added to trunk but is not in any Apache release I know of yet.

I instead make things like this work via the Configuration's ability to use System Properties (from the JVM) when substituting. So if your config files have <value>${foo}</value> and HADOOP_NAMENODE_OPTS has -Dfoo=bar, then the <value> passed in is bar.

Hope this helps!

On Fri, Oct 12, 2012 at 5:13 PM, Guillaume Polaert <gp...@cyres.fr> wrote:
> Hello,
>
> I'm trying to start a NN using the -Dproperty=value functionality.
> I've modified hadoop-env.sh like this :
>    export HADOOP_NAMENODE_OPTS= ... -Ddfs.namenode.http-address=0.0.0.0:50071 ...
>
> and I've launched the daemon with hadoop-hdfs-namenode script.
>
> ps -efHww returns  ".../java -Dproc_namenode ... -Dhadoop.security.logger=INFO,RFAS -Ddfs.namenode.http-address=0.0.0.0:50071 -Dhdfs.audit.logger=INFO,NullAppender ..."
>
> But the property has always the same value : 50070 (default port).
>
> Is "-D-substitution" working for this use case ?
>
> Thanks, Guillaume
>

--
Harsh J

Re: Start NN with dynamic property (option -D)

Posted by Harsh J <ha...@cloudera.com>.

Hi Guillaume,

See https://issues.apache.org/jira/browse/HDFS-2580 - This improvement
was recently added to trunk but is not in any Apache release I know of
yet.

I instead make things like this work via the Configuration's ability
to use System Properties (from the JVM) when substituting. So if your
config files have <value>${foo}</value> and HADOOP_NAMENODE_OPTS has
-Dfoo=bar, then the <value> passed in is bar.

Hope this helps!

On Fri, Oct 12, 2012 at 5:13 PM, Guillaume Polaert <gp...@cyres.fr> wrote:
> Hello,
>
> I'm trying to start a NN using the -Dproperty=value functionality.
> I've modified hadoop-env.sh like this :
>    export HADOOP_NAMENODE_OPTS= ... -Ddfs.namenode.http-address=0.0.0.0:50071 ...
>
> and I've launched the daemon with hadoop-hdfs-namenode script.
>
> ps -efHww returns  ".../java -Dproc_namenode ... -Dhadoop.security.logger=INFO,RFAS -Ddfs.namenode.http-address=0.0.0.0:50071 -Dhdfs.audit.logger=INFO,NullAppender ..."
>
> But the property has always the same value : 50070 (default port).
>
> Is "-D-substitution" working for this use case ?
>
> Thanks, Guillaume
>



-- 
Harsh J

Re: Start NN with dynamic property (option -D)

Posted by Harsh J <ha...@cloudera.com>.

Hi Guillaume,

See https://issues.apache.org/jira/browse/HDFS-2580 - This improvement
was recently added to trunk but is not in any Apache release I know of
yet.

I instead make things like this work via the Configuration's ability
to use System Properties (from the JVM) when substituting. So if your
config files have <value>${foo}</value> and HADOOP_NAMENODE_OPTS has
-Dfoo=bar, then the <value> passed in is bar.

Hope this helps!

On Fri, Oct 12, 2012 at 5:13 PM, Guillaume Polaert <gp...@cyres.fr> wrote:
> Hello,
>
> I'm trying to start a NN using the -Dproperty=value functionality.
> I've modified hadoop-env.sh like this :
>    export HADOOP_NAMENODE_OPTS= ... -Ddfs.namenode.http-address=0.0.0.0:50071 ...
>
> and I've launched the daemon with hadoop-hdfs-namenode script.
>
> ps -efHww returns  ".../java -Dproc_namenode ... -Dhadoop.security.logger=INFO,RFAS -Ddfs.namenode.http-address=0.0.0.0:50071 -Dhdfs.audit.logger=INFO,NullAppender ..."
>
> But the property has always the same value : 50070 (default port).
>
> Is "-D-substitution" working for this use case ?
>
> Thanks, Guillaume
>



-- 
Harsh J

Re: Start NN with dynamic property (option -D)

Posted by Harsh J <ha...@cloudera.com>.

Hi Guillaume,

See https://issues.apache.org/jira/browse/HDFS-2580 - This improvement
was recently added to trunk but is not in any Apache release I know of
yet.

I instead make things like this work via the Configuration's ability
to use System Properties (from the JVM) when substituting. So if your
config files have <value>${foo}</value> and HADOOP_NAMENODE_OPTS has
-Dfoo=bar, then the <value> passed in is bar.

Hope this helps!

On Fri, Oct 12, 2012 at 5:13 PM, Guillaume Polaert <gp...@cyres.fr> wrote:
> Hello,
>
> I'm trying to start a NN using the -Dproperty=value functionality.
> I've modified hadoop-env.sh like this :
>    export HADOOP_NAMENODE_OPTS= ... -Ddfs.namenode.http-address=0.0.0.0:50071 ...
>
> and I've launched the daemon with hadoop-hdfs-namenode script.
>
> ps -efHww returns  ".../java -Dproc_namenode ... -Dhadoop.security.logger=INFO,RFAS -Ddfs.namenode.http-address=0.0.0.0:50071 -Dhdfs.audit.logger=INFO,NullAppender ..."
>
> But the property has always the same value : 50070 (default port).
>
> Is "-D-substitution" working for this use case ?
>
> Thanks, Guillaume
>



-- 
Harsh J

Re: Start NN with dynamic property (option -D)

Posted by Harsh J <ha...@cloudera.com>.

Hi Guillaume,

See https://issues.apache.org/jira/browse/HDFS-2580 - This improvement
was recently added to trunk but is not in any Apache release I know of
yet.

I instead make things like this work via the Configuration's ability
to use System Properties (from the JVM) when substituting. So if your
config files have <value>${foo}</value> and HADOOP_NAMENODE_OPTS has
-Dfoo=bar, then the <value> passed in is bar.

Hope this helps!

On Fri, Oct 12, 2012 at 5:13 PM, Guillaume Polaert <gp...@cyres.fr> wrote:
> Hello,
>
> I'm trying to start a NN using the -Dproperty=value functionality.
> I've modified hadoop-env.sh like this :
>    export HADOOP_NAMENODE_OPTS= ... -Ddfs.namenode.http-address=0.0.0.0:50071 ...
>
> and I've launched the daemon with hadoop-hdfs-namenode script.
>
> ps -efHww returns  ".../java -Dproc_namenode ... -Dhadoop.security.logger=INFO,RFAS -Ddfs.namenode.http-address=0.0.0.0:50071 -Dhdfs.audit.logger=INFO,NullAppender ..."
>
> But the property has always the same value : 50070 (default port).
>
> Is "-D-substitution" working for this use case ?
>
> Thanks, Guillaume
>



-- 
Harsh J

Start NN with dynamic property (option -D)

Posted by Guillaume Polaert <gp...@cyres.fr>.

Hello,

I'm trying to start a NN using the -Dproperty=value functionality.
I've modified hadoop-env.sh like this : 
   export HADOOP_NAMENODE_OPTS= ... -Ddfs.namenode.http-address=0.0.0.0:50071 ...

and I've launched the daemon with hadoop-hdfs-namenode script.

ps -efHww returns  ".../java -Dproc_namenode ... -Dhadoop.security.logger=INFO,RFAS -Ddfs.namenode.http-address=0.0.0.0:50071 -Dhdfs.audit.logger=INFO,NullAppender ..."

But the property has always the same value : 50070 (default port).

Is "-D-substitution" working for this use case ?

Thanks, Guillaume

Start NN with dynamic property (option -D)

Posted by Guillaume Polaert <gp...@cyres.fr>.

Hello,

I'm trying to start a NN using the -Dproperty=value functionality.
I've modified hadoop-env.sh like this : 
   export HADOOP_NAMENODE_OPTS= ... -Ddfs.namenode.http-address=0.0.0.0:50071 ...

and I've launched the daemon with hadoop-hdfs-namenode script.

ps -efHww returns  ".../java -Dproc_namenode ... -Dhadoop.security.logger=INFO,RFAS -Ddfs.namenode.http-address=0.0.0.0:50071 -Dhdfs.audit.logger=INFO,NullAppender ..."

But the property has always the same value : 50070 (default port).

Is "-D-substitution" working for this use case ?

Thanks, Guillaume

Start NN with dynamic property (option -D)

Posted by Guillaume Polaert <gp...@cyres.fr>.

Hello,

I'm trying to start a NN using the -Dproperty=value functionality.
I've modified hadoop-env.sh like this : 
   export HADOOP_NAMENODE_OPTS= ... -Ddfs.namenode.http-address=0.0.0.0:50071 ...

and I've launched the daemon with hadoop-hdfs-namenode script.

ps -efHww returns  ".../java -Dproc_namenode ... -Dhadoop.security.logger=INFO,RFAS -Ddfs.namenode.http-address=0.0.0.0:50071 -Dhdfs.audit.logger=INFO,NullAppender ..."

But the property has always the same value : 50070 (default port).

Is "-D-substitution" working for this use case ?

Thanks, Guillaume

Start NN with dynamic property (option -D)

Posted by Guillaume Polaert <gp...@cyres.fr>.

Hello,

I'm trying to start a NN using the -Dproperty=value functionality.
I've modified hadoop-env.sh like this : 
   export HADOOP_NAMENODE_OPTS= ... -Ddfs.namenode.http-address=0.0.0.0:50071 ...

and I've launched the daemon with hadoop-hdfs-namenode script.

ps -efHww returns  ".../java -Dproc_namenode ... -Dhadoop.security.logger=INFO,RFAS -Ddfs.namenode.http-address=0.0.0.0:50071 -Dhdfs.audit.logger=INFO,NullAppender ..."

But the property has always the same value : 50070 (default port).

Is "-D-substitution" working for this use case ?

Thanks, Guillaume

Re: Why they recommend this (CPU) ?

Posted by Jay Vyas <ja...@gmail.com>.

Presumably, if you have a reasonable number of cores - speeding the cores
up will be better than forking a task into smaller and smaller chunks -
because at some point the overhead of multiple processes would be a
bottleneck - maybe due to streaming reads and writes?  I'm sure each and
every problem has a different sweet spot.