You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Victor Tso-Guillen <vt...@paxata.com> on 2014/12/03 18:41:51 UTC

Re: heterogeneous cluster setup

I don't have a great answer for you. For us, we found a common divisor, not
necessarily a whole gigabyte, of the available memory of the different
hardware and used that as the amount of memory per worker and scaled the
number of cores accordingly so that every core in the system has the same
amount of memory. The quotient of the available memory and the common
divisor, hopefully a whole number to reduce waste, was the number of
workers we spun up. Therefore, if you have 64G, 30G, and 15G available
memory on your machines, the divisor could be 15G and you'd have 4, 2 and 1
worker per machine. Every worker on all the machines would have the same
number of cores, set to what you think is a good value.

Hope that helps.

On Wed, Dec 3, 2014 at 7:44 AM, <ka...@gmail.com> wrote:

> Hi Victor,
>
> I want to setup a heterogeneous stand-alone spark cluster. I have hardware
> with different memory sizes and varied number of cores per node. I could
> get all the nodes active in the cluster only when the size of memory per
> executor is set as the least available memory size of all nodes and is same
> with no.of cores/executor. As of now, I configure one executor per node.
>
> Can you please suggest some path to set up a stand-alone heterogeneous
> cluster such that I can efficiently use the available hardware.
>
> Thank you
>
>
>
>
> _____________________________________
> Sent from http://apache-spark-user-list.1001560.n3.nabble.com
>
>

Re: heterogeneous cluster setup

Posted by Victor Tso-Guillen <vt...@paxata.com>.
To reiterate, it's very important for Spark's workers to have the same
memory available. Think about Spark uniformly chopping up your data and
distributing the work to the nodes. The algorithm is not designed to
consider that a worker has less memory available than some other worker.

On Thu, Dec 4, 2014 at 12:11 AM, rapelly kartheek <ka...@gmail.com>
wrote:

>
> *It's very important for Spark's workers to have the same resources
> available*
>
> So, each worker should have same amount of memory and same number of
> cores. But, heterogeneity of the cluster in the physical layout of cpu is
> understandable, but how about heterogeneity with respect to memory?
>
> On Thu, Dec 4, 2014 at 12:18 PM, Victor Tso-Guillen <vt...@paxata.com>
> wrote:
>
>> You'll have to decide which is more expensive in your heterogenous
>> environment and optimize for the utilization of that. For example, you may
>> decide that memory is the only costing factor and you can discount the
>> number of cores. Then you could have 8GB on each worker each with four
>> cores. Note that cores in Spark don't necessarily map to cores on the
>> machine. It's just a configuration setting for how many simultaneous tasks
>> that worker can work on.
>>
>> You are right that each executor gets the same amount of resources and I
>> would add level of parallelization. Your heterogeneity is in the physical
>> layout of your cluster, not in how Spark treats the workers as resources.
>> It's very important for Spark's workers to have the same resources
>> available because it needs to be able to generically divide and conquer
>> your data amongst all those workers.
>>
>> Hope that helps,
>> Victor
>>
>> On Wed, Dec 3, 2014 at 10:04 PM, rapelly kartheek <
>> kartheek.mbms@gmail.com> wrote:
>>
>>> Thank you so much for valuable reply, Victor. That's a very clear
>>> solution I understood.
>>>
>>> Right now I have nodes with:
>>> 16Gb RAM, 4 cores; 8GB RAM, 4cores; 8GB RAM, 2 cores. From my
>>> understanding, the division could be something like, each executor can have
>>> 2 cores and 6GB RAM.   So, the ones with 16GB RAM and 4 cores can have two
>>> executors. Please let me know if my understanding is correct.
>>>
>>> But, I am not able to see  any heterogeneity in this setting as each
>>> executor has got the same amount of resources. Can you please clarify this
>>> doubt?
>>>
>>> Regards
>>> Karthik
>>>
>>> On Wed, Dec 3, 2014 at 11:11 PM, Victor Tso-Guillen <vt...@paxata.com>
>>> wrote:
>>>
>>>> I don't have a great answer for you. For us, we found a common divisor,
>>>> not necessarily a whole gigabyte, of the available memory of the different
>>>> hardware and used that as the amount of memory per worker and scaled the
>>>> number of cores accordingly so that every core in the system has the same
>>>> amount of memory. The quotient of the available memory and the common
>>>> divisor, hopefully a whole number to reduce waste, was the number of
>>>> workers we spun up. Therefore, if you have 64G, 30G, and 15G available
>>>> memory on your machines, the divisor could be 15G and you'd have 4, 2 and 1
>>>> worker per machine. Every worker on all the machines would have the same
>>>> number of cores, set to what you think is a good value.
>>>>
>>>> Hope that helps.
>>>>
>>>> On Wed, Dec 3, 2014 at 7:44 AM, <ka...@gmail.com> wrote:
>>>>
>>>>> Hi Victor,
>>>>>
>>>>> I want to setup a heterogeneous stand-alone spark cluster. I have
>>>>> hardware with different memory sizes and varied number of cores per node. I
>>>>> could get all the nodes active in the cluster only when the size of memory
>>>>> per executor is set as the least available memory size of all nodes and is
>>>>> same with no.of cores/executor. As of now, I configure one executor per
>>>>> node.
>>>>>
>>>>> Can you please suggest some path to set up a stand-alone
>>>>> heterogeneous  cluster such that I can efficiently use the available
>>>>> hardware.
>>>>>
>>>>> Thank you
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _____________________________________
>>>>> Sent from http://apache-spark-user-list.1001560.n3.nabble.com
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: heterogeneous cluster setup

Posted by Victor Tso-Guillen <vt...@paxata.com>.
You'll have to decide which is more expensive in your heterogenous
environment and optimize for the utilization of that. For example, you may
decide that memory is the only costing factor and you can discount the
number of cores. Then you could have 8GB on each worker each with four
cores. Note that cores in Spark don't necessarily map to cores on the
machine. It's just a configuration setting for how many simultaneous tasks
that worker can work on.

You are right that each executor gets the same amount of resources and I
would add level of parallelization. Your heterogeneity is in the physical
layout of your cluster, not in how Spark treats the workers as resources.
It's very important for Spark's workers to have the same resources
available because it needs to be able to generically divide and conquer
your data amongst all those workers.

Hope that helps,
Victor

On Wed, Dec 3, 2014 at 10:04 PM, rapelly kartheek <ka...@gmail.com>
wrote:

> Thank you so much for valuable reply, Victor. That's a very clear solution
> I understood.
>
> Right now I have nodes with:
> 16Gb RAM, 4 cores; 8GB RAM, 4cores; 8GB RAM, 2 cores. From my
> understanding, the division could be something like, each executor can have
> 2 cores and 6GB RAM.   So, the ones with 16GB RAM and 4 cores can have two
> executors. Please let me know if my understanding is correct.
>
> But, I am not able to see  any heterogeneity in this setting as each
> executor has got the same amount of resources. Can you please clarify this
> doubt?
>
> Regards
> Karthik
>
> On Wed, Dec 3, 2014 at 11:11 PM, Victor Tso-Guillen <vt...@paxata.com>
> wrote:
>
>> I don't have a great answer for you. For us, we found a common divisor,
>> not necessarily a whole gigabyte, of the available memory of the different
>> hardware and used that as the amount of memory per worker and scaled the
>> number of cores accordingly so that every core in the system has the same
>> amount of memory. The quotient of the available memory and the common
>> divisor, hopefully a whole number to reduce waste, was the number of
>> workers we spun up. Therefore, if you have 64G, 30G, and 15G available
>> memory on your machines, the divisor could be 15G and you'd have 4, 2 and 1
>> worker per machine. Every worker on all the machines would have the same
>> number of cores, set to what you think is a good value.
>>
>> Hope that helps.
>>
>> On Wed, Dec 3, 2014 at 7:44 AM, <ka...@gmail.com> wrote:
>>
>>> Hi Victor,
>>>
>>> I want to setup a heterogeneous stand-alone spark cluster. I have
>>> hardware with different memory sizes and varied number of cores per node. I
>>> could get all the nodes active in the cluster only when the size of memory
>>> per executor is set as the least available memory size of all nodes and is
>>> same with no.of cores/executor. As of now, I configure one executor per
>>> node.
>>>
>>> Can you please suggest some path to set up a stand-alone heterogeneous
>>> cluster such that I can efficiently use the available hardware.
>>>
>>> Thank you
>>>
>>>
>>>
>>>
>>> _____________________________________
>>> Sent from http://apache-spark-user-list.1001560.n3.nabble.com
>>>
>>>
>>
>