You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Artemis User <ar...@dtechspace.com> on 2022/10/26 19:17:07 UTC

Dynamic Scaling without Kubernetes

Has anyone tried to make a Spark cluster dynamically scalable, i.e., 
adding a new worker node automatically to the cluster when no more 
executors are available upon a new job submitted?  We need to make the 
whole cluster on-prem and really lightweight, so standalone mode is 
preferred and no k8s if possible.   Any suggestion?  Thanks in advance!

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Dynamic Scaling without Kubernetes

Posted by Mich Talebzadeh <mi...@gmail.com>.

I am a bit late on this.

K8s like GKE do not use YARN.

We now have the option of auto scaling by say  Google Dataproc
<https://cloud.google.com/dataproc>. These benefit from autoscaling by
enabling autoscaling policy.
<https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#gcloud-command>There
are limitations on this like support for Spark Structured Streaming

When I read the points raised above, it comes to my mind that where these
extra resources come into play

Recent tests have shown that depending on the job, adding additional nodes
to the cluster incur delays and occasional freese of Spark jobs. However,
that may be an acceptable price to pay as there is really no guarantee that
having an additional standby node is going to resolve the problem.

I therefore believe that deploying K8s through compute virtual clusters
will be a more prudent option.

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Wed, 26 Oct 2022 at 21:29, Artemis User <ar...@dtechspace.com> wrote:

> Wouldn't you need to run Spark on Hadoop in order to use YARN?  I believe
> that YARN only manages Hadoop nodes, not Spark workers directly.  Besides,
> what I read was that you would need some extra plug-ins to be able to get
> nodes managed dynamically.
>
> Our use case would be like this:
>
>    1. A Spark cluster is launched with some fixed number of initial nodes
>    at the beginning.
>    2. As work load reaches max capacity (e.g. no more executors), a job
>    submission is rejected or has to wait in the queue.
>    3. A new worker node is then instantiated (e.g., a pre-configured
>    container hosting a worker node is created and started) to take the extra
>    work load so new jobs can be submitted.
>    4. Optional:  If some worker nodes have been idle for a while, they
>    can be stopped or removed from the cluster.
>
> I guess an external Spark monitor or manager would be needed to keep an
> eye on the work load of the cluster and submission status to be able to
> launch/remove new nodes.   This shouldn't be difficult to do instead of
> dealing with complex frameworks like k8s which isn't really designed for
> small scale, on-prem use of Spark and requires dedicated admin resources.
>
> On 10/26/22 3:20 PM, Holden Karau wrote:
>
> So Spark can dynamically scale on YARN, but standalone mode becomes a bit
> complicated — where do you envision Spark gets the extra resources from?
>
> On Wed, Oct 26, 2022 at 12:18 PM Artemis User <ar...@dtechspace.com>
> wrote:
>
>> Has anyone tried to make a Spark cluster dynamically scalable, i.e.,
>> adding a new worker node automatically to the cluster when no more
>> executors are available upon a new job submitted?  We need to make the
>> whole cluster on-prem and really lightweight, so standalone mode is
>> preferred and no k8s if possible.   Any suggestion?  Thanks in advance!
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>

Re: Dynamic Scaling without Kubernetes

Posted by Artemis User <ar...@dtechspace.com>.

Wouldn't you need to run Spark on Hadoop in order to use YARN?  I 
believe that YARN only manages Hadoop nodes, not Spark workers 
directly.  Besides, what I read was that you would need some extra 
plug-ins to be able to get nodes managed dynamically.

Our use case would be like this:

 1. A Spark cluster is launched with some fixed number of initial nodes
    at the beginning.
 2. As work load reaches max capacity (e.g. no more executors), a job
    submission is rejected or has to wait in the queue.
 3. A new worker node is then instantiated (e.g., a pre-configured
    container hosting a worker node is created and started) to take the
    extra work load so new jobs can be submitted.
 4. Optional:  If some worker nodes have been idle for a while, they can
    be stopped or removed from the cluster.

I guess an external Spark monitor or manager would be needed to keep an 
eye on the work load of the cluster and submission status to be able to 
launch/remove new nodes.   This shouldn't be difficult to do instead of 
dealing with complex frameworks like k8s which isn't really designed for 
small scale, on-prem use of Spark and requires dedicated admin resources.


On 10/26/22 3:20 PM, Holden Karau wrote:
> So Spark can dynamically scale on YARN, but standalone mode becomes a 
> bit complicated — where do you envision Spark gets the extra resources 
> from?
>
> On Wed, Oct 26, 2022 at 12:18 PM Artemis User <ar...@dtechspace.com> 
> wrote:
>
>     Has anyone tried to make a Spark cluster dynamically scalable, i.e.,
>     adding a new worker node automatically to the cluster when no more
>     executors are available upon a new job submitted?  We need to make
>     the
>     whole cluster on-prem and really lightweight, so standalone mode is
>     preferred and no k8s if possible.   Any suggestion?  Thanks in
>     advance!
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): 
> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Dynamic Scaling without Kubernetes

Posted by Holden Karau <ho...@pigscanfly.ca>.

So Spark can dynamically scale on YARN, but standalone mode becomes a bit
complicated — where do you envision Spark gets the extra resources from?

On Wed, Oct 26, 2022 at 12:18 PM Artemis User <ar...@dtechspace.com>
wrote:

> Has anyone tried to make a Spark cluster dynamically scalable, i.e.,
> adding a new worker node automatically to the cluster when no more
> executors are available upon a new job submitted?  We need to make the
> whole cluster on-prem and really lightweight, so standalone mode is
> preferred and no k8s if possible.   Any suggestion?  Thanks in advance!
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau