You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Rob Vesse <rv...@dotnetrdf.org> on 2021/12/06 11:17:25 UTC

Re: Current Spark on K8S Scheduling

Mich

So there’s several things potentially going on here.

The initial Spark Submit creates just the driver pod.  Once the driver pod starts the various processes, including the Spark master have to start up and this takes a non-zero amount of time (hence the 42 second difference you see).  Once it has reached the point where it can actually starts launching executors it uses the K8S service account credentials of the pod to talk to the K8S API to start requesting executors.  Depending on the number of executors requested the driver will ask for these in chunks i.e. it will ask for N executors, wait briefly and then ask for another N executors.  Where N is a controllable batch size (default 5) managed via the spark.kubernetes.allocation.batch.size property and the delay between batches of executors (default 1s) controlled by the spark.kubernetes.allocation.batch.delay property.  This is why you can see slight differences in the ages of the executor pods.  This batch based allocation is done to avoid overwhelming the K8S API server and effectively DDoS’ing the cluster by asking for all the pods at once.

Now as to executor pods coming up in a random order this is due to the K8S scheduling behaviour and how busy the kubelet on a given node is.  The executor pods are going to be scheduled onto one or more physical (or virtual) nodes and then the kubelet on each node is responsible for bringing those up using whatever underlying container runtime is used on your K8S cluster.  Executor pods can start faster for various reasons, including things like whether a given node already has the necessary image pulled locally (and yours jobs configured image pull policy), how busy a given node in the cluster is with other pod operations, the responsiveness of the container runtime on a given node etc.

As the executors start they’ll report back into the driver and then the actual job will proceed assuming sufficient executors were successfully launched.

Hope this helps,

Rob

From: Mich Talebzadeh <mi...@gmail.com>
Date: Friday, 3 December 2021 at 00:18
To: Yikun Jiang <yi...@gmail.com>
Cc: dev <de...@spark.apache.org>
Subject: Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

Some colleagues enquired about the start of the driver pod before executor pods. With the current k8s on Google kubernetes Engine, the driver pod will start first before starting executor pods. This can be from the output of kubectl get pods -n spark here

NAME                                         READY   STATUS              RESTARTS   AGE

randomdatabigquery-d3de0e7d7d8ea9f4-exec-1   0/1     ContainerCreating   0          1s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-2   0/1     ContainerCreating   0          1s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-3   1/1     Running             0          1s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-4   0/1     ContainerCreating   0          1s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-5   0/1     ContainerCreating   0          1s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-6   0/1     ContainerCreating   0          0s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-7   0/1     ContainerCreating   0          0s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-8   0/1     Pending             0          0s

sparkbq-27747e7d7d8e0f55-driver              1/1     Running             0          42s

Note that the driver pod has been running for 42 seconds before executor pods kick in. The order of executors seems to be random. For example, exec-3 has been running before exec-1 etc.

HTH

   view my Linkedin profile

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 

On Thu, 2 Dec 2021 at 08:24, Mich Talebzadeh <mi...@gmail.com> wrote:

I am referring to this revised diagram below

My understanding is that it is the driver pod that creates executors which in turn are scheduled by the Kube Scheduler responsible for scheduling applications or containers on Nodes. So there should be a solid line from Driver Pod to executor pods and a dotted line from Volcano Scheduler to Executer Pod.

Thanks

   view my Linkedin profile

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 

On Thu, 2 Dec 2021 at 01:51, Yikun Jiang <yi...@gmail.com> wrote:

> Thank you Yikun for the info, and thanks for inviting me to a meeting to discuss this.

> I appreciate your effort to put these together, and I agree that the purpose is to make Spark easy/flexible enough to support other K8s schedulers (not just for Volcano).

> As discussed, could you please help to abstract out the things in common and allow Spark to plug different implementations? I'd be happy to work with you guys on this issue.

Thanks for the support from Yunikron side.

As @weiwei mentioned, yesterday we had an initial meeting which went well, we have reached a consensus initially.

We will also abstract out the common part to make clear the things in common and also provide a way to allow a variety of schedulers to do custom extension.

Regards,

Yikun

Weiwei Yang <ww...@apache.org> 于2021年12月2日周四 上午2:00写道：

Thank you Yikun for the info, and thanks for inviting me to a meeting to discuss this.

I appreciate your effort to put these together, and I agree that the purpose is to make Spark easy/flexible enough to support other K8s schedulers (not just for Volcano).

As discussed, could you please help to abstract out the things in common and allow Spark to plug different implementations? I'd be happy to work with you guys on this issue.

On Tue, Nov 30, 2021 at 6:49 PM Yikun Jiang <yi...@gmail.com> wrote:

@Weiwei @Chenya

> Thanks for bringing this up. This is quite interesting, we definitely should participate more in the discussions.

Thanks for your reply and welcome to join the discussion, I think the input from Yunikorn is very critical.

> The main thing here is, the Spark community should make Spark pluggable in order to support other schedulers, not just for Volcano. It looks like this proposal is pushing really hard for adopting PodGroup, which isn't part of K8s yet, that to me is problematic.

Definitely yes, we are on the same page.

I think we have the same goal: propose a general and reasonable mechanism to make spark on k8s with a custom scheduler more usable.

But for the PodGroup, just allow me to do a brief introduction:
- The PodGroup definition has been approved by Kubernetes officially in KEP-583. [1]
- It can be regarded as a general concept/standard in Kubernetes rather than a specific concept in Volcano, there are also others to implement it, such as [2][3].
- Kubernetes recommends using CRD to do more extension to implement what they want. [4]
- Volcano as extension provides an interface to maintain the life cycle PodGroup CRD and use volcano-scheduler to complete the scheduling.

[1] https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/583-coscheduling

[2] https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling#podgroup
[3] https://github.com/kubernetes-sigs/kube-batch
[4] https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/

Regards,

Yikun

Weiwei Yang <ww...@apache.org> 于2021年12月1日周三 上午5:57写道：

Hi Chenya

Thanks for bringing this up. This is quite interesting, we definitely should participate more in the discussions.

The main thing here is, the Spark community should make Spark pluggable in order to support other schedulers, not just for Volcano. It looks like this proposal is pushing really hard for adopting PodGroup, which isn't part of K8s yet, that to me is problematic.

On Tue, Nov 30, 2021 at 9:21 AM Prasad Paravatha <pr...@gmail.com> wrote:

This is a great feature/idea. 

I'd love to get involved in some form (testing and/or documentation). This could be my 1st contribution to Spark!

On Tue, Nov 30, 2021 at 10:46 PM John Zhuge <jz...@apache.org> wrote:

+1 Kudos to Yikun and the community for starting the discussion!

On Tue, Nov 30, 2021 at 8:47 AM Chenya Zhang <ch...@gmail.com> wrote:

Thanks folks for bringing up the topic of natively integrating Volcano and other alternative schedulers into Spark!

+Weiwei, Wilfred, Chaoran. We would love to contribute to the discussion as well. 

From our side, we have been using and improving on one alternative resource scheduler, Apache YuniKorn (https://yunikorn.apache.org/), for Spark on Kubernetes in production at Apple with solid results in the past year. It is capable of supporting Gang scheduling (similar to PodGroups), multi-tenant resource queues (similar to YARN), FIFO, and other handy features like bin packing to enable efficient autoscaling, etc. 

Natively integrating with Spark would provide more flexibility for users and reduce the extra cost and potential inconsistency of maintaining different layers of resource strategies. One interesting topic we hope to discuss more about is dynamic allocation, which would benefit from native coordination between Spark and resource schedulers in K8s & cloud environment for an optimal resource efficiency. 

On Tue, Nov 30, 2021 at 8:10 AM Holden Karau <ho...@pigscanfly.ca> wrote:

Thanks for putting this together, I’m really excited for us to add better batch scheduling integrations.

On Tue, Nov 30, 2021 at 12:46 AM Yikun Jiang <yi...@gmail.com> wrote:

Hey everyone,

I'd like to start a discussion on "Support Volcano/Alternative Schedulers Proposal".

This SPIP is proposed to make spark k8s schedulers provide more YARN like features (such as queues and minimum resources before scheduling jobs) that many folks want on Kubernetes.

The goal of this SPIP is to improve current spark k8s scheduler implementations, add the ability of batch scheduling and support volcano as one of implementations.

Design doc: https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg

JIRA: https://issues.apache.org/jira/browse/SPARK-36057

Part of PRs:
Ability to create resources https://github.com/apache/spark/pull/34599

Add PodGroupFeatureStep: https://github.com/apache/spark/pull/34456

Regards,

Yikun

-- 

Twitter: https://twitter.com/holdenkarau

Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-- 

John Zhuge

-- 

Regards,
Prasad Paravatha

Re: Current Spark on K8S Scheduling

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Hello All,
Coming into this discussion brand new, I am interested in this effort but from a preemption and priority of jobs scheduling perspective, will the design of the current scheduler address this aspect, namely that higher priority jobs dont get trumped by a set of lower priority jobs, on the flipside lower priority jobs dont run into starvation where they are constantly preempted by higher priority jobs.   Please let me know if this is outside the scope of this effort, otherwise would love to get involved and if its ok I can add comments to the design doc.
Thanks

________________________________
From: Mich Talebzadeh <mi...@gmail.com>
Sent: Monday, December 6, 2021 9:59 AM
To: Rob Vesse <rv...@dotnetrdf.org>
Cc: dev <de...@spark.apache.org>
Subject: Re: Current Spark on K8S Scheduling

Hi Rob,

Thanks for the explanation. For the past few days, I have thought of getting this spark on k8s process graphically so I built the following diagram for GKE (Google),  but it should be true of any other cluster (EKS,  AKS etc.).

[image.png]

  1.  Process 1-3 will be dealing with the request for Kube apiserver to schedule the creation of driver pod and at the completion, process 4 will require kube apiserver to schedule creation of executors. Process 5 and 6 will be dealing with executors. From a practical point of view I don't think the order of executors as being created matters. However, the crucial thing is that the number of executors is left to the one submitting the job to decide through --conf spark.executor.instances=$NEXEC.
  2.  I have a GKE cluster of three each with e2-standard-4 (4 vCPUs, 16 GB memory), so nothing special. However, the current model I believe is built on one pod, one container/executor. I set up the following parameters as well

                       --conf spark.kubernetes.allocation.batch.size=3 \
               --conf spark.kubernetes.allocation.batch.delay=1 \

$NEXEC = 6 in my configuration so according to the setting, k8s will throw a batch of 3 first followed by another batch of 3 , a second later. That may well be the reason why executors now order properly?

k get pods -n spark
NAME                                         READY   STATUS    RESTARTS   AGE
randomdatabigquery-a5dea67d90c4097a-exec-1   1/1     Running   0          50s
randomdatabigquery-a5dea67d90c4097a-exec-2   1/1     Running   0          50s
randomdatabigquery-a5dea67d90c4097a-exec-3   1/1     Running   0          50s
randomdatabigquery-a5dea67d90c4097a-exec-4   1/1     Running   0          50s
randomdatabigquery-a5dea67d90c4097a-exec-5   1/1     Running   0          50s
randomdatabigquery-a5dea67d90c4097a-exec-6   1/1     Running   0          50s
sparkbq-ad1eb27d90c366bd-driver              1/1     Running   0          93s

k get pods -n spark
NAME                                         READY   STATUS    RESTARTS   AGE
randomdatabigquery-a5dea67d90c4097a-exec-1   1/1     Running   0          90s
randomdatabigquery-a5dea67d90c4097a-exec-2   1/1     Running   0          90s
randomdatabigquery-a5dea67d90c4097a-exec-3   1/1     Running   0          90s
randomdatabigquery-a5dea67d90c4097a-exec-4   1/1     Running   0          90s
randomdatabigquery-a5dea67d90c4097a-exec-5   1/1     Running   0          90s
randomdatabigquery-a5dea67d90c4097a-exec-6   1/1     Running   0          90s
sparkbq-ad1eb27d90c366bd-driver              1/1     Running   0          2m13s

However, this is not something we ought to care about? Also another point being, do we care if each node handles two pods with one executor each or one pod with two executors in the pod? According to definition, pods are the smallest, most basic deployable objects in Kubernetes. A Pod represents a single instance of a running process in the Kubernetes cluster. Pods can contain one or more containers, such as Docker containers.

cheers

 [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]   view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

On Mon, 6 Dec 2021 at 11:18, Rob Vesse <rv...@dotnetrdf.org>> wrote:

Mich

So there’s several things potentially going on here.

The initial Spark Submit creates just the driver pod.  Once the driver pod starts the various processes, including the Spark master have to start up and this takes a non-zero amount of time (hence the 42 second difference you see).  Once it has reached the point where it can actually starts launching executors it uses the K8S service account credentials of the pod to talk to the K8S API to start requesting executors.  Depending on the number of executors requested the driver will ask for these in chunks i.e. it will ask for N executors, wait briefly and then ask for another N executors.  Where N is a controllable batch size (default 5) managed via the spark.kubernetes.allocation.batch.size property and the delay between batches of executors (default 1s) controlled by the spark.kubernetes.allocation.batch.delay property.  This is why you can see slight differences in the ages of the executor pods.  This batch based allocation is done to avoid overwhelming the K8S API server and effectively DDoS’ing the cluster by asking for all the pods at once.

Now as to executor pods coming up in a random order this is due to the K8S scheduling behaviour and how busy the kubelet on a given node is.  The executor pods are going to be scheduled onto one or more physical (or virtual) nodes and then the kubelet on each node is responsible for bringing those up using whatever underlying container runtime is used on your K8S cluster.  Executor pods can start faster for various reasons, including things like whether a given node already has the necessary image pulled locally (and yours jobs configured image pull policy), how busy a given node in the cluster is with other pod operations, the responsiveness of the container runtime on a given node etc.

As the executors start they’ll report back into the driver and then the actual job will proceed assuming sufficient executors were successfully launched.

Hope this helps,

Rob

From: Mich Talebzadeh <mi...@gmail.com>>
Date: Friday, 3 December 2021 at 00:18
To: Yikun Jiang <yi...@gmail.com>>
Cc: dev <de...@spark.apache.org>>
Subject: Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

Some colleagues enquired about the start of the driver pod before executor pods. With the current k8s on Google kubernetes Engine, the driver pod will start first before starting executor pods. This can be from the output of kubectl get pods -n spark here

NAME                                         READY   STATUS              RESTARTS   AGE

randomdatabigquery-d3de0e7d7d8ea9f4-exec-1   0/1     ContainerCreating   0          1s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-2   0/1     ContainerCreating   0          1s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-3   1/1     Running             0          1s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-4   0/1     ContainerCreating   0          1s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-5   0/1     ContainerCreating   0          1s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-6   0/1     ContainerCreating   0          0s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-7   0/1     ContainerCreating   0          0s

randomdatabigquery-d3de0e7d7d8ea9f4-exec-8   0/1     Pending             0          0s

sparkbq-27747e7d7d8e0f55-driver              1/1     Running             0          42s

Note that the driver pod has been running for 42 seconds before executor pods kick in. The order of executors seems to be random. For example, exec-3 has been running before exec-1 etc.

HTH

 [Image removed by sender.]   view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

On Thu, 2 Dec 2021 at 08:24, Mich Talebzadeh <mi...@gmail.com>> wrote:

I am referring to this revised diagram below

[cid:17d90c648f34cff311]

My understanding is that it is the driver pod that creates executors which in turn are scheduled by the Kube Scheduler responsible for scheduling applications or containers on Nodes. So there should be a solid line from Driver Pod to executor pods and a dotted line from Volcano Scheduler to Executer Pod.

Thanks

 [Image removed by sender.]   view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

On Thu, 2 Dec 2021 at 01:51, Yikun Jiang <yi...@gmail.com>> wrote:

> Thank you Yikun for the info, and thanks for inviting me to a meeting to discuss this.

> I appreciate your effort to put these together, and I agree that the purpose is to make Spark easy/flexible enough to support other K8s schedulers (not just for Volcano).

> As discussed, could you please help to abstract out the things in common and allow Spark to plug different implementations? I'd be happy to work with you guys on this issue.

Thanks for the support from Yunikron side.

As @weiwei mentioned, yesterday we had an initial meeting which went well, we have reached a consensus initially.

We will also abstract out the common part to make clear the things in common and also provide a way to allow a variety of schedulers to do custom extension.

Regards,

Yikun

Weiwei Yang <ww...@apache.org>> 于2021年12月2日周四 上午2:00写道：

Thank you Yikun for the info, and thanks for inviting me to a meeting to discuss this.

I appreciate your effort to put these together, and I agree that the purpose is to make Spark easy/flexible enough to support other K8s schedulers (not just for Volcano).

As discussed, could you please help to abstract out the things in common and allow Spark to plug different implementations? I'd be happy to work with you guys on this issue.

On Tue, Nov 30, 2021 at 6:49 PM Yikun Jiang <yi...@gmail.com>> wrote:

@Weiwei @Chenya

> Thanks for bringing this up. This is quite interesting, we definitely should participate more in the discussions.

Thanks for your reply and welcome to join the discussion, I think the input from Yunikorn is very critical.

> The main thing here is, the Spark community should make Spark pluggable in order to support other schedulers, not just for Volcano. It looks like this proposal is pushing really hard for adopting PodGroup, which isn't part of K8s yet, that to me is problematic.

Definitely yes, we are on the same page.

I think we have the same goal: propose a general and reasonable mechanism to make spark on k8s with a custom scheduler more usable.

But for the PodGroup, just allow me to do a brief introduction:
- The PodGroup definition has been approved by Kubernetes officially in KEP-583. [1]
- It can be regarded as a general concept/standard in Kubernetes rather than a specific concept in Volcano, there are also others to implement it, such as [2][3].
- Kubernetes recommends using CRD to do more extension to implement what they want. [4]
- Volcano as extension provides an interface to maintain the life cycle PodGroup CRD and use volcano-scheduler to complete the scheduling.

[1] https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/583-coscheduling

[2] https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling#podgroup
[3] https://github.com/kubernetes-sigs/kube-batch
[4] https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/

Regards,

Yikun

Weiwei Yang <ww...@apache.org>> 于2021年12月1日周三 上午5:57写道：

Hi Chenya

Thanks for bringing this up. This is quite interesting, we definitely should participate more in the discussions.

The main thing here is, the Spark community should make Spark pluggable in order to support other schedulers, not just for Volcano. It looks like this proposal is pushing really hard for adopting PodGroup, which isn't part of K8s yet, that to me is problematic.

On Tue, Nov 30, 2021 at 9:21 AM Prasad Paravatha <pr...@gmail.com>> wrote:

This is a great feature/idea.

I'd love to get involved in some form (testing and/or documentation). This could be my 1st contribution to Spark!

On Tue, Nov 30, 2021 at 10:46 PM John Zhuge <jz...@apache.org>> wrote:

+1 Kudos to Yikun and the community for starting the discussion!

On Tue, Nov 30, 2021 at 8:47 AM Chenya Zhang <ch...@gmail.com>> wrote:

Thanks folks for bringing up the topic of natively integrating Volcano and other alternative schedulers into Spark!

+Weiwei, Wilfred, Chaoran. We would love to contribute to the discussion as well.

From our side, we have been using and improving on one alternative resource scheduler, Apache YuniKorn (https://yunikorn.apache.org/), for Spark on Kubernetes in production at Apple with solid results in the past year. It is capable of supporting Gang scheduling (similar to PodGroups), multi-tenant resource queues (similar to YARN), FIFO, and other handy features like bin packing to enable efficient autoscaling, etc.

Natively integrating with Spark would provide more flexibility for users and reduce the extra cost and potential inconsistency of maintaining different layers of resource strategies. One interesting topic we hope to discuss more about is dynamic allocation, which would benefit from native coordination between Spark and resource schedulers in K8s & cloud environment for an optimal resource efficiency.

On Tue, Nov 30, 2021 at 8:10 AM Holden Karau <ho...@pigscanfly.ca>> wrote:

Thanks for putting this together, I’m really excited for us to add better batch scheduling integrations.

On Tue, Nov 30, 2021 at 12:46 AM Yikun Jiang <yi...@gmail.com>> wrote:

Hey everyone,

I'd like to start a discussion on "Support Volcano/Alternative Schedulers Proposal".

This SPIP is proposed to make spark k8s schedulers provide more YARN like features (such as queues and minimum resources before scheduling jobs) that many folks want on Kubernetes.

The goal of this SPIP is to improve current spark k8s scheduler implementations, add the ability of batch scheduling and support volcano as one of implementations.

Design doc: https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg

JIRA: https://issues.apache.org/jira/browse/SPARK-36057

Part of PRs:
Ability to create resources https://github.com/apache/spark/pull/34599

Add PodGroupFeatureStep: https://github.com/apache/spark/pull/34456

Regards,

Yikun

--

Twitter: https://twitter.com/holdenkarau

Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>

YouTube Live Streams: https://www.youtube.com/user/holdenkarau

--

John Zhuge

--

Regards,
Prasad Paravatha

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Current Spark on K8S Scheduling

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Rob,

Thanks for the explanation. For the past few days, I have thought of
getting this spark on k8s process graphically so I built the following
diagram for GKE (Google),  but it should be true of any other cluster (EKS,
AKS etc.).

[image: image.png]


   1. Process 1-3 will be dealing with the request for Kube apiserver to
   schedule the creation of driver pod and at the completion, process 4 will
   require kube apiserver to schedule creation of executors. Process 5 and 6
   will be dealing with executors. From a practical point of view I don't
   think the order of executors as being created matters. However, the
   crucial thing is that the number of executors is left to the one submitting
   the job to decide through --conf spark.executor.instances=$NEXEC.
   2. I have a GKE cluster of three each with e2-standard-4 (4 vCPUs, 16 GB
   memory), so nothing special. However, the current model I believe is built
   on one pod, one container/executor. I set up the following parameters as
   well

                       --conf spark.kubernetes.allocation.batch.size=3 \
               --conf spark.kubernetes.allocation.batch.delay=1 \

$NEXEC = 6 in my configuration so according to the setting, k8s will throw
a batch of 3 first followed by another batch of 3 , a second later. That
may well be the reason why executors now order properly?

k get pods -n spark
NAME                                         READY   STATUS    RESTARTS
 AGE
randomdatabigquery-a5dea67d90c4097a-exec-1   1/1     Running   0
50s
randomdatabigquery-a5dea67d90c4097a-exec-2   1/1     Running   0
50s
randomdatabigquery-a5dea67d90c4097a-exec-3   1/1     Running   0
50s
randomdatabigquery-a5dea67d90c4097a-exec-4   1/1     Running   0
50s
randomdatabigquery-a5dea67d90c4097a-exec-5   1/1     Running   0
50s
randomdatabigquery-a5dea67d90c4097a-exec-6   1/1     Running   0
50s
sparkbq-ad1eb27d90c366bd-driver              1/1     Running   0
93s

k get pods -n spark
NAME                                         READY   STATUS    RESTARTS
 AGE
randomdatabigquery-a5dea67d90c4097a-exec-1   1/1     Running   0
90s
randomdatabigquery-a5dea67d90c4097a-exec-2   1/1     Running   0
90s
randomdatabigquery-a5dea67d90c4097a-exec-3   1/1     Running   0
90s
randomdatabigquery-a5dea67d90c4097a-exec-4   1/1     Running   0
90s
randomdatabigquery-a5dea67d90c4097a-exec-5   1/1     Running   0
90s
randomdatabigquery-a5dea67d90c4097a-exec-6   1/1     Running   0
90s
sparkbq-ad1eb27d90c366bd-driver              1/1     Running   0
2m13s

However, this is not something we ought to care about? Also another point
being, do we care if each node handles two pods with one executor each or
one pod with two executors in the pod? According to definition, pods are
the smallest, most basic deployable objects in Kubernetes. A Pod represents
a single instance of a running process in the Kubernetes cluster. Pods can
contain one or more containers, such as Docker containers.


cheers



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 6 Dec 2021 at 11:18, Rob Vesse <rv...@dotnetrdf.org> wrote:

> Mich
>
>
>
> So there’s several things potentially going on here.
>
>
>
> The initial Spark Submit creates just the driver pod.  Once the driver pod
> starts the various processes, including the Spark master have to start up
> and this takes a non-zero amount of time (hence the 42 second difference
> you see).  Once it has reached the point where it can actually starts
> launching executors it uses the K8S service account credentials of the pod
> to talk to the K8S API to start requesting executors.  Depending on the
> number of executors requested the driver will ask for these in chunks i.e.
> it will ask for N executors, wait briefly and then ask for another N
> executors.  Where N is a controllable batch size (default 5) managed via
> the spark.kubernetes.allocation.batch.size property and the delay between
> batches of executors (default 1s) controlled by the
> spark.kubernetes.allocation.batch.delay property.  This is why you can see
> slight differences in the ages of the executor pods.  This batch based
> allocation is done to avoid overwhelming the K8S API server and effectively
> DDoS’ing the cluster by asking for all the pods at once.
>
>
>
> Now as to executor pods coming up in a random order this is due to the K8S
> scheduling behaviour and how busy the kubelet on a given node is.  The
> executor pods are going to be scheduled onto one or more physical (or
> virtual) nodes and then the kubelet on each node is responsible for
> bringing those up using whatever underlying container runtime is used on
> your K8S cluster.  Executor pods can start faster for various reasons,
> including things like whether a given node already has the necessary image
> pulled locally (and yours jobs configured image pull policy), how busy a
> given node in the cluster is with other pod operations, the responsiveness
> of the container runtime on a given node etc.
>
>
>
> As the executors start they’ll report back into the driver and then the
> actual job will proceed assuming sufficient executors were successfully
> launched.
>
>
>
> Hope this helps,
>
>
>
> Rob
>
>
>
> *From: *Mich Talebzadeh <mi...@gmail.com>
> *Date: *Friday, 3 December 2021 at 00:18
> *To: *Yikun Jiang <yi...@gmail.com>
> *Cc: *dev <de...@spark.apache.org>
> *Subject: *Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers
> Proposal
>
>
>
> Some colleagues enquired about the start of the driver pod before executor
> pods. With the current k8s on Google kubernetes Engine, the driver pod will
> start first before starting executor pods. This can be from the output of kubectl
> get pods -n spark here
>
>
>
> NAME                                         READY   STATUS
> RESTARTS   AGE
>
> randomdatabigquery-d3de0e7d7d8ea9f4-exec-1   0/1     ContainerCreating
>  0          1s
>
> randomdatabigquery-d3de0e7d7d8ea9f4-exec-2   0/1     ContainerCreating
>  0          1s
>
> randomdatabigquery-d3de0e7d7d8ea9f4-exec-3   1/1     Running
>  0          1s
>
> randomdatabigquery-d3de0e7d7d8ea9f4-exec-4   0/1     ContainerCreating
>  0          1s
>
> randomdatabigquery-d3de0e7d7d8ea9f4-exec-5   0/1     ContainerCreating
>  0          1s
>
> randomdatabigquery-d3de0e7d7d8ea9f4-exec-6   0/1     ContainerCreating
>  0          0s
>
> randomdatabigquery-d3de0e7d7d8ea9f4-exec-7   0/1     ContainerCreating
>  0          0s
>
> randomdatabigquery-d3de0e7d7d8ea9f4-exec-8   0/1     Pending
>  0          0s
>
> *sparkbq-27747e7d7d8e0f55-driver              1/1     Running
>  0          42s*
>
>
>
> Note that the driver pod has been running for 42 seconds before executor
> pods kick in. The order of executors seems to be random. For example,
> exec-3 has been running before exec-1 etc.
>
>
>
> HTH
>
>
>
>
>
>  [image: Image removed by sender.]  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 2 Dec 2021 at 08:24, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> I am referring to this revised diagram below
>
>
>
>
> My understanding is that it is the driver pod that creates executors which
> in turn are scheduled by the Kube Scheduler responsible for scheduling
> applications or containers on Nodes. So there should be a solid line from
> Driver Pod to executor pods and a dotted line from Volcano Scheduler to
> Executer Pod.
>
>
>
> Thanks
>
>
>
>  [image: Image removed by sender.]  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 2 Dec 2021 at 01:51, Yikun Jiang <yi...@gmail.com> wrote:
>
> > Thank you Yikun for the info, and thanks for inviting me to a meeting to
> discuss this.
>
> > I appreciate your effort to put these together, and I agree that the
> purpose is to make Spark easy/flexible enough to support other K8s
> schedulers (not just for Volcano).
>
> > As discussed, could you please help to abstract out the things in common
> and allow Spark to plug different implementations? I'd be happy to work
> with you guys on this issue.
>
>
>
> Thanks for the support from Yunikron side.
>
>
>
> As @weiwei mentioned, yesterday we had an initial meeting which went well,
> we have reached a consensus initially.
>
> We will also abstract out the common part to make clear the things in
> common and also provide a way to allow a variety of schedulers to do custom
> extension.
>
>
> Regards,
>
> Yikun
>
>
>
>
>
> Weiwei Yang <ww...@apache.org> 于2021年12月2日周四 上午2:00写道：
>
> Thank you Yikun for the info, and thanks for inviting me to a meeting to
> discuss this.
>
> I appreciate your effort to put these together, and I agree that the
> purpose is to make Spark easy/flexible enough to support other K8s
> schedulers (not just for Volcano).
>
> As discussed, could you please help to abstract out the things in common
> and allow Spark to plug different implementations? I'd be happy to work
> with you guys on this issue.
>
>
>
>
>
> On Tue, Nov 30, 2021 at 6:49 PM Yikun Jiang <yi...@gmail.com> wrote:
>
> @Weiwei @Chenya
>
>
>
> > Thanks for bringing this up. This is quite interesting, we definitely
> should participate more in the discussions.
>
> Thanks for your reply and welcome to join the discussion, I think the
> input from Yunikorn is very critical.
>
> > The main thing here is, the Spark community should make Spark pluggable
> in order to support other schedulers, not just for Volcano. It looks like
> this proposal is pushing really hard for adopting PodGroup, which isn't
> part of K8s yet, that to me is problematic.
>
> Definitely yes, we are on the same page.
>
> I think we have the same goal: propose a general and reasonable mechanism
> to make spark on k8s with a custom scheduler more usable.
>
> But for the PodGroup, just allow me to do a brief introduction:
> - The PodGroup definition has been approved by Kubernetes officially in
> KEP-583. [1]
> - It can be regarded as a general concept/standard in Kubernetes rather
> than a specific concept in Volcano, there are also others to implement it,
> such as [2][3].
> - Kubernetes recommends using CRD to do more extension to implement what
> they want. [4]
> - Volcano as extension provides an interface to maintain the life cycle
> PodGroup CRD and use volcano-scheduler to complete the scheduling.
>
> [1]
> https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/583-coscheduling
>
> [2]
> https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling#podgroup
> [3] https://github.com/kubernetes-sigs/kube-batch
> [4]
> https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/
>
>
>
> Regards,
>
> Yikun
>
>
>
>
>
> Weiwei Yang <ww...@apache.org> 于2021年12月1日周三 上午5:57写道：
>
> Hi Chenya
>
>
>
> Thanks for bringing this up. This is quite interesting, we definitely
> should participate more in the discussions.
>
> The main thing here is, the Spark community should make Spark pluggable in
> order to support other schedulers, not just for Volcano. It looks like this
> proposal is pushing really hard for adopting PodGroup, which isn't part of
> K8s yet, that to me is problematic.
>
>
>
> On Tue, Nov 30, 2021 at 9:21 AM Prasad Paravatha <
> prasad.paravatha@gmail.com> wrote:
>
> This is a great feature/idea.
>
> I'd love to get involved in some form (testing and/or documentation). This
> could be my 1st contribution to Spark!
>
>
>
> On Tue, Nov 30, 2021 at 10:46 PM John Zhuge <jz...@apache.org> wrote:
>
> +1 Kudos to Yikun and the community for starting the discussion!
>
>
>
> On Tue, Nov 30, 2021 at 8:47 AM Chenya Zhang <ch...@gmail.com>
> wrote:
>
> Thanks folks for bringing up the topic of natively integrating Volcano and
> other alternative schedulers into Spark!
>
>
>
> +Weiwei, Wilfred, Chaoran. We would love to contribute to the discussion
> as well.
>
>
>
> From our side, we have been using and improving on one alternative
> resource scheduler, Apache YuniKorn (https://yunikorn.apache.org/), for
> Spark on Kubernetes in production at Apple with solid results in the past
> year. It is capable of supporting Gang scheduling (similar to PodGroups),
> multi-tenant resource queues (similar to YARN), FIFO, and other handy
> features like bin packing to enable efficient autoscaling, etc.
>
>
>
> Natively integrating with Spark would provide more flexibility for users
> and reduce the extra cost and potential inconsistency of maintaining
> different layers of resource strategies. One interesting topic we hope to
> discuss more about is dynamic allocation, which would benefit from native
> coordination between Spark and resource schedulers in K8s &
> cloud environment for an optimal resource efficiency.
>
>
>
>
>
> On Tue, Nov 30, 2021 at 8:10 AM Holden Karau <ho...@pigscanfly.ca> wrote:
>
> Thanks for putting this together, I’m really excited for us to add better
> batch scheduling integrations.
>
>
>
> On Tue, Nov 30, 2021 at 12:46 AM Yikun Jiang <yi...@gmail.com> wrote:
>
> Hey everyone,
>
> I'd like to start a discussion on "Support Volcano/Alternative Schedulers
> Proposal".
>
> This SPIP is proposed to make spark k8s schedulers provide more YARN like
> features (such as queues and minimum resources before scheduling jobs) that
> many folks want on Kubernetes.
>
> The goal of this SPIP is to improve current spark k8s scheduler
> implementations, add the ability of batch scheduling and support volcano as
> one of implementations.
>
> Design doc:
> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-36057
>
> Part of PRs:
> Ability to create resources https://github.com/apache/spark/pull/34599
>
> Add PodGroupFeatureStep: https://github.com/apache/spark/pull/34456
>
>
> Regards,
>
> Yikun
>
> --
>
> Twitter: https://twitter.com/holdenkarau
>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>
>
> --
>
> John Zhuge
>
>
>
>
> --
>
> Regards,
> Prasad Paravatha
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org