You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by purna pradeep <pu...@gmail.com> on 2018/08/15 11:45:10 UTC
spark driver pod stuck in Waiting: PodInitializing state in Kubernetes
im running Spark 2.3 job on kubernetes cluster
kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3",
GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean",
BuildDate:"2018-02-09T21:51:06Z", GoVersion:"go1.9.4", Compiler:"gc",
Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3",
GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean",
BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc",
Platform:"linux/amd64"}
when i ran spark submit on k8s master the driver pod is stuck in Waiting:
PodInitializing state.
I had to manually kill the driver pod and submit new job in this case ,then
it works.
This is happening if i submit the jobs almost parallel ie submit 5 jobs one
after the other simultaneously.
I'm running spark jobs on 20 nodes each having below configuration
I tried kubectl describe node on the node where trhe driver pod is running
this is what i got ,i do see there is overcommit on resources but i
expected kubernetes scheduler not to schedule if resources in node are
overcommitted or node is in Not Ready state ,in this case node is in Ready
State but i observe same behaviour if node is in "Not Ready" state
Name: **********
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=****
node-role.kubernetes.io/worker=true
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: <none>
CreationTimestamp: Tue, 31 Jul 2018 09:59:24 -0400
Conditions:
Type Status LastHeartbeatTime
LastTransitionTime Reason Message
---- ------ -----------------
------------------ ------ -------
OutOfDisk False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31
Jul 2018 09:59:24 -0400 KubeletHasSufficientDisk kubelet has
sufficient disk space available
MemoryPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31
Jul 2018 09:59:24 -0400 KubeletHasSufficientMemory kubelet has
sufficient memory available
DiskPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31
Jul 2018 09:59:24 -0400 KubeletHasNoDiskPressure kubelet has no disk
pressure
Ready True Tue, 14 Aug 2018 09:31:20 -0400 Sat, 11
Aug 2018 00:41:27 -0400 KubeletReady kubelet is posting
ready status. AppArmor enabled
Addresses:
InternalIP: *****
Hostname: ******
Capacity:
cpu: 16
memory: 125827288Ki
pods: 110
Allocatable:
cpu: 16
memory: 125724888Ki
pods: 110
System Info:
Machine ID: *************
System UUID: **************
Boot ID: 1493028d-0a80-4f2f-b0f1-48d9b8910e9f
Kernel Version: 4.4.0-1062-aws
OS Image: Ubuntu 16.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://Unknown
Kubelet Version: v1.8.3
Kube-Proxy Version: v1.8.3
PodCIDR: ******
ExternalID: **************
Non-terminated Pods: (11 in total)
Namespace Name
CPU Requests CPU Limits Memory Requests Memory
Limits
--------- ----
------------ ---------- ---------------
-------------
kube-system calico-node-gj5mb
250m (1%) 0 (0%) 0 (0%) 0 (0%)
kube-system
kube-proxy-**************************************** 100m (0%)
0 (0%) 0 (0%) 0 (0%)
kube-system prometheus-prometheus-node-exporter-9cntq
100m (0%) 200m (1%) 30Mi (0%) 50Mi (0%)
logging
elasticsearch-elasticsearch-data-69df997486-gqcwg 400m (2%)
1 (6%) 8Gi (6%) 16Gi (13%)
logging fluentd-fluentd-elasticsearch-tj7nd
200m (1%) 0 (0%) 612Mi (0%) 0 (0%)
rook rook-agent-6jtzm
0 (0%) 0 (0%) 0 (0%) 0 (0%)
rook
rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j 0 (0%)
0 (0%) 0 (0%) 0 (0%)
spark
accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1 2 (12%)
0 (0%) 10Gi (8%) 12Gi (10%)
spark
accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5 2 (12%)
0 (0%) 10Gi (8%) 12Gi (10%)
spark
accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver 1 (6%)
0 (0%) 2Gi (1%) 2432Mi (1%)
spark
accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver 1 (6%)
0 (0%) 2Gi (1%) 2432Mi (1%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
7050m (44%) 1200m (7%) 33410Mi (27%) 45874Mi (37%)
Events: <none>
Kubectl describe pod gives below message
Name:
accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
Namespace: spark
Node: ****
Start Time: Mon, 13 Aug 2018 16:18:34 -0400
Labels:
launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73
spark-app-selector=spark-63f536fd87f8457796802767922ef7d9
spark-role=driver
Annotations: spark-app-name=accelerate-testing-2
Status: Pending
IP:
Init Containers:
spark-init:
Container ID:
Image: ****:v2.3.0
Image ID:
Port: <none>
Args:
init
/etc/spark-init/spark-init.properties
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/etc/spark-init from spark-init-properties (rw)
/var/run/secrets/kubernetes.io/serviceaccount from
spark-token-mj86g (ro)
/var/spark-data/spark-files from download-files-volume (rw)
/var/spark-data/spark-jars from download-jars-volume (rw)
Containers:
spark-kubernetes-driver:
Container ID:
Image: ******:v2.3.0
Image ID:
Port: <none>
Args:
driver
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
memory: 2432Mi
Requests:
cpu: 1
memory: 2Gi
Environment:
SPARK_DRIVER_MEMORY: 2g
SPARK_DRIVER_CLASS: com.myclass
SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
SPARK_MOUNTED_CLASSPATH:
/var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar
SPARK_MOUNTED_FILES_DIR: /var/spark-data/spark-files
SPARK_JAVA_OPT_0: -Dspark.kubernetes.container.image=***
SPARK_JAVA_OPT_1:
-Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar
SPARK_JAVA_OPT_2: -Dspark.submit.deployMode=cluster
SPARK_JAVA_OPT_3: -Dspark.driver.blockManager.port=7079
SPARK_JAVA_OPT_4: -Dspark.executor.memory=10g
SPARK_JAVA_OPT_5: -Dspark.app.id
=spark-63f536fd87f8457796802767922ef7d9
SPARK_JAVA_OPT_6:
-Dspark.kubernetes.authenticate.driver.serviceAccountName=spark
SPARK_JAVA_OPT_7: -Dspark.master=k8s://
https://kubernetes.default
SPARK_JAVA_OPT_8:
-Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc
SPARK_JAVA_OPT_9: -Dspark.executor.cores=2
SPARK_JAVA_OPT_10:
-Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba
SPARK_JAVA_OPT_11: -Dspark.driver.port=7078
SPARK_JAVA_OPT_12: -Dspark.kubernetes.namespace=spark
SPARK_JAVA_OPT_13: -Dspark.executor.memoryOverhead=2g
SPARK_JAVA_OPT_14:
-Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
SPARK_JAVA_OPT_15:
-Dspark.kubernetes.initContainer.configMapKey=spark-init.properties
SPARK_JAVA_OPT_16: -Dspark.executor.instances=10
SPARK_JAVA_OPT_17: -Dspark.memory.fraction=0.6
SPARK_JAVA_OPT_18: -Dspark.driver.memory=2g
SPARK_JAVA_OPT_19: -Dspark.kubernetes.driver.pod.name
=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
SPARK_JAVA_OPT_20: -Dspark.app.name=accelerate-testing-2
SPARK_JAVA_OPT_21:
-Dspark.kubernetes.driver.label.launch-id=********
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from
spark-token-mj86g (ro)
/var/spark-data/spark-files from download-files-volume (rw)
/var/spark-data/spark-jars from download-jars-volume (rw)
Conditions:
Type Status
Initialized False
Ready False
PodScheduled True
Volumes:
spark-init-properties:
Type: ConfigMap (a volume populated by a ConfigMap)
Name:
accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
Optional: false
download-jars-volume:
Type: EmptyDir (a temporary directory that shares a pod's
lifetime)
Medium:
download-files-volume:
Type: EmptyDir (a temporary directory that shares a pod's
lifetime)
Medium:
spark-token-mj86g:
Type: Secret (a volume populated by a Secret)
SecretName: spark-token-mj86g
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: <none>
Events:
Type Reason Age From
Message
---- ------ ---- ----
-------
Normal SandboxChanged 44m (x518 over 18h) kubelet,
**************************** Pod sandbox changed, it will be killed and
re-created.
Warning FailedSync 19s (x540 over 18h) kubelet,
**************************** Error syncing pod
Re: spark driver pod stuck in Waiting: PodInitializing state in Kubernetes
Posted by purna pradeep <pu...@gmail.com>.
Resurfacing The question to get more attention
Hello,
>
> im running Spark 2.3 job on kubernetes cluster
>>
>> kubectl version
>>
>> Client Version: version.Info{Major:"1", Minor:"9",
>> GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b",
>> GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z",
>> GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
>>
>> Server Version: version.Info{Major:"1", Minor:"8",
>> GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd",
>> GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z",
>> GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
>>
>>
>>
>> when i ran spark submit on k8s master the driverpod is stuck in Waiting:
>> PodInitializing state.
>> I had to manually kill the driver pod and submit new job in this case
>> ,then it works.How this can be handled in production ?
>>
> This happens with executor pods as well
>
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25128
>
>
>>
>> This is happening if i submit the jobs almost parallel ie submit 5 jobs
>> one after the other simultaneously.
>>
>> I'm running spark jobs on 20 nodes each having below configuration
>>
>> I tried kubectl describe node on the node where trhe driver pod is
>> running this is what i got ,i do see there is overcommit on resources but i
>> expected kubernetes scheduler not to schedule if resources in node are
>> overcommitted or node is in Not Ready state ,in this case node is in Ready
>> State but i observe same behaviour if node is in "Not Ready" state
>>
>>
>>
>> Name: **********
>>
>> Roles: worker
>>
>> Labels: beta.kubernetes.io/arch=amd64
>>
>> beta.kubernetes.io/os=linux
>>
>> kubernetes.io/hostname=****
>>
>> node-role.kubernetes.io/worker=true
>>
>> Annotations: node.alpha.kubernetes.io/ttl=0
>>
>>
>> volumes.kubernetes.io/controller-managed-attach-detach=true
>>
>> Taints: <none>
>>
>> CreationTimestamp: Tue, 31 Jul 2018 09:59:24 -0400
>>
>> Conditions:
>>
>> Type Status LastHeartbeatTime
>> LastTransitionTime Reason Message
>>
>> ---- ------ -----------------
>> ------------------ ------ -------
>>
>> OutOfDisk False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31
>> Jul 2018 09:59:24 -0400 KubeletHasSufficientDisk kubelet has
>> sufficient disk space available
>>
>> MemoryPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31
>> Jul 2018 09:59:24 -0400 KubeletHasSufficientMemory kubelet has
>> sufficient memory available
>>
>> DiskPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31
>> Jul 2018 09:59:24 -0400 KubeletHasNoDiskPressure kubelet has no disk
>> pressure
>>
>> Ready True Tue, 14 Aug 2018 09:31:20 -0400 Sat, 11
>> Aug 2018 00:41:27 -0400 KubeletReady kubelet is posting
>> ready status. AppArmor enabled
>>
>> Addresses:
>>
>> InternalIP: *****
>>
>> Hostname: ******
>>
>> Capacity:
>>
>> cpu: 16
>>
>> memory: 125827288Ki
>>
>> pods: 110
>>
>> Allocatable:
>>
>> cpu: 16
>>
>> memory: 125724888Ki
>>
>> pods: 110
>>
>> System Info:
>>
>> Machine ID: *************
>>
>> System UUID: **************
>>
>> Boot ID: 1493028d-0a80-4f2f-b0f1-48d9b8910e9f
>>
>> Kernel Version: 4.4.0-1062-aws
>>
>> OS Image: Ubuntu 16.04.4 LTS
>>
>> Operating System: linux
>>
>> Architecture: amd64
>>
>> Container Runtime Version: docker://Unknown
>>
>> Kubelet Version: v1.8.3
>>
>> Kube-Proxy Version: v1.8.3
>>
>> PodCIDR: ******
>>
>> ExternalID: **************
>>
>> Non-terminated Pods: (11 in total)
>>
>> Namespace Name
>> CPU Requests CPU Limits Memory Requests Memory
>> Limits
>>
>> --------- ----
>> ------------ ---------- ---------------
>> -------------
>>
>> kube-system calico-node-gj5mb
>> 250m (1%) 0 (0%) 0 (0%) 0 (0%)
>>
>> kube-system
>> kube-proxy-**************************************** 100m (0%)
>> 0 (0%) 0 (0%) 0 (0%)
>>
>> kube-system
>> prometheus-prometheus-node-exporter-9cntq 100m (0%)
>> 200m (1%) 30Mi (0%) 50Mi (0%)
>>
>> logging
>> elasticsearch-elasticsearch-data-69df997486-gqcwg 400m (2%)
>> 1 (6%) 8Gi (6%) 16Gi (13%)
>>
>> logging fluentd-fluentd-elasticsearch-tj7nd
>> 200m (1%) 0 (0%) 612Mi (0%) 0 (0%)
>>
>> rook rook-agent-6jtzm
>> 0 (0%) 0 (0%) 0 (0%) 0 (0%)
>>
>> rook
>> rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j 0 (0%)
>> 0 (0%) 0 (0%) 0 (0%)
>>
>> spark
>> accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1 2 (12%)
>> 0 (0%) 10Gi (8%) 12Gi (10%)
>>
>> spark
>> accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5 2 (12%)
>> 0 (0%) 10Gi (8%) 12Gi (10%)
>>
>> spark
>> accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver 1 (6%)
>> 0 (0%) 2Gi (1%) 2432Mi (1%)
>>
>> spark
>> accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver 1 (6%)
>> 0 (0%) 2Gi (1%) 2432Mi (1%)
>>
>> Allocated resources:
>>
>> (Total limits may be over 100 percent, i.e., overcommitted.)
>>
>> CPU Requests CPU Limits Memory Requests Memory Limits
>>
>> ------------ ---------- --------------- -------------
>>
>> 7050m (44%) 1200m (7%) 33410Mi (27%) 45874Mi (37%)
>>
>>
>> Events: <none>
>>
>>
>> Kubectl describe pod gives below message
>>
>> Name:
>> accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
>> Namespace: spark
>> Node: ****
>> Start Time: Mon, 13 Aug 2018 16:18:34 -0400
>> Labels:
>> launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73
>>
>> spark-app-selector=spark-63f536fd87f8457796802767922ef7d9
>> spark-role=driver
>> Annotations: spark-app-name=accelerate-testing-2
>> Status: Pending
>> IP:
>> Init Containers:
>> spark-init:
>> Container ID:
>> Image: ****:v2.3.0
>> Image ID:
>> Port: <none>
>> Args:
>> init
>> /etc/spark-init/spark-init.properties
>> State: Waiting
>> Reason: PodInitializing
>> Ready: False
>> Restart Count: 0
>> Environment: <none>
>> Mounts:
>> /etc/spark-init from spark-init-properties (rw)
>> /var/run/secrets/kubernetes.io/serviceaccount from
>> spark-token-mj86g (ro)
>> /var/spark-data/spark-files from download-files-volume (rw)
>> /var/spark-data/spark-jars from download-jars-volume (rw)
>> Containers:
>> spark-kubernetes-driver:
>> Container ID:
>> Image: ******:v2.3.0
>> Image ID:
>> Port: <none>
>> Args:
>> driver
>> State: Waiting
>> Reason: PodInitializing
>> Ready: False
>> Restart Count: 0
>> Limits:
>> memory: 2432Mi
>> Requests:
>> cpu: 1
>> memory: 2Gi
>> Environment:
>> SPARK_DRIVER_MEMORY: 2g
>> SPARK_DRIVER_CLASS: com.myclass
>> SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
>> SPARK_MOUNTED_CLASSPATH:
>> /var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar
>> SPARK_MOUNTED_FILES_DIR: /var/spark-data/spark-files
>> SPARK_JAVA_OPT_0:
>> -Dspark.kubernetes.container.image=***
>> SPARK_JAVA_OPT_1:
>> -Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar
>> SPARK_JAVA_OPT_2: -Dspark.submit.deployMode=cluster
>> SPARK_JAVA_OPT_3:
>> -Dspark.driver.blockManager.port=7079
>> SPARK_JAVA_OPT_4: -Dspark.executor.memory=10g
>> SPARK_JAVA_OPT_5: -Dspark.app.id
>> =spark-63f536fd87f8457796802767922ef7d9
>> SPARK_JAVA_OPT_6:
>> -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark
>> SPARK_JAVA_OPT_7: -Dspark.master=k8s://
>> https://kubernetes.default
>> SPARK_JAVA_OPT_8:
>> -Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc
>> SPARK_JAVA_OPT_9: -Dspark.executor.cores=2
>> SPARK_JAVA_OPT_10:
>> -Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba
>> SPARK_JAVA_OPT_11: -Dspark.driver.port=7078
>> SPARK_JAVA_OPT_12: -Dspark.kubernetes.namespace=spark
>> SPARK_JAVA_OPT_13: -Dspark.executor.memoryOverhead=2g
>> SPARK_JAVA_OPT_14:
>> -Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
>> SPARK_JAVA_OPT_15:
>> -Dspark.kubernetes.initContainer.configMapKey=spark-init.properties
>> SPARK_JAVA_OPT_16: -Dspark.executor.instances=10
>> SPARK_JAVA_OPT_17: -Dspark.memory.fraction=0.6
>> SPARK_JAVA_OPT_18: -Dspark.driver.memory=2g
>> SPARK_JAVA_OPT_19: -Dspark.kubernetes.driver.pod.name
>> =accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
>> SPARK_JAVA_OPT_20: -Dspark.app.name
>> =accelerate-testing-2
>> SPARK_JAVA_OPT_21:
>> -Dspark.kubernetes.driver.label.launch-id=********
>> Mounts:
>> /var/run/secrets/kubernetes.io/serviceaccount from
>> spark-token-mj86g (ro)
>> /var/spark-data/spark-files from download-files-volume (rw)
>> /var/spark-data/spark-jars from download-jars-volume (rw)
>> Conditions:
>> Type Status
>> Initialized False
>> Ready False
>> PodScheduled True
>> Volumes:
>> spark-init-properties:
>> Type: ConfigMap (a volume populated by a ConfigMap)
>> Name:
>> accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
>> Optional: false
>> download-jars-volume:
>> Type: EmptyDir (a temporary directory that shares a pod's
>> lifetime)
>> Medium:
>> download-files-volume:
>> Type: EmptyDir (a temporary directory that shares a pod's
>> lifetime)
>> Medium:
>> spark-token-mj86g:
>> Type: Secret (a volume populated by a Secret)
>> SecretName: spark-token-mj86g
>> Optional: false
>> QoS Class: Burstable
>> Node-Selectors: <none>
>> Tolerations: <none>
>> Events:
>> Type Reason Age From
>> Message
>> ---- ------ ---- ----
>> -------
>> Normal SandboxChanged 44m (x518 over 18h) kubelet,
>> **************************** Pod sandbox changed, it will be killed and
>> re-created.
>> Warning FailedSync 19s (x540 over 18h) kubelet,
>> **************************** Error syncing pod
>>
>>
Re: spark driver pod stuck in Waiting: PodInitializing state in Kubernetes
Posted by purna pradeep <pu...@gmail.com>.
Hello,
im running Spark 2.3 job on kubernetes cluster
>
> kubectl version
>
> Client Version: version.Info{Major:"1", Minor:"9",
> GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b",
> GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z",
> GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
>
> Server Version: version.Info{Major:"1", Minor:"8",
> GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd",
> GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z",
> GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
>
>
>
> when i ran spark submit on k8s master the driver pod is stuck in Waiting:
> PodInitializing state.
> I had to manually kill the driver pod and submit new job in this case
> ,then it works.How this can be handled in production ?
>
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25128
>
> This is happening if i submit the jobs almost parallel ie submit 5 jobs
> one after the other simultaneously.
>
> I'm running spark jobs on 20 nodes each having below configuration
>
> I tried kubectl describe node on the node where trhe driver pod is running
> this is what i got ,i do see there is overcommit on resources but i
> expected kubernetes scheduler not to schedule if resources in node are
> overcommitted or node is in Not Ready state ,in this case node is in Ready
> State but i observe same behaviour if node is in "Not Ready" state
>
>
>
> Name: **********
>
> Roles: worker
>
> Labels: beta.kubernetes.io/arch=amd64
>
> beta.kubernetes.io/os=linux
>
> kubernetes.io/hostname=****
>
> node-role.kubernetes.io/worker=true
>
> Annotations: node.alpha.kubernetes.io/ttl=0
>
>
> volumes.kubernetes.io/controller-managed-attach-detach=true
>
> Taints: <none>
>
> CreationTimestamp: Tue, 31 Jul 2018 09:59:24 -0400
>
> Conditions:
>
> Type Status LastHeartbeatTime
> LastTransitionTime Reason Message
>
> ---- ------ -----------------
> ------------------ ------ -------
>
> OutOfDisk False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31
> Jul 2018 09:59:24 -0400 KubeletHasSufficientDisk kubelet has
> sufficient disk space available
>
> MemoryPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31
> Jul 2018 09:59:24 -0400 KubeletHasSufficientMemory kubelet has
> sufficient memory available
>
> DiskPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31
> Jul 2018 09:59:24 -0400 KubeletHasNoDiskPressure kubelet has no disk
> pressure
>
> Ready True Tue, 14 Aug 2018 09:31:20 -0400 Sat, 11
> Aug 2018 00:41:27 -0400 KubeletReady kubelet is posting
> ready status. AppArmor enabled
>
> Addresses:
>
> InternalIP: *****
>
> Hostname: ******
>
> Capacity:
>
> cpu: 16
>
> memory: 125827288Ki
>
> pods: 110
>
> Allocatable:
>
> cpu: 16
>
> memory: 125724888Ki
>
> pods: 110
>
> System Info:
>
> Machine ID: *************
>
> System UUID: **************
>
> Boot ID: 1493028d-0a80-4f2f-b0f1-48d9b8910e9f
>
> Kernel Version: 4.4.0-1062-aws
>
> OS Image: Ubuntu 16.04.4 LTS
>
> Operating System: linux
>
> Architecture: amd64
>
> Container Runtime Version: docker://Unknown
>
> Kubelet Version: v1.8.3
>
> Kube-Proxy Version: v1.8.3
>
> PodCIDR: ******
>
> ExternalID: **************
>
> Non-terminated Pods: (11 in total)
>
> Namespace Name
> CPU Requests CPU Limits Memory Requests Memory
> Limits
>
> --------- ----
> ------------ ---------- ---------------
> -------------
>
> kube-system calico-node-gj5mb
> 250m (1%) 0 (0%) 0 (0%) 0 (0%)
>
> kube-system
> kube-proxy-**************************************** 100m (0%)
> 0 (0%) 0 (0%) 0 (0%)
>
> kube-system prometheus-prometheus-node-exporter-9cntq
> 100m (0%) 200m (1%) 30Mi (0%) 50Mi (0%)
>
> logging
> elasticsearch-elasticsearch-data-69df997486-gqcwg 400m (2%)
> 1 (6%) 8Gi (6%) 16Gi (13%)
>
> logging fluentd-fluentd-elasticsearch-tj7nd
> 200m (1%) 0 (0%) 612Mi (0%) 0 (0%)
>
> rook rook-agent-6jtzm
> 0 (0%) 0 (0%) 0 (0%) 0 (0%)
>
> rook
> rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j 0 (0%)
> 0 (0%) 0 (0%) 0 (0%)
>
> spark
> accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1 2 (12%)
> 0 (0%) 10Gi (8%) 12Gi (10%)
>
> spark
> accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5 2 (12%)
> 0 (0%) 10Gi (8%) 12Gi (10%)
>
> spark
> accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver 1 (6%)
> 0 (0%) 2Gi (1%) 2432Mi (1%)
>
> spark
> accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver 1 (6%)
> 0 (0%) 2Gi (1%) 2432Mi (1%)
>
> Allocated resources:
>
> (Total limits may be over 100 percent, i.e., overcommitted.)
>
> CPU Requests CPU Limits Memory Requests Memory Limits
>
> ------------ ---------- --------------- -------------
>
> 7050m (44%) 1200m (7%) 33410Mi (27%) 45874Mi (37%)
>
>
> Events: <none>
>
>
> Kubectl describe pod gives below message
>
> Name:
> accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
> Namespace: spark
> Node: ****
> Start Time: Mon, 13 Aug 2018 16:18:34 -0400
> Labels:
> launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73
> spark-app-selector=spark-63f536fd87f8457796802767922ef7d9
> spark-role=driver
> Annotations: spark-app-name=accelerate-testing-2
> Status: Pending
> IP:
> Init Containers:
> spark-init:
> Container ID:
> Image: ****:v2.3.0
> Image ID:
> Port: <none>
> Args:
> init
> /etc/spark-init/spark-init.properties
> State: Waiting
> Reason: PodInitializing
> Ready: False
> Restart Count: 0
> Environment: <none>
> Mounts:
> /etc/spark-init from spark-init-properties (rw)
> /var/run/secrets/kubernetes.io/serviceaccount from
> spark-token-mj86g (ro)
> /var/spark-data/spark-files from download-files-volume (rw)
> /var/spark-data/spark-jars from download-jars-volume (rw)
> Containers:
> spark-kubernetes-driver:
> Container ID:
> Image: ******:v2.3.0
> Image ID:
> Port: <none>
> Args:
> driver
> State: Waiting
> Reason: PodInitializing
> Ready: False
> Restart Count: 0
> Limits:
> memory: 2432Mi
> Requests:
> cpu: 1
> memory: 2Gi
> Environment:
> SPARK_DRIVER_MEMORY: 2g
> SPARK_DRIVER_CLASS: com.myclass
> SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
> SPARK_MOUNTED_CLASSPATH:
> /var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar
> SPARK_MOUNTED_FILES_DIR: /var/spark-data/spark-files
> SPARK_JAVA_OPT_0:
> -Dspark.kubernetes.container.image=***
> SPARK_JAVA_OPT_1:
> -Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar
> SPARK_JAVA_OPT_2: -Dspark.submit.deployMode=cluster
> SPARK_JAVA_OPT_3: -Dspark.driver.blockManager.port=7079
> SPARK_JAVA_OPT_4: -Dspark.executor.memory=10g
> SPARK_JAVA_OPT_5: -Dspark.app.id
> =spark-63f536fd87f8457796802767922ef7d9
> SPARK_JAVA_OPT_6:
> -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark
> SPARK_JAVA_OPT_7: -Dspark.master=k8s://
> https://kubernetes.default
> SPARK_JAVA_OPT_8:
> -Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc
> SPARK_JAVA_OPT_9: -Dspark.executor.cores=2
> SPARK_JAVA_OPT_10:
> -Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba
> SPARK_JAVA_OPT_11: -Dspark.driver.port=7078
> SPARK_JAVA_OPT_12: -Dspark.kubernetes.namespace=spark
> SPARK_JAVA_OPT_13: -Dspark.executor.memoryOverhead=2g
> SPARK_JAVA_OPT_14:
> -Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
> SPARK_JAVA_OPT_15:
> -Dspark.kubernetes.initContainer.configMapKey=spark-init.properties
> SPARK_JAVA_OPT_16: -Dspark.executor.instances=10
> SPARK_JAVA_OPT_17: -Dspark.memory.fraction=0.6
> SPARK_JAVA_OPT_18: -Dspark.driver.memory=2g
> SPARK_JAVA_OPT_19: -Dspark.kubernetes.driver.pod.name
> =accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
> SPARK_JAVA_OPT_20: -Dspark.app.name
> =accelerate-testing-2
> SPARK_JAVA_OPT_21:
> -Dspark.kubernetes.driver.label.launch-id=********
> Mounts:
> /var/run/secrets/kubernetes.io/serviceaccount from
> spark-token-mj86g (ro)
> /var/spark-data/spark-files from download-files-volume (rw)
> /var/spark-data/spark-jars from download-jars-volume (rw)
> Conditions:
> Type Status
> Initialized False
> Ready False
> PodScheduled True
> Volumes:
> spark-init-properties:
> Type: ConfigMap (a volume populated by a ConfigMap)
> Name:
> accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
> Optional: false
> download-jars-volume:
> Type: EmptyDir (a temporary directory that shares a pod's
> lifetime)
> Medium:
> download-files-volume:
> Type: EmptyDir (a temporary directory that shares a pod's
> lifetime)
> Medium:
> spark-token-mj86g:
> Type: Secret (a volume populated by a Secret)
> SecretName: spark-token-mj86g
> Optional: false
> QoS Class: Burstable
> Node-Selectors: <none>
> Tolerations: <none>
> Events:
> Type Reason Age From
> Message
> ---- ------ ---- ----
> -------
> Normal SandboxChanged 44m (x518 over 18h) kubelet,
> **************************** Pod sandbox changed, it will be killed and
> re-created.
> Warning FailedSync 19s (x540 over 18h) kubelet,
> **************************** Error syncing pod
>
>