You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by purna pradeep <pu...@gmail.com> on 2018/08/15 11:45:10 UTC

spark driver pod stuck in Waiting: PodInitializing state in Kubernetes

im running Spark 2.3 job on kubernetes cluster

kubectl version

    Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3",
GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean",
BuildDate:"2018-02-09T21:51:06Z", GoVersion:"go1.9.4", Compiler:"gc",
Platform:"darwin/amd64"}

    Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3",
GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean",
BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc",
Platform:"linux/amd64"}



when i ran spark submit on k8s master the driver pod is stuck in Waiting:
PodInitializing state.
I had to manually kill the driver pod and submit new job in this case ,then
it works.


This is happening if i submit the jobs almost parallel ie submit 5 jobs one
after the other simultaneously.

I'm running spark jobs on 20 nodes each having below configuration

I tried kubectl describe node on the node where trhe driver pod is running
this is what i got ,i do see there is overcommit on resources but i
expected kubernetes scheduler not to schedule if resources in node are
overcommitted or node is in Not Ready state ,in this case node is in Ready
State but i observe same behaviour if node is in "Not Ready" state



    Name:               **********

    Roles:              worker

    Labels:             beta.kubernetes.io/arch=amd64

                        beta.kubernetes.io/os=linux

                        kubernetes.io/hostname=****

                        node-role.kubernetes.io/worker=true

    Annotations:        node.alpha.kubernetes.io/ttl=0


volumes.kubernetes.io/controller-managed-attach-detach=true

    Taints:             <none>

    CreationTimestamp:  Tue, 31 Jul 2018 09:59:24 -0400

    Conditions:

      Type             Status  LastHeartbeatTime
LastTransitionTime                Reason                       Message

      ----             ------  -----------------
------------------                ------                       -------

      OutOfDisk        False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
Jul 2018 09:59:24 -0400   KubeletHasSufficientDisk     kubelet has
sufficient disk space available

      MemoryPressure   False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
Jul 2018 09:59:24 -0400   KubeletHasSufficientMemory   kubelet has
sufficient memory available

      DiskPressure     False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
Jul 2018 09:59:24 -0400   KubeletHasNoDiskPressure     kubelet has no disk
pressure

      Ready            True    Tue, 14 Aug 2018 09:31:20 -0400   Sat, 11
Aug 2018 00:41:27 -0400   KubeletReady                 kubelet is posting
ready status. AppArmor enabled

    Addresses:

      InternalIP:  *****

      Hostname:    ******

    Capacity:

     cpu:     16

     memory:  125827288Ki

     pods:    110

    Allocatable:

     cpu:     16

     memory:  125724888Ki

     pods:    110

    System Info:

     Machine ID:                 *************

     System UUID:                **************

     Boot ID:                    1493028d-0a80-4f2f-b0f1-48d9b8910e9f

     Kernel Version:             4.4.0-1062-aws

     OS Image:                   Ubuntu 16.04.4 LTS

     Operating System:           linux

     Architecture:               amd64

     Container Runtime Version:  docker://Unknown

     Kubelet Version:            v1.8.3

     Kube-Proxy Version:         v1.8.3

    PodCIDR:                     ******

    ExternalID:                  **************

    Non-terminated Pods:         (11 in total)

      Namespace                  Name
                     CPU Requests  CPU Limits  Memory Requests  Memory
Limits

      ---------                  ----
                     ------------  ----------  ---------------
 -------------

      kube-system                calico-node-gj5mb
                      250m (1%)     0 (0%)      0 (0%)           0 (0%)

      kube-system
 kube-proxy-****************************************             100m (0%)
    0 (0%)      0 (0%)           0 (0%)

      kube-system                prometheus-prometheus-node-exporter-9cntq
                      100m (0%)     200m (1%)   30Mi (0%)        50Mi (0%)

      logging
 elasticsearch-elasticsearch-data-69df997486-gqcwg               400m (2%)
    1 (6%)      8Gi (6%)         16Gi (13%)

      logging                    fluentd-fluentd-elasticsearch-tj7nd
                      200m (1%)     0 (0%)      612Mi (0%)       0 (0%)

      rook                       rook-agent-6jtzm
                     0 (0%)        0 (0%)      0 (0%)           0 (0%)

      rook
rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j    0 (0%)
   0 (0%)      0 (0%)           0 (0%)

      spark
 accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1       2 (12%)
    0 (0%)      10Gi (8%)        12Gi (10%)

      spark
 accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5    2 (12%)
    0 (0%)      10Gi (8%)        12Gi (10%)

      spark
 accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver    1 (6%)
   0 (0%)      2Gi (1%)         2432Mi (1%)

      spark
 accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver    1 (6%)
   0 (0%)      2Gi (1%)         2432Mi (1%)

    Allocated resources:

      (Total limits may be over 100 percent, i.e., overcommitted.)

      CPU Requests  CPU Limits  Memory Requests  Memory Limits

      ------------  ----------  ---------------  -------------

      7050m (44%)   1200m (7%)  33410Mi (27%)    45874Mi (37%)


    Events:         <none>


Kubectl describe pod gives below message

    Name:
accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
    Namespace:    spark
    Node:         ****
    Start Time:   Mon, 13 Aug 2018 16:18:34 -0400
    Labels:
launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73
                  spark-app-selector=spark-63f536fd87f8457796802767922ef7d9
                  spark-role=driver
    Annotations:  spark-app-name=accelerate-testing-2
    Status:       Pending
    IP:
    Init Containers:
      spark-init:
        Container ID:
        Image:         ****:v2.3.0
        Image ID:
        Port:          <none>
        Args:
          init
          /etc/spark-init/spark-init.properties
        State:          Waiting
          Reason:       PodInitializing
        Ready:          False
        Restart Count:  0
        Environment:    <none>
        Mounts:
          /etc/spark-init from spark-init-properties (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from
spark-token-mj86g (ro)
          /var/spark-data/spark-files from download-files-volume (rw)
          /var/spark-data/spark-jars from download-jars-volume (rw)
    Containers:
      spark-kubernetes-driver:
        Container ID:
        Image:         ******:v2.3.0
        Image ID:
        Port:          <none>
        Args:
          driver
        State:          Waiting
          Reason:       PodInitializing
        Ready:          False
        Restart Count:  0
        Limits:
          memory:  2432Mi
        Requests:
          cpu:     1
          memory:  2Gi
        Environment:
          SPARK_DRIVER_MEMORY:        2g
          SPARK_DRIVER_CLASS:         com.myclass
          SPARK_DRIVER_BIND_ADDRESS:   (v1:status.podIP)
          SPARK_MOUNTED_CLASSPATH:
 /var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar
          SPARK_MOUNTED_FILES_DIR:    /var/spark-data/spark-files
          SPARK_JAVA_OPT_0:           -Dspark.kubernetes.container.image=***
          SPARK_JAVA_OPT_1:
-Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar
          SPARK_JAVA_OPT_2:           -Dspark.submit.deployMode=cluster
          SPARK_JAVA_OPT_3:           -Dspark.driver.blockManager.port=7079
          SPARK_JAVA_OPT_4:           -Dspark.executor.memory=10g
          SPARK_JAVA_OPT_5:           -Dspark.app.id
=spark-63f536fd87f8457796802767922ef7d9
          SPARK_JAVA_OPT_6:
-Dspark.kubernetes.authenticate.driver.serviceAccountName=spark
          SPARK_JAVA_OPT_7:           -Dspark.master=k8s://
https://kubernetes.default
          SPARK_JAVA_OPT_8:
-Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc
          SPARK_JAVA_OPT_9:           -Dspark.executor.cores=2
          SPARK_JAVA_OPT_10:
 -Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba
          SPARK_JAVA_OPT_11:          -Dspark.driver.port=7078
          SPARK_JAVA_OPT_12:          -Dspark.kubernetes.namespace=spark
          SPARK_JAVA_OPT_13:          -Dspark.executor.memoryOverhead=2g
          SPARK_JAVA_OPT_14:
 -Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
          SPARK_JAVA_OPT_15:
 -Dspark.kubernetes.initContainer.configMapKey=spark-init.properties
          SPARK_JAVA_OPT_16:          -Dspark.executor.instances=10
          SPARK_JAVA_OPT_17:          -Dspark.memory.fraction=0.6
          SPARK_JAVA_OPT_18:          -Dspark.driver.memory=2g
          SPARK_JAVA_OPT_19:          -Dspark.kubernetes.driver.pod.name
=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
          SPARK_JAVA_OPT_20:          -Dspark.app.name=accelerate-testing-2
          SPARK_JAVA_OPT_21:
 -Dspark.kubernetes.driver.label.launch-id=********
        Mounts:
          /var/run/secrets/kubernetes.io/serviceaccount from
spark-token-mj86g (ro)
          /var/spark-data/spark-files from download-files-volume (rw)
          /var/spark-data/spark-jars from download-jars-volume (rw)
    Conditions:
      Type           Status
      Initialized    False
      Ready          False
      PodScheduled   True
    Volumes:
      spark-init-properties:
        Type:      ConfigMap (a volume populated by a ConfigMap)
        Name:
 accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
        Optional:  false
      download-jars-volume:
        Type:    EmptyDir (a temporary directory that shares a pod's
lifetime)
        Medium:
      download-files-volume:
        Type:    EmptyDir (a temporary directory that shares a pod's
lifetime)
        Medium:
      spark-token-mj86g:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  spark-token-mj86g
        Optional:    false
    QoS Class:       Burstable
    Node-Selectors:  <none>
    Tolerations:     <none>
    Events:
      Type     Reason          Age                  From
                            Message
      ----     ------          ----                 ----
                            -------
      Normal   SandboxChanged  44m (x518 over 18h)  kubelet,
****************************  Pod sandbox changed, it will be killed and
re-created.
      Warning  FailedSync      19s (x540 over 18h)  kubelet,
****************************  Error syncing pod

Re: spark driver pod stuck in Waiting: PodInitializing state in Kubernetes

Posted by purna pradeep <pu...@gmail.com>.
Resurfacing The question to get more attention

Hello,
>
> im running Spark 2.3 job on kubernetes cluster
>>
>> kubectl version
>>
>>     Client Version: version.Info{Major:"1", Minor:"9",
>> GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b",
>> GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z",
>> GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
>>
>>     Server Version: version.Info{Major:"1", Minor:"8",
>> GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd",
>> GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z",
>> GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
>>
>>
>>
>> when i ran spark submit on k8s master the driverpod is stuck in Waiting:
>> PodInitializing state.
>> I had to manually kill the driver pod and submit new job in this case
>> ,then it works.How this can be handled in production ?
>>
> This happens with executor pods as well
>

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25128
>
>
>>
>> This is happening if i submit the jobs almost parallel ie submit 5 jobs
>> one after the other simultaneously.
>>
>> I'm running spark jobs on 20 nodes each having below configuration
>>
>> I tried kubectl describe node on the node where trhe driver pod is
>> running this is what i got ,i do see there is overcommit on resources but i
>> expected kubernetes scheduler not to schedule if resources in node are
>> overcommitted or node is in Not Ready state ,in this case node is in Ready
>> State but i observe same behaviour if node is in "Not Ready" state
>>
>>
>>
>>     Name:               **********
>>
>>     Roles:              worker
>>
>>     Labels:             beta.kubernetes.io/arch=amd64
>>
>>                         beta.kubernetes.io/os=linux
>>
>>                         kubernetes.io/hostname=****
>>
>>                         node-role.kubernetes.io/worker=true
>>
>>     Annotations:        node.alpha.kubernetes.io/ttl=0
>>
>>
>> volumes.kubernetes.io/controller-managed-attach-detach=true
>>
>>     Taints:             <none>
>>
>>     CreationTimestamp:  Tue, 31 Jul 2018 09:59:24 -0400
>>
>>     Conditions:
>>
>>       Type             Status  LastHeartbeatTime
>> LastTransitionTime                Reason                       Message
>>
>>       ----             ------  -----------------
>> ------------------                ------                       -------
>>
>>       OutOfDisk        False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
>> Jul 2018 09:59:24 -0400   KubeletHasSufficientDisk     kubelet has
>> sufficient disk space available
>>
>>       MemoryPressure   False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
>> Jul 2018 09:59:24 -0400   KubeletHasSufficientMemory   kubelet has
>> sufficient memory available
>>
>>       DiskPressure     False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
>> Jul 2018 09:59:24 -0400   KubeletHasNoDiskPressure     kubelet has no disk
>> pressure
>>
>>       Ready            True    Tue, 14 Aug 2018 09:31:20 -0400   Sat, 11
>> Aug 2018 00:41:27 -0400   KubeletReady                 kubelet is posting
>> ready status. AppArmor enabled
>>
>>     Addresses:
>>
>>       InternalIP:  *****
>>
>>       Hostname:    ******
>>
>>     Capacity:
>>
>>      cpu:     16
>>
>>      memory:  125827288Ki
>>
>>      pods:    110
>>
>>     Allocatable:
>>
>>      cpu:     16
>>
>>      memory:  125724888Ki
>>
>>      pods:    110
>>
>>     System Info:
>>
>>      Machine ID:                 *************
>>
>>      System UUID:                **************
>>
>>      Boot ID:                    1493028d-0a80-4f2f-b0f1-48d9b8910e9f
>>
>>      Kernel Version:             4.4.0-1062-aws
>>
>>      OS Image:                   Ubuntu 16.04.4 LTS
>>
>>      Operating System:           linux
>>
>>      Architecture:               amd64
>>
>>      Container Runtime Version:  docker://Unknown
>>
>>      Kubelet Version:            v1.8.3
>>
>>      Kube-Proxy Version:         v1.8.3
>>
>>     PodCIDR:                     ******
>>
>>     ExternalID:                  **************
>>
>>     Non-terminated Pods:         (11 in total)
>>
>>       Namespace                  Name
>>                        CPU Requests  CPU Limits  Memory Requests  Memory
>> Limits
>>
>>       ---------                  ----
>>                        ------------  ----------  ---------------
>>  -------------
>>
>>       kube-system                calico-node-gj5mb
>>                         250m (1%)     0 (0%)      0 (0%)           0 (0%)
>>
>>       kube-system
>>  kube-proxy-****************************************             100m (0%)
>>     0 (0%)      0 (0%)           0 (0%)
>>
>>       kube-system
>>  prometheus-prometheus-node-exporter-9cntq                       100m (0%)
>>     200m (1%)   30Mi (0%)        50Mi (0%)
>>
>>       logging
>>  elasticsearch-elasticsearch-data-69df997486-gqcwg               400m (2%)
>>     1 (6%)      8Gi (6%)         16Gi (13%)
>>
>>       logging                    fluentd-fluentd-elasticsearch-tj7nd
>>                         200m (1%)     0 (0%)      612Mi (0%)       0 (0%)
>>
>>       rook                       rook-agent-6jtzm
>>                        0 (0%)        0 (0%)      0 (0%)           0 (0%)
>>
>>       rook
>> rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j    0 (0%)
>>    0 (0%)      0 (0%)           0 (0%)
>>
>>       spark
>>  accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1       2 (12%)
>>     0 (0%)      10Gi (8%)        12Gi (10%)
>>
>>       spark
>>  accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5    2 (12%)
>>     0 (0%)      10Gi (8%)        12Gi (10%)
>>
>>       spark
>>  accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver    1 (6%)
>>    0 (0%)      2Gi (1%)         2432Mi (1%)
>>
>>       spark
>>  accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver    1 (6%)
>>    0 (0%)      2Gi (1%)         2432Mi (1%)
>>
>>     Allocated resources:
>>
>>       (Total limits may be over 100 percent, i.e., overcommitted.)
>>
>>       CPU Requests  CPU Limits  Memory Requests  Memory Limits
>>
>>       ------------  ----------  ---------------  -------------
>>
>>       7050m (44%)   1200m (7%)  33410Mi (27%)    45874Mi (37%)
>>
>>
>>     Events:         <none>
>>
>>
>> Kubectl describe pod gives below message
>>
>>     Name:
>> accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
>>     Namespace:    spark
>>     Node:         ****
>>     Start Time:   Mon, 13 Aug 2018 16:18:34 -0400
>>     Labels:
>> launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73
>>
>> spark-app-selector=spark-63f536fd87f8457796802767922ef7d9
>>                   spark-role=driver
>>     Annotations:  spark-app-name=accelerate-testing-2
>>     Status:       Pending
>>     IP:
>>     Init Containers:
>>       spark-init:
>>         Container ID:
>>         Image:         ****:v2.3.0
>>         Image ID:
>>         Port:          <none>
>>         Args:
>>           init
>>           /etc/spark-init/spark-init.properties
>>         State:          Waiting
>>           Reason:       PodInitializing
>>         Ready:          False
>>         Restart Count:  0
>>         Environment:    <none>
>>         Mounts:
>>           /etc/spark-init from spark-init-properties (rw)
>>           /var/run/secrets/kubernetes.io/serviceaccount from
>> spark-token-mj86g (ro)
>>           /var/spark-data/spark-files from download-files-volume (rw)
>>           /var/spark-data/spark-jars from download-jars-volume (rw)
>>     Containers:
>>       spark-kubernetes-driver:
>>         Container ID:
>>         Image:         ******:v2.3.0
>>         Image ID:
>>         Port:          <none>
>>         Args:
>>           driver
>>         State:          Waiting
>>           Reason:       PodInitializing
>>         Ready:          False
>>         Restart Count:  0
>>         Limits:
>>           memory:  2432Mi
>>         Requests:
>>           cpu:     1
>>           memory:  2Gi
>>         Environment:
>>           SPARK_DRIVER_MEMORY:        2g
>>           SPARK_DRIVER_CLASS:         com.myclass
>>           SPARK_DRIVER_BIND_ADDRESS:   (v1:status.podIP)
>>           SPARK_MOUNTED_CLASSPATH:
>>  /var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar
>>           SPARK_MOUNTED_FILES_DIR:    /var/spark-data/spark-files
>>           SPARK_JAVA_OPT_0:
>> -Dspark.kubernetes.container.image=***
>>           SPARK_JAVA_OPT_1:
>> -Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar
>>           SPARK_JAVA_OPT_2:           -Dspark.submit.deployMode=cluster
>>           SPARK_JAVA_OPT_3:
>> -Dspark.driver.blockManager.port=7079
>>           SPARK_JAVA_OPT_4:           -Dspark.executor.memory=10g
>>           SPARK_JAVA_OPT_5:           -Dspark.app.id
>> =spark-63f536fd87f8457796802767922ef7d9
>>           SPARK_JAVA_OPT_6:
>> -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark
>>           SPARK_JAVA_OPT_7:           -Dspark.master=k8s://
>> https://kubernetes.default
>>           SPARK_JAVA_OPT_8:
>> -Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc
>>           SPARK_JAVA_OPT_9:           -Dspark.executor.cores=2
>>           SPARK_JAVA_OPT_10:
>>  -Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba
>>           SPARK_JAVA_OPT_11:          -Dspark.driver.port=7078
>>           SPARK_JAVA_OPT_12:          -Dspark.kubernetes.namespace=spark
>>           SPARK_JAVA_OPT_13:          -Dspark.executor.memoryOverhead=2g
>>           SPARK_JAVA_OPT_14:
>>  -Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
>>           SPARK_JAVA_OPT_15:
>>  -Dspark.kubernetes.initContainer.configMapKey=spark-init.properties
>>           SPARK_JAVA_OPT_16:          -Dspark.executor.instances=10
>>           SPARK_JAVA_OPT_17:          -Dspark.memory.fraction=0.6
>>           SPARK_JAVA_OPT_18:          -Dspark.driver.memory=2g
>>           SPARK_JAVA_OPT_19:          -Dspark.kubernetes.driver.pod.name
>> =accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
>>           SPARK_JAVA_OPT_20:          -Dspark.app.name
>> =accelerate-testing-2
>>           SPARK_JAVA_OPT_21:
>>  -Dspark.kubernetes.driver.label.launch-id=********
>>         Mounts:
>>           /var/run/secrets/kubernetes.io/serviceaccount from
>> spark-token-mj86g (ro)
>>           /var/spark-data/spark-files from download-files-volume (rw)
>>           /var/spark-data/spark-jars from download-jars-volume (rw)
>>     Conditions:
>>       Type           Status
>>       Initialized    False
>>       Ready          False
>>       PodScheduled   True
>>     Volumes:
>>       spark-init-properties:
>>         Type:      ConfigMap (a volume populated by a ConfigMap)
>>         Name:
>>  accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
>>         Optional:  false
>>       download-jars-volume:
>>         Type:    EmptyDir (a temporary directory that shares a pod's
>> lifetime)
>>         Medium:
>>       download-files-volume:
>>         Type:    EmptyDir (a temporary directory that shares a pod's
>> lifetime)
>>         Medium:
>>       spark-token-mj86g:
>>         Type:        Secret (a volume populated by a Secret)
>>         SecretName:  spark-token-mj86g
>>         Optional:    false
>>     QoS Class:       Burstable
>>     Node-Selectors:  <none>
>>     Tolerations:     <none>
>>     Events:
>>       Type     Reason          Age                  From
>>                               Message
>>       ----     ------          ----                 ----
>>                               -------
>>       Normal   SandboxChanged  44m (x518 over 18h)  kubelet,
>> ****************************  Pod sandbox changed, it will be killed and
>> re-created.
>>       Warning  FailedSync      19s (x540 over 18h)  kubelet,
>> ****************************  Error syncing pod
>>
>>

Re: spark driver pod stuck in Waiting: PodInitializing state in Kubernetes

Posted by purna pradeep <pu...@gmail.com>.
Hello,

im running Spark 2.3 job on kubernetes cluster
>
> kubectl version
>
>     Client Version: version.Info{Major:"1", Minor:"9",
> GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b",
> GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z",
> GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
>
>     Server Version: version.Info{Major:"1", Minor:"8",
> GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd",
> GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z",
> GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
>
>
>
> when i ran spark submit on k8s master the driver pod is stuck in Waiting:
> PodInitializing state.
> I had to manually kill the driver pod and submit new job in this case
> ,then it works.How this can be handled in production ?
>

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25128


>
> This is happening if i submit the jobs almost parallel ie submit 5 jobs
> one after the other simultaneously.
>
> I'm running spark jobs on 20 nodes each having below configuration
>
> I tried kubectl describe node on the node where trhe driver pod is running
> this is what i got ,i do see there is overcommit on resources but i
> expected kubernetes scheduler not to schedule if resources in node are
> overcommitted or node is in Not Ready state ,in this case node is in Ready
> State but i observe same behaviour if node is in "Not Ready" state
>
>
>
>     Name:               **********
>
>     Roles:              worker
>
>     Labels:             beta.kubernetes.io/arch=amd64
>
>                         beta.kubernetes.io/os=linux
>
>                         kubernetes.io/hostname=****
>
>                         node-role.kubernetes.io/worker=true
>
>     Annotations:        node.alpha.kubernetes.io/ttl=0
>
>
> volumes.kubernetes.io/controller-managed-attach-detach=true
>
>     Taints:             <none>
>
>     CreationTimestamp:  Tue, 31 Jul 2018 09:59:24 -0400
>
>     Conditions:
>
>       Type             Status  LastHeartbeatTime
> LastTransitionTime                Reason                       Message
>
>       ----             ------  -----------------
> ------------------                ------                       -------
>
>       OutOfDisk        False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
> Jul 2018 09:59:24 -0400   KubeletHasSufficientDisk     kubelet has
> sufficient disk space available
>
>       MemoryPressure   False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
> Jul 2018 09:59:24 -0400   KubeletHasSufficientMemory   kubelet has
> sufficient memory available
>
>       DiskPressure     False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
> Jul 2018 09:59:24 -0400   KubeletHasNoDiskPressure     kubelet has no disk
> pressure
>
>       Ready            True    Tue, 14 Aug 2018 09:31:20 -0400   Sat, 11
> Aug 2018 00:41:27 -0400   KubeletReady                 kubelet is posting
> ready status. AppArmor enabled
>
>     Addresses:
>
>       InternalIP:  *****
>
>       Hostname:    ******
>
>     Capacity:
>
>      cpu:     16
>
>      memory:  125827288Ki
>
>      pods:    110
>
>     Allocatable:
>
>      cpu:     16
>
>      memory:  125724888Ki
>
>      pods:    110
>
>     System Info:
>
>      Machine ID:                 *************
>
>      System UUID:                **************
>
>      Boot ID:                    1493028d-0a80-4f2f-b0f1-48d9b8910e9f
>
>      Kernel Version:             4.4.0-1062-aws
>
>      OS Image:                   Ubuntu 16.04.4 LTS
>
>      Operating System:           linux
>
>      Architecture:               amd64
>
>      Container Runtime Version:  docker://Unknown
>
>      Kubelet Version:            v1.8.3
>
>      Kube-Proxy Version:         v1.8.3
>
>     PodCIDR:                     ******
>
>     ExternalID:                  **************
>
>     Non-terminated Pods:         (11 in total)
>
>       Namespace                  Name
>                        CPU Requests  CPU Limits  Memory Requests  Memory
> Limits
>
>       ---------                  ----
>                        ------------  ----------  ---------------
>  -------------
>
>       kube-system                calico-node-gj5mb
>                       250m (1%)     0 (0%)      0 (0%)           0 (0%)
>
>       kube-system
>  kube-proxy-****************************************             100m (0%)
>     0 (0%)      0 (0%)           0 (0%)
>
>       kube-system                prometheus-prometheus-node-exporter-9cntq
>                       100m (0%)     200m (1%)   30Mi (0%)        50Mi (0%)
>
>       logging
>  elasticsearch-elasticsearch-data-69df997486-gqcwg               400m (2%)
>     1 (6%)      8Gi (6%)         16Gi (13%)
>
>       logging                    fluentd-fluentd-elasticsearch-tj7nd
>                       200m (1%)     0 (0%)      612Mi (0%)       0 (0%)
>
>       rook                       rook-agent-6jtzm
>                        0 (0%)        0 (0%)      0 (0%)           0 (0%)
>
>       rook
> rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j    0 (0%)
>    0 (0%)      0 (0%)           0 (0%)
>
>       spark
>  accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1       2 (12%)
>     0 (0%)      10Gi (8%)        12Gi (10%)
>
>       spark
>  accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5    2 (12%)
>     0 (0%)      10Gi (8%)        12Gi (10%)
>
>       spark
>  accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver    1 (6%)
>    0 (0%)      2Gi (1%)         2432Mi (1%)
>
>       spark
>  accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver    1 (6%)
>    0 (0%)      2Gi (1%)         2432Mi (1%)
>
>     Allocated resources:
>
>       (Total limits may be over 100 percent, i.e., overcommitted.)
>
>       CPU Requests  CPU Limits  Memory Requests  Memory Limits
>
>       ------------  ----------  ---------------  -------------
>
>       7050m (44%)   1200m (7%)  33410Mi (27%)    45874Mi (37%)
>
>
>     Events:         <none>
>
>
> Kubectl describe pod gives below message
>
>     Name:
> accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
>     Namespace:    spark
>     Node:         ****
>     Start Time:   Mon, 13 Aug 2018 16:18:34 -0400
>     Labels:
> launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73
>                   spark-app-selector=spark-63f536fd87f8457796802767922ef7d9
>                   spark-role=driver
>     Annotations:  spark-app-name=accelerate-testing-2
>     Status:       Pending
>     IP:
>     Init Containers:
>       spark-init:
>         Container ID:
>         Image:         ****:v2.3.0
>         Image ID:
>         Port:          <none>
>         Args:
>           init
>           /etc/spark-init/spark-init.properties
>         State:          Waiting
>           Reason:       PodInitializing
>         Ready:          False
>         Restart Count:  0
>         Environment:    <none>
>         Mounts:
>           /etc/spark-init from spark-init-properties (rw)
>           /var/run/secrets/kubernetes.io/serviceaccount from
> spark-token-mj86g (ro)
>           /var/spark-data/spark-files from download-files-volume (rw)
>           /var/spark-data/spark-jars from download-jars-volume (rw)
>     Containers:
>       spark-kubernetes-driver:
>         Container ID:
>         Image:         ******:v2.3.0
>         Image ID:
>         Port:          <none>
>         Args:
>           driver
>         State:          Waiting
>           Reason:       PodInitializing
>         Ready:          False
>         Restart Count:  0
>         Limits:
>           memory:  2432Mi
>         Requests:
>           cpu:     1
>           memory:  2Gi
>         Environment:
>           SPARK_DRIVER_MEMORY:        2g
>           SPARK_DRIVER_CLASS:         com.myclass
>           SPARK_DRIVER_BIND_ADDRESS:   (v1:status.podIP)
>           SPARK_MOUNTED_CLASSPATH:
>  /var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar
>           SPARK_MOUNTED_FILES_DIR:    /var/spark-data/spark-files
>           SPARK_JAVA_OPT_0:
> -Dspark.kubernetes.container.image=***
>           SPARK_JAVA_OPT_1:
> -Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar
>           SPARK_JAVA_OPT_2:           -Dspark.submit.deployMode=cluster
>           SPARK_JAVA_OPT_3:           -Dspark.driver.blockManager.port=7079
>           SPARK_JAVA_OPT_4:           -Dspark.executor.memory=10g
>           SPARK_JAVA_OPT_5:           -Dspark.app.id
> =spark-63f536fd87f8457796802767922ef7d9
>           SPARK_JAVA_OPT_6:
> -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark
>           SPARK_JAVA_OPT_7:           -Dspark.master=k8s://
> https://kubernetes.default
>           SPARK_JAVA_OPT_8:
> -Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc
>           SPARK_JAVA_OPT_9:           -Dspark.executor.cores=2
>           SPARK_JAVA_OPT_10:
>  -Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba
>           SPARK_JAVA_OPT_11:          -Dspark.driver.port=7078
>           SPARK_JAVA_OPT_12:          -Dspark.kubernetes.namespace=spark
>           SPARK_JAVA_OPT_13:          -Dspark.executor.memoryOverhead=2g
>           SPARK_JAVA_OPT_14:
>  -Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
>           SPARK_JAVA_OPT_15:
>  -Dspark.kubernetes.initContainer.configMapKey=spark-init.properties
>           SPARK_JAVA_OPT_16:          -Dspark.executor.instances=10
>           SPARK_JAVA_OPT_17:          -Dspark.memory.fraction=0.6
>           SPARK_JAVA_OPT_18:          -Dspark.driver.memory=2g
>           SPARK_JAVA_OPT_19:          -Dspark.kubernetes.driver.pod.name
> =accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
>           SPARK_JAVA_OPT_20:          -Dspark.app.name
> =accelerate-testing-2
>           SPARK_JAVA_OPT_21:
>  -Dspark.kubernetes.driver.label.launch-id=********
>         Mounts:
>           /var/run/secrets/kubernetes.io/serviceaccount from
> spark-token-mj86g (ro)
>           /var/spark-data/spark-files from download-files-volume (rw)
>           /var/spark-data/spark-jars from download-jars-volume (rw)
>     Conditions:
>       Type           Status
>       Initialized    False
>       Ready          False
>       PodScheduled   True
>     Volumes:
>       spark-init-properties:
>         Type:      ConfigMap (a volume populated by a ConfigMap)
>         Name:
>  accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
>         Optional:  false
>       download-jars-volume:
>         Type:    EmptyDir (a temporary directory that shares a pod's
> lifetime)
>         Medium:
>       download-files-volume:
>         Type:    EmptyDir (a temporary directory that shares a pod's
> lifetime)
>         Medium:
>       spark-token-mj86g:
>         Type:        Secret (a volume populated by a Secret)
>         SecretName:  spark-token-mj86g
>         Optional:    false
>     QoS Class:       Burstable
>     Node-Selectors:  <none>
>     Tolerations:     <none>
>     Events:
>       Type     Reason          Age                  From
>                             Message
>       ----     ------          ----                 ----
>                             -------
>       Normal   SandboxChanged  44m (x518 over 18h)  kubelet,
> ****************************  Pod sandbox changed, it will be killed and
> re-created.
>       Warning  FailedSync      19s (x540 over 18h)  kubelet,
> ****************************  Error syncing pod
>
>