You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Will Zhang (JIRA)" <ji...@apache.org> on 2019/04/29 15:22:00 UTC

[jira] [Comment Edited] (SPARK-27574) spark on kubernetes driver pod phase changed from running to pending and starts another container in pod

    [ https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829337#comment-16829337 ] 

Will Zhang edited comment on SPARK-27574 at 4/29/19 3:21 PM:
-------------------------------------------------------------

Hi [~Udbhav Agrawal],  the driver log is nothing special, the first container ran successfully and exited. The second failed cause it checks the filepath of the output and returns error if already existed. What I can see from the log is that the second container starts shortly after the first one exited. I attached the driver log files. Thank you.

below is the output of kubectl describe pod, it only contains the second container id:

Name: com-xxxx-cloud-mf-trainer-submit-1555666719424-driver
 Namespace: default
 Node: yq01-m12-ai2b-service02.yq01.xxxx.com/10.155.197.12
 Start Time: Fri, 19 Apr 2019 17:38:40 +0800
 Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7
 spark-app-selector=spark-4343fe80572c4240bd933246efd975da
 spark-role=driver
 Annotations: <none>
 Status: Failed
 IP: 10.244.12.106
 Containers:
 spark-kubernetes-driver:
 Container ID: docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3
 Image: 10.96.0.100:5000/spark:spark-2.4.0
 Image ID: docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f
 Ports: 7078/TCP, 7079/TCP, 4040/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP
 Args:
 driver
 --properties-file
 /opt/spark/conf/spark.properties
 --class
 com.xxxx.cloud.mf.trainer.Submit
 spark-internal
 --ak
 970f5e4c-7171-4c61-603e-f101b65a573b
 --tracking_server_url
 [http://10.155.197.12:8080|http://10.155.197.12:8080/]
 --graph
 hdfs://yq01-m12-ai2b-service02.yq01.xxxx.com:9000/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/meta/node1555661669082/graph.json
 --sk
 56305f9f-b755-4b42-4218-592555f5c4a8
 --mode
 train
 State: Terminated
 Reason: Error
 Exit Code: 1
 Started: Fri, 19 Apr 2019 17:39:57 +0800
 Finished: Fri, 19 Apr 2019 17:40:48 +0800
 Ready: False
 Restart Count: 0
 Limits:
 memory: 2432Mi
 Requests:
 cpu: 1
 memory: 2432Mi
 Environment:
 xxxx_KUBERNETES_LOG_ENDPOINT: yq01-m12-ai2b-service02.yq01.xxxx.com:8070
 xxxx_KUBERNETES_LOG_FLUSH_FREQUENCY: 10s
 xxxx_KUBERNETES_LOG_PATH: /project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/log/driver
 SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
 SPARK_LOCAL_DIRS: /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f
 SPARK_CONF_DIR: /opt/spark/conf
 Mounts:
 /opt/spark/conf from spark-conf-volume (rw)
 /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f from spark-local-dir-1 (rw)
 /var/run/secrets/kubernetes.io/serviceaccount from default-token-q7drh (ro)
 Conditions:
 Type Status
 Initialized True
 Ready False
 PodScheduled True
 Volumes:
 spark-local-dir-1:
 Type: EmptyDir (a temporary directory that shares a pod's lifetime)
 Medium:
 spark-conf-volume:
 Type: ConfigMap (a volume populated by a ConfigMap)
 Name: com-xxxx-cloud-mf-trainer-submit-1555666719424-driver-conf-map
 Optional: false
 default-token-q7drh:
 Type: Secret (a volume populated by a Secret)
 SecretName: default-token-q7drh
 Optional: false
 QoS Class: Burstable
 Node-Selectors: <none>
 Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
 node.kubernetes.io/unreachable:NoExecute for 300s
 Events: <none>

 

 

 


was (Author: zyfo2):
Hi [~Udbhav Agrawal],  the driver log is nothing special, the first container ran successfully and exited. The second failed cause it checks the filepath of the output and returns error if already existed. What I can see from the log is that the second container starts shortly after the first one exited. I attached the driver log files. Thank you.

below is the output of kubectl describe pod, it only contains the second container id:

Name: com-xxxx-cloud-mf-trainer-submit-1555666719424-driver
 Namespace: default
 Node: yq01-m12-ai2b-service02.yq01.xxxx[^driver-pod-logs.zip].com/10.155.197.12
 Start Time: Fri, 19 Apr 2019 17:38:40 +0800
 Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7
 spark-app-selector=spark-4343fe80572c4240bd933246efd975da
 spark-role=driver
 Annotations: <none>
 Status: Failed
 IP: 10.244.12.106
 Containers:
 spark-kubernetes-driver:
 Container ID: docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3
 Image: 10.96.0.100:5000/spark:spark-2.4.0
 Image ID: docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f
 Ports: 7078/TCP, 7079/TCP, 4040/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP
 Args:
 driver
 --properties-file
 /opt/spark/conf/spark.properties
 --class
 com.xxxx.cloud.mf.trainer.Submit
 spark-internal
 --ak
 970f5e4c-7171-4c61-603e-f101b65a573b
 --tracking_server_url
 [http://10.155.197.12:8080|http://10.155.197.12:8080/]
 --graph
 hdfs://yq01-m12-ai2b-service02.yq01.xxxx.com:9000/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/meta/node1555661669082/graph.json
 --sk
 56305f9f-b755-4b42-4218-592555f5c4a8
 --mode
 train
 State: Terminated
 Reason: Error
 Exit Code: 1
 Started: Fri, 19 Apr 2019 17:39:57 +0800
 Finished: Fri, 19 Apr 2019 17:40:48 +0800
 Ready: False
 Restart Count: 0
 Limits:
 memory: 2432Mi
 Requests:
 cpu: 1
 memory: 2432Mi
 Environment:
 xxxx_KUBERNETES_LOG_ENDPOINT: yq01-m12-ai2b-service02.yq01.xxxx.com:8070
 xxxx_KUBERNETES_LOG_FLUSH_FREQUENCY: 10s
 xxxx_KUBERNETES_LOG_PATH: /project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/log/driver
 SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
 SPARK_LOCAL_DIRS: /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f
 SPARK_CONF_DIR: /opt/spark/conf
 Mounts:
 /opt/spark/conf from spark-conf-volume (rw)
 /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f from spark-local-dir-1 (rw)
 /var/run/secrets/kubernetes.io/serviceaccount from default-token-q7drh (ro)
 Conditions:
 Type Status
 Initialized True
 Ready False
 PodScheduled True
 Volumes:
 spark-local-dir-1:
 Type: EmptyDir (a temporary directory that shares a pod's lifetime)
 Medium:
 spark-conf-volume:
 Type: ConfigMap (a volume populated by a ConfigMap)
 Name: com-xxxx-cloud-mf-trainer-submit-1555666719424-driver-conf-map
 Optional: false
 default-token-q7drh:
 Type: Secret (a volume populated by a Secret)
 SecretName: default-token-q7drh
 Optional: false
 QoS Class: Burstable
 Node-Selectors: <none>
 Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
 node.kubernetes.io/unreachable:NoExecute for 300s
 Events: <none>

 

 

 

> spark on kubernetes driver pod phase changed from running to pending and starts another container in pod
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27574
>                 URL: https://issues.apache.org/jira/browse/SPARK-27574
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.4.0
>         Environment: Kubernetes version (use kubectl version):
> v1.10.0
> OS (e.g: cat /etc/os-release):
> CentOS-7
> Kernel (e.g. uname -a):
> 4.17.11-1.el7.elrepo.x86_64
> Spark-2.4.0
>            Reporter: Will Zhang
>            Priority: Major
>         Attachments: driver-pod-logs.zip
>
>
> I'm using spark-on-kubernetes to submit spark app to kubernetes.
> most of the time, it runs smoothly.
> but sometimes, I see logs after submitting: the driver pod phase changed from running to pending and starts another container in the pod though the first container exited successfully.
> I use the standard spark-submit to kubernetes like:
> /opt/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --deploy-mode cluster --class xxx ...
>  
> log is below:
>  
>  
> 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: N/A
> start time: N/A
> container images: N/A
> phase: Pending
> status: []
> 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01.xxxx.com
> start time: N/A
> container images: N/A
> phase: Pending
> status: []
> 2019-04-25 13:37:01 INFO Client:54 - Waiting for application com.xxxx.cloud.mf.trainer.Submit to finish...
> 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01.xxxx.com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Pending
> status: [ContainerStatus(containerID=null, image=10.96.0.100:5000/spark:spark-2.4.0, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 2019-04-25 13:37:04 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01.xxxx.com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Running
> status: [ContainerStatus(containerID=docker://120dbf8cb11cf8ef9b26cff3354e096a979beb35279de34be64b3c06e896b991, image=10.96.0.100:5000/spark:spark-2.4.0, imageID=docker-pullable://10.96.0.100:5000/spark@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=true, restartCount=0, state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2019-04-25T13:37:03Z, additionalProperties={}), additionalProperties={}), terminated=null, waiting=null, additionalProperties={}), additionalProperties={})]
> 2019-04-25 13:37:27 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01.xxxx.com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Pending
> status: [ContainerStatus(containerID=null, image=10.96.0.100:5000/spark:spark-2.4.0, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 2019-04-25 13:37:29 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01.xxxx.com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Running
> status: [ContainerStatus(containerID=docker://43753f5336c41eaec8cdcdfd271b34ac465de331aad2d612fe0c7ad1c3706aac, image=10.96.0.100:5000/spark:spark-2.4.0, imageID=docker-pullable://10.96.0.100:5000/spark@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=true, restartCount=0, state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2019-04-25T13:37:28Z, additionalProperties={}), additionalProperties={}), terminated=null, waiting=null, additionalProperties={}), additionalProperties={})]
> 2019-04-25 13:37:52 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01.xxxx.com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Failed
> status: [ContainerStatus(containerID=docker://43753f5336c41eaec8cdcdfd271b34ac465de331aad2d612fe0c7ad1c3706aac, image=10.96.0.100:5000/spark:spark-2.4.0, imageID=docker-pullable://10.96.0.100:5000/spark@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://43753f5336c41eaec8cdcdfd271b34ac465de331aad2d612fe0c7ad1c3706aac, exitCode=1, finishedAt=Time(time=2019-04-25T13:37:48Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-04-25T13:37:28Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})]
> 2019-04-25 13:37:52 INFO LoggingPodStatusWatcherImpl:54 - Container final statuses:
> Container name: spark-kubernetes-driver
>  Container image: 10.96.0.100:5000/spark:spark-2.4.0
>  Container state: Terminated
>  Exit code: 1
> 2019-04-25 13:37:52 INFO Client:54 - Application com.xxxx.cloud.mf.trainer.Submit finished.
> 2019-04-25 13:37:52 INFO ShutdownHookManager:54 - Shutdown hook called
> 2019-04-25 13:37:52 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-84727675-4ced-491c-8993-22e8f3539bf3
> bash-4.4#
>  
>  
> Please let me know if I miss anything.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org