You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Will Zhang (JIRA)" <ji...@apache.org> on 2019/04/29 15:22:00 UTC
[jira] [Comment Edited] (SPARK-27574) spark on kubernetes driver
pod phase changed from running to pending and starts another container in
pod
[ https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829337#comment-16829337 ]
Will Zhang edited comment on SPARK-27574 at 4/29/19 3:21 PM:
-------------------------------------------------------------
Hi [~Udbhav Agrawal], the driver log is nothing special, the first container ran successfully and exited. The second failed cause it checks the filepath of the output and returns error if already existed. What I can see from the log is that the second container starts shortly after the first one exited. I attached the driver log files. Thank you.
below is the output of kubectl describe pod, it only contains the second container id:
Name: com-xxxx-cloud-mf-trainer-submit-1555666719424-driver
Namespace: default
Node: yq01-m12-ai2b-service02.yq01.xxxx.com/10.155.197.12
Start Time: Fri, 19 Apr 2019 17:38:40 +0800
Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7
spark-app-selector=spark-4343fe80572c4240bd933246efd975da
spark-role=driver
Annotations: <none>
Status: Failed
IP: 10.244.12.106
Containers:
spark-kubernetes-driver:
Container ID: docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3
Image: 10.96.0.100:5000/spark:spark-2.4.0
Image ID: docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f
Ports: 7078/TCP, 7079/TCP, 4040/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
driver
--properties-file
/opt/spark/conf/spark.properties
--class
com.xxxx.cloud.mf.trainer.Submit
spark-internal
--ak
970f5e4c-7171-4c61-603e-f101b65a573b
--tracking_server_url
[http://10.155.197.12:8080|http://10.155.197.12:8080/]
--graph
hdfs://yq01-m12-ai2b-service02.yq01.xxxx.com:9000/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/meta/node1555661669082/graph.json
--sk
56305f9f-b755-4b42-4218-592555f5c4a8
--mode
train
State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 19 Apr 2019 17:39:57 +0800
Finished: Fri, 19 Apr 2019 17:40:48 +0800
Ready: False
Restart Count: 0
Limits:
memory: 2432Mi
Requests:
cpu: 1
memory: 2432Mi
Environment:
xxxx_KUBERNETES_LOG_ENDPOINT: yq01-m12-ai2b-service02.yq01.xxxx.com:8070
xxxx_KUBERNETES_LOG_FLUSH_FREQUENCY: 10s
xxxx_KUBERNETES_LOG_PATH: /project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/log/driver
SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
SPARK_LOCAL_DIRS: /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f
SPARK_CONF_DIR: /opt/spark/conf
Mounts:
/opt/spark/conf from spark-conf-volume (rw)
/var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f from spark-local-dir-1 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-q7drh (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
spark-local-dir-1:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
spark-conf-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: com-xxxx-cloud-mf-trainer-submit-1555666719424-driver-conf-map
Optional: false
default-token-q7drh:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-q7drh
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
was (Author: zyfo2):
Hi [~Udbhav Agrawal], the driver log is nothing special, the first container ran successfully and exited. The second failed cause it checks the filepath of the output and returns error if already existed. What I can see from the log is that the second container starts shortly after the first one exited. I attached the driver log files. Thank you.
below is the output of kubectl describe pod, it only contains the second container id:
Name: com-xxxx-cloud-mf-trainer-submit-1555666719424-driver
Namespace: default
Node: yq01-m12-ai2b-service02.yq01.xxxx[^driver-pod-logs.zip].com/10.155.197.12
Start Time: Fri, 19 Apr 2019 17:38:40 +0800
Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7
spark-app-selector=spark-4343fe80572c4240bd933246efd975da
spark-role=driver
Annotations: <none>
Status: Failed
IP: 10.244.12.106
Containers:
spark-kubernetes-driver:
Container ID: docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3
Image: 10.96.0.100:5000/spark:spark-2.4.0
Image ID: docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f
Ports: 7078/TCP, 7079/TCP, 4040/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
driver
--properties-file
/opt/spark/conf/spark.properties
--class
com.xxxx.cloud.mf.trainer.Submit
spark-internal
--ak
970f5e4c-7171-4c61-603e-f101b65a573b
--tracking_server_url
[http://10.155.197.12:8080|http://10.155.197.12:8080/]
--graph
hdfs://yq01-m12-ai2b-service02.yq01.xxxx.com:9000/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/meta/node1555661669082/graph.json
--sk
56305f9f-b755-4b42-4218-592555f5c4a8
--mode
train
State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 19 Apr 2019 17:39:57 +0800
Finished: Fri, 19 Apr 2019 17:40:48 +0800
Ready: False
Restart Count: 0
Limits:
memory: 2432Mi
Requests:
cpu: 1
memory: 2432Mi
Environment:
xxxx_KUBERNETES_LOG_ENDPOINT: yq01-m12-ai2b-service02.yq01.xxxx.com:8070
xxxx_KUBERNETES_LOG_FLUSH_FREQUENCY: 10s
xxxx_KUBERNETES_LOG_PATH: /project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/log/driver
SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
SPARK_LOCAL_DIRS: /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f
SPARK_CONF_DIR: /opt/spark/conf
Mounts:
/opt/spark/conf from spark-conf-volume (rw)
/var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f from spark-local-dir-1 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-q7drh (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
spark-local-dir-1:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
spark-conf-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: com-xxxx-cloud-mf-trainer-submit-1555666719424-driver-conf-map
Optional: false
default-token-q7drh:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-q7drh
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
> spark on kubernetes driver pod phase changed from running to pending and starts another container in pod
> --------------------------------------------------------------------------------------------------------
>
> Key: SPARK-27574
> URL: https://issues.apache.org/jira/browse/SPARK-27574
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes
> Affects Versions: 2.4.0
> Environment: Kubernetes version (use kubectl version):
> v1.10.0
> OS (e.g: cat /etc/os-release):
> CentOS-7
> Kernel (e.g. uname -a):
> 4.17.11-1.el7.elrepo.x86_64
> Spark-2.4.0
> Reporter: Will Zhang
> Priority: Major
> Attachments: driver-pod-logs.zip
>
>
> I'm using spark-on-kubernetes to submit spark app to kubernetes.
> most of the time, it runs smoothly.
> but sometimes, I see logs after submitting: the driver pod phase changed from running to pending and starts another container in the pod though the first container exited successfully.
> I use the standard spark-submit to kubernetes like:
> /opt/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --deploy-mode cluster --class xxx ...
>
> log is below:
>
>
> 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: N/A
> start time: N/A
> container images: N/A
> phase: Pending
> status: []
> 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01.xxxx.com
> start time: N/A
> container images: N/A
> phase: Pending
> status: []
> 2019-04-25 13:37:01 INFO Client:54 - Waiting for application com.xxxx.cloud.mf.trainer.Submit to finish...
> 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01.xxxx.com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Pending
> status: [ContainerStatus(containerID=null, image=10.96.0.100:5000/spark:spark-2.4.0, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 2019-04-25 13:37:04 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01.xxxx.com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Running
> status: [ContainerStatus(containerID=docker://120dbf8cb11cf8ef9b26cff3354e096a979beb35279de34be64b3c06e896b991, image=10.96.0.100:5000/spark:spark-2.4.0, imageID=docker-pullable://10.96.0.100:5000/spark@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=true, restartCount=0, state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2019-04-25T13:37:03Z, additionalProperties={}), additionalProperties={}), terminated=null, waiting=null, additionalProperties={}), additionalProperties={})]
> 2019-04-25 13:37:27 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01.xxxx.com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Pending
> status: [ContainerStatus(containerID=null, image=10.96.0.100:5000/spark:spark-2.4.0, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 2019-04-25 13:37:29 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01.xxxx.com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Running
> status: [ContainerStatus(containerID=docker://43753f5336c41eaec8cdcdfd271b34ac465de331aad2d612fe0c7ad1c3706aac, image=10.96.0.100:5000/spark:spark-2.4.0, imageID=docker-pullable://10.96.0.100:5000/spark@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=true, restartCount=0, state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2019-04-25T13:37:28Z, additionalProperties={}), additionalProperties={}), terminated=null, waiting=null, additionalProperties={}), additionalProperties={})]
> 2019-04-25 13:37:52 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
> pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01.xxxx.com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Failed
> status: [ContainerStatus(containerID=docker://43753f5336c41eaec8cdcdfd271b34ac465de331aad2d612fe0c7ad1c3706aac, image=10.96.0.100:5000/spark:spark-2.4.0, imageID=docker-pullable://10.96.0.100:5000/spark@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://43753f5336c41eaec8cdcdfd271b34ac465de331aad2d612fe0c7ad1c3706aac, exitCode=1, finishedAt=Time(time=2019-04-25T13:37:48Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-04-25T13:37:28Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})]
> 2019-04-25 13:37:52 INFO LoggingPodStatusWatcherImpl:54 - Container final statuses:
> Container name: spark-kubernetes-driver
> Container image: 10.96.0.100:5000/spark:spark-2.4.0
> Container state: Terminated
> Exit code: 1
> 2019-04-25 13:37:52 INFO Client:54 - Application com.xxxx.cloud.mf.trainer.Submit finished.
> 2019-04-25 13:37:52 INFO ShutdownHookManager:54 - Shutdown hook called
> 2019-04-25 13:37:52 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-84727675-4ced-491c-8993-22e8f3539bf3
> bash-4.4#
>
>
> Please let me know if I miss anything.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org