You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stavros Kontopoulos (JIRA)" <ji...@apache.org> on 2019/05/31 15:08:00 UTC
[jira] [Created] (SPARK-27900) Spark on K8s will not report
container failure due to oom
Stavros Kontopoulos created SPARK-27900:
-------------------------------------------
Summary: Spark on K8s will not report container failure due to oom
Key: SPARK-27900
URL: https://issues.apache.org/jira/browse/SPARK-27900
Project: Spark
Issue Type: Bug
Components: Kubernetes
Affects Versions: 2.4.3, 3.0.0
Reporter: Stavros Kontopoulos
{quote}A driver is running
{quote}
spark-pi-driver 1/1 Running 0 1h
spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
spark-pi2-1559309337787-exec-2 1/1 Running 0 1h
with the following setup:
{quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
name: spark-pi
namespace: spark
spec:
type: Scala
mode: cluster
image: "skonto/spark:k8s-3.0.0-sa"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
arguments:
- "1000000"
sparkVersion: "2.4.0"
restartPolicy:
type: Never
nodeSelector:
"spark": "autotune"
driver:
memory: "1g"
labels:
version: 2.4.0
serviceAccount: spark-sa
executor:
instances: 2
memory: "1g"
labels:
version: 2.4.0
{quote}
At some point the driver fails but it is still running and so the pods are still running:
19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.0 KiB, free 110.0 MiB)
19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1765.0 B, free 110.0 MiB)
19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 MiB)
19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1180
19/05/31 13:29:25 INFO DAGScheduler: Submitting 1000000 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 1000000 tasks
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached
$ kubectl describe pod spark-pi2-driver -n spark
Name: spark-pi2-driver
Namespace: spark
Priority: 0
PriorityClassName: <none>
Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
Start Time: Fri, 31 May 2019 16:28:59 +0300
Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
spark-role=driver
sparkoperator.k8s.io/app-name=spark-pi2
sparkoperator.k8s.io/launched-by-spark-operator=true
sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
version=2.4.0
Annotations: <none>
Status: Running
IP: 10.12.103.4
Controlled By: SparkApplication/spark-pi2
Containers:
spark-kubernetes-driver:
Container ID: docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
Image: skonto/spark:k8s-3.0.0-sa
Image ID: docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
Ports: 7078/TCP, 7079/TCP, 4040/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
driver
--properties-file
/opt/spark/conf/spark.properties
--class
org.apache.spark.examples.SparkPi
spark-internal
1000000
State: Running
In the container processes are in _interruptible sleep_:
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar
287 0 185 S 2344 0% 3 0% sh
294 287 185 R 1536 0% 3 0% top
1 0 185 S 776 0% 0 0% /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file /opt/spark/conf/spark.prope
Liveness checks might be a workaround but rest apis may be still working if threads in jvm still are running as in this case (I did check the spark ui and it was there).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org