You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kent (Jira)" <ji...@apache.org> on 2022/02/17 17:56:00 UTC

[jira] [Commented] (SPARK-33349) ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed

    [ https://issues.apache.org/jira/browse/SPARK-33349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494121#comment-17494121 ] 

Kent commented on SPARK-33349:
------------------------------

[~jkleckner] Just curious when you see "too old resource" in the driver pod log does your pod die off and get restarted?

Our Driver pod just hangs and logs are not moving...

The only way we see to fix this is to manually kill the pod or restart the spark job itself.

This must be affecting many others as well?

Maybe the custom fix is to have a watcher pod that looks for "too old resource" in the driver pod log and if found kills that driver pod which would then hopefully keep the spark job running...

Thanks!

Kent

BTW I voted and will have others vote for this as well!

 

> ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed
> ------------------------------------------------------------------
>
>                 Key: SPARK-33349
>                 URL: https://issues.apache.org/jira/browse/SPARK-33349
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.0.1, 3.0.2, 3.1.0
>            Reporter: Nicola Bova
>            Priority: Critical
>
> I launch my spark application with the [spark-on-kubernetes-operator|https://github.com/GoogleCloudPlatform/spark-on-k8s-operator] with the following yaml file:
> {code:yaml}
> apiVersion: sparkoperator.k8s.io/v1beta2
> kind: SparkApplication
> metadata:
>    name: spark-kafka-streamer-test
>    namespace: kafka2hdfs
> spec: 
>    type: Scala
>    mode: cluster
>    image: <my-repo>/spark:3.0.2-SNAPSHOT-2.12-0.1.0
>    imagePullPolicy: Always
>    timeToLiveSeconds: 259200
>    mainClass: path.to.my.class.KafkaStreamer
>    mainApplicationFile: spark-kafka-streamer_2.12-spark300-assembly.jar
>    sparkVersion: 3.0.1
>    restartPolicy:
>      type: Always
>    sparkConf:
>      "spark.kafka.consumer.cache.capacity": "8192"
>      "spark.kubernetes.memoryOverheadFactor": "0.3"
>    deps:
>    jars:
>      - my
>      - jar
>      - list
>    hadoopConfigMap: hdfs-config
>    driver:
>      cores: 4
>      memory: 12g
>      labels:
>        version: 3.0.1
>      serviceAccount: default
>      javaOptions: "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
>   executor:
>      instances: 4
>     cores: 4
>     memory: 16g
>     labels:
>       version: 3.0.1
>     javaOptions: "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
> {code}
>  I have tried with both Spark `3.0.1` and `3.0.2-SNAPSHOT` with the ["Restart the watcher when we receive a version changed from k8s"|https://github.com/apache/spark/pull/29533] patch.
> This is the driver log:
> {code}
> 20/11/04 12:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> ... // my app log, it's a structured streaming app reading from kafka and writing to hdfs
> 20/11/04 13:12:12 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
> io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 1574101276 (1574213896)
>  at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
>  at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
>  at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
>  at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
>  at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
>  at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
>  at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
>  at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>  at java.base/java.lang.Thread.run(Unknown Source)
> {code}
> The error above appears after roughly 50 minutes.
> After the exception above, no more logs are produced and the app hangs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org