You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hoa Le (Jira)" <ji...@apache.org> on 2021/11/21 18:38:00 UTC

[jira] [Created] (SPARK-37432) Driver keep a record of decommission executor

Hoa Le created SPARK-37432:
------------------------------

             Summary: Driver keep a record of decommission executor
                 Key: SPARK-37432
                 URL: https://issues.apache.org/jira/browse/SPARK-37432
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 3.1.1
            Reporter: Hoa Le


Hi,

We are running spark on k8s with version 3.1.1. After the spark application running for a while, we are getting the exception below:

On driver: 

 
{code:java}
2021-11-21 18:25:21,859 ERROR Failed to send RPC RPC 6827167497981418905 to /10.1.201.113:58354: java.nio.channels.ClosedChannelException (org.apache.spark.network.client.TransportClient) [rpc-server-4-1]
java.nio.channels.ClosedChannelException
	at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
	at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)
2021-11-21 18:25:21,864 ERROR Failed to send RPC RPC 7618635518207296341 to /10.1.201.113:58354: java.nio.channels.ClosedChannelException (org.apache.spark.network.client.TransportClient) [rpc-server-4-1]
java.nio.channels.ClosedChannelException
	at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
	at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)
2021-11-21 18:25:21,868 ERROR Failed to send RPC RPC 5040314884474308699 to /10.1.201.113:58354: java.nio.channels.ClosedChannelException (org.apache.spark.network.client.TransportClient) [rpc-server-4-1]
java.nio.channels.ClosedChannelException
	at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
	at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source) {code}
 

 

The dead executor (we got the logs exported to persistent storage):

 
{code:java}
2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.lang.Thread.run(Unknown Source)
2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:83)
2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.lang.Thread.sleep(Native Method)
2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,java.lang.InterruptedException: sleep interrupted
2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:25:47,632 ERROR Error while waiting for block to migrate (org.apache.spark.storage.BlockManagerDecommissioner) [migrate-shuffles-1]"
2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.lang.Thread.run(Unknown Source)
2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:83)
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.lang.Thread.sleep(Native Method)
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,java.lang.InterruptedException: sleep interrupted
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:25:47,632 ERROR Error while waiting for block to migrate (org.apache.spark.storage.BlockManagerDecommissioner) [migrate-shuffles-2]"
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.lang.Thread.run(Unknown Source)
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:83)
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.lang.Thread.sleep(Native Method)
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,java.lang.InterruptedException: sleep interrupted
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:25:47,632 ERROR Error while waiting for block to migrate (org.apache.spark.storage.BlockManagerDecommissioner) [migrate-shuffles-0]"
2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:25:47,618 ERROR Executor self-exiting due to : Finished decommissioning (org.apache.spark.executor.CoarseGrainedExecutorBackend) [wait-for-blocks-to-migrate]"
2021-11-21T14:23:52.722Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:23:52,199 WARN NoSuchMethodException was thrown when disabling normalizeUri. This indicates you are using an old version (< 4.5.8) of Apache http client. It is recommended to use http client version >= 4.5.9 to avoid the breaking change introduced in apache client 4.5.7 and the latency in exception handling. See https://github.com/aws/aws-sdk-java/issues/1919 for more information (com.amazonaws.http.apache.utils.ApacheUtils) [dispatcher-Executor]"
2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,WARNING: All illegal access operations will be denied in a future release
2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,"WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)"
2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,WARNING: An illegal reflective access operation has occurred
2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,+ exec /usr/bin/tini -s -- /usr/local/openjdk-11/bin/java -Dlog4j.configuration=file:/opt/spark/log4j/log4j.properties -javaagent:/prometheus/jmx_prometheus_javaagent-0.16.1.jar=8090:/etc/metrics/conf/prometheus.yaml -Dspark.network.timeout=600s -Dspark.driver.port=7078 -Dspark.driver.blockManager.port=7079 -Xms14336m -Xmx14336m -cp '/opt/spark/conf::/opt/spark/jars/*:' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@firehose-ingestion-job-4ace777d3ec0ca06-driver-svc.atlas-spark-apps.svc:7078 --executor-id 699 --cores 3 --app-id spark-be1c315d0c2d49df926455f6d04a50eb --hostname 10.1.201.113 --resourceProfileId 0
2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,"+ CMD=(${JAVA_HOME}/bin/java ""${SPARK_EXECUTOR_JAVA_OPTS[@]}"" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp ""$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP --resourceProfileId $SPARK_RESOURCE_PROFILE_ID)" {code}
 

 

The actual spark executor pod was decommissioned but seems like the driver still keeps sending tasks to the driver.

The below (attachment) is the master UI at executor task of the app:

!image-2021-11-21-12-33-56-840.png!

 

and there is no executor pod 699 is running:
{code:java}
$ kubectl get pods |grep firehose
firehose-ingestion-job-driver                             1/1     Running   0          23h
firehoseingestionjob-28d3d27d3ec15aaf-exec-869            1/1     Running   0          18m
firehoseingestionjob-28d3d27d3ec15aaf-exec-874            1/1     Running   0          18m {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org