You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hoa Le (Jira)" <ji...@apache.org> on 2021/11/21 18:41:00 UTC

[jira] [Updated] (SPARK-37432) Driver keep a record of decommission executor

     [ https://issues.apache.org/jira/browse/SPARK-37432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoa Le updated SPARK-37432:
---------------------------
    Attachment: master_ui_executor_tab.png

> Driver keep a record of decommission executor
> ---------------------------------------------
>
>                 Key: SPARK-37432
>                 URL: https://issues.apache.org/jira/browse/SPARK-37432
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.1.1
>            Reporter: Hoa Le
>            Priority: Minor
>         Attachments: master_ui_executor_tab.png
>
>
> Hi,
> We are running spark on k8s with version 3.1.1. After the spark application running for a while, we are getting the exception below:
> On driver: 
>  
> {code:java}
> 2021-11-21 18:25:21,859 ERROR Failed to send RPC RPC 6827167497981418905 to /10.1.201.113:58354: java.nio.channels.ClosedChannelException (org.apache.spark.network.client.TransportClient) [rpc-server-4-1]
> java.nio.channels.ClosedChannelException
> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
> 	at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
> 	at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
> 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> 	at java.base/java.lang.Thread.run(Unknown Source)
> 2021-11-21 18:25:21,864 ERROR Failed to send RPC RPC 7618635518207296341 to /10.1.201.113:58354: java.nio.channels.ClosedChannelException (org.apache.spark.network.client.TransportClient) [rpc-server-4-1]
> java.nio.channels.ClosedChannelException
> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
> 	at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
> 	at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
> 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> 	at java.base/java.lang.Thread.run(Unknown Source)
> 2021-11-21 18:25:21,868 ERROR Failed to send RPC RPC 5040314884474308699 to /10.1.201.113:58354: java.nio.channels.ClosedChannelException (org.apache.spark.network.client.TransportClient) [rpc-server-4-1]
> java.nio.channels.ClosedChannelException
> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
> 	at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
> 	at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
> 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> 	at java.base/java.lang.Thread.run(Unknown Source) {code}
>  
>  
> The dead executor (we got the logs exported to persistent storage):
>  
> {code:java}
> 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.lang.Thread.run(Unknown Source)
> 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
> 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:83)
> 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.lang.Thread.sleep(Native Method)
> 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,java.lang.InterruptedException: sleep interrupted
> 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:25:47,632 ERROR Error while waiting for block to migrate (org.apache.spark.storage.BlockManagerDecommissioner) [migrate-shuffles-1]"
> 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.lang.Thread.run(Unknown Source)
> 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:83)
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.lang.Thread.sleep(Native Method)
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,java.lang.InterruptedException: sleep interrupted
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:25:47,632 ERROR Error while waiting for block to migrate (org.apache.spark.storage.BlockManagerDecommissioner) [migrate-shuffles-2]"
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.lang.Thread.run(Unknown Source)
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:83)
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at java.base/java.lang.Thread.sleep(Native Method)
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,java.lang.InterruptedException: sleep interrupted
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:25:47,632 ERROR Error while waiting for block to migrate (org.apache.spark.storage.BlockManagerDecommissioner) [migrate-shuffles-0]"
> 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:25:47,618 ERROR Executor self-exiting due to : Finished decommissioning (org.apache.spark.executor.CoarseGrainedExecutorBackend) [wait-for-blocks-to-migrate]"
> 2021-11-21T14:23:52.722Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:23:52,199 WARN NoSuchMethodException was thrown when disabling normalizeUri. This indicates you are using an old version (< 4.5.8) of Apache http client. It is recommended to use http client version >= 4.5.9 to avoid the breaking change introduced in apache client 4.5.7 and the latency in exception handling. See https://github.com/aws/aws-sdk-java/issues/1919 for more information (com.amazonaws.http.apache.utils.ApacheUtils) [dispatcher-Executor]"
> 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,WARNING: All illegal access operations will be denied in a future release
> 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
> 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
> 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,"WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)"
> 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,WARNING: An illegal reflective access operation has occurred
> 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,+ exec /usr/bin/tini -s -- /usr/local/openjdk-11/bin/java -Dlog4j.configuration=file:/opt/spark/log4j/log4j.properties -javaagent:/prometheus/jmx_prometheus_javaagent-0.16.1.jar=8090:/etc/metrics/conf/prometheus.yaml -Dspark.network.timeout=600s -Dspark.driver.port=7078 -Dspark.driver.blockManager.port=7079 -Xms14336m -Xmx14336m -cp '/opt/spark/conf::/opt/spark/jars/*:' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@firehose-ingestion-job-4ace777d3ec0ca06-driver-svc.atlas-spark-apps.svc:7078 --executor-id 699 --cores 3 --app-id spark-be1c315d0c2d49df926455f6d04a50eb --hostname 10.1.201.113 --resourceProfileId 0
> 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,"+ CMD=(${JAVA_HOME}/bin/java ""${SPARK_EXECUTOR_JAVA_OPTS[@]}"" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp ""$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP --resourceProfileId $SPARK_RESOURCE_PROFILE_ID)" {code}
>  
>  
> The actual spark executor pod was decommissioned but seems like the driver still keeps sending tasks to the driver.
> The below (attachment) is the master UI at executor task of the app:
> !image-2021-11-21-12-33-56-840.png!
>  
> and there is no executor pod 699 is running:
> {code:java}
> $ kubectl get pods |grep firehose
> firehose-ingestion-job-driver                             1/1     Running   0          23h
> firehoseingestionjob-28d3d27d3ec15aaf-exec-869            1/1     Running   0          18m
> firehoseingestionjob-28d3d27d3ec15aaf-exec-874            1/1     Running   0          18m {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org