You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dylan Patterson (Jira)" <ji...@apache.org> on 2021/03/15 23:12:00 UTC

[jira] [Commented] (SPARK-34753) Deadlock in executor RPC shutdown hook

    [ https://issues.apache.org/jira/browse/SPARK-34753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302090#comment-17302090 ] 

Dylan Patterson commented on SPARK-34753:
-----------------------------------------

Aside from fixing the underlying issue it might be worth adding some sort of killswitch timeout for the containers since this causes resource leaks.

> Deadlock in executor RPC shutdown hook
> --------------------------------------
>
>                 Key: SPARK-34753
>                 URL: https://issues.apache.org/jira/browse/SPARK-34753
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.0.1
>         Environment: Not sure this is relevant but let me know and I can append
>            Reporter: Dylan Patterson
>            Priority: Major
>
> Ran into an issue where executors initiate shutdown sequence, System.exit is called but java process never dies leaving orphaned containers in kubernetes. Tracked it down to a deadlock in the RPC shutdown. See thread dump
> {code:java}
> "Thread-2" #26 prio=5 os_prio=0 tid=0x00007f6410231800 nid=0x2a2 waiting on condition [0x00007f63c3bf1000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000c05a47b8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475) at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(Executors.java:675) at org.apache.spark.rpc.netty.MessageLoop.stop(MessageLoop.scala:60)org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1(Dispatcher.scala:190) at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1$adapted(Dispatcher.scala:187) at org.apache.spark.rpc.netty.Dispatcher$$Lambda$214/337533935.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:187) at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:324) at org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:302) at org.apache.spark.SparkEnv.stop(SparkEnv.scala:96) at org.apache.spark.executor.Executor.stop(Executor.scala:292) at org.apache.spark.executor.Executor.$anonfun$new$2(Executor.scala:74) at org.apache.spark.executor.Executor$$Lambda$317/1046854795.apply$mcV$sp(Unknown Source) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$Lambda$2192/1832515374.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$Lambda$2191/952019066.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org