You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zongheng Yang (JIRA)" <ji...@apache.org> on 2014/08/07 09:35:13 UTC

[jira] [Commented] (SPARK-2865) Potential deadlock: tasks could hang forever waiting to fetch a remote block even though most tasks finish

    [ https://issues.apache.org/jira/browse/SPARK-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088964#comment-14088964 ] 

Zongheng Yang commented on SPARK-2865:
--------------------------------------

Just wanted to point out that I used the aforementioned Github patch in my experiment when it's not at its final state.  I haven't retried since it got new fixes and merged into master.

> Potential deadlock: tasks could hang forever waiting to fetch a remote block even though most tasks finish
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-2865
>                 URL: https://issues.apache.org/jira/browse/SPARK-2865
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.0.1, 1.1.0
>         Environment: 16-node EC2 r3.2xlarge cluster
>            Reporter: Zongheng Yang
>            Priority: Blocker
>
> In the application I tested, most of the tasks out of 128 tasks could finish, but sometimes (pretty deterministically) either 1 or 3 tasks would just hang forever (> 5 hrs with no progress at all) with the following stack trace. There were no apparent failures from the UI, also the nodes where the stuck tasks were running had no apparent memory/CPU/disk pressures.
> {noformat}
> "Executor task launch worker-0" daemon prio=10 tid=0x00007f32ec003800 nid=0xaac waiting on condition [0x00007f33f4428000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007f3e0d7198e8> (a scala.concurrent.impl.Promise$CompletionLatch)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
>         at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
>         at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>         at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>         at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>         at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>         at scala.concurrent.Await$.result(package.scala:107)
>         at org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832)
>         at org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122)
>         at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497)
>         at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495)
>         at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495)
>         at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481)
>         at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524)
>         at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This behavior does *not* appear on 1.0 (reusing the same cluster), but appears on the master branch as of Aug 4, 2014 *and* 1.0.1. Further, I tried out [this patch|https://github.com/apache/spark/pull/1758], and it didn't fix the behavior.
> When this behavior happened, the driver printed out the following line repeatedly:
> {noformat}
> 14/08/04 23:32:42 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-172-31-6-74.us-west-1.compute.internal, 59408, 0) with no recent heart beats: 67331ms exceeds 45000ms
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org