You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2017/05/17 12:06:04 UTC

[jira] [Comment Edited] (SPARK-20784) Spark hangs forever

    [ https://issues.apache.org/jira/browse/SPARK-20784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013901#comment-16013901 ] 

Hyukjin Kwon edited comment on SPARK-20784 at 5/17/17 12:05 PM:
----------------------------------------------------------------

Could you update the JIRA title? it sounds Spark always hangs forever for whatever it does.

I think we really need a self-reproducer if possible for such cases, in particular, when it randomly happens.

I have seen many such JIRAs open without any action further because no one except for the reporter can reproduce and makes sure it is fixed even if it is already fixed somewhere. 


was (Author: hyukjin.kwon):
Could you update the JIRA title? it sounds Spark always hangs forever for whatever it does.

I think we really need a self-reproducer if possible for such cases, in particular, when it randomly happens. I have seen many such JIRAs open without any action further because no one can reproduce and makes sure it is fixed even if it is fixed somewhere in the future unless you keep verifying this or fixes it by yourself.

> Spark hangs forever
> -------------------
>
>                 Key: SPARK-20784
>                 URL: https://issues.apache.org/jira/browse/SPARK-20784
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.2
>            Reporter: Mathieu D
>
> Spark hangs and stop executing any job or task.
> Web UI shows *0 active stages* and *0 active task* on executors, although a driver thread is clearly working/finishing a stage (see below).
> Our application runs several spark contexts for several users in parallel in threads. spark version 2.0.2, yarn-client
> Extract of thread stack below.
> {noformat}
> "ForkJoinPool-1-worker-0" #107 daemon prio=5 os_prio=0 tid=0x00007fddf0005800 nid=0x484 waiting on condition [0x00007fddd0bf
> 6000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x000000078c232760> (a scala.concurrent.impl.Promise$CompletionLatch)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>         at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
>         at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>         at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>         at org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:212)
>         at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>         at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>         at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>         at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>         at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
>         at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
>         at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>         at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:30)
>         at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:62)
>         at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>         at org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
>         at org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
>         at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>         at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>         at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>         at org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
>         at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
>         at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>         at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>         at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>         at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:30)
>         at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
>         at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>         at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>         at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>         at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.produce(BroadcastHashJoinExec.scala:38)
>         at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
>         at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>         at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>         at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>         at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:30)
>         at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:309)
>         at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:347)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>         at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>         at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:87)
>         at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:123)
>         at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:114)
>         at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>         at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:114)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>         at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>         at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>         at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:111)
>         at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>         at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>         at org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:97)
>         at org.apache.spark.sql.execution.columnar.InMemoryRelation.<init>(InMemoryRelation.scala:86)
>         at org.apache.spark.sql.execution.columnar.InMemoryRelation$.apply(InMemoryRelation.scala:42)
>         at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:98)
>         at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65)
>         at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89)
>         at org.apache.spark.sql.Dataset.persist(Dataset.scala:2301)
>         at org.apache.spark.sql.Dataset.cache(Dataset.scala:2311)
>         at com.bluedme.woda.ng.matcher.StrictMatchStrategy.buildJoinMap(StrictMatchStrategy.scala:172)
>         at com.bluedme.woda.ng.matcher.Matcher$$anonfun$evaluateMatching$1$$anonfun$apply$5$$anonfun$apply$6.apply(Matcher.scala:137)
>         at com.bluedme.woda.ng.matcher.Matcher$$anonfun$evaluateMatching$1$$anonfun$apply$5$$anonfun$apply$6.apply(Matcher.scala:137)
>         at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>         at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>         at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
>         at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
>         at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
>         at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
>         at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
> {noformat}
> jstack -F does not mention any deadlock.
> It already happened a couple of times, and is related to this block of code of our app
> {code}
>  val matchedRowsDF = rfqDF.joinWith(rfsDF, rfqDF(m.rfqColName) === rfsDF(m.rfsColName))
>         .select($"_1.$id".alias("RFQ" + id), $"_2.$id".alias("RFS" + id))
>         .repartition(rfqIDS.numPartitions, $"RFQ$id")
>         .sortWithinPartitions($"RFQ$id")
>         .as[(Long, Long)]
>         .cache
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org