You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bharath Ravi Kumar (JIRA)" <ji...@apache.org> on 2014/06/25 18:52:24 UTC

[jira] [Comment Edited] (SPARK-1112) When spark.akka.frameSize > 10, task results bigger than 10MiB block execution

    [ https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043730#comment-14043730 ] 

Bharath Ravi Kumar edited comment on SPARK-1112 at 6/25/14 4:51 PM:
--------------------------------------------------------------------

Can a clear workaround be specified for this bug please? For those unable to upgrade to run on 1.0.1  or 1.1.0 in production, general instructions on the workaround are required. This is a huge blocker for current production deployments (even on 1.0.0) otherwise. For instance, running a saveAsTextFile() on an RDD (~400MB) causes execution to freeze with the last log statements seen on the driver being:

14/06/25 16:38:55 INFO spark.SparkContext: Starting job: saveAsTextFile at Test.java:99
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Got job 6 (saveAsTextFile at Test.java:99) with 2 output partitions (allowLocal=false)
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Final stage: Stage 6(saveAsTextFile at Test.java:99)
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Missing parents: List()
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Submitting Stage 6 (MappedRDD[558] at saveAsTextFile at Test.java:99), which has no missing parents
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 6 (MappedRDD[558] at saveAsTextFile at Test.java:99)
14/06/25 16:38:55 INFO scheduler.TaskSchedulerImpl: Adding task set 6.0 with 2 tasks
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Starting task 6.0:0 as TID 5 on executor 1: somehost.corp (PROCESS_LOCAL)
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Serialized task 6.0:0 as 351777 bytes in 36 ms
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Starting task 6.0:1 as TID 6 on executor 0: someotherhost.corp (PROCESS_LOCAL)
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Serialized task 6.0:1 as 186453 bytes in 16 ms

The test setup for reproducing this issue has two slaves (each with 24G) running spark standalone. The driver runs with Xmx 4G.

Thanks.


was (Author: reachbach):
Can a clear workaround be specified for this bug please? For those unable to upgrade to run on 1.0.1  or 1.1.0 in production, general instructions on the workaround are required. This is a huge blocker for current production deployments (even on 1.0.0) otherwise. For instance, running a saveAsTextFile() on an RDD (~400MB) causes execution to freeze with the last log statements seen on the driver being:

14/06/25 16:38:55 INFO spark.SparkContext: Starting job: saveAsTextFile at Test.java:99
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Got job 6 (saveAsTextFile at Test.java:99) with 2 output partitions (allowLocal=false)
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Final stage: Stage 6(saveAsTextFile at Test.java:99)
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Missing parents: List()
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Submitting Stage 6 (MappedRDD[558] at saveAsTextFile at Test.java:99), which has no missing parents
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 6 (MappedRDD[558] at saveAsTextFile at Test.java:99)
14/06/25 16:38:55 INFO scheduler.TaskSchedulerImpl: Adding task set 6.0 with 2 tasks
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Starting task 6.0:0 as TID 5 on executor 1: somehost.corp (PROCESS_LOCAL)
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Serialized task 6.0:0 as 351777 bytes in 36 ms
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Starting task 6.0:1 as TID 6 on executor 0: someotherhost.corp (PROCESS_LOCAL)
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Serialized task 6.0:1 as 186453 bytes in 16 ms

Thanks.

> When spark.akka.frameSize > 10, task results bigger than 10MiB block execution
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-1112
>                 URL: https://issues.apache.org/jira/browse/SPARK-1112
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.9.0, 1.0.0
>            Reporter: Guillaume Pitel
>            Assignee: Xiangrui Meng
>            Priority: Blocker
>             Fix For: 1.0.1, 1.1.0
>
>
> When I set the spark.akka.frameSize to something over 10, the messages sent from the executors to the driver completely block the execution if the message is bigger than 10MiB and smaller than the frameSize (if it's above the frameSize, it's ok)
> Workaround is to set the spark.akka.frameSize to 10. In this case, since 0.8.1, the blockManager deal with  the data to be sent. It seems slower than akka direct message though.
> The configuration seems to be correctly read (see actorSystemConfig.txt), so I don't see where the 10MiB could come from 



--
This message was sent by Atlassian JIRA
(v6.2#6252)