You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2015/07/17 18:41:05 UTC

[jira] [Comment Edited] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()

    [ https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631556#comment-14631556 ] 

Xiangrui Meng edited comment on SPARK-9096 at 7/17/15 4:40 PM:
---------------------------------------------------------------

This is not a known issue. The previous issue was that `Vector.hashCode` is too expensive and slows down Pyrolite serialization. We can update the hashCode implementation and use more nonzero entries to compute the hash code. The only downside is that we might have a large dense vector with most entries being zeros. But it should be rare because sparse vector should be used in this case.


was (Author: mengxr):
This is not a known issue. The previous issue was that `Vector.hashCode` is too expensive and slows down Pyrolite serialization. We can update the hashCode implementation and use more nonzero entries to compute the hash code. The only downside is that we have a large dense vector with most entries being zeros. But it should be rare because sparse vector should be used in this case.

> Unevenly distributed task loads after using JavaRDD.subtract()
> --------------------------------------------------------------
>
>                 Key: SPARK-9096
>                 URL: https://issues.apache.org/jira/browse/SPARK-9096
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.4.0, 1.4.1
>            Reporter: Gisle Ytrestøl
>            Priority: Minor
>         Attachments: ReproduceBug.java, hanging-one-task.jpg, reproduce.1.3.1.log.gz, reproduce.1.4.1.log.gz
>
>
> When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed in the the following operations on the new JavaRDD which is created by "subtract". The result is that in the following operation on the new JavaRDD, a few tasks process almost all the data, and these tasks will take a long time to finish. 
> I've reproduced this bug in the attached Java file, which I submit with spark-submit. 
> The logs for 1.3.1 and 1.4.1 are attached. In 1.4.1, we see that a few tasks in the count job takes a lot of time:
> 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1459.0 in stage 2.0 (TID 4659) in 708 ms on 148.251.190.217 (1597/1600)
> 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1586.0 in stage 2.0 (TID 4786) in 772 ms on 148.251.190.217 (1598/1600)
> 15/07/16 09:17:51 INFO TaskSetManager: Finished task 1382.0 in stage 2.0 (TID 4582) in 275019 ms on 148.251.190.217 (1599/1600)
> 15/07/16 09:20:02 INFO TaskSetManager: Finished task 1230.0 in stage 2.0 (TID 4430) in 407020 ms on 148.251.190.217 (1600/1600)
> 15/07/16 09:20:02 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
> 15/07/16 09:20:02 INFO DAGScheduler: ResultStage 2 (count at ReproduceBug.java:56) finished in 420.024 s
> 15/07/16 09:20:02 INFO DAGScheduler: Job 0 finished: count at ReproduceBug.java:56, took 442.941395 s
> In comparison, all tasks are more or less equal in size when running the same application in Spark 1.3.1. In overall, this
> attached application (ReproduceBug.java) takes about 7 minutes on Spark 1.4.1, and completes in roughly 30 seconds in Spark 1.3.1. 
> Spark 1.4.0 behaves similar to Spark 1.4.1 wrt this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org