You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2014/11/05 22:53:34 UTC

[jira] [Commented] (SPARK-4092) Input metrics don't work for coalesce()'d RDD's

    [ https://issues.apache.org/jira/browse/SPARK-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199181#comment-14199181 ] 

Apache Spark commented on SPARK-4092:
-------------------------------------

User 'ksakellis' has created a pull request for this issue:
https://github.com/apache/spark/pull/3120

> Input metrics don't work for coalesce()'d RDD's
> -----------------------------------------------
>
>                 Key: SPARK-4092
>                 URL: https://issues.apache.org/jira/browse/SPARK-4092
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Patrick Wendell
>            Assignee: Kostas Sakellis
>            Priority: Critical
>
> In every case where we set input metrics (from both Hadoop and block storage) we currently assume that exactly one input partition is computed within the task. This is not a correct assumption in the general case. The main example in the current API is coalesce(), but user-defined RDD's could also be affected.
> To deal with the most general case, we would need to support the notion of a single task having multiple input sources. A more surgical and less general fix is to simply go to HadoopRDD and check if there are already inputMetrics defined for the task with the same "type". If there are, then merge in the new data rather than blowing away the old one.
> This wouldn't cover case where, e.g. a single task has input from both on-disk and in-memory blocks. It _would_ cover the case where someone calls coalesce on a HadoopRDD... which is more common.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org