You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kostas Sakellis (JIRA)" <ji...@apache.org> on 2014/10/30 23:47:33 UTC

[jira] [Commented] (SPARK-4092) Input metrics don't work for coalesce()'d RDD's

    [ https://issues.apache.org/jira/browse/SPARK-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190974#comment-14190974 ] 

Kostas Sakellis commented on SPARK-4092:
----------------------------------------

The current patch I'm working on does the simplest thing to address this issue. In the hadoop rdds and cache manager if the task already has input metrics of the same read method it will increment them instead of overriding. This simple solution should handle the common case of:
{code}
sc.textFile(..).coalesce(5).collect()
{code}
In addition it will cover the situation where blocks are coming from the cache. 

What it won't handle properly is if coalesce (or other rdds with similar properties) reads from multiple blocks of mixed read methods (memory vs. hadoop). In that case, one input metric will override the other. We have several options:
# We create a MIXED readMethod and if we see input metrics from different methods, we change to MIXED. This will loose some information because now we don't know where the blocks were read from.
# We store multiple inputMetrics for each TaskContext. Up the stack (eg. JsonProtocol) we send the array of InputMetrics to the caller. We have to worry about backwards compatibility in that case so we can't just remove the single inputMetric. We might have to send back a MIXED metric and in addition the array for newer clients.
# We punt on this issue for now since it can be argued is not common.
What are people's thoughts on this?



> Input metrics don't work for coalesce()'d RDD's
> -----------------------------------------------
>
>                 Key: SPARK-4092
>                 URL: https://issues.apache.org/jira/browse/SPARK-4092
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Patrick Wendell
>            Assignee: Kostas Sakellis
>            Priority: Critical
>
> In every case where we set input metrics (from both Hadoop and block storage) we currently assume that exactly one input partition is computed within the task. This is not a correct assumption in the general case. The main example in the current API is coalesce(), but user-defined RDD's could also be affected.
> To deal with the most general case, we would need to support the notion of a single task having multiple input sources. A more surgical and less general fix is to simply go to HadoopRDD and check if there are already inputMetrics defined for the task with the same "type". If there are, then merge in the new data rather than blowing away the old one.
> This wouldn't cover case where, e.g. a single task has input from both on-disk and in-memory blocks. It _would_ cover the case where someone calls coalesce on a HadoopRDD... which is more common.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org