You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andrew Ash (JIRA)" <ji...@apache.org> on 2014/11/14 09:55:33 UTC
[jira] [Commented] (SPARK-603) add simple Counter API

    [ https://issues.apache.org/jira/browse/SPARK-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212017#comment-14212017 ] 

Andrew Ash commented on SPARK-603:
----------------------------------

Hi Anonymous (not sure how you can read this but...),

I'm not sure how a counters API could solve the delayed evaluation problem and be fundamentally different from the existing Accumulator feature.  You'd have to make accessing a counter force evaluation of every RDD it's accessed from, and you'd have issues with recomputing the RDD later on when it's actually needed, causing a duplicate access.

I can see some kind of API that offers a hook into the RDD evaluation DAG, but that operates at a stage level rather than an operation level (for example multiple maps are pipelined together) so the mapping for available hook points wouldn't be 1:1 with RDD operations, which would be quite tricky.

I don't see a way to implement what you propose with Counters without compromising major parts of the Spark API contract (RDD laziness) so propose to close.  Especially given that I haven no way to contact you given your information doesn't appear on the prior Atlassian Jira either: https://spark-project.atlassian.net/browse/SPARK-603

[~rxin] are you ok with closing this?

> add simple Counter API
> ----------------------
>
>                 Key: SPARK-603
>                 URL: https://issues.apache.org/jira/browse/SPARK-603
>             Project: Spark
>          Issue Type: New Feature
>            Priority: Minor
>
> Users need a very simple way to create counters in their jobs.  Accumulators provide a way to do this, but are a little clunky, for two reasons:
> 1) the setup is a nuisance
> 2) w/ delayed evaluation, you don't know when it will actually run, so its hard to look at the values
> consider this code:
> {code}
> def filterBogus(rdd:RDD[MyCustomClass], sc: SparkContext) = {
>   val filterCount = sc.accumulator(0)
>   val filtered = rdd.filter{r =>
>     if (isOK(r)) true else {filterCount += 1; false}
>   }
>   println("removed " + filterCount.value + " records)
>   filtered
> }
> {code}
> The println will always say 0 records were filtered, because its printed before anything has actually run.  I could print out the value later on, but note that it would destroy the modularity of the method -- kinda ugly to return the accumulator just so that it can get printed later on.  (and of course, the caller in turn might not know when the filter is going to get applied, and would have to pass the accumulator up even further ...)
> I'd like to have Counters which just automatically get printed out whenever a stage has been run, and also with some api to get them back.  I realize this is tricky b/c a stage can get re-computed, so maybe you should only increment the counters once.
> Maybe a more general way to do this is to provide some callback for whenever an RDD is computed -- by default, you would just print the counters, but the user could replace w/ a custom handler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org