You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sryza <gi...@git.apache.org> on 2014/10/28 01:55:27 UTC

[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

GitHub user sryza opened a pull request:

    https://github.com/apache/spark/pull/2968

    SPARK-3179. Add task OutputMetrics.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sryza/spark sandy-spark-3179

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2968.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2968
    
----
commit 5b0871bcaef350ee2e43991df21ec5636ef55203
Author: Sandy Ryza <sa...@cloudera.com>
Date:   2014-10-21T16:45:16Z

    SPARK-3179

commit b08853bc2c7b4962aee75d8ce50a10d660898ac5
Author: Sandy Ryza <sa...@cloudera.com>
Date:   2014-10-28T00:43:23Z

    Clarify InputMetrics stuff

commit 1a793e8927a570114295ee6912b239c85a0a8f28
Author: Sandy Ryza <sa...@cloudera.com>
Date:   2014-10-28T00:48:12Z

    Stylistic and import cleanups

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-60700216
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22314/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2968


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-62069792
  
      [Test build #23018 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23018/consoleFull) for   PR 2968 at commit [`92c772c`](https://github.com/apache/spark/commit/92c772cb641da3686f0545841e1e5655729499aa).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class OutputMetrics(writeMethod: DataWriteMethod.Value) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-61347200
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22657/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-62285457
  
      [Test build #23107 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23107/consoleFull) for   PR 2968 at commit [`dce4784`](https://github.com/apache/spark/commit/dce4784fd428bd3b0df55080a8a94b60a153b094).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class OutputMetrics(writeMethod: DataWriteMethod.Value) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2968#discussion_r20047634
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -151,6 +147,42 @@ class SparkHadoopUtil extends Logging {
           }
         }
       }
    +
    +  /**
    +   * Returns a function that can be called to find Hadoop FileSystem bytes written. If
    +   * getFSBytesWrittenOnThreadCallback is called from thread r at time t, the returned callback will
    +   * return the bytes written on r since t.  Reflection is required because thread-level FileSystem
    +   * statistics are only available as of Hadoop 2.5 (see HADOOP-10688).
    +   * Returns None if the required method can't be found.
    +   */
    +  private[spark] def getFSBytesWrittenOnThreadCallback(path: Path, conf: Configuration)
    +    : Option[() => Long] = {
    +    try {
    +      val threadStats = getFileSystemThreadStatistics(path, conf)
    +      val getBytesReadMethod = getFileSystemThreadStatisticsMethod("getBytesWritten")
    --- End diff --
    
    getBytesWrittenMethod? And same with the below naming?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by sryza <gi...@git.apache.org>.
Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2968#discussion_r20055135
  
    --- Diff: core/src/test/scala/org/apache/spark/metrics/InputOutputMetricsSuite.scala ---
    @@ -73,4 +78,32 @@ class InputMetricsSuite extends FunSuite with SharedSparkContext {
         assert(taskBytesRead.length == 2)
         assert(taskBytesRead.sum >= file.length())
       }
    +
    +  test("output metrics when writing text file") {
    +    val fs = FileSystem.getLocal(new Configuration())
    +    val outPath = new Path(fs.getWorkingDirectory, "outdir")
    +
    +    if (SparkHadoopUtil.get.getFSBytesWrittenOnThreadCallback(outPath, fs.getConf).isDefined) {
    --- End diff --
    
    Yeah, I had this same thought, but I can't think of any good way to prevent this.  Note that the other tests in this class have the same problem. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-61948176
  
      [Test build #22997 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22997/consoleFull) for   PR 2968 at commit [`92c772c`](https://github.com/apache/spark/commit/92c772cb641da3686f0545841e1e5655729499aa).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2968#discussion_r20047725
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -56,6 +57,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
       with SparkHadoopMapReduceUtil
       with Serializable
     {
    +  import PairRDDFunctions._
    --- End diff --
    
    Is this just for the reference to RECORDS_BETWEEN_BYTES_WRITTEN_METRIC_UPDATES on line 1093?  If so, I think it would be better to just use PairRDDFunctions directly there, since I think it makes the code a little easier to read not to have these blanket imports.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-61957993
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22997/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2968#discussion_r19930010
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala ---
    @@ -171,6 +177,18 @@ case class InputMetrics(readMethod: DataReadMethod.Value) {
     
     /**
      * :: DeveloperApi ::
    + * Metrics about writing output data.
    + */
    +@DeveloperApi
    +case class OutputMetrics() {
    +  /**
    +   * Total bytes written
    +   */
    +  var bytesWritten: Long = 0L
    --- End diff --
    
    Do you anticipate adding other things here later (like time blocked on writing output data)?  Or should this just be a long inside the metrics class (rather than having those whole output metrics class)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by sryza <gi...@git.apache.org>.
Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-61341309
  
    Updated patch rebases on master and adds output metrics to a couple places on the UI they were missing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-62282883
  
      [Test build #23107 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23107/consoleFull) for   PR 2968 at commit [`dce4784`](https://github.com/apache/spark/commit/dce4784fd428bd3b0df55080a8a94b60a153b094).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-61347196
  
      [Test build #22657 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22657/consoleFull) for   PR 2968 at commit [`94b7c3f`](https://github.com/apache/spark/commit/94b7c3f37f2e149debdbfed8c23b384c3817efb4).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2968#discussion_r20048069
  
    --- Diff: core/src/test/scala/org/apache/spark/metrics/InputOutputMetricsSuite.scala ---
    @@ -73,4 +78,32 @@ class InputMetricsSuite extends FunSuite with SharedSparkContext {
         assert(taskBytesRead.length == 2)
         assert(taskBytesRead.sum >= file.length())
       }
    +
    +  test("output metrics when writing text file") {
    +    val fs = FileSystem.getLocal(new Configuration())
    +    val outPath = new Path(fs.getWorkingDirectory, "outdir")
    +
    +    if (SparkHadoopUtil.get.getFSBytesWrittenOnThreadCallback(outPath, fs.getConf).isDefined) {
    --- End diff --
    
    This test is a little hairy because it could break without anyone noticing (since I don't think Jenkins runs a new enough version for this code to be run?).  But I don't have a great solution -- you could mock out a bunch of the function calls, but then you're left not really testing anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by sryza <gi...@git.apache.org>.
Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2968#discussion_r19931259
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -249,7 +248,7 @@ class HadoopRDD[K, V](
                 && bytesReadCallback.isDefined) {
               recordsSinceMetricsUpdate = 0
               val bytesReadFn = bytesReadCallback.get
    -          inputMetrics.bytesRead = bytesReadFn()
    +          context.taskMetrics.inputMetrics.get.bytesRead = bytesReadFn()
    --- End diff --
    
    This change is a followup from SPARK-2621 that @pwendell asked to include here.  Since there's not consensus on the best way to do this, and it's not really related to output metrics, I'm going to take this out. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-60700213
  
      [Test build #22314 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22314/consoleFull) for   PR 2968 at commit [`1a793e8`](https://github.com/apache/spark/commit/1a793e8927a570114295ee6912b239c85a0a8f28).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class OutputMetrics() `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-62238843
  
    Sorry for not noticing the two other things earlier!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-62285463
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23107/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-61938133
  
    @sryza this mostly looks good and super helpful!!  One more comment I discussed with @pwendell: I think it would be good to add an output type field to output metrics (similar to what we have for input metrics) that currently can only be Hadoop.  The motivation for this is that there's some annoying overhead associated with changing stuff we write out to Json, and it seems like we're likely to want to add metrics for other output types (e.g., storing output data in the block manager) later on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-60695417
  
      [Test build #22314 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22314/consoleFull) for   PR 2968 at commit [`1a793e8`](https://github.com/apache/spark/commit/1a793e8927a570114295ee6912b239c85a0a8f28).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2968#discussion_r19930092
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -961,30 +962,52 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
         }
     
         val writeShard = (context: TaskContext, iter: Iterator[(K,V)]) => {
    +      val config = wrappedConf.value
           // Hadoop wants a 32-bit task attempt ID, so if ours is bigger than Int.MaxValue, roll it
           // around by taking a mod. We expect that no task will be attempted 2 billion times.
           val attemptNumber = (context.attemptId % Int.MaxValue).toInt
           /* "reduce task" <split #> <attempt # = spark task #> */
           val attemptId = newTaskAttemptID(jobtrackerID, stageId, isMap = false, context.partitionId,
             attemptNumber)
    -      val hadoopContext = newTaskAttemptContext(wrappedConf.value, attemptId)
    +      val hadoopContext = newTaskAttemptContext(config, attemptId)
           val format = outfmt.newInstance
           format match {
    -        case c: Configurable => c.setConf(wrappedConf.value)
    +        case c: Configurable => c.setConf(config)
             case _ => ()
           }
           val committer = format.getOutputCommitter(hadoopContext)
           committer.setupTask(hadoopContext)
    +
    +      val bytesWrittenCallback = Option(config.get("mapreduce.output.fileoutputformat.outputdir"))
    +        .map(new Path(_))
    +        .flatMap(SparkHadoopUtil.get.getFSBytesWrittenOnThreadCallback(_, config))
    +      val outputMetrics = new OutputMetrics()
    +      if (bytesWrittenCallback.isDefined) {
    +        context.taskMetrics.outputMetrics = Some(outputMetrics)
    +      }
    +
    --- End diff --
    
    Can you move this code and (separately) the code in lines 995 through 1004 into methods that can be called here and also by saveAsHadoopDataset to avoid the code duplication?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by sryza <gi...@git.apache.org>.
Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-61941116
  
    Thanks for taking a look @kayousterhout .  I'll add in an output type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-62058487
  
      [Test build #23018 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23018/consoleFull) for   PR 2968 at commit [`92c772c`](https://github.com/apache/spark/commit/92c772cb641da3686f0545841e1e5655729499aa).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-61957977
  
      [Test build #22997 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22997/consoleFull) for   PR 2968 at commit [`92c772c`](https://github.com/apache/spark/commit/92c772cb641da3686f0545841e1e5655729499aa).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class OutputMetrics(writeMethod: DataWriteMethod.Value) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-62057912
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-62069802
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23018/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-61341603
  
      [Test build #22657 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22657/consoleFull) for   PR 2968 at commit [`94b7c3f`](https://github.com/apache/spark/commit/94b7c3f37f2e149debdbfed8c23b384c3817efb4).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by sryza <gi...@git.apache.org>.
Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-61341347
  
    Verified that output metrics show up every (and while tasks are running) on a pseudo-distributed cluster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2968#discussion_r19930031
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -249,7 +248,7 @@ class HadoopRDD[K, V](
                 && bytesReadCallback.isDefined) {
               recordsSinceMetricsUpdate = 0
               val bytesReadFn = bytesReadCallback.get
    -          inputMetrics.bytesRead = bytesReadFn()
    +          context.taskMetrics.inputMetrics.get.bytesRead = bytesReadFn()
    --- End diff --
    
    Why did you change this?  I think it makes the code slightly harder to read / understand, because it's not obvious without reading all of the code around here why inputMetrics will always be non-None


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/2968#issuecomment-62346552
  
    Thanks Sandy! I've merged this into master and 1.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org