You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sryza <gi...@git.apache.org> on 2014/08/22 01:10:39 UTC

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

GitHub user sryza opened a pull request:

    https://github.com/apache/spark/pull/2087

    SPARK-2621. Update task InputMetrics incrementally

    The patch takes advantage an API provided in Hadoop 2.5 that allows getting accurate data on Hadoop FileSystem bytes read.  It eliminates the old method, which naively accepts the split size as the input bytes.  An impact of this change will be that input metrics go away when using against Hadoop versions earlier thatn 2.5.  I can add this back in, but my opinion is that no metrics are better than inaccurate metrics.
    
    This is difficult to write a test for because we don't usually build against a version of Hadoop that contains the function we need.  I've tested it manually on a pseudo-distributed cluster.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sryza/spark sandy-spark-2621

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2087.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2087
    
----
commit b5f4c6c5d0be646798bc8188f610b00fb4be83fa
Author: Sandy Ryza <sa...@cloudera.com>
Date:   2014-07-22T20:42:28Z

    SPARK-2621. Update task InputMetrics incrementally

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-57425658
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21091/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-54780033
  
    It looks like all the core tests are passing, but there are some failures in streaming and SQL tests.  Have those been showing up elsewhere?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-54563140
  
    Hm, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-58975673
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/367/consoleFull) for   PR 2087 at commit [`305ad9f`](https://github.com/apache/spark/commit/305ad9f50b4dae992e09fdc962ebaf71e191a191).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-54915960
  
    I think we need some indication of the bytes being read from Hadoop. If this is our only current mechanism, then I think removing the code is not worth the behavioral regression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18795734
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): Option[() => Long] = {
    --- End diff --
    
    Hey so there are a couple issues with the current approach:
    
    This a bunch of reflective calls + exception handling every time it is called. That will have huge performance overhead. Also, this catch-all exception is sort of scary... what if there is a legitimate exception invoking this function in versions that support it? The user will never be able to find it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-57421546
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21091/consoleFull) for   PR 2087 at commit [`305ad9f`](https://github.com/apache/spark/commit/305ad9f50b4dae992e09fdc962ebaf71e191a191).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18796436
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -222,12 +221,33 @@ class HadoopRDD[K, V](
               case eof: EOFException =>
                 finished = true
             }
    +
    +        // Update bytes read metric every 32 records
    +        if (recordsSinceMetricsUpdate == 32 && bytesReadCallback.isDefined) {
    +          recordsSinceMetricsUpdate = 0
    +          inputMetrics.bytesRead = bytesReadCallback.get()
    --- End diff --
    
    is this a legal assignment? `inputMetrics.bytesRead` is of type "Long" and `bytesReadCallback` is of type `Option[() => Long]`...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59885707
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18796711
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -222,12 +221,33 @@ class HadoopRDD[K, V](
               case eof: EOFException =>
                 finished = true
             }
    +
    +        // Update bytes read metric every 32 records
    +        if (recordsSinceMetricsUpdate == 32 && bytesReadCallback.isDefined) {
    --- End diff --
    
    Can you make this into a class constant?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18833578
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): Option[() => Long] = {
    +    val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
    +    val scheme = qualifiedPath.toUri().getScheme()
    +    val stats = FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
    +    try {
    +      val threadStats = stats.map(Utils.invoke(classOf[Statistics], _, "getThreadStatistics"))
    +      val statisticsDataClass =
    +        Class.forName("org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData")
    +      val getBytesReadMethod = statisticsDataClass.getDeclaredMethod("getBytesRead")
    +      val f = () => threadStats.map(getBytesReadMethod.invoke(_).asInstanceOf[Long]).sum
    +      val start = f()
    +      Some(() => f() - start)
    --- End diff --
    
    `getBytesRead` will give you the bytes since the thread was created.  As executors can put multiple tasks on the same thread, calling it without a delta could include bytes read from a previous task.  I'll comment / restructure this to make it a little more clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-54859158
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19983/consoleFull) for   PR 2087 at commit [`0034292`](https://github.com/apache/spark/commit/00342924b0b3eba12861bf1cd524f0bfca2fbc4e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18939723
  
    --- Diff: core/src/test/scala/org/apache/spark/metrics/InputMetricsSuite.scala ---
    @@ -0,0 +1,53 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.metrics
    +
    +import org.scalatest.FunSuite
    +
    +import org.apache.spark.SharedSparkContext
    +import org.apache.spark.scheduler.{SparkListenerTaskEnd, SparkListener}
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import java.io.{FileWriter, PrintWriter, File}
    +
    +class InputMetricsSuite extends FunSuite with SharedSparkContext {
    +  test("input metrics when reading text file") {
    +    val file = new File(getClass.getSimpleName + ".txt")
    +    val pw = new PrintWriter(new FileWriter(file))
    +    pw.println("some stuff")
    +    pw.println("some other stuff")
    +    pw.println("yet more stuff")
    +    pw.println("too much stuff")
    +    pw.close()
    +    file.deleteOnExit()
    +
    +    val taskBytesRead = new ArrayBuffer[Long]()
    +    sc.addSparkListener(new SparkListener() {
    +      override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
    +        taskBytesRead += taskEnd.taskMetrics.inputMetrics.get.bytesRead
    +      }
    +    })
    +    sc.textFile("file://" + file.getAbsolutePath, 2).count()
    +
    +    // Wait for task end events to come in
    +    Thread.sleep(100)
    --- End diff --
    
    Can you use the utility that waits into the listener bus is empty here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18796643
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): Option[() => Long] = {
    +    val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
    +    val scheme = qualifiedPath.toUri().getScheme()
    +    val stats = FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
    --- End diff --
    
    When will a specific FileSystem be present here? Is it definitely the case that there will exist a fileysystem for the supplied scheme? It looks like this is based on a static cache within the FileSystem class. Does that get populated before this is invoked?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-54698426
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19865/consoleFull) for   PR 2087 at commit [`0034292`](https://github.com/apache/spark/commit/00342924b0b3eba12861bf1cd524f0bfca2fbc4e).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-60193683
  
    Oops, sorry about that.  Posted a new patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-60454724
  
    Anything else needed here?  Sorry to keep pestering - I have an output metrics patch that depends on this that I'm eager to post.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2087


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18797454
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): Option[() => Long] = {
    +    val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
    +    val scheme = qualifiedPath.toUri().getScheme()
    +    val stats = FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
    +    try {
    +      val threadStats = stats.map(Utils.invoke(classOf[Statistics], _, "getThreadStatistics"))
    +      val statisticsDataClass =
    +        Class.forName("org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData")
    +      val getBytesReadMethod = statisticsDataClass.getDeclaredMethod("getBytesRead")
    +      val f = () => threadStats.map(getBytesReadMethod.invoke(_).asInstanceOf[Long]).sum
    +      val start = f()
    +      Some(() => f() - start)
    --- End diff --
    
    What is going with the deltas here? The javadoc says it's the bytes read thus far, but if I look at the function here it seems like it will return to you a function that gives you the bytes read since the function was created.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-60458950
  
    Ah sorry about that - I'm out until tomorrow morning but I can look then. I just wanted to test this locally with a few hadoop versions to check it, this looks good. In the mean time feel free to send up the other patch so folks can review... and you can rebase that once this goes in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18939658
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): Option[() => Long] = {
    +    val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
    +    val scheme = qualifiedPath.toUri().getScheme()
    +    val stats = FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
    +    try {
    +      val threadStats = stats.map(Utils.invoke(classOf[Statistics], _, "getThreadStatistics"))
    +      val statisticsDataClass =
    +        Class.forName("org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData")
    +      val getBytesReadMethod = statisticsDataClass.getDeclaredMethod("getBytesRead")
    +      val f = () => threadStats.map(getBytesReadMethod.invoke(_).asInstanceOf[Long]).sum
    +      val start = f()
    +      Some(() => f() - start)
    --- End diff --
    
    ah I see now - so should we call this function `getThreadLocalBytesRead` or something? It seems like it only semantically makes sense if the returned function is called from the same thread as the function was generated in. That might also be worth documenting in the javadoc. The use of the phrase "so far" there also threw me off a bit - maybe there is a better phrase for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59318880
  
    Yeah you are totally right - the performance bit was not correct from my end. I added some more comments on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r19259861
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
    @@ -147,12 +150,37 @@ class NewHadoopRDD[K, V](
               throw new java.util.NoSuchElementException("End of stream")
             }
             havePair = false
    +
    +        // Update bytes read metric every few records
    +        if (recordsSinceMetricsUpdate == HadoopRDD.RECORDS_BETWEEN_BYTES_READ_METRIC_UPDATES
    +            && bytesReadCallback.isDefined) {
    +          recordsSinceMetricsUpdate = 0
    +          inputMetrics.bytesRead = bytesReadCallback.get()
    --- End diff --
    
    here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-58979294
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/367/consoleFull) for   PR 2087 at commit [`305ad9f`](https://github.com/apache/spark/commit/305ad9f50b4dae992e09fdc962ebaf71e191a191).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-54521510
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19793/consoleFull) for   PR 2087 at commit [`0034292`](https://github.com/apache/spark/commit/00342924b0b3eba12861bf1cd524f0bfca2fbc4e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18832748
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -222,12 +221,33 @@ class HadoopRDD[K, V](
               case eof: EOFException =>
                 finished = true
             }
    +
    +        // Update bytes read metric every 32 records
    +        if (recordsSinceMetricsUpdate == 32 && bytesReadCallback.isDefined) {
    +          recordsSinceMetricsUpdate = 0
    +          inputMetrics.bytesRead = bytesReadCallback.get()
    --- End diff --
    
    This compiles for me and apparently gets interpreted as `(bytesReadCallback.get)()`.  Agreed that this looks weird.  Is there a phrasing you think would be best?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r19113109
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): Option[() => Long] = {
    +    val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
    +    val scheme = qualifiedPath.toUri().getScheme()
    +    val stats = FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
    --- End diff --
    
    Right.  I was assuming that your "It looks like this is based on a static cache within the FileSystem class." was because you noticed this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-55358462
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20199/consoleFull) for   PR 2087 at commit [`8bfaa24`](https://github.com/apache/spark/commit/8bfaa24c03262db846f61d50ed174200da527f82).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18796228
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): Option[() => Long] = {
    --- End diff --
    
    Actually - this will only be called once per partition... let me continue looking.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59970613
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21995/consoleFull) for   PR 2087 at commit [`74fc9bb`](https://github.com/apache/spark/commit/74fc9bb31081453f701f15d090ec6f0f988a9f2f).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-52998163
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19069/consoleFull) for   PR 2087 at commit [`32daf1f`](https://github.com/apache/spark/commit/32daf1fee117a54cad1e4dfa258c846aa507f9e5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18832502
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): Option[() => Long] = {
    --- End diff --
    
    getInputBytesReadCallback only gets called once per task - to find the function and return it.  Are you worried about that per-task overhead?  We still do have a single reflective call to actually invoke it when we want to populate the metric.  The internet has a few different opinions on the overhead of this.  Most likely is that it's only about twice the overhead of a direct function call, but I've also seen threads that say it's much more.  Either way, this was my root of my earlier wariness about having this call on the read path as opposed to doing it asynchonously in a separate thread.
    
    On the catch-all exception, you're right - will rework that part.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-57425652
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21091/consoleFull) for   PR 2087 at commit [`305ad9f`](https://github.com/apache/spark/commit/305ad9f50b4dae992e09fdc962ebaf71e191a191).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-57236118
  
    > The current approach couples the updating of this metric with the heartbeats in a way that seems strange.
    
    The heartbeats (and task completion, which, my bad, I need to add in) are the only times when we use the value of this metric.  Is there an advantage to adding complexity to keep it more up to date than that?  We'd also be adding an extra branch on the read path, which I suppose might not be much compared with the crazy stuff Hadoop record readers do, but could still be a small perf hit.  Last, in the (rare) case where we're reading a single huge record, we wouldn't get incremental measurements within it.
    
    We use a similar approach for shuffleReadMetrics, aggregating it across readers right before sending it to the driver.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18795775
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): Option[() => Long] = {
    +    val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
    +    val scheme = qualifiedPath.toUri().getScheme()
    +    val stats = FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
    +    try {
    +      val threadStats = stats.map(Utils.invoke(classOf[Statistics], _, "getThreadStatistics"))
    --- End diff --
    
    some versions of Hadoop have a method called `getThreadData` that seems to return something very similar, is that the same as this modulo the name? Or is it different?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18795643
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
    @@ -147,12 +150,36 @@ class NewHadoopRDD[K, V](
               throw new java.util.NoSuchElementException("End of stream")
             }
             havePair = false
    +
    +        // Update bytes read metric every 32 records
    --- End diff --
    
    It would be better to pull this number into a constant and also probably make it much higher... 32 will have a performance implication since there is e.g. thread locals and other stuff going on here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18832984
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): Option[() => Long] = {
    +    val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
    +    val scheme = qualifiedPath.toUri().getScheme()
    +    val stats = FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
    --- End diff --
    
    The code inside FileSystem is a little confusing about this, but `CACHE.get(...)` will actually create a new FileSystem object if none currently exists.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-57421448
  
    Updated patch switches from the pull to push model as requested by @pwendell and adds a test.  I verified that the test succeeds against both Hadoop 2.2 and Hadoop 2.5 (which contains the new API).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59892050
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21976/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59057738
  
    > I.e. the Hadoop RDD should look up the entire function for the computing thread at the beginning, then it can invoke that function within the hot loop only.
    
    Commented inline above, but am I missing something about my implementation?  This is what I (thought) my code is doing.  Unless you mean that looking up the function per-task is too expensive?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18939560
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): Option[() => Long] = {
    --- End diff --
    
    I mean that the reflective lookups are themselves expensive, not calling Method.invoke. However, now I see that this is just returning a new function that calls Method.invoke so it should be fine performance wise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59882086
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21971/consoleFull) for   PR 2087 at commit [`1ab662d`](https://github.com/apache/spark/commit/1ab662d8ae674407bfe0f8bbc14aedf1da60c030).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-60193928
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22061/consoleFull) for   PR 2087 at commit [`23010b8`](https://github.com/apache/spark/commit/23010b850b28fccd9b33b0352c4bc2cb5f5dd45c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-60198186
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22061/consoleFull) for   PR 2087 at commit [`23010b8`](https://github.com/apache/spark/commit/23010b850b28fccd9b33b0352c4bc2cb5f5dd45c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-60528666
  
    Hey @sryza this looks good - I tested it locally and it worked. I stumbled a bit with the test because I was using coalesce() and these metrics don't work well with coalesced RDD's right now (which I forgot).
    
    I just have one small question about slightly simplifying access to the metrics variable. Let me know what you think... if it's possible to clean it quickly it would be nice to do it and then merge this. If that can't be simplified for some reason then let me know and we can just merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59891135
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21971/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-60193215
  
    @pwendell any further comments on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r19259854
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -244,12 +243,35 @@ class HadoopRDD[K, V](
               case eof: EOFException =>
                 finished = true
             }
    +
    +        // Update bytes read metric every few records
    +        if (recordsSinceMetricsUpdate == HadoopRDD.RECORDS_BETWEEN_BYTES_READ_METRIC_UPDATES
    +            && bytesReadCallback.isDefined) {
    +          recordsSinceMetricsUpdate = 0
    +          val bytesReadFn = bytesReadCallback.get
    +          inputMetrics.bytesRead = bytesReadFn()
    +        } else {
    +          recordsSinceMetricsUpdate += 1
    +        }
             (key, value)
           }
     
           override def close() {
             try {
               reader.close()
    +          if (bytesReadCallback.isDefined) {
    +            inputMetrics.bytesRead = bytesReadCallback.get()
    --- End diff --
    
    can you change this invocation to use the correct style as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-58988193
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/376/consoleFull) for   PR 2087 at commit [`305ad9f`](https://github.com/apache/spark/commit/305ad9f50b4dae992e09fdc962ebaf71e191a191).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18939679
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -222,12 +221,33 @@ class HadoopRDD[K, V](
               case eof: EOFException =>
                 finished = true
             }
    +
    +        // Update bytes read metric every 32 records
    +        if (recordsSinceMetricsUpdate == 32 && bytesReadCallback.isDefined) {
    +          recordsSinceMetricsUpdate = 0
    +          inputMetrics.bytesRead = bytesReadCallback.get()
    --- End diff --
    
    One thought is to make it two statements:
    ```
    val bytesReadFn = bytesReadCallback.get()
    inputBetrics.bytesRead = bytesReadFn()
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r19259865
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
    @@ -147,12 +150,37 @@ class NewHadoopRDD[K, V](
               throw new java.util.NoSuchElementException("End of stream")
             }
             havePair = false
    +
    +        // Update bytes read metric every few records
    +        if (recordsSinceMetricsUpdate == HadoopRDD.RECORDS_BETWEEN_BYTES_READ_METRIC_UPDATES
    +            && bytesReadCallback.isDefined) {
    +          recordsSinceMetricsUpdate = 0
    +          inputMetrics.bytesRead = bytesReadCallback.get()
    +        } else {
    +          recordsSinceMetricsUpdate += 1
    +        }
    +
             (reader.getCurrentKey, reader.getCurrentValue)
           }
     
           private def close() {
             try {
               reader.close()
    +
    +          // Update metrics with final amount
    +          if (bytesReadCallback.isDefined) {
    +            inputMetrics.bytesRead = bytesReadCallback.get()
    --- End diff --
    
    here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r19387877
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -224,18 +223,18 @@ class HadoopRDD[K, V](
           val key: K = reader.createKey()
           val value: V = reader.createValue()
     
    -      // Set the task input metrics.
           val inputMetrics = new InputMetrics(DataReadMethod.Hadoop)
    -      try {
    -        /* bytesRead may not exactly equal the bytes read by a task: split boundaries aren't
    -         * always at record boundaries, so tasks may need to read into other splits to complete
    -         * a record. */
    -        inputMetrics.bytesRead = split.inputSplit.value.getLength()
    -      } catch {
    -        case e: java.io.IOException =>
    -          logWarning("Unable to get input size to set InputMetrics for task", e)
    +      // Find a function that will return the FileSystem bytes read by this thread.
    +      val bytesReadCallback = if (split.inputSplit.value.isInstanceOf[FileSplit]) {
    +        SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(
    +          split.inputSplit.value.asInstanceOf[FileSplit].getPath, jobConf)
    +      } else {
    +        None
    +      }
    +      if (bytesReadCallback.isDefined) {
    +        context.taskMetrics.inputMetrics = Some(inputMetrics)
           }
    -      context.taskMetrics.inputMetrics = Some(inputMetrics)
    --- End diff --
    
    My thinking was that, if we aren't able to get the in-progress metrics callback, we don't want to set the InputMetrics until the end of the task.  If we were to set the input metrics at the beginning of the task, in-progress metrics would always show its bytes read as 0, instead of not at all, which is confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-53005751
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19069/consoleFull) for   PR 2087 at commit [`32daf1f`](https://github.com/apache/spark/commit/32daf1fee117a54cad1e4dfa258c846aa507f9e5).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-56307027
  
    @aarondav @sryza Did you consider using reader.getPos() to get the correct metrics for older versions of Hadoop (as in here: https://github.com/kayousterhout/spark-1/blob/0028f0de92d79fd4df01d69b9fefdc51d3489c54/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L170)?  That only fixes half the problem because I think it's only available in the old Hadoop API (not the new one), but I think this is the way that MapReduce correctly sets the input bytes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-57244182

Yeah so I just prefer keeping the TaskMetrics/InputMetrics as simple as possible rather than having callback registration and other state in them. The simplest possible interface is that they are just structs and people update their values. This keeps all of the logic around this thread-based Hadoop instrumentation local to the HadoopRDD itself, so the interface is much narrower between the components.

Overall, I'm gauging the complexity based on how complicated the interfaces are, not on the complexity of the internal implementations.

If we have a single large record this might be an issue. But we already make other assumptions that record sizes are fairly small, for instance they must fit easily in memory so they can't be large.

By keeping the interactions between the components simpler, this will be easier to test also. Right now there are no unit tests for this and because the interfaces are complex it might be difficult to test as-is.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-54807921
  
    FWIW, I think mostly-accurate metrics are much better than no metrics in this case. The read/write bytes are very useful from Hadoop FSes, and Hadoop <2.5 is still very much widespread (Spark's default).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-54519238
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19790/consoleFull) for   PR 2087 at commit [`0a743c0`](https://github.com/apache/spark/commit/0a743c00307c721b74343eceb715925101fc5c66).
     * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59891129
  
    **[Tests timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21971/consoleFull)**     for PR 2087 at commit [`1ab662d`](https://github.com/apache/spark/commit/1ab662d8ae674407bfe0f8bbc14aedf1da60c030)     after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-54851789
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59891774
  
    Jenkins, retest this pleas.e


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18939525
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): Option[() => Long] = {
    +    val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
    +    val scheme = qualifiedPath.toUri().getScheme()
    +    val stats = FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
    --- End diff --
    
    Does this call `CACHE.get` transitively somehow?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-60198189
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22061/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59885695
  
    Cool, updated patch addresses comments.  It look like the failure is caused by a failure to fetch from git.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-60193547
  
    Just had some minor style comments - there were four cases which used the confusing invocation style but you only changed one of them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r19382961
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -224,18 +223,18 @@ class HadoopRDD[K, V](
           val key: K = reader.createKey()
           val value: V = reader.createValue()
     
    -      // Set the task input metrics.
           val inputMetrics = new InputMetrics(DataReadMethod.Hadoop)
    -      try {
    -        /* bytesRead may not exactly equal the bytes read by a task: split boundaries aren't
    -         * always at record boundaries, so tasks may need to read into other splits to complete
    -         * a record. */
    -        inputMetrics.bytesRead = split.inputSplit.value.getLength()
    -      } catch {
    -        case e: java.io.IOException =>
    -          logWarning("Unable to get input size to set InputMetrics for task", e)
    +      // Find a function that will return the FileSystem bytes read by this thread.
    +      val bytesReadCallback = if (split.inputSplit.value.isInstanceOf[FileSplit]) {
    +        SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(
    +          split.inputSplit.value.asInstanceOf[FileSplit].getPath, jobConf)
    +      } else {
    +        None
    +      }
    +      if (bytesReadCallback.isDefined) {
    +        context.taskMetrics.inputMetrics = Some(inputMetrics)
           }
    -      context.taskMetrics.inputMetrics = Some(inputMetrics)
    --- End diff --
    
    Is there a reason why we can't just set the input metrics (this line) here in both cases? In the case where we can't get the read callback, we just do this exact thing further down. Is there a reason why you can't just make the assignment here and just update the `bytesRead` below?
    
    In fact, would it be possible to just do this at the beginning of the function:
    ```
    context.taskMetrics.inputMetrics = new InputMetrics(DataReadMethod.Hadoop)
    ```
    and not have a `val inputMetrics` here? Then we could just always access `context.taskMetrics` in the code and we'd just have one code path for accessing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-57229133

Hey @sryza so it seems like there are two things going on here. One is adding incremental update and the other is changing the way we deal with tracking read bytes for Hadoop RDD's. For the incremental updates, could we just make bytes read an atomic long and update it directly inside of the `compute` functions - this seems simpler than using callbacks? For instance, what if we just update the bytes read every N records by reading from the thread local information.

The current approach couples the updating of this metric with the heartbeats in a way that seems strange. In fact, is `updatebytesRead` ever called here if the heartbeats are disabled or are very long? And don't we need to `updateBytesRead` once the task finishes... for instance, more bytes could have been read after the most recent heartbeat was sent, right? If we did an approach that updated it every N records and then when the entire partition was computed, it is easier to reason about the order of updates.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-55355339
  
    Updated patch includes fallback to the split size


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-54867619
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19983/consoleFull) for   PR 2087 at commit [`0034292`](https://github.com/apache/spark/commit/00342924b0b3eba12861bf1cd524f0bfca2fbc4e).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `    logDebug("isMulticlass = " + metadata.isMulticlass)`
      * `    logDebug("isMulticlass = " + metadata.isMulticlass)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59892045
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21976/consoleFull) for   PR 2087 at commit [`1ab662d`](https://github.com/apache/spark/commit/1ab662d8ae674407bfe0f8bbc14aedf1da60c030).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r19419188
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -224,18 +223,18 @@ class HadoopRDD[K, V](
           val key: K = reader.createKey()
           val value: V = reader.createValue()
     
    -      // Set the task input metrics.
           val inputMetrics = new InputMetrics(DataReadMethod.Hadoop)
    -      try {
    -        /* bytesRead may not exactly equal the bytes read by a task: split boundaries aren't
    -         * always at record boundaries, so tasks may need to read into other splits to complete
    -         * a record. */
    -        inputMetrics.bytesRead = split.inputSplit.value.getLength()
    -      } catch {
    -        case e: java.io.IOException =>
    -          logWarning("Unable to get input size to set InputMetrics for task", e)
    +      // Find a function that will return the FileSystem bytes read by this thread.
    +      val bytesReadCallback = if (split.inputSplit.value.isInstanceOf[FileSplit]) {
    +        SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(
    +          split.inputSplit.value.asInstanceOf[FileSplit].getPath, jobConf)
    +      } else {
    +        None
    +      }
    +      if (bytesReadCallback.isDefined) {
    +        context.taskMetrics.inputMetrics = Some(inputMetrics)
           }
    -      context.taskMetrics.inputMetrics = Some(inputMetrics)
    --- End diff --
    
    Gotcha - maybe we can add a comment about this later.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-55355492
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20199/consoleFull) for   PR 2087 at commit [`8bfaa24`](https://github.com/apache/spark/commit/8bfaa24c03262db846f61d50ed174200da527f82).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59959931
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21995/consoleFull) for   PR 2087 at commit [`74fc9bb`](https://github.com/apache/spark/commit/74fc9bb31081453f701f15d090ec6f0f988a9f2f).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-54695093
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19865/consoleFull) for   PR 2087 at commit [`0034292`](https://github.com/apache/spark/commit/00342924b0b3eba12861bf1cd524f0bfca2fbc4e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-58960161
  
    Hey Sandy, had a couple questions about behavior and assumptions from Hadoop. A couple of things here. The current approach does a lot of reflection every time we invoke this statistics function which is very expensive. Also there is some "catch all" exception handling that could bite us. Two things that would help this are:
    
    1. Determine a single time up-front whether we will try to compute these advanced statistics:
    ```
      private val statsClass = "org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData"
      private val statsFunction = "getThreadStatistics"
    
      /** Whether to attempt accessing per-thread statistics from Hadoop */
      private[spark] val hasAdvancedStatistics =
        Try(Class.forName(statsClass).getDeclaredMethod(statsFunction)).map(true).getOrElse(false)
    ```
    
    And then remove the exception blocks elsewhere. I.e. if we detect advanced statistics are available and then there is a failure getting them, we should throw an exception.
    
    2. Perform as much reflection as possible off the critical path. I.e. the Hadoop RDD should look up the entire function for the computing thread at the beginning, then it can invoke that function within the hot loop only.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-54563520
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-58984415
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/376/consoleFull) for   PR 2087 at commit [`305ad9f`](https://github.com/apache/spark/commit/305ad9f50b4dae992e09fdc962ebaf71e191a191).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-56461580
  
    MapReduce doesn't use getPos, but it does look like it might be helpful in some situations.  One caveat is that pos only means # bytes for file input formats.  For example, for DBInputFormat, it means the number of records. 
    
    If we choose to use getPos for pre-2.5 Hadoop, my preference would be to make that change in a separate patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18831556
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): Option[() => Long] = {
    +    val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
    +    val scheme = qualifiedPath.toUri().getScheme()
    +    val stats = FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
    +    try {
    +      val threadStats = stats.map(Utils.invoke(classOf[Statistics], _, "getThreadStatistics"))
    --- End diff --
    
    getThreadData is basically the same as getThreadStatistics, but when we made the method public, we changed its name to make it a little bit more descriptive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-57260156
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21013/consoleFull) for   PR 2087 at commit [`a5486af`](https://github.com/apache/spark/commit/a5486af4c6750980eba8065344676ba64f2a8ad5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-54852759
  
    Just to make sure it's clear, the issue isn't only that we can be a few bytes off when we're reading outside of split boundaries, but that it'll look like we read the full split even if it's only a take or query with a limit.
    
    If you're comfortable with that, I don't mind adding it back in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59882564
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21969/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59970626
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21995/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59886278
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21976/consoleFull) for   PR 2087 at commit [`1ab662d`](https://github.com/apache/spark/commit/1ab662d8ae674407bfe0f8bbc14aedf1da60c030).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-59959228
  
    Small change to make a method I added private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org