You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/03 05:45:03 UTC

[GitHub] [spark] viirya opened a new pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

viirya opened a new pull request #31451:
URL: https://github.com/apache/spark/pull/31451


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This patch proposes to add a few public API change to DS v2, to make DS v2 scan can report metrics to Spark.
   
   One public interface `CustomMetric` is added. The interface simply can return metric name, description and value.
   
   There are two public methods added to existing public interfaces.
   
   * `PartitionReader.getCustomMetrics()`: returns an array of CustomMetric. Here is where the actual metrics values are collected. Empty array by default.
   * `Scan.supportedCustomMetrics()`: returns an array of supported custom metrics with name and description. Empty array by default.
   
   The metric collection happens as following.
   
   * `BatchScanExec`, `MicroBatchScanExec`, `ContinuousScanExec` call its scan's `supportedCustomMetrics` method to know what custom SQL metrics should be added.
   * `BatchScanExec`, `MicroBatchScanExec`, `ContinuousScanExec` pass a callback function for updating metrics to the underlying RDD. When it completes data consumption, calling the callback function to update metrics from `PartitionReader`.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   This is related to #31398. In SPARK-34297, we want to add a couple of metrics when reading from Kafka in SS. We need some public API change in DS v2 to make it possible. This extracts only DS v2 change and make it general for DS v2 instead of micro-batch DS v2 API.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Test with SS-specific metrics in #31398.
   
   Unit test WIP
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-773564010


   > I think that this could use some tests. It would be easier to add tests if it were split to add the interfaces, a batch implementation (with tests), and a stream implementation (with stream tests).
   
   Split to #31476 for add the interfaces.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820089390


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41968/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r603905580



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala
##########
@@ -199,6 +199,9 @@ class SQLAppStatusListener(
 
   private def aggregateMetrics(exec: LiveExecutionData): Map[Long, String] = {
     val metricTypes = exec.metrics.map { m => (m.accumulatorId, m.metricType) }.toMap

Review comment:
       nit: this can be `val accumIds = exec.metrics.map(_. accumulatorId).toSet` now.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala
##########
@@ -199,6 +199,9 @@ class SQLAppStatusListener(
 
   private def aggregateMetrics(exec: LiveExecutionData): Map[Long, String] = {
     val metricTypes = exec.metrics.map { m => (m.accumulatorId, m.metricType) }.toMap

Review comment:
       nit: this can be `val accumIds = exec.metrics.map(_.accumulatorId).toSet` now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r606951487



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReader.java
##########
@@ -51,7 +51,8 @@
   T get();
 
   /**
-   * Returns an array of custom task metrics. By default it returns empty array.
+   * Returns an array of custom task metrics. By default it returns empty array. Note that it is
+   * not recommended to put heavy logic in this method as it may affect reading performance.

Review comment:
       yea I think so




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820198663


   **[Test build #137391 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137391/testReport)** for PR 31451 at commit [`a50bf40`](https://github.com/apache/spark/commit/a50bf40194a98916c3fbeb1b864e6eb0fac7797e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810668931


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41320/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r570234278



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala
##########
@@ -44,8 +44,19 @@ case class BatchScanExec(
 
   override lazy val readerFactory: PartitionReaderFactory = batch.createReaderFactory()
 
+  /**
+   * The callback function which is called when the output iterator of input RDD is consumed
+   * completely.
+   */
+  private def onOutputCompletion(reader: PartitionReader[_]) = {

Review comment:
       Do we have the plan to do other things on completion? Otherwise, I'd like to suggest naming `collectMetericsOnCompletion` for now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-772311968


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39399/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820013445


   Now history server doesn't fail when reading the log of custom metric. History server doesn't actually show any metrics, I found. I also checked current master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808816298


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41184/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810668931


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41320/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-772277403


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39399/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808870494


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41193/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r570444688



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala
##########
@@ -32,8 +32,13 @@ import org.apache.spark.util.Utils
 
 trait DataSourceV2ScanExecBase extends LeafExecNode {
 
-  override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+  override lazy val metrics = {

Review comment:
       Sounds good to me. I can work on it  in follow up PRs.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala
##########
@@ -44,8 +44,19 @@ case class BatchScanExec(
 
   override lazy val readerFactory: PartitionReaderFactory = batch.createReaderFactory()
 
+  /**
+   * The callback function which is called when the output iterator of input RDD is consumed
+   * completely.
+   */
+  private def onOutputCompletion(reader: PartitionReader[_]) = {
+    reader.getCustomMetrics.foreach { metric =>
+      longMetric(metric.getName) += metric.getValue

Review comment:
       Good idea. I will do it.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala
##########
@@ -44,8 +44,19 @@ case class BatchScanExec(
 
   override lazy val readerFactory: PartitionReaderFactory = batch.createReaderFactory()
 
+  /**
+   * The callback function which is called when the output iterator of input RDD is consumed
+   * completely.
+   */
+  private def onOutputCompletion(reader: PartitionReader[_]) = {

Review comment:
       sounds good to me.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ContinuousScanExec.scala
##########
@@ -47,6 +47,16 @@ case class ContinuousScanExec(
     stream.createContinuousReaderFactory()
   }
 
+  /**
+   * The callback function which is called when the output iterator of input RDD is consumed
+   * completely.
+   */
+  private def onOutputCompletion(reader: PartitionReader[_]) = {
+    reader.getCustomMetrics.foreach { metric =>
+      longMetric(metric.getName) += metric.getValue
+    }

Review comment:
       +1 will add in later implementation PRs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-823028148


   **[Test build #137676 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137676/testReport)** for PR 31451 at commit [`06eb9c7`](https://github.com/apache/spark/commit/06eb9c79a3fdd807ec08540deb2234939396325a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820200164


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137391/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820015932


   **[Test build #137391 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137391/testReport)** for PR 31451 at commit [`a50bf40`](https://github.com/apache/spark/commit/a50bf40194a98916c3fbeb1b864e6eb0fac7797e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810663397


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41320/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808822796


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41185/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810554421


   > Thanks for the explanation. This sounds like a change from the API discussed in #31476. IIUC, before, the expectation was that `PartitionReader#currentMetricsValues()` is called after the partition is read. Now, the expectation is that `PartitionReader#currentMetricsValues()` is called for every row we iterate through in the reader. Such expectation should be documented clearly in the API, for implementors of custom metrics.
   
   I don't see we have documented in the API the exact time where `currentMetricsValues` will be called. This is implementation detail. If you worry about the implementation of `currentMetricsValues` will do something taking time. We can add a note to the API suggesting not to do heavy login in it. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820238460


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41980/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820200164


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137391/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-772349211


   **[Test build #134811 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134811/testReport)** for PR 31451 at commit [`4d941c7`](https://github.com/apache/spark/commit/4d941c7bb606ebf7d8a76b7b3e3d6fd6a3a00b95).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r603903106



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDD.scala
##########
@@ -88,8 +90,12 @@ class ContinuousDataSourceRDD(
       partition.queueReader
     }
 
+    val partitionReader = readerForPartition.getPartitionReader()
     new NextIterator[InternalRow] {
       override def getNext(): InternalRow = {
+        partitionReader.currentMetricsValues.foreach { metric =>

Review comment:
       ditto




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808821706


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41185/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808890389


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136609/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808822412


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41185/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-818542369


   > @viirya any updates?
   
   Sorry for late. I will update in this week. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820015932


   **[Test build #137391 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137391/testReport)** for PR 31451 at commit [`a50bf40`](https://github.com/apache/spark/commit/a50bf40194a98916c3fbeb1b864e6eb0fac7797e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820089390


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41968/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808822796


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41185/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820772686


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137431/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808871371


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41193/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808817388


   **[Test build #136602 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136602/testReport)** for PR 31451 at commit [`d3bf283`](https://github.com/apache/spark/commit/d3bf283c1f0e0e7dfb54243c6d8bfc15e30082b6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808817388


   **[Test build #136602 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136602/testReport)** for PR 31451 at commit [`d3bf283`](https://github.com/apache/spark/commit/d3bf283c1f0e0e7dfb54243c6d8bfc15e30082b6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820310479


   > History server doesn't actually show any metrics, I found. I also checked current master.
   
   @gengliangwang @rednaxelafx is this true? SQL metrics is very important to analyze SQL queries. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r606111670



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReader.java
##########
@@ -51,7 +51,8 @@
   T get();
 
   /**
-   * Returns an array of custom task metrics. By default it returns empty array.
+   * Returns an array of custom task metrics. By default it returns empty array. Note that it is
+   * not recommended to put heavy logic in this method as it may affect reading performance.

Review comment:
       So sounds like a feasible approach is to call `currentMetricsValues` and update metrics per some rows and also update it at the end of the task?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820391096


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137404/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808865639


   **[Test build #136611 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136611/testReport)** for PR 31451 at commit [`918c90d`](https://github.com/apache/spark/commit/918c90da4c62419f57674cbee707628e0c65f8a6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808860917


   **[Test build #136609 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136609/testReport)** for PR 31451 at commit [`aedb965`](https://github.com/apache/spark/commit/aedb965a2b7cef2027bf7857fa4b7c10f4e8d571).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-823076827


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42204/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820757279


   **[Test build #137431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137431/testReport)** for PR 31451 at commit [`0f38782`](https://github.com/apache/spark/commit/0f38782bfeae0f55bb662780d66153270b5ac40d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808839684


   **[Test build #136602 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136602/testReport)** for PR 31451 at commit [`d3bf283`](https://github.com/apache/spark/commit/d3bf283c1f0e0e7dfb54243c6d8bfc15e30082b6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810807243


   @viirya how about the history server? I'm a bit worried about the event log with v2 metrics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808816938


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41184/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r615887650



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala
##########
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.metric
+
+import org.apache.spark.sql.connector.CustomMetric
+
+object CustomMetrics {
+  private[spark] val V2_CUSTOM = "v2Custom"
+
+  /**
+   * Given a class name, builds and returns a metric type for a V2 custom metric class
+   * `CustomMetric`.
+   */
+  def buildV2CustomMetricTypeName(customMetric: CustomMetric): String = {
+    s"${V2_CUSTOM}_${customMetric.getClass.getCanonicalName}"
+  }
+
+  /**
+   * Given a V2 custom metric type name, this method parses it and returns the corresponding
+   * `CustomMetric` class name.
+   */
+  def parseV2CustomMetricType(metricType: String): Option[String] = {
+    val className = metricType.stripPrefix(s"${V2_CUSTOM}_")

Review comment:
       nit: to make the code more readable
   ```
   if (metricType.startsWith(s"${V2_CUSTOM}_")) {
     Some(metricType.drop(V2_CUSTOM.length + 1))
   } else {
     None
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r605101273



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetricInfo.scala
##########
@@ -27,4 +27,5 @@ import org.apache.spark.annotation.DeveloperApi
 class SQLMetricInfo(
     val name: String,
     val accumulatorId: Long,
-    val metricType: String)
+    val metricType: String,
+    val aggregateMethod: (Array[Long], Array[Long]) => String)

Review comment:
       For now the only reason to have default there, is to avoid changing test code. This doesn't look like strong reason. For personal preference, I can understand you may prefer to set default there. However, for code structure I don't think a `SQLMetricInfo` or `SQLPlanMetric` should know how to aggregate metric by default. If we actually need these classes to initiate metric aggregation in the future, we may change them. I think we should not adding default values just because of an unknown future usage.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r616167606



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala
##########
@@ -811,6 +814,42 @@ class SQLAppStatusListenerSuite extends SharedSparkSession with JsonTestUtils
     listener.onOtherEvent(SparkListenerSQLExecutionEnd(
       executionId, System.currentTimeMillis()))
   }
+
+
+  test("SPARK-34338: Report metrics from Datasource v2 scan") {
+    val statusStore = spark.sharedState.statusStore
+    val oldCount = statusStore.executionsList().size
+
+    val schema = new StructType().add("i", "int").add("j", "int")
+    val physicalPlan = BatchScanExec(schema.toAttributes, new CustomMetricScanBuilder())

Review comment:
       moved.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r605093466



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetricInfo.scala
##########
@@ -27,4 +27,5 @@ import org.apache.spark.annotation.DeveloperApi
 class SQLMetricInfo(
     val name: String,
     val accumulatorId: Long,
-    val metricType: String)
+    val metricType: String,
+    val aggregateMethod: (Array[Long], Array[Long]) => String)

Review comment:
       Yes, precisely, `SQLMetricInfo` objects are created from `SQLMetric` objects and `SQLPlanMetric` objects are created from `SQLMetricInfo` objects. Most of the time `SQLMetric` objects just need the default aggregateMethod. Hence it is natural for `SQLMetricInfo` and `SQLPlanMetric` to have this same default.
   A default is just that; in the non-test parts that this PR needed to touch, we have to explicitly set the aggregateMethod from the `SQLMetric`, because they have to apply to all `SQLMetric`s. But in other cases, for the test parts now, or for future uses that we don't know about now, the default may be applicable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810849844


   @viirya I think this is hard to fix. The history server is a different JVM and may not have the data source implementation classes loaded, we can't run the aggregate code in the history server.
   
   One idea is to provide some builtin implementations of `CustomMetric`, so that it's in the Spark codebase and will be available in the history server JVM as well. In the event logs we can record the class name, and the history server can instantiate the class and call the aggregate method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r605096457



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReader.java
##########
@@ -51,7 +51,8 @@
   T get();
 
   /**
-   * Returns an array of custom task metrics. By default it returns empty array.
+   * Returns an array of custom task metrics. By default it returns empty array. Note that it is
+   * not recommended to put heavy logic in this method as it may affect reading performance.

Review comment:
       My interest is in the batch scan use case. It is true that, under the hood, a particular DS v2 may well need to track some particular metric per row, but there are definitely cases where a metric does not need to be updated per row; in those cases, simply reporting the metric value at the end of the read would be cheaper.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808865903


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41191/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r604334593



##########
File path: sql/core/src/test/scala/org/apache/spark/status/api/v1/sql/SqlResourceSuite.scala
##########
@@ -41,31 +42,32 @@ object SqlResourceSuite {
 
   val nodeIdAndWSCGIdMap: Map[Long, Option[Long]] = Map(1L -> Some(1L))
 
+  val defaultAggregateMethod = SQLMetrics.defaultAggregateMethod("")
   val filterNode = new SparkPlanGraphNode(1, FILTER, "",
-    metrics = Seq(SQLPlanMetric(NUMBER_OF_OUTPUT_ROWS, 1, "")))
+    metrics = Seq(SQLPlanMetric(NUMBER_OF_OUTPUT_ROWS, 1, "", defaultAggregateMethod)))

Review comment:
       The default aggregate method is defined in `SQLMetrics`. For code structure, it is far from the UI class `SQLPlanMetric` and also conceptionally I don't think we should have a default aggregate method at `SQLPlanMetric`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r570437357



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala
##########
@@ -32,8 +32,13 @@ import org.apache.spark.util.Utils
 
 trait DataSourceV2ScanExecBase extends LeafExecNode {
 
-  override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+  override lazy val metrics = {

Review comment:
       Looks like updating `context.taskMetrics().outputMetrics` is just in our branch. That just uses the Hadoop FS metrics collection that we use elsewhere, so it isn't metrics from the source as we want to support in this PR.
   
   I think it would be good to follow up and support metrics on the output side. It doesn't need to be done here, but metrics are really useful.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808865903


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41191/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808865170


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41191/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808871367


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41193/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r616012193



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
##########
@@ -86,6 +92,9 @@ private class PartitionIterator[T](reader: PartitionReader[T]) extends Iterator[
     if (!hasNext) {
       throw QueryExecutionErrors.endOfStreamError()
     }
+    reader.currentMetricsValues.foreach { metric =>
+      customMetrics(metric.name()).set(metric.value())

Review comment:
       We can throw a user-friendly message if it doesn't.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810647889


   **[Test build #136738 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136738/testReport)** for PR 31451 at commit [`d5d8678`](https://github.com/apache/spark/commit/d5d867880ebb57c49ac422251ba50bbabf1159d1).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-822804878






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810480565


   > > Can you please explain why you changed from passing a completion function to `DataSourceRDD` to passing a `Map[String, SQLMetric]`? What is the benefit?
   > 
   > It is more fit to the current approach we use `SQLMetric`. This approach looks more clear to me. We update custom metrics during consuming the data now, instead of at the completion of data consuming.
   
   Thanks for the explanation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r614238442



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala
##########
@@ -818,3 +818,51 @@ class ReportStatisticsDataSource extends SimpleWritableDataSource {
     }
   }
 }
+
+
+class SimpleCustomMetric extends CustomMetric {
+  override def name(): String = "custom_metric"
+  override def description(): String = "a simple custom metric"
+  override def aggregateTaskMetrics(taskMetrics: Array[Long]): String = {
+    s"custom_metric: ${taskMetrics.mkString(", ")}"
+  }
+}
+
+// The followings are for custom metrics of V2 data source.
+object CustomMetricReaderFactory extends PartitionReaderFactory {
+  override def createReader(partition: InputPartition): PartitionReader[InternalRow] = {
+    val RangeInputPartition(start, end) = partition
+    new PartitionReader[InternalRow] {
+      private var current = start - 1
+
+      override def next(): Boolean = {
+        current += 1
+        current < end
+      }
+
+      override def get(): InternalRow = InternalRow(current, -current)
+
+      override def close(): Unit = {}
+
+      override def currentMetricsValues(): Array[CustomTaskMetric] = {
+        val metric = new CustomTaskMetric {
+          override def name(): String = "custom_metric"
+          override def value(): Long = 12345
+        }
+        Array(metric)
+      }
+    }
+  }
+}
+
+class CustomMetricScanBuilder extends SimpleScanBuilder {

Review comment:
       Ah, I should not put `CustomMetricDataSourceV2` here. It is a simple DS v2 I use to test and capture screenshot. I added it by mistake.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820198345


   **[Test build #137404 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137404/testReport)** for PR 31451 at commit [`b5be5ba`](https://github.com/apache/spark/commit/b5be5ba2abfc5022ff8c6b72e598c9361fbb1323).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810631527


   **[Test build #136737 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136737/testReport)** for PR 31451 at commit [`e8576ec`](https://github.com/apache/spark/commit/e8576ec04596ea962177a6a18818149e332a9b01).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810844094


   > @viirya how about the history server? I'm a bit worried about the event log with v2 metrics.
   
   Oh, there is a problem on ser/de `aggregateMethod`...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820588626


   **[Test build #752827513](https://github.com/viirya/spark-1/actions/runs/752827513)** for PR 31451 at commit [`0f38782`](https://github.com/viirya/spark-1/commit/0f38782bfeae0f55bb662780d66153270b5ac40d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808811925


   **[Test build #136601 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136601/testReport)** for PR 31451 at commit [`4215ec0`](https://github.com/apache/spark/commit/4215ec0930082c34316c21f943a6395ebbeff72f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r604332285



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala
##########
@@ -199,6 +199,9 @@ class SQLAppStatusListener(
 
   private def aggregateMetrics(exec: LiveExecutionData): Map[Long, String] = {
     val metricTypes = exec.metrics.map { m => (m.accumulatorId, m.metricType) }.toMap

Review comment:
       Or you don't even need that; in below, where you used to do `taskMetrics.filter(m => metricTypes.contains(m._1))`, just do `taskMetrics.filter(m => metricAggregationMethods.contains(m._1))`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r614238442



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala
##########
@@ -818,3 +818,51 @@ class ReportStatisticsDataSource extends SimpleWritableDataSource {
     }
   }
 }
+
+
+class SimpleCustomMetric extends CustomMetric {
+  override def name(): String = "custom_metric"
+  override def description(): String = "a simple custom metric"
+  override def aggregateTaskMetrics(taskMetrics: Array[Long]): String = {
+    s"custom_metric: ${taskMetrics.mkString(", ")}"
+  }
+}
+
+// The followings are for custom metrics of V2 data source.
+object CustomMetricReaderFactory extends PartitionReaderFactory {
+  override def createReader(partition: InputPartition): PartitionReader[InternalRow] = {
+    val RangeInputPartition(start, end) = partition
+    new PartitionReader[InternalRow] {
+      private var current = start - 1
+
+      override def next(): Boolean = {
+        current += 1
+        current < end
+      }
+
+      override def get(): InternalRow = InternalRow(current, -current)
+
+      override def close(): Unit = {}
+
+      override def currentMetricsValues(): Array[CustomTaskMetric] = {
+        val metric = new CustomTaskMetric {
+          override def name(): String = "custom_metric"
+          override def value(): Long = 12345
+        }
+        Array(metric)
+      }
+    }
+  }
+}
+
+class CustomMetricScanBuilder extends SimpleScanBuilder {

Review comment:
       Ah, I should not put `CustomMetricDataSourceV2` here. It is a simple DS v2 I use to test and capture screenshot locally. I added it by mistake.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r604669214



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetricInfo.scala
##########
@@ -27,4 +27,5 @@ import org.apache.spark.annotation.DeveloperApi
 class SQLMetricInfo(
     val name: String,
     val accumulatorId: Long,
-    val metricType: String)
+    val metricType: String,

Review comment:
       one idea is to encode the `CustomMetric` class name in the metric type: `v2_custom_com.class.name`. In the UI side we extract the class name, instantiate it, and aggregate the metrics.
   
   We can provide some builtin implementations (Scala objects or Java singleton). If the class name points to a builtin implementation, do not instantiate it and simply call the method of a Scala object.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r604572473



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReader.java
##########
@@ -51,7 +51,8 @@
   T get();
 
   /**
-   * Returns an array of custom task metrics. By default it returns empty array.
+   * Returns an array of custom task metrics. By default it returns empty array. Note that it is
+   * not recommended to put heavy logic in this method as it may affect reading performance.

Review comment:
       Normally, this method should be very cheap, as it just returns the current metrics values that are being tracked. The implementation may track the metrics per-row or per-batch, but that's out of our control.
   
   That said, even if we call this method only at the end of partition read, the perf overhead can still be big if the data source tracks the metrics per-row under the hood.
   
   @wypoon does it answer your concern?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808871371


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41193/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820238412






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820615579


   **[Test build #137431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137431/testReport)** for PR 31451 at commit [`0f38782`](https://github.com/apache/spark/commit/0f38782bfeae0f55bb662780d66153270b5ac40d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810664956


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41320/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808016273


   As #31476 was merged, I will update this to implement the API details in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810562358


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41319/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-823216179


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137676/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810799188


   @cloud-fan Captured a screenshot and attached in the description. The DS v2 uses the same custom metrics as I added in `SQLAppStatusListenerSuite`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-822813830


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42177/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-772292807


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39399/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808838118


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136601/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r569875341



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/CustomMetric.java
##########
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read;
+
+import org.apache.spark.annotation.Evolving;
+
+/**
+ * A custom metric for {@link Scan}.

Review comment:
       `LongMetric` sounds okay to me. This basically follows `SQLMetric` to use long value.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r605995166



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReader.java
##########
@@ -51,7 +51,8 @@
   T get();
 
   /**
-   * Returns an array of custom task metrics. By default it returns empty array.
+   * Returns an array of custom task metrics. By default it returns empty array. Note that it is
+   * not recommended to put heavy logic in this method as it may affect reading performance.

Review comment:
       It sounds like the DS v2 should be prepared to compute and provide the metrics whenever `PartitionReader#currentMetricsValues()` is called. I think if we always call it at the end of the task (not precluding calling it more often), we should be able to get accurate metrics.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon edited a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
wypoon edited a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810480565


   > > Can you please explain why you changed from passing a completion function to `DataSourceRDD` to passing a `Map[String, SQLMetric]`? What is the benefit?
   > 
   > It is more fit to the current approach we use `SQLMetric`. This approach looks more clear to me. We update custom metrics during consuming the data now, instead of at the completion of data consuming.
   
   Thanks for the explanation. This sounds like a change from the API discussed in https://github.com/apache/spark/pull/31476. IIUC, before, the expectation was that `PartitionReader#currentMetricsValues()` is called after the partition is read. Now, the expectation is that `PartitionReader#currentMetricsValues()` is called for every row we iterate through in the reader. Such expectation should be documented clearly in the API, for implementors of custom metrics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810497805


   **[Test build #136737 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136737/testReport)** for PR 31451 at commit [`e8576ec`](https://github.com/apache/spark/commit/e8576ec04596ea962177a6a18818149e332a9b01).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810745261


   **[Test build #136738 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136738/testReport)** for PR 31451 at commit [`d5d8678`](https://github.com/apache/spark/commit/d5d867880ebb57c49ac422251ba50bbabf1159d1).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r613956852



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala
##########
@@ -818,3 +818,51 @@ class ReportStatisticsDataSource extends SimpleWritableDataSource {
     }
   }
 }
+
+
+class SimpleCustomMetric extends CustomMetric {
+  override def name(): String = "custom_metric"
+  override def description(): String = "a simple custom metric"
+  override def aggregateTaskMetrics(taskMetrics: Array[Long]): String = {
+    s"custom_metric: ${taskMetrics.mkString(", ")}"
+  }
+}
+
+// The followings are for custom metrics of V2 data source.
+object CustomMetricReaderFactory extends PartitionReaderFactory {
+  override def createReader(partition: InputPartition): PartitionReader[InternalRow] = {
+    val RangeInputPartition(start, end) = partition
+    new PartitionReader[InternalRow] {
+      private var current = start - 1
+
+      override def next(): Boolean = {
+        current += 1
+        current < end
+      }
+
+      override def get(): InternalRow = InternalRow(current, -current)
+
+      override def close(): Unit = {}
+
+      override def currentMetricsValues(): Array[CustomTaskMetric] = {
+        val metric = new CustomTaskMetric {
+          override def name(): String = "custom_metric"
+          override def value(): Long = 12345
+        }
+        Array(metric)
+      }
+    }
+  }
+}
+
+class CustomMetricScanBuilder extends SimpleScanBuilder {

Review comment:
       I'm a bit confused. Why the code here is not put together with `CustomMetricDataSourceV2`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810647889


   **[Test build #136738 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136738/testReport)** for PR 31451 at commit [`d5d8678`](https://github.com/apache/spark/commit/d5d867880ebb57c49ac422251ba50bbabf1159d1).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r569773752



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/CustomMetric.java
##########
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read;
+
+import org.apache.spark.annotation.Evolving;
+
+/**
+ * A custom metric for {@link Scan}.
+ *
+ * @since 3.2.0
+ */
+@Evolving
+public interface CustomMetric {
+  /**
+   * Returns the name of custom metric.
+   */
+  String getName();

Review comment:
       I find that `get` is not very useful in method names. This doesn't fit with the conventions used in Scala (which omits get) and using `get` in an interface means that although a quick field retrieval is expected by Java convention, the actual implementation can do anything.
   
   I think it would be better to name these `name()`, `description()`, and `value()`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r570235588



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala
##########
@@ -44,8 +44,19 @@ case class BatchScanExec(
 
   override lazy val readerFactory: PartitionReaderFactory = batch.createReaderFactory()
 
+  /**
+   * The callback function which is called when the output iterator of input RDD is consumed
+   * completely.
+   */
+  private def onOutputCompletion(reader: PartitionReader[_]) = {
+    reader.getCustomMetrics.foreach { metric =>
+      longMetric(metric.getName) += metric.getValue

Review comment:
       What if the metric doesn't exist in the SQLMetrics in case of users implemented mistakenly? Would it be better to throw an explicit error?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-772952749


   Thanks @rdblue for the review. If others are agreed on the interfaces, I can make one only for adding the interfaces and other two for batch and streaming implementations (with tests).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-818535587


   @viirya any updates?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820238460


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41980/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810562358


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41319/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-823076762






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r604707280



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetricInfo.scala
##########
@@ -27,4 +27,5 @@ import org.apache.spark.annotation.DeveloperApi
 class SQLMetricInfo(
     val name: String,
     val accumulatorId: Long,
-    val metricType: String)
+    val metricType: String,

Review comment:
       Okay, seems this is only feasible solution. I will update this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r607182316



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReader.java
##########
@@ -51,7 +51,8 @@
   T get();
 
   /**
-   * Returns an array of custom task metrics. By default it returns empty array.
+   * Returns an array of custom task metrics. By default it returns empty array. Note that it is
+   * not recommended to put heavy logic in this method as it may affect reading performance.

Review comment:
       okay, sgtm.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810755166


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136738/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r570502410



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ContinuousScanExec.scala
##########
@@ -47,6 +47,16 @@ case class ContinuousScanExec(
     stream.createContinuousReaderFactory()
   }
 
+  /**
+   * The callback function which is called when the output iterator of input RDD is consumed
+   * completely.
+   */
+  private def onOutputCompletion(reader: PartitionReader[_]) = {
+    reader.getCustomMetrics.foreach { metric =>
+      longMetric(metric.getName) += metric.getValue
+    }

Review comment:
       +1 will add in later implementation PRs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r604469020



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala
##########
@@ -87,9 +91,21 @@ object SQLMetrics {
   private val TIMING_METRIC = "timing"
   private val NS_TIMING_METRIC = "nsTiming"
   private val AVERAGE_METRIC = "average"
+  private val V2_CUSTOM = "v2Custom"

Review comment:
       The metrics is specially for DS v2 API. This is private one, so if we extend it to general custom metric later, we can change it if we want.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r604253363



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
##########
@@ -86,6 +92,9 @@ private class PartitionIterator[T](reader: PartitionReader[T]) extends Iterator[
     if (!hasNext) {
       throw QueryExecutionErrors.endOfStreamError()
     }
+    reader.currentMetricsValues.foreach { metric =>
+      customMetrics(metric.name()) += metric.value()

Review comment:
       oh, good catch. Now it should be `=`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810477842


   > Can you please explain why you changed from passing a completion function to `DataSourceRDD` to passing a `Map[String, SQLMetric]`? What is the benefit?
   
   It is more fit to the current approach we use `SQLMetric`. This approach looks more clear to me. We update custom metrics during consuming the data now, instead of at the completion of data consuming. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r604669214



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetricInfo.scala
##########
@@ -27,4 +27,5 @@ import org.apache.spark.annotation.DeveloperApi
 class SQLMetricInfo(
     val name: String,
     val accumulatorId: Long,
-    val metricType: String)
+    val metricType: String,

Review comment:
       one idea is to encode the `CustomMetric` class name in the metric type: `v2_custom_com.class.name`. In the UI side we extract the class name, instantiate it, and aggregate the metrics.
   
   We can provide some builtin implementations (Scala objects). If the class name points to a builtin implementation, do not instantiate it and simply call the method of a Scala object.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r569774028



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/CustomMetric.java
##########
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read;
+
+import org.apache.spark.annotation.Evolving;
+
+/**
+ * A custom metric for {@link Scan}.

Review comment:
       I think this should note that the metrics is a `long`. What about calling it a `LongMetric`? Any plans to generalize it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r569778168



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ContinuousScanExec.scala
##########
@@ -47,6 +47,16 @@ case class ContinuousScanExec(
     stream.createContinuousReaderFactory()
   }
 
+  /**
+   * The callback function which is called when the output iterator of input RDD is consumed
+   * completely.
+   */
+  private def onOutputCompletion(reader: PartitionReader[_]) = {
+    reader.getCustomMetrics.foreach { metric =>
+      longMetric(metric.getName) += metric.getValue
+    }

Review comment:
       Should we also have a way to set task input metrics? Maybe a reserved metric name like `input-bytes` that will set `TaskContext.get().taskMetrics().inputMetrics.bytesRead`. Records read would be really useful as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-822913691


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137648/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r616379743



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala
##########
@@ -199,7 +202,37 @@ class SQLAppStatusListener(
   }
 
   private def aggregateMetrics(exec: LiveExecutionData): Map[Long, String] = {
-    val metricTypes = exec.metrics.map { m => (m.accumulatorId, m.metricType) }.toMap
+    val accumIds = exec.metrics.map(_.accumulatorId).toSet
+
+    val metricAggregationMap = new mutable.HashMap[String, (Array[Long], Array[Long]) => String]()
+    val metricAggregationMethods = exec.metrics.map { m =>
+      val optClassName = CustomMetrics.parseV2CustomMetricType(m.metricType)
+      val metricAggMethod = optClassName.map { className =>
+        if (metricAggregationMap.contains(className)) {
+          metricAggregationMap(className)
+        } else {
+          // Try to initiate custom metric object
+          try {
+            val metric = Utils.loadExtensions(classOf[CustomMetric], Seq(className), conf).head
+            val method =
+              (metrics: Array[Long], _: Array[Long]) => metric.aggregateTaskMetrics(metrics)
+            metricAggregationMap.put(className, method)
+            method
+          } catch {
+            case NonFatal(_) =>
+              // Cannot initiaize custom metric object, we might be in history server that does

Review comment:
       Nit:  initialize




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r605392105



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReader.java
##########
@@ -51,7 +51,8 @@
   T get();
 
   /**
-   * Returns an array of custom task metrics. By default it returns empty array.
+   * Returns an array of custom task metrics. By default it returns empty array. Note that it is
+   * not recommended to put heavy logic in this method as it may affect reading performance.

Review comment:
       If the iterator is not consumed to the end, doesn't it mean we will produce inaccurate metrics if it stops in the middle of 100?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r613949556



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
##########
@@ -627,7 +627,8 @@ case class AdaptiveSparkPlanExec(
       // of the new plan nodes, so that it can track the valid accumulator updates later
       // and display SQL metrics correctly.
       val newMetrics = newSubPlans.flatMap { p =>
-        p.flatMap(_.metrics.values.map(m => SQLPlanMetric(m.name.get, m.id, m.metricType)))
+        p.flatMap(_.metrics.values.map(m =>
+          SQLPlanMetric(m.name.get, m.id, m.metricType)))

Review comment:
       unnecessary change?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820772686


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137431/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820172851


   **[Test build #751073828](https://github.com/viirya/spark-1/actions/runs/751073828)** for PR 31451 at commit [`b5be5ba`](https://github.com/viirya/spark-1/commit/b5be5ba2abfc5022ff8c6b72e598c9361fbb1323).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808811925


   **[Test build #136601 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136601/testReport)** for PR 31451 at commit [`4215ec0`](https://github.com/apache/spark/commit/4215ec0930082c34316c21f943a6395ebbeff72f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820379844


   **[Test build #137404 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137404/testReport)** for PR 31451 at commit [`b5be5ba`](https://github.com/apache/spark/commit/b5be5ba2abfc5022ff8c6b72e598c9361fbb1323).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-822913691


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137648/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808835054


   **[Test build #136601 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136601/testReport)** for PR 31451 at commit [`4215ec0`](https://github.com/apache/spark/commit/4215ec0930082c34316c21f943a6395ebbeff72f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-823216179


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137676/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-773564010


   > I think that this could use some tests. It would be easier to add tests if it were split to add the interfaces, a batch implementation (with tests), and a stream implementation (with stream tests).
   
   Split to #31476 for add the interfaces.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r616034064



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala
##########
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.metric
+
+import org.apache.spark.sql.connector.CustomMetric
+
+object CustomMetrics {
+  private[spark] val V2_CUSTOM = "v2Custom"
+
+  /**
+   * Given a class name, builds and returns a metric type for a V2 custom metric class
+   * `CustomMetric`.
+   */
+  def buildV2CustomMetricTypeName(customMetric: CustomMetric): String = {
+    s"${V2_CUSTOM}_${customMetric.getClass.getCanonicalName}"
+  }
+
+  /**
+   * Given a V2 custom metric type name, this method parses it and returns the corresponding
+   * `CustomMetric` class name.
+   */
+  def parseV2CustomMetricType(metricType: String): Option[String] = {
+    val className = metricType.stripPrefix(s"${V2_CUSTOM}_")
+
+    if (className == metricType) {
+      None
+    } else {
+      Some(className)
+    }
+  }
+}
+
+/**
+ * Built-in `CustomMetric` that sums up metric values.
+ */
+class CustomSumMetric extends CustomMetric {
+  override def name(): String = "CustomSumMetric"
+
+  override def description(): String = "Sum up CustomMetric"
+
+  override def aggregateTaskMetrics(taskMetrics: Array[Long]): String = {
+    taskMetrics.sum.toString
+  }
+}
+
+/**
+ * Built-in `CustomMetric` that computes average of metric values.
+ */
+class CustomAvgMetric extends CustomMetric {
+  override def name(): String = "CustomAvgMetric"
+
+  override def description(): String = "Average CustomMetric"
+
+  override def aggregateTaskMetrics(taskMetrics: Array[Long]): String = {
+    val average = if (taskMetrics.isEmpty) {
+      0L
+    } else {
+      taskMetrics.sum / taskMetrics.length

Review comment:
       No special reason. We can use Double to show numbers after decimal point.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r605413557



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReader.java
##########
@@ -51,7 +51,8 @@
   T get();
 
   /**
-   * Returns an array of custom task metrics. By default it returns empty array.
+   * Returns an array of custom task metrics. By default it returns empty array. Note that it is
+   * not recommended to put heavy logic in this method as it may affect reading performance.

Review comment:
       That's a good point. For whole stage codegen with limit, we may stop earlier before consuming the entire iterator.
   
   If we want to do per 100 rows update, we also need to do one more update at the end of the task.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808865899


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41191/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-823346873


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r605134483



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetricInfo.scala
##########
@@ -27,4 +27,5 @@ import org.apache.spark.annotation.DeveloperApi
 class SQLMetricInfo(
     val name: String,
     val accumulatorId: Long,
-    val metricType: String)
+    val metricType: String,
+    val aggregateMethod: (Array[Long], Array[Long]) => String)

Review comment:
       I understand your point. It is just a suggestion. Existing unit tests do not exercise custom metrics, which is a new feature. In my opinion, being able to add this new support in a way that does not necessitate any code change in tests is actually a plus; it shows that the implementation is well isolated (which is, of course, not always possible). New tests that test the new feature would obviously exercise the new parts of the code. That is just my opinion. At the end of the day, this is a minor point.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r616012789



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala
##########
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.metric
+
+import org.apache.spark.sql.connector.CustomMetric
+
+object CustomMetrics {
+  private[spark] val V2_CUSTOM = "v2Custom"
+
+  /**
+   * Given a class name, builds and returns a metric type for a V2 custom metric class
+   * `CustomMetric`.
+   */
+  def buildV2CustomMetricTypeName(customMetric: CustomMetric): String = {
+    s"${V2_CUSTOM}_${customMetric.getClass.getCanonicalName}"
+  }
+
+  /**
+   * Given a V2 custom metric type name, this method parses it and returns the corresponding
+   * `CustomMetric` class name.
+   */
+  def parseV2CustomMetricType(metricType: String): Option[String] = {
+    val className = metricType.stripPrefix(s"${V2_CUSTOM}_")
+
+    if (className == metricType) {
+      None
+    } else {
+      Some(className)
+    }
+  }
+}
+
+/**
+ * Built-in `CustomMetric` that sums up metric values.
+ */
+class CustomSumMetric extends CustomMetric {
+  override def name(): String = "CustomSumMetric"
+
+  override def description(): String = "Sum up CustomMetric"
+
+  override def aggregateTaskMetrics(taskMetrics: Array[Long]): String = {
+    taskMetrics.sum.toString
+  }
+}
+
+/**
+ * Built-in `CustomMetric` that computes average of metric values.
+ */
+class CustomAvgMetric extends CustomMetric {
+  override def name(): String = "CustomAvgMetric"
+
+  override def description(): String = "Average CustomMetric"
+
+  override def aggregateTaskMetrics(taskMetrics: Array[Long]): String = {
+    val average = if (taskMetrics.isEmpty) {
+      0L
+    } else {
+      taskMetrics.sum / taskMetrics.length

Review comment:
       Why choosing the `Long` type instead of the `Double` type here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-823076827


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42204/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820651259


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42008/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-823442034


   Great! Thank you, @viirya and all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r614246538



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala
##########
@@ -818,3 +818,51 @@ class ReportStatisticsDataSource extends SimpleWritableDataSource {
     }
   }
 }
+
+
+class SimpleCustomMetric extends CustomMetric {
+  override def name(): String = "custom_metric"
+  override def description(): String = "a simple custom metric"
+  override def aggregateTaskMetrics(taskMetrics: Array[Long]): String = {
+    s"custom_metric: ${taskMetrics.mkString(", ")}"
+  }
+}
+
+// The followings are for custom metrics of V2 data source.
+object CustomMetricReaderFactory extends PartitionReaderFactory {
+  override def createReader(partition: InputPartition): PartitionReader[InternalRow] = {
+    val RangeInputPartition(start, end) = partition
+    new PartitionReader[InternalRow] {
+      private var current = start - 1
+
+      override def next(): Boolean = {
+        current += 1
+        current < end
+      }
+
+      override def get(): InternalRow = InternalRow(current, -current)
+
+      override def close(): Unit = {}
+
+      override def currentMetricsValues(): Array[CustomTaskMetric] = {
+        val metric = new CustomTaskMetric {
+          override def name(): String = "custom_metric"
+          override def value(): Long = 12345
+        }
+        Array(metric)
+      }
+    }
+  }
+}
+
+class CustomMetricScanBuilder extends SimpleScanBuilder {

Review comment:
       Removed.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
##########
@@ -86,6 +92,9 @@ private class PartitionIterator[T](reader: PartitionReader[T]) extends Iterator[
     if (!hasNext) {
       throw QueryExecutionErrors.endOfStreamError()
     }
+    reader.currentMetricsValues.foreach { metric =>

Review comment:
       yea, will do "update metrics per some rows" in a followup.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808884077


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136611/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r604669872



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetricInfo.scala
##########
@@ -27,4 +27,5 @@ import org.apache.spark.annotation.DeveloperApi
 class SQLMetricInfo(
     val name: String,
     val accumulatorId: Long,
-    val metricType: String)
+    val metricType: String,

Review comment:
       We should also skip the metrics if we can't instantiate the class. The code may be run in the history server which doesn't have that class.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-772311968


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39399/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810042025


   Looks good. Let's add some UTs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820040127


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41968/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r606951648



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReader.java
##########
@@ -51,7 +51,8 @@
   T get();
 
   /**
-   * Returns an array of custom task metrics. By default it returns empty array.
+   * Returns an array of custom task metrics. By default it returns empty array. Note that it is
+   * not recommended to put heavy logic in this method as it may affect reading performance.

Review comment:
       We can probably add the "update metrics per some rows" in a followup as it's an improvement.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon edited a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
wypoon edited a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810480565


   > > Can you please explain why you changed from passing a completion function to `DataSourceRDD` to passing a `Map[String, SQLMetric]`? What is the benefit?
   > 
   > It is more fit to the current approach we use `SQLMetric`. This approach looks more clear to me. We update custom metrics during consuming the data now, instead of at the completion of data consuming.
   
   Thanks for the explanation. This sounds like a change from the API discussed in https://github.com/apache/spark/pull/31476. IIUC, before, the expectation was that `PartitionReader#currentMetricsValues()` is called after the partition is read. Now, the expectation is that `PartitionReader#currentMetricsValues()` is called in every iteration through the reader. Such expectation should be documented clearly in the API, for implementors of custom metrics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810758921


   @viirya Can we write a simple DS v2 with metrics and try it locally? Then we can get some screenshots of the web UI, and also verify the history server.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-822813830


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42177/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-822782815


   **[Test build #137648 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137648/testReport)** for PR 31451 at commit [`50ed317`](https://github.com/apache/spark/commit/50ed317c2a3fde4bdc1fd74a337d75b2fc85eb1c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r614246711



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala
##########
@@ -256,7 +289,8 @@ class SQLAppStatusListener(
     }
 
     val aggregatedMetrics = allMetrics.map { case (id, values) =>
-      id -> SQLMetrics.stringValue(metricTypes(id), values, maxMetricsFromAllStages.getOrElse(id,
+      val aggMethod = metricAggregationMethods(id)
+      id -> aggMethod(values, maxMetricsFromAllStages.getOrElse(id,
         Array.empty[Long]))

Review comment:
       merged.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820640591






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #31451:
URL: https://github.com/apache/spark/pull/31451


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810792543


   Okay. Let me have a simple test DS v2 locally and capture some screenshots of the web UI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810632069


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136737/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r605374302



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReader.java
##########
@@ -51,7 +51,8 @@
   T get();
 
   /**
-   * Returns an array of custom task metrics. By default it returns empty array.
+   * Returns an array of custom task metrics. By default it returns empty array. Note that it is
+   * not recommended to put heavy logic in this method as it may affect reading performance.

Review comment:
       Since we can't control how frequently the underlying data source tracks the metrics, this is all about how frequently we want to update the SQL metrics.
   
   Since the actual metrics update is sent to the driver via heartbeat, it's not very useful if we update the metrics per-row. How about we do it like per 100 rows? Only update once may be problematic if the task runs for a long time.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810854755


   For a `CustomMetric` defined outside Spark, even we record the class name, we still cannot instantiate the class.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-772851822


   I think that this could use some tests. It would be easier to add tests if it were split to add the interfaces, a batch implementation (with tests), and a stream implementation (with stream tests).
   
   Thanks for working on this, @viirya!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r570234278



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala
##########
@@ -44,8 +44,19 @@ case class BatchScanExec(
 
   override lazy val readerFactory: PartitionReaderFactory = batch.createReaderFactory()
 
+  /**
+   * The callback function which is called when the output iterator of input RDD is consumed
+   * completely.
+   */
+  private def onOutputCompletion(reader: PartitionReader[_]) = {

Review comment:
       Do we have the plan to do other things on completion? Otherwise, I'd like to suggest naming `collectMetericsOnCompletion` for now.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala
##########
@@ -44,8 +44,19 @@ case class BatchScanExec(
 
   override lazy val readerFactory: PartitionReaderFactory = batch.createReaderFactory()
 
+  /**
+   * The callback function which is called when the output iterator of input RDD is consumed
+   * completely.
+   */
+  private def onOutputCompletion(reader: PartitionReader[_]) = {
+    reader.getCustomMetrics.foreach { metric =>
+      longMetric(metric.getName) += metric.getValue

Review comment:
       What if the metric doesn't exist in the SQLMetrics in case of users implemented mistakenly? Would it be better to throw an explicit error?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r613950376



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
##########
@@ -86,6 +92,9 @@ private class PartitionIterator[T](reader: PartitionReader[T]) extends Iterator[
     if (!hasNext) {
       throw QueryExecutionErrors.endOfStreamError()
     }
+    reader.currentMetricsValues.foreach { metric =>

Review comment:
       it's still per-row update?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820198345


   **[Test build #137404 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137404/testReport)** for PR 31451 at commit [`b5be5ba`](https://github.com/apache/spark/commit/b5be5ba2abfc5022ff8c6b72e598c9361fbb1323).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r613955498



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala
##########
@@ -256,7 +289,8 @@ class SQLAppStatusListener(
     }
 
     val aggregatedMetrics = allMetrics.map { case (id, values) =>
-      id -> SQLMetrics.stringValue(metricTypes(id), values, maxMetricsFromAllStages.getOrElse(id,
+      val aggMethod = metricAggregationMethods(id)
+      id -> aggMethod(values, maxMetricsFromAllStages.getOrElse(id,
         Array.empty[Long]))

Review comment:
       nit: this can be merged to the line above.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810557383


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41319/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808860917


   **[Test build #136609 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136609/testReport)** for PR 31451 at commit [`aedb965`](https://github.com/apache/spark/commit/aedb965a2b7cef2027bf7857fa4b7c10f4e8d571).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r570502082



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala
##########
@@ -44,8 +44,19 @@ case class BatchScanExec(
 
   override lazy val readerFactory: PartitionReaderFactory = batch.createReaderFactory()
 
+  /**
+   * The callback function which is called when the output iterator of input RDD is consumed
+   * completely.
+   */
+  private def onOutputCompletion(reader: PartitionReader[_]) = {

Review comment:
       sounds good to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r604318143



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala
##########
@@ -87,9 +91,21 @@ object SQLMetrics {
   private val TIMING_METRIC = "timing"
   private val NS_TIMING_METRIC = "nsTiming"
   private val AVERAGE_METRIC = "average"
+  private val V2_CUSTOM = "v2Custom"

Review comment:
       How about just `CUSTOM = "custom"`?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala
##########
@@ -33,7 +34,10 @@ import org.apache.spark.util.{AccumulatorContext, AccumulatorV2, Utils}
  * the executor side are automatically propagated and shown in the SQL UI through metrics. Updates
  * on the driver side must be explicitly posted using [[SQLMetrics.postDriverMetricUpdates()]].
  */
-class SQLMetric(val metricType: String, initValue: Long = 0L) extends AccumulatorV2[Long, Long] {
+class SQLMetric(
+    val metricType: String,
+    initValue: Long = 0L,
+    val aggregateMethod: (Array[Long], Array[Long]) => String) extends AccumulatorV2[Long, Long] {

Review comment:
       You could add a default for aggregateMethod by adding another constructor:
   ```
     def this(metricType: String, initValue: Long = 0L) =
       this(metricType, initValue, SQLMetrics.stringValue(metricType, _, _))
   ```
   I think you may need to remove the default for initValue in the primary constructor first.
   With this, you don't need to modify the existing `create*Metric` methods in `SQLMetrics`. You don't even need to define `defaultAggregateMethod`, since we can inline it.

##########
File path: sql/core/src/test/scala/org/apache/spark/status/api/v1/sql/SqlResourceSuite.scala
##########
@@ -41,31 +42,32 @@ object SqlResourceSuite {
 
   val nodeIdAndWSCGIdMap: Map[Long, Option[Long]] = Map(1L -> Some(1L))
 
+  val defaultAggregateMethod = SQLMetrics.defaultAggregateMethod("")
   val filterNode = new SparkPlanGraphNode(1, FILTER, "",
-    metrics = Seq(SQLPlanMetric(NUMBER_OF_OUTPUT_ROWS, 1, "")))
+    metrics = Seq(SQLPlanMetric(NUMBER_OF_OUTPUT_ROWS, 1, "", defaultAggregateMethod)))

Review comment:
       Changes in this class won't be needed if you add the default for the aggregateMethod.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusStore.scala
##########
@@ -146,4 +146,5 @@ class SparkPlanGraphNodeWrapper(
 case class SQLPlanMetric(
     name: String,
     accumulatorId: Long,
-    metricType: String)
+    metricType: String,
+    aggregateMethod: (Array[Long], Array[Long]) => String)

Review comment:
       You could add a default for aggregateMethod by creating a companion object for the case class:
   ```
   object SQLPlanMetric {
     def apply(name: String, accumulatorId: Long, metricType: String): SQLPlanMetric =
       this(name, accumulatorId, metricType, SQLMetrics.stringValue(metricType, _, _))
   }
   ```
   Then you don't need to modify existing tests that do `SQLPlanMetric(name, accumulatorId, metricType)`.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/ui/MetricsAggregationBenchmark.scala
##########
@@ -60,7 +60,8 @@ object MetricsAggregationBenchmark extends BenchmarkBase {
     val store = new SQLAppStatusStore(kvstore, Some(listener))
 
     val metrics = (0 until numMetrics).map { i =>
-      new SQLMetricInfo(s"metric$i", i.toLong, "average")
+      new SQLMetricInfo(s"metric$i", i.toLong, "average",
+        SQLMetrics.defaultAggregateMethod("average"))

Review comment:
       This change won't be needed if you add the default for the aggregateMethod.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetricInfo.scala
##########
@@ -27,4 +27,5 @@ import org.apache.spark.annotation.DeveloperApi
 class SQLMetricInfo(
     val name: String,
     val accumulatorId: Long,
-    val metricType: String)
+    val metricType: String,
+    val aggregateMethod: (Array[Long], Array[Long]) => String)

Review comment:
       You could add a default for aggregateMethod by adding another constructor:
   ```
     def this(name: String, accumulatorId: Long, metricType: String) =
       this(name, accumulatorId, metricType, SQLMetrics.stringValue(metricType, _, _))
   ```
   Then existing tests that do `new SQLMetricInfo(name, accumuldatorId, metricType)` do not need to be changed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r570501983



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala
##########
@@ -44,8 +44,19 @@ case class BatchScanExec(
 
   override lazy val readerFactory: PartitionReaderFactory = batch.createReaderFactory()
 
+  /**
+   * The callback function which is called when the output iterator of input RDD is consumed
+   * completely.
+   */
+  private def onOutputCompletion(reader: PartitionReader[_]) = {
+    reader.getCustomMetrics.foreach { metric =>
+      longMetric(metric.getName) += metric.getValue

Review comment:
       Good idea. I will do it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-823028148


   **[Test build #137676 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137676/testReport)** for PR 31451 at commit [`06eb9c7`](https://github.com/apache/spark/commit/06eb9c79a3fdd807ec08540deb2234939396325a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r569874733



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala
##########
@@ -32,8 +32,13 @@ import org.apache.spark.util.Utils
 
 trait DataSourceV2ScanExecBase extends LeafExecNode {
 
-  override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+  override lazy val metrics = {

Review comment:
       I looked at a few V2 write nodes, but seems we don't have any SQL metrics there. I guess we don't provide metrics for writes generally now?
   
   If there is interest to see metrics in writes, I think it is okay to work on it later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r614241642



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
##########
@@ -627,7 +627,8 @@ case class AdaptiveSparkPlanExec(
       // of the new plan nodes, so that it can track the valid accumulator updates later
       // and display SQL metrics correctly.
       val newMetrics = newSubPlans.flatMap { p =>
-        p.flatMap(_.metrics.values.map(m => SQLPlanMetric(m.name.get, m.id, m.metricType)))
+        p.flatMap(_.metrics.values.map(m =>
+          SQLPlanMetric(m.name.get, m.id, m.metricType)))

Review comment:
       reverted.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820391096


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137404/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810632069


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136737/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r604680021



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetricInfo.scala
##########
@@ -27,4 +27,5 @@ import org.apache.spark.annotation.DeveloperApi
 class SQLMetricInfo(
     val name: String,
     val accumulatorId: Long,
-    val metricType: String)
+    val metricType: String,

Review comment:
       If so, we cannot call the custom aggregate methods on custom metrics when replaying the events, right?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808865639


   **[Test build #136611 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136611/testReport)** for PR 31451 at commit [`918c90d`](https://github.com/apache/spark/commit/918c90da4c62419f57674cbee707628e0c65f8a6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-772350514


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134811/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-772259785


   **[Test build #134811 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134811/testReport)** for PR 31451 at commit [`4d941c7`](https://github.com/apache/spark/commit/4d941c7bb606ebf7d8a76b7b3e3d6fd6a3a00b95).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r569775072



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala
##########
@@ -32,8 +32,13 @@ import org.apache.spark.util.Utils
 
 trait DataSourceV2ScanExecBase extends LeafExecNode {
 
-  override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+  override lazy val metrics = {

Review comment:
       Any plans to similarly support metrics in writes?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r569874733



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala
##########
@@ -32,8 +32,13 @@ import org.apache.spark.util.Utils
 
 trait DataSourceV2ScanExecBase extends LeafExecNode {
 
-  override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+  override lazy val metrics = {

Review comment:
       I looked at a few V2 write nodes, but seems we don't have any SQL metrics there (even number of output rows). I guess we don't provide metrics for writes generally now?
   
   If there is interest to see metrics in writes, I think it is okay to work on it later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r569875512



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/CustomMetric.java
##########
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read;
+
+import org.apache.spark.annotation.Evolving;
+
+/**
+ * A custom metric for {@link Scan}.
+ *
+ * @since 3.2.0
+ */
+@Evolving
+public interface CustomMetric {
+  /**
+   * Returns the name of custom metric.
+   */
+  String getName();

Review comment:
       Okay. Let me update it in next commit.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-772259785


   **[Test build #134811 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134811/testReport)** for PR 31451 at commit [`4d941c7`](https://github.com/apache/spark/commit/4d941c7bb606ebf7d8a76b7b3e3d6fd6a3a00b95).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-819972857


   **[Test build #750412274](https://github.com/viirya/spark-1/actions/runs/750412274)** for PR 31451 at commit [`a50bf40`](https://github.com/viirya/spark-1/commit/a50bf40194a98916c3fbeb1b864e6eb0fac7797e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r615890361



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala
##########
@@ -811,6 +814,42 @@ class SQLAppStatusListenerSuite extends SharedSparkSession with JsonTestUtils
     listener.onOtherEvent(SparkListenerSQLExecutionEnd(
       executionId, System.currentTimeMillis()))
   }
+
+
+  test("SPARK-34338: Report metrics from Datasource v2 scan") {
+    val statusStore = spark.sharedState.statusStore
+    val oldCount = statusStore.executionsList().size
+
+    val schema = new StructType().add("i", "int").add("j", "int")
+    val physicalPlan = BatchScanExec(schema.toAttributes, new CustomMetricScanBuilder())

Review comment:
       If `CustomMetricScanBuilder` is only used here, shall we define it in this file?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808883966


   **[Test build #136611 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136611/testReport)** for PR 31451 at commit [`918c90d`](https://github.com/apache/spark/commit/918c90da4c62419f57674cbee707628e0c65f8a6).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r604683543



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetricInfo.scala
##########
@@ -27,4 +27,5 @@ import org.apache.spark.annotation.DeveloperApi
 class SQLMetricInfo(
     val name: String,
     val accumulatorId: Long,
-    val metricType: String)
+    val metricType: String,

Review comment:
       Yes, but that's inevitable. Hopefully, most people will use the built-in `CustomMetric` implementations, so that the history server can support it as well. The worse case is people need to include the custom data source jar when starting the history server JVM.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820081492


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41968/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820615579


   **[Test build #137431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137431/testReport)** for PR 31451 at commit [`0f38782`](https://github.com/apache/spark/commit/0f38782bfeae0f55bb662780d66153270b5ac40d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r570437357



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala
##########
@@ -32,8 +32,13 @@ import org.apache.spark.util.Utils
 
 trait DataSourceV2ScanExecBase extends LeafExecNode {
 
-  override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+  override lazy val metrics = {

Review comment:
       Looks like updating `context.taskMetrics().outputMetrics` is just in our branch. That just uses the Hadoop FS metrics collection that we use elsewhere, so it isn't metrics from the source as we want to support in this PR.
   
   I think it would be good to follow up and support metrics on the output side. It doesn't need to be done here, but metrics are really useful.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r570444688



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala
##########
@@ -32,8 +32,13 @@ import org.apache.spark.util.Utils
 
 trait DataSourceV2ScanExecBase extends LeafExecNode {
 
-  override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+  override lazy val metrics = {

Review comment:
       Sounds good to me. I can work on it  in follow up PRs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810755166


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136738/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808842689


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136602/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808886782


   **[Test build #136609 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136609/testReport)** for PR 31451 at commit [`aedb965`](https://github.com/apache/spark/commit/aedb965a2b7cef2027bf7857fa4b7c10f4e8d571).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808838118


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136601/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808816938


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41184/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r616010120



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
##########
@@ -86,6 +92,9 @@ private class PartitionIterator[T](reader: PartitionReader[T]) extends Iterator[
     if (!hasNext) {
       throw QueryExecutionErrors.endOfStreamError()
     }
+    reader.currentMetricsValues.foreach { metric =>
+      customMetrics(metric.name()).set(metric.value())

Review comment:
       Do we need to check whether `customMetrics` contains `metric.name()`?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala
##########
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.metric
+
+import org.apache.spark.sql.connector.CustomMetric
+
+object CustomMetrics {
+  private[spark] val V2_CUSTOM = "v2Custom"
+
+  /**
+   * Given a class name, builds and returns a metric type for a V2 custom metric class
+   * `CustomMetric`.
+   */
+  def buildV2CustomMetricTypeName(customMetric: CustomMetric): String = {
+    s"${V2_CUSTOM}_${customMetric.getClass.getCanonicalName}"
+  }
+
+  /**
+   * Given a V2 custom metric type name, this method parses it and returns the corresponding
+   * `CustomMetric` class name.
+   */
+  def parseV2CustomMetricType(metricType: String): Option[String] = {
+    val className = metricType.stripPrefix(s"${V2_CUSTOM}_")

Review comment:
       +1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-823196846


   **[Test build #137676 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137676/testReport)** for PR 31451 at commit [`06eb9c7`](https://github.com/apache/spark/commit/06eb9c79a3fdd807ec08540deb2234939396325a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-822782815


   **[Test build #137648 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137648/testReport)** for PR 31451 at commit [`50ed317`](https://github.com/apache/spark/commit/50ed317c2a3fde4bdc1fd74a337d75b2fc85eb1c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810554421


   > Thanks for the explanation. This sounds like a change from the API discussed in #31476. IIUC, before, the expectation was that `PartitionReader#currentMetricsValues()` is called after the partition is read. Now, the expectation is that `PartitionReader#currentMetricsValues()` is called for every row we iterate through in the reader. Such expectation should be documented clearly in the API, for implementors of custom metrics.
   
   I don't see we have documented in the API the exact time where `currentMetricsValues` will be called. This is implementation detail. If you worry about the implementation of `currentMetricsValues` will do something taking time. We can add a note to the API suggesting not to do heavy logic in it. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-820651259


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42008/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-810497805


   **[Test build #136737 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136737/testReport)** for PR 31451 at commit [`e8576ec`](https://github.com/apache/spark/commit/e8576ec04596ea962177a6a18818149e332a9b01).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-772350514


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134811/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808890389


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136609/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808842689


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136602/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r604339191



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetricInfo.scala
##########
@@ -27,4 +27,5 @@ import org.apache.spark.annotation.DeveloperApi
 class SQLMetricInfo(
     val name: String,
     val accumulatorId: Long,
-    val metricType: String)
+    val metricType: String,
+    val aggregateMethod: (Array[Long], Array[Long]) => String)

Review comment:
       The only benefit is to not change existing tests. It is pretty minor to me. Except for test code, these objects for the classes like `SQLMetricInfo` are created from `SQLMetric` and we always need to assign the aggregate method from a `SQLMetric`.  It doesn't make sense to have default aggregate method in these classes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-822905148


   **[Test build #137648 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137648/testReport)** for PR 31451 at commit [`50ed317`](https://github.com/apache/spark/commit/50ed317c2a3fde4bdc1fd74a337d75b2fc85eb1c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-808884077


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136611/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31451:
URL: https://github.com/apache/spark/pull/31451#issuecomment-822912498


   @gengliangwang I addressed above comments. Please take another look. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31451: [WIP][SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r603902380



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
##########
@@ -86,6 +92,9 @@ private class PartitionIterator[T](reader: PartitionReader[T]) extends Iterator[
     if (!hasNext) {
       throw QueryExecutionErrors.endOfStreamError()
     }
+    reader.currentMetricsValues.foreach { metric =>
+      customMetrics(metric.name()) += metric.value()

Review comment:
       shall we call `+=` or `=`? `reader` keeps alive and `reader.currentMetricsValues` will keep increasing IIUC.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #31451: [SPARK-34338][SQL] Report metrics from Datasource v2 scan

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #31451:
URL: https://github.com/apache/spark/pull/31451#discussion_r605139486



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReader.java
##########
@@ -51,7 +51,8 @@
   T get();
 
   /**
-   * Returns an array of custom task metrics. By default it returns empty array.
+   * Returns an array of custom task metrics. By default it returns empty array. Note that it is
+   * not recommended to put heavy logic in this method as it may affect reading performance.

Review comment:
       I think it is worth spelling out when `PartitionReader#currentMetricsValues()` is called on the DS v2, and how the values are used (in this case, they are used to set the long in the `SQLMetric`, not added to it).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org