You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/08 05:28:52 UTC

[GitHub] [spark] bozhang2820 opened a new pull request #35432: [SPARK-37585][SQL] Update InputMetric in DataSourceRDD with TaskCompletionListener

bozhang2820 opened a new pull request #35432:
URL: https://github.com/apache/spark/pull/35432


   ### What changes were proposed in this pull request?
   This PR changes to update `InputMetrics` in `DataSourceRDD` with `TaskCompletionListener`, instead of relying on `MetricsIterator.hasNext`.
   
   ### Why are the changes needed?
   This is to fix the bug that `InputMetrics.bytesRead` is not updated when there is still data in the datasource but the output is limited. 
   
   ### Does this PR introduce _any_ user-facing change?
   Users will see more accurate `InputMetrics.bytesRead` with this change.
   
   ### How was this patch tested?
   Added unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35432: [SPARK-37585][SQL] Update InputMetric in DataSourceRDD with TaskCompletionListener

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35432:
URL: https://github.com/apache/spark/pull/35432#discussion_r801680822



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala
##########
@@ -244,4 +244,33 @@ class DataSourceV2ScanExecRedactionSuite extends DataSourceScanRedactionTest {
       }
     }
   }
+
+  test("SPARK-37585: test input metrics for DSV2 with output limits") {

Review comment:
       I should have pointed it out earlier... It's weird to put metrics test in this suite, how about `FileBasedDataSourceSuite`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #35432: [SPARK-37585][SQL] Update InputMetric in DataSourceRDD with TaskCompletionListener

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #35432:
URL: https://github.com/apache/spark/pull/35432#issuecomment-1033802293


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35432: [SPARK-37585][SQL] Update InputMetric in DataSourceRDD with TaskCompletionListener

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35432:
URL: https://github.com/apache/spark/pull/35432#discussion_r802297297



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala
##########
@@ -244,4 +244,33 @@ class DataSourceV2ScanExecRedactionSuite extends DataSourceScanRedactionTest {
       }
     }
   }
+
+  test("SPARK-37585: test input metrics for DSV2 with output limits") {

Review comment:
       yea let's move them together




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #35432: [SPARK-37585][SQL] Update InputMetric in DataSourceRDD with TaskCompletionListener

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #35432:
URL: https://github.com/apache/spark/pull/35432#issuecomment-1032660821


   also cc @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35432: [SPARK-37585][SQL] Update InputMetric in DataSourceRDD with TaskCompletionListener

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35432:
URL: https://github.com/apache/spark/pull/35432#discussion_r802389005



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala
##########
@@ -244,4 +244,33 @@ class DataSourceV2ScanExecRedactionSuite extends DataSourceScanRedactionTest {
       }
     }
   }
+
+  test("SPARK-37585: test input metrics for DSV2 with output limits") {

Review comment:
       oh i see. yea, okay to move that together.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35432: [SPARK-37585][SQL] Update InputMetric in DataSourceRDD with TaskCompletionListener

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35432:
URL: https://github.com/apache/spark/pull/35432#discussion_r802254113



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala
##########
@@ -244,4 +244,33 @@ class DataSourceV2ScanExecRedactionSuite extends DataSourceScanRedactionTest {
       }
     }
   }
+
+  test("SPARK-37585: test input metrics for DSV2 with output limits") {

Review comment:
       don't you have only one new test? any other related tests?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35432: [SPARK-37585][SQL] Update InputMetric in DataSourceRDD with TaskCompletionListener

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35432:
URL: https://github.com/apache/spark/pull/35432#discussion_r802255256



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
##########
@@ -70,6 +70,7 @@ class DataSourceRDD(
       // In case of early stopping before consuming the entire iterator,
       // we need to do one more metric update at the end of the task.
       CustomMetrics.updateMetrics(reader.currentMetricsValues, customMetrics)
+      iter.forceUpdateMetrics()

Review comment:
       looks okay.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bozhang2820 commented on a change in pull request #35432: [SPARK-37585][SQL] Update InputMetric in DataSourceRDD with TaskCompletionListener

Posted by GitBox <gi...@apache.org>.
bozhang2820 commented on a change in pull request #35432:
URL: https://github.com/apache/spark/pull/35432#discussion_r802237425



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala
##########
@@ -244,4 +244,33 @@ class DataSourceV2ScanExecRedactionSuite extends DataSourceScanRedactionTest {
       }
     }
   }
+
+  test("SPARK-37585: test input metrics for DSV2 with output limits") {

Review comment:
       Sure, but should I move the other related tests together in this PR as well?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
##########
@@ -126,14 +127,9 @@ private class MetricsHandler extends Logging with Serializable {
 private abstract class MetricsIterator[I](iter: Iterator[I]) extends Iterator[I] {
   protected val metricsHandler = new MetricsHandler
 
-  override def hasNext: Boolean = {
-    if (iter.hasNext) {
-      true
-    } else {
-      metricsHandler.updateMetrics(0, force = true)
-      false
-    }
-  }
+  override def hasNext: Boolean = iter.hasNext

Review comment:
       Sure.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #35432: [SPARK-37585][SQL] Update InputMetric in DataSourceRDD with TaskCompletionListener

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #35432:
URL: https://github.com/apache/spark/pull/35432


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bozhang2820 commented on pull request #35432: [SPARK-37585][SQL] Update InputMetric in DataSourceRDD with TaskCompletionListener

Posted by GitBox <gi...@apache.org>.
bozhang2820 commented on pull request #35432:
URL: https://github.com/apache/spark/pull/35432#issuecomment-1032373673


   CC @cloud-fan for a review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bozhang2820 commented on a change in pull request #35432: [SPARK-37585][SQL] Update InputMetric in DataSourceRDD with TaskCompletionListener

Posted by GitBox <gi...@apache.org>.
bozhang2820 commented on a change in pull request #35432:
URL: https://github.com/apache/spark/pull/35432#discussion_r802261967



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala
##########
@@ -244,4 +244,33 @@ class DataSourceV2ScanExecRedactionSuite extends DataSourceScanRedactionTest {
       }
     }
   }
+
+  test("SPARK-37585: test input metrics for DSV2 with output limits") {

Review comment:
       I meant the other test related to InputMetrics here. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35432: [SPARK-37585][SQL] Update InputMetric in DataSourceRDD with TaskCompletionListener

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35432:
URL: https://github.com/apache/spark/pull/35432#discussion_r801678659



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
##########
@@ -126,14 +127,9 @@ private class MetricsHandler extends Logging with Serializable {
 private abstract class MetricsIterator[I](iter: Iterator[I]) extends Iterator[I] {
   protected val metricsHandler = new MetricsHandler
 
-  override def hasNext: Boolean = {
-    if (iter.hasNext) {
-      true
-    } else {
-      metricsHandler.updateMetrics(0, force = true)
-      false
-    }
-  }
+  override def hasNext: Boolean = iter.hasNext

Review comment:
       Can we keep updating metrics at the end of the iterator? It's still useful in cases like `RDD.coalesce`, where a task consumes multiple iterators for coalesced partitions, and we can update metrics earlier here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org