You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by we...@apache.org on 2022/09/19 12:53:35 UTC
[spark] branch master updated: [SPARK-40456][SQL] PartitionIterator.hasNext should be cheap to call repeatedly

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 3b9e5cf662c [SPARK-40456][SQL] PartitionIterator.hasNext should be cheap to call repeatedly
3b9e5cf662c is described below

commit 3b9e5cf662cd90cb3d64bd3abc57e0be26367631
Author: Wenchen Fan <we...@databricks.com>
AuthorDate: Mon Sep 19 20:52:49 2022 +0800

    [SPARK-40456][SQL] PartitionIterator.hasNext should be cheap to call repeatedly
    
    ### What changes were proposed in this pull request?
    
    This PR caches the result of `PartitionReader.next` in `PartitionIterator`, so that its `hasNext` method is cheap to be called repeatedly.
    
    ### Why are the changes needed?
    
    potential perf improvement. `PartitionReader.next` can be expensive in some v2 sources, and it's legal to call `Iterator.hasNext` repeatedly.
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    existing tests
    
    Closes #37900 from cloud-fan/minor.
    
    Authored-by: Wenchen Fan <we...@databricks.com>
    Signed-off-by: Wenchen Fan <we...@databricks.com>
---
 .../apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala   | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
index 09c8756ca01..67e77a97865 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
@@ -111,12 +111,14 @@ private class PartitionIterator[T](
     reader: PartitionReader[T],
     customMetrics: Map[String, SQLMetric]) extends Iterator[T] {
   private[this] var valuePrepared = false
+  private[this] var hasMoreInput = true
 
   private var numRow = 0L
 
   override def hasNext: Boolean = {
-    if (!valuePrepared) {
-      valuePrepared = reader.next()
+    if (!valuePrepared && hasMoreInput) {
+      hasMoreInput = reader.next()
+      valuePrepared = hasMoreInput
     }
     valuePrepared
   }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org