You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/11/03 13:34:10 UTC

[GitHub] [spark] viirya commented on a change in pull request #30221: [SPARK-33314][SQL] Avoid dropping rows in Avro reader

viirya commented on a change in pull request #30221:
URL: https://github.com/apache/spark/pull/30221#discussion_r515794920



##########
File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala
##########
@@ -171,9 +171,15 @@ private[sql] object AvroUtils extends Logging {
     protected val stopPosition: Long
 
     private[this] var completed = false
+    private[this] var interveningNext = true
+    private[this] var prevHasNextRow = false
     private[this] var currentRow: Option[InternalRow] = None
 
     def hasNextRow: Boolean = {
+      if (!interveningNext) {
+        // until a row is consumed, return previous result of hasNextRow
+        return prevHasNextRow
+      }

Review comment:
       Can't we just reset `currentRow` in  `nextRow` and check `currentRow.isDefined` here? 

##########
File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala
##########
@@ -171,9 +171,15 @@ private[sql] object AvroUtils extends Logging {
     protected val stopPosition: Long
 
     private[this] var completed = false
+    private[this] var interveningNext = true
+    private[this] var prevHasNextRow = false
     private[this] var currentRow: Option[InternalRow] = None
 
     def hasNextRow: Boolean = {
+      if (!interveningNext) {
+        // until a row is consumed, return previous result of hasNextRow
+        return prevHasNextRow
+      }

Review comment:
       From the API semantics perspective, shouldn't `nextRow` return the next row? It looks okay if `hasNextRow` has been called multiple times before `nextRow` is called. But it sounds weird that `nextRow` will be called with the same row. As a fix this looks fine, but the API, if it is called like that way, sounds a weird design, in particular it is documented as iterator-like interface.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org