You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/22 14:07:59 UTC

[GitHub] [spark] ankurdave opened a new pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

ankurdave opened a new pull request #34369:
URL: https://github.com/apache/spark/pull/34369


   ### What changes were proposed in this pull request?
   
   The previous PR https://github.com/apache/spark/pull/34245 assumed task completion listeners are registered bottom-up. `ParquetFileFormat#buildReaderWithPartitionValues()` violates this assumption by registering a task completion listener to close its output iterator lazily. Since task completion listeners are executed in reverse order of registration, this listener always runs before other listeners. When the downstream operator contains a Python UDF and the off-heap vectorized reader is enabled, this results in a use-after-free that causes a segfault.
   
   The fix is to close the output iterator using FileScanRDD's task completion listener.
   
   ### Why are the changes needed?
   
   Without this PR, the Python tests introduced in https://github.com/apache/spark/pull/34245 are flaky ([see details in thread](https://github.com/apache/spark/pull/34245#issuecomment-948713545)). They intermittently fail with a segfault.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Repeatedly ran one of the Python tests introduced in https://github.com/apache/spark/pull/34245 using the commands below. Previously, the test was flaky and failed after about 50 runs. With this PR, the test has not failed after 200+ runs.
   
   ```sh
   ./build/sbt -Phive clean package && ./build/sbt test:compile
   seq 1000 | parallel -j 8 --halt now,fail=1 'echo {#}; python/run-tests --testnames pyspark.sql.tests.test_udf'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949667648


   **[Test build #144538 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144538/testReport)** for PR 34369 at commit [`29f5e77`](https://github.com/apache/spark/commit/29f5e77f7964ee4380b2eee288f54d431dcfee74).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734822511



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -85,6 +85,17 @@ class FileScanRDD(
       private[this] var currentFile: PartitionedFile = null
       private[this] var currentIterator: Iterator[Object] = null
 
+      private def resetCurrentIterator(): Unit = {
+        currentIterator match {
+          case iter: NextIterator[_] =>
+            iter.closeIfNeeded()
+          case iter: Closeable =>
+            iter.close()
+          case _ => // do nothing

Review comment:
       There are currently two cases aside from null:
   - OrcFileFormat produces an ordinary non-Closeable Iterator due to unwrapOrcStructs.
   - The user can create a FileScanRDD with an arbitrary readFunction that does not return a Closeable Iterator.
   
   It would be ideal if we could disallow these cases and require the iterator to be Closeable, but it seems that would require changing public APIs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949951007


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144544/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734822511



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -85,6 +85,17 @@ class FileScanRDD(
       private[this] var currentFile: PartitionedFile = null
       private[this] var currentIterator: Iterator[Object] = null
 
+      private def resetCurrentIterator(): Unit = {
+        currentIterator match {
+          case iter: NextIterator[_] =>
+            iter.closeIfNeeded()
+          case iter: Closeable =>
+            iter.close()
+          case _ => // do nothing

Review comment:
       There are currently two cases aside from null:
   - OrcFileFormat produces an ordinary non-Closeable Iterator due to unwrapOrcStructs().
   - The user can create a FileScanRDD with an arbitrary readFunction that does not return a Closeable Iterator.
   
   It would be ideal if we could disallow these cases and require the iterator to be Closeable, but it seems that would require changing public APIs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734831796



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -128,15 +139,21 @@ class FileScanRDD(
           // Sets InputFileBlockHolder for the file block's information
           InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length)
 
+          resetCurrentIterator()
           if (ignoreMissingFiles || ignoreCorruptFiles) {
             currentIterator = new NextIterator[Object] {
               // The readFunction may read some bytes before consuming the iterator, e.g.,
-              // vectorized Parquet reader. Here we use lazy val to delay the creation of
-              // iterator so that we will throw exception in `getNext`.
-              private lazy val internalIter = readCurrentFile()
+              // vectorized Parquet reader. Here we use a lazily initialized variable to delay the
+              // creation of iterator so that we will throw exception in `getNext`.
+              private var internalIter: Iterator[InternalRow] = null

Review comment:
       hm why is this change necessary?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949708540


   **[Test build #144538 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144538/testReport)** for PR 34369 at commit [`29f5e77`](https://github.com/apache/spark/commit/29f5e77f7964ee4380b2eee288f54d431dcfee74).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734840380



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -128,15 +139,21 @@ class FileScanRDD(
           // Sets InputFileBlockHolder for the file block's information
           InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length)
 
+          resetCurrentIterator()
           if (ignoreMissingFiles || ignoreCorruptFiles) {
             currentIterator = new NextIterator[Object] {
               // The readFunction may read some bytes before consuming the iterator, e.g.,
-              // vectorized Parquet reader. Here we use lazy val to delay the creation of
-              // iterator so that we will throw exception in `getNext`.
-              private lazy val internalIter = readCurrentFile()
+              // vectorized Parquet reader. Here we use a lazily initialized variable to delay the
+              // creation of iterator so that we will throw exception in `getNext`.
+              private var internalIter: Iterator[InternalRow] = null

Review comment:
       If the downstream operator never pulls any rows from the iterator, then the first time we access internalIter will be when close() is called. If internalIter is a lazy val, this will trigger a call to readCurrentFile(), which is unnecessary and may throw. Changing internalIter from a lazy val to a var lets us avoid this unnecessary call.
   
   Several unit tests fail without this change, including AvroV1Suite.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949949305


   **[Test build #144544 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144544/testReport)** for PR 34369 at commit [`559224d`](https://github.com/apache/spark/commit/559224d9148fb3fc9a743908a68e2e4f8121d04a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949855217


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144543/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949846632


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49015/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949760297


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49011/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949737747


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49009/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734840380



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -128,15 +139,21 @@ class FileScanRDD(
           // Sets InputFileBlockHolder for the file block's information
           InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length)
 
+          resetCurrentIterator()
           if (ignoreMissingFiles || ignoreCorruptFiles) {
             currentIterator = new NextIterator[Object] {
               // The readFunction may read some bytes before consuming the iterator, e.g.,
-              // vectorized Parquet reader. Here we use lazy val to delay the creation of
-              // iterator so that we will throw exception in `getNext`.
-              private lazy val internalIter = readCurrentFile()
+              // vectorized Parquet reader. Here we use a lazily initialized variable to delay the
+              // creation of iterator so that we will throw exception in `getNext`.
+              private var internalIter: Iterator[InternalRow] = null

Review comment:
       If the downstream operator never pulls any rows from the iterator, then the first time we access `internalIter` will be when `close()` is called. If `internalIter` is a `lazy val`, this will trigger a call to `readCurrentFile()`, which is unnecessary and may throw. Changing `internalIter` from a `lazy val` to a `var` lets us avoid this unnecessary call.
   
   Several unit tests fail without this change, including `AvroV1Suite`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734587904



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
##########
@@ -327,18 +327,31 @@ class ParquetFileFormat
           int96RebaseMode.toString,
           enableOffHeapColumnVector && taskContext.isDefined,
           capacity)
+        // SPARK-37089: We cannot register a task completion listener to close this iterator here
+        // because downstream exec nodes have already registered their listeners. Since listeners
+        // are executed in reverse order of registration, a listener registered here would close the
+        // iterator while downstream exec nodes are still running. When off-heap column vectors are
+        // enabled, this can cause a use-after-free bug leading to a segfault.
+        //
+        // Instead, we use FileScanRDD's task completion listener to close this iterator.
         val iter = new RecordReaderIterator(vectorizedReader)
-        // SPARK-23457 Register a task completion listener before `initialization`.
-        taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close()))
-        vectorizedReader.initialize(split, hadoopAttemptContext)
-        logDebug(s"Appending $partitionSchema ${file.partitionValues}")
-        vectorizedReader.initBatch(partitionSchema, file.partitionValues)
-        if (returningBatch) {
-          vectorizedReader.enableReturningBatches()
-        }
+        try {
+          vectorizedReader.initialize(split, hadoopAttemptContext)
+          logDebug(s"Appending $partitionSchema ${file.partitionValues}")
+          vectorizedReader.initBatch(partitionSchema, file.partitionValues)
+          if (returningBatch) {
+            vectorizedReader.enableReturningBatches()
+          }
 
-        // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
-        iter.asInstanceOf[Iterator[InternalRow]]
+          // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
+          iter.asInstanceOf[Iterator[InternalRow]]
+        } catch {
+          case e: Throwable =>
+            // SPARK-23457: In case there is an exception in initialization, close the iterator to

Review comment:
       shall we let the caller `FileScanRDD` close the iterator when hitting errors?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734608343



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
##########
@@ -327,18 +327,31 @@ class ParquetFileFormat
           int96RebaseMode.toString,
           enableOffHeapColumnVector && taskContext.isDefined,
           capacity)
+        // SPARK-37089: We cannot register a task completion listener to close this iterator here
+        // because downstream exec nodes have already registered their listeners. Since listeners
+        // are executed in reverse order of registration, a listener registered here would close the
+        // iterator while downstream exec nodes are still running. When off-heap column vectors are
+        // enabled, this can cause a use-after-free bug leading to a segfault.
+        //
+        // Instead, we use FileScanRDD's task completion listener to close this iterator.
         val iter = new RecordReaderIterator(vectorizedReader)
-        // SPARK-23457 Register a task completion listener before `initialization`.
-        taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close()))
-        vectorizedReader.initialize(split, hadoopAttemptContext)
-        logDebug(s"Appending $partitionSchema ${file.partitionValues}")
-        vectorizedReader.initBatch(partitionSchema, file.partitionValues)
-        if (returningBatch) {
-          vectorizedReader.enableReturningBatches()
-        }
+        try {
+          vectorizedReader.initialize(split, hadoopAttemptContext)
+          logDebug(s"Appending $partitionSchema ${file.partitionValues}")
+          vectorizedReader.initBatch(partitionSchema, file.partitionValues)
+          if (returningBatch) {
+            vectorizedReader.enableReturningBatches()
+          }
 
-        // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
-        iter.asInstanceOf[Iterator[InternalRow]]
+          // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
+          iter.asInstanceOf[Iterator[InternalRow]]
+        } catch {
+          case e: Throwable =>
+            // SPARK-23457: In case there is an exception in initialization, close the iterator to

Review comment:
       In general, I think FileScanRDD does close the iterator when hitting exceptions, because it uses a task completion listener to do so. The only case where it will not close the iterator is when the exception prevents FileScanRDD from getting a reference to the iterator, as is the case here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734608343



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
##########
@@ -327,18 +327,31 @@ class ParquetFileFormat
           int96RebaseMode.toString,
           enableOffHeapColumnVector && taskContext.isDefined,
           capacity)
+        // SPARK-37089: We cannot register a task completion listener to close this iterator here
+        // because downstream exec nodes have already registered their listeners. Since listeners
+        // are executed in reverse order of registration, a listener registered here would close the
+        // iterator while downstream exec nodes are still running. When off-heap column vectors are
+        // enabled, this can cause a use-after-free bug leading to a segfault.
+        //
+        // Instead, we use FileScanRDD's task completion listener to close this iterator.
         val iter = new RecordReaderIterator(vectorizedReader)
-        // SPARK-23457 Register a task completion listener before `initialization`.
-        taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close()))
-        vectorizedReader.initialize(split, hadoopAttemptContext)
-        logDebug(s"Appending $partitionSchema ${file.partitionValues}")
-        vectorizedReader.initBatch(partitionSchema, file.partitionValues)
-        if (returningBatch) {
-          vectorizedReader.enableReturningBatches()
-        }
+        try {
+          vectorizedReader.initialize(split, hadoopAttemptContext)
+          logDebug(s"Appending $partitionSchema ${file.partitionValues}")
+          vectorizedReader.initBatch(partitionSchema, file.partitionValues)
+          if (returningBatch) {
+            vectorizedReader.enableReturningBatches()
+          }
 
-        // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
-        iter.asInstanceOf[Iterator[InternalRow]]
+          // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
+          iter.asInstanceOf[Iterator[InternalRow]]
+        } catch {
+          case e: Throwable =>
+            // SPARK-23457: In case there is an exception in initialization, close the iterator to

Review comment:
       In general, I think FileScanRDD does close the iterator when hitting exceptions, because it [uses a task completion listener](https://github.com/apache/spark/pull/34369/files#diff-b7b097f8cec6a7ae6640f9ecd6d4ac14ed304ad7c8db802a3ec2c0983535e157R208) to do so. The only case where it will not close the iterator is when the exception prevents FileScanRDD from getting a reference to the iterator, as is the case here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734808494



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -85,6 +85,17 @@ class FileScanRDD(
       private[this] var currentFile: PartitionedFile = null
       private[this] var currentIterator: Iterator[Object] = null
 
+      private def resetCurrentIterator(): Unit = {
+        currentIterator match {
+          case iter: NextIterator[_] =>
+            iter.closeIfNeeded()
+          case iter: Closeable =>
+            iter.close()
+          case _ => // do nothing

Review comment:
       When does this happen?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734842173



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -128,15 +139,21 @@ class FileScanRDD(
           // Sets InputFileBlockHolder for the file block's information
           InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length)
 
+          resetCurrentIterator()
           if (ignoreMissingFiles || ignoreCorruptFiles) {
             currentIterator = new NextIterator[Object] {
               // The readFunction may read some bytes before consuming the iterator, e.g.,
-              // vectorized Parquet reader. Here we use lazy val to delay the creation of
-              // iterator so that we will throw exception in `getNext`.
-              private lazy val internalIter = readCurrentFile()
+              // vectorized Parquet reader. Here we use a lazily initialized variable to delay the
+              // creation of iterator so that we will throw exception in `getNext`.
+              private var internalIter: Iterator[InternalRow] = null

Review comment:
       Got it. Thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949708869


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144538/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949766425


   **[Test build #144544 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144544/testReport)** for PR 34369 at commit [`559224d`](https://github.com/apache/spark/commit/559224d9148fb3fc9a743908a68e2e4f8121d04a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949760350


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49011/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ankurdave commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
ankurdave commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949667146


   cc @cloud-fan @HyukjinKwon @dongjoon-hyun @mswit-databricks 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949846666


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49015/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734840380



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -128,15 +139,21 @@ class FileScanRDD(
           // Sets InputFileBlockHolder for the file block's information
           InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length)
 
+          resetCurrentIterator()
           if (ignoreMissingFiles || ignoreCorruptFiles) {
             currentIterator = new NextIterator[Object] {
               // The readFunction may read some bytes before consuming the iterator, e.g.,
-              // vectorized Parquet reader. Here we use lazy val to delay the creation of
-              // iterator so that we will throw exception in `getNext`.
-              private lazy val internalIter = readCurrentFile()
+              // vectorized Parquet reader. Here we use a lazily initialized variable to delay the
+              // creation of iterator so that we will throw exception in `getNext`.
+              private var internalIter: Iterator[InternalRow] = null

Review comment:
       If the downstream operator never pulls any rows from the iterator, then the first time we access `internalIter` will be when `close()` is called. If `internalIter` is a `lazy val`, this will trigger a call to `readCurrentFile()`, which is unnecessary and may throw. Changing `internalIter` from a `lazy val` to a `var` lets us avoid this unnecessary call.
   
   Several tests fail without this change, including `AvroV1Suite`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949854562


   **[Test build #144543 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144543/testReport)** for PR 34369 at commit [`c09cf5e`](https://github.com/apache/spark/commit/c09cf5e0beeeafd14e0df0d9d32c283c47262a0f).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949760350


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49011/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949756172


   **[Test build #144543 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144543/testReport)** for PR 34369 at commit [`c09cf5e`](https://github.com/apache/spark/commit/c09cf5e0beeeafd14e0df0d9d32c283c47262a0f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #34369:
URL: https://github.com/apache/spark/pull/34369


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949916003


   cc @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949771646


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49009/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949846666


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49015/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949756172


   **[Test build #144543 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144543/testReport)** for PR 34369 at commit [`c09cf5e`](https://github.com/apache/spark/commit/c09cf5e0beeeafd14e0df0d9d32c283c47262a0f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949855217


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144543/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949951007


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144544/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949805594


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49015/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949800896


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49009/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949800896


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49009/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734808494



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -85,6 +85,17 @@ class FileScanRDD(
       private[this] var currentFile: PartitionedFile = null
       private[this] var currentIterator: Iterator[Object] = null
 
+      private def resetCurrentIterator(): Unit = {
+        currentIterator match {
+          case iter: NextIterator[_] =>
+            iter.closeIfNeeded()
+          case iter: Closeable =>
+            iter.close()
+          case _ => // do nothing

Review comment:
       When does this happen? Only when `currentIterator` is `null`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949914403


   Thank you for pinging me, @ankurdave .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949708869


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144538/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734609634



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
##########
@@ -327,18 +327,31 @@ class ParquetFileFormat
           int96RebaseMode.toString,
           enableOffHeapColumnVector && taskContext.isDefined,
           capacity)
+        // SPARK-37089: We cannot register a task completion listener to close this iterator here
+        // because downstream exec nodes have already registered their listeners. Since listeners
+        // are executed in reverse order of registration, a listener registered here would close the
+        // iterator while downstream exec nodes are still running. When off-heap column vectors are
+        // enabled, this can cause a use-after-free bug leading to a segfault.
+        //
+        // Instead, we use FileScanRDD's task completion listener to close this iterator.
         val iter = new RecordReaderIterator(vectorizedReader)
-        // SPARK-23457 Register a task completion listener before `initialization`.
-        taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close()))
-        vectorizedReader.initialize(split, hadoopAttemptContext)
-        logDebug(s"Appending $partitionSchema ${file.partitionValues}")
-        vectorizedReader.initBatch(partitionSchema, file.partitionValues)
-        if (returningBatch) {
-          vectorizedReader.enableReturningBatches()
-        }
+        try {
+          vectorizedReader.initialize(split, hadoopAttemptContext)
+          logDebug(s"Appending $partitionSchema ${file.partitionValues}")
+          vectorizedReader.initBatch(partitionSchema, file.partitionValues)
+          if (returningBatch) {
+            vectorizedReader.enableReturningBatches()
+          }
 
-        // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
-        iter.asInstanceOf[Iterator[InternalRow]]
+          // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
+          iter.asInstanceOf[Iterator[InternalRow]]
+        } catch {
+          case e: Throwable =>
+            // SPARK-23457: In case there is an exception in initialization, close the iterator to

Review comment:
       ah I see!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949667648


   **[Test build #144538 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144538/testReport)** for PR 34369 at commit [`29f5e77`](https://github.com/apache/spark/commit/29f5e77f7964ee4380b2eee288f54d431dcfee74).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949766425


   **[Test build #144544 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144544/testReport)** for PR 34369 at commit [`559224d`](https://github.com/apache/spark/commit/559224d9148fb3fc9a743908a68e2e4f8121d04a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-950437867


   Merged to master and branch-3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-950437867


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org