You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/22 14:07:59 UTC
[GitHub] [spark] ankurdave opened a new pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
ankurdave opened a new pull request #34369:
URL: https://github.com/apache/spark/pull/34369
### What changes were proposed in this pull request?
The previous PR https://github.com/apache/spark/pull/34245 assumed task completion listeners are registered bottom-up. `ParquetFileFormat#buildReaderWithPartitionValues()` violates this assumption by registering a task completion listener to close its output iterator lazily. Since task completion listeners are executed in reverse order of registration, this listener always runs before other listeners. When the downstream operator contains a Python UDF and the off-heap vectorized reader is enabled, this results in a use-after-free that causes a segfault.
The fix is to close the output iterator using FileScanRDD's task completion listener.
### Why are the changes needed?
Without this PR, the Python tests introduced in https://github.com/apache/spark/pull/34245 are flaky ([see details in thread](https://github.com/apache/spark/pull/34245#issuecomment-948713545)). They intermittently fail with a segfault.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Repeatedly ran one of the Python tests introduced in https://github.com/apache/spark/pull/34245 using the commands below. Previously, the test was flaky and failed after about 50 runs. With this PR, the test has not failed after 200+ runs.
```sh
./build/sbt -Phive clean package && ./build/sbt test:compile
seq 1000 | parallel -j 8 --halt now,fail=1 'echo {#}; python/run-tests --testnames pyspark.sql.tests.test_udf'
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949667648
**[Test build #144538 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144538/testReport)** for PR 34369 at commit [`29f5e77`](https://github.com/apache/spark/commit/29f5e77f7964ee4380b2eee288f54d431dcfee74).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734822511
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -85,6 +85,17 @@ class FileScanRDD(
private[this] var currentFile: PartitionedFile = null
private[this] var currentIterator: Iterator[Object] = null
+ private def resetCurrentIterator(): Unit = {
+ currentIterator match {
+ case iter: NextIterator[_] =>
+ iter.closeIfNeeded()
+ case iter: Closeable =>
+ iter.close()
+ case _ => // do nothing
Review comment:
There are currently two cases aside from null:
- OrcFileFormat produces an ordinary non-Closeable Iterator due to unwrapOrcStructs.
- The user can create a FileScanRDD with an arbitrary readFunction that does not return a Closeable Iterator.
It would be ideal if we could disallow these cases and require the iterator to be Closeable, but it seems that would require changing public APIs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949951007
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144544/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734822511
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -85,6 +85,17 @@ class FileScanRDD(
private[this] var currentFile: PartitionedFile = null
private[this] var currentIterator: Iterator[Object] = null
+ private def resetCurrentIterator(): Unit = {
+ currentIterator match {
+ case iter: NextIterator[_] =>
+ iter.closeIfNeeded()
+ case iter: Closeable =>
+ iter.close()
+ case _ => // do nothing
Review comment:
There are currently two cases aside from null:
- OrcFileFormat produces an ordinary non-Closeable Iterator due to unwrapOrcStructs().
- The user can create a FileScanRDD with an arbitrary readFunction that does not return a Closeable Iterator.
It would be ideal if we could disallow these cases and require the iterator to be Closeable, but it seems that would require changing public APIs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sunchao commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734831796
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -128,15 +139,21 @@ class FileScanRDD(
// Sets InputFileBlockHolder for the file block's information
InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length)
+ resetCurrentIterator()
if (ignoreMissingFiles || ignoreCorruptFiles) {
currentIterator = new NextIterator[Object] {
// The readFunction may read some bytes before consuming the iterator, e.g.,
- // vectorized Parquet reader. Here we use lazy val to delay the creation of
- // iterator so that we will throw exception in `getNext`.
- private lazy val internalIter = readCurrentFile()
+ // vectorized Parquet reader. Here we use a lazily initialized variable to delay the
+ // creation of iterator so that we will throw exception in `getNext`.
+ private var internalIter: Iterator[InternalRow] = null
Review comment:
hm why is this change necessary?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949708540
**[Test build #144538 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144538/testReport)** for PR 34369 at commit [`29f5e77`](https://github.com/apache/spark/commit/29f5e77f7964ee4380b2eee288f54d431dcfee74).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734840380
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -128,15 +139,21 @@ class FileScanRDD(
// Sets InputFileBlockHolder for the file block's information
InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length)
+ resetCurrentIterator()
if (ignoreMissingFiles || ignoreCorruptFiles) {
currentIterator = new NextIterator[Object] {
// The readFunction may read some bytes before consuming the iterator, e.g.,
- // vectorized Parquet reader. Here we use lazy val to delay the creation of
- // iterator so that we will throw exception in `getNext`.
- private lazy val internalIter = readCurrentFile()
+ // vectorized Parquet reader. Here we use a lazily initialized variable to delay the
+ // creation of iterator so that we will throw exception in `getNext`.
+ private var internalIter: Iterator[InternalRow] = null
Review comment:
If the downstream operator never pulls any rows from the iterator, then the first time we access internalIter will be when close() is called. If internalIter is a lazy val, this will trigger a call to readCurrentFile(), which is unnecessary and may throw. Changing internalIter from a lazy val to a var lets us avoid this unnecessary call.
Several unit tests fail without this change, including AvroV1Suite.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949949305
**[Test build #144544 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144544/testReport)** for PR 34369 at commit [`559224d`](https://github.com/apache/spark/commit/559224d9148fb3fc9a743908a68e2e4f8121d04a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949855217
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144543/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949846632
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49015/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949760297
Kubernetes integration test unable to build dist.
exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49011/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949737747
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49009/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734840380
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -128,15 +139,21 @@ class FileScanRDD(
// Sets InputFileBlockHolder for the file block's information
InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length)
+ resetCurrentIterator()
if (ignoreMissingFiles || ignoreCorruptFiles) {
currentIterator = new NextIterator[Object] {
// The readFunction may read some bytes before consuming the iterator, e.g.,
- // vectorized Parquet reader. Here we use lazy val to delay the creation of
- // iterator so that we will throw exception in `getNext`.
- private lazy val internalIter = readCurrentFile()
+ // vectorized Parquet reader. Here we use a lazily initialized variable to delay the
+ // creation of iterator so that we will throw exception in `getNext`.
+ private var internalIter: Iterator[InternalRow] = null
Review comment:
If the downstream operator never pulls any rows from the iterator, then the first time we access `internalIter` will be when `close()` is called. If `internalIter` is a `lazy val`, this will trigger a call to `readCurrentFile()`, which is unnecessary and may throw. Changing `internalIter` from a `lazy val` to a `var` lets us avoid this unnecessary call.
Several unit tests fail without this change, including `AvroV1Suite`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734587904
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
##########
@@ -327,18 +327,31 @@ class ParquetFileFormat
int96RebaseMode.toString,
enableOffHeapColumnVector && taskContext.isDefined,
capacity)
+ // SPARK-37089: We cannot register a task completion listener to close this iterator here
+ // because downstream exec nodes have already registered their listeners. Since listeners
+ // are executed in reverse order of registration, a listener registered here would close the
+ // iterator while downstream exec nodes are still running. When off-heap column vectors are
+ // enabled, this can cause a use-after-free bug leading to a segfault.
+ //
+ // Instead, we use FileScanRDD's task completion listener to close this iterator.
val iter = new RecordReaderIterator(vectorizedReader)
- // SPARK-23457 Register a task completion listener before `initialization`.
- taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close()))
- vectorizedReader.initialize(split, hadoopAttemptContext)
- logDebug(s"Appending $partitionSchema ${file.partitionValues}")
- vectorizedReader.initBatch(partitionSchema, file.partitionValues)
- if (returningBatch) {
- vectorizedReader.enableReturningBatches()
- }
+ try {
+ vectorizedReader.initialize(split, hadoopAttemptContext)
+ logDebug(s"Appending $partitionSchema ${file.partitionValues}")
+ vectorizedReader.initBatch(partitionSchema, file.partitionValues)
+ if (returningBatch) {
+ vectorizedReader.enableReturningBatches()
+ }
- // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
- iter.asInstanceOf[Iterator[InternalRow]]
+ // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
+ iter.asInstanceOf[Iterator[InternalRow]]
+ } catch {
+ case e: Throwable =>
+ // SPARK-23457: In case there is an exception in initialization, close the iterator to
Review comment:
shall we let the caller `FileScanRDD` close the iterator when hitting errors?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734608343
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
##########
@@ -327,18 +327,31 @@ class ParquetFileFormat
int96RebaseMode.toString,
enableOffHeapColumnVector && taskContext.isDefined,
capacity)
+ // SPARK-37089: We cannot register a task completion listener to close this iterator here
+ // because downstream exec nodes have already registered their listeners. Since listeners
+ // are executed in reverse order of registration, a listener registered here would close the
+ // iterator while downstream exec nodes are still running. When off-heap column vectors are
+ // enabled, this can cause a use-after-free bug leading to a segfault.
+ //
+ // Instead, we use FileScanRDD's task completion listener to close this iterator.
val iter = new RecordReaderIterator(vectorizedReader)
- // SPARK-23457 Register a task completion listener before `initialization`.
- taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close()))
- vectorizedReader.initialize(split, hadoopAttemptContext)
- logDebug(s"Appending $partitionSchema ${file.partitionValues}")
- vectorizedReader.initBatch(partitionSchema, file.partitionValues)
- if (returningBatch) {
- vectorizedReader.enableReturningBatches()
- }
+ try {
+ vectorizedReader.initialize(split, hadoopAttemptContext)
+ logDebug(s"Appending $partitionSchema ${file.partitionValues}")
+ vectorizedReader.initBatch(partitionSchema, file.partitionValues)
+ if (returningBatch) {
+ vectorizedReader.enableReturningBatches()
+ }
- // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
- iter.asInstanceOf[Iterator[InternalRow]]
+ // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
+ iter.asInstanceOf[Iterator[InternalRow]]
+ } catch {
+ case e: Throwable =>
+ // SPARK-23457: In case there is an exception in initialization, close the iterator to
Review comment:
In general, I think FileScanRDD does close the iterator when hitting exceptions, because it uses a task completion listener to do so. The only case where it will not close the iterator is when the exception prevents FileScanRDD from getting a reference to the iterator, as is the case here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734608343
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
##########
@@ -327,18 +327,31 @@ class ParquetFileFormat
int96RebaseMode.toString,
enableOffHeapColumnVector && taskContext.isDefined,
capacity)
+ // SPARK-37089: We cannot register a task completion listener to close this iterator here
+ // because downstream exec nodes have already registered their listeners. Since listeners
+ // are executed in reverse order of registration, a listener registered here would close the
+ // iterator while downstream exec nodes are still running. When off-heap column vectors are
+ // enabled, this can cause a use-after-free bug leading to a segfault.
+ //
+ // Instead, we use FileScanRDD's task completion listener to close this iterator.
val iter = new RecordReaderIterator(vectorizedReader)
- // SPARK-23457 Register a task completion listener before `initialization`.
- taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close()))
- vectorizedReader.initialize(split, hadoopAttemptContext)
- logDebug(s"Appending $partitionSchema ${file.partitionValues}")
- vectorizedReader.initBatch(partitionSchema, file.partitionValues)
- if (returningBatch) {
- vectorizedReader.enableReturningBatches()
- }
+ try {
+ vectorizedReader.initialize(split, hadoopAttemptContext)
+ logDebug(s"Appending $partitionSchema ${file.partitionValues}")
+ vectorizedReader.initBatch(partitionSchema, file.partitionValues)
+ if (returningBatch) {
+ vectorizedReader.enableReturningBatches()
+ }
- // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
- iter.asInstanceOf[Iterator[InternalRow]]
+ // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
+ iter.asInstanceOf[Iterator[InternalRow]]
+ } catch {
+ case e: Throwable =>
+ // SPARK-23457: In case there is an exception in initialization, close the iterator to
Review comment:
In general, I think FileScanRDD does close the iterator when hitting exceptions, because it [uses a task completion listener](https://github.com/apache/spark/pull/34369/files#diff-b7b097f8cec6a7ae6640f9ecd6d4ac14ed304ad7c8db802a3ec2c0983535e157R208) to do so. The only case where it will not close the iterator is when the exception prevents FileScanRDD from getting a reference to the iterator, as is the case here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734808494
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -85,6 +85,17 @@ class FileScanRDD(
private[this] var currentFile: PartitionedFile = null
private[this] var currentIterator: Iterator[Object] = null
+ private def resetCurrentIterator(): Unit = {
+ currentIterator match {
+ case iter: NextIterator[_] =>
+ iter.closeIfNeeded()
+ case iter: Closeable =>
+ iter.close()
+ case _ => // do nothing
Review comment:
When does this happen?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sunchao commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734842173
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -128,15 +139,21 @@ class FileScanRDD(
// Sets InputFileBlockHolder for the file block's information
InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length)
+ resetCurrentIterator()
if (ignoreMissingFiles || ignoreCorruptFiles) {
currentIterator = new NextIterator[Object] {
// The readFunction may read some bytes before consuming the iterator, e.g.,
- // vectorized Parquet reader. Here we use lazy val to delay the creation of
- // iterator so that we will throw exception in `getNext`.
- private lazy val internalIter = readCurrentFile()
+ // vectorized Parquet reader. Here we use a lazily initialized variable to delay the
+ // creation of iterator so that we will throw exception in `getNext`.
+ private var internalIter: Iterator[InternalRow] = null
Review comment:
Got it. Thanks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949708869
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144538/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949766425
**[Test build #144544 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144544/testReport)** for PR 34369 at commit [`559224d`](https://github.com/apache/spark/commit/559224d9148fb3fc9a743908a68e2e4f8121d04a).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949760350
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49011/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] ankurdave commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
ankurdave commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949667146
cc @cloud-fan @HyukjinKwon @dongjoon-hyun @mswit-databricks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949846666
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49015/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734840380
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -128,15 +139,21 @@ class FileScanRDD(
// Sets InputFileBlockHolder for the file block's information
InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length)
+ resetCurrentIterator()
if (ignoreMissingFiles || ignoreCorruptFiles) {
currentIterator = new NextIterator[Object] {
// The readFunction may read some bytes before consuming the iterator, e.g.,
- // vectorized Parquet reader. Here we use lazy val to delay the creation of
- // iterator so that we will throw exception in `getNext`.
- private lazy val internalIter = readCurrentFile()
+ // vectorized Parquet reader. Here we use a lazily initialized variable to delay the
+ // creation of iterator so that we will throw exception in `getNext`.
+ private var internalIter: Iterator[InternalRow] = null
Review comment:
If the downstream operator never pulls any rows from the iterator, then the first time we access `internalIter` will be when `close()` is called. If `internalIter` is a `lazy val`, this will trigger a call to `readCurrentFile()`, which is unnecessary and may throw. Changing `internalIter` from a `lazy val` to a `var` lets us avoid this unnecessary call.
Several tests fail without this change, including `AvroV1Suite`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949854562
**[Test build #144543 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144543/testReport)** for PR 34369 at commit [`c09cf5e`](https://github.com/apache/spark/commit/c09cf5e0beeeafd14e0df0d9d32c283c47262a0f).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949760350
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49011/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949756172
**[Test build #144543 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144543/testReport)** for PR 34369 at commit [`c09cf5e`](https://github.com/apache/spark/commit/c09cf5e0beeeafd14e0df0d9d32c283c47262a0f).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #34369:
URL: https://github.com/apache/spark/pull/34369
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949916003
cc @sunchao
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949771646
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49009/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949846666
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49015/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949756172
**[Test build #144543 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144543/testReport)** for PR 34369 at commit [`c09cf5e`](https://github.com/apache/spark/commit/c09cf5e0beeeafd14e0df0d9d32c283c47262a0f).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949855217
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144543/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949951007
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144544/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949805594
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49015/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949800896
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49009/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949800896
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49009/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734808494
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##########
@@ -85,6 +85,17 @@ class FileScanRDD(
private[this] var currentFile: PartitionedFile = null
private[this] var currentIterator: Iterator[Object] = null
+ private def resetCurrentIterator(): Unit = {
+ currentIterator match {
+ case iter: NextIterator[_] =>
+ iter.closeIfNeeded()
+ case iter: Closeable =>
+ iter.close()
+ case _ => // do nothing
Review comment:
When does this happen? Only when `currentIterator` is `null`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949914403
Thank you for pinging me, @ankurdave .
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949708869
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144538/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734609634
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
##########
@@ -327,18 +327,31 @@ class ParquetFileFormat
int96RebaseMode.toString,
enableOffHeapColumnVector && taskContext.isDefined,
capacity)
+ // SPARK-37089: We cannot register a task completion listener to close this iterator here
+ // because downstream exec nodes have already registered their listeners. Since listeners
+ // are executed in reverse order of registration, a listener registered here would close the
+ // iterator while downstream exec nodes are still running. When off-heap column vectors are
+ // enabled, this can cause a use-after-free bug leading to a segfault.
+ //
+ // Instead, we use FileScanRDD's task completion listener to close this iterator.
val iter = new RecordReaderIterator(vectorizedReader)
- // SPARK-23457 Register a task completion listener before `initialization`.
- taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close()))
- vectorizedReader.initialize(split, hadoopAttemptContext)
- logDebug(s"Appending $partitionSchema ${file.partitionValues}")
- vectorizedReader.initBatch(partitionSchema, file.partitionValues)
- if (returningBatch) {
- vectorizedReader.enableReturningBatches()
- }
+ try {
+ vectorizedReader.initialize(split, hadoopAttemptContext)
+ logDebug(s"Appending $partitionSchema ${file.partitionValues}")
+ vectorizedReader.initBatch(partitionSchema, file.partitionValues)
+ if (returningBatch) {
+ vectorizedReader.enableReturningBatches()
+ }
- // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
- iter.asInstanceOf[Iterator[InternalRow]]
+ // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
+ iter.asInstanceOf[Iterator[InternalRow]]
+ } catch {
+ case e: Throwable =>
+ // SPARK-23457: In case there is an exception in initialization, close the iterator to
Review comment:
ah I see!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949667648
**[Test build #144538 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144538/testReport)** for PR 34369 at commit [`29f5e77`](https://github.com/apache/spark/commit/29f5e77f7964ee4380b2eee288f54d431dcfee74).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949766425
**[Test build #144544 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144544/testReport)** for PR 34369 at commit [`559224d`](https://github.com/apache/spark/commit/559224d9148fb3fc9a743908a68e2e4f8121d04a).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-950437867
Merged to master and branch-3.2.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-950437867
Merged to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org