You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/12 11:15:22 UTC

[GitHub] [arrow-datafusion] yjshen commented on pull request #1990: WIP: Finer-grained parallelism for Parquet Scan

yjshen commented on pull request #1990:
URL: https://github.com/apache/arrow-datafusion/pull/1990#issuecomment-1065864478


   Hi @tustvold , the filter is based on row-group midpoint position. It was introduced recently in parquet crate with https://github.com/apache/arrow-rs/commit/2bca71e322fcab6c6d93a47ef71638a617e29f6c. The midpoint filtering is modeled after the [ParquetSplit](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L67-L91) and [MetadataConverter](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1241-L1292)
   
   The parquet row groups level parallelism is used in MapReduce and Spark. In Spark [`splitFiles`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala#L26-L45) is used to generate task partitions based on partition size settings. And it may partition bigger parquet file parts to different partitions.
   
   Currently, this PR is still WIP, since only physical plan changes are implemented. And we translate Spark physical plan to DataFusion physical plan to run natively in DataFusion https://github.com/blaze-init/spark-blaze-extension/blob/master/src/main/scala/org/apache/spark/sql/blaze/plan/NativeParquetScanExec.scala#L57-L63


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org