You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/28 16:34:32 UTC

[GitHub] [spark] Ngone51 opened a new pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Ngone51 opened a new pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031

### What changes were proposed in this pull request?

In `ParquetUtils.splitFiles()` I changed the logic to:
1) if `spark.sql.parquet.mergeSchema=false`, then don't sort any files.
2) if `spark.sql.parquet.mergeSchema=true`,
2.1) if `spark.sql.parquet.respectSummaryFiles=false`, then sort all files
2.1) if `spark.sql.parquet.respectSummaryFiles=true`, then only sort metadata files

### Why are the changes needed?

According to [SPARK-11500](https://issues.apache.org/jira/browse/SPARK-11500), files' order only matters when schema merged is need. So, we cloud avoid unnecessary sort when we don't merge schema, especially when there're plenty of files.

Another minor improvement is grouping metadata files and data files firstly to avoid traversing all file for three times while constructing `FileTypes`.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Partially covered by SPARK-11500. And I also want to test metadata file part, but I don't find any metadata files while using write parquet files...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org