You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/28 16:34:32 UTC

[GitHub] [spark] Ngone51 opened a new pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Ngone51 opened a new pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031
 
 
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   In `ParquetUtils.splitFiles()` I changed the logic to:
   1) if `spark.sql.parquet.mergeSchema=false`, then don't sort any files.
   2) if `spark.sql.parquet.mergeSchema=true`,
      2.1) if `spark.sql.parquet.respectSummaryFiles=false`, then sort all files
      2.1) if `spark.sql.parquet.respectSummaryFiles=true`, then only sort metadata files
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   According to [SPARK-11500](https://issues.apache.org/jira/browse/SPARK-11500), files' order only matters when schema merged is need. So, we cloud avoid unnecessary sort when we don't merge schema, especially when there're plenty of files.
   
   Another minor improvement is grouping metadata files and data files firstly to avoid traversing all file for three times while constructing `FileTypes`.
   
   
   ### Does this PR introduce any user-facing change?
   <!--
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If no, write 'No'.
   -->
   
   No.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Partially covered by SPARK-11500. And I also want to test metadata file part, but I don't find any metadata files while using write parquet files...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-569431403
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-569448727
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115889/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-569431282
 
 
   **[Test build #115889 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115889/testReport)** for PR 27031 at commit [`085d9d4`](https://github.com/apache/spark/commit/085d9d48402a28cb228ea14ea5d88a8ca0ed933e).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#discussion_r361915141
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala
 ##########
 @@ -112,15 +114,26 @@ object ParquetUtils {
       metadata: Seq[FileStatus],
       commonMetadata: Seq[FileStatus])
 
-  private def splitFiles(allFiles: Seq[FileStatus]): FileTypes = {
-    val leaves = allFiles.toArray.sortBy(_.getPath.toString)
 
 Review comment:
   It was actually @liancheng's advice (wow 5 years ago). Sorting it will gives you a deterministic result. Relying on the native listing order doesn't look a great idea. From a cursory look, HDFS's vs Java's `File.list` (by `RawLocalFileSystem`) behaviours look inconsistent as an example.
   
   In addition, sorting itself won't cause a lot of performance penalty given most of performance penalty usually comes from listing up files, and given that the number of files should be, really, really high which isn't a pretty common case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-569448727
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115889/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#discussion_r361919592
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala
 ##########
 @@ -112,15 +114,26 @@ object ParquetUtils {
       metadata: Seq[FileStatus],
       commonMetadata: Seq[FileStatus])
 
-  private def splitFiles(allFiles: Seq[FileStatus]): FileTypes = {
-    val leaves = allFiles.toArray.sortBy(_.getPath.toString)
 
 Review comment:
   IIUC, files's deterministic order only matters when we need to merge schema? If so, I think this PR doesn't introduce any behavior change on files' order. We'll still get a deterministic order when we need it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#discussion_r361915141
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala
 ##########
 @@ -112,15 +114,26 @@ object ParquetUtils {
       metadata: Seq[FileStatus],
       commonMetadata: Seq[FileStatus])
 
-  private def splitFiles(allFiles: Seq[FileStatus]): FileTypes = {
-    val leaves = allFiles.toArray.sortBy(_.getPath.toString)
 
 Review comment:
   It was actually @liancheng's advice (wow 5 years ago). Sorting it will gives you a deterministic result. Relying on the native listing order doesn't look a great idea. From a cursory look, HDFS's vs Java's `File.list` (by `RawLocalFileSystem`) behaviours look inconsistent as an example.
   
   In addition, sorting itself won't cause a lot of performance penalty given most of performance penalty comes from listing up files, and given that the number of files should be, really, really high which isn't a pretty common case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-569431405
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20679/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#discussion_r361925962
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala
 ##########
 @@ -112,15 +114,26 @@ object ParquetUtils {
       metadata: Seq[FileStatus],
       commonMetadata: Seq[FileStatus])
 
-  private def splitFiles(allFiles: Seq[FileStatus]): FileTypes = {
-    val leaves = allFiles.toArray.sortBy(_.getPath.toString)
 
 Review comment:
   Yes, that issue was specific to the ordering; however, it can cause another behaviour change: we will pick one file to read the schema, right?
   
   It is possible to put the complete parquet file in the top in a directory, and other files are partial. For instance,
   
   ```scala
   scala> spark.read.parquet("test/a.snappy.parquet")
   res1: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
   
   scala> spark.read.parquet("test/b.snappy.parquet")
   res2: org.apache.spark.sql.DataFrame = [b: bigint]
   
   scala> spark.read.parquet("test").show()
   +----+---+
   |   a|  b|
   +----+---+
   |   0|  0|
   |   1|  1|
   |   2|  2|
   |null|  0|
   |null|  1|
   |null|  2|
   +----+---+
   ```
   
   If the order is switched:
   
   ```scala
   scala> spark.read.parquet("test").show()
   +---+
   |  b|
   +---+
   |  0|
   |  1|
   |  2|
   |  0|
   |  1|
   |  2|
   +---+
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-569431282
 
 
   **[Test build #115889 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115889/testReport)** for PR 27031 at commit [`085d9d4`](https://github.com/apache/spark/commit/085d9d48402a28cb228ea14ea5d88a8ca0ed933e).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-569435394
 
 
   cc @cloud-fan @HyukjinKwon 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-569448725
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#discussion_r361934666
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala
 ##########
 @@ -112,15 +114,26 @@ object ParquetUtils {
       metadata: Seq[FileStatus],
       commonMetadata: Seq[FileStatus])
 
-  private def splitFiles(allFiles: Seq[FileStatus]): FileTypes = {
-    val leaves = allFiles.toArray.sortBy(_.getPath.toString)
 
 Review comment:
   That's true but users always find awesome ways to use or abuse :-). True, this case `mergeSchema` have better to be set `true` but many people want to avoid in production because `mergeSchema` is super expensive as it needs to touch every file.
   
   In non performance sensitive codes, we better keep the codes simple and more readable anyway.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#discussion_r361929417
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala
 ##########
 @@ -112,15 +114,26 @@ object ParquetUtils {
       metadata: Seq[FileStatus],
       commonMetadata: Seq[FileStatus])
 
-  private def splitFiles(allFiles: Seq[FileStatus]): FileTypes = {
-    val leaves = allFiles.toArray.sortBy(_.getPath.toString)
 
 Review comment:
   Thanks for providing a specific case. But I think that user should really need to use `mergeSchema=true` option for such case if there's no guarantee that the completed parquet file should always be on top of a directory.
   
   Besides, I believe that there's assuming that schemas should be consistent if `mergeSchema=false`. And user should be responsible for the result schema if not.
   
   >  In either case, we fall back to any of the first part-file, and just assume all schemas are consistent.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#discussion_r361915141
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala
 ##########
 @@ -112,15 +114,26 @@ object ParquetUtils {
       metadata: Seq[FileStatus],
       commonMetadata: Seq[FileStatus])
 
-  private def splitFiles(allFiles: Seq[FileStatus]): FileTypes = {
-    val leaves = allFiles.toArray.sortBy(_.getPath.toString)
 
 Review comment:
   It was actually @liancheng's advice (wow 5 years ago). Sorting it will gives you a deterministic result. Relying on the native listing order doesn't look a great idea. From a cursory look, HDFS's vs Java's `File.list` (by `RawLocalFileSystem`) behaviours look inconsistent as an example.
   
   In addition, sorting itself won't cause a lot of performance penalty given most of performance penalty usually comes from listing up files, and given that the number of files should be, really, really high to cause some performance penalty which isn't a pretty common case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-569431403
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#discussion_r361934666
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala
 ##########
 @@ -112,15 +114,26 @@ object ParquetUtils {
       metadata: Seq[FileStatus],
       commonMetadata: Seq[FileStatus])
 
-  private def splitFiles(allFiles: Seq[FileStatus]): FileTypes = {
-    val leaves = allFiles.toArray.sortBy(_.getPath.toString)
 
 Review comment:
   That's true but users always find awesome ways to use. True, this case `mergeSchema` have better to be set `true` but many people want to avoid in production because `mergeSchema` is super expensive as it needs to touch every file.
   
   In non performance sensitive codes, we better keep the codes simple and more readable anyway.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#discussion_r361920840
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala
 ##########
 @@ -112,15 +114,26 @@ object ParquetUtils {
       metadata: Seq[FileStatus],
       commonMetadata: Seq[FileStatus])
 
-  private def splitFiles(allFiles: Seq[FileStatus]): FileTypes = {
-    val leaves = allFiles.toArray.sortBy(_.getPath.toString)
 
 Review comment:
   And thanks for providing performance penalty knowledge around here. I agree that if there' no obviously performance improvement, this may not be such necessary change.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#discussion_r361917663
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala
 ##########
 @@ -112,15 +114,26 @@ object ParquetUtils {
       metadata: Seq[FileStatus],
       commonMetadata: Seq[FileStatus])
 
-  private def splitFiles(allFiles: Seq[FileStatus]): FileTypes = {
-    val leaves = allFiles.toArray.sortBy(_.getPath.toString)
 
 Review comment:
   Also, IIRC, `sortBy` uses Timsort under the hood which gives you linear time complexity when the input is already sorted, which I believe are in most cases here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#discussion_r361915141
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala
 ##########
 @@ -112,15 +114,26 @@ object ParquetUtils {
       metadata: Seq[FileStatus],
       commonMetadata: Seq[FileStatus])
 
-  private def splitFiles(allFiles: Seq[FileStatus]): FileTypes = {
-    val leaves = allFiles.toArray.sortBy(_.getPath.toString)
 
 Review comment:
   It was actually @liancheng's advice (wow 5 years ago). Sorting it will gives you a deterministic result. Relying on the native listing order doesn't look a great idea. From a cursory look, HDFS's vs Java's `File.list` (by `RawLocalFileSystem `) behaviours look inconsistent as an example.
   
   In addition, sorting itself won't cause a lot of performance penalty given most of performance penalty comes from listing up files, and given that the number of files should be, really, really high which isn't a pretty common case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-570176896
 
 
   @Ngone51 for non-trivial changes like this, we should make sure there are significant benefits. I agree with @HyukjinKwon and don't think sorting the files can be a bottleneck.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#discussion_r361925962
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala
 ##########
 @@ -112,15 +114,26 @@ object ParquetUtils {
       metadata: Seq[FileStatus],
       commonMetadata: Seq[FileStatus])
 
-  private def splitFiles(allFiles: Seq[FileStatus]): FileTypes = {
-    val leaves = allFiles.toArray.sortBy(_.getPath.toString)
 
 Review comment:
   Yes, that issue was specific to the ordering; however, it can cause a behaviour change: we will pick one file to read the schema, right?
   
   It is possible to put the complete parquet file in the top in a directory, and other files are partial. For instance,
   
   ```scala
   scala> spark.read.parquet("test/a.snappy.parquet")
   res1: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
   
   scala> spark.read.parquet("test/b.snappy.parquet")
   res2: org.apache.spark.sql.DataFrame = [b: bigint]
   
   scala> spark.read.parquet("test").show()
   +----+---+
   |   a|  b|
   +----+---+
   |   0|  0|
   |   1|  1|
   |   2|  2|
   |null|  0|
   |null|  1|
   |null|  2|
   +----+---+
   ```
   
   If the order is switched:
   
   ```scala
   scala> spark.read.parquet("test").show()
   +---+
   |  b|
   +---+
   |  0|
   |  1|
   |  2|
   |  0|
   |  1|
   |  2|
   +---+
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang closed pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
gengliangwang closed pull request #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-569448725
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-569431405
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20679/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-570419904
 
 
   +1 with @HyukjinKwon .
   I am closing this one. 
   @Ngone51 You can reopen this one if you have more thoughts.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27031: [SPARK-30373][SQL] Avoid unnecessary sort for ParquetUtils.splitFiles
URL: https://github.com/apache/spark/pull/27031#issuecomment-569448605
 
 
   **[Test build #115889 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115889/testReport)** for PR 27031 at commit [`085d9d4`](https://github.com/apache/spark/commit/085d9d48402a28cb228ea14ea5d88a8ca0ed933e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org