You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2015/11/06 08:51:48 UTC

[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/9517

    [SPARK-11500][SQL] Not deterministic order of columns when using merging schemas.

    https://issues.apache.org/jira/browse/SPARK-11500
    
    As filed in SPARK-11500, if merging schemas is enabled, the order of files to touch is a matter which might affect the ordering of the output columns. 
    
    This was mostly because of the use of `Set` and `Map` so I replaced them to `LinkedHashSet` and `LinkedHashMap` to keep the insertion order.
    
    Also, reducing order is set left, and replaced the order of `filesToTouch` from `metadataStatuses ++ commonMetadataStatuses ++ needMerged` to  `needMerged ++ metadataStatuses ++ commonMetadataStatuses` in order to touch the part-files first which always have the schema in footers whereas the others might not exist.
    
    One nit is, If merging schemas is enabled, but when multiple files are given, there is no guarantee of the output order, since there might not be a summary file for the first file, which ends up putting ahead the columns of the other files. 
    
    However, I thought this should be okay since disabling merging schemas means (assumes) all the files have the same schemas.
    
    In addition, in the test code for this, I only checked the names of fields.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-11500

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9517.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9517
    
----
commit b0e6ce2729f584a9f95996707f60eb650c2a58b9
Author: hyukjinkwon <gu...@gmail.com>
Date:   2015-11-06T07:38:26Z

    [SPARK-11500][SQL] Not deterministic order of columns when using merging schemas.

commit 08fc91ca8d21902677e78f0adb3b36769f2cba51
Author: hyukjinkwon <gu...@gmail.com>
Date:   2015-11-06T07:38:55Z

    [SPARK-11500][SQL] Add a test to check the deterministic order.

commit bcf72d3ca308f9a69993803d9c8939696c915b07
Author: hyukjinkwon <gu...@gmail.com>
Date:   2015-11-06T07:40:17Z

    [SPARK-11500][SQL] Remove trailing newline.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155309503
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45489/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155019607
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154531328
  
    Build finished. No test results found.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154906427
  
    **[Test build #45324 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45324/consoleFull)** for PR 9517 at commit [`bcf72d3`](https://github.com/apache/spark/commit/bcf72d3ca308f9a69993803d9c8939696c915b07).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9517#discussion_r44247087
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala ---
    @@ -155,4 +155,22 @@ class ParquetHadoopFsRelationSuite extends HadoopFsRelationTest {
           assert(physicalPlan.collect { case p: execution.Filter => p }.length === 1)
         }
       }
    +
    +  test("SPARK-11500: Not deterministic order of columns when using merging schemas.") {
    +    import testImplicits._
    +    withSQLConf(SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") {
    +      withTempPath { dir =>
    +        val pathOne = s"${dir.getCanonicalPath}/table1"
    +        Seq(1, 1).zipWithIndex.toDF("a", "b").write.parquet(pathOne)
    +        val pathTwo = s"${dir.getCanonicalPath}/table2"
    +        Seq(1, 1).zipWithIndex.toDF("c", "b").write.parquet(pathTwo)
    +        val pathThree = s"${dir.getCanonicalPath}/table3"
    +        Seq(1, 1).zipWithIndex.toDF("d", "b").write.parquet(pathThree)
    --- End diff --
    
    Thanks for commands!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155281155
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155707374
  
    Also backported to branch-1.6.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155019577
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155281726
  
    **[Test build #45489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45489/consoleFull)** for PR 9517 at commit [`32dfb87`](https://github.com/apache/spark/commit/32dfb87ce36a093c54d4a3dfd39ccbc00c417af9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155053133
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154502425
  
    **[Test build #45235 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45235/consoleFull)** for PR 9517 at commit [`bcf72d3`](https://github.com/apache/spark/commit/bcf72d3ca308f9a69993803d9c8939696c915b07).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154503748
  
    Build finished. No test results found.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154891112
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9517#discussion_r44245271
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala ---
    @@ -461,13 +461,29 @@ private[sql] class ParquetRelation(
               // You should enable this configuration only if you are very sure that for the parquet
               // part-files to read there are corresponding summary files containing correct schema.
     
    +          // As filed in SPARK-11500, the order of files to touch is a matter, which might affect
    +          // the ordering of the output columns. There are several things to mention here.
    +          //
    +          //  1. If mergeRespectSummaries config is false, then it merges schemas by reducing from
    +          //     the first part-file so that the columns of the first file show first.
    +          //
    +          //  2. If mergeRespectSummaries config is true, then there should be, at least,
    +          //     "_metadata"s for all given files. So, we can ensure the columns of the first file
    +          //     show first.
    +          //
    +          //  3. If shouldMergeSchemas is false, but when multiple files are given, there is
    +          //     no guarantee of the output order, since there might not be a summary file for the
    +          //     first file, which ends up putting ahead the columns of the other files. However,
    +          //     this should be okay since not enabling shouldMergeSchemas means (assumes) all the
    +          //     files have the same schemas.
    +
               val needMerged: Seq[FileStatus] =
                 if (mergeRespectSummaries) {
                   Seq()
                 } else {
                   dataStatuses
                 }
    -          (metadataStatuses ++ commonMetadataStatuses ++ needMerged).toSeq
    +          needMerged ++ metadataStatuses ++ commonMetadataStatuses
    --- End diff --
    
    Does HDFS guarantee that the result of `listStatus()` is always sorted? If not, we probably need to sort these `FileStatus`es.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154503753
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45233/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155020044
  
    In this commit, I added partitioned tables for the test and sorted the `FileStatus`es.
    
    There are several things to mention here.
    
    Firstly, now we do not need to change `Set` to `LinkedHashSet` and `Map` to `LinkedHashMap` for this issue since it manually sorts the `FileStatus`es. However, I left them as I though anyway the order of files better be in the order as they are retrieved. If that looks weird, I would like to get it back.
    
    Secondly, in any cases, the columns of the lexicographically first file shows first, which might be a matter for files starting/containing with numeric values. However, I left this as I though anyway it is deterministic.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154531343
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45235/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155280757
  
    I used `sortBy` instead of `sortWith`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155021593
  
    **[Test build #45359 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45359/consoleFull)** for PR 9517 at commit [`4f47063`](https://github.com/apache/spark/commit/4f4706352c84469503ae3c3388098458b570f62f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9517#discussion_r44244350
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala ---
    @@ -155,4 +155,22 @@ class ParquetHadoopFsRelationSuite extends HadoopFsRelationTest {
           assert(physicalPlan.collect { case p: execution.Filter => p }.length === 1)
         }
       }
    +
    +  test("SPARK-11500: Not deterministic order of columns when using merging schemas.") {
    +    import testImplicits._
    +    withSQLConf(SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") {
    +      withTempPath { dir =>
    +        val pathOne = s"${dir.getCanonicalPath}/table1"
    +        Seq(1, 1).zipWithIndex.toDF("a", "b").write.parquet(pathOne)
    +        val pathTwo = s"${dir.getCanonicalPath}/table2"
    +        Seq(1, 1).zipWithIndex.toDF("c", "b").write.parquet(pathTwo)
    +        val pathThree = s"${dir.getCanonicalPath}/table3"
    +        Seq(1, 1).zipWithIndex.toDF("d", "b").write.parquet(pathThree)
    --- End diff --
    
    We should probably use a partitioned table here. Directories like `base/table1`, `base/table2`, and `base/table3` are not valid partition directory names, and loading `base` as a Parquet file should throw an exception. It's not expected that this test case can pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154502163
  
    Build triggered. sha1 is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154498818
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155053016
  
    **[Test build #45359 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45359/consoleFull)** for PR 9517 at commit [`4f47063`](https://github.com/apache/spark/commit/4f4706352c84469503ae3c3388098458b570f62f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154338582
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154891106
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154906491
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155309501
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154890873
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154499906
  
    Build triggered. sha1 is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154891629
  
    **[Test build #45324 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45324/consoleFull)** for PR 9517 at commit [`bcf72d3`](https://github.com/apache/spark/commit/bcf72d3ca308f9a69993803d9c8939696c915b07).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155281161
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154501691
  
    add to whitelist 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/9517


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155706699
  
    LGTM, merged to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154338571
  
    cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-155309294
  
    **[Test build #45489 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45489/consoleFull)** for PR 9517 at commit [`32dfb87`](https://github.com/apache/spark/commit/32dfb87ce36a093c54d4a3dfd39ccbc00c417af9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154502199
  
    Build started sha1 is merged.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9517#issuecomment-154499940
  
    Build started sha1 is merged.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9517#discussion_r44247117
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala ---
    @@ -461,13 +461,29 @@ private[sql] class ParquetRelation(
               // You should enable this configuration only if you are very sure that for the parquet
               // part-files to read there are corresponding summary files containing correct schema.
     
    +          // As filed in SPARK-11500, the order of files to touch is a matter, which might affect
    +          // the ordering of the output columns. There are several things to mention here.
    +          //
    +          //  1. If mergeRespectSummaries config is false, then it merges schemas by reducing from
    +          //     the first part-file so that the columns of the first file show first.
    +          //
    +          //  2. If mergeRespectSummaries config is true, then there should be, at least,
    +          //     "_metadata"s for all given files. So, we can ensure the columns of the first file
    +          //     show first.
    +          //
    +          //  3. If shouldMergeSchemas is false, but when multiple files are given, there is
    +          //     no guarantee of the output order, since there might not be a summary file for the
    +          //     first file, which ends up putting ahead the columns of the other files. However,
    +          //     this should be okay since not enabling shouldMergeSchemas means (assumes) all the
    +          //     files have the same schemas.
    +
               val needMerged: Seq[FileStatus] =
                 if (mergeRespectSummaries) {
                   Seq()
                 } else {
                   dataStatuses
                 }
    -          (metadataStatuses ++ commonMetadataStatuses ++ needMerged).toSeq
    +          needMerged ++ metadataStatuses ++ commonMetadataStatuses
    --- End diff --
    
    Yes, I think I should sort them.
    It looks it is not really recommended just to use it as it is, although they looks sorted, assuming from [this link](http://lucene.472066.n3.nabble.com/FileSystem-contract-of-listStatus-td3475540.html). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org