You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by tejasapatil <gi...@git.apache.org> on 2018/01/09 16:14:18 UTC

[GitHub] spark pull request #20206: [SPARK-19256][SQL] Remove ordering enforcement fr...

GitHub user tejasapatil opened a pull request:

    https://github.com/apache/spark/pull/20206

    [SPARK-19256][SQL] Remove ordering enforcement from `FileFormatWriter` and let planner do that

    ## What changes were proposed in this pull request?
    
    Thks is as per discussion in https://github.com/apache/spark/pull/19483#issuecomment-336337516 . Enforcing `Sort` at right places in the tree is something that `EnsureRequirements` should take care of. This PR removes `SORT` node insertion done inside `FileFormatWriter`.
    
    ## How was this patch tested?
    
    - Existing unit tests
    - Looked at the query plan for bucketed insert. `Sort` was added in the plan.
    
    ```
    scala> hc.sql(" desc formatted test1  ").collect.foreach(println)
    .....
    [Num Buckets,8,]
    [Bucket Columns,[`j`, `k`],]
    [Sort Columns,[`j`, `k`],]
    
    
    scala> hc.sql(" EXPLAIN INSERT OVERWRITE TABLE test1 SELECT * FROM test2 ").collect.foreach(println)
    [== Physical Plan ==
    Execute InsertIntoHadoopFsRelationCommand InsertIntoHadoopFsRelationCommand file:/warehouse/test1, false, 8 buckets, bucket columns: [j, k], sort columns: [j, k], ...
    +- *Sort [pmod(hash(j#56, k#57, 42), 8) ASC NULLS FIRST, j#56 ASC NULLS FIRST, k#57 ASC NULLS FIRST], false, 0
       +- *FileScan orc default.test2[i#55,j#56,k#57] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/warehouse/test2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<i:int,j:int,k:string>]
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tejasapatil/spark refac_sort_req

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20206
    
----
commit 652dca2fa7cbdf73433e2ef8a2b91578c72d84fa
Author: Tejas Patil <te...@...>
Date:   2017-10-13T19:26:09Z

    [SPARK-19256][SQL] Move bucketing constraints out of `FileFormatWriter` into `RunnableCommand`

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    **[Test build #85936 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85936/testReport)** for PR 20206 at commit [`8c91ff9`](https://github.com/apache/spark/commit/8c91ff9909fecaa36a79397e36edf64d83adac6b).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2242/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    **[Test build #85946 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85946/testReport)** for PR 20206 at commit [`8c91ff9`](https://github.com/apache/spark/commit/8c91ff9909fecaa36a79397e36edf64d83adac6b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20206: [SPARK-19256][SQL] Remove ordering enforcement fr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20206#discussion_r161495402
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -156,40 +145,14 @@ object FileFormatWriter extends Logging {
           statsTrackers = statsTrackers
         )
     
    -    // We should first sort by partition columns, then bucket id, and finally sorting columns.
    -    val requiredOrdering = partitionColumns ++ bucketIdExpression ++ sortColumns
    -    // the sort order doesn't matter
    -    val actualOrdering = plan.outputOrdering.map(_.child)
    -    val orderingMatched = if (requiredOrdering.length > actualOrdering.length) {
    -      false
    -    } else {
    -      requiredOrdering.zip(actualOrdering).forall {
    -        case (requiredOrder, childOutputOrder) =>
    -          requiredOrder.semanticEquals(childOutputOrder)
    -      }
    -    }
    -
         SQLExecution.checkSQLExecutionId(sparkSession)
     
         // This call shouldn't be put into the `try` block below because it only initializes and
         // prepares the job, any exception thrown from here shouldn't cause abortJob() to be called.
         committer.setupJob(job)
     
         try {
    -      val rdd = if (orderingMatched) {
    -        plan.execute()
    -      } else {
    -        // SPARK-21165: the `requiredOrdering` is based on the attributes from analyzed plan, and
    -        // the physical plan may have different attribute ids due to optimizer removing some
    -        // aliases. Here we bind the expression ahead to avoid potential attribute ids mismatch.
    --- End diff --
    
    This concern is still valid, the `DataWritingCommand.requiredChildOrdering` is based on logical plan's output attribute ids, how can we safely apply it in `DataWritingCommandExec`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20206: [SPARK-19256][SQL] Remove ordering enforcement fr...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20206#discussion_r161575592
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -150,6 +152,10 @@ case class InsertIntoHadoopFsRelationCommand(
             }
           }
     
    +      val partitionSet = AttributeSet(partitionColumns)
    +      val dataColumns = query.output.filterNot(partitionSet.contains)
    --- End diff --
    
    +1, it should be `outputColumns ` here, which is the output columns of analyzed plan. See #20020 for details.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85936/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    **[Test build #85891 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85891/testReport)** for PR 20206 at commit [`1008b2e`](https://github.com/apache/spark/commit/1008b2efe8256fe95ae61ebba1ab673e0f397118).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by tejasapatil <gi...@git.apache.org>.
Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    Jenkins retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    **[Test build #85936 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85936/testReport)** for PR 20206 at commit [`8c91ff9`](https://github.com/apache/spark/commit/8c91ff9909fecaa36a79397e36edf64d83adac6b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    **[Test build #85859 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85859/testReport)** for PR 20206 at commit [`652dca2`](https://github.com/apache/spark/commit/652dca2fa7cbdf73433e2ef8a2b91578c72d84fa).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85891/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20206: [SPARK-19256][SQL] Remove ordering enforcement fr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20206#discussion_r161496099
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -184,6 +190,43 @@ case class InsertIntoHadoopFsRelationCommand(
         Seq.empty[Row]
       }
     
    +  private def getBucketIdExpression(dataColumns: Seq[Attribute]): Option[Expression] = {
    +    bucketSpec.map { spec =>
    +      val bucketColumns = spec.bucketColumnNames.map(c => dataColumns.find(_.name == c).get)
    +      // Use `HashPartitioning.partitionIdExpression` as our bucket id expression, so that we can
    +      // guarantee the data distribution is same between shuffle and bucketed data source, which
    +      // enables us to only shuffle one side when join a bucketed table and a normal one.
    +      HashPartitioning(bucketColumns, spec.numBuckets).partitionIdExpression
    +    }
    +  }
    +
    +  /**
    +   * How is `requiredOrdering` determined ?
    +   *
    +   *     table type   |            requiredOrdering
    +   * -----------------+-------------------------------------------------
    +   *   normal table   |             partition columns
    --- End diff --
    
    nit: `non-bucketed table`, a partitioned table is not a normal table...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    **[Test build #85891 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85891/testReport)** for PR 20206 at commit [`1008b2e`](https://github.com/apache/spark/commit/1008b2efe8256fe95ae61ebba1ab673e0f397118).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    Hi all, any update here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    Build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85946/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by tejasapatil <gi...@git.apache.org>.
Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    cc @cloud-fan @gengliangwang for review


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20206: [SPARK-19256][SQL] Remove ordering enforcement fr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20206#discussion_r161496715
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -150,6 +152,10 @@ case class InsertIntoHadoopFsRelationCommand(
             }
           }
     
    +      val partitionSet = AttributeSet(partitionColumns)
    +      val dataColumns = query.output.filterNot(partitionSet.contains)
    --- End diff --
    
    We should use `outputColumns` instead of `query.output`, cc @gengliangwang 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85859/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    **[Test build #85859 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85859/testReport)** for PR 20206 at commit [`652dca2`](https://github.com/apache/spark/commit/652dca2fa7cbdf73433e2ef8a2b91578c72d84fa).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20206
  
    **[Test build #85946 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85946/testReport)** for PR 20206 at commit [`8c91ff9`](https://github.com/apache/spark/commit/8c91ff9909fecaa36a79397e36edf64d83adac6b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org