You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by tejasapatil <gi...@git.apache.org> on 2018/01/09 16:14:18 UTC
[GitHub] spark pull request #20206: [SPARK-19256][SQL] Remove ordering enforcement fr...
GitHub user tejasapatil opened a pull request:
https://github.com/apache/spark/pull/20206
[SPARK-19256][SQL] Remove ordering enforcement from `FileFormatWriter` and let planner do that
## What changes were proposed in this pull request?
Thks is as per discussion in https://github.com/apache/spark/pull/19483#issuecomment-336337516 . Enforcing `Sort` at right places in the tree is something that `EnsureRequirements` should take care of. This PR removes `SORT` node insertion done inside `FileFormatWriter`.
## How was this patch tested?
- Existing unit tests
- Looked at the query plan for bucketed insert. `Sort` was added in the plan.
```
scala> hc.sql(" desc formatted test1 ").collect.foreach(println)
.....
[Num Buckets,8,]
[Bucket Columns,[`j`, `k`],]
[Sort Columns,[`j`, `k`],]
scala> hc.sql(" EXPLAIN INSERT OVERWRITE TABLE test1 SELECT * FROM test2 ").collect.foreach(println)
[== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand InsertIntoHadoopFsRelationCommand file:/warehouse/test1, false, 8 buckets, bucket columns: [j, k], sort columns: [j, k], ...
+- *Sort [pmod(hash(j#56, k#57, 42), 8) ASC NULLS FIRST, j#56 ASC NULLS FIRST, k#57 ASC NULLS FIRST], false, 0
+- *FileScan orc default.test2[i#55,j#56,k#57] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/warehouse/test2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<i:int,j:int,k:string>]
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tejasapatil/spark refac_sort_req
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20206.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20206
----
commit 652dca2fa7cbdf73433e2ef8a2b91578c72d84fa
Author: Tejas Patil <te...@...>
Date: 2017-10-13T19:26:09Z
[SPARK-19256][SQL] Move bucketing constraints out of `FileFormatWriter` into `RunnableCommand`
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20206
**[Test build #85936 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85936/testReport)** for PR 20206 at commit [`8c91ff9`](https://github.com/apache/spark/commit/8c91ff9909fecaa36a79397e36edf64d83adac6b).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20206
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2242/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20206
**[Test build #85946 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85946/testReport)** for PR 20206 at commit [`8c91ff9`](https://github.com/apache/spark/commit/8c91ff9909fecaa36a79397e36edf64d83adac6b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20206: [SPARK-19256][SQL] Remove ordering enforcement fr...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/20206#discussion_r161495402
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
@@ -156,40 +145,14 @@ object FileFormatWriter extends Logging {
statsTrackers = statsTrackers
)
- // We should first sort by partition columns, then bucket id, and finally sorting columns.
- val requiredOrdering = partitionColumns ++ bucketIdExpression ++ sortColumns
- // the sort order doesn't matter
- val actualOrdering = plan.outputOrdering.map(_.child)
- val orderingMatched = if (requiredOrdering.length > actualOrdering.length) {
- false
- } else {
- requiredOrdering.zip(actualOrdering).forall {
- case (requiredOrder, childOutputOrder) =>
- requiredOrder.semanticEquals(childOutputOrder)
- }
- }
-
SQLExecution.checkSQLExecutionId(sparkSession)
// This call shouldn't be put into the `try` block below because it only initializes and
// prepares the job, any exception thrown from here shouldn't cause abortJob() to be called.
committer.setupJob(job)
try {
- val rdd = if (orderingMatched) {
- plan.execute()
- } else {
- // SPARK-21165: the `requiredOrdering` is based on the attributes from analyzed plan, and
- // the physical plan may have different attribute ids due to optimizer removing some
- // aliases. Here we bind the expression ahead to avoid potential attribute ids mismatch.
--- End diff --
This concern is still valid, the `DataWritingCommand.requiredChildOrdering` is based on logical plan's output attribute ids, how can we safely apply it in `DataWritingCommandExec`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20206
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20206: [SPARK-19256][SQL] Remove ordering enforcement fr...
Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:
https://github.com/apache/spark/pull/20206#discussion_r161575592
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
@@ -150,6 +152,10 @@ case class InsertIntoHadoopFsRelationCommand(
}
}
+ val partitionSet = AttributeSet(partitionColumns)
+ val dataColumns = query.output.filterNot(partitionSet.contains)
--- End diff --
+1, it should be `outputColumns ` here, which is the output columns of analyzed plan. See #20020 for details.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20206
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85936/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20206
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20206
**[Test build #85891 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85891/testReport)** for PR 20206 at commit [`1008b2e`](https://github.com/apache/spark/commit/1008b2efe8256fe95ae61ebba1ab673e0f397118).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by tejasapatil <gi...@git.apache.org>.
Github user tejasapatil commented on the issue:
https://github.com/apache/spark/pull/20206
Jenkins retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20206
**[Test build #85936 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85936/testReport)** for PR 20206 at commit [`8c91ff9`](https://github.com/apache/spark/commit/8c91ff9909fecaa36a79397e36edf64d83adac6b).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20206
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20206
**[Test build #85859 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85859/testReport)** for PR 20206 at commit [`652dca2`](https://github.com/apache/spark/commit/652dca2fa7cbdf73433e2ef8a2b91578c72d84fa).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20206
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85891/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20206: [SPARK-19256][SQL] Remove ordering enforcement fr...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/20206#discussion_r161496099
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
@@ -184,6 +190,43 @@ case class InsertIntoHadoopFsRelationCommand(
Seq.empty[Row]
}
+ private def getBucketIdExpression(dataColumns: Seq[Attribute]): Option[Expression] = {
+ bucketSpec.map { spec =>
+ val bucketColumns = spec.bucketColumnNames.map(c => dataColumns.find(_.name == c).get)
+ // Use `HashPartitioning.partitionIdExpression` as our bucket id expression, so that we can
+ // guarantee the data distribution is same between shuffle and bucketed data source, which
+ // enables us to only shuffle one side when join a bucketed table and a normal one.
+ HashPartitioning(bucketColumns, spec.numBuckets).partitionIdExpression
+ }
+ }
+
+ /**
+ * How is `requiredOrdering` determined ?
+ *
+ * table type | requiredOrdering
+ * -----------------+-------------------------------------------------
+ * normal table | partition columns
--- End diff --
nit: `non-bucketed table`, a partitioned table is not a normal table...
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20206
**[Test build #85891 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85891/testReport)** for PR 20206 at commit [`1008b2e`](https://github.com/apache/spark/commit/1008b2efe8256fe95ae61ebba1ab673e0f397118).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20206
Hi all, any update here?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20206
Build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20206
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85946/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by tejasapatil <gi...@git.apache.org>.
Github user tejasapatil commented on the issue:
https://github.com/apache/spark/pull/20206
cc @cloud-fan @gengliangwang for review
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20206: [SPARK-19256][SQL] Remove ordering enforcement fr...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/20206#discussion_r161496715
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
@@ -150,6 +152,10 @@ case class InsertIntoHadoopFsRelationCommand(
}
}
+ val partitionSet = AttributeSet(partitionColumns)
+ val dataColumns = query.output.filterNot(partitionSet.contains)
--- End diff --
We should use `outputColumns` instead of `query.output`, cc @gengliangwang
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20206
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85859/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20206
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20206
**[Test build #85859 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85859/testReport)** for PR 20206 at commit [`652dca2`](https://github.com/apache/spark/commit/652dca2fa7cbdf73433e2ef8a2b91578c72d84fa).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20206
**[Test build #85946 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85946/testReport)** for PR 20206 at commit [`8c91ff9`](https://github.com/apache/spark/commit/8c91ff9909fecaa36a79397e36edf64d83adac6b).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org