You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dilipbiswal <gi...@git.apache.org> on 2018/04/26 22:19:45 UTC

[GitHub] spark pull request #21174: [SPARK-24085] Query returns UnsupportedOperationE...

GitHub user dilipbiswal opened a pull request:

    https://github.com/apache/spark/pull/21174

    [SPARK-24085] Query returns UnsupportedOperationException when scalar subquery is present in partitioning expression

    ## What changes were proposed in this pull request?
    In this case, the partition pruning happens before the planning phase of scalar subquery expressions.
    For scalar subquery expressions, the planning occurs late in the cycle (after the physical planning)  in "PlanSubqueries" just before execution. Currently we try to execute the scalar subquery expression as part of partition pruning and fail as it implements Unevaluable.
    
    The fix attempts to ignore the Subquery expressions from partition pruning computation. Another option can be to somehow plan the subqueries before the partition pruning. Since this may not be a commonly occuring expression, i am opting for a simpler fix.
    
    Repro
    ``` SQL
    CREATE TABLE test_prc_bug (
    id_value string
    )
    partitioned by (id_type string)
    location '/tmp/test_prc_bug'
    stored as parquet;
    
    insert into test_prc_bug values ('1','a');
    insert into test_prc_bug values ('2','a');
    insert into test_prc_bug values ('3','b');
    insert into test_prc_bug values ('4','b');
    
    
    select * from test_prc_bug
    where id_type = (select 'b');
    ```
    ## How was this patch tested?
    Added test in SubquerySuite and hive/SQLQuerySuite


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dilipbiswal/spark spark-24085

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21174.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21174
    
----
commit 38c769274fca2931d0b0147e5e666b9cd7c99f59
Author: Dilip Biswal <db...@...>
Date:   2018-04-26T00:40:01Z

    [SPARK-24085] Query returns UnsupportedOperationException when scalar subquery is present in partitioning expression.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21174: [SPARK-24085][SQL] Query returns UnsupportedOpera...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21174#discussion_r184606904
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -955,4 +955,28 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
         // before the fix this would throw AnalysisException
         spark.range(10).where("(id,id) in (select id, null from range(3))").count
       }
    +
    +  test("SPARK-24085 scalar subquery in partitioning expression") {
    +    withTempPath { tempDir =>
    +      withTable("parquet_part") {
    +        sql(
    +          s"""
    +             |CREATE TABLE parquet_part (id_value string, id_type string)
    +             |USING PARQUET
    +             |OPTIONS (
    +             |  path '${tempDir.toURI}'
    +             |)
    +             |PARTITIONED BY (id_type)
    --- End diff --
    
    ```Scala
            Seq("1" -> "a", "2" -> "a", "3" -> "b", "4" -> "b")
              .toDF("id_value", "id_type")
              .write
              .mode(SaveMode.Overwrite)
              .partitionBy("id_type")
              .format("parquet")
              .saveAsTable("parquet_part")
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085][SQL] Query returns UnsupportedOperationExc...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    One question; we have no risk to miss any partition pruning by this fix?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085][SQL] Query returns UnsupportedOperationExc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2712/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085] Query returns UnsupportedOperationExceptio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085] Query returns UnsupportedOperationExceptio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2707/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085] Query returns UnsupportedOperationExceptio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085] Query returns UnsupportedOperationExceptio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    **[Test build #89906 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89906/testReport)** for PR 21174 at commit [`38c7692`](https://github.com/apache/spark/commit/38c769274fca2931d0b0147e5e666b9cd7c99f59).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085][SQL] Query returns UnsupportedOperationExc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    **[Test build #89918 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89918/testReport)** for PR 21174 at commit [`e6e9397`](https://github.com/apache/spark/commit/e6e9397b42c1ad39d58d7b1c11f7cb152f019c82).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085] Query returns UnsupportedOperationExceptio...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    plz add `[SQL]` in the title.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21174: [SPARK-24085][SQL] Query returns UnsupportedOpera...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21174#discussion_r184609040
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -76,7 +76,10 @@ object FileSourceStrategy extends Strategy with Logging {
               fsRelation.partitionSchema, fsRelation.sparkSession.sessionState.analyzer.resolver)
           val partitionSet = AttributeSet(partitionColumns)
           val partitionKeyFilters =
    -        ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
    +        ExpressionSet(normalizedFilters.
    +          filterNot(SubqueryExpression.hasSubquery(_)).
    +          filter(_.references.subsetOf(partitionSet)))
    --- End diff --
    
    @gatorsmile Will make the change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085][SQL] Query returns UnsupportedOperationExc...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    LGTM 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085][SQL] Query returns UnsupportedOperationExc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085][SQL] Query returns UnsupportedOperationExc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085] Query returns UnsupportedOperationExceptio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    **[Test build #89906 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89906/testReport)** for PR 21174 at commit [`38c7692`](https://github.com/apache/spark/commit/38c769274fca2931d0b0147e5e666b9cd7c99f59).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085] Query returns UnsupportedOperationExceptio...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    @maropu Thanks for your response. ORC has CONVERT_METASTORE_ORC set to false as default. So its not converted to a file based datasource. If we set this to true then we would see the same issue for ORC. I have added test to cover the case.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21174: [SPARK-24085][SQL] Query returns UnsupportedOpera...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21174#discussion_r184607398
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -76,7 +76,10 @@ object FileSourceStrategy extends Strategy with Logging {
               fsRelation.partitionSchema, fsRelation.sparkSession.sessionState.analyzer.resolver)
           val partitionSet = AttributeSet(partitionColumns)
           val partitionKeyFilters =
    -        ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
    +        ExpressionSet(normalizedFilters.
    +          filterNot(SubqueryExpression.hasSubquery(_)).
    +          filter(_.references.subsetOf(partitionSet)))
    --- End diff --
    
    The same here


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085][SQL] Query returns UnsupportedOperationExc...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    Thanks! Merged to master/2.3


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21174: [SPARK-24085][SQL] Query returns UnsupportedOpera...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21174#discussion_r184607164
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala ---
    @@ -55,7 +55,9 @@ private[sql] object PruneFileSourcePartitions extends Rule[LogicalPlan] {
               partitionSchema, sparkSession.sessionState.analyzer.resolver)
           val partitionSet = AttributeSet(partitionColumns)
           val partitionKeyFilters =
    -        ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
    +        ExpressionSet(normalizedFilters.
    +          filterNot(SubqueryExpression.hasSubquery(_)).
    +          filter(_.references.subsetOf(partitionSet)))
    --- End diff --
    
    ```Scala
            ExpressionSet(normalizedFilters
              .filterNot(SubqueryExpression.hasSubquery)
              .filter(_.references.subsetOf(partitionSet)))
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21174: [SPARK-24085][SQL] Query returns UnsupportedOpera...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21174#discussion_r184609073
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -955,4 +955,28 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
         // before the fix this would throw AnalysisException
         spark.range(10).where("(id,id) in (select id, null from range(3))").count
       }
    +
    +  test("SPARK-24085 scalar subquery in partitioning expression") {
    +    withTempPath { tempDir =>
    +      withTable("parquet_part") {
    +        sql(
    +          s"""
    +             |CREATE TABLE parquet_part (id_value string, id_type string)
    +             |USING PARQUET
    +             |OPTIONS (
    +             |  path '${tempDir.toURI}'
    +             |)
    +             |PARTITIONED BY (id_type)
    --- End diff --
    
    @gatorsmile OK.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085] Query returns UnsupportedOperationExceptio...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    Why the orc works fine?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085][SQL] Query returns UnsupportedOperationExc...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    @gatorsmile @maropu Thank you very much !!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085][SQL] Query returns UnsupportedOperationExc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89918/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085] Query returns UnsupportedOperationExceptio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89906/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21174: [SPARK-24085][SQL] Query returns UnsupportedOpera...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21174


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21174: [SPARK-24085][SQL] Query returns UnsupportedOpera...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21174#discussion_r184609062
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala ---
    @@ -55,7 +55,9 @@ private[sql] object PruneFileSourcePartitions extends Rule[LogicalPlan] {
               partitionSchema, sparkSession.sessionState.analyzer.resolver)
           val partitionSet = AttributeSet(partitionColumns)
           val partitionKeyFilters =
    -        ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
    +        ExpressionSet(normalizedFilters.
    +          filterNot(SubqueryExpression.hasSubquery(_)).
    +          filter(_.references.subsetOf(partitionSet)))
    --- End diff --
    
    @gatorsmile Will make the change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085][SQL] Query returns UnsupportedOperationExc...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    @gatorsmile Thanks a lot. Addressed the comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085][SQL] Query returns UnsupportedOperationExc...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    This fix is safe for us to backport to the previous versions. To achieve a better performance, we can start a separate job to execute these subqueries. : ) 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085] Query returns UnsupportedOperationExceptio...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    Ah, ok. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085][SQL] Query returns UnsupportedOperationExc...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    @maropu So with the fix, if the query predicate contains an scalar subquery expression, then that expression is not considered for partition pruning.  For example, if the predicate was , part_key1 = (select ...) and part_key2 = 5 , then only the 2nd part of the expression is considered for pruning purposes and the first part will be a regular filter. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21174: [SPARK-24085][SQL] Query returns UnsupportedOperationExc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21174
  
    **[Test build #89918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89918/testReport)** for PR 21174 at commit [`e6e9397`](https://github.com/apache/spark/commit/e6e9397b42c1ad39d58d7b1c11f7cb152f019c82).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org