You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ajithme <gi...@git.apache.org> on 2018/08/30 03:44:59 UTC

[GitHub] spark pull request #22277: [SPARK-25276] Redundant constrains when using ali...

GitHub user ajithme opened a pull request:

    https://github.com/apache/spark/pull/22277

    [SPARK-25276] Redundant constrains when using alias

    Attaching a test to reproduce the issue. The test fails with following message
    
      test("redundant constrains") {
        val tr = LocalRelation('a.int, 'b.string, 'c.int)
        val aliasedRelation = tr.where('a.attr > 10).select('a.as('x), 'b, 'b.as('y), 'a.as('z))
    
        verifyConstraints(aliasedRelation.analyze.constraints,
          ExpressionSet(Seq(resolveColumn(aliasedRelation.analyze, "x") > 10,
            IsNotNull(resolveColumn(aliasedRelation.analyze, "x")),
            resolveColumn(aliasedRelation.analyze, "b") <=> resolveColumn(aliasedRelation.analyze, "y"),
            resolveColumn(aliasedRelation.analyze, "z") <=>
              resolveColumn(aliasedRelation.analyze, "x"))))
      }
    
    == FAIL: Constraints do not match ===
    Found: isnotnull(z#5),(z#5 > 10),(x#3 > 10),(z#5 <=> x#3),(b#1 <=> y#4),isnotnull(x#3)
    Expected: (x#3 > 10),isnotnull(x#3),(b#1 <=> y#4),(z#5 <=> x#3)
    == Result ==
    Missing: N/A
    Found but not expected: isnotnull(z#5),(z#5 > 10)
    Here i think as z has a EqualNullSafe comparison with x, so having isnotnull(z#5),(z#5 > 10) is redundant. If a query has lot of aliases, this may cause overhead.
    
    So i suggest https://github.com/apache/spark/blob/v2.3.2-rc5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L254 instead of  addAll++= we must just assign =

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ajithme/spark SPARK-25276

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22277.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22277
    
----
commit 5be1df1215299758d6e684ec9a5f6ba12693280e
Author: Ajith <aj...@...>
Date:   2018-08-30T03:41:36Z

    [SPARK-25276] Redundant constrains when using alias

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22277: [SPARK-25276] Redundant constrains when using alias

Posted by ajithme <gi...@git.apache.org>.
Github user ajithme commented on the issue:

    https://github.com/apache/spark/pull/22277
  
    @gatorsmile and @jiangxb1987 any inputs.?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22277: [SPARK-25276] Redundant constrains when using alias

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/22277
  
    You can have `select * from (select a, a as c from table1 where a > 10) t where a > c`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22277: [SPARK-25276] Redundant constrains when using alias

Posted by ajithme <gi...@git.apache.org>.
Github user ajithme commented on the issue:

    https://github.com/apache/spark/pull/22277
  
    Attaching a sql file to reproduce the issue and see the effect of PR : 
    [test.txt](https://github.com/apache/spark/files/2356468/test.txt)
    
    
    
    
    ### Without patch: 
    ```
    spark-2.3.1-bin-hadoop2.7/bin # ./spark-sql -f test.txt
    Time taken: 3.405 seconds
    Time taken: 0.373 seconds
    Time taken: 0.202 seconds
    Time taken: 0.024 seconds
    18/09/06 11:29:49 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table11 specified for non-external table:table11
    Time taken: 0.541 seconds
    18/09/06 11:29:49 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table22 specified for non-external table:table22
    Time taken: 0.115 seconds
    18/09/06 11:29:50 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table33 specified for non-external table:table33
    Time taken: 6.075 seconds
    18/09/06 11:31:38 ERROR SparkSQLDriver: Failed in [
    create table table44 as
    select a.*
    from
    (
    select
    (concat(
    case when a1 is null then '' else cast(a1 as string) end,'|~|',
    case when a2 is null then '' else cast(a2 as string) end,'|~|',
    case when a3 is null then '' else cast(a3 as string) end,'|~|',
    case when a4 is null then '' else cast(a4 as string) end,'|~|',
    case when a5 is null then '' else cast(a5 as string) end,'|~|',
    case when a6 is null then '' else cast(a6 as string) end,'|~|',
    case when a7 is null then '' else cast(a7 as string) end,'|~|',
    case when a8 is null then '' else cast(a8 as string) end,'|~|',
    case when a9 is null then '' else cast(a9 as string) end,'|~|',
    case when a10 is null then '' else cast(a10 as string) end,'|~|',
    case when a11 is null then '' else cast(a11 as string) end,'|~|',
    case when a12 is null then '' else cast(a12 as string) end,'|~|',
    case when a13 is null then '' else cast(a13 as string) end,'|~|',
    case when a14 is null then '' else cast(a14 as string) end,'|~|',
    case when a15 is null then '' else cast(a15 as string) end,'|~|',
    case when a16 is null then '' else cast(a16 as string) end,'|~|',
    case when a17 is null then '' else cast(a17 as string) end,'|~|',
    case when a18 is null then '' else cast(a18 as string) end,'|~|',
    case when a19 is null then '' else cast(a19 as string) end
    )) as KEY_ID ,
    case when a1 is null then '' else cast(a1 as string) end as a1,
    case when a2 is null then '' else cast(a2 as string) end as a2,
    case when a3 is null then '' else cast(a3 as string) end as a3,
    case when a4 is null then '' else cast(a4 as string) end as a4,
    case when a5 is null then '' else cast(a5 as string) end as a5,
    case when a6 is null then '' else cast(a6 as string) end as a6,
    case when a7 is null then '' else cast(a7 as string) end as a7,
    case when a8 is null then '' else cast(a8 as string) end as a8,
    case when a9 is null then '' else cast(a9 as string) end as a9,
    case when a10 is null then '' else cast(a10 as string) end as a10,
    case when a11 is null then '' else cast(a11 as string) end as a11,
    case when a12 is null then '' else cast(a12 as string) end as a12,
    case when a13 is null then '' else cast(a13 as string) end as a13,
    case when a14 is null then '' else cast(a14 as string) end as a14,
    case when a15 is null then '' else cast(a15 as string) end as a15,
    case when a16 is null then '' else cast(a16 as string) end as a16,
    case when a17 is null then '' else cast(a17 as string) end as a17,
    case when a18 is null then '' else cast(a18 as string) end as a18,
    case when a19 is null then '' else cast(a19 as string) end as a19
    from table22
    ) A
    left join table11 B ON A.KEY_ID = B.KEY_ID
    where b.KEY_ID is null]
    java.lang.OutOfMemoryError: GC overhead limit exceeded
            at java.lang.Class.copyConstructors(Class.java:3130)
            at java.lang.Class.getConstructors(Class.java:1651)
            at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:387)
            at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:385)
            at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
            at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)
            at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:244)
            at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:190)
            at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
            at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
            at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
            at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
            at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
            at scala.collection.immutable.List.foreach(List.scala:381)
            at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
            at scala.collection.immutable.List.map(List.scala:285)
            at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:189)
            at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
            at org.apache.spark.sql.catalyst.expressions.ExpressionSet.add(ExpressionSet.scala:63)
            at org.apache.spark.sql.catalyst.expressions.ExpressionSet$$anonfun$$plus$plus$1.apply(ExpressionSet.scala:79)
            at org.apache.spark.sql.catalyst.expressions.ExpressionSet$$anonfun$$plus$plus$1.apply(ExpressionSet.scala:79)
            at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
            at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
            at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
            at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
            at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
            at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:79)
            at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:55)
            at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:254)
            at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:249)
            at scala.collection.immutable.List.foreach(List.scala:381)
            at org.apache.spark.sql.catalyst.plans.logical.UnaryNode.getAliasedConstraints(LogicalPlan.scala:249)
    ```  
    
    ### After applying patch: 
    ```
    spark-2.3.1-bin-hadoop2.7/bin # ./spark-sql -f test.txt
    Time taken: 3.469 seconds
    Time taken: 0.294 seconds
    Time taken: 0.223 seconds
    Time taken: 0.023 seconds
    18/09/06 11:33:08 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table11 specified for non-external table:table11
    Time taken: 0.546 seconds
    18/09/06 11:33:08 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table22 specified for non-external table:table22
    Time taken: 0.11 seconds
    18/09/06 11:33:10 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table33 specified for non-external table:table33
    Time taken: 6.258 seconds
    18/09/06 11:33:15 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table44 specified for non-external table:table44
    Time taken: 2.603 seconds
    ``` 
    
    As you can see here, when we have many aliases in projection, computing it will cause significant overhead with current code.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22277: [SPARK-25276] Redundant constrains when using alias

Posted by ajithme <gi...@git.apache.org>.
Github user ajithme commented on the issue:

    https://github.com/apache/spark/pull/22277
  
    I see. But the code modified in this PR is when alias is part of projection. The query mention by you seems not to hit the current alias logic @ __org.apache.spark.sql.catalyst.plans.logical.UnaryNode#getAliasedConstraints__  as for outer query __a__ and __c__ as not aliases but rather AttributeReferences.
    
    Do you mean we should cover the scenario where alias is referenced in filter as part of this PR.? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22277: [SPARK-25276] Redundant constrains when using alias

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/22277
  
    Thank you for interest in this issue, however, I don't think the changes proposed in this PR is valid, consider you have another predicate like `a > z`, it is surely desired to infer a new constraint `z > z`. Please correct me if I'm wrong about this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22277: [SPARK-25276] Redundant constrains when using alias

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22277
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22277: [SPARK-25276] Redundant constrains when using alias

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22277
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22277: [SPARK-25276] Redundant constrains when using alias

Posted by ajithme <gi...@git.apache.org>.
Github user ajithme commented on the issue:

    https://github.com/apache/spark/pull/22277
  
    @jiangxb1987 Thanks you for the feedback. Couple of points
    1. If introduce a predicate which refers to alias( as u mentioned a > z), it will throw error    
    ```
    spark-sql> create table table1 (a int);
    18/09/05 13:00:28 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table1 specified for non-external table:table1
    Time taken: 0.152 seconds
    
    spark-sql> select a, a as c from table1 where a > 10 and a > c;
    18/09/05 13:01:04 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
    Error in query: cannot resolve '`c`' given input columns: [table1.a]; line 1 pos 50;
    'Project ['a, 'a AS c#6]
    +- 'Filter ((a#7 > 10) && (a#7 > 'c))
       +- SubqueryAlias table1
          +- HiveTableRelation `default`.`table1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#7]
    ```
    So i think its invalid scenario for a > z.? please correct me if i am wrong
    
    2) if we add a predicate like __a > a__ instead of __a > z__ ( self referring)  the PR still produces valid constrain list  
    ```
      (x#5 > x#5),(b#1 <=> y#6),(x#5 > 10),(z#7 <=> x#5),isnotnull(x#5)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22277: [SPARK-25276] Redundant constrains when using alias

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22277
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org