You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by heary-cao <gi...@git.apache.org> on 2017/08/17 07:19:17 UTC

[GitHub] spark pull request #18969: [SPARK-21520][SQL][FOLLOW-UP]fix a special case f...

GitHub user heary-cao opened a pull request:

    https://github.com/apache/spark/pull/18969

    [SPARK-21520][SQL][FOLLOW-UP]fix a special case for non-deterministic projects in optimizer

    ## What changes were proposed in this pull request?
    
    This is a follow-up of #18892 , to another fix it:
    Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic projects. Deal with that we only need to read user needs fields for non-deterministic projects in optimizer.
     For example, the fields of project contains nondeterministic function(rand function), after a executedPlan optimizer generated:
    ```
    *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L])
    +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 10000.0)) AS k#403L]
       +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table
    ```
    HiveTableScan will read all the fields from table. but we only need to ‘d004’ . it will affect the performance of task.
    
    
    ## How was this patch tested?
    Should be covered existing test cases and add test cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/heary-cao/spark followup-non-deterministic

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18969.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18969
    
----
commit e84425f16c868844f442ff5b7cd8aa7695a94038
Author: caoxuewen <ca...@zte.com.cn>
Date:   2017-08-17T07:12:43Z

    fix a special case for non-deterministic projects in optimizer

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18969: [SPARK-21520][SQL][FOLLOW-UP]fix a special case f...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18969#discussion_r134935923
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -24,6 +24,24 @@ import org.apache.spark.sql.catalyst.plans._
     import org.apache.spark.sql.catalyst.plans.logical._
     
     /**
    + * A pattern that matches any number of project if fields is deterministic
    + * or child is LeafNode of project on top of another relational operator.
    + */
    +object ProjectOperation extends PredicateHelper {
    +  type ReturnType = (Seq[NamedExpression], LogicalPlan)
    +
    +  def unapply(plan: LogicalPlan): Option[ReturnType] = plan match {
    +    case Project(fields, child) if fields.forall(_.deterministic) =>
    +      Some((fields, child))
    +
    +    case Project(fields, child: LeafNode) if !fields.forall(_.deterministic) =>
    --- End diff --
    
    aha,  :(  
    I misunderstood you.
    Is not that we have added a condition.
    `case p @ Project(fields, child: LeafNode) if p.references.nonEmpty =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18969: [SPARK-21520][SQL][FOLLOW-UP]fix a special case f...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18969#discussion_r134393570
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -24,6 +24,24 @@ import org.apache.spark.sql.catalyst.plans._
     import org.apache.spark.sql.catalyst.plans.logical._
     
     /**
    + * A pattern that matches any number of project if fields is deterministic
    + * or child is LeafNode of project on top of another relational operator.
    + */
    +object ProjectOperation extends PredicateHelper {
    +  type ReturnType = (Seq[NamedExpression], LogicalPlan)
    +
    +  def unapply(plan: LogicalPlan): Option[ReturnType] = plan match {
    +    case Project(fields, child) if fields.forall(_.deterministic) =>
    +      Some((fields, child))
    +
    +    case Project(fields, child: LeafNode) =>
    --- End diff --
    
    We should still consider whether the fields are non-deterministic. It makes sense only when the non-deterministic fields are not referencing any attribute. Thus, your use case is pretty rare.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18969: [SPARK-21520][SQL][FOLLOW-UP]fix a special case f...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao closed the pull request at:

    https://github.com/apache/spark/pull/18969


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18969: [SPARK-21520][SQL][FOLLOW-UP]fix a special case f...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18969#discussion_r134915918
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -24,6 +24,24 @@ import org.apache.spark.sql.catalyst.plans._
     import org.apache.spark.sql.catalyst.plans.logical._
     
     /**
    + * A pattern that matches any number of project if fields is deterministic
    + * or child is LeafNode of project on top of another relational operator.
    + */
    +object ProjectOperation extends PredicateHelper {
    +  type ReturnType = (Seq[NamedExpression], LogicalPlan)
    +
    +  def unapply(plan: LogicalPlan): Option[ReturnType] = plan match {
    +    case Project(fields, child) if fields.forall(_.deterministic) =>
    +      Some((fields, child))
    +
    +    case Project(fields, child: LeafNode) =>
    --- End diff --
    
    Hi, @gatorsmile .
    Could you review again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18969: [SPARK-21520][SQL][FOLLOW-UP]fix a special case f...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18969#discussion_r134932482
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -24,6 +24,24 @@ import org.apache.spark.sql.catalyst.plans._
     import org.apache.spark.sql.catalyst.plans.logical._
     
     /**
    + * A pattern that matches any number of project if fields is deterministic
    + * or child is LeafNode of project on top of another relational operator.
    + */
    +object ProjectOperation extends PredicateHelper {
    +  type ReturnType = (Seq[NamedExpression], LogicalPlan)
    +
    +  def unapply(plan: LogicalPlan): Option[ReturnType] = plan match {
    +    case Project(fields, child) if fields.forall(_.deterministic) =>
    +      Some((fields, child))
    +
    +    case Project(fields, child: LeafNode) if !fields.forall(_.deterministic) =>
    --- End diff --
    
    What does this mean? :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18969: [SPARK-21520][SQL][FOLLOW-UP]fix a special case f...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18969#discussion_r136531688
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -24,6 +24,24 @@ import org.apache.spark.sql.catalyst.plans._
     import org.apache.spark.sql.catalyst.plans.logical._
     
     /**
    + * A pattern that matches any number of project if fields is deterministic
    + * or child is LeafNode of project on top of another relational operator.
    + */
    +object ProjectOperation extends PredicateHelper {
    +  type ReturnType = (Seq[NamedExpression], LogicalPlan)
    +
    +  def unapply(plan: LogicalPlan): Option[ReturnType] = plan match {
    +    case Project(fields, child) if fields.forall(_.deterministic) =>
    +      Some((fields, child))
    +
    +    case Project(fields, child: LeafNode) if !fields.forall(_.deterministic) =>
    --- End diff --
    
    this might be a rare case. but in business scenario, there are still a lot of scenes to use the rare case. similar business scenarios:
    1.Random grouping, add a random factor to each row of data before grouping.
    2.Use the random value to fill a field, easy to follow algorithm for calculation or prevents querying data anomalies.
    3.Data skew, data are discretized using random values.
    thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18969: [SPARK-21520][SQL][FOLLOW-UP]fix a special case f...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18969#discussion_r134404432
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -24,6 +24,24 @@ import org.apache.spark.sql.catalyst.plans._
     import org.apache.spark.sql.catalyst.plans.logical._
     
     /**
    + * A pattern that matches any number of project if fields is deterministic
    + * or child is LeafNode of project on top of another relational operator.
    + */
    +object ProjectOperation extends PredicateHelper {
    +  type ReturnType = (Seq[NamedExpression], LogicalPlan)
    +
    +  def unapply(plan: LogicalPlan): Option[ReturnType] = plan match {
    +    case Project(fields, child) if fields.forall(_.deterministic) =>
    +      Some((fields, child))
    +
    +    case Project(fields, child: LeafNode) =>
    --- End diff --
    
    thanks, 
    I have modify it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18969: [SPARK-21520][SQL][FOLLOW-UP]fix a special case for non-...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18969
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18969: [SPARK-21520][SQL][FOLLOW-UP]fix a special case f...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18969#discussion_r134937947
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -24,6 +24,24 @@ import org.apache.spark.sql.catalyst.plans._
     import org.apache.spark.sql.catalyst.plans.logical._
     
     /**
    + * A pattern that matches any number of project if fields is deterministic
    + * or child is LeafNode of project on top of another relational operator.
    + */
    +object ProjectOperation extends PredicateHelper {
    +  type ReturnType = (Seq[NamedExpression], LogicalPlan)
    +
    +  def unapply(plan: LogicalPlan): Option[ReturnType] = plan match {
    +    case Project(fields, child) if fields.forall(_.deterministic) =>
    +      Some((fields, child))
    +
    +    case Project(fields, child: LeafNode) if !fields.forall(_.deterministic) =>
    --- End diff --
    
    I think this might not worth to fix. This only covers a rare case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18969: [SPARK-21520][SQL][FOLLOW-UP]fix a special case for non-...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18969
  
    @heary-cao Maybe you can close this PR first? @jiangxb1987 will handle it in the previous PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18969: [SPARK-21520][SQL][FOLLOW-UP]fix a special case for non-...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao commented on the issue:

    https://github.com/apache/spark/pull/18969
  
    cc @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org