You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by xguo27 <gi...@git.apache.org> on 2016/01/27 01:40:17 UTC

[GitHub] spark pull request: [SPARK-12981][SQL] Fix Python UDF extraction f...

GitHub user xguo27 opened a pull request:

    https://github.com/apache/spark/pull/10935

    [SPARK-12981][SQL] Fix Python UDF extraction for aggregation.

    When Aggregate operator being applied ExtractPythonUDFs rule, it becomes a Project. This change fixes that and maintain Aggregate operator to the original type.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xguo27/spark SPARK-12981

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10935.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10935
    
----
commit d55146f6dd865bff9789a32641de6aa1678b912f
Author: Xiu Guo <xg...@gmail.com>
Date:   2016-01-26T22:35:50Z

    [SPARK-12981][SQL] Fix Python UDF extraction for aggregation.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12981][SQL] Fix Python UDF extraction f...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/10935#issuecomment-204658130
  
    @xguo27 Thanks for working on this. I think the root cause here is that we extract Python UDFs too early (in analyzer), EvaluatePython is an special logical plan, many rules have no knowledge of it, which will break many things. We should extract Python UDFs later, in end of optimizer, or physical plan, I will send an PR to fix that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12981][SQL] Fix Python UDF extraction f...

Posted by xguo27 <gi...@git.apache.org>.
Github user xguo27 closed the pull request at:

    https://github.com/apache/spark/pull/10935


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12981][SQL] Fix Python UDF extraction f...

Posted by xguo27 <gi...@git.apache.org>.
Github user xguo27 commented on the pull request:

    https://github.com/apache/spark/pull/10935#issuecomment-189430834
  
    @rxin Does this fix look good to you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12981][SQL] Fix Python UDF extraction f...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10935#discussion_r54326619
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -64,11 +64,16 @@ private[spark] object ExtractPythonUDFs extends Rule[LogicalPlan] {
                 assert(evaluation != null, "Unable to evaluate PythonUDF.  Missing input attributes.")
     
                 // Trim away the new UDF value if it was only used for filtering or something.
    -            logical.Project(
    -              plan.output,
    -              plan.transformExpressions {
    -                case p: PythonUDF if p.fastEquals(udf) => evaluation.resultAttribute
    -              }.withNewChildren(newChildren))
    +            val transformed = plan.transformExpressions {
    +              case p: PythonUDF if p.fastEquals(udf) => evaluation.resultAttribute
    +            }.withNewChildren(newChildren)
    +
    +            if (plan.isInstanceOf[Aggregate]) {
    +              transformed
    +            }
    --- End diff --
    
    a style nit: put else on the same line as the previous }
    
    also can you add some comment explaining what's happening


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12981][SQL] Fix Python UDF extraction f...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10935#issuecomment-189562989
  
    cc @davies


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12981][SQL] Fix Python UDF extraction f...

Posted by xguo27 <gi...@git.apache.org>.
Github user xguo27 commented on the pull request:

    https://github.com/apache/spark/pull/10935#issuecomment-204840629
  
    Sure @davies . I will close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12981][SQL] Fix Python UDF extraction f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10935#issuecomment-175314935
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12981][SQL] Fix Python UDF extraction f...

Posted by xguo27 <gi...@git.apache.org>.
Github user xguo27 commented on the pull request:

    https://github.com/apache/spark/pull/10935#issuecomment-190554648
  
    Using these two functionally equavalent code snippets:
    
    Scala
    ```
    val data = Seq((1, "1"), (2, "2"), (3, "2"), (1, "3")).toDF("a","b")
    val my_filter = sqlContext.udf.register("my_filter", (a:Int) => a==1)
    data.select(col("a")).distinct().filter(my_filter(col("a")))
    ```
    
    Python
    ```
    data = sqlContext.createDataFrame([(1, "1"), (2, "2"), (3, "2"), (1, "3")], ["a", "b"])
    my_filter = udf(lambda a: a == 1, BooleanType())
    data.select(col("a")).distinct().filter(my_filter(col("a")))
    ```
    
    The logical plan comes out `execute(aggregateCondition)` in here is as below:
    
    https://github.com/apache/spark/blob/916fc34f98dd731f607d9b3ed657bad6cc30df2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L801
    
    Scala
    ```
    Aggregate [a#8], [UDF(a#8) AS havingCondition#11]
    +- Project [a#8]
       +- Project [_1#6 AS a#8,_2#7 AS b#9]
          +- LocalRelation [_1#6,_2#7], [[1,1],[2,2],[3,2],[1,3]]
    ```
    
    Python
    ```
    Project [havingCondition#2]
    +- Aggregate [a#0L], [pythonUDF#3 AS havingCondition#2]
       +- EvaluatePython PythonUDF#<lambda>(a#0L), pythonUDF#3: boolean
          +- Project [a#0L]
             +- LogicalRDD [a#0L,b#1], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2
    ```
    We can see in Python's case, we inject an extra Project when `execute(aggregateCondition)`going through ExtractPythonUDFs, but ResolveAggregateFunctions expects an Aggregate here:
    
    https://github.com/apache/spark/blob/916fc34f98dd731f607d9b3ed657bad6cc30df2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L801-L805
    
    
    With this fix, the logical plan generated for Python UDFs does not construct a Project if it is an Aggregate, making it consistent with its Scala counterpart, which gives correct results for ResolveAggregateFunctions to consume:
    
    After fix, Python:
    ```
    Aggregate [a#0L], [pythonUDF#3 AS havingCondition#2]
    +- EvaluatePython PythonUDF#<lambda>(a#0L), pythonUDF#3: boolean
       +- Project [a#0L]
          +- LogicalRDD [a#0L,b#1], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org