You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2017/11/02 10:48:24 UTC

[GitHub] spark pull request #19642: [SPARK-22410][SQL] Remove unnecessary output from...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/19642

    [SPARK-22410][SQL] Remove unnecessary output from BatchEvalPython's children plans

    ## What changes were proposed in this pull request?
    
    When we insert `BatchEvalPython` for Python UDFs into a query plan, if its child has some outputs that are not used by the original parent node, `BatchEvalPython` will still take those outputs and save into the queue. When the data for those outputs are big, it is easily to generate big spill on disk.
    
    For example, the following reproducible code is from the JIRA ticket.
    
    ```python
    from pyspark.sql.functions import *
    from pyspark.sql.types import *
    
    lines_of_file = [ "this is a line" for x in xrange(10000) ]
    file_obj = [ "this_is_a_foldername/this_is_a_filename", lines_of_file ]
    data = [ file_obj for x in xrange(5) ]
    
    small_df = spark.sparkContext.parallelize(data).map(lambda x : (x[0], x[1])).toDF(["file", "lines"])
    exploded = small_df.select("file", explode("lines"))
    
    def split_key(s):
        return s.split("/")[1]
    
    split_key_udf = udf(split_key, StringType())
    
    with_filename = exploded.withColumn("filename", split_key_udf("file"))
    with_filename.explain(True)
    ```
    
    The physical plan before/after this change:
    
    Before:
    
    ```
    *Project [file#0, col#5, pythonUDF0#14 AS filename#9]
    +- BatchEvalPython [split_key(file#0)], [file#0, lines#1, col#5, pythonUDF0#14]
       +- Generate explode(lines#1), true, false, [col#5]
          +- Scan ExistingRDD[file#0,lines#1]
    
    ```
    
    After:
    
    ```
    *Project [file#0, col#5, pythonUDF0#14 AS filename#9]
    +- BatchEvalPython [split_key(file#0)], [col#5, file#0, pythonUDF0#14]
       +- *Project [col#5, file#0]
          +- Generate explode(lines#1), true, false, [col#5]
             +- Scan ExistingRDD[file#0,lines#1]
    ```
    
    Before this change, `lines#1` is a redundant input to `BatchEvalPython`. This patch removes it by adding a Project.
    
    ## How was this patch tested?
    
    Manually test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-22410

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19642.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19642
    
----
commit 4e0974dec907c0a74fca5701263001c2bab9c250
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-11-02T10:37:45Z

    Remove unnecessary output from BatchEvalPython's input.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83347/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    FWIW, I asked this to Jenkins admin. Looks the problem is, the launch script is ran by Python 2.6 from dropping the path in some Jenkins machines, assuming from the quick look for the log. I explicitly printed this message before - https://github.com/apache/spark/pull/19524. Quick observation is these happen in worker 2, 6, and 7.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    why is column pruning execution details? Actually I feel it's werid to have `ExtractPythonUDFs` rule applying on physical plans, is there a particular reason?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83333/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    About why `ExtractPythonUDFs` applies on physical plans, I think it may partly because the wrapper of Python function `PythonFunction` also encapsulates core concepts like `Broadcast` and `Accumulator`, it is out of the scope of catalyst APIs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    how could this happen? the column pruning rule doesn't work? Can you give a concreate example?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    Ah, I rushed to retrigger .. it has been being failed unexpected globally somehow ..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    **[Test build #83333 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83333/testReport)** for PR 19642 at commit [`55f1701`](https://github.com/apache/spark/commit/55f1701ebe3239859579393cb09020d9fe273ae7).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    Thanks @HyukjinKwon @cloud-fan for your review!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    To have a logical python runner, we may need to change some logic of extracting python udfs. May require quite more change than this simple fix. If you prefer it, I can do it. If it is just for column pruning, I'd prefer not to do it. Because this sounds more like execution details more than logical query plan. But I also like to hear others ideas too. @HyukjinKwon what do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    cc @cloud-fan @ueshin 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    ah this makes sense, then LGTM.
    
    One thing we can think more is, `Project` is not only an operator but also a column pruning info, this may bring trouble if `Project` can't be whole stage codegened and run as a real operator(python runner).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83332/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    To have a python runner operator also has advantage like to work with optimizer better, e.g. column pruning. I am not against this idea. However, since it requires more change, I'd like to have more support before beginning the change. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    I mean to have an individual python runner operator tells the way we execute python udfs. Currently python udfs are just normal expressions. It seems to me that logically they are just expressions.
    
    Not sure why `ExtractPythonUDFs` are applying on physical plans.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19642
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83366/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org