You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by e-dorigatti <gi...@git.apache.org> on 2018/06/12 09:47:25 UTC

[GitHub] spark pull request #21538: [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop ite...

GitHub user e-dorigatti opened a pull request:

    https://github.com/apache/spark/pull/21538

    [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from driver to executor

    SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the `StopIteration` bug in the worker.
    
    The root of the problem is that when an user-supplied function raises a `StopIteration`, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch `StopIteration`s exceptions and re-raise them as `RuntimeError`s, so that the execution fails and the error is reported to the user. This is done using the `fail_on_stopiteration` wrapper, in different ways depending on where the function is used:
     - In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself.
     - In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack.
    
    @HyukjinKwon 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/e-dorigatti/spark branch-2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21538.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21538
    
----
commit 762893682e2bb1e7c5b065eab33e472660cdb4fb
Author: e-dorigatti <em...@...>
Date:   2018-05-30T10:11:33Z

    [SPARK-23754][PYTHON] Re-raising StopIteration in client code
    
    Make sure that `StopIteration`s raised in users' code do not silently interrupt processing by spark, but are raised as exceptions to the users. The users' functions are wrapped in `safe_iter` (in `shuffle.py`), which re-raises `StopIteration`s as `RuntimeError`s
    
    Unit tests, making sure that the exceptions are indeed raised. I am not sure how to check whether a `Py4JJavaError` contains my exception, so I simply looked for the exception message in the java exception's `toString`. Can you propose a better way?
    
    This is my original work, licensed in the same way as spark
    
    Author: e-dorigatti <em...@gmail.com>
    
    Closes #21383 from e-dorigatti/fix_spark_23754.
    
    (cherry picked from commit 0ebb0c0d4dd3e192464dc5e0e6f01efa55b945ed)

commit e7db4688fba6ddd8168288c78d4106550211569b
Author: edorigatti <em...@...>
Date:   2018-06-12T07:49:04Z

    Merge remote-tracking branch 'upstream/branch-2.3' into branch-2.3

commit 217e730ec60e6b74fa12cf3e6ec6365be8c82aec
Author: edorigatti <em...@...>
Date:   2018-06-11T02:15:42Z

    [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from driver to executor
    
    SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the `StopIteration` bug in the worker
    
    The root of the problem is that when an user-supplied function raises a `StopIteration`, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch `StopIteration`s exceptions and re-raise them as `RuntimeError`s, so that the execution fails and the error is reported to the user. This is done using the `fail_on_stopiteration` wrapper, in different ways depending on where the function is used:
     - In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself.
     - In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack.
    
    Same tests, plus tests for pandas UDFs
    
    Author: edorigatti <em...@gmail.com>
    
    Closes #21467 from e-dorigatti/fix_udf_hack.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by e-dorigatti <gi...@git.apache.org>.
Github user e-dorigatti commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    @HyukjinKwon thank you so much for your patience :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Mov...

Posted by e-dorigatti <gi...@git.apache.org>.
Github user e-dorigatti closed the pull request at:

    https://github.com/apache/spark/pull/21538


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    Merged to branch-2.3.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    **[Test build #91704 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91704/testReport)** for PR 21538 at commit [`217e730`](https://github.com/apache/spark/commit/217e730ec60e6b74fa12cf3e6ec6365be8c82aec).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    **[Test build #91701 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91701/testReport)** for PR 21538 at commit [`217e730`](https://github.com/apache/spark/commit/217e730ec60e6b74fa12cf3e6ec6365be8c82aec).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91716/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    **[Test build #91704 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91704/testReport)** for PR 21538 at commit [`217e730`](https://github.com/apache/spark/commit/217e730ec60e6b74fa12cf3e6ec6365be8c82aec).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by e-dorigatti <gi...@git.apache.org>.
Github user e-dorigatti commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    Seems like it skipped the pandas tests, for both python2.7 and pypy
    
    ```
    Will skip Pandas related features against Python executable  ...
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    add to whitelist


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Mov...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21538#discussion_r194806008
  
    --- Diff: python/pyspark/worker.py ---
    @@ -122,6 +123,10 @@ def read_single_udf(pickleSer, infile, eval_type):
             else:
                 row_func = chain(row_func, f)
     
    +    # make sure StopIteration's raised in the user code are not ignored
    +    # when they are processed in a for loop, raise them as RuntimeError's instead
    +    row_func = fail_on_stopiteration(row_func)
    --- End diff --
    
    @e-dorigatti, I think it's fine to name it `func`. Let's reduce the diff so that other backports make less conflicts in the future.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    @e-dorigatti, this got merged into branch-2.3. Likewise, this also should be manually closed. Thanks for working on this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    Yea, it's unfortunate .. we should fix and set up the Jenkins env too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    **[Test build #91716 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91716/testReport)** for PR 21538 at commit [`612781a`](https://github.com/apache/spark/commit/612781a4be82de4759b5a3bd482a98687f5404ba).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    **[Test build #91701 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91701/testReport)** for PR 21538 at commit [`217e730`](https://github.com/apache/spark/commit/217e730ec60e6b74fa12cf3e6ec6365be8c82aec).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91704/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    **[Test build #91716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91716/testReport)** for PR 21538 at commit [`612781a`](https://github.com/apache/spark/commit/612781a4be82de4759b5a3bd482a98687f5404ba).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    @e-dorigatti Can you add `[BACKPORT-2.3]` in the PR title? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21538: [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21538
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91701/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org