You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/08/27 05:53:55 UTC
[GitHub] [spark] HyukjinKwon opened a new pull request #25593: [SPARK-27992][PYTHON][BRANCH-2.4] Allow Python to join with connection thread to propagate errors

HyukjinKwon opened a new pull request #25593: [SPARK-27992][PYTHON][BRANCH-2.4] Allow Python to join with connection thread to propagate errors
URL: https://github.com/apache/spark/pull/25593
 
 
   ### What changes were proposed in this pull request?
   
   This PR proposes to backport https://github.com/apache/spark/pull/24834 with minimised changes. See https://github.com/apache/spark/pull/24834/commits/519926f908fe3f54cd149a24a7e645c3e69347a8
   It was not backported before because basically it targeted a better exception by propagating the exception from JVM.
   
   However, actually this PR fixed another problem accidentally (see [SPARK-28881](https://issues.apache.org/jira/browse/SPARK-28881)). This regression seems introduced by https://github.com/apache/spark/pull/21546.
   
   Root cause is that, seems 
   
   https://github.com/apache/spark/blob/23bed0d3c08e03085d3f0c3a7d457eedd30bd67f/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3370-L3384
   
   `runJob` with `resultHandler` seems able to write partial output.
   
   JVM throws an exception but, since the JVM exception is not propagated into Python process, Python process doesn't know if the exception is thrown or not from JVM (it just closes the socket), which results as below:
   
   ```
   ./bin/pyspark --conf spark.driver.maxResultSize=1m
   ```
   ```python
   spark.conf.set("spark.sql.execution.arrow.enabled",True)
   spark.range(10000000).toPandas()
   ```
   ```
   Empty DataFrame
   Columns: [id]
   Index: []
   ```
   
   This PR let Python process catches exceptions from JVM.
   
   
   ### Why are the changes needed?
   
   It returns incorrect data. And potentially it returns partial results when an exception happens in JVM sides. This is a regression. The codes work fine in Spark 2.3.3.
   
   ### Does this PR introduce any user-facing change?
   
   Yes. 
   
   ```
   ./bin/pyspark --conf spark.driver.maxResultSize=1m
   ```
   ```python
   spark.conf.set("spark.sql.execution.arrow.enabled",True)
   spark.range(10000000).toPandas()
   ```
   
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.../pyspark/sql/dataframe.py", line 2122, in toPandas
       batches = self._collectAsArrow()
     File "/.../pyspark/sql/dataframe.py", line 2184, in _collectAsArrow
       jsocket_auth_server.getResult()  # Join serving thread and raise any exceptions
     File "/.../lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
     File "/.../pyspark/sql/utils.py", line 63, in deco
       return f(*a, **kw)
     File "/.../lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o42.getResult.
   : org.apache.spark.SparkException: Exception thrown in awaitResult:
       ...
   Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1 tasks (6.5 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
   ```
   
   now throws an exception as expected.
   
   ### How was this patch tested?
   
   Manually as described above. unittest will be added against SPARK-28881.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org