You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/22 18:43:08 UTC

[GitHub] [spark] dvogelbacher commented on issue #24677: [SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled

dvogelbacher commented on issue #24677: [SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled
URL: https://github.com/apache/spark/pull/24677#issuecomment-494884105

`collectAsArrowToPython` will just return the socket info from `PythonRDD.serveToStream("serve-Arrow")`. The exception will occur during the `runJob` which is inside the `serveToStream`, which will be executed in a background thread. When the background thread encounters an exception it will close the `OutputStream`.
The `ArrowStreamSerializer` in the python process will then think that it read all the batches after which the `ArrowCollectSerializer` will try to read the batch order indices and throw an `EofError` as those were never written.

Also note that before https://github.com/apache/spark/pull/22275 (which introduced the batch order indices) this would not have resulted in any error on the python side. We would have just dropped some partitions without throwing an error. Now at least we get an error but it is not a very helpful one.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org