You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "David Vogelbacher (JIRA)" <ji...@apache.org> on 2019/05/20 13:59:00 UTC
[jira] [Commented] (SPARK-27778) toPandas with arrow enabled fails for DF with no partitions

    [ https://issues.apache.org/jira/browse/SPARK-27778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843987#comment-16843987 ] 

David Vogelbacher commented on SPARK-27778:
-------------------------------------------

I will make a pr for this shortly.

> toPandas with arrow enabled fails for DF with no partitions
> -----------------------------------------------------------
>
>                 Key: SPARK-27778
>                 URL: https://issues.apache.org/jira/browse/SPARK-27778
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 3.0.0
>            Reporter: David Vogelbacher
>            Priority: Major
>
> Calling to pandas with {{spark.sql.execution.arrow.enabled: true}} fails for dataframes with no partitions. The error is a {{EOFError}}. With {{spark.sql.execution.arrow.enabled: false}} the conversion.
> Repro (on current master branch):
> {noformat}
> >>> from pyspark.sql.types import *
> >>> schema = StructType([StructField("field1", StringType(), True)])
> >>> df = spark.createDataFrame(sc.emptyRDD(), schema)
> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> >>> df.toPandas()
> /Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py:2162: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.fallback.enabled' does not have an effect on failures in the middle of computation.
>   warnings.warn(msg)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 2143, in toPandas
>     batches = self._collectAsArrow()
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 2205, in _collectAsArrow
>     results = list(_load_from_socket(sock_info, ArrowCollectSerializer()))
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 210, in load_stream
>     num = read_int(stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 810, in read_int
>     raise EOFError
> EOFError
> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> >>> df.toPandas()
> Empty DataFrame
> Columns: [field1]
> Index: []
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org