You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Radhwane Chebaane (Jira)" <ji...@apache.org> on 2019/09/20 09:38:00 UTC

[jira] [Updated] (SPARK-29188) toPandas gets wrong dtypes when applied on empty DF

     [ https://issues.apache.org/jira/browse/SPARK-29188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Radhwane Chebaane updated SPARK-29188:
--------------------------------------
    Summary: toPandas gets wrong dtypes when applied on empty DF  (was: toPandas get wrong dtypes when applied on empty DF)

> toPandas gets wrong dtypes when applied on empty DF
> ---------------------------------------------------
>
>                 Key: SPARK-29188
>                 URL: https://issues.apache.org/jira/browse/SPARK-29188
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.0, 2.4.4
>         Environment: >> uname -a
> Linux XXXXXXXXXXXXXXXX 4.14.104-95.84.amzn2.x86_64 #1 SMP Sat Mar 2 00:40:20 UTC 2019 x86_64 GNU/Linux
> >> python
> Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42)
> [GCC 7.3.0] on linux
> >> conda list
> ...
> openjdk   8.0.192   h1de35cc_1003       conda-forge
> pandas    0.25.1      py36h86efe34_0    conda-forge
> py4j         0.10.7      py_1                           conda-forge
> pyspark   2.4.4       py_0                          conda-forge
> ....
>            Reporter: Radhwane Chebaane
>            Priority: Major
>
> When calling toPandas from an empty dataframe, all dtypes are set to `object`.
> {code:python}
> spark_df = spark.createDataFrame([(10, "Emy", datetime.today() ), (11, "Bob", datetime.today())], ["age", "name", "date"])
> spark.createDataFrame(spark.sparkContext.emptyRDD(), schema=spark_df.schema).toPandas().dtypes 
> {code}
> Result: 
> {code:bash}
> age     object
> name    object
> date    object
> dtype: object
> {code}
>  
> While it gets the correct types when converting the entire dataframe (or at least with 1 line of data) to pandas:
> {code:python}
> spark_df = spark.createDataFrame([(10, "Emy", datetime.today() ), (11, "Bob", datetime.today())], ["age", "name", "date"]) 
> spark_df.limit(1).toPandas().dtypes 
> {code}
>  Result:
> {code:bash}
> age              int64
> name            object
> date    datetime64[ns]
> dtype: object
> {code}
>  
> Is this intended ? Why toPandas does not rely on the Spark DataFrame Schema ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org