You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Radhwane Chebaane (Jira)" <ji...@apache.org> on 2019/09/20 09:38:00 UTC
[jira] [Updated] (SPARK-29188) toPandas gets wrong dtypes when
applied on empty DF
[ https://issues.apache.org/jira/browse/SPARK-29188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Radhwane Chebaane updated SPARK-29188:
--------------------------------------
Summary: toPandas gets wrong dtypes when applied on empty DF (was: toPandas get wrong dtypes when applied on empty DF)
> toPandas gets wrong dtypes when applied on empty DF
> ---------------------------------------------------
>
> Key: SPARK-29188
> URL: https://issues.apache.org/jira/browse/SPARK-29188
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 2.0.0, 2.4.4
> Environment: >> uname -a
> Linux XXXXXXXXXXXXXXXX 4.14.104-95.84.amzn2.x86_64 #1 SMP Sat Mar 2 00:40:20 UTC 2019 x86_64 GNU/Linux
> >> python
> Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42)
> [GCC 7.3.0] on linux
> >> conda list
> ...
> openjdk 8.0.192 h1de35cc_1003 conda-forge
> pandas 0.25.1 py36h86efe34_0 conda-forge
> py4j 0.10.7 py_1 conda-forge
> pyspark 2.4.4 py_0 conda-forge
> ....
> Reporter: Radhwane Chebaane
> Priority: Major
>
> When calling toPandas from an empty dataframe, all dtypes are set to `object`.
> {code:python}
> spark_df = spark.createDataFrame([(10, "Emy", datetime.today() ), (11, "Bob", datetime.today())], ["age", "name", "date"])
> spark.createDataFrame(spark.sparkContext.emptyRDD(), schema=spark_df.schema).toPandas().dtypes
> {code}
> Result:
> {code:bash}
> age object
> name object
> date object
> dtype: object
> {code}
>
> While it gets the correct types when converting the entire dataframe (or at least with 1 line of data) to pandas:
> {code:python}
> spark_df = spark.createDataFrame([(10, "Emy", datetime.today() ), (11, "Bob", datetime.today())], ["age", "name", "date"])
> spark_df.limit(1).toPandas().dtypes
> {code}
> Result:
> {code:bash}
> age int64
> name object
> date datetime64[ns]
> dtype: object
> {code}
>
> Is this intended ? Why toPandas does not rely on the Spark DataFrame Schema ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org