You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/12/17 01:51:00 UTC

[jira] [Commented] (SPARK-30239) Creating a dataframe with Pandas rather than Numpy datatypes fails

    [ https://issues.apache.org/jira/browse/SPARK-30239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997782#comment-16997782 ] 

Hyukjin Kwon commented on SPARK-30239:
--------------------------------------

Can you show the self-contained reproducer?

> Creating a dataframe with Pandas rather than Numpy datatypes fails
> ------------------------------------------------------------------
>
>                 Key: SPARK-30239
>                 URL: https://issues.apache.org/jira/browse/SPARK-30239
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.3
>         Environment: DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 | Scala 2.11
>            Reporter: Philip Kahn
>            Priority: Minor
>
> It's possible to work with DataFrames in Pandas and shuffle them back over to Spark dataframes for processing; however, using Pandas extended datatypes like {{Int64 }}( [https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html] ) throws an error (that long / float can't be converted).
> This is internally because {{np.nan}} is a float, and {{pd.Int64DType()}} allows only integers except for the single float value {{np.nan}}.
>  
> The current workaround for this is to use the columns as floats, and after conversion to the Spark DataFrame, to recast the column as {{LongType()}}. For example:
>  
> {{sdfC = spark.createDataFrame(kgridCLinked)}}
> {{sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))}}
>  
> However, this is awkward and redundant.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org