You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiao Li (JIRA)" <ji...@apache.org> on 2018/02/01 18:33:00 UTC

[jira] [Updated] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

     [ https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiao Li updated SPARK-23290:
----------------------------
    Priority: Blocker  (was: Major)

> inadvertent change in handling of DateType when converting to pandas dataframe
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-23290
>                 URL: https://issues.apache.org/jira/browse/SPARK-23290
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>            Reporter: Andre Menck
>            Priority: Blocker
>
> In [this PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968] there was a change in how `DateType` is being returned to users (line 1968 in dataframe.py). This can cause client code to fail, as in the following example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> date    object
> num      int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 0    2015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> date    object
> num      int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> date    datetime64[ns]
> num              int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", line 2355, in apply
>     mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in pandas._libs.lib.map_infer
>   File "<stdin>", line 1, in <lambda>
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new behavior (returning a datetime column). Since there may be user code relying on the old behavior, I'd suggest reverting this specific part of this change. Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org