You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/10/28 00:37:00 UTC

[jira] [Commented] (SPARK-40934) pyspark.pandas.read_csv parses dates, but docs state otherwise

    [ https://issues.apache.org/jira/browse/SPARK-40934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625362#comment-17625362 ] 

Hyukjin Kwon commented on SPARK-40934:
--------------------------------------

[~soxofaan] are you interested in a PR?

> pyspark.pandas.read_csv parses dates, but docs state otherwise
> --------------------------------------------------------------
>
>                 Key: SPARK-40934
>                 URL: https://issues.apache.org/jira/browse/SPARK-40934
>             Project: Spark
>          Issue Type: Bug
>          Components: Pandas API on Spark
>    Affects Versions: 3.3.1
>            Reporter: Stefaan Lippens
>            Priority: Major
>
> from [https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_csv.html] :
> {quote}parse_dates:
> boolean or list of ints or names or list of lists or dict, default False.
> Currently only False is allowed.
> {quote}
> This documentation suggests that dates are never parsed, but apparently they are always parsed (and it can not be disabled):
> {code:python}
> import pyspark.pandas
> df = pyspark.pandas.read_csv("data.csv", parse_dates=False)
> print(df)
> print(df.dtypes)
> {code}
> with this data
> {code:java}
> date,feature_index,band_0,band_1,band_2
> 2021-01-05T01:00:00.000+01:00,2,5.0,4.5,3.75
> 2021-01-05T01:00:00.000+01:00,0,5.0,1.0,2.25
> 2021-01-05T01:00:00.000+01:00,1,5.0,3.5,4.0
> 2021-01-15T01:00:00.000+01:00,2,15.0,4.5,3.75
> 2021-01-15T01:00:00.000+01:00,0,15.0,1.0,2.25
> {code}
> gives
> {code:java}
>                  date  feature_index  band_0  band_1  band_2
> 0 2021-01-05 01:00:00              2     5.0     4.5    3.75
> 1 2021-01-05 01:00:00              0     5.0     1.0    2.25
> 2 2021-01-05 01:00:00              1     5.0     3.5    4.00
> 3 2021-01-15 01:00:00              2    15.0     4.5    3.75
> 4 2021-01-15 01:00:00              0    15.0     1.0    2.25
> date             datetime64[ns]
> feature_index             int32
> band_0                  float64
> band_1                  float64
> band_2                  float64
> dtype: object
> {code}
> Notice how the dates are parsed (e.g.  dtype {{datetime64[ns]}} for {{date}})



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org