You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/10/28 00:37:00 UTC
[jira] [Commented] (SPARK-40934) pyspark.pandas.read_csv parses dates, but docs state otherwise
[ https://issues.apache.org/jira/browse/SPARK-40934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625362#comment-17625362 ]
Hyukjin Kwon commented on SPARK-40934:
--------------------------------------
[~soxofaan] are you interested in a PR?
> pyspark.pandas.read_csv parses dates, but docs state otherwise
> --------------------------------------------------------------
>
> Key: SPARK-40934
> URL: https://issues.apache.org/jira/browse/SPARK-40934
> Project: Spark
> Issue Type: Bug
> Components: Pandas API on Spark
> Affects Versions: 3.3.1
> Reporter: Stefaan Lippens
> Priority: Major
>
> from [https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_csv.html] :
> {quote}parse_dates:
> boolean or list of ints or names or list of lists or dict, default False.
> Currently only False is allowed.
> {quote}
> This documentation suggests that dates are never parsed, but apparently they are always parsed (and it can not be disabled):
> {code:python}
> import pyspark.pandas
> df = pyspark.pandas.read_csv("data.csv", parse_dates=False)
> print(df)
> print(df.dtypes)
> {code}
> with this data
> {code:java}
> date,feature_index,band_0,band_1,band_2
> 2021-01-05T01:00:00.000+01:00,2,5.0,4.5,3.75
> 2021-01-05T01:00:00.000+01:00,0,5.0,1.0,2.25
> 2021-01-05T01:00:00.000+01:00,1,5.0,3.5,4.0
> 2021-01-15T01:00:00.000+01:00,2,15.0,4.5,3.75
> 2021-01-15T01:00:00.000+01:00,0,15.0,1.0,2.25
> {code}
> gives
> {code:java}
> date feature_index band_0 band_1 band_2
> 0 2021-01-05 01:00:00 2 5.0 4.5 3.75
> 1 2021-01-05 01:00:00 0 5.0 1.0 2.25
> 2 2021-01-05 01:00:00 1 5.0 3.5 4.00
> 3 2021-01-15 01:00:00 2 15.0 4.5 3.75
> 4 2021-01-15 01:00:00 0 15.0 1.0 2.25
> date datetime64[ns]
> feature_index int32
> band_0 float64
> band_1 float64
> band_2 float64
> dtype: object
> {code}
> Notice how the dates are parsed (e.g. dtype {{datetime64[ns]}} for {{date}})
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org