You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2017/05/25 08:13:04 UTC
[jira] [Resolved] (SPARK-20878) Pyspark date string parsing
erroneously treats 1 as 10
[ https://issues.apache.org/jira/browse/SPARK-20878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-20878.
----------------------------------
Resolution: Not A Problem
I guess it compares a (casted by type coercion rule) string to a string. I think the target variable should be manually casted into a date as below:
{code}
>>> df.filter("date > cast('2017-2-01' as date)").show()
+------+----------+
| label| date|
+------+----------+
| Two|2017-02-02|
| Ten|2017-02-10|
|Eleven|2017-02-11|
+------+----------+
{code}
The current comparison looks casting all to strings .
{code}
>>> df.filter(df.date > '2017-02-1').explain()
== Physical Plan ==
*Filter (isnotnull(date#1) && (cast(date#1 as string) > 2017-02-1))
+- Scan ExistingRDD[label#0,date#1]
{code}
I am resolving this. Please reopen this if I misunderstood.
> Pyspark date string parsing erroneously treats 1 as 10
> -------------------------------------------------------
>
> Key: SPARK-20878
> URL: https://issues.apache.org/jira/browse/SPARK-20878
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.0.2
> Reporter: Nick Lothian
>
> Pyspark date filter columns can take a String in format yyyy-mm-dd and correctly handle it. This doesn't appear to be documented anywhere (?) but is extremely useful.
> However, it silently converts the format yyyy-mm-d to yyyy-mm-d0 and yyyy-m-dd to yyyy-m0-dd.
> For example, 2017-02-1 will be treated as 2017-02-1, and 2017-2-01 as 2017-20-01 (which is invalid, but does not throw an error)
> This is causes very hard to discover bugs.
> Test code:
> {code}
> from pyspark.sql.types import *
> from datetime import datetime
> schema = StructType([StructField("label", StringType(), True),\
> StructField("date", DateType(), True)]\
> )
> data = [('One', datetime.strptime("2017/02/01", '%Y/%m/%d')),
> ('Two', datetime.strptime("2017/02/02", '%Y/%m/%d')),
> ('Ten', datetime.strptime("2017/02/10", '%Y/%m/%d')),
> ('Eleven', datetime.strptime("2017/02/11", '%Y/%m/%d'))]
> df = sqlContext.createDataFrame(data, schema)
> df.printSchema()
> print("All Data")
> df.show()
> print("Filter greater than 1 Jan (using 2017-02-1)")
> df.filter(df.date > '2017-02-1').show()
> print("Filter greater than 1 Jan (using 2017-02-01)")
> df.filter(df.date > '2017-02-01').show()
> print("Filter greater than 1 Jan (using 2017-2-01)")
> df.filter(df.date > '2017-2-01').show()
> {code}
> Output:
> {code}
> root
> |-- label: string (nullable = true)
> |-- date: date (nullable = true)
> All Data
> +------+----------+
> | label| date|
> +------+----------+
> | One|2017-02-01|
> | Two|2017-02-02|
> | Ten|2017-02-10|
> |Eleven|2017-02-11|
> +------+----------+
> Filter greater than 1 Feb (using 2017-02-1)
> +------+----------+
> | label| date|
> +------+----------+
> | Ten|2017-02-10|
> |Eleven|2017-02-11|
> +------+----------+
> Filter greater than 1 Feb (using 2017-02-01)
> +------+----------+
> | label| date|
> +------+----------+
> | Two|2017-02-02|
> | Ten|2017-02-10|
> |Eleven|2017-02-11|
> +------+----------+
> Filter greater than 1 Feb (using 2017-2-01)
> +-----+----+
> |label|date|
> +-----+----+
> +-----+----+
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org