You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2017/05/25 08:13:04 UTC
[jira] [Resolved] (SPARK-20878) Pyspark date string parsing erroneously treats 1 as 10

     [ https://issues.apache.org/jira/browse/SPARK-20878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-20878.
----------------------------------
    Resolution: Not A Problem

I guess it compares a (casted by type coercion rule) string to a string. I think the target variable should be manually casted into a date as below:

{code}
>>> df.filter("date > cast('2017-2-01' as date)").show()
+------+----------+
| label|      date|
+------+----------+
|   Two|2017-02-02|
|   Ten|2017-02-10|
|Eleven|2017-02-11|
+------+----------+
{code}

The current comparison looks casting all to strings .


{code}
>>> df.filter(df.date > '2017-02-1').explain()
== Physical Plan ==
*Filter (isnotnull(date#1) && (cast(date#1 as string) > 2017-02-1))
+- Scan ExistingRDD[label#0,date#1]
{code}

I am resolving this. Please reopen this if I misunderstood.

> Pyspark date string parsing erroneously treats 1 as 10 
> -------------------------------------------------------
>
>                 Key: SPARK-20878
>                 URL: https://issues.apache.org/jira/browse/SPARK-20878
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.2
>            Reporter: Nick Lothian
>
> Pyspark date filter columns can take a String in format yyyy-mm-dd and correctly handle it. This doesn't appear to be documented anywhere (?) but is extremely useful. 
> However, it silently converts the format yyyy-mm-d to yyyy-mm-d0 and yyyy-m-dd to yyyy-m0-dd. 
> For example, 2017-02-1 will be treated as  2017-02-1, and 2017-2-01 as 2017-20-01 (which is invalid, but does not throw an error)
> This is causes very hard to discover bugs.
> Test code:
> {code}
> from pyspark.sql.types import *
> from datetime import datetime
> schema = StructType([StructField("label", StringType(), True),\
>                     StructField("date", DateType(), True)]\
>                    )
> data = [('One', datetime.strptime("2017/02/01", '%Y/%m/%d')), 
>         ('Two', datetime.strptime("2017/02/02", '%Y/%m/%d')), 
>         ('Ten', datetime.strptime("2017/02/10", '%Y/%m/%d')),
>         ('Eleven', datetime.strptime("2017/02/11", '%Y/%m/%d'))]
> df = sqlContext.createDataFrame(data, schema)
> df.printSchema()
> print("All Data")
> df.show()
> print("Filter greater than 1 Jan (using 2017-02-1)")
> df.filter(df.date > '2017-02-1').show()
> print("Filter greater than 1 Jan (using 2017-02-01)")
> df.filter(df.date > '2017-02-01').show()
> print("Filter greater than 1 Jan (using 2017-2-01)")
> df.filter(df.date > '2017-2-01').show()
> {code}
> Output:
> {code}
> root
>  |-- label: string (nullable = true)
>  |-- date: date (nullable = true)
> All Data
> +------+----------+
> | label|      date|
> +------+----------+
> |   One|2017-02-01|
> |   Two|2017-02-02|
> |   Ten|2017-02-10|
> |Eleven|2017-02-11|
> +------+----------+
> Filter greater than 1 Feb (using 2017-02-1)
> +------+----------+
> | label|      date|
> +------+----------+
> |   Ten|2017-02-10|
> |Eleven|2017-02-11|
> +------+----------+
> Filter greater than 1 Feb (using 2017-02-01)
> +------+----------+
> | label|      date|
> +------+----------+
> |   Two|2017-02-02|
> |   Ten|2017-02-10|
> |Eleven|2017-02-11|
> +------+----------+
> Filter greater than 1 Feb (using 2017-2-01)
> +-----+----+
> |label|date|
> +-----+----+
> +-----+----+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org