You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nick Lothian (JIRA)" <ji...@apache.org> on 2017/05/25 05:31:04 UTC

[jira] [Created] (SPARK-20878) Pyspark date string parsing erroneously treats 1 as 10

Nick Lothian created SPARK-20878:
------------------------------------

             Summary: Pyspark date string parsing erroneously treats 1 as 10 
                 Key: SPARK-20878
                 URL: https://issues.apache.org/jira/browse/SPARK-20878
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.0.2
            Reporter: Nick Lothian


Pyspark date filter columns can take a String in format yyyy-mm-dd and correctly handle it. This doesn't appear to be documented anywhere (?) but is extremely useful. 

However, it silently converts the format yyyy-mm-d to yyyy-mm-d0 and yyyy-m-dd to yyyy-m0-dd. 

For example, 2017-02-1 will be treated as  2017-02-1, and 2017-2-01 as 2017-20-01 (which is invalid, but does not throw an error)

This is causes very hard to discover bugs.

Test code:

{code}
from pyspark.sql.types import *
from datetime import datetime

schema = StructType([StructField("label", StringType(), True),\
                    StructField("date", DateType(), True)]\
                   )


data = [('One', datetime.strptime("2017/02/01", '%Y/%m/%d')), 
        ('Two', datetime.strptime("2017/02/02", '%Y/%m/%d')), 
        ('Ten', datetime.strptime("2017/02/10", '%Y/%m/%d')),
        ('Eleven', datetime.strptime("2017/02/11", '%Y/%m/%d'))]

df = sqlContext.createDataFrame(data, schema)
df.printSchema()

print("All Data")
df.show()

print("Filter greater than 1 Jan (using 2017-02-1)")
df.filter(df.date > '2017-02-1').show()


print("Filter greater than 1 Jan (using 2017-02-01)")
df.filter(df.date > '2017-02-01').show()


print("Filter greater than 1 Jan (using 2017-2-01)")
df.filter(df.date > '2017-2-01').show()

{code}

Output:

{code}
root
 |-- label: string (nullable = true)
 |-- date: date (nullable = true)

All Data
+------+----------+
| label|      date|
+------+----------+
|   One|2017-02-01|
|   Two|2017-02-02|
|   Ten|2017-02-10|
|Eleven|2017-02-11|
+------+----------+

Filter greater than 1 Feb (using 2017-02-1)
+------+----------+
| label|      date|
+------+----------+
|   Ten|2017-02-10|
|Eleven|2017-02-11|
+------+----------+

Filter greater than 1 Feb (using 2017-02-01)
+------+----------+
| label|      date|
+------+----------+
|   Two|2017-02-02|
|   Ten|2017-02-10|
|Eleven|2017-02-11|
+------+----------+

Filter greater than 1 Feb (using 2017-2-01)
+-----+----+
|label|date|
+-----+----+
+-----+----+
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org