You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nick Lothian (JIRA)" <ji...@apache.org> on 2017/05/25 05:31:04 UTC
[jira] [Created] (SPARK-20878) Pyspark date string parsing
erroneously treats 1 as 10
Nick Lothian created SPARK-20878:
------------------------------------
Summary: Pyspark date string parsing erroneously treats 1 as 10
Key: SPARK-20878
URL: https://issues.apache.org/jira/browse/SPARK-20878
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.0.2
Reporter: Nick Lothian
Pyspark date filter columns can take a String in format yyyy-mm-dd and correctly handle it. This doesn't appear to be documented anywhere (?) but is extremely useful.
However, it silently converts the format yyyy-mm-d to yyyy-mm-d0 and yyyy-m-dd to yyyy-m0-dd.
For example, 2017-02-1 will be treated as 2017-02-1, and 2017-2-01 as 2017-20-01 (which is invalid, but does not throw an error)
This is causes very hard to discover bugs.
Test code:
{code}
from pyspark.sql.types import *
from datetime import datetime
schema = StructType([StructField("label", StringType(), True),\
StructField("date", DateType(), True)]\
)
data = [('One', datetime.strptime("2017/02/01", '%Y/%m/%d')),
('Two', datetime.strptime("2017/02/02", '%Y/%m/%d')),
('Ten', datetime.strptime("2017/02/10", '%Y/%m/%d')),
('Eleven', datetime.strptime("2017/02/11", '%Y/%m/%d'))]
df = sqlContext.createDataFrame(data, schema)
df.printSchema()
print("All Data")
df.show()
print("Filter greater than 1 Jan (using 2017-02-1)")
df.filter(df.date > '2017-02-1').show()
print("Filter greater than 1 Jan (using 2017-02-01)")
df.filter(df.date > '2017-02-01').show()
print("Filter greater than 1 Jan (using 2017-2-01)")
df.filter(df.date > '2017-2-01').show()
{code}
Output:
{code}
root
|-- label: string (nullable = true)
|-- date: date (nullable = true)
All Data
+------+----------+
| label| date|
+------+----------+
| One|2017-02-01|
| Two|2017-02-02|
| Ten|2017-02-10|
|Eleven|2017-02-11|
+------+----------+
Filter greater than 1 Feb (using 2017-02-1)
+------+----------+
| label| date|
+------+----------+
| Ten|2017-02-10|
|Eleven|2017-02-11|
+------+----------+
Filter greater than 1 Feb (using 2017-02-01)
+------+----------+
| label| date|
+------+----------+
| Two|2017-02-02|
| Ten|2017-02-10|
|Eleven|2017-02-11|
+------+----------+
Filter greater than 1 Feb (using 2017-2-01)
+-----+----+
|label|date|
+-----+----+
+-----+----+
{code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org