You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Aleksandr Koriagin (JIRA)" <ji...@apache.org> on 2018/09/25 07:16:00 UTC
[jira] [Comment Edited] (SPARK-25467) Python date/datetime objects
in dataframes increment by 1 day when converted to JSON
[ https://issues.apache.org/jira/browse/SPARK-25467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626902#comment-16626902 ]
Aleksandr Koriagin edited comment on SPARK-25467 at 9/25/18 7:15 AM:
---------------------------------------------------------------------
Can be reproduced with HDP (2.6.5.0-292) with Spark 2.3.0:
{code:python}
import datetime
from pyspark.sql import Row
date = datetime.date.fromordinal(1)
print date # >> '0001-01-01'
a = [Row(date=date)]
sqlContext.createDataFrame(a).toJSON().collect() # >> [u'{"date":"0001-01-03"}']
{code}
Here is a part of code probably where issue happens:
https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/session.py#L750
{code:python}
if isinstance(data, RDD):
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
else:
rdd, schema = self._createFromLocal(map(prepare, data), schema)
# ipdb> rdd.collect() --> [(-719162,)]
# See "sql.types.DateType#toInternal" about '-719162' value:
# https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/types.py#L161
# This number '-719162' will be because:
# datetime.date.fromordinal(1).toordinal() - datetime.datetime(1970, 1, 1).toordinal() = -719162
#
# ipdb> schema --> StructType(List(StructField(date,DateType,true)))
# ipdb> schema.json() --> '{"fields":[{"metadata":{},"name":"date","nullable":true,"type":"date"}],"type":"struct"}'
# Here all seems to be correct good
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
# After 'applySchemaToPythonRDD' transformation value is incorrect: '0001-01-03'.
# ipdb> jdf.show()
# +----------+
# | date|
# +----------+
# |0001-01-03| <<-- should be '0001-01-01'
# +----------+
#
# Seems issue happens at Java/Scala part:
# https://github.com/apache/spark/blob/2a0a8f753bbdc8c251f8e699c0808f35b94cfd20/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L734
df = DataFrame(jdf, self._wrapped)
df._schema = schema
return df
{code}
was (Author: aleksandr_koriagin):
Can be reproduced with HDP (2.6.5.0-292) with Spark 2.3.0:
{code:python}
import datetime
from pyspark.sql import Row
date = datetime.date.fromordinal(1)
print date # >> '0001-01-01'
a = [Row(date=date)]
sqlContext.createDataFrame(a).toJSON().collect() # >> [u'{"date":"0001-01-03"}']
{code}
Here is a part of code probably where issue happens:
https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/session.py#L748
{code:python}
if isinstance(data, RDD):
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
else:
rdd, schema = self._createFromLocal(map(prepare, data), schema)
# ipdb> rdd.collect() --> [(-719162,)]
# See "sql.types.DateType#toInternal" about '-719162' value:
# https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/types.py#L161
# This number '-719162' will be because:
# datetime.date.fromordinal(1).toordinal() - datetime.datetime(1970, 1, 1).toordinal() = -719162
#
# ipdb> schema --> StructType(List(StructField(date,DateType,true)))
# ipdb> schema.json() --> '{"fields":[{"metadata":{},"name":"date","nullable":true,"type":"date"}],"type":"struct"}'
# Here all seems to be correct good
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
# After 'applySchemaToPythonRDD' transformation value is incorrect: '0001-01-03'.
# ipdb> jdf.show()
# +----------+
# | date|
# +----------+
# |0001-01-03| <<-- should be '0001-01-01'
# +----------+
#
# Seems issue happens at Java/Scala part:
# https://github.com/apache/spark/blob/2a0a8f753bbdc8c251f8e699c0808f35b94cfd20/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L734
df = DataFrame(jdf, self._wrapped)
df._schema = schema
return df
{code}
> Python date/datetime objects in dataframes increment by 1 day when converted to JSON
> ------------------------------------------------------------------------------------
>
> Key: SPARK-25467
> URL: https://issues.apache.org/jira/browse/SPARK-25467
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 2.3.1
> Environment: Spark 2.3.1
> Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:39:56)
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
> openjdk version "1.8.0_181"
> OpenJDK Runtime Environment (build 1.8.0_181-b13)
> OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> Centos 7 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 x86_64 x86_64 GNU/Linux
> Reporter: David V. Hill
> Priority: Major
>
> When Dataframes contains datetime.date or datetime.datetime instances and toJSON() is called on the Dataframe, the day is incremented in the JSON date representation.
> {code}
> # Create a Dataframe containing datetime.date instances, convert to JSON and display
> rows = [Row(cx=1, cy=2, dates=[datetime.date.fromordinal(1), datetime.date.fromordinal(2)])]
> df = sqc.createDataFrame(rows)
> df.collect()
> [Row(cx=1, cy=2, dates=[datetime.date(1, 1, 1), datetime.date(1, 1, 2)])]
> df.toJSON().collect()
> ['{"cx":1,"cy":2,"dates":["0001-01-03","0001-01-04"]}']
> # Issue also occurs with datetime.datetime instances
> rows = [Row(cx=1, cy=2, dates=[datetime.datetime.fromordinal(1), datetime.datetime.fromordinal(2)])]
> df = sqc.createDataFrame(rows)
> df.collect()
> [Row(cx=1, cy=2, dates=[datetime.datetime(1, 1, 1, 0, 0, fold=1), datetime.datetime(1, 1, 2, 0, 0)])]
> df.toJSON().collect()
> ['{"cx":1,"cy":2,"dates":["0001-01-02T23:50:36.000-06:00","0001-01-03T23:50:36.000-06:00"]}']
> {code}
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org