You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Aleksandr Koriagin (JIRA)" <ji...@apache.org> on 2018/09/25 07:16:00 UTC
[jira] [Comment Edited] (SPARK-25467) Python date/datetime objects in dataframes increment by 1 day when converted to JSON

    [ https://issues.apache.org/jira/browse/SPARK-25467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626902#comment-16626902 ] 

Aleksandr Koriagin edited comment on SPARK-25467 at 9/25/18 7:15 AM:
---------------------------------------------------------------------

Can be reproduced with HDP (2.6.5.0-292) with Spark 2.3.0:
{code:python}
import datetime
from pyspark.sql import Row

date = datetime.date.fromordinal(1)
print date  # >> '0001-01-01'

a = [Row(date=date)]
sqlContext.createDataFrame(a).toJSON().collect()  # >> [u'{"date":"0001-01-03"}']
{code}
Here is a part of code probably where issue happens:
https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/session.py#L750

{code:python}
if isinstance(data, RDD):
    rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
else:
    rdd, schema = self._createFromLocal(map(prepare, data), schema)

    # ipdb> rdd.collect() --> [(-719162,)]  
    #   See "sql.types.DateType#toInternal" about '-719162' value:
    #   https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/types.py#L161
    #   This number '-719162' will be because:
    #   datetime.date.fromordinal(1).toordinal() - datetime.datetime(1970, 1, 1).toordinal() = -719162
    #
    # ipdb> schema        --> StructType(List(StructField(date,DateType,true)))
    # ipdb> schema.json() --> '{"fields":[{"metadata":{},"name":"date","nullable":true,"type":"date"}],"type":"struct"}'
    # Here all seems to be correct good

jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
    
    # After 'applySchemaToPythonRDD' transformation value is incorrect: '0001-01-03'.
    # ipdb> jdf.show()
    # +----------+
    # |      date|
    # +----------+
    # |0001-01-03|   <<-- should be '0001-01-01'
    # +----------+
    #
    # Seems issue happens at Java/Scala part:
    # https://github.com/apache/spark/blob/2a0a8f753bbdc8c251f8e699c0808f35b94cfd20/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L734

df = DataFrame(jdf, self._wrapped)
df._schema = schema
return df
{code}



was (Author: aleksandr_koriagin):
Can be reproduced with HDP (2.6.5.0-292) with Spark 2.3.0:
{code:python}
import datetime
from pyspark.sql import Row

date = datetime.date.fromordinal(1)
print date  # >> '0001-01-01'

a = [Row(date=date)]
sqlContext.createDataFrame(a).toJSON().collect()  # >> [u'{"date":"0001-01-03"}']
{code}
Here is a part of code probably where issue happens:
https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/session.py#L748

{code:python}
if isinstance(data, RDD):
    rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
else:
    rdd, schema = self._createFromLocal(map(prepare, data), schema)

    # ipdb> rdd.collect() --> [(-719162,)]  
    #   See "sql.types.DateType#toInternal" about '-719162' value:
    #   https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/types.py#L161
    #   This number '-719162' will be because:
    #   datetime.date.fromordinal(1).toordinal() - datetime.datetime(1970, 1, 1).toordinal() = -719162
    #
    # ipdb> schema        --> StructType(List(StructField(date,DateType,true)))
    # ipdb> schema.json() --> '{"fields":[{"metadata":{},"name":"date","nullable":true,"type":"date"}],"type":"struct"}'
    # Here all seems to be correct good

jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
    
    # After 'applySchemaToPythonRDD' transformation value is incorrect: '0001-01-03'.
    # ipdb> jdf.show()
    # +----------+
    # |      date|
    # +----------+
    # |0001-01-03|   <<-- should be '0001-01-01'
    # +----------+
    #
    # Seems issue happens at Java/Scala part:
    # https://github.com/apache/spark/blob/2a0a8f753bbdc8c251f8e699c0808f35b94cfd20/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L734

df = DataFrame(jdf, self._wrapped)
df._schema = schema
return df
{code}


> Python date/datetime objects in dataframes increment by 1 day when converted to JSON
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-25467
>                 URL: https://issues.apache.org/jira/browse/SPARK-25467
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.3.1
>         Environment: Spark 2.3.1
> Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:39:56) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
> openjdk version "1.8.0_181"
> OpenJDK Runtime Environment (build 1.8.0_181-b13)
> OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> Centos 7 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 x86_64 x86_64 GNU/Linux
>            Reporter: David V. Hill
>            Priority: Major
>
> When Dataframes contains datetime.date or datetime.datetime instances and toJSON() is called on the Dataframe, the day is incremented in the JSON date representation.
> {code}
> # Create a Dataframe containing datetime.date instances, convert to JSON and display
> rows = [Row(cx=1, cy=2, dates=[datetime.date.fromordinal(1), datetime.date.fromordinal(2)])]
> df = sqc.createDataFrame(rows)
> df.collect()
> [Row(cx=1, cy=2, dates=[datetime.date(1, 1, 1), datetime.date(1, 1, 2)])]
> df.toJSON().collect()
> ['{"cx":1,"cy":2,"dates":["0001-01-03","0001-01-04"]}']
> # Issue also occurs with datetime.datetime instances
> rows = [Row(cx=1, cy=2, dates=[datetime.datetime.fromordinal(1), datetime.datetime.fromordinal(2)])]
> df = sqc.createDataFrame(rows)
> df.collect()
> [Row(cx=1, cy=2, dates=[datetime.datetime(1, 1, 1, 0, 0, fold=1), datetime.datetime(1, 1, 2, 0, 0)])]
> df.toJSON().collect()
> ['{"cx":1,"cy":2,"dates":["0001-01-02T23:50:36.000-06:00","0001-01-03T23:50:36.000-06:00"]}']
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org