You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2017/07/25 17:11:00 UTC

[jira] [Comment Edited] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection

    [ https://issues.apache.org/jira/browse/SPARK-21375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100371#comment-16100371 ] 

Wes McKinney edited comment on SPARK-21375 at 7/25/17 5:10 PM:
---------------------------------------------------------------

What is the summary of how you're handling the time zone issue? The way that Spark handles timestamps without time zones seems problematic to me. Is there a way to configure your Spark system to force UTC locale? Otherwise the same code could yield different answers in different locales on the same input data. 

The way that Arrow handles this is by disallowing system locale, instead providing time zone naive and time zone aware timestamps (pandas does the same thing with its data):

* Time zone naive timestamps, where timezone = null in the Arrow metadata. The time components (day, hour, minute, etc) are computed without considering the system locale, so it's as though the locale is UTC

* Time zone aware timestamps: the physical representation is internally normalized to UTC, and time zone changes do not alter the underlying int64 timestamp values. So changing the time zone is a metadata only conversion


was (Author: wesmckinn):
What is the summary of how you're handling the time zone issue? The way that Spark handles timestamps without time zones seems problematic to me. Is there a way to configure your Spark system to force UTC locale? Otherwise the same code could yield different answers in different locales on the same input data. 

The way that Arrow handles this is by disallowing system locale, instead time zone naive and time zone aware timestamps:

* Time zone naive timestamps, where timezone = null in the Arrow metadata. The time components (day, hour, minute, etc) are computed without considering the system locale, so it's as though the locale is UTC

* Time zone aware timestamps: the physical representation is internally normalized to UTC, and time zone changes do not alter the underlying int64 timestamp values. So changing the time zone is a metadata only conversion

> Add date and timestamp support to ArrowConverters for toPandas() collection
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-21375
>                 URL: https://issues.apache.org/jira/browse/SPARK-21375
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: Bryan Cutler
>
> Date and timestamp are not yet supported in DataFrame.toPandas() using ArrowConverters.  These are common types for data analysis used in both Spark and Pandas and should be supported.
> There is a discrepancy with the way that PySpark and Arrow store timestamps, without timezone specified, internally.  PySpark takes a UTC timestamp that is adjusted to local time and Arrow is in UTC time.  Hopefully there is a clean way to resolve this.
> Spark internal storage spec:
> * *DateType* stored as days
> * *Timestamp* stored as microseconds 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org