You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ueshin <gi...@git.apache.org> on 2017/08/14 03:37:36 UTC

[GitHub] spark pull request #18933: [SPARK-21722][SQL][PYTHON] Enable timezone-aware ...

GitHub user ueshin opened a pull request:

    https://github.com/apache/spark/pull/18933

    [SPARK-21722][SQL][PYTHON] Enable timezone-aware timestamp type when creating Pandas DataFrame.

    ## What changes were proposed in this pull request?
    
    Make Pandas DataFrame with timezone-aware timestamp type when converting `DataFrame` to Pandas DataFrame by `pyspark.sql.DataFrame.toPandas`.
    The session local timezone is used for the timezone.
    
    ## How was this patch tested?
    
    Added a test and existing tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ueshin/apache-spark issues/SPARK-21722

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18933.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18933
    
----
commit 0f182d0f6d8e1eb3b92e7e0eb39f2616235c1368
Author: Takuya UESHIN <ue...@databricks.com>
Date:   2017-08-14T02:08:03Z

    Enable timezone-aware timestamp type when creating Pandas DataFrame.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-a...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18933#discussion_r145717813
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1760,6 +1760,17 @@ def toPandas(self):
     
                 for f, t in dtype.items():
                     pdf[f] = pdf[f].astype(t, copy=False)
    +
    +            if self.sql_ctx.getConf("spark.sql.execution.pandas.timeZoneAware", "false").lower() \
    --- End diff --
    
    I'd like to treat it as a bug and always respect the session local timezone.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    I was wondering what your thoughts were on what this conf should do for the case that Arrow was enabled and `spark.sql.execution.pandas.timeZoneAware` was `false`?  Would the time zone be converted to local time and removed, sort of the opposite way as done here when it is `true`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    **[Test build #80608 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80608/testReport)** for PR 18933 at commit [`0f182d0`](https://github.com/apache/spark/commit/0f182d0f6d8e1eb3b92e7e0eb39f2616235c1368).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-a...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18933#discussion_r133255672
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -912,6 +912,14 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val PANDAS_TIMEZONE_AWARE =
    --- End diff --
    
    Yes, I agree with this. There is also inconsistent behavior when bringing data into Spark because `TimestampType.toInternal` does a conversion with local time and not with session local timezone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    **[Test build #80622 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80622/testReport)** for PR 18933 at commit [`7df7ac9`](https://github.com/apache/spark/commit/7df7ac941da56ee9ae894ada3ae30661fddd4b03).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-a...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18933#discussion_r133229705
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -912,6 +912,14 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val PANDAS_TIMEZONE_AWARE =
    --- End diff --
    
    There are other parts of the pyspark that doesn't use session local timezone. For instance, df.collect() and (maybe) python udf execution.
    
    I am worried that having those to be inconsistent (some use local timezone, some doesn't) and complex (one configuration for each of these functionality?)
    
    While it will be harder to fix, but how about we use one configuration to control the behavior of `df.toPandas()` and `df.collect()` and python udf regarding session local timezone?
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-a...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin closed the pull request at:

    https://github.com/apache/spark/pull/18933


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-a...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18933#discussion_r145888158
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1760,6 +1760,17 @@ def toPandas(self):
     
                 for f, t in dtype.items():
                     pdf[f] = pdf[f].astype(t, copy=False)
    +
    +            if self.sql_ctx.getConf("spark.sql.execution.pandas.timeZoneAware", "false").lower() \
    --- End diff --
    
    We still need a conf, even if it is a bug. This is just to avoid breaking any existing app. We can remove the conf in Spark 3.x.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    I opened https://issues.apache.org/jira/browse/SPARK-23314 @icexelloss 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    Ping. I ran into this exact issue with pandas_udf on a simple data set with a timestamp type column.
    As far as I can tell, there is no way to around this since pandas code is running deep inside pyspark and the only workaround is to make the column a string?
    
    @BryanCutler @ueshin @icexelloss @HyukjinKwon  any thought on how to fix this?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    I thought this is resolved. @felixcheung can you give an example of the issue you ran into?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80622/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-a...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18933#discussion_r133254340
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2507,6 +2507,37 @@ def test_to_pandas(self):
             self.assertEquals(types[2], np.bool)
             self.assertEquals(types[3], np.float32)
     
    +    @unittest.skipIf(not _have_pandas, "Pandas not installed")
    +    def test_to_pandas_timezone_aware(self):
    +        import pandas as pd
    +        from dateutil import tz
    +        tzlocal = tz.tzlocal()
    +        ts = datetime.datetime(1970, 1, 1)
    +        pdf = pd.DataFrame.from_records([[ts]], columns=['ts'])
    +
    +        self.spark.conf.set('spark.sql.session.timeZone', 'America/Los_Angeles')
    +
    +        schema = StructType().add("ts", TimestampType())
    +        df = self.spark.createDataFrame([(ts,)], schema)
    +
    +        pdf_naive = df.toPandas()
    +        self.assertEqual(pdf_naive['ts'][0].tzinfo, None)
    +        self.assertTrue(pdf_naive.equals(pdf))
    --- End diff --
    
    This is not really a test that `df.toPandas()` is time zone naive.  If that was true then you should be able to do
    ```
    df = self.spark.createDataFrame([(ts,)], schema)
    os.environ["TZ"] = "America/New_York"
    time.tzset()
    pdf_naive = df.toPandas()
    self.assertTrue(pdf_naive.equals(pdf))
    ``` 
    but this will fail because `toPandas()` does a conversion to local time, which is what the original data happens to be


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [SPARK-21722][SQL][PYTHON] Enable timezone-aware timesta...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    **[Test build #80608 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80608/testReport)** for PR 18933 at commit [`0f182d0`](https://github.com/apache/spark/commit/0f182d0f6d8e1eb3b92e7e0eb39f2616235c1368).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-a...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18933#discussion_r145888010
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -912,6 +912,14 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val PANDAS_TIMEZONE_AWARE =
    +    buildConf("spark.sql.execution.pandas.timeZoneAware")
    +      .internal()
    +      .doc("When true, make Pandas DataFrame with timezone-aware timestamp type when converting " +
    +        "by pyspark.sql.DataFrame.toPandas. The session local timezone is used for the timezone.")
    +      .booleanConf
    +      .createWithDefault(false)
    --- End diff --
    
    We can change the default to `true`, since we agree that this is a bug. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    I'd close this for now. We can open another pr if needed. We need another implementation anyway.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-a...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18933#discussion_r133209362
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2507,6 +2507,37 @@ def test_to_pandas(self):
             self.assertEquals(types[2], np.bool)
             self.assertEquals(types[3], np.float32)
     
    +    @unittest.skipIf(not _have_pandas, "Pandas not installed")
    +    def test_to_pandas_timezone_aware(self):
    +        import pandas as pd
    +        from dateutil import tz
    +        tzlocal = tz.tzlocal()
    +        ts = datetime.datetime(1970, 1, 1)
    +        pdf = pd.DataFrame.from_records([[ts]], columns=['ts'])
    +
    +        self.spark.conf.set('spark.sql.session.timeZone', 'America/Los_Angeles')
    +
    +        schema = StructType().add("ts", TimestampType())
    +        df = self.spark.createDataFrame([(ts,)], schema)
    +
    +        pdf_naive = df.toPandas()
    +        self.assertEqual(pdf_naive['ts'][0].tzinfo, None)
    +        self.assertTrue(pdf_naive.equals(pdf))
    +
    +        self.spark.conf.set('spark.sql.execution.pandas.timeZoneAware', 'true')
    +
    +        pdf_pst = df.toPandas()
    +        self.assertEqual(pdf_pst['ts'][0].tzinfo.zone, 'America/Los_Angeles')
    +        self.assertFalse(pdf_pst.equals(pdf))
    +
    +        pdf_pst_naive = pdf_pst.copy()
    +        pdf_pst_naive['ts'] = pdf_pst_naive['ts'].apply(
    +            lambda ts: ts.tz_convert(tzlocal).tz_localize(None))
    +        self.assertTrue(pdf_pst_naive.equals(pdf))
    +
    +        self.spark.conf.unset('spark.sql.execution.pandas.timeZoneAware')
    +        self.spark.conf.unset('spark.sql.session.timeZone')
    --- End diff --
    
    (Not a big deal but we could use `finally` just in case this test fails and other tests do not get affected by this test failure in the future)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    Hi @ueshin , I've been following SPARK-12297 PR https://github.com/apache/spark/pull/19250 that deals with some of the same issues as here.  I think they are proposing a conf that the user could set to make timestamps tz naive.  Do you think that might apply here or is it specific to SQL/Hive tables?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    **[Test build #80622 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80622/testReport)** for PR 18933 at commit [`7df7ac9`](https://github.com/apache/spark/commit/7df7ac941da56ee9ae894ada3ae30661fddd4b03).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18933
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80608/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org