You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ueshin <gi...@git.apache.org> on 2017/10/30 05:32:51 UTC

[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

GitHub user ueshin opened a pull request:

    https://github.com/apache/spark/pull/19607

    [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone

    ## What changes were proposed in this pull request?
    
    When converting Pandas DataFrame/Series from/to Spark DataFrame using `toPandas()` or pandas udfs, timestamp values behave to respect Python system timezone instead of session timezone.
    
    For example, let's say we use `"America/Los_Angeles"` as session timezone and have a timestamp value `"1970-01-01 00:00:01"` in the timezone. Btw, I'm in Japan so Python timezone would be `"Asia/Tokyo"`.
    
    The timestamp value from current `toPandas()` will be the following:
    
    ```
    >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
    >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) as ts")
    >>> df.show()
    +-------------------+
    |                 ts|
    +-------------------+
    |1970-01-01 00:00:01|
    +-------------------+
    
    >>> df.toPandas()
                       ts
    0 1970-01-01 17:00:01
    ```
    
    As you can see, the value becomes `"1970-01-01 17:00:01"` because it respects Python timezone.
    As we discussed in #18664, we consider this behavior is a bug and the value should be `"1970-01-01 00:00:01"`.
    
    ## How was this patch tested?
    
    Added tests and existing tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ueshin/apache-spark issues/SPARK-22395

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19607.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19607
    
----
commit 4735e5981ecf3a4bce50ce86f706e25830f4a801
Author: Takuya UESHIN <ue...@databricks.com>
Date:   2017-10-23T06:27:22Z

    Add a conf to make Pandas DataFrame respect session local timezone.

commit 1f85150dc5b26df21dca6bad2ef4eaec342c4400
Author: Takuya UESHIN <ue...@databricks.com>
Date:   2017-10-23T08:09:16Z

    Fix toPandas() behavior.

commit 5c08ecf247bfe7e14afcdef8eba1c25cb3b68634
Author: Takuya UESHIN <ue...@databricks.com>
Date:   2017-10-23T09:15:47Z

    Modify pandas UDFs to respect session timezone.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83295/testReport)** for PR 19607 at commit [`6872516`](https://github.com/apache/spark/commit/6872516e8cd9d7f81929c38708571c69a0af7883).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84204 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84204/testReport)** for PR 19607 at commit [`f92eae3`](https://github.com/apache/spark/commit/f92eae35767a766ad80ac576a67f521e365549c7).
     * This patch **fails due to an unknown error code, -9**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84013 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84013/testReport)** for PR 19607 at commit [`9c94f90`](https://github.com/apache/spark/commit/9c94f90a703daaf08887259c1757420477a95b94).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83474 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83474/testReport)** for PR 19607 at commit [`9101a3a`](https://github.com/apache/spark/commit/9101a3a12f17b5bd633756139eaa2cb3ee9bb33c).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83780/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83472/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84044 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84044/testReport)** for PR 19607 at commit [`9cfdde2`](https://github.com/apache/spark/commit/9cfdde2ce07d00779a9f6f8f5ab86cc442b7655b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r148601935
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -274,12 +278,13 @@ def load_stream(self, stream):
             """
             Deserialize ArrowRecordBatches to an Arrow table and return as a list of pandas.Series.
             """
    -        from pyspark.sql.types import _check_dataframe_localize_timestamps
    +        from pyspark.sql.types import _check_dataframe_localize_timestamps, from_arrow_schema
             import pyarrow as pa
             reader = pa.open_stream(stream)
    +        schema = from_arrow_schema(reader.schema)
             for batch in reader:
                 # NOTE: changed from pa.Columns.to_pandas, timezone issue in conversion fixed in 0.7.1
    -            pdf = _check_dataframe_localize_timestamps(batch.to_pandas())
    +            pdf = _check_dataframe_localize_timestamps(batch.to_pandas(), schema, self._timezone)
    --- End diff --
    
    Oh, maybe I misunderstood the purpose of this conf "spark.sql.execution.pandas.respectSessionTimeZone".  If that is true then what is the behavior of Spark?
    
    1) convert timestamps in Pandas to remove the timezone and localize to SESSION_LOCAL_TIMEZONE
    
    2) show Pandas timestamps with SESSION_LOCAL_TIMEZONE set as the timezone
    
    It seems this change is doing (1), but what's wrong with doing (2)?  I think that would be a lot cleaner


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84205 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84205/testReport)** for PR 19607 at commit [`40a9735`](https://github.com/apache/spark/commit/40a9735b88deb85c08f618186daf9ed2152fc406).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, JavaMLReadable,`
      * `class _ImageSchema(object):`
      * `    raise RuntimeError(\"Creating instance of _ImageSchema class is disallowed.\")`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r152174813
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1913,7 +1920,16 @@ def toPandas(self):
     
                 for f, t in dtype.items():
                     pdf[f] = pdf[f].astype(t, copy=False)
    -            return pdf
    +
    +            if timezone is None:
    +                return pdf
    +            else:
    +                from pyspark.sql.types import _check_series_convert_timestamps_local_tz
    +                for field in self.schema:
    +                    if isinstance(field.dataType, TimestampType):
    --- End diff --
    
    Sure, I'll add it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Jenkins, retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83593 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83593/testReport)** for PR 19607 at commit [`292678f`](https://github.com/apache/spark/commit/292678f3e47a4f5a20fd1af5da10e02cc4017882).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84242/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    I have no idea about the reason but seems like there is a difference of handling DST in `spark.createDataFrame(data, schema=schema)` between Jenkins and my local environments.
    
    The debug print by `df.show()` from the test [tests.py#L3487](https://github.com/ueshin/apache-spark/blob/9101a3a12f17b5bd633756139eaa2cb3ee9bb33c/python/pyspark/sql/tests.py#L3487) was:
    
    ```
    +---+-------------------+
    |idx|          timestamp|
    +---+-------------------+
    |  0|1969-01-01 01:01:01|
    |  1|2012-02-02 02:02:02|
    |  2|               null|
    |  3|2100-04-04 04:04:04|
    +---+-------------------+
    ```
    
    but in my local:
    
    ```
    +---+-------------------+
    |idx|          timestamp|
    +---+-------------------+
    |  0|1969-01-01 01:01:01|
    |  1|2012-02-02 02:02:02|
    |  2|               null|
    |  3|2100-04-04 05:04:04|
    +---+-------------------+
    ```
    
    Could you please let me know if I miss something or you have any ideas?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    My local is:
    
    ```
    +---+-------------------+
    |idx|          timestamp|
    +---+-------------------+
    |  0|1969-01-01 01:01:01|
    |  1|2012-02-02 02:02:02|
    |  2|               null|
    |  3|2100-04-04 05:04:04|
    +---+-------------------+
    ```
    
    too. Will try to take a look as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r149001569
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1629,35 +1629,112 @@ def to_arrow_type(dt):
         return arrow_type
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def from_arrow_type(at):
    +    """ Convert pyarrow type to Spark data type.
    +    """
    +    # TODO: newer pyarrow has is_boolean(at) functions that would be better to check type
    +    import pyarrow as pa
    +    if at == pa.bool_():
    +        spark_type = BooleanType()
    +    elif at == pa.int8():
    +        spark_type = ByteType()
    +    elif at == pa.int16():
    +        spark_type = ShortType()
    +    elif at == pa.int32():
    +        spark_type = IntegerType()
    +    elif at == pa.int64():
    +        spark_type = LongType()
    +    elif at == pa.float32():
    +        spark_type = FloatType()
    +    elif at == pa.float64():
    +        spark_type = DoubleType()
    +    elif type(at) == pa.DecimalType:
    +        spark_type = DecimalType(precision=at.precision, scale=at.scale)
    +    elif at == pa.string():
    +        spark_type = StringType()
    +    elif at == pa.date32():
    +        spark_type = DateType()
    +    elif type(at) == pa.TimestampType:
    +        spark_type = TimestampType()
    +    else:
    +        raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
    +    return spark_type
    +
    +
    +def from_arrow_schema(arrow_schema):
    +    """ Convert schema from Arrow to Spark.
    +    """
    +    return StructType(
    +        [StructField(field.name, from_arrow_type(field.type), nullable=field.nullable)
    +         for field in arrow_schema])
    +
    +
    +def _check_dataframe_localize_timestamps(pdf, schema, timezone):
         """
         Convert timezone aware timestamps to timezone-naive in local time
     
         :param pdf: pandas.DataFrame
         :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
         """
    -    from pandas.api.types import is_datetime64tz_dtype
    -    for column, series in pdf.iteritems():
    -        # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    -        if is_datetime64tz_dtype(series.dtype):
    -            pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
    +    import pandas as pd
    +    try:
    +        from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
    +        tz = timezone or 'tzlocal()'
    +        for column, series in pdf.iteritems():
    +            if type(schema[str(column)].dataType) == TimestampType:
    +                # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +                if is_datetime64tz_dtype(series.dtype):
    +                    pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
    +                elif is_datetime64_dtype(series.dtype) and timezone is not None:
    +                    # `series.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
    +                    pdf[column] = series.apply(
    +                        lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
    +                        if ts is not pd.NaT else pd.NaT)
    +    except ImportError:
    --- End diff --
    
    Is this import error due to version difference of pandas?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83774 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83774/testReport)** for PR 19607 at commit [`8b1a4d8`](https://github.com/apache/spark/commit/8b1a4d8eb1c73d87eb3867fe4c1876cb9c48b2cf).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    @BryanCutler I'm fixing the behavior of `toPandas()` and pandas udfs as we discussed in #18664 but I guess we still need to support old Pandas as well.
    I tried to find out a workaround for old Pandas, but I haven't done yet.
    Do you have any ideas for the workaround? cc @wesm @HyukjinKwon @viirya 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83473 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83473/testReport)** for PR 19607 at commit [`ba3d6e3`](https://github.com/apache/spark/commit/ba3d6e3cf679e3db0a2e095f8cbe099ab4260532).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84210 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84210/testReport)** for PR 19607 at commit [`40a9735`](https://github.com/apache/spark/commit/40a9735b88deb85c08f618186daf9ed2152fc406).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, JavaMLReadable,`
      * `class _ImageSchema(object):`
      * `    raise RuntimeError(\"Creating instance of _ImageSchema class is disallowed.\")`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83207 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83207/testReport)** for PR 19607 at commit [`5c08ecf`](https://github.com/apache/spark/commit/5c08ecf247bfe7e14afcdef8eba1c25cb3b68634).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    > I tried to find out a workaround for old Pandas, but I haven't done yet.
    
    I haven't looked at this closely yet but will definitely try to take a look and help soon together. I would appreciate it if the problem (or just symptoms, or just a pointer ..) can be given though if it is not too complex.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153386033
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1678,37 +1679,105 @@ def from_arrow_schema(arrow_schema):
              for field in arrow_schema])
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def _old_pandas_exception_message(e):
    +    """ Create an error message for importing old Pandas.
         """
    -    Convert timezone aware timestamps to timezone-naive in local time
    +    msg = "note: Pandas (>=0.19.2) must be installed and available on calling Python process"
    +    return "%s\n%s" % (_exception_message(e), msg)
    +
    +
    +def _check_dataframe_localize_timestamps(pdf, timezone):
    +    """
    +    Convert timezone aware timestamps to timezone-naive in the specified timezone or local timezone
     
         :param pdf: pandas.DataFrame
    -    :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
    +    :param timezone: the timezone to convert. if None then use local timezone
    +    :return pandas.DataFrame where any timezone aware columns have been converted to tz-naive
         """
    -    from pandas.api.types import is_datetime64tz_dtype
    +    try:
    +        from pandas.api.types import is_datetime64tz_dtype
    +    except ImportError as e:
    +        raise ImportError(_old_pandas_exception_message(e))
    +    tz = timezone or 'tzlocal()'
         for column, series in pdf.iteritems():
             # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
             if is_datetime64tz_dtype(series.dtype):
    -            pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
    +            pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
         return pdf
     
     
    -def _check_series_convert_timestamps_internal(s):
    +def _check_series_convert_timestamps_internal(s, timezone):
         """
    -    Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
    +    Convert a tz-naive timestamp in the specified timezone or local timezone to UTC normalized for
    +    Spark internal storage
    +
         :param s: a pandas.Series
    +    :param timezone: the timezone to convert. if None then use local timezone
         :return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
         """
    -    from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
    +    try:
    +        from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
    +    except ImportError as e:
    +        raise ImportError(_old_pandas_exception_message(e))
         # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
         if is_datetime64_dtype(s.dtype):
    -        return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
    +        tz = timezone or 'tzlocal()'
    +        return s.dt.tz_localize(tz).dt.tz_convert('UTC')
         elif is_datetime64tz_dtype(s.dtype):
             return s.dt.tz_convert('UTC')
         else:
             return s
     
     
    +def _check_series_convert_timestamps_localize(s, fromTimezone, toTimezone):
    --- End diff --
    
    Thanks, I'll update it. Maybe `toTimestamp` -> `to_timestamp` as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83212/testReport)** for PR 19607 at commit [`5c08ecf`](https://github.com/apache/spark/commit/5c08ecf247bfe7e14afcdef8eba1c25cb3b68634).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83474/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153142283
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1678,37 +1679,105 @@ def from_arrow_schema(arrow_schema):
              for field in arrow_schema])
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def _old_pandas_exception_message(e):
    +    """ Create an error message for importing old Pandas.
         """
    -    Convert timezone aware timestamps to timezone-naive in local time
    +    msg = "note: Pandas (>=0.19.2) must be installed and available on calling Python process"
    +    return "%s\n%s" % (_exception_message(e), msg)
    +
    +
    +def _check_dataframe_localize_timestamps(pdf, timezone):
    +    """
    +    Convert timezone aware timestamps to timezone-naive in the specified timezone or local timezone
     
         :param pdf: pandas.DataFrame
    -    :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
    +    :param timezone: the timezone to convert. if None then use local timezone
    +    :return pandas.DataFrame where any timezone aware columns have been converted to tz-naive
         """
    -    from pandas.api.types import is_datetime64tz_dtype
    +    try:
    +        from pandas.api.types import is_datetime64tz_dtype
    +    except ImportError as e:
    +        raise ImportError(_old_pandas_exception_message(e))
    +    tz = timezone or 'tzlocal()'
         for column, series in pdf.iteritems():
             # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
             if is_datetime64tz_dtype(series.dtype):
    -            pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
    +            pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
         return pdf
     
     
    -def _check_series_convert_timestamps_internal(s):
    +def _check_series_convert_timestamps_internal(s, timezone):
         """
    -    Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
    +    Convert a tz-naive timestamp in the specified timezone or local timezone to UTC normalized for
    +    Spark internal storage
    +
         :param s: a pandas.Series
    +    :param timezone: the timezone to convert. if None then use local timezone
         :return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
         """
    -    from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
    +    try:
    +        from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
    +    except ImportError as e:
    +        raise ImportError(_old_pandas_exception_message(e))
         # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
         if is_datetime64_dtype(s.dtype):
    -        return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
    +        tz = timezone or 'tzlocal()'
    +        return s.dt.tz_localize(tz).dt.tz_convert('UTC')
         elif is_datetime64tz_dtype(s.dtype):
             return s.dt.tz_convert('UTC')
         else:
             return s
     
     
    +def _check_series_convert_timestamps_localize(s, fromTimezone, toTimezone):
    --- End diff --
    
    Nit: maybe `from_timezone` .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84070 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84070/testReport)** for PR 19607 at commit [`d741171`](https://github.com/apache/spark/commit/d7411717c7c8c7c7aba63e4e19ebdfadfa7ea0c0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83474 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83474/testReport)** for PR 19607 at commit [`9101a3a`](https://github.com/apache/spark/commit/9101a3a12f17b5bd633756139eaa2cb3ee9bb33c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83777 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83777/testReport)** for PR 19607 at commit [`8b1a4d8`](https://github.com/apache/spark/commit/8b1a4d8eb1c73d87eb3867fe4c1876cb9c48b2cf).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r148576678
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1629,35 +1629,121 @@ def to_arrow_type(dt):
         return arrow_type
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def to_arrow_schema(schema):
    +    """ Convert a schema from Spark to Arrow
    +    """
    +    import pyarrow as pa
    +    fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
    +              for field in schema]
    +    return pa.schema(fields)
    +
    +
    +def from_arrow_type(at):
    +    """ Convert pyarrow type to Spark data type.
    +    """
    +    # TODO: newer pyarrow has is_boolean(at) functions that would be better to check type
    +    import pyarrow as pa
    +    if at == pa.bool_():
    +        spark_type = BooleanType()
    +    elif at == pa.int8():
    +        spark_type = ByteType()
    +    elif at == pa.int16():
    +        spark_type = ShortType()
    +    elif at == pa.int32():
    +        spark_type = IntegerType()
    +    elif at == pa.int64():
    +        spark_type = LongType()
    +    elif at == pa.float32():
    +        spark_type = FloatType()
    +    elif at == pa.float64():
    +        spark_type = DoubleType()
    +    elif type(at) == pa.DecimalType:
    +        spark_type = DecimalType(precision=at.precision, scale=at.scale)
    +    elif at == pa.string():
    +        spark_type = StringType()
    +    elif at == pa.date32():
    +        spark_type = DateType()
    +    elif type(at) == pa.TimestampType:
    +        spark_type = TimestampType()
    +    else:
    +        raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
    +    return spark_type
    +
    +
    +def from_arrow_schema(arrow_schema):
    +    """ Convert schema from Arrow to Spark.
    +    """
    +    return StructType(
    +        [StructField(field.name, from_arrow_type(field.type), nullable=field.nullable)
    +         for field in arrow_schema])
    +
    +
    +def _check_dataframe_localize_timestamps(pdf, schema, timezone):
         """
         Convert timezone aware timestamps to timezone-naive in local time
     
         :param pdf: pandas.DataFrame
         :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
         """
    -    from pandas.api.types import is_datetime64tz_dtype
    -    for column, series in pdf.iteritems():
    -        # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    -        if is_datetime64tz_dtype(series.dtype):
    -            pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
    +    import pandas as pd
    +    try:
    +        from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
    +        tz = timezone or 'tzlocal()'
    +        for column, series in pdf.iteritems():
    +            if type(schema[str(column)].dataType) == TimestampType:
    +                # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +                if is_datetime64tz_dtype(series.dtype):
    +                    pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
    +                elif is_datetime64_dtype(series.dtype) and timezone is not None:
    +                    # `series.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
    +                    pdf[column] = series.apply(
    +                        lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
    +                        if ts is not pd.NaT else pd.NaT)
    +    except ImportError:
    +        from pandas.core.common import is_datetime64_dtype
    +        from pandas.tslib import _dateutil_tzlocal
    +        tzlocal = _dateutil_tzlocal()
    +        tz = timezone or tzlocal
    +        for column, series in pdf.iteritems():
    +            if type(schema[str(column)].dataType) == TimestampType:
    +                # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +                if not is_datetime64_dtype(series.dtype):
    +                    # `series.dt.tz_convert(tzlocal).dt.tz_localize(None)` doesn't work properly.
    +                    pdf[column] = pd.Series([ts.tz_convert(tz).tz_localize(None)
    +                                             if ts is not pd.NaT else pd.NaT for ts in series])
    +                elif is_datetime64_dtype(series.dtype) and timezone is not None:
    +                    # `series.dt.tz_localize(tzlocal)` doesn't work properly.
    +                    pdf[column] = pd.Series(
    +                        [ts.tz_localize(tzlocal).tz_convert(tz).tz_localize(None)
    +                         if ts is not pd.NaT else pd.NaT for ts in series])
         return pdf
     
     
    -def _check_series_convert_timestamps_internal(s):
    +def _check_series_convert_timestamps_internal(s, timezone):
         """
         Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
         :param s: a pandas.Series
         :return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
         """
    -    from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
    -    # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    -    if is_datetime64_dtype(s.dtype):
    -        return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
    -    elif is_datetime64tz_dtype(s.dtype):
    -        return s.dt.tz_convert('UTC')
    -    else:
    -        return s
    +    try:
    +        from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
    +        # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +        if is_datetime64_dtype(s.dtype):
    +            tz = timezone or 'tzlocal()'
    +            return s.dt.tz_localize(tz).dt.tz_convert('UTC')
    +        elif is_datetime64tz_dtype(s.dtype):
    +            return s.dt.tz_convert('UTC')
    +        else:
    +            return s
    +    except ImportError:
    --- End diff --
    
    Sure, let me look into it a little more and summarize what version we can support.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153107765
  
    --- Diff: python/setup.py ---
    @@ -201,7 +201,7 @@ def _supports_symlinks():
             extras_require={
                 'ml': ['numpy>=1.7'],
                 'mllib': ['numpy>=1.7'],
    -            'sql': ['pandas>=0.13.0']
    +            'sql': ['pandas>=0.19.2']
    --- End diff --
    
    Sure, I'll add it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83774 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83774/testReport)** for PR 19607 at commit [`8b1a4d8`](https://github.com/apache/spark/commit/8b1a4d8eb1c73d87eb3867fe4c1876cb9c48b2cf).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84204/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83212/testReport)** for PR 19607 at commit [`5c08ecf`](https://github.com/apache/spark/commit/5c08ecf247bfe7e14afcdef8eba1c25cb3b68634).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84210/testReport)** for PR 19607 at commit [`40a9735`](https://github.com/apache/spark/commit/40a9735b88deb85c08f618186daf9ed2152fc406).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83324 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83324/testReport)** for PR 19607 at commit [`1f096bf`](https://github.com/apache/spark/commit/1f096bf32f742945363cc7d9af978041ad77408b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83472/testReport)** for PR 19607 at commit [`ce07f39`](https://github.com/apache/spark/commit/ce07f39643ef9711419e7e11082e62c578d816f5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Yup it should be separate. I meant to file another JIRA while we are here if it is something we need to fix before we fix before forgetting. If `df.collect()` is not meant to be fixed, I think I should reread the discussion and _maybe_ resolve the R JIRA.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83592 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83592/testReport)** for PR 19607 at commit [`1e0f217`](https://github.com/apache/spark/commit/1e0f21715f5ad053b5a5677a129677d498946ea3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    AllI know is here:
    
    https://github.com/apache/spark/blob/aad2125475dcdeb4a0410392b6706511db17bac4/python/setup.py#L199-L205
    
    and few places in documentation:
    
    https://github.com/apache/spark/tree/c8d0aba198c0f593c2b6b656c23b3d0fb7ea98a2/python#python-requirements


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    > Do we need the config "spark.sql.execution.pandas.respectSessionTimeZone"?
    
    I think we don't need this in this case. If this discussion lasts without a conclusion, I think we should better open a discussion in the mailing list to deduplicate such discussion in the future. 
    
    > What version of Pandas should we support?
    
    I think this does not block this PR though. I don't have a strong opinion on this. Just FYI, there are some information that might help:
    
    Pandas 0.19.2 seems [released in December 24, 2016](https://pandas.pydata.org/pandas-docs/stable/release.html#pandas-0-19-2) 
    Spark 2.1.0 seems [released in December 28, 2016](https://spark.apache.org/news/index.html).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Jenkins, retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r148576656
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1629,35 +1629,121 @@ def to_arrow_type(dt):
         return arrow_type
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def to_arrow_schema(schema):
    --- End diff --
    
    Ah, currently it isn't used. I'll remove it for now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83211 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83211/testReport)** for PR 19607 at commit [`e28fc87`](https://github.com/apache/spark/commit/e28fc873cdbfd25189df35ce54c7c8ede9e1cc99).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Sorry, but does anyone remember how we are going to deal with `df.collect()` in PySpark? R fix should be more like `df.collect()`. It should be good to file a JIRA for `df.collect()` in PySpark too while we are here, if I haven't missed some discussion about it.
    
    Filed for R anyway - https://issues.apache.org/jira/browse/SPARK-22632.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Thanks. I was just worried if I missed any discussion somewhere and wanted to double check.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83475 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83475/testReport)** for PR 19607 at commit [`9101a3a`](https://github.com/apache/spark/commit/9101a3a12f17b5bd633756139eaa2cb3ee9bb33c).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Correct me if I'm wrong, but I think depending on what version of Pandas we support this PR could be simplified a little to only use `is_datetime64tz_dtype` that is in 0.19.2+.  So it might be a good idea to decide on a version here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83593/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83295 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83295/testReport)** for PR 19607 at commit [`6872516`](https://github.com/apache/spark/commit/6872516e8cd9d7f81929c38708571c69a0af7883).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83475 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83475/testReport)** for PR 19607 at commit [`9101a3a`](https://github.com/apache/spark/commit/9101a3a12f17b5bd633756139eaa2cb3ee9bb33c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84062 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84062/testReport)** for PR 19607 at commit [`3e23653`](https://github.com/apache/spark/commit/3e23653572b9d555f5479cdd58ac3e15c7f88c28).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84062/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153142413
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1678,37 +1679,105 @@ def from_arrow_schema(arrow_schema):
              for field in arrow_schema])
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def _old_pandas_exception_message(e):
    +    """ Create an error message for importing old Pandas.
         """
    -    Convert timezone aware timestamps to timezone-naive in local time
    +    msg = "note: Pandas (>=0.19.2) must be installed and available on calling Python process"
    +    return "%s\n%s" % (_exception_message(e), msg)
    +
    +
    +def _check_dataframe_localize_timestamps(pdf, timezone):
    +    """
    +    Convert timezone aware timestamps to timezone-naive in the specified timezone or local timezone
     
         :param pdf: pandas.DataFrame
    -    :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
    +    :param timezone: the timezone to convert. if None then use local timezone
    +    :return pandas.DataFrame where any timezone aware columns have been converted to tz-naive
         """
    -    from pandas.api.types import is_datetime64tz_dtype
    +    try:
    +        from pandas.api.types import is_datetime64tz_dtype
    +    except ImportError as e:
    +        raise ImportError(_old_pandas_exception_message(e))
    +    tz = timezone or 'tzlocal()'
         for column, series in pdf.iteritems():
             # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
             if is_datetime64tz_dtype(series.dtype):
    -            pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
    +            pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
         return pdf
     
     
    -def _check_series_convert_timestamps_internal(s):
    +def _check_series_convert_timestamps_internal(s, timezone):
         """
    -    Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
    +    Convert a tz-naive timestamp in the specified timezone or local timezone to UTC normalized for
    +    Spark internal storage
    +
         :param s: a pandas.Series
    +    :param timezone: the timezone to convert. if None then use local timezone
         :return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
         """
    -    from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
    +    try:
    +        from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
    +    except ImportError as e:
    +        raise ImportError(_old_pandas_exception_message(e))
         # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
         if is_datetime64_dtype(s.dtype):
    -        return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
    +        tz = timezone or 'tzlocal()'
    +        return s.dt.tz_localize(tz).dt.tz_convert('UTC')
         elif is_datetime64tz_dtype(s.dtype):
             return s.dt.tz_convert('UTC')
         else:
             return s
     
     
    +def _check_series_convert_timestamps_localize(s, fromTimezone, toTimezone):
    +    """
    +    Convert timestamp to timezone-naive in the specified timezone or local timezone
    +
    +    :param s: a pandas.Series
    +    :param fromTimezone: the timezone to convert from. if None then use local timezone
    +    :param toTimezone: the timezone to convert to. if None then use local timezone
    +    :return pandas.Series where if it is a timestamp, has been converted to tz-naive
    +    """
    +    try:
    +        import pandas as pd
    +        from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
    +    except ImportError as e:
    +        raise ImportError(_old_pandas_exception_message(e))
    +    fromTz = fromTimezone or 'tzlocal()'
    --- End diff --
    
    Ditto.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r149661483
  
    --- Diff: python/pyspark/sql/session.py ---
    @@ -557,7 +577,13 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
             except Exception:
                 has_pandas = False
             if has_pandas and isinstance(data, pandas.DataFrame):
    -            data, schema = self._convert_from_pandas(data, schema)
    +            if self.conf.get("spark.sql.execution.pandas.respectSessionTimeZone").lower() \
    --- End diff --
    
    Do you mean we don't need the config and just fix the behavior? cc @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Jenkins, retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83207/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19607


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r152646370
  
    --- Diff: python/setup.py ---
    @@ -201,7 +201,7 @@ def _supports_symlinks():
             extras_require={
                 'ml': ['numpy>=1.7'],
                 'mllib': ['numpy>=1.7'],
    -            'sql': ['pandas>=0.13.0']
    +            'sql': ['pandas>=0.19.2']
    --- End diff --
    
    Document this requirement and behavior changes in `Migration Guide`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83212/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84242 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84242/testReport)** for PR 19607 at commit [`9200f38`](https://github.com/apache/spark/commit/9200f38b6414255a5c60127aeeae517086ba108b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83578 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83578/testReport)** for PR 19607 at commit [`4adb073`](https://github.com/apache/spark/commit/4adb073f8d1454fbea0742a16b6d7662e063b37a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Unfortunately, `df.collect()` is out of scope of this pr. Its timestamp values will respect python system timezone.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83473 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83473/testReport)** for PR 19607 at commit [`ba3d6e3`](https://github.com/apache/spark/commit/ba3d6e3cf679e3db0a2e095f8cbe099ab4260532).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83320 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83320/testReport)** for PR 19607 at commit [`1f096bf`](https://github.com/apache/spark/commit/1f096bf32f742945363cc7d9af978041ad77408b).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Maybe we need at least 2 external libraries like `dateutil` and `pytz` to handle timezones properly, which Pandas uses, but I have no idea to handle timezones out of Pandas properly for now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r148540021
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1629,35 +1629,121 @@ def to_arrow_type(dt):
         return arrow_type
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def to_arrow_schema(schema):
    +    """ Convert a schema from Spark to Arrow
    +    """
    +    import pyarrow as pa
    +    fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
    +              for field in schema]
    +    return pa.schema(fields)
    +
    +
    +def from_arrow_type(at):
    +    """ Convert pyarrow type to Spark data type.
    +    """
    +    # TODO: newer pyarrow has is_boolean(at) functions that would be better to check type
    +    import pyarrow as pa
    +    if at == pa.bool_():
    +        spark_type = BooleanType()
    +    elif at == pa.int8():
    +        spark_type = ByteType()
    +    elif at == pa.int16():
    +        spark_type = ShortType()
    +    elif at == pa.int32():
    +        spark_type = IntegerType()
    +    elif at == pa.int64():
    +        spark_type = LongType()
    +    elif at == pa.float32():
    +        spark_type = FloatType()
    +    elif at == pa.float64():
    +        spark_type = DoubleType()
    +    elif type(at) == pa.DecimalType:
    +        spark_type = DecimalType(precision=at.precision, scale=at.scale)
    +    elif at == pa.string():
    +        spark_type = StringType()
    +    elif at == pa.date32():
    +        spark_type = DateType()
    +    elif type(at) == pa.TimestampType:
    +        spark_type = TimestampType()
    +    else:
    +        raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
    +    return spark_type
    +
    +
    +def from_arrow_schema(arrow_schema):
    +    """ Convert schema from Arrow to Spark.
    +    """
    +    return StructType(
    +        [StructField(field.name, from_arrow_type(field.type), nullable=field.nullable)
    +         for field in arrow_schema])
    +
    +
    +def _check_dataframe_localize_timestamps(pdf, schema, timezone):
         """
         Convert timezone aware timestamps to timezone-naive in local time
     
         :param pdf: pandas.DataFrame
         :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
         """
    -    from pandas.api.types import is_datetime64tz_dtype
    -    for column, series in pdf.iteritems():
    -        # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    -        if is_datetime64tz_dtype(series.dtype):
    -            pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
    +    import pandas as pd
    +    try:
    +        from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
    +        tz = timezone or 'tzlocal()'
    +        for column, series in pdf.iteritems():
    +            if type(schema[str(column)].dataType) == TimestampType:
    +                # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +                if is_datetime64tz_dtype(series.dtype):
    +                    pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
    +                elif is_datetime64_dtype(series.dtype) and timezone is not None:
    +                    # `series.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
    +                    pdf[column] = series.apply(
    +                        lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
    +                        if ts is not pd.NaT else pd.NaT)
    +    except ImportError:
    +        from pandas.core.common import is_datetime64_dtype
    +        from pandas.tslib import _dateutil_tzlocal
    +        tzlocal = _dateutil_tzlocal()
    +        tz = timezone or tzlocal
    +        for column, series in pdf.iteritems():
    +            if type(schema[str(column)].dataType) == TimestampType:
    +                # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +                if not is_datetime64_dtype(series.dtype):
    +                    # `series.dt.tz_convert(tzlocal).dt.tz_localize(None)` doesn't work properly.
    +                    pdf[column] = pd.Series([ts.tz_convert(tz).tz_localize(None)
    +                                             if ts is not pd.NaT else pd.NaT for ts in series])
    +                elif is_datetime64_dtype(series.dtype) and timezone is not None:
    +                    # `series.dt.tz_localize(tzlocal)` doesn't work properly.
    +                    pdf[column] = pd.Series(
    +                        [ts.tz_localize(tzlocal).tz_convert(tz).tz_localize(None)
    +                         if ts is not pd.NaT else pd.NaT for ts in series])
         return pdf
     
     
    -def _check_series_convert_timestamps_internal(s):
    +def _check_series_convert_timestamps_internal(s, timezone):
         """
         Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
         :param s: a pandas.Series
         :return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
         """
    -    from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
    -    # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    -    if is_datetime64_dtype(s.dtype):
    -        return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
    -    elif is_datetime64tz_dtype(s.dtype):
    -        return s.dt.tz_convert('UTC')
    -    else:
    -        return s
    +    try:
    +        from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
    +        # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +        if is_datetime64_dtype(s.dtype):
    +            tz = timezone or 'tzlocal()'
    +            return s.dt.tz_localize(tz).dt.tz_convert('UTC')
    +        elif is_datetime64tz_dtype(s.dtype):
    +            return s.dt.tz_convert('UTC')
    +        else:
    +            return s
    +    except ImportError:
    --- End diff --
    
    I think we should bump up pandas version if we can't find a workaround.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153386019
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -3683,6 +3808,47 @@ def check_records_per_batch(x):
                 else:
                     self.spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", orig_value)
     
    +    def test_vectorized_udf_timestamps_respect_session_timezone(self):
    +        from pyspark.sql.functions import pandas_udf, col
    +        from datetime import datetime
    +        import pandas as pd
    +        schema = StructType([
    +            StructField("idx", LongType(), True),
    +            StructField("timestamp", TimestampType(), True)])
    +        data = [(1, datetime(1969, 1, 1, 1, 1, 1)),
    +                (2, datetime(2012, 2, 2, 2, 2, 2)),
    +                (3, None),
    +                (4, datetime(2100, 3, 3, 3, 3, 3))]
    +        df = self.spark.createDataFrame(data, schema=schema)
    +
    +        f_timestamp_copy = pandas_udf(lambda ts: ts, TimestampType())
    +        internal_value = pandas_udf(
    +            lambda ts: ts.apply(lambda ts: ts.value if ts is not pd.NaT else None), LongType())
    +
    +        orig_tz = self.spark.conf.get("spark.sql.session.timeZone")
    +        try:
    +            timezone = "America/New_York"
    +            self.spark.conf.set("spark.sql.session.timeZone", timezone)
    +            self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", "false")
    +            try:
    +                df_la = df.withColumn("tscopy", f_timestamp_copy(col("timestamp"))) \
    +                    .withColumn("internal_value", internal_value(col("timestamp")))
    +                result_la = df_la.select(col("idx"), col("internal_value")).collect()
    +                diff = 3 * 60 * 60 * 1000 * 1000 * 1000
    --- End diff --
    
    Yes, I'll add some comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153176050
  
    --- Diff: python/pyspark/sql/session.py ---
    @@ -444,11 +445,30 @@ def _get_numpy_record_dtype(self, rec):
                 record_type_list.append((str(col_names[i]), curr_type))
             return np.dtype(record_type_list) if has_rec_fix else None
     
    -    def _convert_from_pandas(self, pdf):
    +    def _convert_from_pandas(self, pdf, schema, timezone):
    --- End diff --
    
    Just an idea not blocking this PR. Probably, we have enough codes to make a separate Python file / class to put Pandas / Arrow stuff into one place.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83473/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84210/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    >Aside: Could someone start preparing a patch that uses Arrow 0.8.x (where a lot of issues that surfaced throughout this process have been fixed)?
    
    @wesm I could start on this maybe in the next day or two


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84204 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84204/testReport)** for PR 19607 at commit [`f92eae3`](https://github.com/apache/spark/commit/f92eae35767a766ad80ac576a67f521e365549c7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Jenkins, retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83592 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83592/testReport)** for PR 19607 at commit [`1e0f217`](https://github.com/apache/spark/commit/1e0f21715f5ad053b5a5677a129677d498946ea3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    @HyukjinKwon do you know where pyspark defines its python library requirements?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83586 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83586/testReport)** for PR 19607 at commit [`1e0f217`](https://github.com/apache/spark/commit/1e0f21715f5ad053b5a5677a129677d498946ea3).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r148541382
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -948,6 +948,14 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val PANDAS_RESPECT_SESSION_LOCAL_TIMEZONE =
    --- End diff --
    
    can we clean up the code more if we don't have this config?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83324/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153386047
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1678,37 +1679,105 @@ def from_arrow_schema(arrow_schema):
              for field in arrow_schema])
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def _old_pandas_exception_message(e):
    +    """ Create an error message for importing old Pandas.
         """
    -    Convert timezone aware timestamps to timezone-naive in local time
    +    msg = "note: Pandas (>=0.19.2) must be installed and available on calling Python process"
    +    return "%s\n%s" % (_exception_message(e), msg)
    +
    +
    +def _check_dataframe_localize_timestamps(pdf, timezone):
    +    """
    +    Convert timezone aware timestamps to timezone-naive in the specified timezone or local timezone
     
         :param pdf: pandas.DataFrame
    -    :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
    +    :param timezone: the timezone to convert. if None then use local timezone
    +    :return pandas.DataFrame where any timezone aware columns have been converted to tz-naive
         """
    -    from pandas.api.types import is_datetime64tz_dtype
    +    try:
    +        from pandas.api.types import is_datetime64tz_dtype
    +    except ImportError as e:
    +        raise ImportError(_old_pandas_exception_message(e))
    +    tz = timezone or 'tzlocal()'
         for column, series in pdf.iteritems():
             # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
             if is_datetime64tz_dtype(series.dtype):
    -            pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
    +            pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
         return pdf
     
     
    -def _check_series_convert_timestamps_internal(s):
    +def _check_series_convert_timestamps_internal(s, timezone):
         """
    -    Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
    +    Convert a tz-naive timestamp in the specified timezone or local timezone to UTC normalized for
    +    Spark internal storage
    +
         :param s: a pandas.Series
    +    :param timezone: the timezone to convert. if None then use local timezone
         :return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
         """
    -    from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
    +    try:
    +        from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
    +    except ImportError as e:
    +        raise ImportError(_old_pandas_exception_message(e))
         # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
         if is_datetime64_dtype(s.dtype):
    -        return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
    +        tz = timezone or 'tzlocal()'
    +        return s.dt.tz_localize(tz).dt.tz_convert('UTC')
         elif is_datetime64tz_dtype(s.dtype):
             return s.dt.tz_convert('UTC')
         else:
             return s
     
     
    +def _check_series_convert_timestamps_localize(s, fromTimezone, toTimezone):
    +    """
    +    Convert timestamp to timezone-naive in the specified timezone or local timezone
    +
    +    :param s: a pandas.Series
    +    :param fromTimezone: the timezone to convert from. if None then use local timezone
    +    :param toTimezone: the timezone to convert to. if None then use local timezone
    +    :return pandas.Series where if it is a timestamp, has been converted to tz-naive
    +    """
    +    try:
    +        import pandas as pd
    +        from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
    +    except ImportError as e:
    +        raise ImportError(_old_pandas_exception_message(e))
    +    fromTz = fromTimezone or 'tzlocal()'
    --- End diff --
    
    I'll update it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83286/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r152174935
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala ---
    @@ -58,6 +59,11 @@ class ArrowPythonRunner(
     
           protected override def writeCommand(dataOut: DataOutputStream): Unit = {
             PythonUDFRunner.writeUDFs(dataOut, funcs, argOffsets)
    +        if (respectTimeZone) {
    +          PythonRDD.writeUTF(timeZoneId, dataOut)
    --- End diff --
    
    Unfortunately, can't because the worker doesn't have sql conf.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r149055095
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -274,12 +278,13 @@ def load_stream(self, stream):
             """
             Deserialize ArrowRecordBatches to an Arrow table and return as a list of pandas.Series.
             """
    -        from pyspark.sql.types import _check_dataframe_localize_timestamps
    +        from pyspark.sql.types import _check_dataframe_localize_timestamps, from_arrow_schema
             import pyarrow as pa
             reader = pa.open_stream(stream)
    +        schema = from_arrow_schema(reader.schema)
             for batch in reader:
                 # NOTE: changed from pa.Columns.to_pandas, timezone issue in conversion fixed in 0.7.1
    -            pdf = _check_dataframe_localize_timestamps(batch.to_pandas())
    +            pdf = _check_dataframe_localize_timestamps(batch.to_pandas(), schema, self._timezone)
    --- End diff --
    
    Yes, current implementation is doing (1). I'm not sure if we should hold the timezone. cc @cloud-fan @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84062 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84062/testReport)** for PR 19607 at commit [`3e23653`](https://github.com/apache/spark/commit/3e23653572b9d555f5479cdd58ac3e15c7f88c28).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r152174929
  
    --- Diff: python/pyspark/worker.py ---
    @@ -150,7 +150,8 @@ def read_udfs(pickleSer, infile, eval_type):
     
         if eval_type == PythonEvalType.SQL_PANDAS_SCALAR_UDF \
            or eval_type == PythonEvalType.SQL_PANDAS_GROUP_MAP_UDF:
    -        ser = ArrowStreamPandasSerializer()
    +        timezone = utf8_deserializer.loads(infile)
    --- End diff --
    
    The timezone will be read by the following `ArrowStreamPandasSerializer`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Here - https://github.com/apache/spark/pull/19674#issuecomment-342336767 I checked it lately.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83286 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83286/testReport)** for PR 19607 at commit [`b1436b8`](https://github.com/apache/spark/commit/b1436b8e4876838efbcb38cf0bffa7ebcc7a5544).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r149005429
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1629,35 +1629,112 @@ def to_arrow_type(dt):
         return arrow_type
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def from_arrow_type(at):
    +    """ Convert pyarrow type to Spark data type.
    +    """
    +    # TODO: newer pyarrow has is_boolean(at) functions that would be better to check type
    +    import pyarrow as pa
    +    if at == pa.bool_():
    +        spark_type = BooleanType()
    +    elif at == pa.int8():
    +        spark_type = ByteType()
    +    elif at == pa.int16():
    +        spark_type = ShortType()
    +    elif at == pa.int32():
    +        spark_type = IntegerType()
    +    elif at == pa.int64():
    +        spark_type = LongType()
    +    elif at == pa.float32():
    +        spark_type = FloatType()
    +    elif at == pa.float64():
    +        spark_type = DoubleType()
    +    elif type(at) == pa.DecimalType:
    +        spark_type = DecimalType(precision=at.precision, scale=at.scale)
    +    elif at == pa.string():
    +        spark_type = StringType()
    +    elif at == pa.date32():
    +        spark_type = DateType()
    +    elif type(at) == pa.TimestampType:
    +        spark_type = TimestampType()
    +    else:
    +        raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
    +    return spark_type
    +
    +
    +def from_arrow_schema(arrow_schema):
    +    """ Convert schema from Arrow to Spark.
    +    """
    +    return StructType(
    +        [StructField(field.name, from_arrow_type(field.type), nullable=field.nullable)
    +         for field in arrow_schema])
    +
    +
    +def _check_dataframe_localize_timestamps(pdf, schema, timezone):
         """
         Convert timezone aware timestamps to timezone-naive in local time
     
         :param pdf: pandas.DataFrame
         :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
         """
    -    from pandas.api.types import is_datetime64tz_dtype
    -    for column, series in pdf.iteritems():
    -        # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    -        if is_datetime64tz_dtype(series.dtype):
    -            pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
    +    import pandas as pd
    +    try:
    +        from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
    +        tz = timezone or 'tzlocal()'
    +        for column, series in pdf.iteritems():
    +            if type(schema[str(column)].dataType) == TimestampType:
    +                # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +                if is_datetime64tz_dtype(series.dtype):
    +                    pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
    +                elif is_datetime64_dtype(series.dtype) and timezone is not None:
    +                    # `series.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
    +                    pdf[column] = series.apply(
    +                        lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
    +                        if ts is not pd.NaT else pd.NaT)
    +    except ImportError:
    --- End diff --
    
    Yes, Pandas <0.19(?) seems to not have `pandas.api.types` but `pandas.core.common` instead.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Yup, I was testing and trying to produce details. Let me describe this in the JIRA, not here :D.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153386007
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -3192,16 +3255,49 @@ def test_filtered_frame(self):
             self.assertEqual(pdf.columns[0], "i")
             self.assertTrue(pdf.empty)
     
    -    def test_createDataFrame_toggle(self):
    -        pdf = self.create_pandas_data_frame()
    +    def _createDataFrame_toggle(self, pdf, schema=None):
             self.spark.conf.set("spark.sql.execution.arrow.enabled", "false")
             try:
    -            df_no_arrow = self.spark.createDataFrame(pdf)
    +            df_no_arrow = self.spark.createDataFrame(pdf, schema=schema)
             finally:
                 self.spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    -        df_arrow = self.spark.createDataFrame(pdf)
    +        df_arrow = self.spark.createDataFrame(pdf, schema=schema)
    +        return df_no_arrow, df_arrow
    +
    +    def test_createDataFrame_toggle(self):
    +        pdf = self.create_pandas_data_frame()
    +        df_no_arrow, df_arrow = self._createDataFrame_toggle(pdf, schema=self.schema)
             self.assertEquals(df_no_arrow.collect(), df_arrow.collect())
     
    +    def test_createDataFrame_respect_session_timezone(self):
    +        from datetime import timedelta
    +        pdf = self.create_pandas_data_frame()
    +        orig_tz = self.spark.conf.get("spark.sql.session.timeZone")
    +        try:
    +            timezone = "America/New_York"
    +            self.spark.conf.set("spark.sql.session.timeZone", timezone)
    +            self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", "false")
    +            try:
    +                df_no_arrow_la, df_arrow_la = self._createDataFrame_toggle(pdf, schema=self.schema)
    +                result_la = df_no_arrow_la.collect()
    +                result_arrow_la = df_arrow_la.collect()
    +                self.assertEqual(result_la, result_arrow_la)
    +            finally:
    +                self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", "true")
    +            df_no_arrow_ny, df_arrow_ny = self._createDataFrame_toggle(pdf, schema=self.schema)
    +            result_ny = df_no_arrow_ny.collect()
    +            result_arrow_ny = df_arrow_ny.collect()
    +            self.assertEqual(result_ny, result_arrow_ny)
    +
    +            self.assertNotEqual(result_ny, result_la)
    +
    +            result_la_corrected = [Row(**{k: v - timedelta(hours=3) if k == '7_timestamp_t' else v
    --- End diff --
    
    Yes, the 3 hours timedelta is the time difference.
    I'll add some comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r151999012
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala ---
    @@ -58,6 +59,11 @@ class ArrowPythonRunner(
     
           protected override def writeCommand(dataOut: DataOutputStream): Unit = {
             PythonUDFRunner.writeUDFs(dataOut, funcs, argOffsets)
    +        if (respectTimeZone) {
    +          PythonRDD.writeUTF(timeZoneId, dataOut)
    --- End diff --
    
    why do we need to send it? Can't python side read the timezone from sql conf?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84013/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    BTW what's the pandas version we set up in jenkins? I'm afraid we don't even have a test for old pandas versions.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153386064
  
    --- Diff: python/pyspark/sql/session.py ---
    @@ -444,11 +445,30 @@ def _get_numpy_record_dtype(self, rec):
                 record_type_list.append((str(col_names[i]), curr_type))
             return np.dtype(record_type_list) if has_rec_fix else None
     
    -    def _convert_from_pandas(self, pdf):
    +    def _convert_from_pandas(self, pdf, schema, timezone):
    --- End diff --
    
    Thanks, I agree with it but maybe I'll leave those as they are in this pr.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83777/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83294/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r151998797
  
    --- Diff: python/pyspark/worker.py ---
    @@ -150,7 +150,8 @@ def read_udfs(pickleSer, infile, eval_type):
     
         if eval_type == PythonEvalType.SQL_PANDAS_SCALAR_UDF \
            or eval_type == PythonEvalType.SQL_PANDAS_GROUP_MAP_UDF:
    -        ser = ArrowStreamPandasSerializer()
    +        timezone = utf8_deserializer.loads(infile)
    --- End diff --
    
    who will read this timezone?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83586 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83586/testReport)** for PR 19607 at commit [`1e0f217`](https://github.com/apache/spark/commit/1e0f21715f5ad053b5a5677a129677d498946ea3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r151998296
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1913,7 +1920,16 @@ def toPandas(self):
     
                 for f, t in dtype.items():
                     pdf[f] = pdf[f].astype(t, copy=False)
    -            return pdf
    +
    +            if timezone is None:
    +                return pdf
    +            else:
    +                from pyspark.sql.types import _check_series_convert_timestamps_local_tz
    +                for field in self.schema:
    +                    if isinstance(field.dataType, TimestampType):
    --- End diff --
    
    add a TODO for nested timestamp field?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Yea, that was my proposal. If anything is blocked by this, I think we should bump it up as, IMHO, technically the fixed version specification was not yet released and published.
    
    ^ cc @cloud-fan, @srowen and @viirya 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Jenkins, retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    For Py4J, as a hard required dependency, its src-zip is in Spark partly to support non-pip installation as well properly AFAIK. In this case, the place to update seems https://github.com/apache/spark/commit/c8d0aba198c0f593c2b6b656c23b3d0fb7ea98a2 although it includes few more changes by Py4J jar as well, again, AFAIK.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    @BryanCutler I guess the oldest version of Pandas is `0.13.0` currently according to #18403, cc @HyukjinKwon. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83777 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83777/testReport)** for PR 19607 at commit [`8b1a4d8`](https://github.com/apache/spark/commit/8b1a4d8eb1c73d87eb3867fe4c1876cb9c48b2cf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84013 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84013/testReport)** for PR 19607 at commit [`9c94f90`](https://github.com/apache/spark/commit/9c94f90a703daaf08887259c1757420477a95b94).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83207 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83207/testReport)** for PR 19607 at commit [`5c08ecf`](https://github.com/apache/spark/commit/5c08ecf247bfe7e14afcdef8eba1c25cb3b68634).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83322/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    FWIW, to me, I think it's possibly a breaking change but I believe it's worth given the recent cool changes. +0 for bumping up anyway. Will join in the mailing thread too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83211 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83211/testReport)** for PR 19607 at commit [`e28fc87`](https://github.com/apache/spark/commit/e28fc873cdbfd25189df35ce54c7c8ede9e1cc99).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    I'll ask dev list if we can drop support old Pandas (<0.19.2), or what version we should support if we can't.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153385979
  
    --- Diff: python/pyspark/sql/session.py ---
    @@ -444,11 +445,30 @@ def _get_numpy_record_dtype(self, rec):
                 record_type_list.append((str(col_names[i]), curr_type))
             return np.dtype(record_type_list) if has_rec_fix else None
     
    -    def _convert_from_pandas(self, pdf):
    +    def _convert_from_pandas(self, pdf, schema, timezone):
             """
              Convert a pandas.DataFrame to list of records that can be used to make a DataFrame
              :return list of records
             """
    +        if timezone is not None:
    +            from pyspark.sql.types import _check_series_convert_timestamps_tz_local
    +            copied = False
    +            if isinstance(schema, StructType):
    +                for field in schema:
    +                    # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +                    if isinstance(field.dataType, TimestampType):
    +                        s = _check_series_convert_timestamps_tz_local(pdf[field.name], timezone)
    +                        if not copied and s is not pdf[field.name]:
    +                            pdf = pdf.copy()
    +                            copied = True
    --- End diff --
    
    Yes, it's to prevent the original one from being updated.
    I'll add some comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83205 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83205/testReport)** for PR 19607 at commit [`5c08ecf`](https://github.com/apache/spark/commit/5c08ecf247bfe7e14afcdef8eba1c25cb3b68634).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Jenkins, retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84205/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84048 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84048/testReport)** for PR 19607 at commit [`3db2bea`](https://github.com/apache/spark/commit/3db2bea22c6bd240c871cb4580fbbf0ee3b659b4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r152645369
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -997,6 +997,14 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val PANDAS_RESPECT_SESSION_LOCAL_TIMEZONE =
    +    buildConf("spark.sql.execution.pandas.respectSessionTimeZone")
    +      .internal()
    +      .doc("When true, make Pandas DataFrame with timestamp type respecting session local " +
    +        "timezone when converting to/from Pandas DataFrame.")
    --- End diff --
    
    Emphasize the conf will be deprecated?
    
    > When true, make Pandas DataFrame with timestamp type respecting session local timezone when converting to/from Pandas DataFrame. This configuration will be deprecated in the future releases.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by wesm <gi...@git.apache.org>.
Github user wesm commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Aside: Could someone start preparing a patch that uses Arrow 0.8.x (where a lot of issues that surfaced throughout this process have been fixed)? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    The remaining discussions in this pr are:
    
    - [ ] Do we need the config `"spark.sql.execution.pandas.respectSessionTimeZone"`?
      - @cloud-fan raised that we don't need the config. https://github.com/apache/spark/pull/19607#discussion_r149655172
    - [ ] What version of Pandas should we support?
      - Since old pandas needs a lot of workaround or doesn't support timestamp values properly, we need to bump up Pandas to support.
      - @wesm suggested that we should support only `0.19.2` or upper. https://github.com/apache/spark/pull/19607#issuecomment-342371522
    
    I'd really appreciate if you left comments on these and please let me know if I miss something.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84242 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84242/testReport)** for PR 19607 at commit [`9200f38`](https://github.com/apache/spark/commit/9200f38b6414255a5c60127aeeae517086ba108b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83324 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83324/testReport)** for PR 19607 at commit [`1f096bf`](https://github.com/apache/spark/commit/1f096bf32f742945363cc7d9af978041ad77408b).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r149005432
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1629,35 +1629,112 @@ def to_arrow_type(dt):
         return arrow_type
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def from_arrow_type(at):
    +    """ Convert pyarrow type to Spark data type.
    +    """
    +    # TODO: newer pyarrow has is_boolean(at) functions that would be better to check type
    +    import pyarrow as pa
    +    if at == pa.bool_():
    +        spark_type = BooleanType()
    +    elif at == pa.int8():
    +        spark_type = ByteType()
    +    elif at == pa.int16():
    +        spark_type = ShortType()
    +    elif at == pa.int32():
    +        spark_type = IntegerType()
    +    elif at == pa.int64():
    +        spark_type = LongType()
    +    elif at == pa.float32():
    +        spark_type = FloatType()
    +    elif at == pa.float64():
    +        spark_type = DoubleType()
    +    elif type(at) == pa.DecimalType:
    +        spark_type = DecimalType(precision=at.precision, scale=at.scale)
    +    elif at == pa.string():
    +        spark_type = StringType()
    +    elif at == pa.date32():
    +        spark_type = DateType()
    +    elif type(at) == pa.TimestampType:
    +        spark_type = TimestampType()
    +    else:
    +        raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
    +    return spark_type
    +
    +
    +def from_arrow_schema(arrow_schema):
    +    """ Convert schema from Arrow to Spark.
    +    """
    +    return StructType(
    +        [StructField(field.name, from_arrow_type(field.type), nullable=field.nullable)
    +         for field in arrow_schema])
    +
    +
    +def _check_dataframe_localize_timestamps(pdf, schema, timezone):
         """
         Convert timezone aware timestamps to timezone-naive in local time
     
         :param pdf: pandas.DataFrame
         :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
         """
    -    from pandas.api.types import is_datetime64tz_dtype
    -    for column, series in pdf.iteritems():
    -        # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    -        if is_datetime64tz_dtype(series.dtype):
    -            pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
    +    import pandas as pd
    +    try:
    +        from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
    +        tz = timezone or 'tzlocal()'
    +        for column, series in pdf.iteritems():
    +            if type(schema[str(column)].dataType) == TimestampType:
    +                # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +                if is_datetime64tz_dtype(series.dtype):
    +                    pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
    +                elif is_datetime64_dtype(series.dtype) and timezone is not None:
    +                    # `series.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
    +                    pdf[column] = series.apply(
    +                        lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
    +                        if ts is not pd.NaT else pd.NaT)
    +    except ImportError:
    +        from pandas.core.common import is_datetime64_dtype
    +        from pandas.tslib import _dateutil_tzlocal
    +        tzlocal = _dateutil_tzlocal()
    +        tz = timezone or tzlocal
    +        for column, series in pdf.iteritems():
    +            if type(schema[str(column)].dataType) == TimestampType:
    +                # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +                if not is_datetime64_dtype(series.dtype):
    --- End diff --
    
    Unfortunately, Pandas <0.17(?) seems to not have `is_datetime64tz_dtype`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r148538507
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1629,35 +1629,121 @@ def to_arrow_type(dt):
         return arrow_type
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def to_arrow_schema(schema):
    --- End diff --
    
    where do we use this method?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83774/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83472/testReport)** for PR 19607 at commit [`ce07f39`](https://github.com/apache/spark/commit/ce07f39643ef9711419e7e11082e62c578d816f5).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83280/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Hi @ueshin , what is the oldest version of Pandas that's required to support and what exactly wasn't working with it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r149655172
  
    --- Diff: python/pyspark/sql/session.py ---
    @@ -557,7 +577,13 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
             except Exception:
                 has_pandas = False
             if has_pandas and isinstance(data, pandas.DataFrame):
    -            data, schema = self._convert_from_pandas(data, schema)
    +            if self.conf.get("spark.sql.execution.pandas.respectSessionTimeZone").lower() \
    --- End diff --
    
    I feel this is a weird config... I think it's acceptable to introduce behavior change during bug fix, like the type inference bug we fixed when converting pandas dataframe to pyspark dataframe.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    I guess we should look at R to see if it should behavior similarly? WDYT @HyukjinKwon ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84020/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83780 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83780/testReport)** for PR 19607 at commit [`e919ed5`](https://github.com/apache/spark/commit/e919ed55758f75733d56287d5a49326b1067a44c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84070/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    @wesm I'd like to know if you have any specific reason to recommend the version 0.19.2 or upper?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153167040
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -3683,6 +3808,47 @@ def check_records_per_batch(x):
                 else:
                     self.spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", orig_value)
     
    +    def test_vectorized_udf_timestamps_respect_session_timezone(self):
    +        from pyspark.sql.functions import pandas_udf, col
    +        from datetime import datetime
    +        import pandas as pd
    +        schema = StructType([
    +            StructField("idx", LongType(), True),
    +            StructField("timestamp", TimestampType(), True)])
    +        data = [(1, datetime(1969, 1, 1, 1, 1, 1)),
    +                (2, datetime(2012, 2, 2, 2, 2, 2)),
    +                (3, None),
    +                (4, datetime(2100, 3, 3, 3, 3, 3))]
    +        df = self.spark.createDataFrame(data, schema=schema)
    +
    +        f_timestamp_copy = pandas_udf(lambda ts: ts, TimestampType())
    +        internal_value = pandas_udf(
    +            lambda ts: ts.apply(lambda ts: ts.value if ts is not pd.NaT else None), LongType())
    +
    +        orig_tz = self.spark.conf.get("spark.sql.session.timeZone")
    +        try:
    +            timezone = "America/New_York"
    +            self.spark.conf.set("spark.sql.session.timeZone", timezone)
    +            self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", "false")
    +            try:
    +                df_la = df.withColumn("tscopy", f_timestamp_copy(col("timestamp"))) \
    +                    .withColumn("internal_value", internal_value(col("timestamp")))
    +                result_la = df_la.select(col("idx"), col("internal_value")).collect()
    +                diff = 3 * 60 * 60 * 1000 * 1000 * 1000
    --- End diff --
    
    Here too. it took me a while to check where this 3 came from ..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r152201452
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1678,37 +1678,105 @@ def from_arrow_schema(arrow_schema):
              for field in arrow_schema])
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def _old_pandas_exception_message(e):
    +    """ Create an error message for importing old Pandas.
         """
    -    Convert timezone aware timestamps to timezone-naive in local time
    +    msg = "note: Pandas (>=0.19.2) must be installed and available on calling Python process"
    +    return "%s\n%s" % (e.message, msg)
    --- End diff --
    
    Oops, thanks!
    We need to fix [dataframe.py#L1905](https://github.com/apache/spark/blob/3db2bea22c6bd240c871cb4580fbbf0ee3b659b4/python/pyspark/sql/dataframe.py#L1905) as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84044 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84044/testReport)** for PR 19607 at commit [`9cfdde2`](https://github.com/apache/spark/commit/9cfdde2ce07d00779a9f6f8f5ab86cc442b7655b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    (Will give a shot to reproduce this within tomorrow and leave my output.)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r148576712
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -948,6 +948,14 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val PANDAS_RESPECT_SESSION_LOCAL_TIMEZONE =
    --- End diff --
    
    Sure, I'll try it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Ah, yes, I think so.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84048/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83780 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83780/testReport)** for PR 19607 at commit [`e919ed5`](https://github.com/apache/spark/commit/e919ed55758f75733d56287d5a49326b1067a44c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83593 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83593/testReport)** for PR 19607 at commit [`292678f`](https://github.com/apache/spark/commit/292678f3e47a4f5a20fd1af5da10e02cc4017882).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83578/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83280 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83280/testReport)** for PR 19607 at commit [`ee1a1c8`](https://github.com/apache/spark/commit/ee1a1c87e2a89974e4e299f4ad84e5831526d079).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83578 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83578/testReport)** for PR 19607 at commit [`4adb073`](https://github.com/apache/spark/commit/4adb073f8d1454fbea0742a16b6d7662e063b37a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r149002542
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1629,35 +1629,112 @@ def to_arrow_type(dt):
         return arrow_type
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def from_arrow_type(at):
    +    """ Convert pyarrow type to Spark data type.
    +    """
    +    # TODO: newer pyarrow has is_boolean(at) functions that would be better to check type
    +    import pyarrow as pa
    +    if at == pa.bool_():
    +        spark_type = BooleanType()
    +    elif at == pa.int8():
    +        spark_type = ByteType()
    +    elif at == pa.int16():
    +        spark_type = ShortType()
    +    elif at == pa.int32():
    +        spark_type = IntegerType()
    +    elif at == pa.int64():
    +        spark_type = LongType()
    +    elif at == pa.float32():
    +        spark_type = FloatType()
    +    elif at == pa.float64():
    +        spark_type = DoubleType()
    +    elif type(at) == pa.DecimalType:
    +        spark_type = DecimalType(precision=at.precision, scale=at.scale)
    +    elif at == pa.string():
    +        spark_type = StringType()
    +    elif at == pa.date32():
    +        spark_type = DateType()
    +    elif type(at) == pa.TimestampType:
    +        spark_type = TimestampType()
    +    else:
    +        raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
    +    return spark_type
    +
    +
    +def from_arrow_schema(arrow_schema):
    +    """ Convert schema from Arrow to Spark.
    +    """
    +    return StructType(
    +        [StructField(field.name, from_arrow_type(field.type), nullable=field.nullable)
    +         for field in arrow_schema])
    +
    +
    +def _check_dataframe_localize_timestamps(pdf, schema, timezone):
         """
         Convert timezone aware timestamps to timezone-naive in local time
     
         :param pdf: pandas.DataFrame
         :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
         """
    -    from pandas.api.types import is_datetime64tz_dtype
    -    for column, series in pdf.iteritems():
    -        # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    -        if is_datetime64tz_dtype(series.dtype):
    -            pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
    +    import pandas as pd
    +    try:
    +        from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
    +        tz = timezone or 'tzlocal()'
    +        for column, series in pdf.iteritems():
    +            if type(schema[str(column)].dataType) == TimestampType:
    +                # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +                if is_datetime64tz_dtype(series.dtype):
    +                    pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
    +                elif is_datetime64_dtype(series.dtype) and timezone is not None:
    +                    # `series.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
    +                    pdf[column] = series.apply(
    +                        lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
    +                        if ts is not pd.NaT else pd.NaT)
    +    except ImportError:
    +        from pandas.core.common import is_datetime64_dtype
    +        from pandas.tslib import _dateutil_tzlocal
    +        tzlocal = _dateutil_tzlocal()
    +        tz = timezone or tzlocal
    +        for column, series in pdf.iteritems():
    +            if type(schema[str(column)].dataType) == TimestampType:
    +                # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +                if not is_datetime64_dtype(series.dtype):
    --- End diff --
    
    Can't we just use `is_datetime64tz_dtype` as above?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84020 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84020/testReport)** for PR 19607 at commit [`9c94f90`](https://github.com/apache/spark/commit/9c94f90a703daaf08887259c1757420477a95b94).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Yup, I think we should take a look between POSIXct / POSIXlt in R and timestamp within Spark too. Seems not respecting the session timezone in a quick look.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Seems there is no explicit objection for dropping it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r152199700
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1678,37 +1678,105 @@ def from_arrow_schema(arrow_schema):
              for field in arrow_schema])
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def _old_pandas_exception_message(e):
    +    """ Create an error message for importing old Pandas.
         """
    -    Convert timezone aware timestamps to timezone-naive in local time
    +    msg = "note: Pandas (>=0.19.2) must be installed and available on calling Python process"
    +    return "%s\n%s" % (e.message, msg)
    --- End diff --
    
    I am quite sure the `message` attribute does not exist in Python 3:
    
    ```python
    >>> try:
    ...     import foo
    ... except ImportError as e:
    ...     dir(e)
    ...
    ['__cause__', '__class__', '__context__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__suppress_context__', '__traceback__', 'args', 'msg', 'name', 'path', 'with_traceback']
    ```
    
    Probably, we should use `from pyspark.util import _exception_message` (https://github.com/apache/spark/pull/16845).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84070 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84070/testReport)** for PR 19607 at commit [`d741171`](https://github.com/apache/spark/commit/d7411717c7c8c7c7aba63e4e19ebdfadfa7ea0c0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83205/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    @wesm Thanks for the explanation!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83475/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83286 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83286/testReport)** for PR 19607 at commit [`b1436b8`](https://github.com/apache/spark/commit/b1436b8e4876838efbcb38cf0bffa7ebcc7a5544).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83280 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83280/testReport)** for PR 19607 at commit [`ee1a1c8`](https://github.com/apache/spark/commit/ee1a1c87e2a89974e4e299f4ad84e5831526d079).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153107748
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -997,6 +997,14 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val PANDAS_RESPECT_SESSION_LOCAL_TIMEZONE =
    +    buildConf("spark.sql.execution.pandas.respectSessionTimeZone")
    +      .internal()
    +      .doc("When true, make Pandas DataFrame with timestamp type respecting session local " +
    +        "timezone when converting to/from Pandas DataFrame.")
    --- End diff --
    
    Sure, I'll update it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84044/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84205 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84205/testReport)** for PR 19607 at commit [`40a9735`](https://github.com/apache/spark/commit/40a9735b88deb85c08f618186daf9ed2152fc406).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83320/testReport)** for PR 19607 at commit [`1f096bf`](https://github.com/apache/spark/commit/1f096bf32f742945363cc7d9af978041ad77408b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83211/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Hmm, I can't reproduce the current failure in my local, which seems to not be related to old Pandas because the failure is on Python 3.4 and Pandas 0.19 (I guess). I tried the combination in my local.
    Can anyone help me reproduce and fix it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83592/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r149582142
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1629,37 +1629,82 @@ def to_arrow_type(dt):
         return arrow_type
     
     
    -def _check_dataframe_localize_timestamps(pdf):
    +def _check_dataframe_localize_timestamps(pdf, timezone):
         """
    -    Convert timezone aware timestamps to timezone-naive in local time
    +    Convert timezone aware timestamps to timezone-naive in the specified timezone or local timezone
     
         :param pdf: pandas.DataFrame
    -    :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
    +    :param timezone: the timezone to convert. if None then use local timezone
    +    :return pandas.DataFrame where any timezone aware columns have been converted to tz-naive
         """
         from pandas.api.types import is_datetime64tz_dtype
    +    tz = timezone or 'tzlocal()'
         for column, series in pdf.iteritems():
             # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
             if is_datetime64tz_dtype(series.dtype):
    -            pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
    +            pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
         return pdf
     
     
    -def _check_series_convert_timestamps_internal(s):
    +def _check_series_convert_timestamps_internal(s, timezone):
         """
    -    Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
    +    Convert a tz-naive timestamp in the specified timezone or local timezone to UTC normalized for
    +    Spark internal storage
    +
         :param s: a pandas.Series
    +    :param timezone: the timezone to convert. if None then use local timezone
         :return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
         """
         from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
         # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
         if is_datetime64_dtype(s.dtype):
    -        return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
    +        tz = timezone or 'tzlocal()'
    +        return s.dt.tz_localize(tz).dt.tz_convert('UTC')
         elif is_datetime64tz_dtype(s.dtype):
             return s.dt.tz_convert('UTC')
         else:
             return s
     
     
    +def _check_series_convert_timestamps_localize(s, timezone):
    +    """
    +    Convert timestamp to timezone-naive in the specified timezone or local timezone
    +
    +    :param s: a pandas.Series
    +    :param timezone: the timezone to convert. if None then use local timezone
    +    :return pandas.Series where if it is a timestamp, has been converted to tz-naive
    +    """
    +    import pandas as pd
    +    try:
    +        from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
    +        tz = timezone or 'tzlocal()'
    +        # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +        if is_datetime64tz_dtype(s.dtype):
    +            return s.dt.tz_convert(tz).dt.tz_localize(None)
    +        elif is_datetime64_dtype(s.dtype) and timezone is not None:
    +            # `s.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
    +            return s.apply(lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
    +                           if ts is not pd.NaT else pd.NaT)
    +        else:
    +            return s
    +    except ImportError:
    --- End diff --
    
    We will be able to remove this block if we decided to support only Pandas >=0.19.2.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by wesm <gi...@git.apache.org>.
Github user wesm commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Well, old versions of pandas are not supported at all. There have also been a number of API evolutions around extension dtypes that make supporting pandas 0.18.x and lower challenging. We have dropped support already in pyarrow, it is a serious liability to accept the maintenance burden of supporting such an outdated version of the project. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Hm .. I see.  but is this something we should fix though ideally? I am asking this because I am checking R related codes now .. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r148598592
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -274,12 +278,13 @@ def load_stream(self, stream):
             """
             Deserialize ArrowRecordBatches to an Arrow table and return as a list of pandas.Series.
             """
    -        from pyspark.sql.types import _check_dataframe_localize_timestamps
    +        from pyspark.sql.types import _check_dataframe_localize_timestamps, from_arrow_schema
             import pyarrow as pa
             reader = pa.open_stream(stream)
    +        schema = from_arrow_schema(reader.schema)
             for batch in reader:
                 # NOTE: changed from pa.Columns.to_pandas, timezone issue in conversion fixed in 0.7.1
    -            pdf = _check_dataframe_localize_timestamps(batch.to_pandas())
    +            pdf = _check_dataframe_localize_timestamps(batch.to_pandas(), schema, self._timezone)
    --- End diff --
    
    If `self._timezone` is not None, then it will be the SESSION_LOCAL_TIMEZONE and Arrow data will already have this timezone set so nothing needs to be done here right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    LGTM, merging to master!
    
    Let's fix the R timestamp issue in a new ticket.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83586/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84020/testReport)** for PR 19607 at commit [`9c94f90`](https://github.com/apache/spark/commit/9c94f90a703daaf08887259c1757420477a95b94).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by wesm <gi...@git.apache.org>.
Github user wesm commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    @jreback @cpcloud could use some input on how to maximize cross-pandas version compatibility in this patch.
    
    I am not sure I would recommend that Spark support pandas versions lower than 0.19.2


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #84048 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84048/testReport)** for PR 19607 at commit [`3db2bea`](https://github.com/apache/spark/commit/3db2bea22c6bd240c871cb4580fbbf0ee3b659b4).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83295/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153168918
  
    --- Diff: python/pyspark/sql/session.py ---
    @@ -444,11 +445,30 @@ def _get_numpy_record_dtype(self, rec):
                 record_type_list.append((str(col_names[i]), curr_type))
             return np.dtype(record_type_list) if has_rec_fix else None
     
    -    def _convert_from_pandas(self, pdf):
    +    def _convert_from_pandas(self, pdf, schema, timezone):
             """
              Convert a pandas.DataFrame to list of records that can be used to make a DataFrame
              :return list of records
             """
    +        if timezone is not None:
    +            from pyspark.sql.types import _check_series_convert_timestamps_tz_local
    +            copied = False
    +            if isinstance(schema, StructType):
    +                for field in schema:
    +                    # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
    +                    if isinstance(field.dataType, TimestampType):
    +                        s = _check_series_convert_timestamps_tz_local(pdf[field.name], timezone)
    +                        if not copied and s is not pdf[field.name]:
    +                            pdf = pdf.copy()
    +                            copied = True
    --- End diff --
    
    Would you mind if I ask why we should copy here? Probably, some comments explaining it would be helpful. To be clear, Is it to prevent the original Pandas DataFrame being updated?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Well, I'd leave the config `spark.sql.execution.pandas.respectSessionTimeZone` as it is for now to be safe as @gatorsmile mentioned before.
    And as we discussed in dev list, I'll update this to remove workarounds for old Pandas but add some error messages saying we need 0.19.2 or upper.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19607#discussion_r153149080
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -3192,16 +3255,49 @@ def test_filtered_frame(self):
             self.assertEqual(pdf.columns[0], "i")
             self.assertTrue(pdf.empty)
     
    -    def test_createDataFrame_toggle(self):
    -        pdf = self.create_pandas_data_frame()
    +    def _createDataFrame_toggle(self, pdf, schema=None):
             self.spark.conf.set("spark.sql.execution.arrow.enabled", "false")
             try:
    -            df_no_arrow = self.spark.createDataFrame(pdf)
    +            df_no_arrow = self.spark.createDataFrame(pdf, schema=schema)
             finally:
                 self.spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    -        df_arrow = self.spark.createDataFrame(pdf)
    +        df_arrow = self.spark.createDataFrame(pdf, schema=schema)
    +        return df_no_arrow, df_arrow
    +
    +    def test_createDataFrame_toggle(self):
    +        pdf = self.create_pandas_data_frame()
    +        df_no_arrow, df_arrow = self._createDataFrame_toggle(pdf, schema=self.schema)
             self.assertEquals(df_no_arrow.collect(), df_arrow.collect())
     
    +    def test_createDataFrame_respect_session_timezone(self):
    +        from datetime import timedelta
    +        pdf = self.create_pandas_data_frame()
    +        orig_tz = self.spark.conf.get("spark.sql.session.timeZone")
    +        try:
    +            timezone = "America/New_York"
    +            self.spark.conf.set("spark.sql.session.timeZone", timezone)
    +            self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", "false")
    +            try:
    +                df_no_arrow_la, df_arrow_la = self._createDataFrame_toggle(pdf, schema=self.schema)
    +                result_la = df_no_arrow_la.collect()
    +                result_arrow_la = df_arrow_la.collect()
    +                self.assertEqual(result_la, result_arrow_la)
    +            finally:
    +                self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", "true")
    +            df_no_arrow_ny, df_arrow_ny = self._createDataFrame_toggle(pdf, schema=self.schema)
    +            result_ny = df_no_arrow_ny.collect()
    +            result_arrow_ny = df_arrow_ny.collect()
    +            self.assertEqual(result_ny, result_arrow_ny)
    +
    +            self.assertNotEqual(result_ny, result_la)
    +
    +            result_la_corrected = [Row(**{k: v - timedelta(hours=3) if k == '7_timestamp_t' else v
    --- End diff --
    
    Small comments here would be helpful .. BTW, to be clear, this 3 hours timedelta is from America/Los_Angeles and America/New_York time difference?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    I doubt if we can still support pandas 0.13.0 now...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    **[Test build #83205 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83205/testReport)** for PR 19607 at commit [`5c08ecf`](https://github.com/apache/spark/commit/5c08ecf247bfe7e14afcdef8eba1c25cb3b68634).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83320/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19607
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org