You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ueshin <gi...@git.apache.org> on 2017/10/30 05:32:51 UTC
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
GitHub user ueshin opened a pull request:
https://github.com/apache/spark/pull/19607
[SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone
## What changes were proposed in this pull request?
When converting Pandas DataFrame/Series from/to Spark DataFrame using `toPandas()` or pandas udfs, timestamp values behave to respect Python system timezone instead of session timezone.
For example, let's say we use `"America/Los_Angeles"` as session timezone and have a timestamp value `"1970-01-01 00:00:01"` in the timezone. Btw, I'm in Japan so Python timezone would be `"Asia/Tokyo"`.
The timestamp value from current `toPandas()` will be the following:
```
>>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) as ts")
>>> df.show()
+-------------------+
| ts|
+-------------------+
|1970-01-01 00:00:01|
+-------------------+
>>> df.toPandas()
ts
0 1970-01-01 17:00:01
```
As you can see, the value becomes `"1970-01-01 17:00:01"` because it respects Python timezone.
As we discussed in #18664, we consider this behavior is a bug and the value should be `"1970-01-01 00:00:01"`.
## How was this patch tested?
Added tests and existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ueshin/apache-spark issues/SPARK-22395
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19607.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19607
----
commit 4735e5981ecf3a4bce50ce86f706e25830f4a801
Author: Takuya UESHIN <ue...@databricks.com>
Date: 2017-10-23T06:27:22Z
Add a conf to make Pandas DataFrame respect session local timezone.
commit 1f85150dc5b26df21dca6bad2ef4eaec342c4400
Author: Takuya UESHIN <ue...@databricks.com>
Date: 2017-10-23T08:09:16Z
Fix toPandas() behavior.
commit 5c08ecf247bfe7e14afcdef8eba1c25cb3b68634
Author: Takuya UESHIN <ue...@databricks.com>
Date: 2017-10-23T09:15:47Z
Modify pandas UDFs to respect session timezone.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83295/testReport)** for PR 19607 at commit [`6872516`](https://github.com/apache/spark/commit/6872516e8cd9d7f81929c38708571c69a0af7883).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84204 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84204/testReport)** for PR 19607 at commit [`f92eae3`](https://github.com/apache/spark/commit/f92eae35767a766ad80ac576a67f521e365549c7).
* This patch **fails due to an unknown error code, -9**.
* This patch **does not merge cleanly**.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84013 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84013/testReport)** for PR 19607 at commit [`9c94f90`](https://github.com/apache/spark/commit/9c94f90a703daaf08887259c1757420477a95b94).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83474 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83474/testReport)** for PR 19607 at commit [`9101a3a`](https://github.com/apache/spark/commit/9101a3a12f17b5bd633756139eaa2cb3ee9bb33c).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83780/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83472/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84044 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84044/testReport)** for PR 19607 at commit [`9cfdde2`](https://github.com/apache/spark/commit/9cfdde2ce07d00779a9f6f8f5ab86cc442b7655b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r148601935
--- Diff: python/pyspark/serializers.py ---
@@ -274,12 +278,13 @@ def load_stream(self, stream):
"""
Deserialize ArrowRecordBatches to an Arrow table and return as a list of pandas.Series.
"""
- from pyspark.sql.types import _check_dataframe_localize_timestamps
+ from pyspark.sql.types import _check_dataframe_localize_timestamps, from_arrow_schema
import pyarrow as pa
reader = pa.open_stream(stream)
+ schema = from_arrow_schema(reader.schema)
for batch in reader:
# NOTE: changed from pa.Columns.to_pandas, timezone issue in conversion fixed in 0.7.1
- pdf = _check_dataframe_localize_timestamps(batch.to_pandas())
+ pdf = _check_dataframe_localize_timestamps(batch.to_pandas(), schema, self._timezone)
--- End diff --
Oh, maybe I misunderstood the purpose of this conf "spark.sql.execution.pandas.respectSessionTimeZone". If that is true then what is the behavior of Spark?
1) convert timestamps in Pandas to remove the timezone and localize to SESSION_LOCAL_TIMEZONE
2) show Pandas timestamps with SESSION_LOCAL_TIMEZONE set as the timezone
It seems this change is doing (1), but what's wrong with doing (2)? I think that would be a lot cleaner
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84205 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84205/testReport)** for PR 19607 at commit [`40a9735`](https://github.com/apache/spark/commit/40a9735b88deb85c08f618186daf9ed2152fc406).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, JavaMLReadable,`
* `class _ImageSchema(object):`
* ` raise RuntimeError(\"Creating instance of _ImageSchema class is disallowed.\")`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r152174813
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1913,7 +1920,16 @@ def toPandas(self):
for f, t in dtype.items():
pdf[f] = pdf[f].astype(t, copy=False)
- return pdf
+
+ if timezone is None:
+ return pdf
+ else:
+ from pyspark.sql.types import _check_series_convert_timestamps_local_tz
+ for field in self.schema:
+ if isinstance(field.dataType, TimestampType):
--- End diff --
Sure, I'll add it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
Jenkins, retest this please.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83593 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83593/testReport)** for PR 19607 at commit [`292678f`](https://github.com/apache/spark/commit/292678f3e47a4f5a20fd1af5da10e02cc4017882).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84242/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
I have no idea about the reason but seems like there is a difference of handling DST in `spark.createDataFrame(data, schema=schema)` between Jenkins and my local environments.
The debug print by `df.show()` from the test [tests.py#L3487](https://github.com/ueshin/apache-spark/blob/9101a3a12f17b5bd633756139eaa2cb3ee9bb33c/python/pyspark/sql/tests.py#L3487) was:
```
+---+-------------------+
|idx| timestamp|
+---+-------------------+
| 0|1969-01-01 01:01:01|
| 1|2012-02-02 02:02:02|
| 2| null|
| 3|2100-04-04 04:04:04|
+---+-------------------+
```
but in my local:
```
+---+-------------------+
|idx| timestamp|
+---+-------------------+
| 0|1969-01-01 01:01:01|
| 1|2012-02-02 02:02:02|
| 2| null|
| 3|2100-04-04 05:04:04|
+---+-------------------+
```
Could you please let me know if I miss something or you have any ideas?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
My local is:
```
+---+-------------------+
|idx| timestamp|
+---+-------------------+
| 0|1969-01-01 01:01:01|
| 1|2012-02-02 02:02:02|
| 2| null|
| 3|2100-04-04 05:04:04|
+---+-------------------+
```
too. Will try to take a look as well.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r149001569
--- Diff: python/pyspark/sql/types.py ---
@@ -1629,35 +1629,112 @@ def to_arrow_type(dt):
return arrow_type
-def _check_dataframe_localize_timestamps(pdf):
+def from_arrow_type(at):
+ """ Convert pyarrow type to Spark data type.
+ """
+ # TODO: newer pyarrow has is_boolean(at) functions that would be better to check type
+ import pyarrow as pa
+ if at == pa.bool_():
+ spark_type = BooleanType()
+ elif at == pa.int8():
+ spark_type = ByteType()
+ elif at == pa.int16():
+ spark_type = ShortType()
+ elif at == pa.int32():
+ spark_type = IntegerType()
+ elif at == pa.int64():
+ spark_type = LongType()
+ elif at == pa.float32():
+ spark_type = FloatType()
+ elif at == pa.float64():
+ spark_type = DoubleType()
+ elif type(at) == pa.DecimalType:
+ spark_type = DecimalType(precision=at.precision, scale=at.scale)
+ elif at == pa.string():
+ spark_type = StringType()
+ elif at == pa.date32():
+ spark_type = DateType()
+ elif type(at) == pa.TimestampType:
+ spark_type = TimestampType()
+ else:
+ raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
+ return spark_type
+
+
+def from_arrow_schema(arrow_schema):
+ """ Convert schema from Arrow to Spark.
+ """
+ return StructType(
+ [StructField(field.name, from_arrow_type(field.type), nullable=field.nullable)
+ for field in arrow_schema])
+
+
+def _check_dataframe_localize_timestamps(pdf, schema, timezone):
"""
Convert timezone aware timestamps to timezone-naive in local time
:param pdf: pandas.DataFrame
:return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
"""
- from pandas.api.types import is_datetime64tz_dtype
- for column, series in pdf.iteritems():
- # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
- if is_datetime64tz_dtype(series.dtype):
- pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
+ import pandas as pd
+ try:
+ from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+ tz = timezone or 'tzlocal()'
+ for column, series in pdf.iteritems():
+ if type(schema[str(column)].dataType) == TimestampType:
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if is_datetime64tz_dtype(series.dtype):
+ pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
+ elif is_datetime64_dtype(series.dtype) and timezone is not None:
+ # `series.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
+ pdf[column] = series.apply(
+ lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
+ if ts is not pd.NaT else pd.NaT)
+ except ImportError:
--- End diff --
Is this import error due to version difference of pandas?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83774 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83774/testReport)** for PR 19607 at commit [`8b1a4d8`](https://github.com/apache/spark/commit/8b1a4d8eb1c73d87eb3867fe4c1876cb9c48b2cf).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
@BryanCutler I'm fixing the behavior of `toPandas()` and pandas udfs as we discussed in #18664 but I guess we still need to support old Pandas as well.
I tried to find out a workaround for old Pandas, but I haven't done yet.
Do you have any ideas for the workaround? cc @wesm @HyukjinKwon @viirya
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83473 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83473/testReport)** for PR 19607 at commit [`ba3d6e3`](https://github.com/apache/spark/commit/ba3d6e3cf679e3db0a2e095f8cbe099ab4260532).
* This patch **fails Python style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84210 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84210/testReport)** for PR 19607 at commit [`40a9735`](https://github.com/apache/spark/commit/40a9735b88deb85c08f618186daf9ed2152fc406).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, JavaMLReadable,`
* `class _ImageSchema(object):`
* ` raise RuntimeError(\"Creating instance of _ImageSchema class is disallowed.\")`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83207 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83207/testReport)** for PR 19607 at commit [`5c08ecf`](https://github.com/apache/spark/commit/5c08ecf247bfe7e14afcdef8eba1c25cb3b68634).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
> I tried to find out a workaround for old Pandas, but I haven't done yet.
I haven't looked at this closely yet but will definitely try to take a look and help soon together. I would appreciate it if the problem (or just symptoms, or just a pointer ..) can be given though if it is not too complex.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153386033
--- Diff: python/pyspark/sql/types.py ---
@@ -1678,37 +1679,105 @@ def from_arrow_schema(arrow_schema):
for field in arrow_schema])
-def _check_dataframe_localize_timestamps(pdf):
+def _old_pandas_exception_message(e):
+ """ Create an error message for importing old Pandas.
"""
- Convert timezone aware timestamps to timezone-naive in local time
+ msg = "note: Pandas (>=0.19.2) must be installed and available on calling Python process"
+ return "%s\n%s" % (_exception_message(e), msg)
+
+
+def _check_dataframe_localize_timestamps(pdf, timezone):
+ """
+ Convert timezone aware timestamps to timezone-naive in the specified timezone or local timezone
:param pdf: pandas.DataFrame
- :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
+ :param timezone: the timezone to convert. if None then use local timezone
+ :return pandas.DataFrame where any timezone aware columns have been converted to tz-naive
"""
- from pandas.api.types import is_datetime64tz_dtype
+ try:
+ from pandas.api.types import is_datetime64tz_dtype
+ except ImportError as e:
+ raise ImportError(_old_pandas_exception_message(e))
+ tz = timezone or 'tzlocal()'
for column, series in pdf.iteritems():
# TODO: handle nested timestamps, such as ArrayType(TimestampType())?
if is_datetime64tz_dtype(series.dtype):
- pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
+ pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
return pdf
-def _check_series_convert_timestamps_internal(s):
+def _check_series_convert_timestamps_internal(s, timezone):
"""
- Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
+ Convert a tz-naive timestamp in the specified timezone or local timezone to UTC normalized for
+ Spark internal storage
+
:param s: a pandas.Series
+ :param timezone: the timezone to convert. if None then use local timezone
:return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
"""
- from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
+ try:
+ from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
+ except ImportError as e:
+ raise ImportError(_old_pandas_exception_message(e))
# TODO: handle nested timestamps, such as ArrayType(TimestampType())?
if is_datetime64_dtype(s.dtype):
- return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
+ tz = timezone or 'tzlocal()'
+ return s.dt.tz_localize(tz).dt.tz_convert('UTC')
elif is_datetime64tz_dtype(s.dtype):
return s.dt.tz_convert('UTC')
else:
return s
+def _check_series_convert_timestamps_localize(s, fromTimezone, toTimezone):
--- End diff --
Thanks, I'll update it. Maybe `toTimestamp` -> `to_timestamp` as well.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83212/testReport)** for PR 19607 at commit [`5c08ecf`](https://github.com/apache/spark/commit/5c08ecf247bfe7e14afcdef8eba1c25cb3b68634).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
LGTM
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83474/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153142283
--- Diff: python/pyspark/sql/types.py ---
@@ -1678,37 +1679,105 @@ def from_arrow_schema(arrow_schema):
for field in arrow_schema])
-def _check_dataframe_localize_timestamps(pdf):
+def _old_pandas_exception_message(e):
+ """ Create an error message for importing old Pandas.
"""
- Convert timezone aware timestamps to timezone-naive in local time
+ msg = "note: Pandas (>=0.19.2) must be installed and available on calling Python process"
+ return "%s\n%s" % (_exception_message(e), msg)
+
+
+def _check_dataframe_localize_timestamps(pdf, timezone):
+ """
+ Convert timezone aware timestamps to timezone-naive in the specified timezone or local timezone
:param pdf: pandas.DataFrame
- :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
+ :param timezone: the timezone to convert. if None then use local timezone
+ :return pandas.DataFrame where any timezone aware columns have been converted to tz-naive
"""
- from pandas.api.types import is_datetime64tz_dtype
+ try:
+ from pandas.api.types import is_datetime64tz_dtype
+ except ImportError as e:
+ raise ImportError(_old_pandas_exception_message(e))
+ tz = timezone or 'tzlocal()'
for column, series in pdf.iteritems():
# TODO: handle nested timestamps, such as ArrayType(TimestampType())?
if is_datetime64tz_dtype(series.dtype):
- pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
+ pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
return pdf
-def _check_series_convert_timestamps_internal(s):
+def _check_series_convert_timestamps_internal(s, timezone):
"""
- Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
+ Convert a tz-naive timestamp in the specified timezone or local timezone to UTC normalized for
+ Spark internal storage
+
:param s: a pandas.Series
+ :param timezone: the timezone to convert. if None then use local timezone
:return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
"""
- from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
+ try:
+ from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
+ except ImportError as e:
+ raise ImportError(_old_pandas_exception_message(e))
# TODO: handle nested timestamps, such as ArrayType(TimestampType())?
if is_datetime64_dtype(s.dtype):
- return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
+ tz = timezone or 'tzlocal()'
+ return s.dt.tz_localize(tz).dt.tz_convert('UTC')
elif is_datetime64tz_dtype(s.dtype):
return s.dt.tz_convert('UTC')
else:
return s
+def _check_series_convert_timestamps_localize(s, fromTimezone, toTimezone):
--- End diff --
Nit: maybe `from_timezone` .
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84070 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84070/testReport)** for PR 19607 at commit [`d741171`](https://github.com/apache/spark/commit/d7411717c7c8c7c7aba63e4e19ebdfadfa7ea0c0).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83474 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83474/testReport)** for PR 19607 at commit [`9101a3a`](https://github.com/apache/spark/commit/9101a3a12f17b5bd633756139eaa2cb3ee9bb33c).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83777 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83777/testReport)** for PR 19607 at commit [`8b1a4d8`](https://github.com/apache/spark/commit/8b1a4d8eb1c73d87eb3867fe4c1876cb9c48b2cf).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r148576678
--- Diff: python/pyspark/sql/types.py ---
@@ -1629,35 +1629,121 @@ def to_arrow_type(dt):
return arrow_type
-def _check_dataframe_localize_timestamps(pdf):
+def to_arrow_schema(schema):
+ """ Convert a schema from Spark to Arrow
+ """
+ import pyarrow as pa
+ fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
+ for field in schema]
+ return pa.schema(fields)
+
+
+def from_arrow_type(at):
+ """ Convert pyarrow type to Spark data type.
+ """
+ # TODO: newer pyarrow has is_boolean(at) functions that would be better to check type
+ import pyarrow as pa
+ if at == pa.bool_():
+ spark_type = BooleanType()
+ elif at == pa.int8():
+ spark_type = ByteType()
+ elif at == pa.int16():
+ spark_type = ShortType()
+ elif at == pa.int32():
+ spark_type = IntegerType()
+ elif at == pa.int64():
+ spark_type = LongType()
+ elif at == pa.float32():
+ spark_type = FloatType()
+ elif at == pa.float64():
+ spark_type = DoubleType()
+ elif type(at) == pa.DecimalType:
+ spark_type = DecimalType(precision=at.precision, scale=at.scale)
+ elif at == pa.string():
+ spark_type = StringType()
+ elif at == pa.date32():
+ spark_type = DateType()
+ elif type(at) == pa.TimestampType:
+ spark_type = TimestampType()
+ else:
+ raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
+ return spark_type
+
+
+def from_arrow_schema(arrow_schema):
+ """ Convert schema from Arrow to Spark.
+ """
+ return StructType(
+ [StructField(field.name, from_arrow_type(field.type), nullable=field.nullable)
+ for field in arrow_schema])
+
+
+def _check_dataframe_localize_timestamps(pdf, schema, timezone):
"""
Convert timezone aware timestamps to timezone-naive in local time
:param pdf: pandas.DataFrame
:return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
"""
- from pandas.api.types import is_datetime64tz_dtype
- for column, series in pdf.iteritems():
- # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
- if is_datetime64tz_dtype(series.dtype):
- pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
+ import pandas as pd
+ try:
+ from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+ tz = timezone or 'tzlocal()'
+ for column, series in pdf.iteritems():
+ if type(schema[str(column)].dataType) == TimestampType:
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if is_datetime64tz_dtype(series.dtype):
+ pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
+ elif is_datetime64_dtype(series.dtype) and timezone is not None:
+ # `series.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
+ pdf[column] = series.apply(
+ lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
+ if ts is not pd.NaT else pd.NaT)
+ except ImportError:
+ from pandas.core.common import is_datetime64_dtype
+ from pandas.tslib import _dateutil_tzlocal
+ tzlocal = _dateutil_tzlocal()
+ tz = timezone or tzlocal
+ for column, series in pdf.iteritems():
+ if type(schema[str(column)].dataType) == TimestampType:
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if not is_datetime64_dtype(series.dtype):
+ # `series.dt.tz_convert(tzlocal).dt.tz_localize(None)` doesn't work properly.
+ pdf[column] = pd.Series([ts.tz_convert(tz).tz_localize(None)
+ if ts is not pd.NaT else pd.NaT for ts in series])
+ elif is_datetime64_dtype(series.dtype) and timezone is not None:
+ # `series.dt.tz_localize(tzlocal)` doesn't work properly.
+ pdf[column] = pd.Series(
+ [ts.tz_localize(tzlocal).tz_convert(tz).tz_localize(None)
+ if ts is not pd.NaT else pd.NaT for ts in series])
return pdf
-def _check_series_convert_timestamps_internal(s):
+def _check_series_convert_timestamps_internal(s, timezone):
"""
Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
:param s: a pandas.Series
:return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
"""
- from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
- # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
- if is_datetime64_dtype(s.dtype):
- return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
- elif is_datetime64tz_dtype(s.dtype):
- return s.dt.tz_convert('UTC')
- else:
- return s
+ try:
+ from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if is_datetime64_dtype(s.dtype):
+ tz = timezone or 'tzlocal()'
+ return s.dt.tz_localize(tz).dt.tz_convert('UTC')
+ elif is_datetime64tz_dtype(s.dtype):
+ return s.dt.tz_convert('UTC')
+ else:
+ return s
+ except ImportError:
--- End diff --
Sure, let me look into it a little more and summarize what version we can support.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153107765
--- Diff: python/setup.py ---
@@ -201,7 +201,7 @@ def _supports_symlinks():
extras_require={
'ml': ['numpy>=1.7'],
'mllib': ['numpy>=1.7'],
- 'sql': ['pandas>=0.13.0']
+ 'sql': ['pandas>=0.19.2']
--- End diff --
Sure, I'll add it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83774 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83774/testReport)** for PR 19607 at commit [`8b1a4d8`](https://github.com/apache/spark/commit/8b1a4d8eb1c73d87eb3867fe4c1876cb9c48b2cf).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84204/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83212/testReport)** for PR 19607 at commit [`5c08ecf`](https://github.com/apache/spark/commit/5c08ecf247bfe7e14afcdef8eba1c25cb3b68634).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84210/testReport)** for PR 19607 at commit [`40a9735`](https://github.com/apache/spark/commit/40a9735b88deb85c08f618186daf9ed2152fc406).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83324 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83324/testReport)** for PR 19607 at commit [`1f096bf`](https://github.com/apache/spark/commit/1f096bf32f742945363cc7d9af978041ad77408b).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83472/testReport)** for PR 19607 at commit [`ce07f39`](https://github.com/apache/spark/commit/ce07f39643ef9711419e7e11082e62c578d816f5).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
Yup it should be separate. I meant to file another JIRA while we are here if it is something we need to fix before we fix before forgetting. If `df.collect()` is not meant to be fixed, I think I should reread the discussion and _maybe_ resolve the R JIRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83592 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83592/testReport)** for PR 19607 at commit [`1e0f217`](https://github.com/apache/spark/commit/1e0f21715f5ad053b5a5677a129677d498946ea3).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
AllI know is here:
https://github.com/apache/spark/blob/aad2125475dcdeb4a0410392b6706511db17bac4/python/setup.py#L199-L205
and few places in documentation:
https://github.com/apache/spark/tree/c8d0aba198c0f593c2b6b656c23b3d0fb7ea98a2/python#python-requirements
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
> Do we need the config "spark.sql.execution.pandas.respectSessionTimeZone"?
I think we don't need this in this case. If this discussion lasts without a conclusion, I think we should better open a discussion in the mailing list to deduplicate such discussion in the future.
> What version of Pandas should we support?
I think this does not block this PR though. I don't have a strong opinion on this. Just FYI, there are some information that might help:
Pandas 0.19.2 seems [released in December 24, 2016](https://pandas.pydata.org/pandas-docs/stable/release.html#pandas-0-19-2)
Spark 2.1.0 seems [released in December 28, 2016](https://spark.apache.org/news/index.html).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
Jenkins, retest this please.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r148576656
--- Diff: python/pyspark/sql/types.py ---
@@ -1629,35 +1629,121 @@ def to_arrow_type(dt):
return arrow_type
-def _check_dataframe_localize_timestamps(pdf):
+def to_arrow_schema(schema):
--- End diff --
Ah, currently it isn't used. I'll remove it for now.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83211 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83211/testReport)** for PR 19607 at commit [`e28fc87`](https://github.com/apache/spark/commit/e28fc873cdbfd25189df35ce54c7c8ede9e1cc99).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
Sorry, but does anyone remember how we are going to deal with `df.collect()` in PySpark? R fix should be more like `df.collect()`. It should be good to file a JIRA for `df.collect()` in PySpark too while we are here, if I haven't missed some discussion about it.
Filed for R anyway - https://issues.apache.org/jira/browse/SPARK-22632.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
Thanks. I was just worried if I missed any discussion somewhere and wanted to double check.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83475 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83475/testReport)** for PR 19607 at commit [`9101a3a`](https://github.com/apache/spark/commit/9101a3a12f17b5bd633756139eaa2cb3ee9bb33c).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/19607
Correct me if I'm wrong, but I think depending on what version of Pandas we support this PR could be simplified a little to only use `is_datetime64tz_dtype` that is in 0.19.2+. So it might be a good idea to decide on a version here.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83593/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83295 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83295/testReport)** for PR 19607 at commit [`6872516`](https://github.com/apache/spark/commit/6872516e8cd9d7f81929c38708571c69a0af7883).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83475 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83475/testReport)** for PR 19607 at commit [`9101a3a`](https://github.com/apache/spark/commit/9101a3a12f17b5bd633756139eaa2cb3ee9bb33c).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19607
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84062 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84062/testReport)** for PR 19607 at commit [`3e23653`](https://github.com/apache/spark/commit/3e23653572b9d555f5479cdd58ac3e15c7f88c28).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84062/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153142413
--- Diff: python/pyspark/sql/types.py ---
@@ -1678,37 +1679,105 @@ def from_arrow_schema(arrow_schema):
for field in arrow_schema])
-def _check_dataframe_localize_timestamps(pdf):
+def _old_pandas_exception_message(e):
+ """ Create an error message for importing old Pandas.
"""
- Convert timezone aware timestamps to timezone-naive in local time
+ msg = "note: Pandas (>=0.19.2) must be installed and available on calling Python process"
+ return "%s\n%s" % (_exception_message(e), msg)
+
+
+def _check_dataframe_localize_timestamps(pdf, timezone):
+ """
+ Convert timezone aware timestamps to timezone-naive in the specified timezone or local timezone
:param pdf: pandas.DataFrame
- :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
+ :param timezone: the timezone to convert. if None then use local timezone
+ :return pandas.DataFrame where any timezone aware columns have been converted to tz-naive
"""
- from pandas.api.types import is_datetime64tz_dtype
+ try:
+ from pandas.api.types import is_datetime64tz_dtype
+ except ImportError as e:
+ raise ImportError(_old_pandas_exception_message(e))
+ tz = timezone or 'tzlocal()'
for column, series in pdf.iteritems():
# TODO: handle nested timestamps, such as ArrayType(TimestampType())?
if is_datetime64tz_dtype(series.dtype):
- pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
+ pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
return pdf
-def _check_series_convert_timestamps_internal(s):
+def _check_series_convert_timestamps_internal(s, timezone):
"""
- Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
+ Convert a tz-naive timestamp in the specified timezone or local timezone to UTC normalized for
+ Spark internal storage
+
:param s: a pandas.Series
+ :param timezone: the timezone to convert. if None then use local timezone
:return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
"""
- from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
+ try:
+ from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
+ except ImportError as e:
+ raise ImportError(_old_pandas_exception_message(e))
# TODO: handle nested timestamps, such as ArrayType(TimestampType())?
if is_datetime64_dtype(s.dtype):
- return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
+ tz = timezone or 'tzlocal()'
+ return s.dt.tz_localize(tz).dt.tz_convert('UTC')
elif is_datetime64tz_dtype(s.dtype):
return s.dt.tz_convert('UTC')
else:
return s
+def _check_series_convert_timestamps_localize(s, fromTimezone, toTimezone):
+ """
+ Convert timestamp to timezone-naive in the specified timezone or local timezone
+
+ :param s: a pandas.Series
+ :param fromTimezone: the timezone to convert from. if None then use local timezone
+ :param toTimezone: the timezone to convert to. if None then use local timezone
+ :return pandas.Series where if it is a timestamp, has been converted to tz-naive
+ """
+ try:
+ import pandas as pd
+ from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+ except ImportError as e:
+ raise ImportError(_old_pandas_exception_message(e))
+ fromTz = fromTimezone or 'tzlocal()'
--- End diff --
Ditto.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r149661483
--- Diff: python/pyspark/sql/session.py ---
@@ -557,7 +577,13 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
except Exception:
has_pandas = False
if has_pandas and isinstance(data, pandas.DataFrame):
- data, schema = self._convert_from_pandas(data, schema)
+ if self.conf.get("spark.sql.execution.pandas.respectSessionTimeZone").lower() \
--- End diff --
Do you mean we don't need the config and just fix the behavior? cc @gatorsmile
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
Jenkins, retest this please.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83207/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/19607
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r152646370
--- Diff: python/setup.py ---
@@ -201,7 +201,7 @@ def _supports_symlinks():
extras_require={
'ml': ['numpy>=1.7'],
'mllib': ['numpy>=1.7'],
- 'sql': ['pandas>=0.13.0']
+ 'sql': ['pandas>=0.19.2']
--- End diff --
Document this requirement and behavior changes in `Migration Guide`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83212/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84242 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84242/testReport)** for PR 19607 at commit [`9200f38`](https://github.com/apache/spark/commit/9200f38b6414255a5c60127aeeae517086ba108b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83578 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83578/testReport)** for PR 19607 at commit [`4adb073`](https://github.com/apache/spark/commit/4adb073f8d1454fbea0742a16b6d7662e063b37a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
Unfortunately, `df.collect()` is out of scope of this pr. Its timestamp values will respect python system timezone.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83473 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83473/testReport)** for PR 19607 at commit [`ba3d6e3`](https://github.com/apache/spark/commit/ba3d6e3cf679e3db0a2e095f8cbe099ab4260532).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83320 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83320/testReport)** for PR 19607 at commit [`1f096bf`](https://github.com/apache/spark/commit/1f096bf32f742945363cc7d9af978041ad77408b).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
Maybe we need at least 2 external libraries like `dateutil` and `pytz` to handle timezones properly, which Pandas uses, but I have no idea to handle timezones out of Pandas properly for now.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r148540021
--- Diff: python/pyspark/sql/types.py ---
@@ -1629,35 +1629,121 @@ def to_arrow_type(dt):
return arrow_type
-def _check_dataframe_localize_timestamps(pdf):
+def to_arrow_schema(schema):
+ """ Convert a schema from Spark to Arrow
+ """
+ import pyarrow as pa
+ fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
+ for field in schema]
+ return pa.schema(fields)
+
+
+def from_arrow_type(at):
+ """ Convert pyarrow type to Spark data type.
+ """
+ # TODO: newer pyarrow has is_boolean(at) functions that would be better to check type
+ import pyarrow as pa
+ if at == pa.bool_():
+ spark_type = BooleanType()
+ elif at == pa.int8():
+ spark_type = ByteType()
+ elif at == pa.int16():
+ spark_type = ShortType()
+ elif at == pa.int32():
+ spark_type = IntegerType()
+ elif at == pa.int64():
+ spark_type = LongType()
+ elif at == pa.float32():
+ spark_type = FloatType()
+ elif at == pa.float64():
+ spark_type = DoubleType()
+ elif type(at) == pa.DecimalType:
+ spark_type = DecimalType(precision=at.precision, scale=at.scale)
+ elif at == pa.string():
+ spark_type = StringType()
+ elif at == pa.date32():
+ spark_type = DateType()
+ elif type(at) == pa.TimestampType:
+ spark_type = TimestampType()
+ else:
+ raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
+ return spark_type
+
+
+def from_arrow_schema(arrow_schema):
+ """ Convert schema from Arrow to Spark.
+ """
+ return StructType(
+ [StructField(field.name, from_arrow_type(field.type), nullable=field.nullable)
+ for field in arrow_schema])
+
+
+def _check_dataframe_localize_timestamps(pdf, schema, timezone):
"""
Convert timezone aware timestamps to timezone-naive in local time
:param pdf: pandas.DataFrame
:return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
"""
- from pandas.api.types import is_datetime64tz_dtype
- for column, series in pdf.iteritems():
- # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
- if is_datetime64tz_dtype(series.dtype):
- pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
+ import pandas as pd
+ try:
+ from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+ tz = timezone or 'tzlocal()'
+ for column, series in pdf.iteritems():
+ if type(schema[str(column)].dataType) == TimestampType:
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if is_datetime64tz_dtype(series.dtype):
+ pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
+ elif is_datetime64_dtype(series.dtype) and timezone is not None:
+ # `series.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
+ pdf[column] = series.apply(
+ lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
+ if ts is not pd.NaT else pd.NaT)
+ except ImportError:
+ from pandas.core.common import is_datetime64_dtype
+ from pandas.tslib import _dateutil_tzlocal
+ tzlocal = _dateutil_tzlocal()
+ tz = timezone or tzlocal
+ for column, series in pdf.iteritems():
+ if type(schema[str(column)].dataType) == TimestampType:
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if not is_datetime64_dtype(series.dtype):
+ # `series.dt.tz_convert(tzlocal).dt.tz_localize(None)` doesn't work properly.
+ pdf[column] = pd.Series([ts.tz_convert(tz).tz_localize(None)
+ if ts is not pd.NaT else pd.NaT for ts in series])
+ elif is_datetime64_dtype(series.dtype) and timezone is not None:
+ # `series.dt.tz_localize(tzlocal)` doesn't work properly.
+ pdf[column] = pd.Series(
+ [ts.tz_localize(tzlocal).tz_convert(tz).tz_localize(None)
+ if ts is not pd.NaT else pd.NaT for ts in series])
return pdf
-def _check_series_convert_timestamps_internal(s):
+def _check_series_convert_timestamps_internal(s, timezone):
"""
Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
:param s: a pandas.Series
:return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
"""
- from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
- # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
- if is_datetime64_dtype(s.dtype):
- return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
- elif is_datetime64tz_dtype(s.dtype):
- return s.dt.tz_convert('UTC')
- else:
- return s
+ try:
+ from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if is_datetime64_dtype(s.dtype):
+ tz = timezone or 'tzlocal()'
+ return s.dt.tz_localize(tz).dt.tz_convert('UTC')
+ elif is_datetime64tz_dtype(s.dtype):
+ return s.dt.tz_convert('UTC')
+ else:
+ return s
+ except ImportError:
--- End diff --
I think we should bump up pandas version if we can't find a workaround.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153386019
--- Diff: python/pyspark/sql/tests.py ---
@@ -3683,6 +3808,47 @@ def check_records_per_batch(x):
else:
self.spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", orig_value)
+ def test_vectorized_udf_timestamps_respect_session_timezone(self):
+ from pyspark.sql.functions import pandas_udf, col
+ from datetime import datetime
+ import pandas as pd
+ schema = StructType([
+ StructField("idx", LongType(), True),
+ StructField("timestamp", TimestampType(), True)])
+ data = [(1, datetime(1969, 1, 1, 1, 1, 1)),
+ (2, datetime(2012, 2, 2, 2, 2, 2)),
+ (3, None),
+ (4, datetime(2100, 3, 3, 3, 3, 3))]
+ df = self.spark.createDataFrame(data, schema=schema)
+
+ f_timestamp_copy = pandas_udf(lambda ts: ts, TimestampType())
+ internal_value = pandas_udf(
+ lambda ts: ts.apply(lambda ts: ts.value if ts is not pd.NaT else None), LongType())
+
+ orig_tz = self.spark.conf.get("spark.sql.session.timeZone")
+ try:
+ timezone = "America/New_York"
+ self.spark.conf.set("spark.sql.session.timeZone", timezone)
+ self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", "false")
+ try:
+ df_la = df.withColumn("tscopy", f_timestamp_copy(col("timestamp"))) \
+ .withColumn("internal_value", internal_value(col("timestamp")))
+ result_la = df_la.select(col("idx"), col("internal_value")).collect()
+ diff = 3 * 60 * 60 * 1000 * 1000 * 1000
--- End diff --
Yes, I'll add some comments.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153176050
--- Diff: python/pyspark/sql/session.py ---
@@ -444,11 +445,30 @@ def _get_numpy_record_dtype(self, rec):
record_type_list.append((str(col_names[i]), curr_type))
return np.dtype(record_type_list) if has_rec_fix else None
- def _convert_from_pandas(self, pdf):
+ def _convert_from_pandas(self, pdf, schema, timezone):
--- End diff --
Just an idea not blocking this PR. Probably, we have enough codes to make a separate Python file / class to put Pandas / Arrow stuff into one place.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83473/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84210/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/19607
>Aside: Could someone start preparing a patch that uses Arrow 0.8.x (where a lot of issues that surfaced throughout this process have been fixed)?
@wesm I could start on this maybe in the next day or two
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84204 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84204/testReport)** for PR 19607 at commit [`f92eae3`](https://github.com/apache/spark/commit/f92eae35767a766ad80ac576a67f521e365549c7).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
Jenkins, retest this please.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83592 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83592/testReport)** for PR 19607 at commit [`1e0f217`](https://github.com/apache/spark/commit/1e0f21715f5ad053b5a5677a129677d498946ea3).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19607
@HyukjinKwon do you know where pyspark defines its python library requirements?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83586 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83586/testReport)** for PR 19607 at commit [`1e0f217`](https://github.com/apache/spark/commit/1e0f21715f5ad053b5a5677a129677d498946ea3).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r148541382
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -948,6 +948,14 @@ object SQLConf {
.intConf
.createWithDefault(10000)
+ val PANDAS_RESPECT_SESSION_LOCAL_TIMEZONE =
--- End diff --
can we clean up the code more if we don't have this config?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83324/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153386047
--- Diff: python/pyspark/sql/types.py ---
@@ -1678,37 +1679,105 @@ def from_arrow_schema(arrow_schema):
for field in arrow_schema])
-def _check_dataframe_localize_timestamps(pdf):
+def _old_pandas_exception_message(e):
+ """ Create an error message for importing old Pandas.
"""
- Convert timezone aware timestamps to timezone-naive in local time
+ msg = "note: Pandas (>=0.19.2) must be installed and available on calling Python process"
+ return "%s\n%s" % (_exception_message(e), msg)
+
+
+def _check_dataframe_localize_timestamps(pdf, timezone):
+ """
+ Convert timezone aware timestamps to timezone-naive in the specified timezone or local timezone
:param pdf: pandas.DataFrame
- :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
+ :param timezone: the timezone to convert. if None then use local timezone
+ :return pandas.DataFrame where any timezone aware columns have been converted to tz-naive
"""
- from pandas.api.types import is_datetime64tz_dtype
+ try:
+ from pandas.api.types import is_datetime64tz_dtype
+ except ImportError as e:
+ raise ImportError(_old_pandas_exception_message(e))
+ tz = timezone or 'tzlocal()'
for column, series in pdf.iteritems():
# TODO: handle nested timestamps, such as ArrayType(TimestampType())?
if is_datetime64tz_dtype(series.dtype):
- pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
+ pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
return pdf
-def _check_series_convert_timestamps_internal(s):
+def _check_series_convert_timestamps_internal(s, timezone):
"""
- Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
+ Convert a tz-naive timestamp in the specified timezone or local timezone to UTC normalized for
+ Spark internal storage
+
:param s: a pandas.Series
+ :param timezone: the timezone to convert. if None then use local timezone
:return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
"""
- from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
+ try:
+ from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
+ except ImportError as e:
+ raise ImportError(_old_pandas_exception_message(e))
# TODO: handle nested timestamps, such as ArrayType(TimestampType())?
if is_datetime64_dtype(s.dtype):
- return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
+ tz = timezone or 'tzlocal()'
+ return s.dt.tz_localize(tz).dt.tz_convert('UTC')
elif is_datetime64tz_dtype(s.dtype):
return s.dt.tz_convert('UTC')
else:
return s
+def _check_series_convert_timestamps_localize(s, fromTimezone, toTimezone):
+ """
+ Convert timestamp to timezone-naive in the specified timezone or local timezone
+
+ :param s: a pandas.Series
+ :param fromTimezone: the timezone to convert from. if None then use local timezone
+ :param toTimezone: the timezone to convert to. if None then use local timezone
+ :return pandas.Series where if it is a timestamp, has been converted to tz-naive
+ """
+ try:
+ import pandas as pd
+ from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+ except ImportError as e:
+ raise ImportError(_old_pandas_exception_message(e))
+ fromTz = fromTimezone or 'tzlocal()'
--- End diff --
I'll update it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83286/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r152174935
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala ---
@@ -58,6 +59,11 @@ class ArrowPythonRunner(
protected override def writeCommand(dataOut: DataOutputStream): Unit = {
PythonUDFRunner.writeUDFs(dataOut, funcs, argOffsets)
+ if (respectTimeZone) {
+ PythonRDD.writeUTF(timeZoneId, dataOut)
--- End diff --
Unfortunately, can't because the worker doesn't have sql conf.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r149055095
--- Diff: python/pyspark/serializers.py ---
@@ -274,12 +278,13 @@ def load_stream(self, stream):
"""
Deserialize ArrowRecordBatches to an Arrow table and return as a list of pandas.Series.
"""
- from pyspark.sql.types import _check_dataframe_localize_timestamps
+ from pyspark.sql.types import _check_dataframe_localize_timestamps, from_arrow_schema
import pyarrow as pa
reader = pa.open_stream(stream)
+ schema = from_arrow_schema(reader.schema)
for batch in reader:
# NOTE: changed from pa.Columns.to_pandas, timezone issue in conversion fixed in 0.7.1
- pdf = _check_dataframe_localize_timestamps(batch.to_pandas())
+ pdf = _check_dataframe_localize_timestamps(batch.to_pandas(), schema, self._timezone)
--- End diff --
Yes, current implementation is doing (1). I'm not sure if we should hold the timezone. cc @cloud-fan @gatorsmile
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84062 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84062/testReport)** for PR 19607 at commit [`3e23653`](https://github.com/apache/spark/commit/3e23653572b9d555f5479cdd58ac3e15c7f88c28).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r152174929
--- Diff: python/pyspark/worker.py ---
@@ -150,7 +150,8 @@ def read_udfs(pickleSer, infile, eval_type):
if eval_type == PythonEvalType.SQL_PANDAS_SCALAR_UDF \
or eval_type == PythonEvalType.SQL_PANDAS_GROUP_MAP_UDF:
- ser = ArrowStreamPandasSerializer()
+ timezone = utf8_deserializer.loads(infile)
--- End diff --
The timezone will be read by the following `ArrowStreamPandasSerializer`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
Here - https://github.com/apache/spark/pull/19674#issuecomment-342336767 I checked it lately.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83286 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83286/testReport)** for PR 19607 at commit [`b1436b8`](https://github.com/apache/spark/commit/b1436b8e4876838efbcb38cf0bffa7ebcc7a5544).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r149005429
--- Diff: python/pyspark/sql/types.py ---
@@ -1629,35 +1629,112 @@ def to_arrow_type(dt):
return arrow_type
-def _check_dataframe_localize_timestamps(pdf):
+def from_arrow_type(at):
+ """ Convert pyarrow type to Spark data type.
+ """
+ # TODO: newer pyarrow has is_boolean(at) functions that would be better to check type
+ import pyarrow as pa
+ if at == pa.bool_():
+ spark_type = BooleanType()
+ elif at == pa.int8():
+ spark_type = ByteType()
+ elif at == pa.int16():
+ spark_type = ShortType()
+ elif at == pa.int32():
+ spark_type = IntegerType()
+ elif at == pa.int64():
+ spark_type = LongType()
+ elif at == pa.float32():
+ spark_type = FloatType()
+ elif at == pa.float64():
+ spark_type = DoubleType()
+ elif type(at) == pa.DecimalType:
+ spark_type = DecimalType(precision=at.precision, scale=at.scale)
+ elif at == pa.string():
+ spark_type = StringType()
+ elif at == pa.date32():
+ spark_type = DateType()
+ elif type(at) == pa.TimestampType:
+ spark_type = TimestampType()
+ else:
+ raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
+ return spark_type
+
+
+def from_arrow_schema(arrow_schema):
+ """ Convert schema from Arrow to Spark.
+ """
+ return StructType(
+ [StructField(field.name, from_arrow_type(field.type), nullable=field.nullable)
+ for field in arrow_schema])
+
+
+def _check_dataframe_localize_timestamps(pdf, schema, timezone):
"""
Convert timezone aware timestamps to timezone-naive in local time
:param pdf: pandas.DataFrame
:return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
"""
- from pandas.api.types import is_datetime64tz_dtype
- for column, series in pdf.iteritems():
- # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
- if is_datetime64tz_dtype(series.dtype):
- pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
+ import pandas as pd
+ try:
+ from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+ tz = timezone or 'tzlocal()'
+ for column, series in pdf.iteritems():
+ if type(schema[str(column)].dataType) == TimestampType:
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if is_datetime64tz_dtype(series.dtype):
+ pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
+ elif is_datetime64_dtype(series.dtype) and timezone is not None:
+ # `series.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
+ pdf[column] = series.apply(
+ lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
+ if ts is not pd.NaT else pd.NaT)
+ except ImportError:
--- End diff --
Yes, Pandas <0.19(?) seems to not have `pandas.api.types` but `pandas.core.common` instead.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
Yup, I was testing and trying to produce details. Let me describe this in the JIRA, not here :D.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153386007
--- Diff: python/pyspark/sql/tests.py ---
@@ -3192,16 +3255,49 @@ def test_filtered_frame(self):
self.assertEqual(pdf.columns[0], "i")
self.assertTrue(pdf.empty)
- def test_createDataFrame_toggle(self):
- pdf = self.create_pandas_data_frame()
+ def _createDataFrame_toggle(self, pdf, schema=None):
self.spark.conf.set("spark.sql.execution.arrow.enabled", "false")
try:
- df_no_arrow = self.spark.createDataFrame(pdf)
+ df_no_arrow = self.spark.createDataFrame(pdf, schema=schema)
finally:
self.spark.conf.set("spark.sql.execution.arrow.enabled", "true")
- df_arrow = self.spark.createDataFrame(pdf)
+ df_arrow = self.spark.createDataFrame(pdf, schema=schema)
+ return df_no_arrow, df_arrow
+
+ def test_createDataFrame_toggle(self):
+ pdf = self.create_pandas_data_frame()
+ df_no_arrow, df_arrow = self._createDataFrame_toggle(pdf, schema=self.schema)
self.assertEquals(df_no_arrow.collect(), df_arrow.collect())
+ def test_createDataFrame_respect_session_timezone(self):
+ from datetime import timedelta
+ pdf = self.create_pandas_data_frame()
+ orig_tz = self.spark.conf.get("spark.sql.session.timeZone")
+ try:
+ timezone = "America/New_York"
+ self.spark.conf.set("spark.sql.session.timeZone", timezone)
+ self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", "false")
+ try:
+ df_no_arrow_la, df_arrow_la = self._createDataFrame_toggle(pdf, schema=self.schema)
+ result_la = df_no_arrow_la.collect()
+ result_arrow_la = df_arrow_la.collect()
+ self.assertEqual(result_la, result_arrow_la)
+ finally:
+ self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", "true")
+ df_no_arrow_ny, df_arrow_ny = self._createDataFrame_toggle(pdf, schema=self.schema)
+ result_ny = df_no_arrow_ny.collect()
+ result_arrow_ny = df_arrow_ny.collect()
+ self.assertEqual(result_ny, result_arrow_ny)
+
+ self.assertNotEqual(result_ny, result_la)
+
+ result_la_corrected = [Row(**{k: v - timedelta(hours=3) if k == '7_timestamp_t' else v
--- End diff --
Yes, the 3 hours timedelta is the time difference.
I'll add some comments.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r151999012
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala ---
@@ -58,6 +59,11 @@ class ArrowPythonRunner(
protected override def writeCommand(dataOut: DataOutputStream): Unit = {
PythonUDFRunner.writeUDFs(dataOut, funcs, argOffsets)
+ if (respectTimeZone) {
+ PythonRDD.writeUTF(timeZoneId, dataOut)
--- End diff --
why do we need to send it? Can't python side read the timezone from sql conf?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84013/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19607
BTW what's the pandas version we set up in jenkins? I'm afraid we don't even have a test for old pandas versions.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153386064
--- Diff: python/pyspark/sql/session.py ---
@@ -444,11 +445,30 @@ def _get_numpy_record_dtype(self, rec):
record_type_list.append((str(col_names[i]), curr_type))
return np.dtype(record_type_list) if has_rec_fix else None
- def _convert_from_pandas(self, pdf):
+ def _convert_from_pandas(self, pdf, schema, timezone):
--- End diff --
Thanks, I agree with it but maybe I'll leave those as they are in this pr.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83777/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83294/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r151998797
--- Diff: python/pyspark/worker.py ---
@@ -150,7 +150,8 @@ def read_udfs(pickleSer, infile, eval_type):
if eval_type == PythonEvalType.SQL_PANDAS_SCALAR_UDF \
or eval_type == PythonEvalType.SQL_PANDAS_GROUP_MAP_UDF:
- ser = ArrowStreamPandasSerializer()
+ timezone = utf8_deserializer.loads(infile)
--- End diff --
who will read this timezone?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83586 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83586/testReport)** for PR 19607 at commit [`1e0f217`](https://github.com/apache/spark/commit/1e0f21715f5ad053b5a5677a129677d498946ea3).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r151998296
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1913,7 +1920,16 @@ def toPandas(self):
for f, t in dtype.items():
pdf[f] = pdf[f].astype(t, copy=False)
- return pdf
+
+ if timezone is None:
+ return pdf
+ else:
+ from pyspark.sql.types import _check_series_convert_timestamps_local_tz
+ for field in self.schema:
+ if isinstance(field.dataType, TimestampType):
--- End diff --
add a TODO for nested timestamp field?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
Yea, that was my proposal. If anything is blocked by this, I think we should bump it up as, IMHO, technically the fixed version specification was not yet released and published.
^ cc @cloud-fan, @srowen and @viirya
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
Jenkins, retest this please.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
For Py4J, as a hard required dependency, its src-zip is in Spark partly to support non-pip installation as well properly AFAIK. In this case, the place to update seems https://github.com/apache/spark/commit/c8d0aba198c0f593c2b6b656c23b3d0fb7ea98a2 although it includes few more changes by Py4J jar as well, again, AFAIK.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
@BryanCutler I guess the oldest version of Pandas is `0.13.0` currently according to #18403, cc @HyukjinKwon.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83777 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83777/testReport)** for PR 19607 at commit [`8b1a4d8`](https://github.com/apache/spark/commit/8b1a4d8eb1c73d87eb3867fe4c1876cb9c48b2cf).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84013 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84013/testReport)** for PR 19607 at commit [`9c94f90`](https://github.com/apache/spark/commit/9c94f90a703daaf08887259c1757420477a95b94).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83207 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83207/testReport)** for PR 19607 at commit [`5c08ecf`](https://github.com/apache/spark/commit/5c08ecf247bfe7e14afcdef8eba1c25cb3b68634).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83322/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
FWIW, to me, I think it's possibly a breaking change but I believe it's worth given the recent cool changes. +0 for bumping up anyway. Will join in the mailing thread too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83211 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83211/testReport)** for PR 19607 at commit [`e28fc87`](https://github.com/apache/spark/commit/e28fc873cdbfd25189df35ce54c7c8ede9e1cc99).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
I'll ask dev list if we can drop support old Pandas (<0.19.2), or what version we should support if we can't.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153385979
--- Diff: python/pyspark/sql/session.py ---
@@ -444,11 +445,30 @@ def _get_numpy_record_dtype(self, rec):
record_type_list.append((str(col_names[i]), curr_type))
return np.dtype(record_type_list) if has_rec_fix else None
- def _convert_from_pandas(self, pdf):
+ def _convert_from_pandas(self, pdf, schema, timezone):
"""
Convert a pandas.DataFrame to list of records that can be used to make a DataFrame
:return list of records
"""
+ if timezone is not None:
+ from pyspark.sql.types import _check_series_convert_timestamps_tz_local
+ copied = False
+ if isinstance(schema, StructType):
+ for field in schema:
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if isinstance(field.dataType, TimestampType):
+ s = _check_series_convert_timestamps_tz_local(pdf[field.name], timezone)
+ if not copied and s is not pdf[field.name]:
+ pdf = pdf.copy()
+ copied = True
--- End diff --
Yes, it's to prevent the original one from being updated.
I'll add some comments.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83205 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83205/testReport)** for PR 19607 at commit [`5c08ecf`](https://github.com/apache/spark/commit/5c08ecf247bfe7e14afcdef8eba1c25cb3b68634).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
Jenkins, retest this please.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84205/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84048 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84048/testReport)** for PR 19607 at commit [`3db2bea`](https://github.com/apache/spark/commit/3db2bea22c6bd240c871cb4580fbbf0ee3b659b4).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r152645369
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -997,6 +997,14 @@ object SQLConf {
.intConf
.createWithDefault(10000)
+ val PANDAS_RESPECT_SESSION_LOCAL_TIMEZONE =
+ buildConf("spark.sql.execution.pandas.respectSessionTimeZone")
+ .internal()
+ .doc("When true, make Pandas DataFrame with timestamp type respecting session local " +
+ "timezone when converting to/from Pandas DataFrame.")
--- End diff --
Emphasize the conf will be deprecated?
> When true, make Pandas DataFrame with timestamp type respecting session local timezone when converting to/from Pandas DataFrame. This configuration will be deprecated in the future releases.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by wesm <gi...@git.apache.org>.
Github user wesm commented on the issue:
https://github.com/apache/spark/pull/19607
Aside: Could someone start preparing a patch that uses Arrow 0.8.x (where a lot of issues that surfaced throughout this process have been fixed)?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
The remaining discussions in this pr are:
- [ ] Do we need the config `"spark.sql.execution.pandas.respectSessionTimeZone"`?
- @cloud-fan raised that we don't need the config. https://github.com/apache/spark/pull/19607#discussion_r149655172
- [ ] What version of Pandas should we support?
- Since old pandas needs a lot of workaround or doesn't support timestamp values properly, we need to bump up Pandas to support.
- @wesm suggested that we should support only `0.19.2` or upper. https://github.com/apache/spark/pull/19607#issuecomment-342371522
I'd really appreciate if you left comments on these and please let me know if I miss something.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84242 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84242/testReport)** for PR 19607 at commit [`9200f38`](https://github.com/apache/spark/commit/9200f38b6414255a5c60127aeeae517086ba108b).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83324 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83324/testReport)** for PR 19607 at commit [`1f096bf`](https://github.com/apache/spark/commit/1f096bf32f742945363cc7d9af978041ad77408b).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r149005432
--- Diff: python/pyspark/sql/types.py ---
@@ -1629,35 +1629,112 @@ def to_arrow_type(dt):
return arrow_type
-def _check_dataframe_localize_timestamps(pdf):
+def from_arrow_type(at):
+ """ Convert pyarrow type to Spark data type.
+ """
+ # TODO: newer pyarrow has is_boolean(at) functions that would be better to check type
+ import pyarrow as pa
+ if at == pa.bool_():
+ spark_type = BooleanType()
+ elif at == pa.int8():
+ spark_type = ByteType()
+ elif at == pa.int16():
+ spark_type = ShortType()
+ elif at == pa.int32():
+ spark_type = IntegerType()
+ elif at == pa.int64():
+ spark_type = LongType()
+ elif at == pa.float32():
+ spark_type = FloatType()
+ elif at == pa.float64():
+ spark_type = DoubleType()
+ elif type(at) == pa.DecimalType:
+ spark_type = DecimalType(precision=at.precision, scale=at.scale)
+ elif at == pa.string():
+ spark_type = StringType()
+ elif at == pa.date32():
+ spark_type = DateType()
+ elif type(at) == pa.TimestampType:
+ spark_type = TimestampType()
+ else:
+ raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
+ return spark_type
+
+
+def from_arrow_schema(arrow_schema):
+ """ Convert schema from Arrow to Spark.
+ """
+ return StructType(
+ [StructField(field.name, from_arrow_type(field.type), nullable=field.nullable)
+ for field in arrow_schema])
+
+
+def _check_dataframe_localize_timestamps(pdf, schema, timezone):
"""
Convert timezone aware timestamps to timezone-naive in local time
:param pdf: pandas.DataFrame
:return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
"""
- from pandas.api.types import is_datetime64tz_dtype
- for column, series in pdf.iteritems():
- # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
- if is_datetime64tz_dtype(series.dtype):
- pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
+ import pandas as pd
+ try:
+ from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+ tz = timezone or 'tzlocal()'
+ for column, series in pdf.iteritems():
+ if type(schema[str(column)].dataType) == TimestampType:
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if is_datetime64tz_dtype(series.dtype):
+ pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
+ elif is_datetime64_dtype(series.dtype) and timezone is not None:
+ # `series.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
+ pdf[column] = series.apply(
+ lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
+ if ts is not pd.NaT else pd.NaT)
+ except ImportError:
+ from pandas.core.common import is_datetime64_dtype
+ from pandas.tslib import _dateutil_tzlocal
+ tzlocal = _dateutil_tzlocal()
+ tz = timezone or tzlocal
+ for column, series in pdf.iteritems():
+ if type(schema[str(column)].dataType) == TimestampType:
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if not is_datetime64_dtype(series.dtype):
--- End diff --
Unfortunately, Pandas <0.17(?) seems to not have `is_datetime64tz_dtype`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r148538507
--- Diff: python/pyspark/sql/types.py ---
@@ -1629,35 +1629,121 @@ def to_arrow_type(dt):
return arrow_type
-def _check_dataframe_localize_timestamps(pdf):
+def to_arrow_schema(schema):
--- End diff --
where do we use this method?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83774/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83472/testReport)** for PR 19607 at commit [`ce07f39`](https://github.com/apache/spark/commit/ce07f39643ef9711419e7e11082e62c578d816f5).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83280/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/19607
Hi @ueshin , what is the oldest version of Pandas that's required to support and what exactly wasn't working with it?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r149655172
--- Diff: python/pyspark/sql/session.py ---
@@ -557,7 +577,13 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
except Exception:
has_pandas = False
if has_pandas and isinstance(data, pandas.DataFrame):
- data, schema = self._convert_from_pandas(data, schema)
+ if self.conf.get("spark.sql.execution.pandas.respectSessionTimeZone").lower() \
--- End diff --
I feel this is a weird config... I think it's acceptable to introduce behavior change during bug fix, like the type inference bug we fixed when converting pandas dataframe to pyspark dataframe.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/19607
I guess we should look at R to see if it should behavior similarly? WDYT @HyukjinKwon ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84020/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83780 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83780/testReport)** for PR 19607 at commit [`e919ed5`](https://github.com/apache/spark/commit/e919ed55758f75733d56287d5a49326b1067a44c).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84070/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
@wesm I'd like to know if you have any specific reason to recommend the version 0.19.2 or upper?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153167040
--- Diff: python/pyspark/sql/tests.py ---
@@ -3683,6 +3808,47 @@ def check_records_per_batch(x):
else:
self.spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", orig_value)
+ def test_vectorized_udf_timestamps_respect_session_timezone(self):
+ from pyspark.sql.functions import pandas_udf, col
+ from datetime import datetime
+ import pandas as pd
+ schema = StructType([
+ StructField("idx", LongType(), True),
+ StructField("timestamp", TimestampType(), True)])
+ data = [(1, datetime(1969, 1, 1, 1, 1, 1)),
+ (2, datetime(2012, 2, 2, 2, 2, 2)),
+ (3, None),
+ (4, datetime(2100, 3, 3, 3, 3, 3))]
+ df = self.spark.createDataFrame(data, schema=schema)
+
+ f_timestamp_copy = pandas_udf(lambda ts: ts, TimestampType())
+ internal_value = pandas_udf(
+ lambda ts: ts.apply(lambda ts: ts.value if ts is not pd.NaT else None), LongType())
+
+ orig_tz = self.spark.conf.get("spark.sql.session.timeZone")
+ try:
+ timezone = "America/New_York"
+ self.spark.conf.set("spark.sql.session.timeZone", timezone)
+ self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", "false")
+ try:
+ df_la = df.withColumn("tscopy", f_timestamp_copy(col("timestamp"))) \
+ .withColumn("internal_value", internal_value(col("timestamp")))
+ result_la = df_la.select(col("idx"), col("internal_value")).collect()
+ diff = 3 * 60 * 60 * 1000 * 1000 * 1000
--- End diff --
Here too. it took me a while to check where this 3 came from ..
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r152201452
--- Diff: python/pyspark/sql/types.py ---
@@ -1678,37 +1678,105 @@ def from_arrow_schema(arrow_schema):
for field in arrow_schema])
-def _check_dataframe_localize_timestamps(pdf):
+def _old_pandas_exception_message(e):
+ """ Create an error message for importing old Pandas.
"""
- Convert timezone aware timestamps to timezone-naive in local time
+ msg = "note: Pandas (>=0.19.2) must be installed and available on calling Python process"
+ return "%s\n%s" % (e.message, msg)
--- End diff --
Oops, thanks!
We need to fix [dataframe.py#L1905](https://github.com/apache/spark/blob/3db2bea22c6bd240c871cb4580fbbf0ee3b659b4/python/pyspark/sql/dataframe.py#L1905) as well.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84044 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84044/testReport)** for PR 19607 at commit [`9cfdde2`](https://github.com/apache/spark/commit/9cfdde2ce07d00779a9f6f8f5ab86cc442b7655b).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
(Will give a shot to reproduce this within tomorrow and leave my output.)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r148576712
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -948,6 +948,14 @@ object SQLConf {
.intConf
.createWithDefault(10000)
+ val PANDAS_RESPECT_SESSION_LOCAL_TIMEZONE =
--- End diff --
Sure, I'll try it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
Ah, yes, I think so.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84048/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83780 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83780/testReport)** for PR 19607 at commit [`e919ed5`](https://github.com/apache/spark/commit/e919ed55758f75733d56287d5a49326b1067a44c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83593 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83593/testReport)** for PR 19607 at commit [`292678f`](https://github.com/apache/spark/commit/292678f3e47a4f5a20fd1af5da10e02cc4017882).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83578/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83280 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83280/testReport)** for PR 19607 at commit [`ee1a1c8`](https://github.com/apache/spark/commit/ee1a1c87e2a89974e4e299f4ad84e5831526d079).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83578 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83578/testReport)** for PR 19607 at commit [`4adb073`](https://github.com/apache/spark/commit/4adb073f8d1454fbea0742a16b6d7662e063b37a).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r149002542
--- Diff: python/pyspark/sql/types.py ---
@@ -1629,35 +1629,112 @@ def to_arrow_type(dt):
return arrow_type
-def _check_dataframe_localize_timestamps(pdf):
+def from_arrow_type(at):
+ """ Convert pyarrow type to Spark data type.
+ """
+ # TODO: newer pyarrow has is_boolean(at) functions that would be better to check type
+ import pyarrow as pa
+ if at == pa.bool_():
+ spark_type = BooleanType()
+ elif at == pa.int8():
+ spark_type = ByteType()
+ elif at == pa.int16():
+ spark_type = ShortType()
+ elif at == pa.int32():
+ spark_type = IntegerType()
+ elif at == pa.int64():
+ spark_type = LongType()
+ elif at == pa.float32():
+ spark_type = FloatType()
+ elif at == pa.float64():
+ spark_type = DoubleType()
+ elif type(at) == pa.DecimalType:
+ spark_type = DecimalType(precision=at.precision, scale=at.scale)
+ elif at == pa.string():
+ spark_type = StringType()
+ elif at == pa.date32():
+ spark_type = DateType()
+ elif type(at) == pa.TimestampType:
+ spark_type = TimestampType()
+ else:
+ raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
+ return spark_type
+
+
+def from_arrow_schema(arrow_schema):
+ """ Convert schema from Arrow to Spark.
+ """
+ return StructType(
+ [StructField(field.name, from_arrow_type(field.type), nullable=field.nullable)
+ for field in arrow_schema])
+
+
+def _check_dataframe_localize_timestamps(pdf, schema, timezone):
"""
Convert timezone aware timestamps to timezone-naive in local time
:param pdf: pandas.DataFrame
:return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
"""
- from pandas.api.types import is_datetime64tz_dtype
- for column, series in pdf.iteritems():
- # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
- if is_datetime64tz_dtype(series.dtype):
- pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
+ import pandas as pd
+ try:
+ from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+ tz = timezone or 'tzlocal()'
+ for column, series in pdf.iteritems():
+ if type(schema[str(column)].dataType) == TimestampType:
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if is_datetime64tz_dtype(series.dtype):
+ pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
+ elif is_datetime64_dtype(series.dtype) and timezone is not None:
+ # `series.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
+ pdf[column] = series.apply(
+ lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
+ if ts is not pd.NaT else pd.NaT)
+ except ImportError:
+ from pandas.core.common import is_datetime64_dtype
+ from pandas.tslib import _dateutil_tzlocal
+ tzlocal = _dateutil_tzlocal()
+ tz = timezone or tzlocal
+ for column, series in pdf.iteritems():
+ if type(schema[str(column)].dataType) == TimestampType:
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if not is_datetime64_dtype(series.dtype):
--- End diff --
Can't we just use `is_datetime64tz_dtype` as above?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84020 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84020/testReport)** for PR 19607 at commit [`9c94f90`](https://github.com/apache/spark/commit/9c94f90a703daaf08887259c1757420477a95b94).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
Yup, I think we should take a look between POSIXct / POSIXlt in R and timestamp within Spark too. Seems not respecting the session timezone in a quick look.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
Seems there is no explicit objection for dropping it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r152199700
--- Diff: python/pyspark/sql/types.py ---
@@ -1678,37 +1678,105 @@ def from_arrow_schema(arrow_schema):
for field in arrow_schema])
-def _check_dataframe_localize_timestamps(pdf):
+def _old_pandas_exception_message(e):
+ """ Create an error message for importing old Pandas.
"""
- Convert timezone aware timestamps to timezone-naive in local time
+ msg = "note: Pandas (>=0.19.2) must be installed and available on calling Python process"
+ return "%s\n%s" % (e.message, msg)
--- End diff --
I am quite sure the `message` attribute does not exist in Python 3:
```python
>>> try:
... import foo
... except ImportError as e:
... dir(e)
...
['__cause__', '__class__', '__context__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__suppress_context__', '__traceback__', 'args', 'msg', 'name', 'path', 'with_traceback']
```
Probably, we should use `from pyspark.util import _exception_message` (https://github.com/apache/spark/pull/16845).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84070 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84070/testReport)** for PR 19607 at commit [`d741171`](https://github.com/apache/spark/commit/d7411717c7c8c7c7aba63e4e19ebdfadfa7ea0c0).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83205/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
@wesm Thanks for the explanation!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83475/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83286 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83286/testReport)** for PR 19607 at commit [`b1436b8`](https://github.com/apache/spark/commit/b1436b8e4876838efbcb38cf0bffa7ebcc7a5544).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83280 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83280/testReport)** for PR 19607 at commit [`ee1a1c8`](https://github.com/apache/spark/commit/ee1a1c87e2a89974e4e299f4ad84e5831526d079).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153107748
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -997,6 +997,14 @@ object SQLConf {
.intConf
.createWithDefault(10000)
+ val PANDAS_RESPECT_SESSION_LOCAL_TIMEZONE =
+ buildConf("spark.sql.execution.pandas.respectSessionTimeZone")
+ .internal()
+ .doc("When true, make Pandas DataFrame with timestamp type respecting session local " +
+ "timezone when converting to/from Pandas DataFrame.")
--- End diff --
Sure, I'll update it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84044/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84205 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84205/testReport)** for PR 19607 at commit [`40a9735`](https://github.com/apache/spark/commit/40a9735b88deb85c08f618186daf9ed2152fc406).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83320/testReport)** for PR 19607 at commit [`1f096bf`](https://github.com/apache/spark/commit/1f096bf32f742945363cc7d9af978041ad77408b).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83211/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
Hmm, I can't reproduce the current failure in my local, which seems to not be related to old Pandas because the failure is on Python 3.4 and Pandas 0.19 (I guess). I tried the combination in my local.
Can anyone help me reproduce and fix it?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83592/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r149582142
--- Diff: python/pyspark/sql/types.py ---
@@ -1629,37 +1629,82 @@ def to_arrow_type(dt):
return arrow_type
-def _check_dataframe_localize_timestamps(pdf):
+def _check_dataframe_localize_timestamps(pdf, timezone):
"""
- Convert timezone aware timestamps to timezone-naive in local time
+ Convert timezone aware timestamps to timezone-naive in the specified timezone or local timezone
:param pdf: pandas.DataFrame
- :return pandas.DataFrame where any timezone aware columns have be converted to tz-naive
+ :param timezone: the timezone to convert. if None then use local timezone
+ :return pandas.DataFrame where any timezone aware columns have been converted to tz-naive
"""
from pandas.api.types import is_datetime64tz_dtype
+ tz = timezone or 'tzlocal()'
for column, series in pdf.iteritems():
# TODO: handle nested timestamps, such as ArrayType(TimestampType())?
if is_datetime64tz_dtype(series.dtype):
- pdf[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
+ pdf[column] = series.dt.tz_convert(tz).dt.tz_localize(None)
return pdf
-def _check_series_convert_timestamps_internal(s):
+def _check_series_convert_timestamps_internal(s, timezone):
"""
- Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage
+ Convert a tz-naive timestamp in the specified timezone or local timezone to UTC normalized for
+ Spark internal storage
+
:param s: a pandas.Series
+ :param timezone: the timezone to convert. if None then use local timezone
:return pandas.Series where if it is a timestamp, has been UTC normalized without a time zone
"""
from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
# TODO: handle nested timestamps, such as ArrayType(TimestampType())?
if is_datetime64_dtype(s.dtype):
- return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
+ tz = timezone or 'tzlocal()'
+ return s.dt.tz_localize(tz).dt.tz_convert('UTC')
elif is_datetime64tz_dtype(s.dtype):
return s.dt.tz_convert('UTC')
else:
return s
+def _check_series_convert_timestamps_localize(s, timezone):
+ """
+ Convert timestamp to timezone-naive in the specified timezone or local timezone
+
+ :param s: a pandas.Series
+ :param timezone: the timezone to convert. if None then use local timezone
+ :return pandas.Series where if it is a timestamp, has been converted to tz-naive
+ """
+ import pandas as pd
+ try:
+ from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+ tz = timezone or 'tzlocal()'
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if is_datetime64tz_dtype(s.dtype):
+ return s.dt.tz_convert(tz).dt.tz_localize(None)
+ elif is_datetime64_dtype(s.dtype) and timezone is not None:
+ # `s.dt.tz_localize('tzlocal()')` doesn't work properly when including NaT.
+ return s.apply(lambda ts: ts.tz_localize('tzlocal()').tz_convert(tz).tz_localize(None)
+ if ts is not pd.NaT else pd.NaT)
+ else:
+ return s
+ except ImportError:
--- End diff --
We will be able to remove this block if we decided to support only Pandas >=0.19.2.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by wesm <gi...@git.apache.org>.
Github user wesm commented on the issue:
https://github.com/apache/spark/pull/19607
Well, old versions of pandas are not supported at all. There have also been a number of API evolutions around extension dtypes that make supporting pandas 0.18.x and lower challenging. We have dropped support already in pyarrow, it is a serious liability to accept the maintenance burden of supporting such an outdated version of the project.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
Hm .. I see. but is this something we should fix though ideally? I am asking this because I am checking R related codes now ..
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior ...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r148598592
--- Diff: python/pyspark/serializers.py ---
@@ -274,12 +278,13 @@ def load_stream(self, stream):
"""
Deserialize ArrowRecordBatches to an Arrow table and return as a list of pandas.Series.
"""
- from pyspark.sql.types import _check_dataframe_localize_timestamps
+ from pyspark.sql.types import _check_dataframe_localize_timestamps, from_arrow_schema
import pyarrow as pa
reader = pa.open_stream(stream)
+ schema = from_arrow_schema(reader.schema)
for batch in reader:
# NOTE: changed from pa.Columns.to_pandas, timezone issue in conversion fixed in 0.7.1
- pdf = _check_dataframe_localize_timestamps(batch.to_pandas())
+ pdf = _check_dataframe_localize_timestamps(batch.to_pandas(), schema, self._timezone)
--- End diff --
If `self._timezone` is not None, then it will be the SESSION_LOCAL_TIMEZONE and Arrow data will already have this timezone set so nothing needs to be done here right?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19607
LGTM, merging to master!
Let's fix the R timestamp issue in a new ticket.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83586/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84020/testReport)** for PR 19607 at commit [`9c94f90`](https://github.com/apache/spark/commit/9c94f90a703daaf08887259c1757420477a95b94).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by wesm <gi...@git.apache.org>.
Github user wesm commented on the issue:
https://github.com/apache/spark/pull/19607
@jreback @cpcloud could use some input on how to maximize cross-pandas version compatibility in this patch.
I am not sure I would recommend that Spark support pandas versions lower than 0.19.2
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #84048 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84048/testReport)** for PR 19607 at commit [`3db2bea`](https://github.com/apache/spark/commit/3db2bea22c6bd240c871cb4580fbbf0ee3b659b4).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83295/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153168918
--- Diff: python/pyspark/sql/session.py ---
@@ -444,11 +445,30 @@ def _get_numpy_record_dtype(self, rec):
record_type_list.append((str(col_names[i]), curr_type))
return np.dtype(record_type_list) if has_rec_fix else None
- def _convert_from_pandas(self, pdf):
+ def _convert_from_pandas(self, pdf, schema, timezone):
"""
Convert a pandas.DataFrame to list of records that can be used to make a DataFrame
:return list of records
"""
+ if timezone is not None:
+ from pyspark.sql.types import _check_series_convert_timestamps_tz_local
+ copied = False
+ if isinstance(schema, StructType):
+ for field in schema:
+ # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+ if isinstance(field.dataType, TimestampType):
+ s = _check_series_convert_timestamps_tz_local(pdf[field.name], timezone)
+ if not copied and s is not pdf[field.name]:
+ pdf = pdf.copy()
+ copied = True
--- End diff --
Would you mind if I ask why we should copy here? Probably, some comments explaining it would be helpful. To be clear, Is it to prevent the original Pandas DataFrame being updated?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19607
Well, I'd leave the config `spark.sql.execution.pandas.respectSessionTimeZone` as it is for now to be safe as @gatorsmile mentioned before.
And as we discussed in dev list, I'll update this to remove workarounds for old Pandas but add some error messages saying we need 0.19.2 or upper.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of ti...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153149080
--- Diff: python/pyspark/sql/tests.py ---
@@ -3192,16 +3255,49 @@ def test_filtered_frame(self):
self.assertEqual(pdf.columns[0], "i")
self.assertTrue(pdf.empty)
- def test_createDataFrame_toggle(self):
- pdf = self.create_pandas_data_frame()
+ def _createDataFrame_toggle(self, pdf, schema=None):
self.spark.conf.set("spark.sql.execution.arrow.enabled", "false")
try:
- df_no_arrow = self.spark.createDataFrame(pdf)
+ df_no_arrow = self.spark.createDataFrame(pdf, schema=schema)
finally:
self.spark.conf.set("spark.sql.execution.arrow.enabled", "true")
- df_arrow = self.spark.createDataFrame(pdf)
+ df_arrow = self.spark.createDataFrame(pdf, schema=schema)
+ return df_no_arrow, df_arrow
+
+ def test_createDataFrame_toggle(self):
+ pdf = self.create_pandas_data_frame()
+ df_no_arrow, df_arrow = self._createDataFrame_toggle(pdf, schema=self.schema)
self.assertEquals(df_no_arrow.collect(), df_arrow.collect())
+ def test_createDataFrame_respect_session_timezone(self):
+ from datetime import timedelta
+ pdf = self.create_pandas_data_frame()
+ orig_tz = self.spark.conf.get("spark.sql.session.timeZone")
+ try:
+ timezone = "America/New_York"
+ self.spark.conf.set("spark.sql.session.timeZone", timezone)
+ self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", "false")
+ try:
+ df_no_arrow_la, df_arrow_la = self._createDataFrame_toggle(pdf, schema=self.schema)
+ result_la = df_no_arrow_la.collect()
+ result_arrow_la = df_arrow_la.collect()
+ self.assertEqual(result_la, result_arrow_la)
+ finally:
+ self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", "true")
+ df_no_arrow_ny, df_arrow_ny = self._createDataFrame_toggle(pdf, schema=self.schema)
+ result_ny = df_no_arrow_ny.collect()
+ result_arrow_ny = df_arrow_ny.collect()
+ self.assertEqual(result_ny, result_arrow_ny)
+
+ self.assertNotEqual(result_ny, result_la)
+
+ result_la_corrected = [Row(**{k: v - timedelta(hours=3) if k == '7_timestamp_t' else v
--- End diff --
Small comments here would be helpful .. BTW, to be clear, this 3 hours timedelta is from America/Los_Angeles and America/New_York time difference?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19607
I doubt if we can still support pandas 0.13.0 now...
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19607
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19607
**[Test build #83205 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83205/testReport)** for PR 19607 at commit [`5c08ecf`](https://github.com/apache/spark/commit/5c08ecf247bfe7e14afcdef8eba1c25cb3b68634).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83320/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19607: [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of time...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19607
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org