You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/01 09:35:39 UTC
[GitHub] [spark] HyukjinKwon opened a new pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
HyukjinKwon opened a new pull request #33876:
URL: https://github.com/apache/spark/pull/33876
### What changes were proposed in this pull request?
This PR proposes to implement `TimestampNTZType` support in PySpark's `SparkSession.createDataFrame`, `DataFrame.toPandas`, Python UDFs, and pandas UDFs with and without Arrow.
This PR is dependent on #33875.
### Why are the changes needed?
To complete `TimestampNTZType` support.
### Does this PR introduce _any_ user-facing change?
Yes.
- Users now can use `TimestampNTZType` type in `SparkSession.createDataFrame`, `DataFrame.toPandas`, Python UDFs, and pandas UDFs with and without Arrow.
- If `spark.sql.timestampType` is configured to `TIMESTAMP_NTZ`, PySpark will infer the `datetime` without timezone as `TimestampNTZType`. If it has a timezone, it will be inferred as `TimestampType` in `SparkSession.createDataFrame`.
- If `TimestampType` and `TimestampNTZType` conflict during merging inferred schema, `TimestampType` has a higher precedence.
- If the type is `TimestampNTZType`, treat this internally as UTC (same as JVM), and avoid localization externally.
### How was this patch tested?
Manually tested and unittests were added.
Closes #33517
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910049323
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142902/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909173737
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47387/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909896157
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47405/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909373882
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142884/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] ueshin commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r699811451
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
"has been set to false.\n %s" % str(e))
warnings.warn(msg)
raise
- data = self._convert_from_pandas(data, schema, timezone)
+
+ should_localize = not self._is_timestamp_ntz_preferred()
Review comment:
I mean, I'm just wondering why it's called here.
It can be called earlier to reuse the result, or it can be called later in each function.
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
"has been set to false.\n %s" % str(e))
warnings.warn(msg)
raise
- data = self._convert_from_pandas(data, schema, timezone)
+
+ should_localize = not self._is_timestamp_ntz_preferred()
Review comment:
nit: we should pass the result of `self._is_timestamp_ntz_preferred()` to `self._create_from_pandas_with_arrow(...)` at line 321 as well to avoid multiple calls?
Otherwise, we can call it in `self._convert_from_pandas(...)` instead?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909835531
**[Test build #142899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142899/testReport)** for PR 33876 at commit [`1c0f1bc`](https://github.com/apache/spark/commit/1c0f1bc78e964d8626ba0784252b3920d508d9e2).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r699816264
##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -167,6 +167,18 @@ def test_toPandas_arrow_toggle(self):
assert_frame_equal(expected, pdf)
assert_frame_equal(expected, pdf_arrow)
+ def test_create_data_frame_to_pandas_timestamp_ntz(self):
+ # SPARK-36608: Test TimestampNTZ in createDataFrame and toPandas
Review comment:
```suggestion
# SPARK-36626: Test TimestampNTZ in createDataFrame and toPandas
```
##########
File path: python/pyspark/sql/tests/test_pandas_udf.py
##########
@@ -239,6 +240,23 @@ def udf(column):
with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": False}):
df.withColumn('udf', udf('id')).collect()
+ def test_pandas_udf_timestamp_ntz(self):
+ # SPARK-36608: Test TimestampNTZ in pandas UDF
Review comment:
```suggestion
# SPARK-36626: Test TimestampNTZ in pandas UDF
```
##########
File path: python/pyspark/sql/tests/test_udf.py
##########
@@ -552,6 +553,23 @@ def __call__(self, x):
self.assertEqual(f, f_.func)
self.assertEqual(return_type, f_.returnType)
+ def test_udf_timestamp_ntz(self):
+ # SPARK-36608: Test TimestampNTZ in Python UDF
Review comment:
```suggestion
# SPARK-36626: Test TimestampNTZ in Python UDF
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909862278
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47402/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r699814247
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
"has been set to false.\n %s" % str(e))
warnings.warn(msg)
raise
- data = self._convert_from_pandas(data, schema, timezone)
+
+ should_localize = not self._is_timestamp_ntz_preferred()
Review comment:
I think I called it here to make `_convert_from_pandas` indenependent from `SparkSession`. `_convert_from_pandas` doesn't have any reference of `self` ... but looks like I overthought too much ... let me call it there.
##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -167,6 +167,18 @@ def test_toPandas_arrow_toggle(self):
assert_frame_equal(expected, pdf)
assert_frame_equal(expected, pdf_arrow)
+ def test_create_data_frame_to_pandas_timestamp_ntz(self):
+ # SPARK-36608: Test TimestampNTZ in createDataFrame and toPandas
Review comment:
```suggestion
# SPARK-36626: Test TimestampNTZ in createDataFrame and toPandas
```
##########
File path: python/pyspark/sql/tests/test_pandas_udf.py
##########
@@ -239,6 +240,23 @@ def udf(column):
with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": False}):
df.withColumn('udf', udf('id')).collect()
+ def test_pandas_udf_timestamp_ntz(self):
+ # SPARK-36608: Test TimestampNTZ in pandas UDF
Review comment:
```suggestion
# SPARK-36626: Test TimestampNTZ in pandas UDF
```
##########
File path: python/pyspark/sql/tests/test_udf.py
##########
@@ -552,6 +553,23 @@ def __call__(self, x):
self.assertEqual(f, f_.func)
self.assertEqual(return_type, f_.returnType)
+ def test_udf_timestamp_ntz(self):
+ # SPARK-36608: Test TimestampNTZ in Python UDF
Review comment:
```suggestion
# SPARK-36626: Test TimestampNTZ in Python UDF
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910049323
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142902/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909125909
cc @gengliangwang @ueshin @BryanCutler FYI
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909169501
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47387/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909867183
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47405/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909363727
**[Test build #142884 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142884/testReport)** for PR 33876 at commit [`49840e6`](https://github.com/apache/spark/commit/49840e65bbd3ed0172aa61233de03cd23e7c357b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `class TimestampNTZType(AtomicType, metaclass=DataTypeSingleton):`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909173737
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47387/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909855883
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47402/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909137449
**[Test build #142884 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142884/testReport)** for PR 33876 at commit [`49840e6`](https://github.com/apache/spark/commit/49840e65bbd3ed0172aa61233de03cd23e7c357b).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909844050
**[Test build #142901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142901/testReport)** for PR 33876 at commit [`d01643a`](https://github.com/apache/spark/commit/d01643a4e50825e727814dc365cc26a09b39af95).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909137449
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910221383
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142901/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909173737
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47387/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909173737
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47387/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909125909
cc @gengliangwang @ueshin @BryanCutler @viirya FYI
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910044527
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142899/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909866340
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47404/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909137449
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910221383
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142901/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #33876:
URL: https://github.com/apache/spark/pull/33876
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909846444
**[Test build #142902 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142902/testReport)** for PR 33876 at commit [`25aeb37`](https://github.com/apache/spark/commit/25aeb37c9efecbddfb7ef55ab0d17b39b499c852).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909173737
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909862305
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47402/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] ueshin commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r699570070
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
"has been set to false.\n %s" % str(e))
warnings.warn(msg)
raise
- data = self._convert_from_pandas(data, schema, timezone)
+
+ should_localize = self._wrapped._conf.timestampType().typeName() == "timestamp"
Review comment:
We should have a util method or function for this?
##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -710,9 +718,10 @@ def test_to_pandas_from_mixed_dataframe(self):
CAST(col6 AS DOUBLE) AS double,
CAST(col7 AS BOOLEAN) AS boolean,
CAST(col8 AS STRING) AS string,
- timestamp_seconds(col9) AS timestamp
- FROM VALUES (1, 1, 1, 1, 1, 1, 1, 1, 1),
- (NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+ timestamp_seconds(col9) AS timestamp,
+ timestamp_seconds(col9) AS timestamp_ntz
Review comment:
nit: `col10` instead of `col9`? or you can use `col9` without adding values in `VALUES` clause?
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
"has been set to false.\n %s" % str(e))
warnings.warn(msg)
raise
- data = self._convert_from_pandas(data, schema, timezone)
+
+ should_localize = not self._is_timestamp_ntz_preferred()
Review comment:
I mean, I'm just wondering why it's called here.
It can be called earlier to reuse the result, or it can be called later in each function.
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
"has been set to false.\n %s" % str(e))
warnings.warn(msg)
raise
- data = self._convert_from_pandas(data, schema, timezone)
+
+ should_localize = not self._is_timestamp_ntz_preferred()
Review comment:
nit: we should pass the result of `self._is_timestamp_ntz_preferred()` to `self._create_from_pandas_with_arrow(...)` at line 321 as well to avoid multiple calls?
Otherwise, we can call it in `self._convert_from_pandas(...)` instead?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909173737
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909835531
**[Test build #142899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142899/testReport)** for PR 33876 at commit [`1c0f1bc`](https://github.com/apache/spark/commit/1c0f1bc78e964d8626ba0784252b3920d508d9e2).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909871168
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47404/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r699814247
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
"has been set to false.\n %s" % str(e))
warnings.warn(msg)
raise
- data = self._convert_from_pandas(data, schema, timezone)
+
+ should_localize = not self._is_timestamp_ntz_preferred()
Review comment:
I think I called it here to make `_convert_from_pandas` indenependent from `SparkSession`. `_convert_from_pandas` doesn't have any reference of `self` ... but looks like I overthought too much ... let me call it there.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909871133
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47404/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910047641
**[Test build #142902 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142902/testReport)** for PR 33876 at commit [`25aeb37`](https://github.com/apache/spark/commit/25aeb37c9efecbddfb7ef55ab0d17b39b499c852).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910206437
**[Test build #142901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142901/testReport)** for PR 33876 at commit [`d01643a`](https://github.com/apache/spark/commit/d01643a4e50825e727814dc365cc26a09b39af95).
* This patch **fails from timeout after a configured wait of `500m`**.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909170841
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47387/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910032935
**[Test build #142899 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142899/testReport)** for PR 33876 at commit [`1c0f1bc`](https://github.com/apache/spark/commit/1c0f1bc78e964d8626ba0784252b3920d508d9e2).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909125909
cc @gengliangwang @ueshin @BryanCutler FYI
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909846444
**[Test build #142902 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142902/testReport)** for PR 33876 at commit [`25aeb37`](https://github.com/apache/spark/commit/25aeb37c9efecbddfb7ef55ab0d17b39b499c852).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] ueshin commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r700422224
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -369,7 +373,10 @@ def _convert_from_pandas(self, pdf, schema, timezone):
pdf[field.name] = s
else:
for column, series in pdf.iteritems():
- s = _check_series_convert_timestamps_tz_local(series, timezone)
+ s = series
+ should_localize = not self._is_timestamp_ntz_preferred()
Review comment:
This should be out of the loop?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909125909
cc @gengliangwang @ueshin @BryanCutler FYI
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909373882
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142884/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909896177
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47405/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909137449
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r700473408
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -369,7 +373,10 @@ def _convert_from_pandas(self, pdf, schema, timezone):
pdf[field.name] = s
else:
for column, series in pdf.iteritems():
- s = _check_series_convert_timestamps_tz_local(series, timezone)
+ s = series
+ should_localize = not self._is_timestamp_ntz_preferred()
+ if should_localize and is_datetime64tz_dtype(s.dtype) and s.dt.tz is not None:
Review comment:
Seems we already handle for `is_datetime64tz_dtype` in `_check_series_convert_timestamps_tz_local`? For non `is_datetime64tz_dtype` case we skip localization? Is it different to previous?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909896177
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47405/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909862305
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47402/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] ueshin commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r699570070
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
"has been set to false.\n %s" % str(e))
warnings.warn(msg)
raise
- data = self._convert_from_pandas(data, schema, timezone)
+
+ should_localize = self._wrapped._conf.timestampType().typeName() == "timestamp"
Review comment:
We should have a util method or function for this?
##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -710,9 +718,10 @@ def test_to_pandas_from_mixed_dataframe(self):
CAST(col6 AS DOUBLE) AS double,
CAST(col7 AS BOOLEAN) AS boolean,
CAST(col8 AS STRING) AS string,
- timestamp_seconds(col9) AS timestamp
- FROM VALUES (1, 1, 1, 1, 1, 1, 1, 1, 1),
- (NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+ timestamp_seconds(col9) AS timestamp,
+ timestamp_seconds(col9) AS timestamp_ntz
Review comment:
nit: `col10` instead of `col9`? or you can use `col9` without adding values in `VALUES` clause?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909871168
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47404/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909125909
cc @gengliangwang @ueshin @BryanCutler @viirya FYI
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909844050
**[Test build #142901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142901/testReport)** for PR 33876 at commit [`d01643a`](https://github.com/apache/spark/commit/d01643a4e50825e727814dc365cc26a09b39af95).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-911179250
All related tests passed.
Merged to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r700695609
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -369,7 +373,10 @@ def _convert_from_pandas(self, pdf, schema, timezone):
pdf[field.name] = s
else:
for column, series in pdf.iteritems():
- s = _check_series_convert_timestamps_tz_local(series, timezone)
+ s = series
+ should_localize = not self._is_timestamp_ntz_preferred()
+ if should_localize and is_datetime64tz_dtype(s.dtype) and s.dt.tz is not None:
Review comment:
Yes, it is being handled there: https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/python/pyspark/sql/pandas/types.py#L284-L292
For non `is_datetime64tz_dtype`, we skip localization because timezone is unavailable when pandas comtains datetime with `object` type as an example:
```python
>>> import pandas as pd
>>> import datetime
>>> s = pd.Series([datetime.datetime.now()])
>>> s.astype("object").dt
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../pandas/core/generic.py", line 5487, in __getattr__
return object.__getattribute__(self, name)
File "/.../pandas/core/accessor.py", line 181, in __get__
accessor_obj = self._accessor(obj)
File "/.../pandas/core/indexes/accessors.py", line 506, in __new__
raise AttributeError("Can only use .dt accessor with datetimelike values")
AttributeError: Can only use .dt accessor with datetimelike values
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909125909
cc @gengliangwang @ueshin @BryanCutler @viirya FYI
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910044527
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142899/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909137449
**[Test build #142884 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142884/testReport)** for PR 33876 at commit [`49840e6`](https://github.com/apache/spark/commit/49840e65bbd3ed0172aa61233de03cd23e7c357b).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org