You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/01 09:35:39 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

HyukjinKwon opened a new pull request #33876:
URL: https://github.com/apache/spark/pull/33876


   ### What changes were proposed in this pull request?
   
   This PR proposes to implement `TimestampNTZType` support in PySpark's `SparkSession.createDataFrame`, `DataFrame.toPandas`, Python UDFs, and pandas UDFs with and without Arrow.
   
   This PR is dependent on #33875.
   
   ### Why are the changes needed?
   
   To complete `TimestampNTZType` support.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   
   - Users now can use `TimestampNTZType` type in `SparkSession.createDataFrame`, `DataFrame.toPandas`, Python UDFs, and pandas UDFs with and without Arrow.
   
   - If `spark.sql.timestampType` is configured to `TIMESTAMP_NTZ`, PySpark will infer the `datetime` without timezone as `TimestampNTZType`. If it has a timezone, it will be inferred as `TimestampType` in `SparkSession.createDataFrame`.
   
       - If `TimestampType` and `TimestampNTZType` conflict during merging inferred schema, `TimestampType` has a higher precedence.
   
   - If the type is `TimestampNTZType`, treat this internally as UTC (same as JVM), and avoid localization externally.
   
   ### How was this patch tested?
   
   Manually tested and unittests were added.
   
   Closes #33517


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910049323


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142902/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909173737


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47387/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909896157


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47405/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909373882


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142884/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r699811451



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
                         "has been set to false.\n  %s" % str(e))
                     warnings.warn(msg)
                     raise
-        data = self._convert_from_pandas(data, schema, timezone)
+
+        should_localize = not self._is_timestamp_ntz_preferred()

Review comment:
       I mean, I'm just wondering why it's called here.
   It can be called earlier to reuse the result, or it can be called later in each function.

##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
                         "has been set to false.\n  %s" % str(e))
                     warnings.warn(msg)
                     raise
-        data = self._convert_from_pandas(data, schema, timezone)
+
+        should_localize = not self._is_timestamp_ntz_preferred()

Review comment:
       nit: we should pass the result of `self._is_timestamp_ntz_preferred()` to `self._create_from_pandas_with_arrow(...)` at line 321 as well to avoid multiple calls?
   Otherwise, we can call it in `self._convert_from_pandas(...)` instead?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909835531


   **[Test build #142899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142899/testReport)** for PR 33876 at commit [`1c0f1bc`](https://github.com/apache/spark/commit/1c0f1bc78e964d8626ba0784252b3920d508d9e2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r699816264



##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -167,6 +167,18 @@ def test_toPandas_arrow_toggle(self):
         assert_frame_equal(expected, pdf)
         assert_frame_equal(expected, pdf_arrow)
 
+    def test_create_data_frame_to_pandas_timestamp_ntz(self):
+        # SPARK-36608: Test TimestampNTZ in createDataFrame and toPandas

Review comment:
       ```suggestion
           # SPARK-36626: Test TimestampNTZ in createDataFrame and toPandas
   ```

##########
File path: python/pyspark/sql/tests/test_pandas_udf.py
##########
@@ -239,6 +240,23 @@ def udf(column):
         with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": False}):
             df.withColumn('udf', udf('id')).collect()
 
+    def test_pandas_udf_timestamp_ntz(self):
+        # SPARK-36608: Test TimestampNTZ in pandas UDF

Review comment:
       ```suggestion
           # SPARK-36626: Test TimestampNTZ in pandas UDF
   ```

##########
File path: python/pyspark/sql/tests/test_udf.py
##########
@@ -552,6 +553,23 @@ def __call__(self, x):
         self.assertEqual(f, f_.func)
         self.assertEqual(return_type, f_.returnType)
 
+    def test_udf_timestamp_ntz(self):
+        # SPARK-36608: Test TimestampNTZ in Python UDF

Review comment:
       ```suggestion
           # SPARK-36626: Test TimestampNTZ in Python UDF
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909862278


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47402/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r699814247



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
                         "has been set to false.\n  %s" % str(e))
                     warnings.warn(msg)
                     raise
-        data = self._convert_from_pandas(data, schema, timezone)
+
+        should_localize = not self._is_timestamp_ntz_preferred()

Review comment:
       I think I called it here to make `_convert_from_pandas` indenependent from `SparkSession`. `_convert_from_pandas` doesn't have any reference of `self` ... but looks like I overthought too much ... let me call it there.

##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -167,6 +167,18 @@ def test_toPandas_arrow_toggle(self):
         assert_frame_equal(expected, pdf)
         assert_frame_equal(expected, pdf_arrow)
 
+    def test_create_data_frame_to_pandas_timestamp_ntz(self):
+        # SPARK-36608: Test TimestampNTZ in createDataFrame and toPandas

Review comment:
       ```suggestion
           # SPARK-36626: Test TimestampNTZ in createDataFrame and toPandas
   ```

##########
File path: python/pyspark/sql/tests/test_pandas_udf.py
##########
@@ -239,6 +240,23 @@ def udf(column):
         with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": False}):
             df.withColumn('udf', udf('id')).collect()
 
+    def test_pandas_udf_timestamp_ntz(self):
+        # SPARK-36608: Test TimestampNTZ in pandas UDF

Review comment:
       ```suggestion
           # SPARK-36626: Test TimestampNTZ in pandas UDF
   ```

##########
File path: python/pyspark/sql/tests/test_udf.py
##########
@@ -552,6 +553,23 @@ def __call__(self, x):
         self.assertEqual(f, f_.func)
         self.assertEqual(return_type, f_.returnType)
 
+    def test_udf_timestamp_ntz(self):
+        # SPARK-36608: Test TimestampNTZ in Python UDF

Review comment:
       ```suggestion
           # SPARK-36626: Test TimestampNTZ in Python UDF
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910049323


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142902/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909125909


   cc @gengliangwang @ueshin @BryanCutler FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909169501


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47387/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909867183


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47405/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909363727


   **[Test build #142884 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142884/testReport)** for PR 33876 at commit [`49840e6`](https://github.com/apache/spark/commit/49840e65bbd3ed0172aa61233de03cd23e7c357b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `class TimestampNTZType(AtomicType, metaclass=DataTypeSingleton):`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909173737


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47387/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909855883


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47402/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909137449


   **[Test build #142884 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142884/testReport)** for PR 33876 at commit [`49840e6`](https://github.com/apache/spark/commit/49840e65bbd3ed0172aa61233de03cd23e7c357b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909844050


   **[Test build #142901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142901/testReport)** for PR 33876 at commit [`d01643a`](https://github.com/apache/spark/commit/d01643a4e50825e727814dc365cc26a09b39af95).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909137449






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910221383


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142901/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909173737


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47387/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909173737


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47387/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909125909


   cc @gengliangwang @ueshin @BryanCutler @viirya FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910044527


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142899/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909866340


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47404/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909137449






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910221383


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142901/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #33876:
URL: https://github.com/apache/spark/pull/33876


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909846444


   **[Test build #142902 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142902/testReport)** for PR 33876 at commit [`25aeb37`](https://github.com/apache/spark/commit/25aeb37c9efecbddfb7ef55ab0d17b39b499c852).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909173737






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909862305


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47402/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r699570070



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
                         "has been set to false.\n  %s" % str(e))
                     warnings.warn(msg)
                     raise
-        data = self._convert_from_pandas(data, schema, timezone)
+
+        should_localize = self._wrapped._conf.timestampType().typeName() == "timestamp"

Review comment:
       We should have a util method or function for this?

##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -710,9 +718,10 @@ def test_to_pandas_from_mixed_dataframe(self):
             CAST(col6 AS DOUBLE) AS double,
             CAST(col7 AS BOOLEAN) AS boolean,
             CAST(col8 AS STRING) AS string,
-            timestamp_seconds(col9) AS timestamp
-            FROM VALUES (1, 1, 1, 1, 1, 1, 1, 1, 1),
-                        (NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+            timestamp_seconds(col9) AS timestamp,
+            timestamp_seconds(col9) AS timestamp_ntz

Review comment:
       nit: `col10` instead of `col9`? or you can use `col9` without adding values in `VALUES` clause?

##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
                         "has been set to false.\n  %s" % str(e))
                     warnings.warn(msg)
                     raise
-        data = self._convert_from_pandas(data, schema, timezone)
+
+        should_localize = not self._is_timestamp_ntz_preferred()

Review comment:
       I mean, I'm just wondering why it's called here.
   It can be called earlier to reuse the result, or it can be called later in each function.

##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
                         "has been set to false.\n  %s" % str(e))
                     warnings.warn(msg)
                     raise
-        data = self._convert_from_pandas(data, schema, timezone)
+
+        should_localize = not self._is_timestamp_ntz_preferred()

Review comment:
       nit: we should pass the result of `self._is_timestamp_ntz_preferred()` to `self._create_from_pandas_with_arrow(...)` at line 321 as well to avoid multiple calls?
   Otherwise, we can call it in `self._convert_from_pandas(...)` instead?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909173737






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909835531


   **[Test build #142899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142899/testReport)** for PR 33876 at commit [`1c0f1bc`](https://github.com/apache/spark/commit/1c0f1bc78e964d8626ba0784252b3920d508d9e2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909871168


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47404/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r699814247



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
                         "has been set to false.\n  %s" % str(e))
                     warnings.warn(msg)
                     raise
-        data = self._convert_from_pandas(data, schema, timezone)
+
+        should_localize = not self._is_timestamp_ntz_preferred()

Review comment:
       I think I called it here to make `_convert_from_pandas` indenependent from `SparkSession`. `_convert_from_pandas` doesn't have any reference of `self` ... but looks like I overthought too much ... let me call it there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909871133


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47404/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910047641


   **[Test build #142902 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142902/testReport)** for PR 33876 at commit [`25aeb37`](https://github.com/apache/spark/commit/25aeb37c9efecbddfb7ef55ab0d17b39b499c852).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910206437


   **[Test build #142901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142901/testReport)** for PR 33876 at commit [`d01643a`](https://github.com/apache/spark/commit/d01643a4e50825e727814dc365cc26a09b39af95).
    * This patch **fails from timeout after a configured wait of `500m`**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909170841


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47387/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910032935


   **[Test build #142899 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142899/testReport)** for PR 33876 at commit [`1c0f1bc`](https://github.com/apache/spark/commit/1c0f1bc78e964d8626ba0784252b3920d508d9e2).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909125909


   cc @gengliangwang @ueshin @BryanCutler FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909846444


   **[Test build #142902 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142902/testReport)** for PR 33876 at commit [`25aeb37`](https://github.com/apache/spark/commit/25aeb37c9efecbddfb7ef55ab0d17b39b499c852).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r700422224



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -369,7 +373,10 @@ def _convert_from_pandas(self, pdf, schema, timezone):
                             pdf[field.name] = s
             else:
                 for column, series in pdf.iteritems():
-                    s = _check_series_convert_timestamps_tz_local(series, timezone)
+                    s = series
+                    should_localize = not self._is_timestamp_ntz_preferred()

Review comment:
       This should be out of the loop?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909125909


   cc @gengliangwang @ueshin @BryanCutler FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909373882


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142884/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909896177


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47405/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909137449






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r700473408



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -369,7 +373,10 @@ def _convert_from_pandas(self, pdf, schema, timezone):
                             pdf[field.name] = s
             else:
                 for column, series in pdf.iteritems():
-                    s = _check_series_convert_timestamps_tz_local(series, timezone)
+                    s = series
+                    should_localize = not self._is_timestamp_ntz_preferred()
+                    if should_localize and is_datetime64tz_dtype(s.dtype) and s.dt.tz is not None:

Review comment:
       Seems we already handle for `is_datetime64tz_dtype` in `_check_series_convert_timestamps_tz_local`? For non `is_datetime64tz_dtype` case we skip localization? Is it different to previous?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909896177


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47405/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909862305


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47402/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r699570070



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -336,10 +338,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
                         "has been set to false.\n  %s" % str(e))
                     warnings.warn(msg)
                     raise
-        data = self._convert_from_pandas(data, schema, timezone)
+
+        should_localize = self._wrapped._conf.timestampType().typeName() == "timestamp"

Review comment:
       We should have a util method or function for this?

##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -710,9 +718,10 @@ def test_to_pandas_from_mixed_dataframe(self):
             CAST(col6 AS DOUBLE) AS double,
             CAST(col7 AS BOOLEAN) AS boolean,
             CAST(col8 AS STRING) AS string,
-            timestamp_seconds(col9) AS timestamp
-            FROM VALUES (1, 1, 1, 1, 1, 1, 1, 1, 1),
-                        (NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+            timestamp_seconds(col9) AS timestamp,
+            timestamp_seconds(col9) AS timestamp_ntz

Review comment:
       nit: `col10` instead of `col9`? or you can use `col9` without adding values in `VALUES` clause?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909871168


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47404/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909125909


   cc @gengliangwang @ueshin @BryanCutler @viirya FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909844050


   **[Test build #142901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142901/testReport)** for PR 33876 at commit [`d01643a`](https://github.com/apache/spark/commit/d01643a4e50825e727814dc365cc26a09b39af95).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-911179250


   All related tests passed.
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33876:
URL: https://github.com/apache/spark/pull/33876#discussion_r700695609



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -369,7 +373,10 @@ def _convert_from_pandas(self, pdf, schema, timezone):
                             pdf[field.name] = s
             else:
                 for column, series in pdf.iteritems():
-                    s = _check_series_convert_timestamps_tz_local(series, timezone)
+                    s = series
+                    should_localize = not self._is_timestamp_ntz_preferred()
+                    if should_localize and is_datetime64tz_dtype(s.dtype) and s.dt.tz is not None:

Review comment:
       Yes, it is being handled there: https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/python/pyspark/sql/pandas/types.py#L284-L292
   
   For non `is_datetime64tz_dtype`, we skip localization because timezone is unavailable when pandas comtains datetime with `object` type as an example: 
   
   ```python
   >>> import pandas as pd
   >>> import datetime
   >>> s = pd.Series([datetime.datetime.now()])
   >>> s.astype("object").dt
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.../pandas/core/generic.py", line 5487, in __getattr__
       return object.__getattribute__(self, name)
     File "/.../pandas/core/accessor.py", line 181, in __get__
       accessor_obj = self._accessor(obj)
     File "/.../pandas/core/indexes/accessors.py", line 506, in __new__
       raise AttributeError("Can only use .dt accessor with datetimelike values")
   AttributeError: Can only use .dt accessor with datetimelike values
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909125909


   cc @gengliangwang @ueshin @BryanCutler @viirya FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-910044527


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142899/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33876: [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33876:
URL: https://github.com/apache/spark/pull/33876#issuecomment-909137449


   **[Test build #142884 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142884/testReport)** for PR 33876 at commit [`49840e6`](https://github.com/apache/spark/commit/49840e65bbd3ed0172aa61233de03cd23e7c357b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org