You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "HyukjinKwon (via GitHub)" <gi...@apache.org> on 2023/08/08 11:17:17 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request, #42392: [SPARK-44717][PYTHON][PS] Respect TimestampNTZ in resampling

HyukjinKwon opened a new pull request, #42392:
URL: https://github.com/apache/spark/pull/42392

   ### What changes were proposed in this pull request?
   
   This PR proposes to respect `TimestampNTZ` type in resampling at pandas API on Spark.
   
   ### Why are the changes needed?
   
   It still operates as if the timestamps are `TIMESTAMP_LTZ` even when `spark.sql.timestampType` is set to `TIMESTAMP_NTZ`, which is unexpected.
   
   ### Does this PR introduce _any_ user-facing change?
   
   This fixes a bug so end users can use exactly same behaviour with pandas with `TimestampNTZType` - pandas does not respect the local timezone with DST. While we might need to follow this even for `TimestampType`, this PR does not address the case as it might be controversial.
   
   ### How was this patch tested?
   
   Unittest was added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #42392: [SPARK-44717][PYTHON][PS] Respect TimestampNTZ in resampling

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #42392:
URL: https://github.com/apache/spark/pull/42392#issuecomment-1669421854

   cc @zhengruifeng and @attilapiros FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #42392: [SPARK-44717][PYTHON][PS] Respect TimestampNTZ in resampling

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #42392:
URL: https://github.com/apache/spark/pull/42392#issuecomment-1670546962

   Merged to master and branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42392: [SPARK-44717][PYTHON][PS] Respect TimestampNTZ in resampling

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #42392:
URL: https://github.com/apache/spark/pull/42392#discussion_r1287820225


##########
python/pyspark/pandas/tests/connect/test_parity_resample.py:
##########
@@ -16,17 +16,25 @@
 #
 import unittest
 
-from pyspark.pandas.tests.test_resample import ResampleTestsMixin
+from pyspark.pandas.tests.test_resample import ResampleTestsMixin, ResampleWithTimezoneMixin
 from pyspark.testing.connectutils import ReusedConnectTestCase
 from pyspark.testing.pandasutils import PandasOnSparkTestUtils, TestUtils
 
 
-class ResampleTestsParityMixin(
+class ResampleParityTests(
     ResampleTestsMixin, PandasOnSparkTestUtils, TestUtils, ReusedConnectTestCase
 ):
     pass
 
 
+class ResampleWithTimezoneTests(
+    ResampleWithTimezoneMixin, PandasOnSparkTestUtils, TestUtils, ReusedConnectTestCase
+):
+    @unittest.skip("SPARK-44731: Support 'spark.sql.timestampType' in Python Spark Connect client")

Review Comment:
   cc @ueshin FYI



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #42392: [SPARK-44717][PYTHON][PS] Respect TimestampNTZ in resampling

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #42392: [SPARK-44717][PYTHON][PS] Respect TimestampNTZ in resampling
URL: https://github.com/apache/spark/pull/42392


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #42392: [SPARK-44717][PYTHON][PS] Respect TimestampNTZ in resampling

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #42392:
URL: https://github.com/apache/spark/pull/42392#issuecomment-1669430468

   cc @gengliangwang too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a diff in pull request #42392: [SPARK-44717][PYTHON][PS] Respect TimestampNTZ in resampling

Posted by "attilapiros (via GitHub)" <gi...@apache.org>.
attilapiros commented on code in PR #42392:
URL: https://github.com/apache/spark/pull/42392#discussion_r1287542152


##########
python/pyspark/pandas/tests/test_resample.py:
##########
@@ -252,14 +254,32 @@ def test_dataframe_resample(self):
         self._test_resample(self.pdf5, self.psdf5, ["55MIN", "2H", "D"], "left", "left", "std")
         self._test_resample(self.pdf6, self.psdf6, ["29S", "10MIN", "3H"], "left", "right", "var")
 
-    def test_series_resample(self):
+    def check_series_resample(self):
         self._test_resample(self.pdf1.A, self.psdf1.A, ["4Y"], "right", None, "min")
         self._test_resample(self.pdf2.A, self.psdf2.A, ["13M"], "right", "left", "max")
         self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", "sum")
         self._test_resample(self.pdf4.A, self.psdf4.A, ["6D"], None, None, "mean")
         self._test_resample(self.pdf5.A, self.psdf5.A, ["47T"], "left", "left", "var")
         self._test_resample(self.pdf6.A, self.psdf6.A, ["111S"], "right", "right", "std")
 
+    def test_series_resample(self):
+        self.check_series_resample()

Review Comment:
   We are still depending on the TZ environment setting indirectly. So just by running the test on a different TZ this test (and also the test_dataframe_resample) would simply fail. 
   
   We should either guarantee the correct TZ in the beginning of the test or validate the assumption and produce a meaningful error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42392: [SPARK-44717][PYTHON][PS] Respect TimestampNTZ in resampling

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #42392:
URL: https://github.com/apache/spark/pull/42392#discussion_r1299469771


##########
python/pyspark/pandas/tests/test_resample.py:
##########
@@ -252,14 +254,32 @@ def test_dataframe_resample(self):
         self._test_resample(self.pdf5, self.psdf5, ["55MIN", "2H", "D"], "left", "left", "std")
         self._test_resample(self.pdf6, self.psdf6, ["29S", "10MIN", "3H"], "left", "right", "var")
 
-    def test_series_resample(self):
+    def check_series_resample(self):
         self._test_resample(self.pdf1.A, self.psdf1.A, ["4Y"], "right", None, "min")
         self._test_resample(self.pdf2.A, self.psdf2.A, ["13M"], "right", "left", "max")
         self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", "sum")
         self._test_resample(self.pdf4.A, self.psdf4.A, ["6D"], None, None, "mean")
         self._test_resample(self.pdf5.A, self.psdf5.A, ["47T"], "left", "left", "var")
         self._test_resample(self.pdf6.A, self.psdf6.A, ["111S"], "right", "right", "std")
 
+    def test_series_resample(self):
+        self.check_series_resample()

Review Comment:
   Yeah ... this PR enables `spark.sql.timestampType=TIMESTAMP_NTZ` as a workaround for now ..
   To fix this, we need a bigger scope of change .. and can be arguable in a way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a diff in pull request #42392: [SPARK-44717][PYTHON][PS] Respect TimestampNTZ in resampling

Posted by "attilapiros (via GitHub)" <gi...@apache.org>.
attilapiros commented on code in PR #42392:
URL: https://github.com/apache/spark/pull/42392#discussion_r1296614626


##########
python/pyspark/pandas/tests/test_resample.py:
##########
@@ -252,14 +254,32 @@ def test_dataframe_resample(self):
         self._test_resample(self.pdf5, self.psdf5, ["55MIN", "2H", "D"], "left", "left", "std")
         self._test_resample(self.pdf6, self.psdf6, ["29S", "10MIN", "3H"], "left", "right", "var")
 
-    def test_series_resample(self):
+    def check_series_resample(self):
         self._test_resample(self.pdf1.A, self.psdf1.A, ["4Y"], "right", None, "min")
         self._test_resample(self.pdf2.A, self.psdf2.A, ["13M"], "right", "left", "max")
         self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", "sum")
         self._test_resample(self.pdf4.A, self.psdf4.A, ["6D"], None, None, "mean")
         self._test_resample(self.pdf5.A, self.psdf5.A, ["47T"], "left", "left", "var")
         self._test_resample(self.pdf6.A, self.psdf6.A, ["111S"], "right", "right", "std")
 
+    def test_series_resample(self):
+        self.check_series_resample()

Review Comment:
   @HyukjinKwon this is still an issue. The old tests still are failing when the TZ is not UTC i.e in America/New_York.
   
   See https://github.com/attilapiros/spark/actions/runs/5884585704/job/15960073184?pr=5
   
   Anyway for production we have the "spark.sql.timestampType=TIMESTAMP_NTZ" settings.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org