You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/06 22:39:00 UTC

[GitHub] [spark] xinrong-meng opened a new pull request, #38946: [SPARK-41414][CONNECT][PYTHON] Implement date/timestamp functions

xinrong-meng opened a new pull request, #38946:
URL: https://github.com/apache/spark/pull/38946

   ### What changes were proposed in this pull request?
   Implement date/timestamp functions on Spark Connect.
   
   ### Why are the changes needed?
   For API coverage on Spark Connect.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. New functions API are supported.
   
   ### How was this patch tested?
   Unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on a diff in pull request #38946: [SPARK-41414][CONNECT][PYTHON] Implement date/timestamp functions

Posted by GitBox <gi...@apache.org>.
xinrong-meng commented on code in PR #38946:
URL: https://github.com/apache/spark/pull/38946#discussion_r1043939193


##########
python/pyspark/sql/tests/connect/test_connect_function.py:
##########
@@ -645,6 +645,153 @@ def test_string_functions(self):
             sdf.select(SF.encode("c", "UTF-8")).toPandas(),
         )
 
+    # TODO(SPARK-41283): To compare toPandas for test cases with dtypes marked
+    def test_date_ts_functions(self):
+        from pyspark.sql import functions as SF
+        from pyspark.sql.connect import functions as CF
+
+        query = """
+            SELECT * FROM VALUES
+            ('1997/02/28 10:30:00', '2023/03/01 06:00:00', 'JST', 1428476400, 2020, 12, 6),
+            ('2000/01/01 04:30:05', '2020/05/01 12:15:00', 'PST', 1403892395, 2022, 12, 6)
+            AS tab(ts1, ts2, tz, seconds, Y, M, D)
+            """
+        # +-------------------+-------------------+---+----------+----+---+---+
+        # |                ts1|                ts2| tz|   seconds|   Y|  M|  D|
+        # +-------------------+-------------------+---+----------+----+---+---+
+        # |1997/02/28 10:30:00|2023/03/01 06:00:00|JST|1428476400|2020| 12|  6|
+        # |2000/01/01 04:30:05|2020/05/01 12:15:00|PST|1403892395|2022| 12|  6|
+        # +-------------------+-------------------+---+----------+----+---+---+
+
+        cdf = self.connect.sql(query)
+        sdf = self.spark.sql(query)
+
+        # With no parameters
+        for cfunc, sfunc in [
+            (CF.current_date, SF.current_date),
+        ]:
+            self.assert_eq(
+                cdf.select(cfunc()).toPandas(),
+                sdf.select(sfunc()).toPandas(),
+            )
+
+        # current_timestamp
+        # [left]:  datetime64[ns, America/Los_Angeles]
+        # [right]: datetime64[ns]
+        plan = cdf.select(CF.current_timestamp())._plan.to_proto(self.connect)
+        self.assertEqual(
+            plan.root.project.expressions.unresolved_function.function_name, "current_timestamp"

Review Comment:
   Cannot compare with PySpark since we cannot call them at the exact same timestamp.



##########
python/pyspark/sql/tests/connect/test_connect_function.py:
##########
@@ -645,6 +645,153 @@ def test_string_functions(self):
             sdf.select(SF.encode("c", "UTF-8")).toPandas(),
         )
 
+    # TODO(SPARK-41283): To compare toPandas for test cases with dtypes marked
+    def test_date_ts_functions(self):
+        from pyspark.sql import functions as SF
+        from pyspark.sql.connect import functions as CF
+
+        query = """
+            SELECT * FROM VALUES
+            ('1997/02/28 10:30:00', '2023/03/01 06:00:00', 'JST', 1428476400, 2020, 12, 6),
+            ('2000/01/01 04:30:05', '2020/05/01 12:15:00', 'PST', 1403892395, 2022, 12, 6)
+            AS tab(ts1, ts2, tz, seconds, Y, M, D)
+            """
+        # +-------------------+-------------------+---+----------+----+---+---+
+        # |                ts1|                ts2| tz|   seconds|   Y|  M|  D|
+        # +-------------------+-------------------+---+----------+----+---+---+
+        # |1997/02/28 10:30:00|2023/03/01 06:00:00|JST|1428476400|2020| 12|  6|
+        # |2000/01/01 04:30:05|2020/05/01 12:15:00|PST|1403892395|2022| 12|  6|
+        # +-------------------+-------------------+---+----------+----+---+---+
+
+        cdf = self.connect.sql(query)
+        sdf = self.spark.sql(query)
+
+        # With no parameters
+        for cfunc, sfunc in [
+            (CF.current_date, SF.current_date),
+        ]:
+            self.assert_eq(
+                cdf.select(cfunc()).toPandas(),
+                sdf.select(sfunc()).toPandas(),
+            )
+
+        # current_timestamp
+        # [left]:  datetime64[ns, America/Los_Angeles]
+        # [right]: datetime64[ns]
+        plan = cdf.select(CF.current_timestamp())._plan.to_proto(self.connect)
+        self.assertEqual(
+            plan.root.project.expressions.unresolved_function.function_name, "current_timestamp"
+        )
+
+        # localtimestamp
+        plan = cdf.select(CF.localtimestamp())._plan.to_proto(self.connect)
+        self.assertEqual(
+            plan.root.project.expressions.unresolved_function.function_name, "localtimestamp"

Review Comment:
   Cannot compare with PySpark since we cannot call them at the exact same timestamp.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38946: [SPARK-41414][CONNECT][PYTHON] Implement date/timestamp functions

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38946:
URL: https://github.com/apache/spark/pull/38946#discussion_r1044356233


##########
python/pyspark/sql/tests/connect/test_connect_function.py:
##########
@@ -645,6 +645,153 @@ def test_string_functions(self):
             sdf.select(SF.encode("c", "UTF-8")).toPandas(),
         )
 
+    # TODO(SPARK-41283): To compare toPandas for test cases with dtypes marked
+    def test_date_ts_functions(self):
+        from pyspark.sql import functions as SF
+        from pyspark.sql.connect import functions as CF
+
+        query = """
+            SELECT * FROM VALUES
+            ('1997/02/28 10:30:00', '2023/03/01 06:00:00', 'JST', 1428476400, 2020, 12, 6),
+            ('2000/01/01 04:30:05', '2020/05/01 12:15:00', 'PST', 1403892395, 2022, 12, 6)
+            AS tab(ts1, ts2, tz, seconds, Y, M, D)
+            """
+        # +-------------------+-------------------+---+----------+----+---+---+
+        # |                ts1|                ts2| tz|   seconds|   Y|  M|  D|
+        # +-------------------+-------------------+---+----------+----+---+---+
+        # |1997/02/28 10:30:00|2023/03/01 06:00:00|JST|1428476400|2020| 12|  6|
+        # |2000/01/01 04:30:05|2020/05/01 12:15:00|PST|1403892395|2022| 12|  6|
+        # +-------------------+-------------------+---+----------+----+---+---+
+
+        cdf = self.connect.sql(query)
+        sdf = self.spark.sql(query)
+
+        # With no parameters
+        for cfunc, sfunc in [
+            (CF.current_date, SF.current_date),
+        ]:
+            self.assert_eq(
+                cdf.select(cfunc()).toPandas(),
+                sdf.select(sfunc()).toPandas(),
+            )
+
+        # current_timestamp
+        # [left]:  datetime64[ns, America/Los_Angeles]
+        # [right]: datetime64[ns]
+        plan = cdf.select(CF.current_timestamp())._plan.to_proto(self.connect)
+        self.assertEqual(
+            plan.root.project.expressions.unresolved_function.function_name, "current_timestamp"

Review Comment:
   you could do something along the lines that you call first spark, then connect then spark again and make sure that connect is in the middle between all values and that the second spark one is after the first one.



##########
python/pyspark/sql/tests/connect/test_connect_function.py:
##########
@@ -645,6 +645,153 @@ def test_string_functions(self):
             sdf.select(SF.encode("c", "UTF-8")).toPandas(),
         )
 
+    # TODO(SPARK-41283): To compare toPandas for test cases with dtypes marked
+    def test_date_ts_functions(self):
+        from pyspark.sql import functions as SF
+        from pyspark.sql.connect import functions as CF
+
+        query = """
+            SELECT * FROM VALUES
+            ('1997/02/28 10:30:00', '2023/03/01 06:00:00', 'JST', 1428476400, 2020, 12, 6),
+            ('2000/01/01 04:30:05', '2020/05/01 12:15:00', 'PST', 1403892395, 2022, 12, 6)
+            AS tab(ts1, ts2, tz, seconds, Y, M, D)
+            """
+        # +-------------------+-------------------+---+----------+----+---+---+
+        # |                ts1|                ts2| tz|   seconds|   Y|  M|  D|
+        # +-------------------+-------------------+---+----------+----+---+---+
+        # |1997/02/28 10:30:00|2023/03/01 06:00:00|JST|1428476400|2020| 12|  6|
+        # |2000/01/01 04:30:05|2020/05/01 12:15:00|PST|1403892395|2022| 12|  6|
+        # +-------------------+-------------------+---+----------+----+---+---+
+
+        cdf = self.connect.sql(query)
+        sdf = self.spark.sql(query)
+
+        # With no parameters
+        for cfunc, sfunc in [
+            (CF.current_date, SF.current_date),
+        ]:
+            self.assert_eq(
+                cdf.select(cfunc()).toPandas(),
+                sdf.select(sfunc()).toPandas(),
+            )
+
+        # current_timestamp
+        # [left]:  datetime64[ns, America/Los_Angeles]
+        # [right]: datetime64[ns]
+        plan = cdf.select(CF.current_timestamp())._plan.to_proto(self.connect)
+        self.assertEqual(
+            plan.root.project.expressions.unresolved_function.function_name, "current_timestamp"
+        )
+
+        # localtimestamp
+        plan = cdf.select(CF.localtimestamp())._plan.to_proto(self.connect)
+        self.assertEqual(
+            plan.root.project.expressions.unresolved_function.function_name, "localtimestamp"

Review Comment:
   see above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on a diff in pull request #38946: [SPARK-41414][CONNECT][PYTHON] Implement date/timestamp functions

Posted by GitBox <gi...@apache.org>.
xinrong-meng commented on code in PR #38946:
URL: https://github.com/apache/spark/pull/38946#discussion_r1044694589


##########
python/pyspark/sql/tests/connect/test_connect_function.py:
##########
@@ -645,6 +645,153 @@ def test_string_functions(self):
             sdf.select(SF.encode("c", "UTF-8")).toPandas(),
         )
 
+    # TODO(SPARK-41283): To compare toPandas for test cases with dtypes marked
+    def test_date_ts_functions(self):
+        from pyspark.sql import functions as SF
+        from pyspark.sql.connect import functions as CF
+
+        query = """
+            SELECT * FROM VALUES
+            ('1997/02/28 10:30:00', '2023/03/01 06:00:00', 'JST', 1428476400, 2020, 12, 6),
+            ('2000/01/01 04:30:05', '2020/05/01 12:15:00', 'PST', 1403892395, 2022, 12, 6)
+            AS tab(ts1, ts2, tz, seconds, Y, M, D)
+            """
+        # +-------------------+-------------------+---+----------+----+---+---+
+        # |                ts1|                ts2| tz|   seconds|   Y|  M|  D|
+        # +-------------------+-------------------+---+----------+----+---+---+
+        # |1997/02/28 10:30:00|2023/03/01 06:00:00|JST|1428476400|2020| 12|  6|
+        # |2000/01/01 04:30:05|2020/05/01 12:15:00|PST|1403892395|2022| 12|  6|
+        # +-------------------+-------------------+---+----------+----+---+---+
+
+        cdf = self.connect.sql(query)
+        sdf = self.spark.sql(query)
+
+        # With no parameters
+        for cfunc, sfunc in [
+            (CF.current_date, SF.current_date),
+        ]:
+            self.assert_eq(
+                cdf.select(cfunc()).toPandas(),
+                sdf.select(sfunc()).toPandas(),
+            )
+
+        # current_timestamp
+        # [left]:  datetime64[ns, America/Los_Angeles]
+        # [right]: datetime64[ns]
+        plan = cdf.select(CF.current_timestamp())._plan.to_proto(self.connect)
+        self.assertEqual(
+            plan.root.project.expressions.unresolved_function.function_name, "current_timestamp"

Review Comment:
   Thank you both! The tests are improved as https://github.com/apache/spark/pull/38946/commits/0493a01a4e39795a522cdc58f8a69b0ed261c832.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38946: [SPARK-41414][CONNECT][PYTHON] Implement date/timestamp functions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38946:
URL: https://github.com/apache/spark/pull/38946#issuecomment-1343716380

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #38946: [SPARK-41414][CONNECT][PYTHON] Implement date/timestamp functions

Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #38946: [SPARK-41414][CONNECT][PYTHON] Implement date/timestamp functions
URL: https://github.com/apache/spark/pull/38946


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on a diff in pull request #38946: [SPARK-41414][CONNECT][PYTHON] Implement date/timestamp functions

Posted by GitBox <gi...@apache.org>.
xinrong-meng commented on code in PR #38946:
URL: https://github.com/apache/spark/pull/38946#discussion_r1043934853


##########
python/pyspark/sql/tests/connect/test_connect_function.py:
##########
@@ -412,6 +412,125 @@ def test_aggregation_functions(self):
             sdf.groupBy("a").agg(SF.percentile_approx(sdf.b, [0.1, 0.9])).toPandas(),
         )
 
+    def test_date_ts_functions(self):

Review Comment:
   Sounds good, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38946: [SPARK-41414][CONNECT][PYTHON] Implement date/timestamp functions

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38946:
URL: https://github.com/apache/spark/pull/38946#discussion_r1044376204


##########
python/pyspark/sql/tests/connect/test_connect_function.py:
##########
@@ -645,6 +645,153 @@ def test_string_functions(self):
             sdf.select(SF.encode("c", "UTF-8")).toPandas(),
         )
 
+    # TODO(SPARK-41283): To compare toPandas for test cases with dtypes marked
+    def test_date_ts_functions(self):
+        from pyspark.sql import functions as SF
+        from pyspark.sql.connect import functions as CF
+
+        query = """
+            SELECT * FROM VALUES
+            ('1997/02/28 10:30:00', '2023/03/01 06:00:00', 'JST', 1428476400, 2020, 12, 6),
+            ('2000/01/01 04:30:05', '2020/05/01 12:15:00', 'PST', 1403892395, 2022, 12, 6)
+            AS tab(ts1, ts2, tz, seconds, Y, M, D)
+            """
+        # +-------------------+-------------------+---+----------+----+---+---+
+        # |                ts1|                ts2| tz|   seconds|   Y|  M|  D|
+        # +-------------------+-------------------+---+----------+----+---+---+
+        # |1997/02/28 10:30:00|2023/03/01 06:00:00|JST|1428476400|2020| 12|  6|
+        # |2000/01/01 04:30:05|2020/05/01 12:15:00|PST|1403892395|2022| 12|  6|
+        # +-------------------+-------------------+---+----------+----+---+---+
+
+        cdf = self.connect.sql(query)
+        sdf = self.spark.sql(query)
+
+        # With no parameters
+        for cfunc, sfunc in [
+            (CF.current_date, SF.current_date),
+        ]:
+            self.assert_eq(
+                cdf.select(cfunc()).toPandas(),
+                sdf.select(sfunc()).toPandas(),
+            )
+
+        # current_timestamp
+        # [left]:  datetime64[ns, America/Los_Angeles]
+        # [right]: datetime64[ns]
+        plan = cdf.select(CF.current_timestamp())._plan.to_proto(self.connect)
+        self.assertEqual(
+            plan.root.project.expressions.unresolved_function.function_name, "current_timestamp"

Review Comment:
   I think you can simply test like this
   
   https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/connect/test_connect_function.py#L144-L148



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38946: [SPARK-41414][CONNECT][PYTHON] Implement date/timestamp functions

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38946:
URL: https://github.com/apache/spark/pull/38946#discussion_r1043091903


##########
python/pyspark/sql/tests/connect/test_connect_function.py:
##########
@@ -412,6 +412,125 @@ def test_aggregation_functions(self):
             sdf.groupBy("a").agg(SF.percentile_approx(sdf.b, [0.1, 0.9])).toPandas(),
         )
 
+    def test_date_ts_functions(self):

Review Comment:
   if `toPandas` doesn't work correctly with timestamp, you can use the newly added comparison method [compare_by_show](https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/connect/test_connect_function.py#L66)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #38946: [SPARK-41414][CONNECT][PYTHON] Implement date/timestamp functions

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #38946:
URL: https://github.com/apache/spark/pull/38946#issuecomment-1344961425

   merged into master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org