You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/03/15 04:01:42 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

zhengruifeng opened a new pull request, #40432:
URL: https://github.com/apache/spark/pull/40432

   ### What changes were proposed in this pull request?
   Implement ml function `{array_to_vector, vector_to_array}`
   
   
   ### Why are the changes needed?
   function parity
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new functions
   
   ### How was this patch tested?
   added ut and manually check
   
   ```
   (spark_dev) ➜  spark git:(connect_ml_functions) ✗ bin/pyspark --remote "local[*]"    
   Python 3.9.16 (main, Mar  8 2023, 04:29:24) 
   Type 'copyright', 'credits' or 'license' for more information
   IPython 8.11.0 -- An enhanced Interactive Python. Type '?' for help.
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
   23/03/15 11:56:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /__ / .__/\_,_/_/ /_/\_\   version 3.5.0.dev0
         /_/
   
   Using Python version 3.9.16 (main, Mar  8 2023 04:29:24)
   Client connected to the Spark Connect server at localhost
   SparkSession available as 'spark'.
   
   In [1]: 
   
   In [1]:         query = """
      ...:             SELECT * FROM VALUES
      ...:             (1, 4, ARRAY(1.0, 2.0, 3.0)),
      ...:             (1, 2, ARRAY(-1.0, -2.0, -3.0))
      ...:             AS tab(a, b, c)
      ...:             """
   
   In [2]: cdf = spark.sql(query)
   
   In [3]:     from pyspark.sql.connect.ml import functions as CF
   
   In [4]: cdf1 = cdf.select("a", CF.array_to_vector(cdf.c).alias("d"))
   
   In [5]: cdf1.show()
   +---+----------------+                                              (0 + 1) / 1]
   |  a|               d|
   +---+----------------+
   |  1|   [1.0,2.0,3.0]|
   |  1|[-1.0,-2.0,-3.0]|
   +---+----------------+
   
   
   In [6]: cdf1.schema
   Out[6]: StructType([StructField('a', IntegerType(), False), StructField('d', VectorUDT(), True)])
   
   In [7]: cdf1.select(CF.vector_to_array(cdf1.d))
   Out[7]: DataFrame[UDF(d): array<double>]
   
   In [8]: cdf1.select(CF.vector_to_array(cdf1.d)).show()
   +------------------+
   |            UDF(d)|
   +------------------+
   |   [1.0, 2.0, 3.0]|
   |[-1.0, -2.0, -3.0]|
   +------------------+
   
   
   In [9]: cdf1.select(CF.vector_to_array(cdf1.d)).schema
   Out[9]: StructType([StructField('UDF(d)', ArrayType(DoubleType(), False), False)])
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1136934621


##########
python/pyspark/sql/connect/ml/functions.py:
##########
@@ -0,0 +1,38 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from pyspark.sql.connect.utils import check_dependencies
+
+check_dependencies(__name__)
+
+from pyspark.ml import functions as PyMLFunctions
+
+from pyspark.sql.connect.column import Column
+from pyspark.sql.connect.functions import _invoke_function, _to_col, lit
+
+
+def vector_to_array(col: Column, dtype: str = "float64") -> Column:
+    return _invoke_function("vector_to_array", _to_col(col), lit(dtype))
+
+
+vector_to_array.__doc__ = PyMLFunctions.vector_to_array.__doc__
+
+
+def array_to_vector(col: Column) -> Column:
+    return _invoke_function("array_to_vector", _to_col(col))
+
+
+array_to_vector.__doc__ = PyMLFunctions.array_to_vector.__doc__

Review Comment:
   Actually, why don't you remove `python/pyspark/sql/tests/connect/ml/test_connect_ml_function.py`, and run the same doctests here instead? That would be less code but the same API coverage.
   
   You could just add, e.g.)
   
   ```python
   def _test() -> None:
       import doctest
       from pyspark.sql import SparkSession
       import pyspark.sql.connect.ml.functions
       import sys
   
       globs = pyspark.sql.connect.ml.functions.__dict__.copy()
       globs["spark"] = (
           PySparkSession.builder.appName("sql.connect.functions tests")
           .remote("local[4]")
           .getOrCreate()
       )
       globs["spark"] = spark
   
       (failure_count, test_count) = doctest.testmod(
           pyspark.sql.connect.ml.functions,
           globs=globs,
           optionflags=doctest.ELLIPSIS | doctest.NORMALIZE_WHITESPACE,
       )
       spark.stop()
       if failure_count:
           sys.exit(-1)
   
   
   if __name__ == "__main__":
       _test()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1136936972


##########
python/pyspark/sql/tests/connect/ml/test_connect_ml_function.py:
##########
@@ -0,0 +1,117 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import unittest
+
+from pyspark.sql import SparkSession as PySparkSession
+
+from pyspark.testing.sqlutils import SQLTestUtils
+from pyspark.testing.connectutils import (
+    should_test_connect,
+    ReusedConnectTestCase,
+)
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+
+if should_test_connect:
+    from pyspark.ml import functions as SF
+    from pyspark.sql.connect.ml import functions as CF
+    from pyspark.sql.dataframe import DataFrame as SDF
+    from pyspark.sql.connect.dataframe import DataFrame as CDF
+
+
+class SparkConnectMLFunctionTests(ReusedConnectTestCase, PandasOnSparkTestUtils, SQLTestUtils):
+    """These test cases exercise the interface to the proto plan
+    generation but do not call Spark."""
+
+    @classmethod
+    def setUpClass(cls):
+        super(SparkConnectMLFunctionTests, cls).setUpClass()
+        # Disable the shared namespace so pyspark.sql.functions, etc point the regular
+        # PySpark libraries.
+        os.environ["PYSPARK_NO_NAMESPACE_SHARE"] = "1"
+        cls.connect = cls.spark  # Switch Spark Connect session and regular PySpark sesion.
+        cls.spark = PySparkSession._instantiatedSession
+        assert cls.spark is not None
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.spark = cls.connect  # Stopping Spark Connect closes the session in JVM at the server.
+        super(SparkConnectMLFunctionTests, cls).setUpClass()
+        del os.environ["PYSPARK_NO_NAMESPACE_SHARE"]
+
+    def compare_by_show(self, df1, df2, n: int = 20, truncate: int = 20):

Review Comment:
   👌 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1136941163


##########
python/pyspark/sql/connect/ml/functions.py:
##########
@@ -0,0 +1,38 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from pyspark.sql.connect.utils import check_dependencies
+
+check_dependencies(__name__)
+
+from pyspark.ml import functions as PyMLFunctions
+
+from pyspark.sql.connect.column import Column
+from pyspark.sql.connect.functions import _invoke_function, _to_col, lit
+
+
+def vector_to_array(col: Column, dtype: str = "float64") -> Column:
+    return _invoke_function("vector_to_array", _to_col(col), lit(dtype))
+
+
+vector_to_array.__doc__ = PyMLFunctions.vector_to_array.__doc__
+
+
+def array_to_vector(col: Column) -> Column:
+    return _invoke_function("array_to_vector", _to_col(col))
+
+
+array_to_vector.__doc__ = PyMLFunctions.array_to_vector.__doc__

Review Comment:
   will have a try
   
   but I guess I will still need to add a few dedicated tests at first due to https://github.com/apache/spark/pull/40432#discussion_r1136931940



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1136929247


##########
python/pyspark/sql/tests/connect/ml/test_connect_ml_function.py:
##########
@@ -0,0 +1,117 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import unittest
+
+from pyspark.sql import SparkSession as PySparkSession
+
+from pyspark.testing.sqlutils import SQLTestUtils
+from pyspark.testing.connectutils import (
+    should_test_connect,
+    ReusedConnectTestCase,
+)
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+
+if should_test_connect:
+    from pyspark.ml import functions as SF
+    from pyspark.sql.connect.ml import functions as CF
+    from pyspark.sql.dataframe import DataFrame as SDF
+    from pyspark.sql.connect.dataframe import DataFrame as CDF
+
+
+class SparkConnectMLFunctionTests(ReusedConnectTestCase, PandasOnSparkTestUtils, SQLTestUtils):
+    """These test cases exercise the interface to the proto plan
+    generation but do not call Spark."""
+
+    @classmethod
+    def setUpClass(cls):
+        super(SparkConnectMLFunctionTests, cls).setUpClass()
+        # Disable the shared namespace so pyspark.sql.functions, etc point the regular
+        # PySpark libraries.
+        os.environ["PYSPARK_NO_NAMESPACE_SHARE"] = "1"
+        cls.connect = cls.spark  # Switch Spark Connect session and regular PySpark sesion.
+        cls.spark = PySparkSession._instantiatedSession
+        assert cls.spark is not None
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.spark = cls.connect  # Stopping Spark Connect closes the session in JVM at the server.
+        super(SparkConnectMLFunctionTests, cls).setUpClass()
+        del os.environ["PYSPARK_NO_NAMESPACE_SHARE"]
+
+    def compare_by_show(self, df1, df2, n: int = 20, truncate: int = 20):

Review Comment:
   Why should we do this though?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #40432:
URL: https://github.com/apache/spark/pull/40432#issuecomment-1469295848

   cc @WeichenXu123 @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #40432:
URL: https://github.com/apache/spark/pull/40432#issuecomment-1471423001

   all tests passed, merged into master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1136935246


##########
python/pyspark/sql/connect/ml/functions.py:
##########
@@ -0,0 +1,38 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from pyspark.sql.connect.utils import check_dependencies
+
+check_dependencies(__name__)
+
+from pyspark.ml import functions as PyMLFunctions
+
+from pyspark.sql.connect.column import Column
+from pyspark.sql.connect.functions import _invoke_function, _to_col, lit
+
+
+def vector_to_array(col: Column, dtype: str = "float64") -> Column:
+    return _invoke_function("vector_to_array", _to_col(col), lit(dtype))
+
+
+vector_to_array.__doc__ = PyMLFunctions.vector_to_array.__doc__
+
+
+def array_to_vector(col: Column) -> Column:
+    return _invoke_function("array_to_vector", _to_col(col))
+
+
+array_to_vector.__doc__ = PyMLFunctions.array_to_vector.__doc__

Review Comment:
   and add this file name into here https://github.com/apache/spark/blob/master/dev/sparktestsupport/modules.py#L530 so doctests actually run in CI



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1137906670


##########
python/pyspark/ml/tests/connect/test_connect_function.py:
##########
@@ -0,0 +1,113 @@
+#

Review Comment:
   I think you can remove this test - I believe the doctests in `pyspark.ml.connect.functions` cover the same cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1136937191


##########
python/pyspark/ml/functions.py:
##########
@@ -119,6 +122,9 @@ def array_to_vector(col: Column) -> Column:
 

Review Comment:
   got it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1136929510


##########
python/pyspark/sql/tests/connect/ml/test_connect_ml_function.py:
##########
@@ -0,0 +1,117 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import unittest
+
+from pyspark.sql import SparkSession as PySparkSession
+
+from pyspark.testing.sqlutils import SQLTestUtils
+from pyspark.testing.connectutils import (
+    should_test_connect,
+    ReusedConnectTestCase,
+)
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+
+if should_test_connect:
+    from pyspark.ml import functions as SF
+    from pyspark.sql.connect.ml import functions as CF
+    from pyspark.sql.dataframe import DataFrame as SDF
+    from pyspark.sql.connect.dataframe import DataFrame as CDF
+
+
+class SparkConnectMLFunctionTests(ReusedConnectTestCase, PandasOnSparkTestUtils, SQLTestUtils):
+    """These test cases exercise the interface to the proto plan
+    generation but do not call Spark."""
+
+    @classmethod
+    def setUpClass(cls):
+        super(SparkConnectMLFunctionTests, cls).setUpClass()
+        # Disable the shared namespace so pyspark.sql.functions, etc point the regular
+        # PySpark libraries.
+        os.environ["PYSPARK_NO_NAMESPACE_SHARE"] = "1"
+        cls.connect = cls.spark  # Switch Spark Connect session and regular PySpark sesion.
+        cls.spark = PySparkSession._instantiatedSession
+        assert cls.spark is not None
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.spark = cls.connect  # Stopping Spark Connect closes the session in JVM at the server.
+        super(SparkConnectMLFunctionTests, cls).setUpClass()
+        del os.environ["PYSPARK_NO_NAMESPACE_SHARE"]
+
+    def compare_by_show(self, df1, df2, n: int = 20, truncate: int = 20):
+        assert isinstance(df1, (SDF, CDF))
+        if isinstance(df1, SDF):
+            str1 = df1._jdf.showString(n, truncate, False)
+        else:
+            str1 = df1._show_string(n, truncate, False)
+
+        assert isinstance(df2, (SDF, CDF))
+        if isinstance(df2, SDF):
+            str2 = df2._jdf.showString(n, truncate, False)
+        else:
+            str2 = df2._show_string(n, truncate, False)
+
+        self.assertEqual(str1, str2)
+
+    def test_array_vector_conversion(self):
+        query = """
+            SELECT * FROM VALUES
+            (1, 4, ARRAY(1.0, 2.0, 3.0)),
+            (1, 2, ARRAY(-1.0, -2.0, -3.0))
+            AS tab(a, b, c)
+            """
+
+        cdf = self.connect.sql(query)
+        sdf = self.spark.sql(query)
+
+        self.compare_by_show(
+            cdf.select(cdf.b, CF.array_to_vector(cdf.c)),
+            sdf.select(sdf.b, SF.array_to_vector(sdf.c)),
+        )
+
+        cdf1 = cdf.select("a", CF.array_to_vector(cdf.c).alias("d"))
+        sdf1 = sdf.select("a", SF.array_to_vector(sdf.c).alias("d"))
+
+        self.compare_by_show(
+            cdf1.select(CF.vector_to_array(cdf1.d)),
+            sdf1.select(SF.vector_to_array(sdf1.d)),
+        )
+        self.compare_by_show(
+            cdf1.select(CF.vector_to_array(cdf1.d, "float32")),
+            sdf1.select(SF.vector_to_array(sdf1.d, "float32")),
+        )
+        self.compare_by_show(
+            cdf1.select(CF.vector_to_array(cdf1.d, "float64")),
+            sdf1.select(SF.vector_to_array(sdf1.d, "float64")),
+        )
+
+
+if __name__ == "__main__":
+    import os
+    from pyspark.sql.tests.connect.ml.test_connect_ml_function import *  # noqa: F401
+
+    # TODO(SPARK-41547): Enable ANSI mode in this file.
+    os.environ["SPARK_ANSI_SQL_MODE"] = "false"

Review Comment:
   sure, I just copied this ...
   
   let me remove it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1136931940


##########
python/pyspark/sql/tests/connect/ml/test_connect_ml_function.py:
##########
@@ -0,0 +1,117 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import unittest
+
+from pyspark.sql import SparkSession as PySparkSession
+
+from pyspark.testing.sqlutils import SQLTestUtils
+from pyspark.testing.connectutils import (
+    should_test_connect,
+    ReusedConnectTestCase,
+)
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+
+if should_test_connect:
+    from pyspark.ml import functions as SF
+    from pyspark.sql.connect.ml import functions as CF
+    from pyspark.sql.dataframe import DataFrame as SDF
+    from pyspark.sql.connect.dataframe import DataFrame as CDF
+
+
+class SparkConnectMLFunctionTests(ReusedConnectTestCase, PandasOnSparkTestUtils, SQLTestUtils):
+    """These test cases exercise the interface to the proto plan
+    generation but do not call Spark."""
+
+    @classmethod
+    def setUpClass(cls):
+        super(SparkConnectMLFunctionTests, cls).setUpClass()
+        # Disable the shared namespace so pyspark.sql.functions, etc point the regular
+        # PySpark libraries.
+        os.environ["PYSPARK_NO_NAMESPACE_SHARE"] = "1"
+        cls.connect = cls.spark  # Switch Spark Connect session and regular PySpark sesion.
+        cls.spark = PySparkSession._instantiatedSession
+        assert cls.spark is not None
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.spark = cls.connect  # Stopping Spark Connect closes the session in JVM at the server.
+        super(SparkConnectMLFunctionTests, cls).setUpClass()
+        del os.environ["PYSPARK_NO_NAMESPACE_SHARE"]
+
+    def compare_by_show(self, df1, df2, n: int = 20, truncate: int = 20):

Review Comment:
   currently, UDT is only supported in `DF.schema`
   
   It's not supported in `DF.collect` / `Spark.createDataFrame` for now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1138015354


##########
python/pyspark/ml/connect/functions.py:
##########
@@ -0,0 +1,76 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from pyspark.sql.connect.utils import check_dependencies
+
+check_dependencies(__name__)
+
+from pyspark.ml import functions as PyMLFunctions
+
+from pyspark.sql.connect.column import Column
+from pyspark.sql.connect.functions import _invoke_function, _to_col, lit
+
+
+def vector_to_array(col: Column, dtype: str = "float64") -> Column:
+    return _invoke_function("vector_to_array", _to_col(col), lit(dtype))
+
+
+vector_to_array.__doc__ = PyMLFunctions.vector_to_array.__doc__
+
+
+def array_to_vector(col: Column) -> Column:
+    return _invoke_function("array_to_vector", _to_col(col))
+
+
+array_to_vector.__doc__ = PyMLFunctions.array_to_vector.__doc__
+
+
+def _test() -> None:
+    import sys
+    import doctest
+    from pyspark.sql import SparkSession as PySparkSession
+    import pyspark.ml.connect.functions
+
+    globs = pyspark.ml.connect.functions.__dict__.copy()
+
+    # TODO: split vector_to_array doctest since it includes .mllib vectors

Review Comment:
   @HyukjinKwon I guess `test_connect_function.py` is still needed, since we can not enable the doctests  for now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1138014700


##########
dev/sparktestsupport/modules.py:
##########
@@ -655,6 +655,7 @@ def __hash__(self):
         "pyspark.ml.tests.test_wrapper",
         "pyspark.ml.torch.tests.test_distributor",
         "pyspark.ml.torch.tests.test_log_communication",
+        "pyspark.ml.tests.connect.test_connect_function",

Review Comment:
   good catch!
   
   I am thinking whether we should put the tests into `pyspark-connect`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1138017400


##########
dev/sparktestsupport/modules.py:
##########
@@ -781,6 +740,57 @@ def __hash__(self):
     ],
 )
 
+
+pyspark_connect = Module(
+    name="pyspark-connect",
+    dependencies=[pyspark_sql, pyspark_ml, connect],
+    source_file_regexes=[
+        "python/pyspark/sql/connect",
+        "python/pyspark/ml/connect",

Review Comment:
   @HyukjinKwon I made an additional change, make `pyspark-connect` depends on `pyspark_ml` and then move ml tests here. since I think it maybe not a good idea to make `pyspark_ml` depends on `connect`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1136930751


##########
python/pyspark/ml/functions.py:
##########
@@ -119,6 +122,9 @@ def array_to_vector(col: Column) -> Column:
 

Review Comment:
   You should probably decorate this via `@try_remote_functions` like `functions.py`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1136928710


##########
python/pyspark/sql/tests/connect/ml/test_connect_ml_function.py:
##########
@@ -0,0 +1,117 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import unittest
+
+from pyspark.sql import SparkSession as PySparkSession
+
+from pyspark.testing.sqlutils import SQLTestUtils
+from pyspark.testing.connectutils import (
+    should_test_connect,
+    ReusedConnectTestCase,
+)
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+
+if should_test_connect:
+    from pyspark.ml import functions as SF

Review Comment:
   `pyspark.ml import functions as SF` and `from pyspark.sql.dataframe import DataFrame as SDF` can be imported on the top.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1136927946


##########
python/pyspark/sql/tests/connect/ml/test_connect_ml_function.py:
##########
@@ -0,0 +1,117 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import unittest
+
+from pyspark.sql import SparkSession as PySparkSession
+
+from pyspark.testing.sqlutils import SQLTestUtils
+from pyspark.testing.connectutils import (
+    should_test_connect,
+    ReusedConnectTestCase,
+)
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+
+if should_test_connect:
+    from pyspark.ml import functions as SF
+    from pyspark.sql.connect.ml import functions as CF
+    from pyspark.sql.dataframe import DataFrame as SDF
+    from pyspark.sql.connect.dataframe import DataFrame as CDF
+
+
+class SparkConnectMLFunctionTests(ReusedConnectTestCase, PandasOnSparkTestUtils, SQLTestUtils):
+    """These test cases exercise the interface to the proto plan
+    generation but do not call Spark."""
+
+    @classmethod
+    def setUpClass(cls):
+        super(SparkConnectMLFunctionTests, cls).setUpClass()
+        # Disable the shared namespace so pyspark.sql.functions, etc point the regular
+        # PySpark libraries.
+        os.environ["PYSPARK_NO_NAMESPACE_SHARE"] = "1"
+        cls.connect = cls.spark  # Switch Spark Connect session and regular PySpark sesion.
+        cls.spark = PySparkSession._instantiatedSession
+        assert cls.spark is not None
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.spark = cls.connect  # Stopping Spark Connect closes the session in JVM at the server.
+        super(SparkConnectMLFunctionTests, cls).setUpClass()
+        del os.environ["PYSPARK_NO_NAMESPACE_SHARE"]
+
+    def compare_by_show(self, df1, df2, n: int = 20, truncate: int = 20):
+        assert isinstance(df1, (SDF, CDF))
+        if isinstance(df1, SDF):
+            str1 = df1._jdf.showString(n, truncate, False)
+        else:
+            str1 = df1._show_string(n, truncate, False)
+
+        assert isinstance(df2, (SDF, CDF))
+        if isinstance(df2, SDF):
+            str2 = df2._jdf.showString(n, truncate, False)
+        else:
+            str2 = df2._show_string(n, truncate, False)
+
+        self.assertEqual(str1, str2)
+
+    def test_array_vector_conversion(self):
+        query = """
+            SELECT * FROM VALUES
+            (1, 4, ARRAY(1.0, 2.0, 3.0)),
+            (1, 2, ARRAY(-1.0, -2.0, -3.0))
+            AS tab(a, b, c)
+            """
+
+        cdf = self.connect.sql(query)
+        sdf = self.spark.sql(query)
+
+        self.compare_by_show(
+            cdf.select(cdf.b, CF.array_to_vector(cdf.c)),
+            sdf.select(sdf.b, SF.array_to_vector(sdf.c)),
+        )
+
+        cdf1 = cdf.select("a", CF.array_to_vector(cdf.c).alias("d"))
+        sdf1 = sdf.select("a", SF.array_to_vector(sdf.c).alias("d"))
+
+        self.compare_by_show(
+            cdf1.select(CF.vector_to_array(cdf1.d)),
+            sdf1.select(SF.vector_to_array(sdf1.d)),
+        )
+        self.compare_by_show(
+            cdf1.select(CF.vector_to_array(cdf1.d, "float32")),
+            sdf1.select(SF.vector_to_array(sdf1.d, "float32")),
+        )
+        self.compare_by_show(
+            cdf1.select(CF.vector_to_array(cdf1.d, "float64")),
+            sdf1.select(SF.vector_to_array(sdf1.d, "float64")),
+        )
+
+
+if __name__ == "__main__":
+    import os
+    from pyspark.sql.tests.connect.ml.test_connect_ml_function import *  # noqa: F401
+
+    # TODO(SPARK-41547): Enable ANSI mode in this file.
+    os.environ["SPARK_ANSI_SQL_MODE"] = "false"

Review Comment:
   I think you can remove this. I don't think there's any ANSI tests broken from a cursory look.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1136932191


##########
python/pyspark/sql/tests/connect/ml/test_connect_ml_function.py:
##########
@@ -0,0 +1,117 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import unittest
+
+from pyspark.sql import SparkSession as PySparkSession
+
+from pyspark.testing.sqlutils import SQLTestUtils
+from pyspark.testing.connectutils import (
+    should_test_connect,
+    ReusedConnectTestCase,
+)
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+
+if should_test_connect:
+    from pyspark.ml import functions as SF
+    from pyspark.sql.connect.ml import functions as CF
+    from pyspark.sql.dataframe import DataFrame as SDF
+    from pyspark.sql.connect.dataframe import DataFrame as CDF
+
+
+class SparkConnectMLFunctionTests(ReusedConnectTestCase, PandasOnSparkTestUtils, SQLTestUtils):
+    """These test cases exercise the interface to the proto plan
+    generation but do not call Spark."""
+
+    @classmethod
+    def setUpClass(cls):
+        super(SparkConnectMLFunctionTests, cls).setUpClass()
+        # Disable the shared namespace so pyspark.sql.functions, etc point the regular
+        # PySpark libraries.
+        os.environ["PYSPARK_NO_NAMESPACE_SHARE"] = "1"
+        cls.connect = cls.spark  # Switch Spark Connect session and regular PySpark sesion.
+        cls.spark = PySparkSession._instantiatedSession
+        assert cls.spark is not None
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.spark = cls.connect  # Stopping Spark Connect closes the session in JVM at the server.
+        super(SparkConnectMLFunctionTests, cls).setUpClass()
+        del os.environ["PYSPARK_NO_NAMESPACE_SHARE"]
+
+    def compare_by_show(self, df1, df2, n: int = 20, truncate: int = 20):

Review Comment:
   so we can not compare the results for now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.
WeichenXu123 commented on PR #40432:
URL: https://github.com/apache/spark/pull/40432#issuecomment-1471290106

   Is it ready to merge ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1136932841


##########
python/pyspark/sql/tests/connect/ml/test_connect_ml_function.py:
##########
@@ -0,0 +1,117 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import unittest
+
+from pyspark.sql import SparkSession as PySparkSession
+
+from pyspark.testing.sqlutils import SQLTestUtils
+from pyspark.testing.connectutils import (
+    should_test_connect,
+    ReusedConnectTestCase,
+)
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+
+if should_test_connect:
+    from pyspark.ml import functions as SF

Review Comment:
   nice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #40432:
URL: https://github.com/apache/spark/pull/40432#issuecomment-1471307277

   @WeichenXu123 not ready.
   
   `sql slow` failed with message related to `mllib-common`:
   ```
   [error] /home/runner/work/spark/spark/mllib/common/src/test/scala/org/apache/spark/ml/attribute/AttributeGroupSuite.scala:35:11: exception during macro expansion: 
   [error] java.util.MissingResourceException: Can't find bundle for base name org.scalactic.ScalacticBundle, locale en
   [error] 	at java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1581)
   [error] 	at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1396)
   [error] 	at java.util.ResourceBundle.getBundle(ResourceBundle.java:782)
   [error] 	at org.scalactic.Resources$.resourceBundle$lzycompute(Resources.scala:8)
   [error] 	at org.scalactic.Resources$.resourceBundle(Resources.scala:8)
   [error] 	at org.scalactic.Resources$.pleaseDefineScalacticFillFilePathnameEnvVar(Resources.scala:256)
   [error] 	at org.scalactic.source.PositionMacro$PositionMacroImpl.apply(PositionMacro.scala:65)
   [error] 	at org.scalactic.source.PositionMacro$.genPosition(PositionMacro.scala:85)
   [error] Caused by: java.io.IOException: Stream closed
   [error] 	at java.util.zip.InflaterInputStream.ensureOpen(InflaterInputStream.java:67)
   ```
   
   
   the current test seems fine, but let's wait for the CI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40432:
URL: https://github.com/apache/spark/pull/40432#discussion_r1137906453


##########
dev/sparktestsupport/modules.py:
##########
@@ -655,6 +655,7 @@ def __hash__(self):
         "pyspark.ml.tests.test_wrapper",
         "pyspark.ml.torch.tests.test_distributor",
         "pyspark.ml.torch.tests.test_log_communication",
+        "pyspark.ml.tests.connect.test_connect_function",

Review Comment:
   Should also add `pyspark.ml.connect.functions`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #40432:
URL: https://github.com/apache/spark/pull/40432#issuecomment-1471248695

   `sql - slow` failed, not sure whether it is related, let me investigate it first


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng closed pull request #40432: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`
URL: https://github.com/apache/spark/pull/40432


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org