You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "WeichenXu123 (via GitHub)" <gi...@apache.org> on 2023/05/15 23:59:17 UTC

[GitHub] [spark] WeichenXu123 opened a new pull request, #41176: [WIP] [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

WeichenXu123 opened a new pull request, #41176:
URL: https://github.com/apache/spark/pull/41176

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 closed pull request #41176: [SPARK-43516][ML][PYTHON][CONNECT] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.
WeichenXu123 closed pull request #41176: [SPARK-43516][ML][PYTHON][CONNECT] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator
URL: https://github.com/apache/spark/pull/41176


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.
WeichenXu123 commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1198664831


##########
python/pyspark/mlv2/feature.py:
##########
@@ -0,0 +1,127 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import numpy as np
+import pandas as pd
+
+from pyspark.sql.functions import col, pandas_udf
+
+from pyspark.mlv2.base import Estimator, Model, Transformer
+from pyspark.mlv2.util import transform_dataframe_column
+from pyspark.mlv2.summarizer import summarize_dataframe
+from pyspark.ml.param.shared import HasInputCol, HasOutputCol
+from pyspark.ml.functions import vector_to_array
+
+
+class MaxAbsScaler(Estimator, HasInputCol, HasOutputCol):
+    """
+    Rescale each feature individually to range [-1, 1] by dividing through the largest maximum
+    absolute value in each feature. It does not shift/center the data, and thus does not destroy
+    any sparsity.
+    """
+
+    def __init__(self, inputCol, outputCol):
+        super().__init__()
+        self.set(self.inputCol, inputCol)
+        self.set(self.outputCol, outputCol)
+
+    def _fit(self, dataset):
+        input_col = self.getInputCol()
+
+        min_max_res = summarize_dataframe(dataset, input_col, ["min", "max"])
+        min_values = min_max_res["min"]
+        max_values = min_max_res["max"]
+
+        max_abs_values = np.maximum(np.abs(min_values), np.abs(max_values))
+
+        model = MaxAbsScalerModel(max_abs_values)
+        model._resetUid(self.uid)
+        return self._copyValues(model)
+
+
+class MaxAbsScalerModel(Transformer, HasInputCol, HasOutputCol):
+
+    def __init__(self, max_abs_values):
+        super().__init__()
+        self.max_abs_values = max_abs_values
+
+    def _input_column_name(self):
+        return self.getInputCol()
+
+    def _output_columns(self):
+        return [(self.getOutputCol(), "array<double>")]
+
+    def _get_transform_fn(self):
+        max_abs_values = self.max_abs_values
+        max_abs_values_zero_cond = (max_abs_values == 0.0)
+
+        def transform_fn(series):
+            def map_value(x):
+                return np.where(max_abs_values_zero_cond, 0.0, x / max_abs_values)
+
+            return series.apply(map_value)
+
+        return transform_fn
+
+
+class StandardScaler(Estimator, HasInputCol, HasOutputCol):
+    """
+    Standardizes features by removing the mean and scaling to unit variance using column summary
+    statistics on the samples in the training set.
+    """
+
+    def __init__(self, inputCol, outputCol):
+        super().__init__()
+        self.set(self.inputCol, inputCol)
+        self.set(self.outputCol, outputCol)
+
+    def _fit(self, dataset):
+        input_col = self.getInputCol()
+
+        min_max_res = summarize_dataframe(dataset, input_col, ["mean", "std"])
+        mean_values = min_max_res["mean"]
+        std_values = min_max_res["std"]
+
+        model = StandardScalerModel(mean_values, std_values)
+        model._resetUid(self.uid)
+        return self._copyValues(model)
+
+
+class StandardScalerModel(Transformer, HasInputCol, HasOutputCol):

Review Comment:
   Discussed with @zhengruifeng offline:
   
   Because the goal is to support fit both pandas DataFrame and spark dataframe (the 2 cases we hope them sharing most of fit implementation code), and we need to ensure the array operation efficient, so current approach (implementing a summarizer via spark pandas UDF) should be best approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1200355639


##########
python/pyspark/mlv2/summarizer.py:
##########
@@ -0,0 +1,112 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import numpy as np
+from pyspark.mlv2.util import aggregate_dataframe
+
+
+class SummarizerAggState:
+    def __init__(self, input_array):
+        self.min_values = input_array.copy()
+        self.max_values = input_array.copy()
+        self.count = 1
+        self.sum_values = np.array(input_array.copy())
+        self.square_sum_values = np.square(input_array.copy())
+
+    def update(self, input_array):
+        self.count += 1
+        self.sum_values += input_array
+        self.square_sum_values += np.square(input_array)
+        self.min_values = np.minimum(self.min_values, input_array)
+        self.max_values = np.maximum(self.max_values, input_array)
+
+    def merge(self, state):
+        self.count += state.count
+        self.sum_values += state.sum_values
+        self.square_sum_values += state.square_sum_values
+        self.min_values = np.minimum(self.min_values, state.min_values)
+        self.max_values = np.maximum(self.max_values, state.max_values)
+        return self
+
+    def to_result(self, metrics):
+        result = {}
+
+        for metric in metrics:
+            if metric == "min":
+                result["min"] = self.min_values.copy()
+            if metric == "max":
+                result["max"] = self.max_values.copy()
+            if metric == "sum":
+                result["sum"] = self.sum_values.copy()
+            if metric == "mean":
+                result["mean"] = self.sum_values / self.count
+            if metric == "std":
+                if self.count <= 1:
+                    raise ValueError(
+                        "Standard deviation evaluation requires more than one row data."
+                    )
+                result["std"] = np.sqrt(

Review Comment:
   nit: why not reusing `torchmetrics` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1198449915


##########
python/pyspark/mlv2/base.py:
##########
@@ -0,0 +1,258 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from abc import ABCMeta, abstractmethod
+
+import copy
+import threading
+import pandas as pd
+
+from typing import (
+    Any,
+    Callable,
+    Generic,
+    Iterator,
+    List,
+    Optional,
+    Sequence,
+    Tuple,
+    TypeVar,
+    Union,
+    cast,
+    overload,
+    TYPE_CHECKING,
+)
+
+from pyspark import since
+from pyspark.ml.param import P
+from pyspark.ml.common import inherit_doc
+from pyspark.ml.param.shared import (
+    HasInputCols,
+    HasOutputCols,
+    HasLabelCol,
+    HasFeaturesCol,
+    HasPredictionCol,
+)
+from pyspark.sql.dataframe import DataFrame
+from pyspark.sql.functions import udf
+from pyspark.sql.types import DataType, StructField, StructType
+from pyspark.ml.param import Param, Params, TypeConverters
+from pyspark.sql.functions import col, pandas_udf, struct
+import pickle
+
+from pyspark.mlv2.util import transform_dataframe_column
+
+if TYPE_CHECKING:
+    from pyspark.ml._typing import ParamMap
+
+T = TypeVar("T")
+M = TypeVar("M", bound="Transformer")
+
+
+@inherit_doc
+class Estimator(Params, Generic[M], metaclass=ABCMeta):
+    """
+    Abstract class for estimators that fit models to data.
+
+    .. versionadded:: 1.3.0

Review Comment:
   versions should be 3.5.0 in this PR ?



##########
python/pyspark/mlv2/feature.py:
##########
@@ -0,0 +1,127 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import numpy as np
+import pandas as pd
+
+from pyspark.sql.functions import col, pandas_udf
+
+from pyspark.mlv2.base import Estimator, Model, Transformer
+from pyspark.mlv2.util import transform_dataframe_column
+from pyspark.mlv2.summarizer import summarize_dataframe
+from pyspark.ml.param.shared import HasInputCol, HasOutputCol
+from pyspark.ml.functions import vector_to_array
+
+
+class MaxAbsScaler(Estimator, HasInputCol, HasOutputCol):
+    """
+    Rescale each feature individually to range [-1, 1] by dividing through the largest maximum
+    absolute value in each feature. It does not shift/center the data, and thus does not destroy
+    any sparsity.
+    """
+
+    def __init__(self, inputCol, outputCol):
+        super().__init__()
+        self.set(self.inputCol, inputCol)
+        self.set(self.outputCol, outputCol)
+
+    def _fit(self, dataset):
+        input_col = self.getInputCol()
+
+        min_max_res = summarize_dataframe(dataset, input_col, ["min", "max"])
+        min_values = min_max_res["min"]
+        max_values = min_max_res["max"]
+
+        max_abs_values = np.maximum(np.abs(min_values), np.abs(max_values))
+
+        model = MaxAbsScalerModel(max_abs_values)
+        model._resetUid(self.uid)
+        return self._copyValues(model)
+
+
+class MaxAbsScalerModel(Transformer, HasInputCol, HasOutputCol):
+
+    def __init__(self, max_abs_values):
+        super().__init__()
+        self.max_abs_values = max_abs_values
+
+    def _input_column_name(self):
+        return self.getInputCol()
+
+    def _output_columns(self):
+        return [(self.getOutputCol(), "array<double>")]
+
+    def _get_transform_fn(self):
+        max_abs_values = self.max_abs_values
+        max_abs_values_zero_cond = (max_abs_values == 0.0)
+
+        def transform_fn(series):
+            def map_value(x):
+                return np.where(max_abs_values_zero_cond, 0.0, x / max_abs_values)
+
+            return series.apply(map_value)
+
+        return transform_fn
+
+
+class StandardScaler(Estimator, HasInputCol, HasOutputCol):
+    """
+    Standardizes features by removing the mean and scaling to unit variance using column summary
+    statistics on the samples in the training set.
+    """
+
+    def __init__(self, inputCol, outputCol):
+        super().__init__()
+        self.set(self.inputCol, inputCol)
+        self.set(self.outputCol, outputCol)
+
+    def _fit(self, dataset):
+        input_col = self.getInputCol()
+
+        min_max_res = summarize_dataframe(dataset, input_col, ["mean", "std"])
+        mean_values = min_max_res["mean"]
+        std_values = min_max_res["std"]
+
+        model = StandardScalerModel(mean_values, std_values)
+        model._resetUid(self.uid)
+        return self._copyValues(model)
+
+
+class StandardScalerModel(Transformer, HasInputCol, HasOutputCol):

Review Comment:
   For simple computation logic like `MaxAbsScalerModel /StandardScaler`, I guess we can implement in two ways:
   
   1. leverage torch utils functions (in this PR);
   2. leverage built-in SQL functions (may add a few helper functions if needed), for example
   
   ```
   In [19]: import pyspark.sql.functions as SF
   
   In [20]: import pyspark.ml.functions as MF
   
   In [21]: bdf.show()
   +-----+------+---------+
   |label|weight| features|
   +-----+------+---------+
   |  1.0|   1.0|[0.0,5.0]|
   |  0.0|   2.0|[1.0,2.0]|
   |  1.0|   3.0|[2.0,1.0]|
   |  0.0|   4.0|[3.0,3.0]|
   +-----+------+---------+
   
   
   In [22]: bdf.select(SF.posexplode(MF.vector_to_array(bdf.features)).alias("index", "value")).groupBy("index").agg(SF.avg("value"), SF.stddev_samp("value")).show()
   +-----+----------+------------------+
   |index|avg(value)|stddev_samp(value)|
   +-----+----------+------------------+
   |    1|      2.75|1.7078251276599332|
   |    0|       1.5|1.2909944487358056|
   +-----+----------+------------------+
   
   ```
   



##########
python/pyspark/mlv2/base.py:
##########
@@ -0,0 +1,258 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from abc import ABCMeta, abstractmethod
+
+import copy
+import threading
+import pandas as pd
+
+from typing import (
+    Any,
+    Callable,
+    Generic,
+    Iterator,
+    List,
+    Optional,
+    Sequence,
+    Tuple,
+    TypeVar,
+    Union,
+    cast,
+    overload,
+    TYPE_CHECKING,
+)
+
+from pyspark import since
+from pyspark.ml.param import P
+from pyspark.ml.common import inherit_doc
+from pyspark.ml.param.shared import (
+    HasInputCols,
+    HasOutputCols,
+    HasLabelCol,
+    HasFeaturesCol,
+    HasPredictionCol,
+)
+from pyspark.sql.dataframe import DataFrame
+from pyspark.sql.functions import udf
+from pyspark.sql.types import DataType, StructField, StructType
+from pyspark.ml.param import Param, Params, TypeConverters
+from pyspark.sql.functions import col, pandas_udf, struct

Review Comment:
   ```suggestion
   from pyspark.sql.types import DataType, StructField, StructType
   from pyspark.ml.param import Param, Params, TypeConverters
   from pyspark.sql.functions import col, pandas_udf, struct, udf
   ```



##########
python/pyspark/mlv2/tests/test_feature.py:
##########
@@ -0,0 +1,103 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import unittest
+import numpy as np
+
+from pyspark.ml.linalg import DenseVector, SparseVector, Vectors
+from pyspark.sql import Row
+from pyspark.testing.utils import QuietTest
+from pyspark.testing.mlutils import check_params, SparkSessionTestCase
+
+from pyspark.mlv2.feature import MaxAbsScaler, StandardScaler
+
+
+class FeatureTests(SparkSessionTestCase):

Review Comment:
   dose this works on Spark Connect?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.
WeichenXu123 commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1198515956


##########
python/pyspark/mlv2/tests/test_feature.py:
##########
@@ -0,0 +1,103 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import unittest
+import numpy as np
+
+from pyspark.ml.linalg import DenseVector, SparseVector, Vectors
+from pyspark.sql import Row
+from pyspark.testing.utils import QuietTest
+from pyspark.testing.mlutils import check_params, SparkSessionTestCase
+
+from pyspark.mlv2.feature import MaxAbsScaler, StandardScaler
+
+
+class FeatureTests(SparkSessionTestCase):

Review Comment:
   I will add a spark connect side test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1200357845


##########
python/pyspark/mlv2/evaluation.py:
##########
@@ -0,0 +1,79 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.mlv2.base import Evaluator
+
+from pyspark.ml.param import Param, Params, TypeConverters
+from pyspark.ml.param.shared import HasLabelCol, HasPredictionCol
+from pyspark.mlv2.util import aggregate_dataframe
+
+import torch
+import torcheval.metrics as torchmetrics

Review Comment:
   is `torcheval` a new dep?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #41176:
URL: https://github.com/apache/spark/pull/41176#issuecomment-1558518780

   I think you may need to add `torchmetrics` to https://github.com/apache/spark/blob/f55fdca10b1d9df2d8126cde07a26f89a75ae1d2/dev/infra/Dockerfile#L74


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.
WeichenXu123 commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1200559825


##########
python/pyspark/mlv2/evaluation.py:
##########
@@ -0,0 +1,79 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.mlv2.base import Evaluator
+
+from pyspark.ml.param import Param, Params, TypeConverters
+from pyspark.ml.param.shared import HasLabelCol, HasPredictionCol
+from pyspark.mlv2.util import aggregate_dataframe
+
+import torch
+import torcheval.metrics as torchmetrics

Review Comment:
   Yes. I talked with @mengxr about this and he said we can use it :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.
WeichenXu123 commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1201332234


##########
python/pyspark/mlv2/summarizer.py:
##########
@@ -0,0 +1,112 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import numpy as np
+from pyspark.mlv2.util import aggregate_dataframe
+
+
+class SummarizerAggState:
+    def __init__(self, input_array):
+        self.min_values = input_array.copy()
+        self.max_values = input_array.copy()
+        self.count = 1
+        self.sum_values = np.array(input_array.copy())
+        self.square_sum_values = np.square(input_array.copy())
+
+    def update(self, input_array):
+        self.count += 1
+        self.sum_values += input_array
+        self.square_sum_values += np.square(input_array)
+        self.min_values = np.minimum(self.min_values, input_array)
+        self.max_values = np.maximum(self.max_values, input_array)
+
+    def merge(self, state):
+        self.count += state.count
+        self.sum_values += state.sum_values
+        self.square_sum_values += state.square_sum_values
+        self.min_values = np.minimum(self.min_values, state.min_values)
+        self.max_values = np.maximum(self.max_values, state.max_values)
+        return self
+
+    def to_result(self, metrics):
+        result = {}
+
+        for metric in metrics:
+            if metric == "min":
+                result["min"] = self.min_values.copy()
+            if metric == "max":
+                result["max"] = self.max_values.copy()
+            if metric == "sum":
+                result["sum"] = self.sum_values.copy()
+            if metric == "mean":
+                result["mean"] = self.sum_values / self.count
+            if metric == "std":
+                if self.count <= 1:
+                    raise ValueError(
+                        "Standard deviation evaluation requires more than one row data."
+                    )
+                result["std"] = np.sqrt(

Review Comment:
   We need to evaluate std values over feature arrays, I think using summarizer will be more efficient. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org