You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "WeichenXu123 (via GitHub)" <gi...@apache.org> on 2023/05/15 23:59:17 UTC

[GitHub] [spark] WeichenXu123 opened a new pull request, #41176: [WIP] [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

WeichenXu123 opened a new pull request, #41176:
URL: https://github.com/apache/spark/pull/41176

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 closed pull request #41176: [SPARK-43516][ML][PYTHON][CONNECT] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 closed pull request #41176: [SPARK-43516][ML][PYTHON][CONNECT] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator
URL: https://github.com/apache/spark/pull/41176


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1198664831


##########
python/pyspark/mlv2/feature.py:
##########
@@ -0,0 +1,127 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import numpy as np
+import pandas as pd
+
+from pyspark.sql.functions import col, pandas_udf
+
+from pyspark.mlv2.base import Estimator, Model, Transformer
+from pyspark.mlv2.util import transform_dataframe_column
+from pyspark.mlv2.summarizer import summarize_dataframe
+from pyspark.ml.param.shared import HasInputCol, HasOutputCol
+from pyspark.ml.functions import vector_to_array
+
+
+class MaxAbsScaler(Estimator, HasInputCol, HasOutputCol):
+    """
+    Rescale each feature individually to range [-1, 1] by dividing through the largest maximum
+    absolute value in each feature. It does not shift/center the data, and thus does not destroy
+    any sparsity.
+    """
+
+    def __init__(self, inputCol, outputCol):
+        super().__init__()
+        self.set(self.inputCol, inputCol)
+        self.set(self.outputCol, outputCol)
+
+    def _fit(self, dataset):
+        input_col = self.getInputCol()
+
+        min_max_res = summarize_dataframe(dataset, input_col, ["min", "max"])
+        min_values = min_max_res["min"]
+        max_values = min_max_res["max"]
+
+        max_abs_values = np.maximum(np.abs(min_values), np.abs(max_values))
+
+        model = MaxAbsScalerModel(max_abs_values)
+        model._resetUid(self.uid)
+        return self._copyValues(model)
+
+
+class MaxAbsScalerModel(Transformer, HasInputCol, HasOutputCol):
+
+    def __init__(self, max_abs_values):
+        super().__init__()
+        self.max_abs_values = max_abs_values
+
+    def _input_column_name(self):
+        return self.getInputCol()
+
+    def _output_columns(self):
+        return [(self.getOutputCol(), "array<double>")]
+
+    def _get_transform_fn(self):
+        max_abs_values = self.max_abs_values
+        max_abs_values_zero_cond = (max_abs_values == 0.0)
+
+        def transform_fn(series):
+            def map_value(x):
+                return np.where(max_abs_values_zero_cond, 0.0, x / max_abs_values)
+
+            return series.apply(map_value)
+
+        return transform_fn
+
+
+class StandardScaler(Estimator, HasInputCol, HasOutputCol):
+    """
+    Standardizes features by removing the mean and scaling to unit variance using column summary
+    statistics on the samples in the training set.
+    """
+
+    def __init__(self, inputCol, outputCol):
+        super().__init__()
+        self.set(self.inputCol, inputCol)
+        self.set(self.outputCol, outputCol)
+
+    def _fit(self, dataset):
+        input_col = self.getInputCol()
+
+        min_max_res = summarize_dataframe(dataset, input_col, ["mean", "std"])
+        mean_values = min_max_res["mean"]
+        std_values = min_max_res["std"]
+
+        model = StandardScalerModel(mean_values, std_values)
+        model._resetUid(self.uid)
+        return self._copyValues(model)
+
+
+class StandardScalerModel(Transformer, HasInputCol, HasOutputCol):

Review Comment:
   Discussed with @zhengruifeng offline:
   
   Because the goal is to support fit both pandas DataFrame and spark dataframe (the 2 cases we hope them sharing most of fit implementation code), and we need to ensure the array operation efficient, so current approach (implementing a summarizer via spark pandas UDF) should be best approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1200355639


##########
python/pyspark/mlv2/summarizer.py:
##########
@@ -0,0 +1,112 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import numpy as np
+from pyspark.mlv2.util import aggregate_dataframe
+
+
+class SummarizerAggState:
+    def __init__(self, input_array):
+        self.min_values = input_array.copy()
+        self.max_values = input_array.copy()
+        self.count = 1
+        self.sum_values = np.array(input_array.copy())
+        self.square_sum_values = np.square(input_array.copy())
+
+    def update(self, input_array):
+        self.count += 1
+        self.sum_values += input_array
+        self.square_sum_values += np.square(input_array)
+        self.min_values = np.minimum(self.min_values, input_array)
+        self.max_values = np.maximum(self.max_values, input_array)
+
+    def merge(self, state):
+        self.count += state.count
+        self.sum_values += state.sum_values
+        self.square_sum_values += state.square_sum_values
+        self.min_values = np.minimum(self.min_values, state.min_values)
+        self.max_values = np.maximum(self.max_values, state.max_values)
+        return self
+
+    def to_result(self, metrics):
+        result = {}
+
+        for metric in metrics:
+            if metric == "min":
+                result["min"] = self.min_values.copy()
+            if metric == "max":
+                result["max"] = self.max_values.copy()
+            if metric == "sum":
+                result["sum"] = self.sum_values.copy()
+            if metric == "mean":
+                result["mean"] = self.sum_values / self.count
+            if metric == "std":
+                if self.count <= 1:
+                    raise ValueError(
+                        "Standard deviation evaluation requires more than one row data."
+                    )
+                result["std"] = np.sqrt(

Review Comment:
   nit: why not reusing `torchmetrics` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1198449915


##########
python/pyspark/mlv2/base.py:
##########
@@ -0,0 +1,258 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from abc import ABCMeta, abstractmethod
+
+import copy
+import threading
+import pandas as pd
+
+from typing import (
+    Any,
+    Callable,
+    Generic,
+    Iterator,
+    List,
+    Optional,
+    Sequence,
+    Tuple,
+    TypeVar,
+    Union,
+    cast,
+    overload,
+    TYPE_CHECKING,
+)
+
+from pyspark import since
+from pyspark.ml.param import P
+from pyspark.ml.common import inherit_doc
+from pyspark.ml.param.shared import (
+    HasInputCols,
+    HasOutputCols,
+    HasLabelCol,
+    HasFeaturesCol,
+    HasPredictionCol,
+)
+from pyspark.sql.dataframe import DataFrame
+from pyspark.sql.functions import udf
+from pyspark.sql.types import DataType, StructField, StructType
+from pyspark.ml.param import Param, Params, TypeConverters
+from pyspark.sql.functions import col, pandas_udf, struct
+import pickle
+
+from pyspark.mlv2.util import transform_dataframe_column
+
+if TYPE_CHECKING:
+    from pyspark.ml._typing import ParamMap
+
+T = TypeVar("T")
+M = TypeVar("M", bound="Transformer")
+
+
+@inherit_doc
+class Estimator(Params, Generic[M], metaclass=ABCMeta):
+    """
+    Abstract class for estimators that fit models to data.
+
+    .. versionadded:: 1.3.0

Review Comment:
   versions should be 3.5.0 in this PR ?



##########
python/pyspark/mlv2/feature.py:
##########
@@ -0,0 +1,127 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import numpy as np
+import pandas as pd
+
+from pyspark.sql.functions import col, pandas_udf
+
+from pyspark.mlv2.base import Estimator, Model, Transformer
+from pyspark.mlv2.util import transform_dataframe_column
+from pyspark.mlv2.summarizer import summarize_dataframe
+from pyspark.ml.param.shared import HasInputCol, HasOutputCol
+from pyspark.ml.functions import vector_to_array
+
+
+class MaxAbsScaler(Estimator, HasInputCol, HasOutputCol):
+    """
+    Rescale each feature individually to range [-1, 1] by dividing through the largest maximum
+    absolute value in each feature. It does not shift/center the data, and thus does not destroy
+    any sparsity.
+    """
+
+    def __init__(self, inputCol, outputCol):
+        super().__init__()
+        self.set(self.inputCol, inputCol)
+        self.set(self.outputCol, outputCol)
+
+    def _fit(self, dataset):
+        input_col = self.getInputCol()
+
+        min_max_res = summarize_dataframe(dataset, input_col, ["min", "max"])
+        min_values = min_max_res["min"]
+        max_values = min_max_res["max"]
+
+        max_abs_values = np.maximum(np.abs(min_values), np.abs(max_values))
+
+        model = MaxAbsScalerModel(max_abs_values)
+        model._resetUid(self.uid)
+        return self._copyValues(model)
+
+
+class MaxAbsScalerModel(Transformer, HasInputCol, HasOutputCol):
+
+    def __init__(self, max_abs_values):
+        super().__init__()
+        self.max_abs_values = max_abs_values
+
+    def _input_column_name(self):
+        return self.getInputCol()
+
+    def _output_columns(self):
+        return [(self.getOutputCol(), "array<double>")]
+
+    def _get_transform_fn(self):
+        max_abs_values = self.max_abs_values
+        max_abs_values_zero_cond = (max_abs_values == 0.0)
+
+        def transform_fn(series):
+            def map_value(x):
+                return np.where(max_abs_values_zero_cond, 0.0, x / max_abs_values)
+
+            return series.apply(map_value)
+
+        return transform_fn
+
+
+class StandardScaler(Estimator, HasInputCol, HasOutputCol):
+    """
+    Standardizes features by removing the mean and scaling to unit variance using column summary
+    statistics on the samples in the training set.
+    """
+
+    def __init__(self, inputCol, outputCol):
+        super().__init__()
+        self.set(self.inputCol, inputCol)
+        self.set(self.outputCol, outputCol)
+
+    def _fit(self, dataset):
+        input_col = self.getInputCol()
+
+        min_max_res = summarize_dataframe(dataset, input_col, ["mean", "std"])
+        mean_values = min_max_res["mean"]
+        std_values = min_max_res["std"]
+
+        model = StandardScalerModel(mean_values, std_values)
+        model._resetUid(self.uid)
+        return self._copyValues(model)
+
+
+class StandardScalerModel(Transformer, HasInputCol, HasOutputCol):

Review Comment:
   For simple computation logic like `MaxAbsScalerModel /StandardScaler`, I guess we can implement in two ways:
   
   1. leverage torch utils functions (in this PR);
   2. leverage built-in SQL functions (may add a few helper functions if needed), for example
   
   ```
   In [19]: import pyspark.sql.functions as SF
   
   In [20]: import pyspark.ml.functions as MF
   
   In [21]: bdf.show()
   +-----+------+---------+
   |label|weight| features|
   +-----+------+---------+
   |  1.0|   1.0|[0.0,5.0]|
   |  0.0|   2.0|[1.0,2.0]|
   |  1.0|   3.0|[2.0,1.0]|
   |  0.0|   4.0|[3.0,3.0]|
   +-----+------+---------+
   
   
   In [22]: bdf.select(SF.posexplode(MF.vector_to_array(bdf.features)).alias("index", "value")).groupBy("index").agg(SF.avg("value"), SF.stddev_samp("value")).show()
   +-----+----------+------------------+
   |index|avg(value)|stddev_samp(value)|
   +-----+----------+------------------+
   |    1|      2.75|1.7078251276599332|
   |    0|       1.5|1.2909944487358056|
   +-----+----------+------------------+
   
   ```
   



##########
python/pyspark/mlv2/base.py:
##########
@@ -0,0 +1,258 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from abc import ABCMeta, abstractmethod
+
+import copy
+import threading
+import pandas as pd
+
+from typing import (
+    Any,
+    Callable,
+    Generic,
+    Iterator,
+    List,
+    Optional,
+    Sequence,
+    Tuple,
+    TypeVar,
+    Union,
+    cast,
+    overload,
+    TYPE_CHECKING,
+)
+
+from pyspark import since
+from pyspark.ml.param import P
+from pyspark.ml.common import inherit_doc
+from pyspark.ml.param.shared import (
+    HasInputCols,
+    HasOutputCols,
+    HasLabelCol,
+    HasFeaturesCol,
+    HasPredictionCol,
+)
+from pyspark.sql.dataframe import DataFrame
+from pyspark.sql.functions import udf
+from pyspark.sql.types import DataType, StructField, StructType
+from pyspark.ml.param import Param, Params, TypeConverters
+from pyspark.sql.functions import col, pandas_udf, struct

Review Comment:
   ```suggestion
   from pyspark.sql.types import DataType, StructField, StructType
   from pyspark.ml.param import Param, Params, TypeConverters
   from pyspark.sql.functions import col, pandas_udf, struct, udf
   ```



##########
python/pyspark/mlv2/tests/test_feature.py:
##########
@@ -0,0 +1,103 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import unittest
+import numpy as np
+
+from pyspark.ml.linalg import DenseVector, SparseVector, Vectors
+from pyspark.sql import Row
+from pyspark.testing.utils import QuietTest
+from pyspark.testing.mlutils import check_params, SparkSessionTestCase
+
+from pyspark.mlv2.feature import MaxAbsScaler, StandardScaler
+
+
+class FeatureTests(SparkSessionTestCase):

Review Comment:
   dose this works on Spark Connect?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1198515956


##########
python/pyspark/mlv2/tests/test_feature.py:
##########
@@ -0,0 +1,103 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import unittest
+import numpy as np
+
+from pyspark.ml.linalg import DenseVector, SparseVector, Vectors
+from pyspark.sql import Row
+from pyspark.testing.utils import QuietTest
+from pyspark.testing.mlutils import check_params, SparkSessionTestCase
+
+from pyspark.mlv2.feature import MaxAbsScaler, StandardScaler
+
+
+class FeatureTests(SparkSessionTestCase):

Review Comment:
   I will add a spark connect side test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1200357845


##########
python/pyspark/mlv2/evaluation.py:
##########
@@ -0,0 +1,79 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.mlv2.base import Evaluator
+
+from pyspark.ml.param import Param, Params, TypeConverters
+from pyspark.ml.param.shared import HasLabelCol, HasPredictionCol
+from pyspark.mlv2.util import aggregate_dataframe
+
+import torch
+import torcheval.metrics as torchmetrics

Review Comment:
   is `torcheval` a new dep?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on PR #41176:
URL: https://github.com/apache/spark/pull/41176#issuecomment-1558518780

   I think you may need to add `torchmetrics` to https://github.com/apache/spark/blob/f55fdca10b1d9df2d8126cde07a26f89a75ae1d2/dev/infra/Dockerfile#L74


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1200559825


##########
python/pyspark/mlv2/evaluation.py:
##########
@@ -0,0 +1,79 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.mlv2.base import Evaluator
+
+from pyspark.ml.param import Param, Params, TypeConverters
+from pyspark.ml.param.shared import HasLabelCol, HasPredictionCol
+from pyspark.mlv2.util import aggregate_dataframe
+
+import torch
+import torcheval.metrics as torchmetrics

Review Comment:
   Yes. I talked with @mengxr about this and he said we can use it :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #41176: [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 commented on code in PR #41176:
URL: https://github.com/apache/spark/pull/41176#discussion_r1201332234


##########
python/pyspark/mlv2/summarizer.py:
##########
@@ -0,0 +1,112 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import numpy as np
+from pyspark.mlv2.util import aggregate_dataframe
+
+
+class SummarizerAggState:
+    def __init__(self, input_array):
+        self.min_values = input_array.copy()
+        self.max_values = input_array.copy()
+        self.count = 1
+        self.sum_values = np.array(input_array.copy())
+        self.square_sum_values = np.square(input_array.copy())
+
+    def update(self, input_array):
+        self.count += 1
+        self.sum_values += input_array
+        self.square_sum_values += np.square(input_array)
+        self.min_values = np.minimum(self.min_values, input_array)
+        self.max_values = np.maximum(self.max_values, input_array)
+
+    def merge(self, state):
+        self.count += state.count
+        self.sum_values += state.sum_values
+        self.square_sum_values += state.square_sum_values
+        self.min_values = np.minimum(self.min_values, state.min_values)
+        self.max_values = np.maximum(self.max_values, state.max_values)
+        return self
+
+    def to_result(self, metrics):
+        result = {}
+
+        for metric in metrics:
+            if metric == "min":
+                result["min"] = self.min_values.copy()
+            if metric == "max":
+                result["max"] = self.max_values.copy()
+            if metric == "sum":
+                result["sum"] = self.sum_values.copy()
+            if metric == "mean":
+                result["mean"] = self.sum_values / self.count
+            if metric == "std":
+                if self.count <= 1:
+                    raise ValueError(
+                        "Standard deviation evaluation requires more than one row data."
+                    )
+                result["std"] = np.sqrt(

Review Comment:
   We need to evaluate std values over feature arrays, I think using summarizer will be more efficient. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org