You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/11/23 15:42:30 UTC

[GitHub] [spark] WeichenXu123 opened a new pull request #30471: [WIP][ML-12519] make CrossValidator/TrainValidateSplit support Python backend estimator/model

WeichenXu123 opened a new pull request #30471:
URL: https://github.com/apache/spark/pull/30471


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-734385072






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r529100588



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \
+                        isinstance(Evaluator):
+                    relative_path = f'epm_{p.name}{numParamsNotJson}'
+                    param_path = os.path.join(path, relative_path)
+                    numParamsNotJson += 1
+                    v.save(param_path)
+                    jsonParam['value'] = relative_path
+                    jsonParam['isJson'] = False
+                elif isinstance(v, MLWritable):
+                    raise RuntimeError(
+                        "ValidatorSharedReadWrite.saveImpl does not handle parameters of type: "
+                        "MLWritable that are not Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+                        "it cannot be Validator or OneVsRest")
+                else:
+                    jsonParam['value'] = v
+                    jsonParam['isJson'] = True
+                jsonParamMap.append(jsonParam)
+            jsonEstimatorParamMaps.append(jsonParamMap)
+
+        skipParams = ['estimator', 'evaluator', 'estimatorParamMaps']
+
+        jsonParams = {}
+        for p, v in instance._paramMap.items():
+            if p.name not in skipParams:
+                jsonParams[p.name] = v
+
+        jsonParams['estimatorParamMaps'] = jsonEstimatorParamMaps
+
+        DefaultParamsWriter.saveMetadata(instance, path, sc, extraMetadata, jsonParams)
+        evaluatorPath = os.path.join(path, 'evaluator')
+        instance.getEvaluator().save(evaluatorPath)
+        estimatorPath = os.path.join(path, 'estimator')
+        instance.getEstimator().save(estimatorPath)
+
+    @staticmethod
+    def load(path, sc, metadata):
+        evaluatorPath = os.path.join(path, 'evaluator')
+        evaluator = DefaultParamsReader.loadParamsInstance(evaluatorPath, sc)
+        estimatorPath = os.path.join(path, 'estimator')
+        estimator = DefaultParamsReader.loadParamsInstance(estimatorPath, sc)
+
+        uidToParams = MetaAlgorithmReadWrite.getUidMap(estimator)

Review comment:
       This map: `MetaAlgorithmReadWrite.getUidMap` used to find the parent of each param in `estimatorParamMaps`
   
   e.g.
   CrossValidator on pipeline, and pipeline include [transformer1, transformer2, estimator1], and CrossValidator want to tune on params : [transformer1.param1, transformer2.param1, estimator1.params1]
   
   https://github.com/apache/spark/pull/30471/files#r529101510
   
   Then the  getUidMap is used for: providing uid, get the corresponding transformer / estimator which is the parent of the param.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732502019






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-735786611






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737946159


   **[Test build #132130 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132130/testReport)** for PR 30471 at commit [`0d02012`](https://github.com/apache/spark/commit/0d020120f216e7f382c83916ca828d2b4525a402).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737067285






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r533535796



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \
+                        isinstance(Evaluator):
+                    relative_path = f'epm_{p.name}{numParamsNotJson}'
+                    param_path = os.path.join(path, relative_path)
+                    numParamsNotJson += 1
+                    v.save(param_path)
+                    jsonParam['value'] = relative_path
+                    jsonParam['isJson'] = False
+                elif isinstance(v, MLWritable):
+                    raise RuntimeError(
+                        "ValidatorSharedReadWrite.saveImpl does not handle parameters of type: "
+                        "MLWritable that are not Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+                        "it cannot be Validator or OneVsRest")

Review comment:
       Keep consistent with java side. Java side support `Evaluator` type param. Although current spark estimators don't have such params, but 3rd party estimator may has that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-733267314






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732269457


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36156/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732591133






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737671169






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r531111061



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \

Review comment:
       > It's easy to add the extension when that case is explicitly supported
   
   Seems no.
   Suppose there're some 3rd party libraries depends on spark. And 3rd party estimators may include an evaluator param (e.g. for supervising in training early stop) . We need to support them.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [WIP][ML-12519] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732244362


   **[Test build #131553 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131553/testReport)** for PR 30471 at commit [`37bb750`](https://github.com/apache/spark/commit/37bb7506b9f89d3bd0400e96edadac90611345c8).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737998785






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 closed pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
WeichenXu123 closed pull request #30471:
URL: https://github.com/apache/spark/pull/30471


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737067285






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737074437






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-733267314






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-733032271






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737675405






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liangz1 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530029579



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \

Review comment:
       The Validators class will directly take `Estimator` and `Evaluator`, and the `Transformer` will be part of the pipeline Estimator. Should the `Transformer` params be part of the pipeline params?

##########
File path: python/pyspark/ml/util.py
##########
@@ -554,19 +564,72 @@ def getAndSetParams(instance, metadata):
                 paramValue = metadata['defaultParamMap'][paramName]
                 instance._setDefault(**{paramName: paramValue})
 
+    @staticmethod
+    def isPythonParamsInstance(metadata):
+        return 'language' in metadata['paramMap'] and \
+               metadata['paramMap']['language'].lower() == 'python'
+
     @staticmethod
     def loadParamsInstance(path, sc):
         """
         Load a :py:class:`Params` instance from the given path, and return it.
         This assumes the instance inherits from :py:class:`MLReadable`.
         """
         metadata = DefaultParamsReader.loadMetadata(path, sc)
-        pythonClassName = metadata['class'].replace("org.apache.spark", "pyspark")
+        if DefaultParamsReader.isPythonParamsInstance(metadata):
+            pythonClassName = metadata['class']
+        else:
+            pythonClassName = metadata['class'].replace("org.apache.spark", "pyspark")
         py_type = DefaultParamsReader.__get_class(pythonClassName)
         instance = py_type.load(path)
         return instance
 
 
+class MetaAlgorithmReadWrite:
+
+    @staticmethod
+    def getUidMap(instance):
+        uidList = MetaAlgorithmReadWrite.getUidMapImpl(instance)
+        uidMap = dict(uidList)
+        if len(uidList) != len(uidMap):
+            raise RuntimeError(f'{instance.__class__.__module__}.{instance.__class__.__name__}'
+                               f'.load found a compound estimator with stages with duplicate '
+                               f'UIDs. List of UIDs: {list(uidMap.keys())}.')
+        return uidMap
+
+    @staticmethod
+    def getUidMapImpl(instance):

Review comment:
       Naming it as `getUidList` sounds more natural to me.

##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \
+                        isinstance(Evaluator):
+                    relative_path = f'epm_{p.name}{numParamsNotJson}'
+                    param_path = os.path.join(path, relative_path)
+                    numParamsNotJson += 1
+                    v.save(param_path)
+                    jsonParam['value'] = relative_path
+                    jsonParam['isJson'] = False
+                elif isinstance(v, MLWritable):
+                    raise RuntimeError(
+                        "ValidatorSharedReadWrite.saveImpl does not handle parameters of type: "
+                        "MLWritable that are not Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+                        "it cannot be Validator or OneVsRest")

Review comment:
       It would be clearer if we can create an interface similar to `DefaultParamsWritable` and do not use
   ~~~
                  if (isinstance(v, Estimator) and not (
                           isinstance(v, _ValidatorParams) or
                           isinstance(v, OneVsRest))
                       ) or isinstance(v, Transformer) or \
                           isinstance(Evaluator):
   ~~~
   since it looks very confusing.

##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []

Review comment:
       L219 and L221 are called maps but are actually lists. Can you fix the names?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732522903


   JVM side logic:
   
   Shared load/save impl
   https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala
   
   MetaAlgorithmReadWrite
   https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L638
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-735786141


   **[Test build #131988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131988/testReport)** for PR 30471 at commit [`18f0551`](https://github.com/apache/spark/commit/18f0551d2162b91675f16f7bde3b63382fc7c20d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737055896


   **[Test build #132028 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132028/testReport)** for PR 30471 at commit [`e4f8acb`](https://github.com/apache/spark/commit/e4f8acbdb82d762d9323bc0c00d2e1b3993f097d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [WIP][ML-12519] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732245114






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737725930


   I would like to keep this PR open for one more day to see whether @srowen has some comment.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737136996






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zero323 commented on a change in pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
zero323 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r533570707



##########
File path: python/pyspark/ml/classification.py
##########
@@ -2991,8 +2994,59 @@ def _to_java(self):
         _java_obj.setRawPredictionCol(self.getRawPredictionCol())
         return _java_obj
 
+    @classmethod
+    def read(cls):
+        return OneVsRestReader(cls)
+
+    def write(self):
+        if isinstance(self.getClassifier(), JavaMLWritable):
+            return JavaMLWriter(self)
+        else:
+            return OneVsRestWriter(self)
+
+
+class OneVsRestSharedReadWrite:
+    @staticmethod
+    def saveImpl(instance, sc, path, extraMetadata=None):
+        skipParams = ['classifier']
+        jsonParams = DefaultParamsWriter.extractJsonParams(instance, skipParams)
+        DefaultParamsWriter.saveMetadata(instance, path, sc, paramMap=jsonParams,
+                                         extraMetadata=extraMetadata)
+        classifierPath = os.path.join(path, 'classifier')
+        instance.getClassifier().save(classifierPath)
+
+    @staticmethod
+    def loadClassifier(path, sc):
+        classifierPath = os.path.join(path, 'classifier')
+        return DefaultParamsReader.loadParamsInstance(classifierPath, sc)
+
+
+class OneVsRestReader(MLReader):
+    def __init__(self, cls):
+        super(OneVsRestReader, self).__init__()
+        self.cls = cls
+
+    def load(self, path):
+        metadata = DefaultParamsReader.loadMetadata(path, self.sc)
+        if not DefaultParamsReader.isPythonParamsInstance(metadata):
+            return JavaMLReader(self.cls).load(path)
+        else:
+            classifier = OneVsRestSharedReadWrite.loadClassifier(path, self.sc)
+            ova = OneVsRest(classifier=classifier)._resetUid(metadata['uid'])
+            DefaultParamsReader.getAndSetParams(ova, metadata, skipParams=['classifier'])
+            return ova
+
+
+class OneVsRestWriter(MLWriter):
+    def __init__(self, instance):
+        super(OneVsRestWriter, self).__init__()
+        self.instance = instance
+
+    def saveImpl(self, path):
+        OneVsRestSharedReadWrite.saveImpl(self.instance, self.sc, path)

Review comment:
       These seem to be intended for internal usage only. If that's the case, shall we mark them as a such (`OneVsRestSharedReadWrite` -> `_OneVsRestSharedReadWrite`, etc.) and skip annotations?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737099589


   **[Test build #132036 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132036/testReport)** for PR 30471 at commit [`95364b2`](https://github.com/apache/spark/commit/95364b2f84e8347c03444ae4a2120c9bcd5272b5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [WIP][ML-12519] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732245128


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/131553/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liangz1 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530073036



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \
+                        isinstance(Evaluator):
+                    relative_path = f'epm_{p.name}{numParamsNotJson}'
+                    param_path = os.path.join(path, relative_path)
+                    numParamsNotJson += 1
+                    v.save(param_path)
+                    jsonParam['value'] = relative_path
+                    jsonParam['isJson'] = False
+                elif isinstance(v, MLWritable):
+                    raise RuntimeError(
+                        "ValidatorSharedReadWrite.saveImpl does not handle parameters of type: "
+                        "MLWritable that are not Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+                        "it cannot be Validator or OneVsRest")

Review comment:
       Why support `Evaluator`? It seems that it's estimator param maps, not evaluator param maps.
   Why don't support `OneVsRest`? Is it not supported for tuning or just to keep consistent with the scala side? We may just save the `OneVsRest.classifier` the same way as we save the estimator.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737998345


   Would this have any effect for code that already extends these classes? I like the cleanup and refactoring, just looking for anything this would break.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liangz1 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530073036



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \
+                        isinstance(Evaluator):
+                    relative_path = f'epm_{p.name}{numParamsNotJson}'
+                    param_path = os.path.join(path, relative_path)
+                    numParamsNotJson += 1
+                    v.save(param_path)
+                    jsonParam['value'] = relative_path
+                    jsonParam['isJson'] = False
+                elif isinstance(v, MLWritable):
+                    raise RuntimeError(
+                        "ValidatorSharedReadWrite.saveImpl does not handle parameters of type: "
+                        "MLWritable that are not Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+                        "it cannot be Validator or OneVsRest")

Review comment:
       Why support `Evaluator`? It seems that it's estimator param maps, not evaluator param maps.
   Why don't support `OneVsRest`? Is it not supported for tuning or just to keep consistent with the scala side? We may just save the `OneVsRest.classifier` the same way as we save the estimator.
   Why don't support `MLWritable`?
   Any examples of classes that will trigger the `RuntimeError` here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732502019






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732289011






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-735854075






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30471: [WIP][ML-12519] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732244362


   **[Test build #131553 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131553/testReport)** for PR 30471 at commit [`37bb750`](https://github.com/apache/spark/commit/37bb7506b9f89d3bd0400e96edadac90611345c8).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737654941


   **[Test build #132088 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132088/testReport)** for PR 30471 at commit [`b79aaf9`](https://github.com/apache/spark/commit/b79aaf96c24e69862843b530a85e7754a2d9bc57).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737671169






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732501993


   **[Test build #131582 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131582/testReport)** for PR 30471 at commit [`37bb750`](https://github.com/apache/spark/commit/37bb7506b9f89d3bd0400e96edadac90611345c8).
    * This patch **fails Python style tests**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `class CrossValidatorReader(MLReader):`
     * `class CrossValidatorWriter(MLWriter):`
     * `class CrossValidatorModelReader(MLReader):`
     * `class CrossValidatorModelWriter(MLWriter):`
     * `class TrainValidationSplitReader(MLReader):`
     * `class TrainValidationSplitWriter(MLWriter):`
     * `class TrainValidationSplitModelReader(MLReader):`
     * `class TrainValidationSplitModelWriter(MLWriter):`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-736651512


   Remaining minors issue:
   * update related pyi files.
   * add param validation before saving. (Similar to java side `ValidatorParams.validateParams` and `OneVsRestParams.validateParams`
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r529101510



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}

Review comment:
       When we save a param,
   We will first save:
   * parent (uid of estimator/transformer)
   * param name




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737136996






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-733032253


   **[Test build #131671 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131671/testReport)** for PR 30471 at commit [`5a39258`](https://github.com/apache/spark/commit/5a3925800b8ecff9911b779eec97bee6f1e6d5da).
    * This patch **fails Python style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r529100588



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \
+                        isinstance(Evaluator):
+                    relative_path = f'epm_{p.name}{numParamsNotJson}'
+                    param_path = os.path.join(path, relative_path)
+                    numParamsNotJson += 1
+                    v.save(param_path)
+                    jsonParam['value'] = relative_path
+                    jsonParam['isJson'] = False
+                elif isinstance(v, MLWritable):
+                    raise RuntimeError(
+                        "ValidatorSharedReadWrite.saveImpl does not handle parameters of type: "
+                        "MLWritable that are not Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+                        "it cannot be Validator or OneVsRest")
+                else:
+                    jsonParam['value'] = v
+                    jsonParam['isJson'] = True
+                jsonParamMap.append(jsonParam)
+            jsonEstimatorParamMaps.append(jsonParamMap)
+
+        skipParams = ['estimator', 'evaluator', 'estimatorParamMaps']
+
+        jsonParams = {}
+        for p, v in instance._paramMap.items():
+            if p.name not in skipParams:
+                jsonParams[p.name] = v
+
+        jsonParams['estimatorParamMaps'] = jsonEstimatorParamMaps
+
+        DefaultParamsWriter.saveMetadata(instance, path, sc, extraMetadata, jsonParams)
+        evaluatorPath = os.path.join(path, 'evaluator')
+        instance.getEvaluator().save(evaluatorPath)
+        estimatorPath = os.path.join(path, 'estimator')
+        instance.getEstimator().save(estimatorPath)
+
+    @staticmethod
+    def load(path, sc, metadata):
+        evaluatorPath = os.path.join(path, 'evaluator')
+        evaluator = DefaultParamsReader.loadParamsInstance(evaluatorPath, sc)
+        estimatorPath = os.path.join(path, 'estimator')
+        estimator = DefaultParamsReader.loadParamsInstance(estimatorPath, sc)
+
+        uidToParams = MetaAlgorithmReadWrite.getUidMap(estimator)

Review comment:
       This map: `MetaAlgorithmReadWrite.getUidMap` used to find the parent of each param in `estimatorParamMaps`
   
   e.g.
   CrossValidator on pipeline, and pipeline include [transformer1, transformer2, estimator1], and CrossValidator want to tune on params : [transformer1.params1, transformer2.params1, estimator1]
   
   Then the  getUidMap is used for: providing uid, get the corresponding transformer / estimator which is the parent of the param.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737675405






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liangz1 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530073036



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \
+                        isinstance(Evaluator):
+                    relative_path = f'epm_{p.name}{numParamsNotJson}'
+                    param_path = os.path.join(path, relative_path)
+                    numParamsNotJson += 1
+                    v.save(param_path)
+                    jsonParam['value'] = relative_path
+                    jsonParam['isJson'] = False
+                elif isinstance(v, MLWritable):
+                    raise RuntimeError(
+                        "ValidatorSharedReadWrite.saveImpl does not handle parameters of type: "
+                        "MLWritable that are not Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+                        "it cannot be Validator or OneVsRest")

Review comment:
       Why support `Evaluator`? It seems that it's estimator param maps, not evaluator param maps.
   Why don't support `OneVsRest`? Is it not supported for tuning or just to keep consistent with the scala side? We may just save the `OneVsRest.classifier` the same way as we save the estimator.
   Why don't support `MLWritable`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737914601


   **[Test build #132130 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132130/testReport)** for PR 30471 at commit [`0d02012`](https://github.com/apache/spark/commit/0d020120f216e7f382c83916ca828d2b4525a402).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [WIP][ML-12519] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732245114


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737914601


   **[Test build #132130 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132130/testReport)** for PR 30471 at commit [`0d02012`](https://github.com/apache/spark/commit/0d020120f216e7f382c83916ca828d2b4525a402).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737113824


   **[Test build #132036 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132036/testReport)** for PR 30471 at commit [`95364b2`](https://github.com/apache/spark/commit/95364b2f84e8347c03444ae4a2120c9bcd5272b5).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-738013420


   @srowen 
   
   > Would this have any effect for code that already extends these classes? I like the cleanup and refactoring, just looking for anything this would break.
   
   I think won't break.
   
   * For interface, it only change CrossValidator/TrainValidateSplit/OneVsRest to inherit MLReadable/MLWritable instead of inherit JavaMLReadable/JavaMLWritable . This won't break anything except user run code like `isinstance(crossvalidator, JavaMLReadable)`, which is meaningless.
   
   * For save/load backward compatibility, also won't break anything. Any model which can be saved successfully by old code, in new code it will also be saved exactly the same, it will go _to_java().save() route. For python backend estimator/evaluator cases, old code will directly raise error but new code it will go python writer/reader route to save/load.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liangz1 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530076126



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \

Review comment:
       > A pyspark param value can be estimator/transformer/evaluator. They're all legal.
   Although currently pyspark does not have the case "transformer" to be a param value,
   but, allow it here is to provide extensibility.
   
   Not allowing them may reduce the confusion by a lot. It's easy to add the extension when that case is explicitly supported, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732501457


   **[Test build #131582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131582/testReport)** for PR 30471 at commit [`37bb750`](https://github.com/apache/spark/commit/37bb7506b9f89d3bd0400e96edadac90611345c8).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737998759


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36732/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-735786141


   **[Test build #131988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131988/testReport)** for PR 30471 at commit [`18f0551`](https://github.com/apache/spark/commit/18f0551d2162b91675f16f7bde3b63382fc7c20d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [WIP][ML-12519] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732245097


   **[Test build #131553 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131553/testReport)** for PR 30471 at commit [`37bb750`](https://github.com/apache/spark/commit/37bb7506b9f89d3bd0400e96edadac90611345c8).
    * This patch **fails Python style tests**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `class CrossValidatorReader(MLReader):`
     * `class CrossValidatorWriter(MLWriter):`
     * `class CrossValidatorModelReader(MLReader):`
     * `class CrossValidatorModelWriter(MLWriter):`
     * `class TrainValidationSplitReader(MLReader):`
     * `class TrainValidationSplitWriter(MLWriter):`
     * `class TrainValidationSplitModelReader(MLReader):`
     * `class TrainValidationSplitModelWriter(MLWriter):`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732289011






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737082643


   Everything is ready. Gentle ping @zhengruifeng @srowen @zero323 Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-738474898


   merge to master. Thanks !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737648921


   > LGTM
   > It seems that python backend Evaluator is not tested, but I think it can be added in follow-up PRs.
   
   Added test for python backend evaluator.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737984893






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r529099004



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \
+                        isinstance(Evaluator):
+                    relative_path = f'epm_{p.name}{numParamsNotJson}'
+                    param_path = os.path.join(path, relative_path)
+                    numParamsNotJson += 1
+                    v.save(param_path)
+                    jsonParam['value'] = relative_path
+                    jsonParam['isJson'] = False

Review comment:
       For the case:
   CrossValidator on estimator of OneVsRest, and need to turning the parameter `OneVsRest.classifier`, then  EstimatorParamMaps will include a param (OneVsRest.classifier -> A_Classifier)
   We need to consider how to saving Classifier type param value.
   This is what the code do:
    * if it is normal param, we save it as Json format and `isJson` field is True.
    * otherwise set `isJson` field False, and save it by object.save().




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737042079


   **[Test build #132028 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132028/testReport)** for PR 30471 at commit [`e4f8acb`](https://github.com/apache/spark/commit/e4f8acbdb82d762d9323bc0c00d2e1b3993f097d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732501457


   **[Test build #131582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131582/testReport)** for PR 30471 at commit [`37bb750`](https://github.com/apache/spark/commit/37bb7506b9f89d3bd0400e96edadac90611345c8).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530060435



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \

Review comment:
       > The Validators class will directly take Estimator and Evaluator, and the Transformer will be part of the pipeline Estimator. Should the Transformer params be part of the pipeline params?
   
   A pyspark param value can be estimator/transformer/evaluator. They're all legal.
   Although currently pyspark does not have the case "transformer" to be a param value,
   but, allow it here is to provide extensibility.
   
   ```
   if (isinstance(v, Estimator) and not (
                           isinstance(v, _ValidatorParams) or
                           isinstance(v, OneVsRest))
                       ) or isinstance(v, Transformer) or \
                           isinstance(Evaluator):
   ```
   The logic try to keep equivalent with jvm side logic `isInstanceOf[DefaultParamsWritable] && !isInstanceOf[MLWritable]`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-733031641


   **[Test build #131671 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131671/testReport)** for PR 30471 at commit [`5a39258`](https://github.com/apache/spark/commit/5a3925800b8ecff9911b779eec97bee6f1e6d5da).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737740399


   @WeichenXu123 do you aim to get this in for Spark 3.1? if so, I would encourage to merge it soon. Branch-cut is tomorrow and will happen 00:00 4th PDT.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737667602


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36687/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-733031641


   **[Test build #131671 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131671/testReport)** for PR 30471 at commit [`5a39258`](https://github.com/apache/spark/commit/5a3925800b8ecff9911b779eec97bee6f1e6d5da).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530055236



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []

Review comment:
       `jsonParamMap` ==> `jsonParamList`: sounds good.
   
   `jsonEstimatorParamMaps`, the name `xxxMaps` actually means `xxxMapList`. Similar to existing codes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-733032271






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r529100588



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \
+                        isinstance(Evaluator):
+                    relative_path = f'epm_{p.name}{numParamsNotJson}'
+                    param_path = os.path.join(path, relative_path)
+                    numParamsNotJson += 1
+                    v.save(param_path)
+                    jsonParam['value'] = relative_path
+                    jsonParam['isJson'] = False
+                elif isinstance(v, MLWritable):
+                    raise RuntimeError(
+                        "ValidatorSharedReadWrite.saveImpl does not handle parameters of type: "
+                        "MLWritable that are not Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+                        "it cannot be Validator or OneVsRest")
+                else:
+                    jsonParam['value'] = v
+                    jsonParam['isJson'] = True
+                jsonParamMap.append(jsonParam)
+            jsonEstimatorParamMaps.append(jsonParamMap)
+
+        skipParams = ['estimator', 'evaluator', 'estimatorParamMaps']
+
+        jsonParams = {}
+        for p, v in instance._paramMap.items():
+            if p.name not in skipParams:
+                jsonParams[p.name] = v
+
+        jsonParams['estimatorParamMaps'] = jsonEstimatorParamMaps
+
+        DefaultParamsWriter.saveMetadata(instance, path, sc, extraMetadata, jsonParams)
+        evaluatorPath = os.path.join(path, 'evaluator')
+        instance.getEvaluator().save(evaluatorPath)
+        estimatorPath = os.path.join(path, 'estimator')
+        instance.getEstimator().save(estimatorPath)
+
+    @staticmethod
+    def load(path, sc, metadata):
+        evaluatorPath = os.path.join(path, 'evaluator')
+        evaluator = DefaultParamsReader.loadParamsInstance(evaluatorPath, sc)
+        estimatorPath = os.path.join(path, 'estimator')
+        estimator = DefaultParamsReader.loadParamsInstance(estimatorPath, sc)
+
+        uidToParams = MetaAlgorithmReadWrite.getUidMap(estimator)

Review comment:
       This map: `MetaAlgorithmReadWrite.getUidMap` used to find the parent of each param in `estimatorParamMaps`
   
   e.g.
   CrossValidator on pipeline, and pipeline include [transformer1, transformer2, estimator1], and CrossValidator want to tune on params : [transformer1.param1, transformer2.param1, estimator1.params1]
   
   Then the  getUidMap is used for: providing uid, get the corresponding transformer / estimator which is the parent of the param.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737099589


   **[Test build #132036 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132036/testReport)** for PR 30471 at commit [`95364b2`](https://github.com/apache/spark/commit/95364b2f84e8347c03444ae4a2120c9bcd5272b5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737131327






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732288984


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36156/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530055236



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []

Review comment:
       they're maps in "json" style (can be directly serialized by json.dumps()) . So named like jsonXXX. It is similar to other existing codes




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737975058


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36732/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737042079


   **[Test build #132028 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132028/testReport)** for PR 30471 at commit [`e4f8acb`](https://github.com/apache/spark/commit/e4f8acbdb82d762d9323bc0c00d2e1b3993f097d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732578139


   Example:
   
   In following:
   ```
   lr1 = LogisticRegression()
   lr2 = LogisticRegression(**) #different params setting with lr1
   ovs1 = OneVsRest()
   hashTF1 = HashTF()
   ```
   
   (1) `CrossValidator(estimator=lr1, evaluator=ev1)`, tune on paramMaps: [ {lr1.maxIter: 3, lr1.regParam: 0.1},   {lr1.maxIter: 5, lr1.regParam: 0.01} ], then:
   
   uidMap = {lr1.uid: lr1, ev1.uid: ev1},  all params in estimatorParamMaps  `isJson` = True
   
   (2)
   ```
   pipeline1 = Pipeline(stages=[hashTF1, lr1])
   CrossValidator(estimator=pipeline1, evaluator=ev1)
   ```
   , tune on paramMaps: [ {hashTF1.numFeatures: 10, lr1.maxIter: 3},   {hashTF1.numFeatures: 20, lr1.maxIter: 5} ], then:
   
   uidMap = {pipeline1.uid: pipeline1, hashTF1.uid: hashTF1, lr1.uid: lr1, ev1.uid: ev1}, all params in estimatorParamMaps `isJson` = True
   
   (3) 
   `CrossValidator(estimator=ovs1, evaluator=ev1)`,  tune on paramMaps: [{ovs1.classifier: lr1}, {ovs1.classifier: lr2}], then
   
   uidMap = {ovs1.uid: ovs1, lr1.uid: lr1, ev1.uid: ev1},  all params in estimatorParamMaps  `isJson` = False  (the param values is lr1 or lr2) 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737654941


   **[Test build #132088 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132088/testReport)** for PR 30471 at commit [`b79aaf9`](https://github.com/apache/spark/commit/b79aaf96c24e69862843b530a85e7754a2d9bc57).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-736686889


   This is an important feature. Hope it go into spark 3.1, remaining 3 days before code freezing...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-735786611






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737675382


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36687/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-735786596


   **[Test build #131988 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131988/testReport)** for PR 30471 at commit [`18f0551`](https://github.com/apache/spark/commit/18f0551d2162b91675f16f7bde3b63382fc7c20d).
    * This patch **fails Python style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-734417060






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-734384562


   **[Test build #131850 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131850/testReport)** for PR 30471 at commit [`e8f4885`](https://github.com/apache/spark/commit/e8f4885989d31378a013e361a8ee8e239a4d2650).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-735854075






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737998785






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-734385072






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737984893






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-732591133






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530056747



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \

Review comment:
       > It would be clearer if we can create an interface similar to DefaultParamsWritable
   
   Pyspark already has this interface, the issue is the existing pyspark estimator/transformer did not inherit `DefaultParamsWritable`, change them to inherit `DefaultParamsWritable` may cause breaking changes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r533574012



##########
File path: python/pyspark/ml/classification.py
##########
@@ -2991,8 +2994,59 @@ def _to_java(self):
         _java_obj.setRawPredictionCol(self.getRawPredictionCol())
         return _java_obj
 
+    @classmethod
+    def read(cls):
+        return OneVsRestReader(cls)
+
+    def write(self):
+        if isinstance(self.getClassifier(), JavaMLWritable):
+            return JavaMLWriter(self)
+        else:
+            return OneVsRestWriter(self)
+
+
+class OneVsRestSharedReadWrite:
+    @staticmethod
+    def saveImpl(instance, sc, path, extraMetadata=None):
+        skipParams = ['classifier']
+        jsonParams = DefaultParamsWriter.extractJsonParams(instance, skipParams)
+        DefaultParamsWriter.saveMetadata(instance, path, sc, paramMap=jsonParams,
+                                         extraMetadata=extraMetadata)
+        classifierPath = os.path.join(path, 'classifier')
+        instance.getClassifier().save(classifierPath)
+
+    @staticmethod
+    def loadClassifier(path, sc):
+        classifierPath = os.path.join(path, 'classifier')
+        return DefaultParamsReader.loadParamsInstance(classifierPath, sc)
+
+
+class OneVsRestReader(MLReader):
+    def __init__(self, cls):
+        super(OneVsRestReader, self).__init__()
+        self.cls = cls
+
+    def load(self, path):
+        metadata = DefaultParamsReader.loadMetadata(path, self.sc)
+        if not DefaultParamsReader.isPythonParamsInstance(metadata):
+            return JavaMLReader(self.cls).load(path)
+        else:
+            classifier = OneVsRestSharedReadWrite.loadClassifier(path, self.sc)
+            ova = OneVsRest(classifier=classifier)._resetUid(metadata['uid'])
+            DefaultParamsReader.getAndSetParams(ova, metadata, skipParams=['classifier'])
+            return ova
+
+
+class OneVsRestWriter(MLWriter):
+    def __init__(self, instance):
+        super(OneVsRestWriter, self).__init__()
+        self.instance = instance
+
+    def saveImpl(self, path):
+        OneVsRestSharedReadWrite.saveImpl(self.instance, self.sc, path)

Review comment:
       Except OneVsRestSharedReadWrite, other Reader/Writer should be public, similar to Pipeline Reader/Writer




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737131327






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-734384562


   **[Test build #131850 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131850/testReport)** for PR 30471 at commit [`e8f4885`](https://github.com/apache/spark/commit/e8f4885989d31378a013e361a8ee8e239a4d2650).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zero323 commented on a change in pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
zero323 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r534487321



##########
File path: python/pyspark/ml/classification.py
##########
@@ -2991,8 +2994,59 @@ def _to_java(self):
         _java_obj.setRawPredictionCol(self.getRawPredictionCol())
         return _java_obj
 
+    @classmethod
+    def read(cls):
+        return OneVsRestReader(cls)
+
+    def write(self):
+        if isinstance(self.getClassifier(), JavaMLWritable):
+            return JavaMLWriter(self)
+        else:
+            return OneVsRestWriter(self)
+
+
+class OneVsRestSharedReadWrite:
+    @staticmethod
+    def saveImpl(instance, sc, path, extraMetadata=None):
+        skipParams = ['classifier']
+        jsonParams = DefaultParamsWriter.extractJsonParams(instance, skipParams)
+        DefaultParamsWriter.saveMetadata(instance, path, sc, paramMap=jsonParams,
+                                         extraMetadata=extraMetadata)
+        classifierPath = os.path.join(path, 'classifier')
+        instance.getClassifier().save(classifierPath)
+
+    @staticmethod
+    def loadClassifier(path, sc):
+        classifierPath = os.path.join(path, 'classifier')
+        return DefaultParamsReader.loadParamsInstance(classifierPath, sc)
+
+
+class OneVsRestReader(MLReader):
+    def __init__(self, cls):
+        super(OneVsRestReader, self).__init__()
+        self.cls = cls
+
+    def load(self, path):
+        metadata = DefaultParamsReader.loadMetadata(path, self.sc)
+        if not DefaultParamsReader.isPythonParamsInstance(metadata):
+            return JavaMLReader(self.cls).load(path)
+        else:
+            classifier = OneVsRestSharedReadWrite.loadClassifier(path, self.sc)
+            ova = OneVsRest(classifier=classifier)._resetUid(metadata['uid'])
+            DefaultParamsReader.getAndSetParams(ova, metadata, skipParams=['classifier'])
+            return ova
+
+
+class OneVsRestWriter(MLWriter):
+    def __init__(self, instance):
+        super(OneVsRestWriter, self).__init__()
+        self.instance = instance
+
+    def saveImpl(self, path):
+        OneVsRestSharedReadWrite.saveImpl(self.instance, self.sc, path)

Review comment:
       > Except OneVsRestSharedReadWrite, other Reader/Writer should be public, similar to Pipeline Reader/Writer
   
   My point  was that the particular implementation of `Reader` / `Writer` is completely irrelevant to the end user.  In terms of types we still expect `MLReader` / `MLWriter` and `OneVsRest` variants are unlikely to be useful for anyone by devs.
   
   I believe that for the same reason we mark `PipelineReader` and `PipelineModelWriter` as private in comments:
   
   https://github.com/apache/spark/blob/92bfbcb2e372e8fecfe65bc582c779d9df4036bb/python/pyspark/ml/pipeline.py#L197
   
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r529109498



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \
+                        isinstance(Evaluator):
+                    relative_path = f'epm_{p.name}{numParamsNotJson}'
+                    param_path = os.path.join(path, relative_path)
+                    numParamsNotJson += 1
+                    v.save(param_path)
+                    jsonParam['value'] = relative_path
+                    jsonParam['isJson'] = False
+                elif isinstance(v, MLWritable):
+                    raise RuntimeError(
+                        "ValidatorSharedReadWrite.saveImpl does not handle parameters of type: "
+                        "MLWritable that are not Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+                        "it cannot be Validator or OneVsRest")

Review comment:
       In JVM side, the logic is only support `DefaultParamsWritable` instance. 
   Pyspark side we cannot use it, but we can equivalently use:
   Support Estimaor/Evaluator/Transformer type but excludes Validator or OneVsRest. These types in JVM side are inherits `DefaultParamsWritable`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liangz1 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530073036



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []
+            for p, v in paramMap.items():
+                jsonParam = {'parent': p.parent, 'name': p.name}
+                if (isinstance(v, Estimator) and not (
+                        isinstance(v, _ValidatorParams) or
+                        isinstance(v, OneVsRest))
+                    ) or isinstance(v, Transformer) or \
+                        isinstance(Evaluator):
+                    relative_path = f'epm_{p.name}{numParamsNotJson}'
+                    param_path = os.path.join(path, relative_path)
+                    numParamsNotJson += 1
+                    v.save(param_path)
+                    jsonParam['value'] = relative_path
+                    jsonParam['isJson'] = False
+                elif isinstance(v, MLWritable):
+                    raise RuntimeError(
+                        "ValidatorSharedReadWrite.saveImpl does not handle parameters of type: "
+                        "MLWritable that are not Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+                        "it cannot be Validator or OneVsRest")

Review comment:
       Why support `Evaluator`? It seems that it's estimator param maps, not evaluator param maps.
   Why don't support `OneVsRest`? Is it not supported for tuning or just to keep consistent with the scala side? We may just save the `OneVsRest.classifier` the same way as we save the estimator.
   Why don't support `MLWritable`?
   Any examples of classes that will trigger the `RuntimeError` here?
   It's also preferred that we refactor this function to highlight the high-level logic and hide details.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-734385050


   **[Test build #131850 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131850/testReport)** for PR 30471 at commit [`e8f4885`](https://github.com/apache/spark/commit/e8f4885989d31378a013e361a8ee8e239a4d2650).
    * This patch **fails Python style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530055236



##########
File path: python/pyspark/ml/tuning.py
##########
@@ -207,6 +210,205 @@ def _to_java_impl(self):
         return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+    @staticmethod
+    def saveImpl(path, instance, sc, extraMetadata=None):
+        from pyspark.ml.classification import OneVsRest
+        numParamsNotJson = 0
+        jsonEstimatorParamMaps = []
+        for paramMap in instance.getEstimatorParamMaps():
+            jsonParamMap = []

Review comment:
       oh. there's a mistake here. I will update.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30471: [SPARK-33520][ML][PySpark] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737661701


   **[Test build #132088 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132088/testReport)** for PR 30471 at commit [`b79aaf9`](https://github.com/apache/spark/commit/b79aaf96c24e69862843b530a85e7754a2d9bc57).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `class DummyEvaluator(Evaluator, DefaultParamsReadable, DefaultParamsWritable):`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-734417060






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30471: [SPARK-33520][ML] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30471:
URL: https://github.com/apache/spark/pull/30471#issuecomment-737074437






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org