You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ajaysaini725 <gi...@git.apache.org> on 2017/07/26 22:57:11 UTC

[GitHub] spark pull request #18742: Python persistence helper functions

GitHub user ajaysaini725 opened a pull request:

    https://github.com/apache/spark/pull/18742

    Python persistence helper functions

    ## What changes were proposed in this pull request?
    
    Added DefaultParamsWriteable, DefaultParamsReadable, DefaultParamsWriter, and DefaultParamsReader to Python to support Python-only persistence of Json-serializable parameters. 
    
    ## How was this patch tested?
    
    Instantiated an estimator with Json-serializable parameters (ex. LogisticRegression), saved it using the added helper functions, and loaded it back, and compared it to the original instance to make sure it is the same. This test was done in both the Python REPL and the unit tests.
    
    Note to reviewers: there are a few excess comments that I left in the code for clarity but will remove before the code is merged to master.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ajaysaini725/spark PythonPersistenceHelperFunctions

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18742.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18742
    
----
commit 71771e55517f7b00960b54ab7ac0248643990cbc
Author: Ajay Saini <aj...@gmail.com>
Date:   2017-07-20T02:16:50Z

    Function outline

commit 7974496912aca3512bac69039fd27b2ca0c35626
Author: Ajay Saini <aj...@gmail.com>
Date:   2017-07-26T22:44:31Z

    Working default read and write helper functions. Code does not pass style tests and still has comments.

commit 9cb114bfce8a86e50b57384435f850c75d8507b7
Author: Ajay Saini <aj...@gmail.com>
Date:   2017-07-26T22:53:24Z

    Removed excess comments and fixed style issues.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131000859
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -61,32 +66,82 @@ def _randomUID(cls):
     
     
     @inherit_doc
    -class MLWriter(object):
    +class BaseReadWrite(object):
    +    """
    +    Base class for MLWriter and MLReader. Stores information about the SparkContext
    +    and SparkSession.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self):
    +        self.sparkSession = None
    +
    +    def context(self, sqlContext):
    +        """
    +        Sets the SQL context to use for saving.
    +
    +        .. note:: Deprecated in 2.1 and will be removed in 3.0, use session instead.
    +        """
    +        # raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +        self.sparkSession = sqlContext.sparkSession
    +        return self
    +
    +    def session(self, sparkSession):
    +        """Sets the Spark Session to use for saving."""
    --- End diff --
    
    Copy doc from Scala; no need to rewrite it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131288786
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +353,143 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +    """
    +    Class for making simple Params types writable. Assumes that all parameters
    +    are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def write(self):
    +        """Returns a DefaultParamsWriter instance for this class."""
    +        if isinstance(self, Params):
    +            return DefaultParamsWriter(self)
    +        else:
    +            raise TypeError("Cannot use DefautParamsWritable with type %s because it does not " +
    +                            " extend Params.", type(self))
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +    """
    +    Class for writing Estimators and Transformers whose parameters are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +
    +    def saveImpl(self, path):
    +        DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
    +
    +    @staticmethod
    +    def save_metadata(instance, path, sc, extraMetadata=None, paramMap=None):
    +        metadataPath = os.path.join(path, "metadata")
    +        metadataJson = DefaultParamsWriter.get_metadata_to_save(instance,
    +                                                                metadataPath,
    +                                                                sc,
    +                                                                extraMetadata,
    +                                                                paramMap)
    +        sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
    +
    +    @staticmethod
    +    def get_metadata_to_save(instance, path, sc, extraMetadata=None, paramMap=None):
    --- End diff --
    
    path is not used


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r130521214
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +289,124 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +
    +    # overrides the write() function in MLWriteable
    +    # users call .save() in MLWriteable which calls this write() function and then calls
    +    # the .save() in DefaultParamsWriter
    +    # this can be overridden to return a different Writer (ex. OneVsRestWriter as seen in Scala)
    +    def write(self):
    +        # instance of check for params?
    +        return DefaultParamsWriter(self)
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +        self.sc = SparkContext._active_spark_context
    +
    +    # if a model extends DefaultParamsWriteable this save() function is called
    +    def save(self, path):
    +        if self.shouldOverwrite:
    +            # This command removes a file. Is this enough?
    +            os.remove(path)
    +        DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
    +
    +    def overwrite(self):
    +        self.shouldOverwrite = True
    +        return self
    +
    +    @staticmethod
    +    def save_metadata(instance, path, sc, extraMetadata=None, paramMap=None):
    +        metadataPath = os.path.join(path, "metadata")
    +        metadataJson = DefaultParamsWriter.get_metadata_to_save(instance,
    +                                                                metadataPath,
    +                                                                sc,
    +                                                                extraMetadata,
    +                                                                paramMap)
    +        sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
    +
    +    @staticmethod
    +    def get_metadata_to_save(instance, path, sc, extraMetadata=None, paramMap=None):
    +        uid = instance.uid
    +        cls = instance.__module__ + '.' + instance.__class__.__name__
    +        params = instance.extractParamMap()
    +        jsonParams = {}
    +        if paramMap is not None:
    +            for p in paramMap:
    +                jsonParams[p.name] = paramMap[p]
    +        else:
    +            for p in params:
    +                jsonParams[p.name] = params[p]
    +        basicMetadata = {"class": cls, "timestamp": int(round(time.time() * 1000)),
    +                         "sparkVersion": sc.version, "uid": uid, "paramMap": jsonParams}
    --- End diff --
    
    Maybe we should use `long(round(time.time() * 1000))` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131288351
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +353,143 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +    """
    +    Class for making simple Params types writable. Assumes that all parameters
    --- End diff --
    
    Could you please merge the Scala doc into this to make this more detailed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80364 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80364/testReport)** for PR 18742 at commit [`860132e`](https://github.com/apache/spark/commit/860132e1eb9df9388d47ae838db805b58630b2d1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131001546
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -61,32 +66,82 @@ def _randomUID(cls):
     
     
     @inherit_doc
    -class MLWriter(object):
    +class BaseReadWrite(object):
    +    """
    +    Base class for MLWriter and MLReader. Stores information about the SparkContext
    +    and SparkSession.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self):
    +        self.sparkSession = None
    +
    +    def context(self, sqlContext):
    +        """
    +        Sets the SQL context to use for saving.
    +
    +        .. note:: Deprecated in 2.1 and will be removed in 3.0, use session instead.
    +        """
    +        # raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +        self.sparkSession = sqlContext.sparkSession
    +        return self
    +
    +    def session(self, sparkSession):
    +        """Sets the Spark Session to use for saving."""
    +        # raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +        self.sparkSession = sparkSession
    +        return self
    +
    +    def getOrCreateSparkSession(self):
    +        if self.sparkSession is None:
    +            self.sparkSession = SparkSession.builder.getOrCreate()
    +        return self.sparkSession
    +
    +    @property
    +    def sc(self):
    +        return self.getOrCreateSparkSession().sparkContext
    +
    +
    +@inherit_doc
    +class MLWriter(BaseReadWrite):
         """
         Utility class that can save ML instances.
     
         .. versionadded:: 2.0.0
         """
     
    +    def __init__(self):
    +        super(MLWriter, self).__init__()
    +        self.shouldOverwrite = False
    +
         def save(self, path):
             """Save the ML instance to the input path."""
    +        if isinstance(self, JavaMLWriter):
    +            self.saveImpl(path)
    +        else:
    +            if self.shouldOverwrite:
    +                os.remove(path)
    +            self.saveImpl(path)
    +
    +    def saveImpl(self, path):
             raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
     
         def overwrite(self):
             """Overwrites if the output path already exists."""
    -        raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +        self.shouldOverwrite = True
    --- End diff --
    
    return self (to allow builder pattern); always match Scala API where possible & where it makes sense


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][Python]Python persistence helper functions

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #79980 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79980/testReport)** for PR 18742 at commit [`9cb114b`](https://github.com/apache/spark/commit/9cb114bfce8a86e50b57384435f850c75d8507b7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `                \"class name `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80145/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131749037
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +333,201 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +    """
    +    .. note:: DeveloperApi
    +
    +    Helper trait for making simple :py:class`Params` types writable.  If a :py:class`Params`
    --- End diff --
    
    Does this work with the missing colon?  Please make sure to generate the API docs and check them.  See https://github.com/apache/spark/tree/master/docs for instructions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80145 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80145/testReport)** for PR 18742 at commit [`ac4cf70`](https://github.com/apache/spark/commit/ac4cf70d7968848638acc080c96f5397275b4655).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80361 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80361/testReport)** for PR 18742 at commit [`d62154c`](https://github.com/apache/spark/commit/d62154c2657e02d23ffa6ca78e248a4bccd5f1bf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131516809
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +333,204 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +    """
    +    .. note:: DeveloperApi
    +
    +    Helper trait for making simple `Params` types writable.  If a `Params` class stores
    +    all data as [[pyspark.ml.param.Param]] values, then extending this trait will provide
    --- End diff --
    
    There are still some instances of Scaladoc-style class references using double brackets like this: ```[[MyClass]]```.  Switch to the python style like ```:py:class:`MyClass` ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80367 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80367/testReport)** for PR 18742 at commit [`4e96355`](https://github.com/apache/spark/commit/4e963553983383cb1a8cb336e50b1d8105e7c7ae).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131772274
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -61,33 +66,86 @@ def _randomUID(cls):
     
     
     @inherit_doc
    -class MLWriter(object):
    +class BaseReadWrite(object):
         """
    -    Utility class that can save ML instances.
    +    Base class for MLWriter and MLReader. Stores information about the SparkContext
    +    and SparkSession.
     
    -    .. versionadded:: 2.0.0
    +    .. versionadded:: 2.3.0
         """
     
    -    def save(self, path):
    -        """Save the ML instance to the input path."""
    -        raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    -
    -    def overwrite(self):
    -        """Overwrites if the output path already exists."""
    -        raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +    def __init__(self):
    +        self._sparkSession = None
     
         def context(self, sqlContext):
             """
    -        Sets the SQL context to use for saving.
    +        Sets the Spark SQLContext to use for saving/loading.
     
             .. note:: Deprecated in 2.1 and will be removed in 3.0, use session instead.
             """
    -        raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +        raise NotImplementedError("Read/Write is not yet implemented for type: %s" % type(self))
     
         def session(self, sparkSession):
    -        """Sets the Spark Session to use for saving."""
    +        """
    +        Sets the Spark Session to use for saving/loading.
    +        """
    +        self._sparkSession = sparkSession
    +        return self
    +
    +    @property
    +    def sparkSession(self):
    +        """
    +        Returns the user-specified Spark Session or the default.
    +        """
    +        if self._sparkSession is None:
    +            self._sparkSession = SparkSession.builder.getOrCreate()
    +        return self._sparkSession
    +
    +    @property
    +    def sc(self):
    +        """
    +        Returns the underlying `SparkContext`.
    +        """
    +        return self.sparkSession.sparkContext
    +
    +
    +@inherit_doc
    +class MLWriter(BaseReadWrite):
    +    """
    +    Utility class that can save ML instances.
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    def __init__(self):
    +        super(MLWriter, self).__init__()
    +        self.shouldOverwrite = False
    +
    +    def _handleOverwrite(self, path):
    +        from pyspark.ml.wrapper import JavaWrapper
    +
    +        _java_obj = JavaWrapper._new_java_obj("org.apache.spark.ml.util.FileSystemOverwrite")
    +        wrapper = JavaWrapper(_java_obj)
    +        wrapper._call_java("handleOverwrite", path, True, self.sc._jsc.sc())
    --- End diff --
    
    The wrapper `FileSystemOverwrite`, can we abstract a more generic interface, such as `FileSystem` ? (Maybe it can be used in other place of pyspark, not only "file overwrite", also including other file system operations)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80364/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r130519716
  
    --- Diff: python/pyspark/ml/param/__init__.py ---
    @@ -375,6 +375,18 @@ def copy(self, extra=None):
             that._defaultParamMap = {}
             return self._copyValues(that, extra)
     
    +    def set(self, param, value):
    +        """
    --- End diff --
    
    We already have `def _set(self, **kwargs)` so does it really need this method ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80366 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80366/testReport)** for PR 18742 at commit [`2f216b7`](https://github.com/apache/spark/commit/2f216b748eb19df7f98c703c34667039edeee29e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80226/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80367 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80367/testReport)** for PR 18742 at commit [`4e96355`](https://github.com/apache/spark/commit/4e963553983383cb1a8cb336e50b1d8105e7c7ae).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131285744
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
    @@ -471,3 +471,24 @@ private[ml] object MetaAlgorithmReadWrite {
         List((instance.uid, instance)) ++ subStageMaps
       }
     }
    +
    +private[ml] class FileSystemOverwrite extends Logging {
    +
    +  def handleOverwrite(path: String, shouldOverwrite: Boolean, sc: SparkContext): Unit = {
    --- End diff --
    
    This is the same now as the code in MLWriter.save, so please use this within MLWriter.save to eliminate the duplicated code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131290424
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +353,143 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +    """
    +    Class for making simple Params types writable. Assumes that all parameters
    +    are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def write(self):
    +        """Returns a DefaultParamsWriter instance for this class."""
    +        if isinstance(self, Params):
    +            return DefaultParamsWriter(self)
    +        else:
    +            raise TypeError("Cannot use DefautParamsWritable with type %s because it does not " +
    +                            " extend Params.", type(self))
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +    """
    +    Class for writing Estimators and Transformers whose parameters are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +
    +    def saveImpl(self, path):
    +        DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
    +
    +    @staticmethod
    +    def save_metadata(instance, path, sc, extraMetadata=None, paramMap=None):
    +        metadataPath = os.path.join(path, "metadata")
    +        metadataJson = DefaultParamsWriter.get_metadata_to_save(instance,
    +                                                                metadataPath,
    +                                                                sc,
    +                                                                extraMetadata,
    +                                                                paramMap)
    +        sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
    +
    +    @staticmethod
    +    def get_metadata_to_save(instance, path, sc, extraMetadata=None, paramMap=None):
    +        uid = instance.uid
    +        cls = instance.__module__ + '.' + instance.__class__.__name__
    +        params = instance.extractParamMap()
    +        jsonParams = {}
    +        if paramMap is not None:
    +            for p in paramMap:
    +                jsonParams[p.name] = paramMap[p]
    +        else:
    +            for p in params:
    +                jsonParams[p.name] = params[p]
    +        basicMetadata = {"class": cls, "timestamp": long(round(time.time() * 1000)),
    +                         "sparkVersion": sc.version, "uid": uid, "paramMap": jsonParams}
    +        if extraMetadata is not None:
    +            basicMetadata.update(extraMetadata)
    +        return json.dumps(basicMetadata, separators=[',',  ':'])
    +
    +
    +@inherit_doc
    +class DefaultParamsReadable(MLReadable):
    +    """
    +    Class for making simple Params types readable. Assumes that all parameters
    +    are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    @classmethod
    +    def read(cls):
    +        """Returns a DefaultParamsReader instance for this class."""
    +        return DefaultParamsReader(cls)
    +
    +
    +@inherit_doc
    +class DefaultParamsReader(MLReader):
    +    """
    +    Class for reading Estimators and Transformers whose parameters are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self, cls):
    +        super(DefaultParamsReader, self).__init__()
    +        self.cls = cls
    +
    +    @staticmethod
    +    def __get_class(clazz):
    +        """
    +        Loads Python class from its name.
    +        """
    +        parts = clazz.split('.')
    +        module = ".".join(parts[:-1])
    +        m = __import__(module)
    +        for comp in parts[1:]:
    +            m = getattr(m, comp)
    +        return m
    +
    +    def load(self, path):
    +        metadata = DefaultParamsReader.loadMetadata(path, self.sc)
    +        py_type = DefaultParamsReader.__get_class(metadata['class'])
    +        instance = py_type()
    +        instance._resetUid(metadata['uid'])
    +        DefaultParamsReader.getAndSetParams(instance, metadata)
    +        return instance
    +
    +    @staticmethod
    +    def loadMetadata(path, sc, expectedClassName=""):
    --- End diff --
    
    These static methods can be private (add leading underscores)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131286586
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -61,20 +66,74 @@ def _randomUID(cls):
     
     
     @inherit_doc
    -class MLWriter(object):
    +class BaseReadWrite(object):
    +    """
    +    Base class for MLWriter and MLReader. Stores information about the SparkContext
    +    and SparkSession.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self):
    +        self._sparkSession = None
    +
    +    def context(self, sqlContext):
    +        """
    +        Sets the Spark SQLContext to use for saving/loading.
    +
    +        .. note:: Deprecated in 2.1 and will be removed in 3.0, use session instead.
    +        """
    +        raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +
    +    def session(self, sparkSession):
    +        """
    +        Sets the Spark Session to use for saving/loading.
    +        """
    +        self._sparkSession = sparkSession
    +        return self
    +
    +    def sparkSession(self):
    --- End diff --
    
    Copy doc from Scala.  Also, this could be a ```@property``` to match Scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80220 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80220/testReport)** for PR 18742 at commit [`be67caa`](https://github.com/apache/spark/commit/be67caa070f5ad7201c5f4c74d9340a2c35830f2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80188 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80188/testReport)** for PR 18742 at commit [`6fb82be`](https://github.com/apache/spark/commit/6fb82be6786793c32f4da06f76b09f61dcab8e68).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80189 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80189/testReport)** for PR 18742 at commit [`b7e29ee`](https://github.com/apache/spark/commit/b7e29eed881075c6cca2e2267a0ba258ec54f754).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131287820
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -237,6 +300,13 @@ def _load_java_obj(cls, clazz):
                 java_obj = getattr(java_obj, name)
             return java_obj
     
    +    @classmethod
    +    def _load_given_name(cls, java_class):
    --- End diff --
    
    This can be static since you don't use ```cls```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80366/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80145 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80145/testReport)** for PR 18742 at commit [`ac4cf70`](https://github.com/apache/spark/commit/ac4cf70d7968848638acc080c96f5397275b4655).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131262239
  
    --- Diff: python/pyspark/ml/param/__init__.py ---
    @@ -375,6 +375,18 @@ def copy(self, extra=None):
             that._defaultParamMap = {}
             return self._copyValues(that, extra)
     
    +    def set(self, param, value):
    +        """
    +            Sets the value for a parameter in the parameter map.
    +        """
    +        if not self.hasParam(param.name):
    +            raise TypeError('Invalid param %s given for instance %r. %s' % (p.name, self, e))
    --- End diff --
    
    Can you add a test which tries to set this with an invalid Param?  This line will not work since e is not defined, but tests pass since this line of code is not tested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: Python persistence helper functions

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #79980 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79980/testReport)** for PR 18742 at commit [`9cb114b`](https://github.com/apache/spark/commit/9cb114bfce8a86e50b57384435f850c75d8507b7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131261087
  
    --- Diff: python/pyspark/ml/param/__init__.py ---
    @@ -375,6 +375,18 @@ def copy(self, extra=None):
             that._defaultParamMap = {}
             return self._copyValues(that, extra)
     
    +    def set(self, param, value):
    +        """
    +            Sets the value for a parameter in the parameter map.
    +        """
    +        if not self.hasParam(param.name):
    --- End diff --
    
    Reuse _shouldOwn for validation here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131332176
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +341,198 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +    """
    +    .. note:: DeveloperApi
    +
    +    Helper trait for making simple `Params` types writable.  If a `Params` class stores
    +    all data as [[pyspark.ml.param.Param]] values, then extending this trait will provide
    +    a default implementation of writing saved instances of the class.
    +    This only handles simple [[pyspark.ml.param.Param]] types; e.g., it will not handle
    +    [[pyspark.sql.Dataset]].
    +
    +    @see `DefaultParamsReadable`, the counterpart to this trait
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def write(self):
    +        """Returns a DefaultParamsWriter instance for this class."""
    +        if isinstance(self, Params):
    +            return DefaultParamsWriter(self)
    +        else:
    +            raise TypeError("Cannot use DefautParamsWritable with type %s because it does not " +
    +                            " extend Params.", type(self))
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +    """
    +    .. note:: DeveloperApi
    +
    +    Class for writing Estimators and Transformers whose parameters are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +
    +    def saveImpl(self, path):
    +        DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
    +
    +    @staticmethod
    +    def save_metadata(instance, path, sc, extraMetadata=None, paramMap=None):
    +        """
    +        Saves metadata + Params to: path + "/metadata"
    +        - class
    +        - timestamp
    +        - sparkVersion
    +        - uid
    +        - paramMap
    +        - (optionally, extra metadata)
    +        @param extraMetadata  Extra metadata to be saved at same level as uid, paramMap, etc.
    +        @param paramMap  If given, this is saved in the "paramMap" field.
    +        """
    +        metadataPath = os.path.join(path, "metadata")
    +        metadataJson = DefaultParamsWriter._get_metadata_to_save(instance,
    +                                                                 sc,
    +                                                                 extraMetadata,
    +                                                                 paramMap)
    +        sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
    +
    +    @staticmethod
    +    def _get_metadata_to_save(instance, sc, extraMetadata=None, paramMap=None):
    +        """
    +        Helper for [[save_metadata()]] which extracts the JSON to save.
    +        This is useful for ensemble models which need to save metadata for many sub-models.
    +
    +        @see [[save_metadata()]] for details on what this includes.
    +        """
    +        uid = instance.uid
    +        cls = instance.__module__ + '.' + instance.__class__.__name__
    +        params = instance.extractParamMap()
    +        jsonParams = {}
    +        if paramMap is not None:
    +            jsonParams = paramMap
    +        else:
    +            for p in params:
    +                jsonParams[p.name] = params[p]
    +        basicMetadata = {"class": cls, "timestamp": long(round(time.time() * 1000)),
    +                         "sparkVersion": sc.version, "uid": uid, "paramMap": jsonParams}
    +        if extraMetadata is not None:
    +            basicMetadata.update(extraMetadata)
    +        return json.dumps(basicMetadata, separators=[',',  ':'])
    +
    +
    +@inherit_doc
    +class DefaultParamsReadable(MLReadable):
    +    """
    +    .. note:: DeveloperApi
    +
    +    Helper trait for making simple `Params` types readable.  If a `Params` class stores
    +    all data as [[pyspark.ml.param.Param]] values, then extending this trait will provide
    --- End diff --
    
    For python docs, follow other examples such as ```:py:class:`MLWriter` ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131000706
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -61,32 +66,82 @@ def _randomUID(cls):
     
     
     @inherit_doc
    -class MLWriter(object):
    +class BaseReadWrite(object):
    +    """
    +    Base class for MLWriter and MLReader. Stores information about the SparkContext
    +    and SparkSession.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self):
    +        self.sparkSession = None
    +
    +    def context(self, sqlContext):
    --- End diff --
    
    No need for this, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by ajaysaini725 <gi...@git.apache.org>.

Github user ajaysaini725 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131781393
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1158,6 +1165,33 @@ def test_decisiontree_regressor(self):
             except OSError:
                 pass
     
    +    def test_default_read_write(self):
    +        temp_path = tempfile.mkdtemp()
    +
    +        lr = LogisticRegression()
    +        lr.setMaxIter(50)
    +        lr.setThreshold(.75)
    +        writer = DefaultParamsWriter(lr)
    +
    +        savePath = temp_path + "/lr"
    +        writer.saveImpl(savePath)
    --- End diff --
    
    Pushed a fix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80220 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80220/testReport)** for PR 18742 at commit [`be67caa`](https://github.com/apache/spark/commit/be67caa070f5ad7201c5f4c74d9340a2c35830f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][Python]Python persistence helper functions

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79980/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131782033
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -61,33 +66,86 @@ def _randomUID(cls):
     
     
     @inherit_doc
    -class MLWriter(object):
    +class BaseReadWrite(object):
         """
    -    Utility class that can save ML instances.
    +    Base class for MLWriter and MLReader. Stores information about the SparkContext
    +    and SparkSession.
     
    -    .. versionadded:: 2.0.0
    +    .. versionadded:: 2.3.0
         """
     
    -    def save(self, path):
    -        """Save the ML instance to the input path."""
    -        raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    -
    -    def overwrite(self):
    -        """Overwrites if the output path already exists."""
    -        raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +    def __init__(self):
    +        self._sparkSession = None
     
         def context(self, sqlContext):
             """
    -        Sets the SQL context to use for saving.
    +        Sets the Spark SQLContext to use for saving/loading.
     
             .. note:: Deprecated in 2.1 and will be removed in 3.0, use session instead.
             """
    -        raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +        raise NotImplementedError("Read/Write is not yet implemented for type: %s" % type(self))
     
         def session(self, sparkSession):
    -        """Sets the Spark Session to use for saving."""
    +        """
    +        Sets the Spark Session to use for saving/loading.
    +        """
    +        self._sparkSession = sparkSession
    +        return self
    +
    +    @property
    +    def sparkSession(self):
    +        """
    +        Returns the user-specified Spark Session or the default.
    +        """
    +        if self._sparkSession is None:
    +            self._sparkSession = SparkSession.builder.getOrCreate()
    +        return self._sparkSession
    +
    +    @property
    +    def sc(self):
    +        """
    +        Returns the underlying `SparkContext`.
    +        """
    +        return self.sparkSession.sparkContext
    +
    +
    +@inherit_doc
    +class MLWriter(BaseReadWrite):
    +    """
    +    Utility class that can save ML instances.
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    def __init__(self):
    +        super(MLWriter, self).__init__()
    +        self.shouldOverwrite = False
    +
    +    def _handleOverwrite(self, path):
    +        from pyspark.ml.wrapper import JavaWrapper
    +
    +        _java_obj = JavaWrapper._new_java_obj("org.apache.spark.ml.util.FileSystemOverwrite")
    +        wrapper = JavaWrapper(_java_obj)
    +        wrapper._call_java("handleOverwrite", path, True, self.sc._jsc.sc())
    --- End diff --
    
    Let's do that when needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131288896
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +353,143 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +    """
    +    Class for making simple Params types writable. Assumes that all parameters
    +    are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def write(self):
    +        """Returns a DefaultParamsWriter instance for this class."""
    +        if isinstance(self, Params):
    +            return DefaultParamsWriter(self)
    +        else:
    +            raise TypeError("Cannot use DefautParamsWritable with type %s because it does not " +
    +                            " extend Params.", type(self))
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +    """
    +    Class for writing Estimators and Transformers whose parameters are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +
    +    def saveImpl(self, path):
    +        DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
    +
    +    @staticmethod
    +    def save_metadata(instance, path, sc, extraMetadata=None, paramMap=None):
    --- End diff --
    
    add leading underscore to make this private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131516889
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +333,204 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +    """
    +    .. note:: DeveloperApi
    +
    +    Helper trait for making simple `Params` types writable.  If a `Params` class stores
    +    all data as [[pyspark.ml.param.Param]] values, then extending this trait will provide
    +    a default implementation of writing saved instances of the class.
    +    This only handles simple [[pyspark.ml.param.Param]] types; e.g., it will not handle
    +    [[pyspark.sql.Dataset]].
    +
    +    @see `DefaultParamsReadable`, the counterpart to this trait
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def write(self):
    +        """Returns a DefaultParamsWriter instance for this class."""
    +        from pyspark.ml.param import Params
    +
    +        if isinstance(self, Params):
    +            return DefaultParamsWriter(self)
    +        else:
    +            raise TypeError("Cannot use DefautParamsWritable with type %s because it does not " +
    +                            " extend Params.", type(self))
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +    """
    +    .. note:: DeveloperApi
    +
    +    Specialization of :py:class:`MLWriter` for :py:class:`Params` types
    +
    +    Class for writing Estimators and Transformers whose parameters are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +
    +    def saveImpl(self, path):
    +        DefaultParamsWriter.saveMetadata(self.instance, path, self.sc)
    +
    +    @staticmethod
    +    def saveMetadata(instance, path, sc, extraMetadata=None, paramMap=None):
    +        """
    +        Saves metadata + Params to: path + "/metadata"
    +        - class
    +        - timestamp
    +        - sparkVersion
    +        - uid
    +        - paramMap
    +        - (optionally, extra metadata)
    +        @param extraMetadata  Extra metadata to be saved at same level as uid, paramMap, etc.
    +        @param paramMap  If given, this is saved in the "paramMap" field.
    +        """
    +        metadataPath = os.path.join(path, "metadata")
    +        metadataJson = DefaultParamsWriter._get_metadata_to_save(instance,
    +                                                                 sc,
    +                                                                 extraMetadata,
    +                                                                 paramMap)
    +        sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
    +
    +    @staticmethod
    +    def _get_metadata_to_save(instance, sc, extraMetadata=None, paramMap=None):
    +        """
    +        Helper for [[saveMetadata()]] which extracts the JSON to save.
    +        This is useful for ensemble models which need to save metadata for many sub-models.
    +
    +        @see [[saveMetadata()]] for details on what this includes.
    +        """
    +        uid = instance.uid
    +        cls = instance.__module__ + '.' + instance.__class__.__name__
    +        params = instance.extractParamMap()
    +        jsonParams = {}
    +        if paramMap is not None:
    +            jsonParams = paramMap
    +        else:
    +            for p in params:
    +                jsonParams[p.name] = params[p]
    +        basicMetadata = {"class": cls, "timestamp": long(round(time.time() * 1000)),
    +                         "sparkVersion": sc.version, "uid": uid, "paramMap": jsonParams}
    +        if extraMetadata is not None:
    +            basicMetadata.update(extraMetadata)
    +        return json.dumps(basicMetadata, separators=[',',  ':'])
    +
    +
    +@inherit_doc
    +class DefaultParamsReadable(MLReadable):
    +    """
    +    .. note:: DeveloperApi
    +
    +    Helper trait for making simple `Params` types readable.  If a `Params` class stores
    +    all data as [[pyspark.ml.param.Param]] values, then extending this trait will provide
    +    a default implementation of reading saved instances of the class.
    +    This only handles simple [[pyspark.ml.param.Param]] types; e.g., it will not handle
    +    [[pyspark.sql.Dataset]].
    +
    +    @see `DefaultParamsWritable`, the counterpart to this trait
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    @classmethod
    +    def read(cls):
    +        """Returns a DefaultParamsReader instance for this class."""
    +        return DefaultParamsReader(cls)
    +
    +
    +@inherit_doc
    +class DefaultParamsReader(MLReader):
    +    """
    +    .. note:: DeveloperApi
    +
    +    Specialization of :py:class:`MLReader` for :py:class:`Params` types
    +
    +    Default `MLReader` implementation for transformers and estimators that
    +    contain basic (json-serializable) params and no data. This will not handle
    +    more complex params or types with data (e.g., models with coefficients).
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self, cls):
    +        super(DefaultParamsReader, self).__init__()
    +        self.cls = cls
    +
    +    @staticmethod
    +    def __get_class(clazz):
    +        """
    +        Loads Python class from its name.
    +        """
    +        parts = clazz.split('.')
    +        module = ".".join(parts[:-1])
    +        m = __import__(module)
    +        for comp in parts[1:]:
    +            m = getattr(m, comp)
    +        return m
    +
    +    def load(self, path):
    +        metadata = DefaultParamsReader.loadMetadata(path, self.sc)
    +        py_type = DefaultParamsReader.__get_class(metadata['class'])
    +        instance = py_type()
    +        instance._resetUid(metadata['uid'])
    +        DefaultParamsReader.getAndSetParams(instance, metadata)
    +        return instance
    +
    +    @staticmethod
    +    def loadMetadata(path, sc, expectedClassName=""):
    +        """
    +        Load metadata saved using [[DefaultParamsWriter.saveMetadata()]]
    +
    +        @param expectedClassName  If non empty, this is checked against the loaded metadata.
    --- End diff --
    
    I missed this before: the Scala-style ```@param``` and ```@see``` do not work in Python.  Look in other file like pipeline.py for examples of ```:param```, and maybe use ```.. note::``` to replace ```@see```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131264975
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +289,124 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +
    +    # overrides the write() function in MLWriteable
    +    # users call .save() in MLWriteable which calls this write() function and then calls
    +    # the .save() in DefaultParamsWriter
    +    # this can be overridden to return a different Writer (ex. OneVsRestWriter as seen in Scala)
    +    def write(self):
    +        # instance of check for params?
    +        return DefaultParamsWriter(self)
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +        self.sc = SparkContext._active_spark_context
    +
    +    # if a model extends DefaultParamsWriteable this save() function is called
    +    def save(self, path):
    +        if self.shouldOverwrite:
    +            # This command removes a file. Is this enough?
    +            os.remove(path)
    +        DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
    +
    --- End diff --
    
    I'd strongly prefer to mimic the Scala APIs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131262008
  
    --- Diff: python/pyspark/ml/param/__init__.py ---
    @@ -375,6 +375,18 @@ def copy(self, extra=None):
             that._defaultParamMap = {}
             return self._copyValues(that, extra)
     
    +    def set(self, param, value):
    +        """
    --- End diff --
    
    OK, makes sense, let's leave it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80269 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80269/testReport)** for PR 18742 at commit [`e8c3041`](https://github.com/apache/spark/commit/e8c3041ce38c707b854a661b751ed3350e8f0371).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r130996252
  
    --- Diff: python/pyspark/ml/param/__init__.py ---
    @@ -375,6 +375,18 @@ def copy(self, extra=None):
             that._defaultParamMap = {}
             return self._copyValues(that, extra)
     
    +    def set(self, param, value):
    +        """
    --- End diff --
    
    This does look extraneous.  I'm OK with adding it to match the Scala API, but it'd be better to put it in a separate PR and JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    LGTM pending tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131749278
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +333,201 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +    """
    +    .. note:: DeveloperApi
    +
    +    Helper trait for making simple :py:class`Params` types writable.  If a :py:class`Params`
    +    class stores all data as :py:class:'Param' values, then extending this trait will provide
    --- End diff --
    
    Does normal tick ' work, or do you have to use backticks ` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80188/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80189/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131288910
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +353,143 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +    """
    +    Class for making simple Params types writable. Assumes that all parameters
    +    are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def write(self):
    +        """Returns a DefaultParamsWriter instance for this class."""
    +        if isinstance(self, Params):
    +            return DefaultParamsWriter(self)
    +        else:
    +            raise TypeError("Cannot use DefautParamsWritable with type %s because it does not " +
    +                            " extend Params.", type(self))
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +    """
    +    Class for writing Estimators and Transformers whose parameters are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +
    +    def saveImpl(self, path):
    +        DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
    +
    +    @staticmethod
    +    def save_metadata(instance, path, sc, extraMetadata=None, paramMap=None):
    +        metadataPath = os.path.join(path, "metadata")
    +        metadataJson = DefaultParamsWriter.get_metadata_to_save(instance,
    +                                                                metadataPath,
    +                                                                sc,
    +                                                                extraMetadata,
    +                                                                paramMap)
    +        sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
    +
    +    @staticmethod
    +    def get_metadata_to_save(instance, path, sc, extraMetadata=None, paramMap=None):
    --- End diff --
    
    add leading underscore to make this private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r130523538
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +289,124 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +
    +    # overrides the write() function in MLWriteable
    +    # users call .save() in MLWriteable which calls this write() function and then calls
    +    # the .save() in DefaultParamsWriter
    +    # this can be overridden to return a different Writer (ex. OneVsRestWriter as seen in Scala)
    +    def write(self):
    +        # instance of check for params?
    +        return DefaultParamsWriter(self)
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +        self.sc = SparkContext._active_spark_context
    --- End diff --
    
    Can we move the `sc` into `BaseReadWrite` abstract hierarchy? like the thing scala-side do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80226 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80226/testReport)** for PR 18742 at commit [`bb3d998`](https://github.com/apache/spark/commit/bb3d9983bd9ee6cc758950aa84a530184db7b68d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by ajaysaini725 <gi...@git.apache.org>.

Github user ajaysaini725 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131028435
  
    --- Diff: python/pyspark/ml/param/__init__.py ---
    @@ -375,6 +375,18 @@ def copy(self, extra=None):
             that._defaultParamMap = {}
             return self._copyValues(that, extra)
     
    +    def set(self, param, value):
    +        """
    --- End diff --
    
    I included this function because the getAndSetParams function in DefaultParamsReader required a way to set a parameter for an instance given the param object and the value of the param. There wasn't any helper functionality in the Params class that did this so I added the set() function from Scala. Is it okay to leave it for this purpose or is there a better way of handling this case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80189 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80189/testReport)** for PR 18742 at commit [`b7e29ee`](https://github.com/apache/spark/commit/b7e29eed881075c6cca2e2267a0ba258ec54f754).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class FileSystemOverwrite extends BaseReadWrite with Logging `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80220/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131284503
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1957,6 +1964,46 @@ def test_chisquaretest(self):
             self.assertTrue(all(field in fieldNames for field in expectedFields))
     
     
    +class DefaultReadWriteTests(SparkSessionTestCase):
    +
    +    def test_default_read_write(self):
    +        temp_path = tempfile.mkdtemp()
    +
    +        lr = LogisticRegression()
    +        lr.setMaxIter(50)
    +        lr.setThreshold(.75)
    +        writer = DefaultParamsWriter(lr)
    +
    +        savePath = temp_path + "/lr"
    +        writer.saveImpl(savePath)
    +
    +        reader = DefaultParamsReadable.read()
    +        lr2 = reader.load(savePath)
    +
    +        self.assertEqual(lr.uid, lr2.uid)
    +        self.assertEqual(lr.extractParamMap(), lr2.extractParamMap())
    +
    +    def test_default_read_write_with_overwrite(self):
    --- End diff --
    
    Since these tests are almost the same, can you combine them to reduce code duplication?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131332345
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +341,198 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +    """
    +    .. note:: DeveloperApi
    +
    +    Helper trait for making simple `Params` types writable.  If a `Params` class stores
    +    all data as [[pyspark.ml.param.Param]] values, then extending this trait will provide
    +    a default implementation of writing saved instances of the class.
    +    This only handles simple [[pyspark.ml.param.Param]] types; e.g., it will not handle
    +    [[pyspark.sql.Dataset]].
    +
    +    @see `DefaultParamsReadable`, the counterpart to this trait
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def write(self):
    +        """Returns a DefaultParamsWriter instance for this class."""
    +        if isinstance(self, Params):
    +            return DefaultParamsWriter(self)
    +        else:
    +            raise TypeError("Cannot use DefautParamsWritable with type %s because it does not " +
    +                            " extend Params.", type(self))
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +    """
    +    .. note:: DeveloperApi
    +
    +    Class for writing Estimators and Transformers whose parameters are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +
    +    def saveImpl(self, path):
    +        DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
    +
    +    @staticmethod
    +    def save_metadata(instance, path, sc, extraMetadata=None, paramMap=None):
    --- End diff --
    
    If this is going to be public, let's use camelCase to match style


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131331807
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -156,28 +218,23 @@ def write(self):
     
     
     @inherit_doc
    -class MLReader(object):
    +class MLReader(BaseReadWrite):
         """
         Utility class that can load ML instances.
     
         .. versionadded:: 2.0.0
         """
     
    +    def __init__(self):
    +        super(MLReader, self).__init__()
    +
         def load(self, path):
             """Load the ML instance from the input path."""
             raise NotImplementedError("MLReader is not yet implemented for type: %s" % type(self))
     
    -    def context(self, sqlContext):
    -        """
    -        Sets the SQL context to use for loading.
    -
    -        .. note:: Deprecated in 2.1 and will be removed in 3.0, use session instead.
    -        """
    -        raise NotImplementedError("MLReader is not yet implemented for type: %s" % type(self))
    -
         def session(self, sparkSession):
    --- End diff --
    
    You can remove this instance of session since it is inherited.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131771002
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1158,6 +1165,33 @@ def test_decisiontree_regressor(self):
             except OSError:
                 pass
     
    +    def test_default_read_write(self):
    +        temp_path = tempfile.mkdtemp()
    +
    +        lr = LogisticRegression()
    +        lr.setMaxIter(50)
    +        lr.setThreshold(.75)
    +        writer = DefaultParamsWriter(lr)
    +
    +        savePath = temp_path + "/lr"
    +        writer.saveImpl(savePath)
    --- End diff --
    
    Why directly call `saveImpl` here ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131288629
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -237,6 +300,13 @@ def _load_java_obj(cls, clazz):
                 java_obj = getattr(java_obj, name)
             return java_obj
     
    +    @classmethod
    +    def _load_given_name(cls, java_class):
    --- End diff --
    
    Please also stick with camelCase for methods since that's what pyspark does.  (here and elsewhere)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131260736
  
    --- Diff: python/pyspark/ml/param/__init__.py ---
    @@ -375,6 +375,18 @@ def copy(self, extra=None):
             that._defaultParamMap = {}
             return self._copyValues(that, extra)
     
    +    def set(self, param, value):
    +        """
    +            Sets the value for a parameter in the parameter map.
    --- End diff --
    
    style: change indent -4 spaces


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80188 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80188/testReport)** for PR 18742 at commit [`6fb82be`](https://github.com/apache/spark/commit/6fb82be6786793c32f4da06f76b09f61dcab8e68).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80234/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80269 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80269/testReport)** for PR 18742 at commit [`e8c3041`](https://github.com/apache/spark/commit/e8c3041ce38c707b854a661b751ed3350e8f0371).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131001072
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -61,32 +66,82 @@ def _randomUID(cls):
     
     
     @inherit_doc
    -class MLWriter(object):
    +class BaseReadWrite(object):
    +    """
    +    Base class for MLWriter and MLReader. Stores information about the SparkContext
    +    and SparkSession.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self):
    +        self.sparkSession = None
    +
    +    def context(self, sqlContext):
    +        """
    +        Sets the SQL context to use for saving.
    +
    +        .. note:: Deprecated in 2.1 and will be removed in 3.0, use session instead.
    +        """
    +        # raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +        self.sparkSession = sqlContext.sparkSession
    +        return self
    +
    +    def session(self, sparkSession):
    +        """Sets the Spark Session to use for saving."""
    +        # raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +        self.sparkSession = sparkSession
    +        return self
    +
    +    def getOrCreateSparkSession(self):
    --- End diff --
    
    Rename to sparkSession to match Scala API


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131286314
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -61,20 +66,74 @@ def _randomUID(cls):
     
     
     @inherit_doc
    -class MLWriter(object):
    +class BaseReadWrite(object):
    +    """
    +    Base class for MLWriter and MLReader. Stores information about the SparkContext
    +    and SparkSession.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self):
    +        self._sparkSession = None
    +
    +    def context(self, sqlContext):
    --- End diff --
    
    Leaving this is OK if you remove it from the subclasses, for the sake of code simplification.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80146/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by ajaysaini725 <gi...@git.apache.org>.

Github user ajaysaini725 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131017418
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -61,32 +66,82 @@ def _randomUID(cls):
     
     
     @inherit_doc
    -class MLWriter(object):
    +class BaseReadWrite(object):
    +    """
    +    Base class for MLWriter and MLReader. Stores information about the SparkContext
    +    and SparkSession.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self):
    +        self.sparkSession = None
    +
    +    def context(self, sqlContext):
    --- End diff --
    
    Yeah it isn't used I just included the function definition because JavaMLWriter which inherits from MLWriter which inherits from BaseReadWrite uses it. I changed the function body to give a NotImplementedError though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80361 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80361/testReport)** for PR 18742 at commit [`d62154c`](https://github.com/apache/spark/commit/d62154c2657e02d23ffa6ca78e248a4bccd5f1bf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131780734
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1158,6 +1165,33 @@ def test_decisiontree_regressor(self):
             except OSError:
                 pass
     
    +    def test_default_read_write(self):
    +        temp_path = tempfile.mkdtemp()
    +
    +        lr = LogisticRegression()
    +        lr.setMaxIter(50)
    +        lr.setThreshold(.75)
    +        writer = DefaultParamsWriter(lr)
    +
    +        savePath = temp_path + "/lr"
    +        writer.saveImpl(savePath)
    --- End diff --
    
    Good point; +1 for calling save


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131265462
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
    @@ -471,3 +471,26 @@ private[ml] object MetaAlgorithmReadWrite {
         List((instance.uid, instance)) ++ subStageMaps
       }
     }
    +
    +class FileSystemOverwrite extends BaseReadWrite with Logging {
    --- End diff --
    
    Also, I'd make this a static object, eliminate the BaseReadWrite dependency, and just take ```sc``` as an argument.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80230/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131001862
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -61,32 +66,82 @@ def _randomUID(cls):
     
     
     @inherit_doc
    -class MLWriter(object):
    +class BaseReadWrite(object):
    +    """
    +    Base class for MLWriter and MLReader. Stores information about the SparkContext
    +    and SparkSession.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self):
    +        self.sparkSession = None
    +
    +    def context(self, sqlContext):
    +        """
    +        Sets the SQL context to use for saving.
    +
    +        .. note:: Deprecated in 2.1 and will be removed in 3.0, use session instead.
    +        """
    +        # raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +        self.sparkSession = sqlContext.sparkSession
    +        return self
    +
    +    def session(self, sparkSession):
    +        """Sets the Spark Session to use for saving."""
    +        # raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +        self.sparkSession = sparkSession
    +        return self
    +
    +    def getOrCreateSparkSession(self):
    +        if self.sparkSession is None:
    +            self.sparkSession = SparkSession.builder.getOrCreate()
    +        return self.sparkSession
    +
    +    @property
    +    def sc(self):
    +        return self.getOrCreateSparkSession().sparkContext
    +
    +
    +@inherit_doc
    +class MLWriter(BaseReadWrite):
         """
         Utility class that can save ML instances.
     
         .. versionadded:: 2.0.0
         """
     
    +    def __init__(self):
    +        super(MLWriter, self).__init__()
    +        self.shouldOverwrite = False
    +
         def save(self, path):
             """Save the ML instance to the input path."""
    +        if isinstance(self, JavaMLWriter):
    +            self.saveImpl(path)
    +        else:
    +            if self.shouldOverwrite:
    +                os.remove(path)
    +            self.saveImpl(path)
    +
    +    def saveImpl(self, path):
    --- End diff --
    
    add leading underscore to indicate private/protected


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r130521964
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +289,124 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +
    +    # overrides the write() function in MLWriteable
    +    # users call .save() in MLWriteable which calls this write() function and then calls
    +    # the .save() in DefaultParamsWriter
    +    # this can be overridden to return a different Writer (ex. OneVsRestWriter as seen in Scala)
    +    def write(self):
    +        # instance of check for params?
    +        return DefaultParamsWriter(self)
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +        self.sc = SparkContext._active_spark_context
    +
    +    # if a model extends DefaultParamsWriteable this save() function is called
    +    def save(self, path):
    +        if self.shouldOverwrite:
    +            # This command removes a file. Is this enough?
    +            os.remove(path)
    +        DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
    +
    --- End diff --
    
    We need add a `saveImpl` method here and let user to override `saveImpl`, and in `save` function call the `saveImpl` (user should not override `save` method)
    like:
    ```
    def save(self, path):
        """
        save metadata first.
        """
        ....
        saveImpl(path)
    
    def saveImpl(self, path):
        raise NotImplementedError()
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131000608
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -61,32 +66,82 @@ def _randomUID(cls):
     
     
     @inherit_doc
    -class MLWriter(object):
    +class BaseReadWrite(object):
    +    """
    +    Base class for MLWriter and MLReader. Stores information about the SparkContext
    +    and SparkSession.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self):
    +        self.sparkSession = None
    --- End diff --
    
    Add leading underscore _sparkSession to be Pythonic


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80262 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80262/testReport)** for PR 18742 at commit [`3c2554c`](https://github.com/apache/spark/commit/3c2554cb989ab84b6f70af6679cec0c41bcb3460).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131265094
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
    @@ -471,3 +471,26 @@ private[ml] object MetaAlgorithmReadWrite {
         List((instance.uid, instance)) ++ subStageMaps
       }
     }
    +
    +class FileSystemOverwrite extends BaseReadWrite with Logging {
    --- End diff --
    
    Make this package private: ```private[util]```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80234 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80234/testReport)** for PR 18742 at commit [`8a3c6d6`](https://github.com/apache/spark/commit/8a3c6d6fdceeb9f11454a96d9302bc17ca1cf1e9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r130745275
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +289,124 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +
    +    # overrides the write() function in MLWriteable
    +    # users call .save() in MLWriteable which calls this write() function and then calls
    +    # the .save() in DefaultParamsWriter
    +    # this can be overridden to return a different Writer (ex. OneVsRestWriter as seen in Scala)
    +    def write(self):
    +        # instance of check for params?
    +        return DefaultParamsWriter(self)
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +        self.sc = SparkContext._active_spark_context
    +
    +    # if a model extends DefaultParamsWriteable this save() function is called
    +    def save(self, path):
    +        if self.shouldOverwrite:
    +            # This command removes a file. Is this enough?
    +            os.remove(path)
    +        DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
    +
    --- End diff --
    
    Good question, in my opinion, the best way is:
    when python side use `JavaMLWriter`, just directly override `save`,
    but when python side is a custom `MLWriter` implemented in python, tell user to inherit `DefaultParamsWriter` and overwrite `saveImpl`. It will make the interface easier to use.
    cc @jkbradley what do you think about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80262 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80262/testReport)** for PR 18742 at commit [`3c2554c`](https://github.com/apache/spark/commit/3c2554cb989ab84b6f70af6679cec0c41bcb3460).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r130520335
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +289,124 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +
    +    # overrides the write() function in MLWriteable
    +    # users call .save() in MLWriteable which calls this write() function and then calls
    +    # the .save() in DefaultParamsWriter
    +    # this can be overridden to return a different Writer (ex. OneVsRestWriter as seen in Scala)
    +    def write(self):
    +        # instance of check for params?
    +        return DefaultParamsWriter(self)
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +        self.sc = SparkContext._active_spark_context
    +
    +    # if a model extends DefaultParamsWriteable this save() function is called
    +    def save(self, path):
    +        if self.shouldOverwrite:
    +            # This command removes a file. Is this enough?
    +            os.remove(path)
    +        DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
    +
    +    def overwrite(self):
    +        self.shouldOverwrite = True
    +        return self
    +
    +    @staticmethod
    +    def save_metadata(instance, path, sc, extraMetadata=None, paramMap=None):
    +        metadataPath = os.path.join(path, "metadata")
    +        metadataJson = DefaultParamsWriter.get_metadata_to_save(instance,
    +                                                                metadataPath,
    +                                                                sc,
    +                                                                extraMetadata,
    +                                                                paramMap)
    +        sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
    +
    +    @staticmethod
    +    def get_metadata_to_save(instance, path, sc, extraMetadata=None, paramMap=None):
    +        uid = instance.uid
    +        cls = instance.__module__ + '.' + instance.__class__.__name__
    +        params = instance.extractParamMap()
    +        jsonParams = {}
    +        if paramMap is not None:
    +            for p in paramMap:
    +                jsonParams[p.name] = paramMap[p]
    +        else:
    +            for p in params:
    +                jsonParams[p.name] = params[p]
    +        basicMetadata = {"class": cls, "timestamp": int(round(time.time() * 1000)),
    +                         "sparkVersion": sc.version, "uid": uid, "paramMap": jsonParams}
    +        if extraMetadata is not None:
    +            basicMetadata.update(extraMetadata)
    +        return json.dumps(basicMetadata)
    --- End diff --
    
    I suggest use
    `json.dumps(basicMetadata, separators=(',',':'))`
    so that output string will compact blank character.
    (scala side also do the compaction)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merging with master
    Thanks @ajaysaini725 and @WeichenXu123 !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131287269
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -86,7 +145,7 @@ def context(self, sqlContext):
     
         def session(self, sparkSession):
    --- End diff --
    
    You can just remove this now, right?  It will be inherited.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131331793
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -61,32 +66,89 @@ def _randomUID(cls):
     
     
     @inherit_doc
    -class MLWriter(object):
    +class BaseReadWrite(object):
    +    """
    +    Base class for MLWriter and MLReader. Stores information about the SparkContext
    +    and SparkSession.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self):
    +        self._sparkSession = None
    +
    +    def context(self, sqlContext):
    +        """
    +        Sets the Spark SQLContext to use for saving/loading.
    +
    +        .. note:: Deprecated in 2.1 and will be removed in 3.0, use session instead.
    +        """
    +        raise NotImplementedError("Read/Write is not yet implemented for type: %s" % type(self))
    +
    +    def session(self, sparkSession):
    +        """
    +        Sets the Spark Session to use for saving/loading.
    +        """
    +        self._sparkSession = sparkSession
    +        return self
    +
    +    @property
    +    def sparkSession(self):
    +        """
    +        Returns the user-specified Spark Session or the default.
    +        """
    +        if self._sparkSession is None:
    +            self._sparkSession = SparkSession.builder.getOrCreate()
    +        return self._sparkSession
    +
    +    @property
    +    def sc(self):
    +        """
    +        Returns the underlying `SparkContext`.
    +        """
    +        return self.sparkSession.sparkContext
    +
    +
    +@inherit_doc
    +class MLWriter(BaseReadWrite):
         """
         Utility class that can save ML instances.
     
         .. versionadded:: 2.0.0
         """
     
    +    def __init__(self):
    +        super(MLWriter, self).__init__()
    +        self.shouldOverwrite = False
    +
    +    def _handleOverwrite(self, path):
    +        from pyspark.ml.wrapper import JavaWrapper
    +
    +        _java_obj = JavaWrapper._new_java_obj("org.apache.spark.ml.util.FileSystemOverwrite")
    +        wrapper = JavaWrapper(_java_obj)
    +        wrapper._call_java("handleOverwrite", path, True, self.sc._jsc.sc())
    +
         def save(self, path):
             """Save the ML instance to the input path."""
    -        raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    -
    -    def overwrite(self):
    -        """Overwrites if the output path already exists."""
    -        raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +        if self.shouldOverwrite:
    +            self._handleOverwrite(path)
    +        self.saveImpl(path)
     
    -    def context(self, sqlContext):
    +    def saveImpl(self, path):
             """
    -        Sets the SQL context to use for saving.
    -
    -        .. note:: Deprecated in 2.1 and will be removed in 3.0, use session instead.
    +        save() handles overwriting and then calls this method.  Subclasses should override this
    +        method to implement the actual saving of the instance.
             """
             raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
     
    +    def overwrite(self):
    +        """Overwrites if the output path already exists."""
    +        self.shouldOverwrite = True
    +        return self
    +
         def session(self, sparkSession):
    --- End diff --
    
    You can remove this instance of session since it is inherited.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80226 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80226/testReport)** for PR 18742 at commit [`bb3d998`](https://github.com/apache/spark/commit/bb3d9983bd9ee6cc758950aa84a530184db7b68d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80366 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80366/testReport)** for PR 18742 at commit [`2f216b7`](https://github.com/apache/spark/commit/2f216b748eb19df7f98c703c34667039edeee29e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131287028
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -61,20 +66,74 @@ def _randomUID(cls):
     
     
     @inherit_doc
    -class MLWriter(object):
    +class BaseReadWrite(object):
    +    """
    +    Base class for MLWriter and MLReader. Stores information about the SparkContext
    +    and SparkSession.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self):
    +        self._sparkSession = None
    +
    +    def context(self, sqlContext):
    +        """
    +        Sets the Spark SQLContext to use for saving/loading.
    +
    +        .. note:: Deprecated in 2.1 and will be removed in 3.0, use session instead.
    +        """
    +        raise NotImplementedError("MLWriter is not yet implemented for type: %s" % type(self))
    +
    +    def session(self, sparkSession):
    +        """
    +        Sets the Spark Session to use for saving/loading.
    +        """
    +        self._sparkSession = sparkSession
    +        return self
    +
    +    def sparkSession(self):
    +        if self._sparkSession is None:
    +            self._sparkSession = SparkSession.builder.getOrCreate()
    +        return self._sparkSession
    +
    +    @property
    +    def sc(self):
    +        return self.sparkSession().sparkContext
    +
    +
    +@inherit_doc
    +class MLWriter(BaseReadWrite):
         """
         Utility class that can save ML instances.
     
         .. versionadded:: 2.0.0
         """
     
    +    def __init__(self):
    +        super(MLWriter, self).__init__()
    +        self.shouldOverwrite = False
    +
    +    def _handleOverwrite(self, path):
    +        from pyspark.ml.wrapper import JavaWrapper
    +
    +        _java_obj = JavaWrapper._new_java_obj("org.apache.spark.ml.util.FileSystemOverwrite")
    +        wrapper = JavaWrapper(_java_obj)
    +        wrapper._call_java("handleOverwrite", path, True, self.sc._jsc.sc())
    +
         def save(self, path):
             """Save the ML instance to the input path."""
    +        if self.shouldOverwrite:
    +            self._handleOverwrite(path)
    +        self.saveImpl(path)
    +
    +    def saveImpl(self, path):
    --- End diff --
    
    Add doc from Scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: Python persistence helper functions

Posted by ajaysaini725 <gi...@git.apache.org>.

Github user ajaysaini725 commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    @jkbradley @thunterdb @MrBago Could you please review this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by ajaysaini725 <gi...@git.apache.org>.

Github user ajaysaini725 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r130741050
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +289,124 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +
    +    # overrides the write() function in MLWriteable
    +    # users call .save() in MLWriteable which calls this write() function and then calls
    +    # the .save() in DefaultParamsWriter
    +    # this can be overridden to return a different Writer (ex. OneVsRestWriter as seen in Scala)
    +    def write(self):
    +        # instance of check for params?
    +        return DefaultParamsWriter(self)
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +        self.sc = SparkContext._active_spark_context
    +
    +    # if a model extends DefaultParamsWriteable this save() function is called
    +    def save(self, path):
    +        if self.shouldOverwrite:
    +            # This command removes a file. Is this enough?
    +            os.remove(path)
    +        DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
    +
    --- End diff --
    
    This implementation exists on the Scala side. However, in Python MLWriter does not even define the saveImpl() function and instead says that the user should override the save() function. I'm not entirely sure which is best to do in Python but as of now everything in pyspark overrides save() instead of saveImpl() so it is probably best to just stick with that pattern.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][Python]Python persistence helper functions

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80234 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80234/testReport)** for PR 18742 at commit [`8a3c6d6`](https://github.com/apache/spark/commit/8a3c6d6fdceeb9f11454a96d9302bc17ca1cf1e9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80364 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80364/testReport)** for PR 18742 at commit [`860132e`](https://github.com/apache/spark/commit/860132e1eb9df9388d47ae838db805b58630b2d1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80361/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80230 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80230/testReport)** for PR 18742 at commit [`cd22c59`](https://github.com/apache/spark/commit/cd22c59ed6e9b11ca9690f11a20febda02e9c896).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80367/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80230 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80230/testReport)** for PR 18742 at commit [`cd22c59`](https://github.com/apache/spark/commit/cd22c59ed6e9b11ca9690f11a20febda02e9c896).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80269/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80262/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80146 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80146/testReport)** for PR 18742 at commit [`470dd7c`](https://github.com/apache/spark/commit/470dd7ccdb9ea5185494b21cb8886e3597ad505e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131288360
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +353,143 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +    """
    +    Class for making simple Params types writable. Assumes that all parameters
    +    are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def write(self):
    +        """Returns a DefaultParamsWriter instance for this class."""
    +        if isinstance(self, Params):
    +            return DefaultParamsWriter(self)
    +        else:
    +            raise TypeError("Cannot use DefautParamsWritable with type %s because it does not " +
    +                            " extend Params.", type(self))
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +    """
    +    Class for writing Estimators and Transformers whose parameters are JSON-serializable.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +
    +    def saveImpl(self, path):
    +        DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
    +
    +    @staticmethod
    +    def save_metadata(instance, path, sc, extraMetadata=None, paramMap=None):
    +        metadataPath = os.path.join(path, "metadata")
    +        metadataJson = DefaultParamsWriter.get_metadata_to_save(instance,
    +                                                                metadataPath,
    +                                                                sc,
    +                                                                extraMetadata,
    +                                                                paramMap)
    +        sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
    +
    +    @staticmethod
    +    def get_metadata_to_save(instance, path, sc, extraMetadata=None, paramMap=None):
    +        uid = instance.uid
    +        cls = instance.__module__ + '.' + instance.__class__.__name__
    +        params = instance.extractParamMap()
    +        jsonParams = {}
    +        if paramMap is not None:
    +            for p in paramMap:
    +                jsonParams[p.name] = paramMap[p]
    +        else:
    +            for p in params:
    +                jsonParams[p.name] = params[p]
    +        basicMetadata = {"class": cls, "timestamp": long(round(time.time() * 1000)),
    +                         "sparkVersion": sc.version, "uid": uid, "paramMap": jsonParams}
    +        if extraMetadata is not None:
    +            basicMetadata.update(extraMetadata)
    +        return json.dumps(basicMetadata, separators=[',',  ':'])
    +
    +
    +@inherit_doc
    +class DefaultParamsReadable(MLReadable):
    +    """
    +    Class for making simple Params types readable. Assumes that all parameters
    --- End diff --
    
    Could you please merge the Scala doc into this to make this more detailed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18742


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    **[Test build #80146 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80146/testReport)** for PR 18742 at commit [`470dd7c`](https://github.com/apache/spark/commit/470dd7ccdb9ea5185494b21cb8886e3597ad505e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r131261798
  
    --- Diff: python/pyspark/ml/param/__init__.py ---
    @@ -375,6 +375,18 @@ def copy(self, extra=None):
             that._defaultParamMap = {}
             return self._copyValues(that, extra)
     
    +    def set(self, param, value):
    +        """
    +            Sets the value for a parameter in the parameter map.
    --- End diff --
    
    Also, just copy the doc from Scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18742
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18742#discussion_r130522066
  
    --- Diff: python/pyspark/ml/util.py ---
    @@ -283,3 +289,124 @@ def numFeatures(self):
             Returns the number of features the model was trained on. If unknown, returns -1
             """
             return self._call_java("numFeatures")
    +
    +
    +@inherit_doc
    +class DefaultParamsWritable(MLWritable):
    +
    +    # overrides the write() function in MLWriteable
    +    # users call .save() in MLWriteable which calls this write() function and then calls
    +    # the .save() in DefaultParamsWriter
    +    # this can be overridden to return a different Writer (ex. OneVsRestWriter as seen in Scala)
    +    def write(self):
    +        # instance of check for params?
    +        return DefaultParamsWriter(self)
    +
    +
    +@inherit_doc
    +class DefaultParamsWriter(MLWriter):
    +
    +    def __init__(self, instance):
    +        super(DefaultParamsWriter, self).__init__()
    +        self.instance = instance
    +        self.sc = SparkContext._active_spark_context
    +
    +    # if a model extends DefaultParamsWriteable this save() function is called
    +    def save(self, path):
    +        if self.shouldOverwrite:
    +            # This command removes a file. Is this enough?
    +            os.remove(path)
    --- End diff --
    
    We use `saveAsTextFile` to save metadata, it is possible to save to `HDFS` or other non-local FS, so here we cannot use `os.remove` to remove it. We need first get the File System of the `path` and call relative API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org