You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yinxusen <gi...@git.apache.org> on 2016/04/02 04:46:48 UTC

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

GitHub user yinxusen opened a pull request:

    https://github.com/apache/spark/pull/12124

    [SPARK-7861][ML] PySpark OneVsRest

    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-7861
    
    Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline.
    
    ## How was this patch tested?
    
    Test with doctest.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yinxusen/spark SPARK-14306-7861

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12124.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12124
    
----
commit 84f292bc04eac58dd624a8fd7fce54c18f20cd15
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-03-31T23:52:20Z

    initial add for OneVsRest

commit a296a86a9d600c347774403a97be26f2cc370820
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-04-01T07:09:52Z

    ser/de error

commit 417d13f34dcf323559e3885960d6633d98da75c0
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-04-02T01:26:54Z

    fix error caused by treating nparray as list

commit 6d30d772ca619bf2a331f361764a036ab4f4603c
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-04-02T01:36:34Z

    add copy and more tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-210204467
  
    **[Test build #55868 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55868/consoleFull)** for PR 12124 at commit [`4e95ecb`](https://github.com/apache/spark/commit/4e95ecb05b08a96d37fd3fbf6212b2f743a79af4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r58813906
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1134,6 +1139,216 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    classifier = Param(Params._dummy(), "classifier", "base binary classifier")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
    +                 classifier=None):
    +        """
    +        __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
    +                 classifier=None)
    +        """
    +        super(OneVsRest, self).__init__()
    +        kwargs = self.__init__._input_kwargs
    +        self._set(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        """
    +        setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        Sets params for OneVsRest.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.0.0")
    +    def setClassifier(self, value):
    +        """
    +        Sets the value of :py:attr:`classifier`.
    +        """
    +        self._paramMap[self.classifier] = value
    +        return self
    +
    +    @since("2.0.0")
    +    def getClassifier(self):
    +        """
    +        Gets the value of classifier or its default value.
    +        """
    +        return self.getOrDefault(self.classifier)
    +
    +    def _fit(self, dataset):
    +        labelCol = self.getLabelCol()
    +        featuresCol = self.getFeaturesCol()
    +        predictionCol = self.getPredictionCol()
    +        classifier = self.getClassifier()
    +
    +        numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1
    +
    +        multiclassLabeled = dataset.select(labelCol, featuresCol)
    +
    +        # persist if underlying dataset is not persistent.
    +        handlePersistence = \
    +            dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False)
    +        if handlePersistence:
    +            multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
    +
    +        def trainSingleClass(index):
    +            binaryLabelCol = "mc2b$" + str(index)
    +            trainingDataset = multiclassLabeled.withColumn(
    +                binaryLabelCol,
    +                when(multiclassLabeled[labelCol] == float(index), 1.0).otherwise(0.0))
    --- End diff --
    
    Yeah, that's absolutely a problem since PySpark cannot handle metadata for now. I'll document it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-205519992
  
    Using a parallel for loop sounds good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r59786340
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1145,6 +1149,213 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    classifier = Param(Params._dummy(), "classifier", "base binary classifier")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
    +                 classifier=None):
    +        """
    +        __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
    +                 classifier=None)
    +        """
    +        super(OneVsRest, self).__init__()
    +        kwargs = self.__init__._input_kwargs
    +        self._set(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        """
    +        setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        Sets params for OneVsRest.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.0.0")
    +    def setClassifier(self, value):
    +        """
    +        Sets the value of :py:attr:`classifier`.
    +
    +        .. note:: Only LogisticRegression, NaiveBayes and MultilayerPerceptronClassifier are
    --- End diff --
    
    Actually MultilayerPerceptronClassifier is not supported since it does not have a rawPredictionCol.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-210206262
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206692927
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55181/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-204635190
  
    **[Test build #54752 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54752/consoleFull)** for PR 12124 at commit [`6d30d77`](https://github.com/apache/spark/commit/6d30d772ca619bf2a331f361764a036ab4f4603c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206691767
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55178/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206690378
  
    **[Test build #55178 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55178/consoleFull)** for PR 12124 at commit [`cf4df64`](https://github.com/apache/spark/commit/cf4df64d90cc00ac8a3a137088f8dab8c6650116).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r59786344
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1145,6 +1149,213 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    classifier = Param(Params._dummy(), "classifier", "base binary classifier")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
    +                 classifier=None):
    +        """
    +        __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
    +                 classifier=None)
    +        """
    +        super(OneVsRest, self).__init__()
    +        kwargs = self.__init__._input_kwargs
    +        self._set(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        """
    +        setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        Sets params for OneVsRest.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.0.0")
    +    def setClassifier(self, value):
    +        """
    +        Sets the value of :py:attr:`classifier`.
    +
    +        .. note:: Only LogisticRegression, NaiveBayes and MultilayerPerceptronClassifier are
    +                  supported now.
    +        """
    +        self._paramMap[self.classifier] = value
    +        return self
    +
    +    @since("2.0.0")
    +    def getClassifier(self):
    +        """
    +        Gets the value of classifier or its default value.
    +        """
    +        return self.getOrDefault(self.classifier)
    +
    +    def _fit(self, dataset):
    +        labelCol = self.getLabelCol()
    +        featuresCol = self.getFeaturesCol()
    +        predictionCol = self.getPredictionCol()
    +        classifier = self.getClassifier()
    --- End diff --
    
    Could you ensure this is a valid classifier here?  You should be able to assert that it has a rawPredictionCol.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-213635964
  
    @jkbradley Do you still have plans to solve the metadata problem for tree methods? I find that [SPARK-7126](https://issues.apache.org/jira/browse/SPARK-7126) aims to solve the problem via auto-index for DataFrame. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-210141456
  
    Thanks, I am updating them now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-209760561
  
    **[Test build #55795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55795/consoleFull)** for PR 12124 at commit [`2fb4e3d`](https://github.com/apache/spark/commit/2fb4e3d27197dbd60f10770d55e8698638673886).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r58794228
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1134,6 +1139,216 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    classifier = Param(Params._dummy(), "classifier", "base binary classifier")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
    +                 classifier=None):
    +        """
    +        __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
    +                 classifier=None)
    +        """
    +        super(OneVsRest, self).__init__()
    +        kwargs = self.__init__._input_kwargs
    +        self._set(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        """
    +        setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        Sets params for OneVsRest.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.0.0")
    +    def setClassifier(self, value):
    +        """
    +        Sets the value of :py:attr:`classifier`.
    +        """
    +        self._paramMap[self.classifier] = value
    +        return self
    +
    +    @since("2.0.0")
    +    def getClassifier(self):
    +        """
    +        Gets the value of classifier or its default value.
    +        """
    +        return self.getOrDefault(self.classifier)
    +
    +    def _fit(self, dataset):
    +        labelCol = self.getLabelCol()
    +        featuresCol = self.getFeaturesCol()
    +        predictionCol = self.getPredictionCol()
    +        classifier = self.getClassifier()
    +
    +        numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1
    +
    +        multiclassLabeled = dataset.select(labelCol, featuresCol)
    +
    +        # persist if underlying dataset is not persistent.
    +        handlePersistence = \
    +            dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False)
    +        if handlePersistence:
    +            multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
    +
    +        def trainSingleClass(index):
    +            binaryLabelCol = "mc2b$" + str(index)
    +            trainingDataset = multiclassLabeled.withColumn(
    +                binaryLabelCol,
    +                when(multiclassLabeled[labelCol] == float(index), 1.0).otherwise(0.0))
    +            paramMap = dict([(classifier.labelCol, binaryLabelCol),
    +                            (classifier.featuresCol, featuresCol),
    +                            (classifier.predictionCol, predictionCol)])
    +            duplicatedClassifier = classifier.__class__()
    +            duplicatedClassifier._resetUid(classifier.uid)
    +            classifier._copyValues(duplicatedClassifier)
    +            return duplicatedClassifier.fit(trainingDataset, paramMap)
    --- End diff --
    
    @jkbradley I've added multi-thread support for OneVsRest. But what we should care about here is the `copy()` in spark.ml is creating a new instance, i.e. deep copy, while pyspark.ml one is a shallow copy. The shallow copy will cause a multi-thread issue in the `fit` method because it copies the `paramMap` to the current classifier.
    
    I add the duplication here. But we also could change the copy method of pyspark.ml into deep-copy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-205522006
  
    I'll try to figure it out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-205087601
  
    **[Test build #54816 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54816/consoleFull)** for PR 12124 at commit [`b17cc7b`](https://github.com/apache/spark/commit/b17cc7b8cb33af7bebb444832a2b7fd9e961ea93).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-204634630
  
    **[Test build #54752 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54752/consoleFull)** for PR 12124 at commit [`6d30d77`](https://github.com/apache/spark/commit/6d30d772ca619bf2a331f361764a036ab4f4603c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r59786325
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1145,6 +1149,213 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    --- End diff --
    
    No need to rename the predictionCol


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r58812140
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1134,6 +1139,216 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    classifier = Param(Params._dummy(), "classifier", "base binary classifier")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
    +                 classifier=None):
    +        """
    +        __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
    +                 classifier=None)
    +        """
    +        super(OneVsRest, self).__init__()
    +        kwargs = self.__init__._input_kwargs
    +        self._set(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        """
    +        setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        Sets params for OneVsRest.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.0.0")
    +    def setClassifier(self, value):
    +        """
    +        Sets the value of :py:attr:`classifier`.
    +        """
    +        self._paramMap[self.classifier] = value
    +        return self
    +
    +    @since("2.0.0")
    +    def getClassifier(self):
    +        """
    +        Gets the value of classifier or its default value.
    +        """
    +        return self.getOrDefault(self.classifier)
    +
    +    def _fit(self, dataset):
    +        labelCol = self.getLabelCol()
    +        featuresCol = self.getFeaturesCol()
    +        predictionCol = self.getPredictionCol()
    +        classifier = self.getClassifier()
    +
    +        numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1
    +
    +        multiclassLabeled = dataset.select(labelCol, featuresCol)
    +
    +        # persist if underlying dataset is not persistent.
    +        handlePersistence = \
    +            dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False)
    +        if handlePersistence:
    +            multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
    +
    +        def trainSingleClass(index):
    +            binaryLabelCol = "mc2b$" + str(index)
    +            trainingDataset = multiclassLabeled.withColumn(
    +                binaryLabelCol,
    +                when(multiclassLabeled[labelCol] == float(index), 1.0).otherwise(0.0))
    +            paramMap = dict([(classifier.labelCol, binaryLabelCol),
    +                            (classifier.featuresCol, featuresCol),
    +                            (classifier.predictionCol, predictionCol)])
    +            duplicatedClassifier = classifier.__class__()
    +            duplicatedClassifier._resetUid(classifier.uid)
    +            classifier._copyValues(duplicatedClassifier)
    +            return duplicatedClassifier.fit(trainingDataset, paramMap)
    --- End diff --
    
    I mean, it's possible that multiprocessing may work depending on how the Py4J socket, locks, etc. are shared with the forked child JVMs... but yeah, there are some questions to answer. Explicit use of `Thread` within a single Python interpreter would probably be easier to reason about.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-210618284
  
    Good catch on the model copy() method.
    LGTM
    Merging with master
    Thanks @yinxusen !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r59786333
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1145,6 +1149,213 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    --- End diff --
    
    remove this line (this comment is outdated now)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-204635220
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54752/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-204634632
  
    @jkbradley @mengxr 
    
    One more thing to discuss, shall we use parallel for-loop in fit() of OneVsRest just like its Scala companion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206602462
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55159/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206633293
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55157/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r59639088
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1134,6 +1138,210 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    classifier = Param(Params._dummy(), "classifier", "base binary classifier")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
    +                 classifier=None):
    +        """
    +        __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
    +                 classifier=None)
    +        """
    +        super(OneVsRest, self).__init__()
    +        kwargs = self.__init__._input_kwargs
    +        self._set(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        """
    +        setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        Sets params for OneVsRest.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.0.0")
    +    def setClassifier(self, value):
    +        """
    +        Sets the value of :py:attr:`classifier`.
    +
    +        .. note:: Only LogisticRegression, NaiveBayes and MultilayerPerceptronClassifier are
    +                  supported now.
    +        """
    +        self._paramMap[self.classifier] = value
    +        return self
    +
    +    @since("2.0.0")
    +    def getClassifier(self):
    +        """
    +        Gets the value of classifier or its default value.
    +        """
    +        return self.getOrDefault(self.classifier)
    +
    +    def _fit(self, dataset):
    +        labelCol = self.getLabelCol()
    +        featuresCol = self.getFeaturesCol()
    +        predictionCol = self.getPredictionCol()
    +        classifier = self.getClassifier()
    +
    +        numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1
    +
    +        multiclassLabeled = dataset.select(labelCol, featuresCol)
    +
    +        # persist if underlying dataset is not persistent.
    +        handlePersistence = \
    +            dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False)
    +        if handlePersistence:
    +            multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
    +
    +        def trainSingleClass(index):
    +            binaryLabelCol = "mc2b$" + str(index)
    +            trainingDataset = multiclassLabeled.withColumn(
    +                binaryLabelCol,
    +                when(multiclassLabeled[labelCol] == float(index), 1.0).otherwise(0.0))
    +            paramMap = dict([(classifier.labelCol, binaryLabelCol),
    +                            (classifier.featuresCol, featuresCol),
    +                            (classifier.predictionCol, predictionCol)])
    +            return classifier.fit(trainingDataset, paramMap)
    +
    +        # TODO: Parallel training for all classes.
    +        models = [trainSingleClass(i) for i in range(numClasses)]
    +
    +        if handlePersistence:
    +            multiclassLabeled.unpersist()
    +
    +        return self._copyValues(OneVsRestModel(models=models))
    +
    +    @since("2.0.0")
    +    def copy(self, extra=None):
    +        """
    +        Creates a copy of this instance with a randomly generated uid
    +        and some extra params. This copies creates a deep copy of
    +        the embedded paramMap, and copies the embedded and extra parameters over.
    +
    +        :param extra: Extra parameters to copy to the new instance
    +        :return: Copy of this instance
    +        """
    +        if extra is None:
    +            extra = dict()
    +        return self._copyValues(OneVsRest(self.getClassifier().copy(extra)))
    --- End diff --
    
    Is this correct?  I think what you had before was better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206602461
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r58811099
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1134,6 +1139,216 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    classifier = Param(Params._dummy(), "classifier", "base binary classifier")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
    +                 classifier=None):
    +        """
    +        __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
    +                 classifier=None)
    +        """
    +        super(OneVsRest, self).__init__()
    +        kwargs = self.__init__._input_kwargs
    +        self._set(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        """
    +        setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        Sets params for OneVsRest.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.0.0")
    +    def setClassifier(self, value):
    +        """
    +        Sets the value of :py:attr:`classifier`.
    +        """
    +        self._paramMap[self.classifier] = value
    +        return self
    +
    +    @since("2.0.0")
    +    def getClassifier(self):
    +        """
    +        Gets the value of classifier or its default value.
    +        """
    +        return self.getOrDefault(self.classifier)
    +
    +    def _fit(self, dataset):
    +        labelCol = self.getLabelCol()
    +        featuresCol = self.getFeaturesCol()
    +        predictionCol = self.getPredictionCol()
    +        classifier = self.getClassifier()
    +
    +        numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1
    +
    +        multiclassLabeled = dataset.select(labelCol, featuresCol)
    +
    +        # persist if underlying dataset is not persistent.
    +        handlePersistence = \
    +            dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False)
    +        if handlePersistence:
    +            multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
    +
    +        def trainSingleClass(index):
    +            binaryLabelCol = "mc2b$" + str(index)
    +            trainingDataset = multiclassLabeled.withColumn(
    +                binaryLabelCol,
    +                when(multiclassLabeled[labelCol] == float(index), 1.0).otherwise(0.0))
    --- End diff --
    
    But I'm hoping to fix trees to not need metadata for 2.0, if we have time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12124


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r58813854
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1134,6 +1139,216 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    classifier = Param(Params._dummy(), "classifier", "base binary classifier")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
    +                 classifier=None):
    +        """
    +        __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
    +                 classifier=None)
    +        """
    +        super(OneVsRest, self).__init__()
    +        kwargs = self.__init__._input_kwargs
    +        self._set(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        """
    +        setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        Sets params for OneVsRest.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.0.0")
    +    def setClassifier(self, value):
    +        """
    +        Sets the value of :py:attr:`classifier`.
    +        """
    +        self._paramMap[self.classifier] = value
    +        return self
    +
    +    @since("2.0.0")
    +    def getClassifier(self):
    +        """
    +        Gets the value of classifier or its default value.
    +        """
    +        return self.getOrDefault(self.classifier)
    +
    +    def _fit(self, dataset):
    +        labelCol = self.getLabelCol()
    +        featuresCol = self.getFeaturesCol()
    +        predictionCol = self.getPredictionCol()
    +        classifier = self.getClassifier()
    +
    +        numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1
    +
    +        multiclassLabeled = dataset.select(labelCol, featuresCol)
    +
    +        # persist if underlying dataset is not persistent.
    +        handlePersistence = \
    +            dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False)
    +        if handlePersistence:
    +            multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
    +
    +        def trainSingleClass(index):
    +            binaryLabelCol = "mc2b$" + str(index)
    +            trainingDataset = multiclassLabeled.withColumn(
    +                binaryLabelCol,
    +                when(multiclassLabeled[labelCol] == float(index), 1.0).otherwise(0.0))
    +            paramMap = dict([(classifier.labelCol, binaryLabelCol),
    +                            (classifier.featuresCol, featuresCol),
    +                            (classifier.predictionCol, predictionCol)])
    +            duplicatedClassifier = classifier.__class__()
    +            duplicatedClassifier._resetUid(classifier.uid)
    +            classifier._copyValues(duplicatedClassifier)
    +            return duplicatedClassifier.fit(trainingDataset, paramMap)
    --- End diff --
    
    No problem, let's remove it for this time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r58811071
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1134,6 +1139,216 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    classifier = Param(Params._dummy(), "classifier", "base binary classifier")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
    +                 classifier=None):
    +        """
    +        __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
    +                 classifier=None)
    +        """
    +        super(OneVsRest, self).__init__()
    +        kwargs = self.__init__._input_kwargs
    +        self._set(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        """
    +        setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        Sets params for OneVsRest.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.0.0")
    +    def setClassifier(self, value):
    +        """
    +        Sets the value of :py:attr:`classifier`.
    +        """
    +        self._paramMap[self.classifier] = value
    +        return self
    +
    +    @since("2.0.0")
    +    def getClassifier(self):
    +        """
    +        Gets the value of classifier or its default value.
    +        """
    +        return self.getOrDefault(self.classifier)
    +
    +    def _fit(self, dataset):
    +        labelCol = self.getLabelCol()
    +        featuresCol = self.getFeaturesCol()
    +        predictionCol = self.getPredictionCol()
    +        classifier = self.getClassifier()
    +
    +        numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1
    +
    +        multiclassLabeled = dataset.select(labelCol, featuresCol)
    +
    +        # persist if underlying dataset is not persistent.
    +        handlePersistence = \
    +            dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False)
    +        if handlePersistence:
    +            multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
    +
    +        def trainSingleClass(index):
    +            binaryLabelCol = "mc2b$" + str(index)
    +            trainingDataset = multiclassLabeled.withColumn(
    +                binaryLabelCol,
    +                when(multiclassLabeled[labelCol] == float(index), 1.0).otherwise(0.0))
    --- End diff --
    
    Uh oh, I just realized this will only work with LogisticRegression and NaiveBayes.  With trees, there is no good way to set the metadata from PySpark.  We'll need to document that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-210206150
  
    **[Test build #55868 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55868/consoleFull)** for PR 12124 at commit [`4e95ecb`](https://github.com/apache/spark/commit/4e95ecb05b08a96d37fd3fbf6212b2f743a79af4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-210205505
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55867/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-209763064
  
    @jkbradley Merged and fixed the `copy`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-210548792
  
    @jkbradley Ready for another look


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-210205438
  
    **[Test build #55867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55867/consoleFull)** for PR 12124 at commit [`6002b92`](https://github.com/apache/spark/commit/6002b923bdc1ad5b757bc73a89c44aa19c21424d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-204635219
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-210202492
  
    **[Test build #55867 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55867/consoleFull)** for PR 12124 at commit [`6002b92`](https://github.com/apache/spark/commit/6002b923bdc1ad5b757bc73a89c44aa19c21424d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-209591866
  
    @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-205087839
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206692753
  
    **[Test build #55181 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55181/consoleFull)** for PR 12124 at commit [`fd4fc11`](https://github.com/apache/spark/commit/fd4fc11d1b954584cf06aeb68bfe8ad982519311).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206633101
  
    **[Test build #55157 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55157/consoleFull)** for PR 12124 at commit [`47bd709`](https://github.com/apache/spark/commit/47bd7091a75ee6dac34674240acc9a594b157ccd).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206691664
  
    **[Test build #55178 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55178/consoleFull)** for PR 12124 at commit [`cf4df64`](https://github.com/apache/spark/commit/cf4df64d90cc00ac8a3a137088f8dab8c6650116).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-210140573
  
    One last comment: Since this implementation is fully in Python, could you please port some of the unit tests from OneVsRestSuite.scala to ml/tests.py?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-209762940
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55795/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-210205503
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-209762937
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r58811068
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1134,6 +1139,216 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    classifier = Param(Params._dummy(), "classifier", "base binary classifier")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
    +                 classifier=None):
    +        """
    +        __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
    +                 classifier=None)
    +        """
    +        super(OneVsRest, self).__init__()
    +        kwargs = self.__init__._input_kwargs
    +        self._set(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        """
    +        setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        Sets params for OneVsRest.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.0.0")
    +    def setClassifier(self, value):
    +        """
    +        Sets the value of :py:attr:`classifier`.
    +        """
    +        self._paramMap[self.classifier] = value
    +        return self
    +
    +    @since("2.0.0")
    +    def getClassifier(self):
    +        """
    +        Gets the value of classifier or its default value.
    +        """
    +        return self.getOrDefault(self.classifier)
    +
    +    def _fit(self, dataset):
    +        labelCol = self.getLabelCol()
    +        featuresCol = self.getFeaturesCol()
    +        predictionCol = self.getPredictionCol()
    +        classifier = self.getClassifier()
    +
    +        numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1
    +
    +        multiclassLabeled = dataset.select(labelCol, featuresCol)
    +
    +        # persist if underlying dataset is not persistent.
    +        handlePersistence = \
    +            dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False)
    +        if handlePersistence:
    +            multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
    +
    +        def trainSingleClass(index):
    +            binaryLabelCol = "mc2b$" + str(index)
    +            trainingDataset = multiclassLabeled.withColumn(
    +                binaryLabelCol,
    +                when(multiclassLabeled[labelCol] == float(index), 1.0).otherwise(0.0))
    +            paramMap = dict([(classifier.labelCol, binaryLabelCol),
    +                            (classifier.featuresCol, featuresCol),
    +                            (classifier.predictionCol, predictionCol)])
    +            duplicatedClassifier = classifier.__class__()
    +            duplicatedClassifier._resetUid(classifier.uid)
    +            classifier._copyValues(duplicatedClassifier)
    +            return duplicatedClassifier.fit(trainingDataset, paramMap)
    --- End diff --
    
    Thanks for doing this.  But...I just talked with Josh, who strongly recommended not using multiprocessing for fear of some possible side-effects.  Would you mind reverting the change and just training one model at a time?  My apologies for the switch!
    
    I'd like us to do multiple jobs at once in the future, but we should do more careful prototyping and testing than we have time for in Spark 2.0.  I'll make a new JIRA and link it to this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206597628
  
    **[Test build #55157 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55157/consoleFull)** for PR 12124 at commit [`47bd709`](https://github.com/apache/spark/commit/47bd7091a75ee6dac34674240acc9a594b157ccd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206602293
  
    **[Test build #55159 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55159/consoleFull)** for PR 12124 at commit [`ecdc742`](https://github.com/apache/spark/commit/ecdc74209e14ba1a1404570a53ea7b24c2635582).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206598793
  
    **[Test build #55159 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55159/consoleFull)** for PR 12124 at commit [`ecdc742`](https://github.com/apache/spark/commit/ecdc74209e14ba1a1404570a53ea7b24c2635582).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206691766
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-210206265
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55868/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-205087844
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54816/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-209762901
  
    **[Test build #55795 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55795/consoleFull)** for PR 12124 at commit [`2fb4e3d`](https://github.com/apache/spark/commit/2fb4e3d27197dbd60f10770d55e8698638673886).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-209680745
  
    Thanks for pinging me!  I'll make a final pass after the merge conflicts are fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-213663303
  
    I'm working on a simpler fix for now: [https://issues.apache.org/jira/browse/SPARK-14862]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-205086068
  
    **[Test build #54816 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54816/consoleFull)** for PR 12124 at commit [`b17cc7b`](https://github.com/apache/spark/commit/b17cc7b8cb33af7bebb444832a2b7fd9e961ea93).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-205522754
  
    OK thanks.  Hopefully there are existing examples of parfors in the codebase to work from.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206633287
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206692925
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-210136661
  
    Thanks for the updates!  I made a final pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r59786622
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1145,6 +1149,213 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    classifier = Param(Params._dummy(), "classifier", "base binary classifier")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
    +                 classifier=None):
    +        """
    +        __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
    +                 classifier=None)
    +        """
    +        super(OneVsRest, self).__init__()
    +        kwargs = self.__init__._input_kwargs
    +        self._set(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        """
    +        setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        Sets params for OneVsRest.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.0.0")
    +    def setClassifier(self, value):
    +        """
    +        Sets the value of :py:attr:`classifier`.
    +
    +        .. note:: Only LogisticRegression, NaiveBayes and MultilayerPerceptronClassifier are
    +                  supported now.
    +        """
    +        self._paramMap[self.classifier] = value
    +        return self
    +
    +    @since("2.0.0")
    +    def getClassifier(self):
    +        """
    +        Gets the value of classifier or its default value.
    +        """
    +        return self.getOrDefault(self.classifier)
    +
    +    def _fit(self, dataset):
    +        labelCol = self.getLabelCol()
    +        featuresCol = self.getFeaturesCol()
    +        predictionCol = self.getPredictionCol()
    +        classifier = self.getClassifier()
    +
    +        numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1
    +
    +        multiclassLabeled = dataset.select(labelCol, featuresCol)
    +
    +        # persist if underlying dataset is not persistent.
    +        handlePersistence = \
    +            dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False)
    +        if handlePersistence:
    +            multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
    +
    +        def trainSingleClass(index):
    +            binaryLabelCol = "mc2b$" + str(index)
    +            trainingDataset = multiclassLabeled.withColumn(
    +                binaryLabelCol,
    +                when(multiclassLabeled[labelCol] == float(index), 1.0).otherwise(0.0))
    +            paramMap = dict([(classifier.labelCol, binaryLabelCol),
    +                            (classifier.featuresCol, featuresCol),
    +                            (classifier.predictionCol, predictionCol)])
    +            return classifier.fit(trainingDataset, paramMap)
    +
    +        # TODO: Parallel training for all classes.
    +        models = [trainSingleClass(i) for i in range(numClasses)]
    +
    +        if handlePersistence:
    +            multiclassLabeled.unpersist()
    +
    +        return self._copyValues(OneVsRestModel(models=models))
    +
    +    @since("2.0.0")
    +    def copy(self, extra=None):
    +        """
    +        Creates a copy of this instance with a randomly generated uid
    +        and some extra params. This copies creates a deep copy of
    +        the embedded paramMap, and copies the embedded and extra parameters over.
    +
    +        :param extra: Extra parameters to copy to the new instance
    +        :return: Copy of this instance
    +        """
    +        if extra is None:
    +            extra = dict()
    +        newOvr = Params.copy(self, extra)
    +        if self.isSet(self.classifier):
    +            newOvr.setClassifier(self.getClassifier().copy(extra))
    +        return newOvr
    +
    +
    +class OneVsRestModel(Model, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Model fitted by OneVsRest.
    +    This stores the models resulting from training k binary classifiers: one for each class.
    +    Each example is scored against all k models, and the model with the highest score
    +    is picked to label the example.
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    def __init__(self, models):
    +        super(OneVsRestModel, self).__init__()
    +        #: best model from cross validation
    --- End diff --
    
    remove line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r59786617
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1145,6 +1149,213 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    classifier = Param(Params._dummy(), "classifier", "base binary classifier")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
    +                 classifier=None):
    +        """
    +        __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
    +                 classifier=None)
    +        """
    +        super(OneVsRest, self).__init__()
    +        kwargs = self.__init__._input_kwargs
    +        self._set(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        """
    +        setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        Sets params for OneVsRest.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.0.0")
    +    def setClassifier(self, value):
    +        """
    +        Sets the value of :py:attr:`classifier`.
    +
    +        .. note:: Only LogisticRegression, NaiveBayes and MultilayerPerceptronClassifier are
    +                  supported now.
    +        """
    +        self._paramMap[self.classifier] = value
    +        return self
    +
    +    @since("2.0.0")
    +    def getClassifier(self):
    +        """
    +        Gets the value of classifier or its default value.
    +        """
    +        return self.getOrDefault(self.classifier)
    +
    +    def _fit(self, dataset):
    +        labelCol = self.getLabelCol()
    +        featuresCol = self.getFeaturesCol()
    +        predictionCol = self.getPredictionCol()
    +        classifier = self.getClassifier()
    +
    +        numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1
    +
    +        multiclassLabeled = dataset.select(labelCol, featuresCol)
    +
    +        # persist if underlying dataset is not persistent.
    +        handlePersistence = \
    +            dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False)
    +        if handlePersistence:
    +            multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
    +
    +        def trainSingleClass(index):
    +            binaryLabelCol = "mc2b$" + str(index)
    +            trainingDataset = multiclassLabeled.withColumn(
    +                binaryLabelCol,
    +                when(multiclassLabeled[labelCol] == float(index), 1.0).otherwise(0.0))
    +            paramMap = dict([(classifier.labelCol, binaryLabelCol),
    +                            (classifier.featuresCol, featuresCol),
    +                            (classifier.predictionCol, predictionCol)])
    +            return classifier.fit(trainingDataset, paramMap)
    +
    +        # TODO: Parallel training for all classes.
    +        models = [trainSingleClass(i) for i in range(numClasses)]
    +
    +        if handlePersistence:
    +            multiclassLabeled.unpersist()
    +
    +        return self._copyValues(OneVsRestModel(models=models))
    +
    +    @since("2.0.0")
    +    def copy(self, extra=None):
    +        """
    +        Creates a copy of this instance with a randomly generated uid
    +        and some extra params. This copies creates a deep copy of
    --- End diff --
    
    "This copies creates a deep copy " --> "This creates a deep copy "


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12124#discussion_r59790843
  
    --- Diff: python/pyspark/ml/classification.py ---
    @@ -1145,6 +1149,213 @@ def weights(self):
             return self._call_java("weights")
     
     
    +@inherit_doc
    +class OneVsRest(Estimator, HasFeaturesCol, HasLabelCol, HasPredictionCol):
    +    """
    +    Reduction of Multiclass Classification to Binary Classification.
    +    Performs reduction using one against all strategy.
    +    For a multiclass classification with k classes, train k models (one per class).
    +    Each example is scored against all k models and the model with highest score
    +    is picked to label the example.
    +
    +    >>> from pyspark.sql import Row
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sc.parallelize([
    +    ...     Row(label=0.0, features=Vectors.dense(1.0, 0.8)),
    +    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
    +    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
    +    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
    +    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
    +    >>> model = ovr.fit(df)
    +    >>> [x.coefficients for x in model.models]
    +    [DenseVector([3.3925, 1.8785]), DenseVector([-4.3016, -6.3163]), DenseVector([-4.5855, 6.1785])]
    +    >>> [x.intercept for x in model.models]
    +    [-3.6474708290602034, 2.5507881951814495, -1.1016513228162115]
    +    >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0))]).toDF()
    +    >>> model.transform(test0).head().indexed
    +    1.0
    +    >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
    +    >>> model.transform(test1).head().indexed
    +    0.0
    +    >>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4))]).toDF()
    +    >>> model.transform(test2).head().indexed
    +    2.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    classifier = Param(Params._dummy(), "classifier", "base binary classifier")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
    +                 classifier=None):
    +        """
    +        __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
    +                 classifier=None)
    +        """
    +        super(OneVsRest, self).__init__()
    +        kwargs = self.__init__._input_kwargs
    +        self._set(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        """
    +        setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None):
    +        Sets params for OneVsRest.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.0.0")
    +    def setClassifier(self, value):
    +        """
    +        Sets the value of :py:attr:`classifier`.
    +
    +        .. note:: Only LogisticRegression, NaiveBayes and MultilayerPerceptronClassifier are
    +                  supported now.
    +        """
    +        self._paramMap[self.classifier] = value
    --- End diff --
    
    Use _set instead.  See [https://github.com/apache/spark/pull/11939]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206692961
  
    @jkbradley Ready for reviewing. I'll try to fix trees if there still time before 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7861][ML] PySpark OneVsRest

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12124#issuecomment-206691541
  
    **[Test build #55181 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55181/consoleFull)** for PR 12124 at commit [`fd4fc11`](https://github.com/apache/spark/commit/fd4fc11d1b954584cf06aeb68bfe8ad982519311).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org