You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yanboliang <gi...@git.apache.org> on 2016/01/24 11:07:03 UTC

[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

GitHub user yanboliang opened a pull request:

    https://github.com/apache/spark/pull/10889

    [SPARK-12974] [ML] [PySpark] Add Python API for spark.ml bisecting k-means

    Add Python API for spark.ml bisecting k-means.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanboliang/spark spark-12974

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10889.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10889
    
----
commit dc81222bde25c9f9b36b8a888e0792a1ed62765e
Author: Yanbo Liang <yb...@gmail.com>
Date:   2016-01-24T09:57:18Z

    Add Python API for spark.ml bisecting k-means

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10889#discussion_r50961234
  
    --- Diff: python/pyspark/ml/clustering.py ---
    @@ -170,6 +170,129 @@ def getInitSteps(self):
             return self.getOrDefault(self.initSteps)
     
     
    +class BisectingKMeansModel(JavaModel):
    +    """
    +    .. note:: Experimental
    +
    +    Model fitted by BisectingKMeans.
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    @since("2.0.0")
    +    def clusterCenters(self):
    +        """Get the cluster centers, represented as a list of NumPy arrays."""
    +        return [c.toArray() for c in self._call_java("clusterCenters")]
    +
    +    @since("2.0.0")
    +    def computeCost(self, dataset):
    +        """
    +        Computes the sum of squared distances between the input points
    +        and their corresponding cluster centers.
    +        """
    +        return self._call_java("computeCost", dataset)
    +
    +
    +@inherit_doc
    +class BisectingKMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasSeed):
    +    """
    +    .. note:: Experimental
    +
    +    A bisecting k-means algorithm based on the paper "A comparison of document clustering
    +    techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark.
    +    The algorithm starts from a single cluster that contains all points.
    +    Iteratively it finds divisible clusters on the bottom level and bisects each of them using
    +    k-means, until there are `k` leaf clusters in total or no leaf clusters are divisible.
    +    The bisecting steps of clusters on the same level are grouped together to increase parallelism.
    +    If bisecting all divisible clusters on the bottom level would result more than `k` leaf
    +    clusters, larger clusters get higher priority.
    +
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
    +    ...         (Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
    +    >>> df = sqlContext.createDataFrame(data, ["features"])
    +    >>> bkm = BisectingKMeans(k=2, minDivisibleClusterSize=1.0)
    +    >>> model = bkm.fit(df)
    +    >>> centers = model.clusterCenters()
    +    >>> len(centers)
    +    2
    +    >>> model.computeCost(df)
    +    2.000...
    +    >>> transformed = model.transform(df).select("features", "prediction")
    +    >>> rows = transformed.collect()
    +    >>> rows[0].prediction == rows[1].prediction
    +    True
    +    >>> rows[2].prediction == rows[3].prediction
    +    True
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    k = Param(Params._dummy(), "k", "number of clusters to create")
    +    minDivisibleClusterSize = Param(Params._dummy(), "minDivisibleClusterSize",
    +                                    "the minimum number of points (if >= 1.0) " +
    +                                    "or the minimum proportion")
    +
    +    @keyword_only
    +    def __init__(self, featuresCol="features", predictionCol="prediction", maxIter=20,
    +                 seed=None, k=4, minDivisibleClusterSize=1.0):
    +        """
    +        __init__(self, featuresCol="features", predictionCol="prediction", maxIter=20, \
    +                 seed=None, k=4, minDivisibleClusterSize=1.0)
    +        """
    +        super(BisectingKMeans, self).__init__()
    +        self._java_obj = self._new_java_obj("org.apache.spark.ml.clustering.BisectingKMeans",
    +                                            self.uid)
    +        self._setDefault(maxIter=20, k=4, minDivisibleClusterSize=1.0)
    +        kwargs = self.__init__._input_kwargs
    +        self.setParams(**kwargs)
    --- End diff --
    
    After #10216, we do not need to declare param variables in ```__init__``` by hand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-175471664
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50180/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-174408233
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-175468906
  
    **[Test build #50180 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50180/consoleFull)** for PR 10889 at commit [`581914c`](https://github.com/apache/spark/commit/581914ce31d2bf02b3284fd28ffad81fe31fbb15).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10889


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-183240342
  
    @mengxr Actually, I have already updated this PR after #10216 get merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-174408235
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49978/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-174277703
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49953/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-174402415
  
    **[Test build #49978 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49978/consoleFull)** for PR 10889 at commit [`dcf6b8b`](https://github.com/apache/spark/commit/dcf6b8b222e5c9168e2f4d984f260ff86db60b7f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-174277650
  
    **[Test build #49953 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49953/consoleFull)** for PR 10889 at commit [`21acce0`](https://github.com/apache/spark/commit/21acce0bb7f04fd88411f65ab2a3624e28d27e4c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-175471561
  
    **[Test build #50180 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50180/consoleFull)** for PR 10889 at commit [`581914c`](https://github.com/apache/spark/commit/581914ce31d2bf02b3284fd28ffad81fe31fbb15).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-174277108
  
    **[Test build #49953 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49953/consoleFull)** for PR 10889 at commit [`21acce0`](https://github.com/apache/spark/commit/21acce0bb7f04fd88411f65ab2a3624e28d27e4c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-174277060
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49952/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-174277059
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-175471662
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-174408135
  
    **[Test build #49978 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49978/consoleFull)** for PR 10889 at commit [`dcf6b8b`](https://github.com/apache/spark/commit/dcf6b8b222e5c9168e2f4d984f260ff86db60b7f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-183117916
  
    @yanboliang Could you update this PR since #10216 was merged? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-175519675
  
    cc @mengxr @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-183255063
  
    LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10889#issuecomment-174277700
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org