You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by hhbyyh <gi...@git.apache.org> on 2015/09/08 04:18:32 UTC

[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

GitHub user hhbyyh opened a pull request:

    https://github.com/apache/spark/pull/8650

    [SPARK-10482] [ML] Add Python interface for ml.CountVectorizer

    jira: https://issues.apache.org/jira/browse/SPARK-10482
    
    Add Python interface for feature transformer: ml.CountVectorizer

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hhbyyh/spark countVecPython

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8650.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8650
    
----
commit 0f1fa34198459e32cb5099a0720e8d4bf053b33e
Author: Yuhao Yang <hh...@gmail.com>
Date:   2015-09-08T02:07:57Z

    add python for countVectorizer

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8650#discussion_r38962140
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -167,6 +168,134 @@ def getSplits(self):
     
     
     @inherit_doc
    +class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    Extracts a vocabulary from document collections and generates a [[CountVectorizerModel]],
    +    which converts text documents to sparse vectors of token counts.
    +
    +    >>> df = sentenceData = sqlContext.createDataFrame([(0, ["a", "b", "c"]),
    +    ... (1, ["a", "b", "b", "c", "a"])], ["label", "raw"])
    +    >>> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
    +    >>> model = cv.fit(df)
    +    >>> model.transform(sentenceData).show(truncate=False)
    +    +-----+---------------+-------------------------+
    +    |label|raw            |vectors                  |
    +    +-----+---------------+-------------------------+
    +    |0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
    +    |1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
    +    +-----+---------------+-------------------------+
    +    ...
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    vocabSize = Param(Params._dummy(), "vocabSize", "max size of the vocabulary")
    +    minDF = Param(Params._dummy(), "minDF",
    +                  "Specifies the minimum number of different documents a term must appear in " +
    +                  "to be included in the vocabulary. If this is an integer >= 1, this specifies " +
    +                  "the number of documents the term must appear in; if this is a double in " +
    +                  "[0,1), then this specifies the fraction of documents.")
    +
    +    minTF = Param(Params._dummy(), "minTF",
    +                  "Filter to ignore rare words in a document. For each document, terms with " +
    +                  "frequency/count less than the given threshold are ignored. If this is an " +
    +                  "integer >= 1, then this specifies a count (of times the term must appear in" +
    +                  " the document); if this is a double in [0,1), then this specifies a " +
    +                  "fraction (out of the document's token count). Note that the parameter is " +
    +                  "only used in transform of CountVectorizerModel and does not affect fitting.")
    +
    +    @keyword_only
    +    def __init__(self, minDF=1.0, minTF=1.0, vocabSize=1 << 18, inputCol=None, outputCol=None):
    +        """
    +        __init__(self, minDF=1.0, minTF=1.0, vocabSize=1<<18, inputCol=None, outputCol=None)
    +        """
    +        super(CountVectorizer, self).__init__()
    +        self._java_obj = self._new_java_obj(
    +            "org.apache.spark.ml.feature.CountVectorizer", self.uid)
    +        self.minDF = \
    +            Param(self, "minDF",
    +                  "Specifies the minimum number of different documents a term must appear in " +
    +                  "to be included in the vocabulary. If this is an integer >= 1, this specifies " +
    +                  "the number of documents the term must appear in; if this is a double in " +
    +                  "[0,1), then this specifies the fraction of documents.")
    +        self.minTF = \
    +            Param(self, "minTF",
    +                  "Filter to ignore rare words in a document. For each document, terms with " +
    +                  "frequency/count less than the given threshold are ignored. If this is an " +
    +                  "integer >= 1, then this specifies a count (of times the term must appear in" +
    +                  " the document); if this is a double in [0,1), then this specifies a " +
    +                  "fraction (out of the document's token count). Note that the parameter is " +
    +                  "only used in transform of CountVectorizerModel and does not affect fitting.")
    +        self.vocabSize = Param(self, "vocabSize", "max size of the vocabulary")
    +        self._setDefault(minDF=1.0, minTF=1.0, vocabSize=1 << 18)
    +        kwargs = self.__init__._input_kwargs
    +        self.setParams(**kwargs)
    +
    +    @keyword_only
    +    def setParams(self, minDF=1, minTF=1, vocabSize=1 << 18, inputCol=None, outputCol=None):
    +        """
    +        setParams(self, minDF=1, minTF=1, vocabSize=1 << 18, inputCol=None, outputCol=None)
    +        Sets params for this CountVectorizer.
    +        """
    +        kwargs = self.setParams._input_kwargs
    +        return self._set(**kwargs)
    +
    +    def setMinDF(self, value):
    +        """
    +        Sets the value of :py:attr:`minDF`.
    +        """
    +        self._paramMap[self.minDF] = value
    +        return self
    +
    +    def getMinDF(self):
    +        """
    +        Gets the value of minDF or its default value.
    +        """
    +        return self.getOrDefault(self.minDF)
    +
    +    def setMinTF(self, value):
    +        """
    +        Sets the value of :py:attr:`minTF`.
    +        """
    +        self._paramMap[self.minTF] = value
    +        return self
    +
    +    def getMinTF(self):
    +        """
    +        Gets the value of minTF or its default value.
    +        """
    +        return self.getOrDefault(self.minTF)
    +
    +    def setVocabSize(self, value):
    +        """
    +        Sets the value of :py:attr:`vocabSize`.
    +        """
    +        self._paramMap[self.vocabSize] = value
    +        return self
    +
    +    def getVocabSize(self):
    +        """
    +        Gets the value of vocabSize or its default value.
    +        """
    +        return self.getOrDefault(self.vocabSize)
    +
    +    def _create_model(self, java_model):
    +        return CountVectorizerModel(java_model)
    +
    +
    +class CountVectorizerModel(JavaModel):
    +    """
    +    Model fitted by CountVectorizer. Converts a text document to a sparse vector of token counts.
    +    """
    +
    +    @property
    +    def vocabulary(self):
    +        """
    +        An Array over terms. Only the terms in the vocabulary will be counted.
    --- End diff --
    
    `array`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138411847
  
      [Test build #42112 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42112/consoleFull) for   PR 8650 at commit [`0f1fa34`](https://github.com/apache/spark/commit/0f1fa34198459e32cb5099a0720e8d4bf053b33e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8650#discussion_r38962137
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -167,6 +168,134 @@ def getSplits(self):
     
     
     @inherit_doc
    +class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    Extracts a vocabulary from document collections and generates a [[CountVectorizerModel]],
    +    which converts text documents to sparse vectors of token counts.
    +
    +    >>> df = sentenceData = sqlContext.createDataFrame([(0, ["a", "b", "c"]),
    +    ... (1, ["a", "b", "b", "c", "a"])], ["label", "raw"])
    +    >>> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
    +    >>> model = cv.fit(df)
    +    >>> model.transform(sentenceData).show(truncate=False)
    +    +-----+---------------+-------------------------+
    +    |label|raw            |vectors                  |
    +    +-----+---------------+-------------------------+
    +    |0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
    +    |1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
    +    +-----+---------------+-------------------------+
    +    ...
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    vocabSize = Param(Params._dummy(), "vocabSize", "max size of the vocabulary")
    +    minDF = Param(Params._dummy(), "minDF",
    +                  "Specifies the minimum number of different documents a term must appear in " +
    +                  "to be included in the vocabulary. If this is an integer >= 1, this specifies " +
    +                  "the number of documents the term must appear in; if this is a double in " +
    +                  "[0,1), then this specifies the fraction of documents.")
    +
    +    minTF = Param(Params._dummy(), "minTF",
    +                  "Filter to ignore rare words in a document. For each document, terms with " +
    +                  "frequency/count less than the given threshold are ignored. If this is an " +
    +                  "integer >= 1, then this specifies a count (of times the term must appear in" +
    +                  " the document); if this is a double in [0,1), then this specifies a " +
    +                  "fraction (out of the document's token count). Note that the parameter is " +
    +                  "only used in transform of CountVectorizerModel and does not affect fitting.")
    +
    +    @keyword_only
    +    def __init__(self, minDF=1.0, minTF=1.0, vocabSize=1 << 18, inputCol=None, outputCol=None):
    +        """
    +        __init__(self, minDF=1.0, minTF=1.0, vocabSize=1<<18, inputCol=None, outputCol=None)
    --- End diff --
    
    `1 << 18`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138680688
  
    LGTM except some minor issues


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138411639
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138459078
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42122/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh closed the pull request at:

    https://github.com/apache/spark/pull/8650


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138714699
  
    @holdenk Yes, I just noticed it. Could you merge some changes in this PR into yours? I think the doctest from @hhbyyh is better and the default values are specified correctly in this PR. I will make a pass after.
    
    @hhbyyh Since this duplicates #8561, do you mind closing this PR? You can check opening PRs at https://spark-prs.appspot.com/#mllib.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138458027
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138411896
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138411897
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42112/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138468566
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42125/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138459076
  
      [Test build #42122 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42122/console) for   PR 8650 at commit [`d22ba5a`](https://github.com/apache/spark/commit/d22ba5a997aef2f0a21c97bcb2ab2ed7226f770b).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138464930
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138704510
  
    This seems to do the same work as the outstanding PR https://github.com/apache/spark/pull/8561 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138458014
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138789555
  
    @mengxr Sorry for the extra effort during review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138411895
  
      [Test build #42112 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42112/console) for   PR 8650 at commit [`0f1fa34`](https://github.com/apache/spark/commit/0f1fa34198459e32cb5099a0720e8d4bf053b33e).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138465382
  
      [Test build #42125 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42125/consoleFull) for   PR 8650 at commit [`dd0e933`](https://github.com/apache/spark/commit/dd0e933269832645f35c42a59d4d41ec4ef7f3fb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138411647
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138468472
  
      [Test build #42125 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42125/console) for   PR 8650 at commit [`dd0e933`](https://github.com/apache/spark/commit/dd0e933269832645f35c42a59d4d41ec4ef7f3fb).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol):`
      * `class CountVectorizerModel(JavaModel):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138468564
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8650#discussion_r38962078
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -167,6 +168,134 @@ def getSplits(self):
     
     
     @inherit_doc
    +class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    Extracts a vocabulary from document collections and generates a [[CountVectorizerModel]],
    +    which converts text documents to sparse vectors of token counts.
    +
    +    >>> df = sentenceData = sqlContext.createDataFrame([(0, ["a", "b", "c"]),
    --- End diff --
    
    remove `sentenceData = `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138726975
  
    Ok, I'll merge in the doc tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8650#discussion_r38962082
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -167,6 +168,134 @@ def getSplits(self):
     
     
     @inherit_doc
    +class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    Extracts a vocabulary from document collections and generates a [[CountVectorizerModel]],
    +    which converts text documents to sparse vectors of token counts.
    +
    +    >>> df = sentenceData = sqlContext.createDataFrame([(0, ["a", "b", "c"]),
    +    ... (1, ["a", "b", "b", "c", "a"])], ["label", "raw"])
    --- End diff --
    
    The following style might be better:
    
    ~~~python
    df = sqlContext.createDataFrame(
        [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])],
        ["label", "raw"])
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138464950
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138459077
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10482] [ML] Add Python interface for ml...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8650#issuecomment-138459005
  
      [Test build #42122 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42122/consoleFull) for   PR 8650 at commit [`d22ba5a`](https://github.com/apache/spark/commit/d22ba5a997aef2f0a21c97bcb2ab2ed7226f770b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org