You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zhengruifeng <gi...@git.apache.org> on 2017/05/31 04:10:06 UTC

[GitHub] spark pull request #18154: [SPARK-20932][ML]CountVectorizer support handle p...

GitHub user zhengruifeng opened a pull request:

    https://github.com/apache/spark/pull/18154

    [SPARK-20932][ML]CountVectorizer support handle persistence

    ## What changes were proposed in this pull request?
    unpersist RDDs `input` & `wordCounts` after computation
    
    ## How was this patch tested?
    existing tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhengruifeng/spark CountVectorizer_unpsersit

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18154.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18154
    
----
commit f7c54421ba02e5dead1b638233d008ac6cdad2af
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2017-05-31T04:04:40Z

    create pr

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18154: [SPARK-20932][ML]CountVectorizer support handle persiste...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18154
  
    I don't know ML as much as reviewing this. I just wanted to be sure if it is in progress in any way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18154: [SPARK-20932][ML]CountVectorizer support handle persiste...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18154
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18154: [SPARK-20932][ML]CountVectorizer support handle persiste...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18154
  
    **[Test build #77572 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77572/testReport)** for PR 18154 at commit [`f7c5442`](https://github.com/apache/spark/commit/f7c54421ba02e5dead1b638233d008ac6cdad2af).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18154: [SPARK-20932][ML]CountVectorizer support handle p...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18154#discussion_r121836024
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -154,13 +155,19 @@ class CountVectorizer @Since("1.5.0") (@Since("1.5.0") override val uid: String)
       override def fit(dataset: Dataset[_]): CountVectorizerModel = {
         transformSchema(dataset.schema, logging = true)
         val vocSize = $(vocabSize)
    +
    +    val handlePersistence = $(minDF) < 1.0 &&
    +      dataset.storageLevel == StorageLevel.NONE
    --- End diff --
    
    I understand your logic, but I don't think it's necessary to make it complicated. How about we just add the unpersist for both input and wordCounts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18154: [SPARK-20932][ML]CountVectorizer support handle persiste...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18154
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77572/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18154: [SPARK-20932][ML]CountVectorizer support handle persiste...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18154
  
    @zhengruifeng Would you answer or address the review comments above?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18154: [SPARK-20932][ML]CountVectorizer support handle persiste...

Posted by zhengruifeng <gi...@git.apache.org>.
Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18154
  
    @hhbyyh @HyukjinKwon  Sorry to reply late. 
    I think it may be better to use a special logic if it is more efficient in performance. 
    What is your opinion? @yanboliang @HyukjinKwon 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18154: [SPARK-20932][ML]CountVectorizer support handle persiste...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18154
  
    **[Test build #77572 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77572/testReport)** for PR 18154 at commit [`f7c5442`](https://github.com/apache/spark/commit/f7c54421ba02e5dead1b638233d008ac6cdad2af).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18154: [SPARK-20932][ML]CountVectorizer support handle persiste...

Posted by zhengruifeng <gi...@git.apache.org>.
Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18154
  
    This PR is out of date. I will close it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18154: [SPARK-20932][ML]CountVectorizer support handle p...

Posted by zhengruifeng <gi...@git.apache.org>.
Github user zhengruifeng closed the pull request at:

    https://github.com/apache/spark/pull/18154


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org