You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dikejiang <gi...@git.apache.org> on 2014/12/03 13:28:54 UTC

[GitHub] spark pull request: [mllib] [random forest] functions returning th...

GitHub user dikejiang opened a pull request:

    https://github.com/apache/spark/pull/3583

    [mllib] [random forest] functions returning the category with weights

    In this version, we add two functions: 1) predictByVotingWithWeight(features: Vector) and 2) predictWithWeight(features: Vector). And we also modify the function: predictByVoting(features: Vector).
    
    There are at least two reasons why we make such improvement:
    
    1 ) In our practice, we want to find the top N samples from one category. However in 1.3.0 version, the function of predict can only give the predicted category but without weights.
    
    2) What's more, in our practice, the numbers of positive and negative samples are very unbalance. There are much less positive samples than negative samples. According to the results of votes, there are very few samples predicted as positive sample. If the weights are also given, users can make a proper threshold to modify the results so that the performance can be improved.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dikejiang/spark 20141203

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3583.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3583
    
----
commit c45247094016ff89829ce3ded74e8c29a7eeb878
Author: dikejiang <di...@yeah.net>
Date:   2014-12-03T12:23:24Z

    functions returning the category with weights
    
    In this version, we add two functions: 1) predictByVotingWithWeight(features: Vector) and 2) predictWithWeight(features: Vector). And we also modify the function: predictByVoting(features: Vector).
    
    There are at least two reasons why we make such improvement:
    
    1 ) In our practice, we want to find the top N samples from one category. However in 1.3.0 version, the function of predict can only give the predicted category but without weights.
    
    2) What's more, in our practice, the numbers of positive and negative samples are very unbalance. There are much less positive samples than negative samples. According to the results of votes, there are very few samples predicted as positive sample. If the weights are also given, users can make a proper threshold to modify the results so that the performance can be improved.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4736][mllib] [random forest] functions ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3583#issuecomment-66563683
  
    @dikejiang  Great, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [mllib] [random forest] functions returning th...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3583#issuecomment-65415563
  
    @dikejiang Do you mind creating a JIRA and add the JIRA number to the PR title? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4736][mllib] [random forest] functions ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3583#issuecomment-77604105
  
    @dikejiang  Do you still plan to update this PR to return a Vector of probabilities?  I'm planning a major reorganization of trees & ensembles APIs here: [https://issues.apache.org/jira/browse/SPARK-6113]
    I don't want it to mess up your PR; we could either finish up this PR soon, or we could wait until the API update (which should help by making the proper API clearer).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4736][mllib] [random forest] functions ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3583#issuecomment-96770011
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4736][mllib] [random forest] functions ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3583#issuecomment-66348700
  
    @dikejiang Thanks for the PR!  I'm wondering if you'd be interested in a more general API.  In the new experimental ML package, I have a PR [https://www.github.com/apache/spark/pull/3637] which introduces a few prediction methods, one of which is:
    ```
    def predictRaw(features: Vector): Vector // for each label, predict a confidence
    ```
    What do you think of using this instead of only predicting the top label's weight?  Eventually, confidence predictions could be improved by incorporating each tree's confidence in its prediction (rather than having each tree simply vote for a single label, as is done now).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4736][mllib] [random forest] functions ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3583#issuecomment-96780105
  
    @dikejiang  This work is now being done here: [https://issues.apache.org/jira/browse/SPARK-3727]
    Can you please close this PR?
    
    If you still want to work on this task, please coordinate on the JIRA I linked. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [mllib] [random forest] functions returning th...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3583#issuecomment-65400376
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4736][mllib] [random forest] functions ...

Posted by dikejiang <gi...@git.apache.org>.
Github user dikejiang commented on the pull request:

    https://github.com/apache/spark/pull/3583#issuecomment-66735373
  
    @mengxr OK to go?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4736][mllib] [random forest] functions ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3583#issuecomment-67049092
  
    @dikejiang  Apologies--I think I was not clear.  I was recommending that you change this PR to implement predictRaw(), rather than predictWithWeight().  Does that sound reasonable?  Since predictRaw gives more info than predictWithWeight, it seems best to only include predictRaw.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4736][mllib] [random forest] functions ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3583


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org