You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by chouqin <gi...@git.apache.org> on 2014/08/14 10:12:42 UTC

[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...

GitHub user chouqin opened a pull request:

    https://github.com/apache/spark/pull/1941

    [SPARK-3022] [mllib] FindBinsForLevel in decision tree should call findBin only once for each feature

    `findbinsForLevel` is applied to every `LabeledPoint` to find bins for all nodes at a given level. Given a specific `LabeledPoint` and a specific feature, the bin to put this labeled point should always be same.But in current implementation, `findBin` on a (labeledpoint, feature) pair is called for all nodes and all levels, which is a waste of computation.
    
    In my implementation, `findBin` for each (labeledpoint, feature) pair is executed only once before the start of level-wise training of decision tree. Then, at each level, this `feature2bin` array can be reused.
    
    What's more, `findbinsForLevel` now return a array of smaller size, all the nodes on which this labeledPoint is valid share the same `feature2bin` array, instead of each node having a copy of it.
    
    CC: @mengxr @manishamde @jkbradley,  Please have a look at this, thanks.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chouqin/spark dt-findbins

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1941.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1941
    
----
commit 065a42c21c5810c2a09a125b39e8d56c38a18ebc
Author: qiping.lqp <qi...@alibaba-inc.com>
Date:   2014-08-14T06:06:43Z

    improve decision tree: findbins called only once for each feature

commit 45549f777eafe166bfb49e606ac12c3faec5f965
Author: qiping.lqp <qi...@alibaba-inc.com>
Date:   2014-08-14T06:49:44Z

    fix unit test for decision tree

commit 4a7bc0bab5a31712ebc8a5f37349a125d0d125b0
Author: qiping.lqp <qi...@alibaba-inc.com>
Date:   2014-08-14T06:58:02Z

    fix style: line length doesn't exceed 100.

commit 59b67819cd0a256beae6e2ca67cfebfb20b690d7
Author: qiping.lqp <qi...@alibaba-inc.com>
Date:   2014-08-14T07:36:24Z

    add comments

commit 4ef9e7fd53f6cefc998def2be2e08e0c22ec9e7a
Author: qiping.lqp <qi...@alibaba-inc.com>
Date:   2014-08-14T07:51:33Z

    fix indentation

commit f1a20e3b347fb3df77da9e51d50a7e0d54c0a75e
Author: qiping.lqp <qi...@alibaba-inc.com>
Date:   2014-08-14T07:57:57Z

    fix indentation too

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...

Posted by manishamde <gi...@git.apache.org>.
Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/1941#issuecomment-52188362
  
    Thanks for the PR @chouqin. The redundant findBin calculation should definitely be performed once and it will definitely speedup the computation. A couple of thoughts after looking at your implementation:
    1. Do you need the store the original features along with the bins?
    2. You are using Int for bin ids. You can pack them tightly as Byte if the number of bins are less than 256. One can use multiple Byte format for other bins sizes.
    
    I have an implementation similar to yours
    https://github.com/manishamde/spark/compare/ent
    
    A slight difference is that I am creating an internal TreePoint class that can store the bin mapping and this class is extended while performing Random Forest computation. Finally, I think @jkbradley is working on more optimizations on top these changes. I will let him elaborate on that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1941#issuecomment-52247961
  
    @chouqin My apologies as well.  But I hope you find the soon-to-follow PRs useful, with additional optimizations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...

Posted by emef <gi...@git.apache.org>.
Github user emef commented on the pull request:

    https://github.com/apache/spark/pull/1941#issuecomment-52217523
  
    @manishamde I've been experimenting with a gradient boosting implementation that would definitely benefit from having the labeled point conversion done once.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...

Posted by chouqin <gi...@git.apache.org>.
Github user chouqin commented on the pull request:

    https://github.com/apache/spark/pull/1941#issuecomment-52277965
  
    @mengxr @jkbradley never mind, I will help you review @1950 :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...

Posted by chouqin <gi...@git.apache.org>.
Github user chouqin commented on the pull request:

    https://github.com/apache/spark/pull/1941#issuecomment-52279909
  
    I close this PR now and focus on #1950 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1941#issuecomment-52244001
  
    @chouqin Thanks for optimizing decision tree! As @manishamde mentioned, @jkbradley has been working on decision tree optimization and bug fixes, including this one and several others. Considering his following PRs will based on his version #1950, do you mind helping review his code?
    
    Btw, I'm fully responsible for duplicated efforts in MLlib. The correct procedure of open-source contribution should be: 1) create a JIRA and describe and discuss to work to be done, 2) get assigned for the work, 3) submit a PR. However, few of us follows this procedure closely. Usually a JIRA is created just before submitting the PR, this caused duplicated efforts. I will try to do a better job at this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...

Posted by chouqin <gi...@git.apache.org>.
Github user chouqin closed the pull request at:

    https://github.com/apache/spark/pull/1941


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1941#issuecomment-52155272
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...

Posted by manishamde <gi...@git.apache.org>.
Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/1941#issuecomment-52231336
  
    @emef Yup. GBT, AdaBoost and RF implementations will also benefits from this LabeledPointConversion. Each will extend the TreePoint class in different ways: 1) GBT will add pseudo-residuals, 2) AdaBoost will add sample weights, and 3) RF will add poisson resampled weights for trees. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org