You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by facaiy <gi...@git.apache.org> on 2017/03/22 06:00:27 UTC

[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

GitHub user facaiy opened a pull request:

    https://github.com/apache/spark/pull/17383

    [SPARK-3165][MLlib][WIP] DecisionTree does not use sparsity in data

    ## What changes were proposed in this pull request?
    
    DecisionTree should take advantage of sparse feature vectors. Aggregation over training data could handle the empty/zero-valued data elements more efficiently.
    
    
    ## How was this patch tested?
    
    Modifying Inner implementation won't change behavior of DecisionTree module,
    hence all unit tests before should pass.
    
    Some performance benchmark perhaps are need.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/facaiy/spark ENH/use_sparsity_in_decision_tree

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17383.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17383
    
----
commit d2eea0645110b3bcc6c0b905bc55e43e0af9debb
Author: \u989c\u53d1\u624d\uff08Yan Facai\uff09 <fa...@gmail.com>
Date:   2017-03-22T05:45:58Z

    CLN: use Vector to implement binnedFeatures in TreePoint

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not use spars...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17383
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

Posted by facaiy <gi...@git.apache.org>.

Github user facaiy commented on the issue:

    https://github.com/apache/spark/pull/17383
  
    Sure, @WeichenXu123 , perhaps one or two weeks later, is it OK?
    
    By the way, I think using sparse representation can only reduce memory usage, and it is in the cost of compute performance. Hence, it cannot speed up calculation for low-dimension data. The only advantage sparse representation has is for super large dimension in my opinion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

Posted by facaiy <gi...@git.apache.org>.

GitHub user facaiy reopened a pull request:

    https://github.com/apache/spark/pull/17383

    [SPARK-3165][MLlib][WIP] DecisionTree does not use sparsity in data

    ## What changes were proposed in this pull request?
    
    DecisionTree should take advantage of sparse feature vectors. Aggregation over training data could handle the empty/zero-valued data elements more efficiently.
    
    
    ## How was this patch tested?
    
    Modifying Inner implementation won't change behavior of DecisionTree module,
    hence all unit tests before should pass.
    
    Some performance benchmark perhaps are need.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/facaiy/spark ENH/use_sparsity_in_decision_tree

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17383.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17383
    
----
commit d2eea0645110b3bcc6c0b905bc55e43e0af9debb
Author: 颜发才（Yan Facai） <fa...@gmail.com>
Date:   2017-03-22T05:45:58Z

    CLN: use Vector to implement binnedFeatures in TreePoint

commit 9ce6b813beffb9d58e7b2907425a1262610256be
Author: 颜发才（Yan Facai） <fa...@gmail.com>
Date:   2017-03-22T09:15:30Z

    BUG: fix for incompatible argument of predictImpl method

commit 37f05f9b0386acc8bea048e72aff2b9c37ca4ca6
Author: 颜发才（Yan Facai） <fa...@gmail.com>
Date:   2017-03-22T09:18:04Z

    CLN: create sparse vector when converting to TreePoint

commit c9664ce6c94b98cbc76253817e637d9a968e4bd6
Author: 颜发才（Yan Facai） <fa...@gmail.com>
Date:   2017-03-22T09:21:59Z

    CLN: change Array to Vector in TreePoint when created

commit d6ef9e512ea4a58db2dccf3e7cca95f9e8b0df8f
Author: 颜发才（Yan Facai） <fa...@gmail.com>
Date:   2017-03-23T02:12:22Z

    PREP: use Vector[Int] to store binnedFeature

commit 59eb779a9d4f711e7b28d31d579cc49e3d3cc370
Author: 颜发才（Yan Facai） <fa...@gmail.com>
Date:   2017-03-23T03:50:14Z

    CLN: change binnedFeatures from def to val

commit 9cbe577b408e987f3026d01316f5a7f2d4c5cfb2
Author: 颜发才（Yan Facai） <fa...@gmail.com>
Date:   2017-03-28T00:57:42Z

    CLN: use filter to select non-zero bits

commit b5b0dc8683b6e2d7d274aa8d39932dec61e6193d
Author: 颜发才（Yan Facai） <fa...@gmail.com>
Date:   2017-03-28T01:03:55Z

    BUG: fix, compile fails

commit cf7e3d8e03f73df725336d0d5a9dd6cc16e7bf95
Author: Yan Facai (颜发才) <fa...@gmail.com>
Date:   2017-07-05T05:42:09Z

    Merge branch 'master' into ENH/use_sparsity_in_decision_tree

commit 032d50d8c8a851671ba2754cec817d0f6e9ae70f
Author: Yan Facai (颜发才) <fa...@gmail.com>
Date:   2017-07-05T06:20:38Z

    CLN: use BSV in predictImpl

commit 257ddf773eb47499962d6cc57fd1323324dd4ab8
Author: Yan Facai (颜发才) <fa...@gmail.com>
Date:   2017-07-05T06:42:24Z

    ENH: create subclass TreeSparsePoint

commit 8a919735f9474283d263df78feb2e176f66917f3
Author: Yan Facai (颜发才) <fa...@gmail.com>
Date:   2017-07-05T06:58:54Z

    ENH: use TreeDensePoint when numFeatures < 10000

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

Posted by facaiy <gi...@git.apache.org>.

Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17383#discussion_r108318833
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala ---
    @@ -301,7 +302,7 @@ private[tree] class LearningNode(
        *         group of nodes on one call to
        *         [[org.apache.spark.ml.tree.impl.RandomForest.findBestSplits()]].
        */
    -  def predictImpl(binnedFeatures: Array[Int], splits: Array[Array[Split]]): Int = {
    +  def predictImpl(binnedFeatures: Int => Int, splits: Array[Array[Split]]): Int = {
    --- End diff --
    
    when use `breeze.linalg.SparseVector`, compile fails:
    ```scala
    java.lang.AssertionError: assertion failed: List(method apply$mcI$sp, method apply$mcI$sp)
            at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1947)
            at scala.tools.nsc.transform.SpecializeTypes$$anon$2.matchingSymbolInPrefix$1(SpecializeTypes.scala:1474)
            at scala.tools.nsc.transform.SpecializeTypes$$anon$2.transformSelect$1(SpecializeTypes.scala:1497)
            at scala.tools.nsc.transform.SpecializeTypes$$anon$2.transform1(SpecializeTypes.scala:1593)
            at scala.tools.nsc.transform.SpecializeTypes$$anon$2$$anonfun$transform$3.apply(SpecializeTypes.scala:1442)
            ......
    ```
    
    I don't know why? Can anyone help? 
    Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

Posted by facaiy <gi...@git.apache.org>.

Github user facaiy commented on the issue:

    https://github.com/apache/spark/pull/17383
  
    Hi, since the work has been done for a long time, I take a review by myself. 
    
    After careful review, as SparseVector is compressed sparse row format, so the only benefit of the PR would be for data storage but in the cost of performance. But for tree-method, it is uncommon to handle a super large dimension features. Hence, it cannot satisfy me.
    
    I prefer to [SPARK-3717: DecisionTree, RandomForest: Partition by feature](https://issues.apache.org/jira/browse/SPARK-3717) as an alternative, which will be benefits in both performance and storage if I understand correctly. So the PR is closed. Thank everyone for review / comment.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

Posted by facaiy <gi...@git.apache.org>.

Github user facaiy commented on the issue:

    https://github.com/apache/spark/pull/17383
  
    Thank you for comment. 
    
    Very good question, at least for me, the answer to both questions is no. In most case, we feed dense raw data into tree model. However, if large dimensions required, I believe the sparse representation is necessary.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not use spars...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17383
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

Posted by facaiy <gi...@git.apache.org>.

Github user facaiy closed the pull request at:

    https://github.com/apache/spark/pull/17383


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree does not use sparsity i...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17383
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

Posted by facaiy <gi...@git.apache.org>.

Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17383#discussion_r108318997
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala ---
    @@ -301,7 +302,7 @@ private[tree] class LearningNode(
        *         group of nodes on one call to
        *         [[org.apache.spark.ml.tree.impl.RandomForest.findBestSplits()]].
        */
    -  def predictImpl(binnedFeatures: Array[Int], splits: Array[Array[Split]]): Int = {
    +  def predictImpl(binnedFeatures: Int => Int, splits: Array[Array[Split]]): Int = {
    --- End diff --
    
    when use `breeze.linalg.SparseVector`, compile fails:
    ```scala
    java.lang.AssertionError: assertion failed: List(method apply$mcI$sp, method apply$mcI$sp)
            at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1947)
            at scala.tools.nsc.transform.SpecializeTypes$$anon$2.matchingSymbolInPrefix$1(SpecializeTypes.scala:1474)
            at scala.tools.nsc.transform.SpecializeTypes$$anon$2.transformSelect$1(SpecializeTypes.scala:1497)
            at scala.tools.nsc.transform.SpecializeTypes$$anon$2.transform1(SpecializeTypes.scala:1593)
            at scala.tools.nsc.transform.SpecializeTypes$$anon$2$$anonfun$transform$3.apply(SpecializeTypes.scala:1442)
            ......
    ```
    
    I don't know why? Can anyone help? 
    Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/17383
  
    @facaiy So can you do benchmark first (by generating random testing data) ?  So we can see how much this can speed up.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in ...

Posted by facaiy <gi...@git.apache.org>.

Github user facaiy closed the pull request at:

    https://github.com/apache/spark/pull/17383


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org