You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by chouqin <gi...@git.apache.org> on 2014/08/28 10:25:13 UTC

[GitHub] spark pull request: Dt predict

GitHub user chouqin opened a pull request:

    https://github.com/apache/spark/pull/2180

    Dt predict

    In current implementation, prediction for a node is calculated along with calculation of information gain stats for each possible splits. The value to predict for a specific node is determined, no matter what the splits are.
    To save computation, we can first calculate prediction first and then calculate information gain stats for each split.
    
    This is also necessary if we want to support minimum instances per node parameters([SPARK-2207](https://issues.apache.org/jira/browse/SPARK-2207)) because when all splits don't satisfy minimum instances requirement , we don't use information gain of any splits. There should be a way to get the prediction value.
    
    This PR also removes unused function `nodeIndexToLevel`.
    
    CC: @mengxr @manishamde @jkbradley, do you think this is really necessary?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chouqin/spark dt-predict

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2180.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2180
    
----
commit 0552c7e798f5d62b74511372c0d38e08e50e6bac
Author: qiping.lqp <qi...@alibaba-inc.com>
Date:   2014-08-28T08:03:55Z

    separate calculation of predict of node from calculation of info gain of splits

commit c205eb8775a8dabfd567501972e2c9732c2fe80a
Author: qiping.lqp <qi...@alibaba-inc.com>
Date:   2014-08-28T08:05:20Z

    commit Predict.scala

commit d92b3d47666e1c907222605b873172ef4a2c770c
Author: qiping.lqp <qi...@alibaba-inc.com>
Date:   2014-08-28T08:19:59Z

    fix decision tree suite

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3272][MLLib]Calculate prediction for no...

Posted by chouqin <gi...@git.apache.org>.
Github user chouqin closed the pull request at:

    https://github.com/apache/spark/pull/2180


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3272][MLLib]Calculate prediction for no...

Posted by chouqin <gi...@git.apache.org>.
Github user chouqin commented on the pull request:

    https://github.com/apache/spark/pull/2180#issuecomment-54944410
  
    Close this PR and move to #2332 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3272][MLLib]Calculate prediction for no...

Posted by chouqin <gi...@git.apache.org>.
Github user chouqin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2180#discussion_r16832992
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -34,9 +34,9 @@ import org.apache.spark.mllib.regression.LabeledPoint
     class DecisionTreeSuite extends FunSuite with LocalSparkContext {
     
       def validateClassifier(
    -      model: DecisionTreeModel,
    -      input: Seq[LabeledPoint],
    -      requiredAccuracy: Double) {
    +                          model: DecisionTreeModel,
    +                          input: Seq[LabeledPoint],
    --- End diff --
    
    Thanks, this is my fault, I will fix style soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3272][MLLib]Calculate prediction for no...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2180#issuecomment-54694383
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Dt predict

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2180#issuecomment-53687434
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Dt predict

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2180#discussion_r16827247
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -885,7 +887,7 @@ object DecisionTreeSuite {
       }
     
       def generateCategoricalDataPointsForMulticlassForOrderedFeatures():
    -    Array[LabeledPoint] = {
    +  Array[LabeledPoint] = {
    --- End diff --
    
    Same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Dt predict

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on the pull request:

    https://github.com/apache/spark/pull/2180#issuecomment-53688602
  
    I can not say anything about the usefulness of the patch. But we follow the spark style guide across our code base. https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3272][MLLib]Calculate prediction for no...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2180#issuecomment-53746339
  
    @chouqin Thanks for observing that we can sometimes avoid calculating the prediction and/or the info gain.  I'm worried that this won't really change the scaling of the algorithm much since calculating the prediction is a low-cost operation.  (This computation is done on the master node, and for any reasonable size dataset, the time spent on the master node is negligible compared to the time spent on the treeAggregate() call.)
    
    I'm also worried about this PR clashing with the current DecisionTree PR: [https://github.com/apache/spark/pull/2125], which moves the calculation of predictions into separate Impurity* classes.  Would it be possible to update this once [https://github.com/apache/spark/pull/2125] has gone through?
    
    At that time, I think this PR could be simplified a bit by removing the Predict class.  InformationGainStats.predict already holds the prediction, and InformationGainStats.gain can be computed or ignored as needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3272][MLLib]Calculate prediction for no...

Posted by chouqin <gi...@git.apache.org>.
Github user chouqin commented on the pull request:

    https://github.com/apache/spark/pull/2180#issuecomment-53706696
  
    @ScrapCodes thanks for you comments, I have changed indentation to meet the spark style guide just now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Dt predict

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2180#discussion_r16827224
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -34,9 +34,9 @@ import org.apache.spark.mllib.regression.LabeledPoint
     class DecisionTreeSuite extends FunSuite with LocalSparkContext {
     
       def validateClassifier(
    -      model: DecisionTreeModel,
    -      input: Seq[LabeledPoint],
    -      requiredAccuracy: Double) {
    +                          model: DecisionTreeModel,
    +                          input: Seq[LabeledPoint],
    --- End diff --
    
    According to style guide for spark this change should be reverted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Dt predict

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2180#discussion_r16827232
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -47,9 +47,9 @@ class DecisionTreeSuite extends FunSuite with LocalSparkContext {
       }
     
       def validateRegressor(
    -      model: DecisionTreeModel,
    -      input: Seq[LabeledPoint],
    -      requiredMSE: Double) {
    +                         model: DecisionTreeModel,
    --- End diff --
    
    same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org