You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by chouqin <gi...@git.apache.org> on 2014/08/28 10:25:13 UTC
[GitHub] spark pull request: Dt predict
GitHub user chouqin opened a pull request:
https://github.com/apache/spark/pull/2180
Dt predict
In current implementation, prediction for a node is calculated along with calculation of information gain stats for each possible splits. The value to predict for a specific node is determined, no matter what the splits are.
To save computation, we can first calculate prediction first and then calculate information gain stats for each split.
This is also necessary if we want to support minimum instances per node parameters([SPARK-2207](https://issues.apache.org/jira/browse/SPARK-2207)) because when all splits don't satisfy minimum instances requirement , we don't use information gain of any splits. There should be a way to get the prediction value.
This PR also removes unused function `nodeIndexToLevel`.
CC: @mengxr @manishamde @jkbradley, do you think this is really necessary?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/chouqin/spark dt-predict
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2180.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2180
----
commit 0552c7e798f5d62b74511372c0d38e08e50e6bac
Author: qiping.lqp <qi...@alibaba-inc.com>
Date: 2014-08-28T08:03:55Z
separate calculation of predict of node from calculation of info gain of splits
commit c205eb8775a8dabfd567501972e2c9732c2fe80a
Author: qiping.lqp <qi...@alibaba-inc.com>
Date: 2014-08-28T08:05:20Z
commit Predict.scala
commit d92b3d47666e1c907222605b873172ef4a2c770c
Author: qiping.lqp <qi...@alibaba-inc.com>
Date: 2014-08-28T08:19:59Z
fix decision tree suite
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-3272][MLLib]Calculate prediction for no...
Posted by chouqin <gi...@git.apache.org>.
Github user chouqin closed the pull request at:
https://github.com/apache/spark/pull/2180
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-3272][MLLib]Calculate prediction for no...
Posted by chouqin <gi...@git.apache.org>.
Github user chouqin commented on the pull request:
https://github.com/apache/spark/pull/2180#issuecomment-54944410
Close this PR and move to #2332
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-3272][MLLib]Calculate prediction for no...
Posted by chouqin <gi...@git.apache.org>.
Github user chouqin commented on a diff in the pull request:
https://github.com/apache/spark/pull/2180#discussion_r16832992
--- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -34,9 +34,9 @@ import org.apache.spark.mllib.regression.LabeledPoint
class DecisionTreeSuite extends FunSuite with LocalSparkContext {
def validateClassifier(
- model: DecisionTreeModel,
- input: Seq[LabeledPoint],
- requiredAccuracy: Double) {
+ model: DecisionTreeModel,
+ input: Seq[LabeledPoint],
--- End diff --
Thanks, this is my fault, I will fix style soon
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-3272][MLLib]Calculate prediction for no...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2180#issuecomment-54694383
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: Dt predict
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2180#issuecomment-53687434
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: Dt predict
Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on a diff in the pull request:
https://github.com/apache/spark/pull/2180#discussion_r16827247
--- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -885,7 +887,7 @@ object DecisionTreeSuite {
}
def generateCategoricalDataPointsForMulticlassForOrderedFeatures():
- Array[LabeledPoint] = {
+ Array[LabeledPoint] = {
--- End diff --
Same here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: Dt predict
Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on the pull request:
https://github.com/apache/spark/pull/2180#issuecomment-53688602
I can not say anything about the usefulness of the patch. But we follow the spark style guide across our code base. https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-3272][MLLib]Calculate prediction for no...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2180#issuecomment-53746339
@chouqin Thanks for observing that we can sometimes avoid calculating the prediction and/or the info gain. I'm worried that this won't really change the scaling of the algorithm much since calculating the prediction is a low-cost operation. (This computation is done on the master node, and for any reasonable size dataset, the time spent on the master node is negligible compared to the time spent on the treeAggregate() call.)
I'm also worried about this PR clashing with the current DecisionTree PR: [https://github.com/apache/spark/pull/2125], which moves the calculation of predictions into separate Impurity* classes. Would it be possible to update this once [https://github.com/apache/spark/pull/2125] has gone through?
At that time, I think this PR could be simplified a bit by removing the Predict class. InformationGainStats.predict already holds the prediction, and InformationGainStats.gain can be computed or ignored as needed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-3272][MLLib]Calculate prediction for no...
Posted by chouqin <gi...@git.apache.org>.
Github user chouqin commented on the pull request:
https://github.com/apache/spark/pull/2180#issuecomment-53706696
@ScrapCodes thanks for you comments, I have changed indentation to meet the spark style guide just now.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: Dt predict
Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on a diff in the pull request:
https://github.com/apache/spark/pull/2180#discussion_r16827224
--- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -34,9 +34,9 @@ import org.apache.spark.mllib.regression.LabeledPoint
class DecisionTreeSuite extends FunSuite with LocalSparkContext {
def validateClassifier(
- model: DecisionTreeModel,
- input: Seq[LabeledPoint],
- requiredAccuracy: Double) {
+ model: DecisionTreeModel,
+ input: Seq[LabeledPoint],
--- End diff --
According to style guide for spark this change should be reverted.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: Dt predict
Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on a diff in the pull request:
https://github.com/apache/spark/pull/2180#discussion_r16827232
--- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -47,9 +47,9 @@ class DecisionTreeSuite extends FunSuite with LocalSparkContext {
}
def validateRegressor(
- model: DecisionTreeModel,
- input: Seq[LabeledPoint],
- requiredMSE: Double) {
+ model: DecisionTreeModel,
--- End diff --
same here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org