You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by SK <sk...@gmail.com> on 2014/07/11 23:25:48 UTC

Decision tree classifier in MLlib

Hi,

I have a small dataset (120 training points, 30 test points) that I am
trying to classify into binary classes (1 or 0). The dataset has 4 numerical
features and 1 binary label (1 or 0). 

I used LogisticRegression and SVM in MLLib and I got 100% accuracy in both
cases. But when I used DecisionTree, I am getting only 33% accuracy
(basically all the predicted test labels are 1 whereas actually only 10 out
of the 30 should be 1). I tried modifying the different parameters
(maxDepth, bins, impurity etc) and still am able to get only 33% accuracy. 

I used the same dataset with R's decision tree  (rpart) and I am getting
100% accuracy. I would like to understand why the performance of MLLib's
decision tree model is poor  and if there is some way I can improve it. 

thanks



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Decision-tree-classifier-in-MLlib-tp9457.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Decision tree classifier in MLlib

Posted by Joseph Bradley <jo...@databricks.com>.

Hi Sudha,
Have you checked if the labels are being loaded correctly?  It sounds like
the DT algorithm can't find any useful splits to make, so maybe it thinks
they are all the same?  Some data loading functions threshold labels to
make them binary.
Hope it helps,
Joseph


On Fri, Jul 11, 2014 at 2:25 PM, SK <sk...@gmail.com> wrote:

> Hi,
>
> I have a small dataset (120 training points, 30 test points) that I am
> trying to classify into binary classes (1 or 0). The dataset has 4
> numerical
> features and 1 binary label (1 or 0).
>
> I used LogisticRegression and SVM in MLLib and I got 100% accuracy in both
> cases. But when I used DecisionTree, I am getting only 33% accuracy
> (basically all the predicted test labels are 1 whereas actually only 10 out
> of the 30 should be 1). I tried modifying the different parameters
> (maxDepth, bins, impurity etc) and still am able to get only 33% accuracy.
>
> I used the same dataset with R's decision tree  (rpart) and I am getting
> 100% accuracy. I would like to understand why the performance of MLLib's
> decision tree model is poor  and if there is some way I can improve it.
>
> thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Decision-tree-classifier-in-MLlib-tp9457.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Decision tree classifier in MLlib

Posted by "Evan R. Sparks" <ev...@gmail.com>.

Can you share the dataset via a gist or something and we can take a look at
what's going on?


On Fri, Jul 25, 2014 at 10:51 AM, SK <sk...@gmail.com> wrote:

> yes, the output  is continuous. So I used a threshold to get binary labels.
> If prediction < threshold, then class is 0 else 1. I use this binary label
> to then compute the accuracy. Even with this binary transformation, the
> accuracy with decision tree model is low compared to LR or SVM (for the
> specific dataset I used).
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Decision-tree-classifier-in-MLlib-tp9457p10678.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Decision tree classifier in MLlib

Posted by SK <sk...@gmail.com>.

yes, the output  is continuous. So I used a threshold to get binary labels.
If prediction < threshold, then class is 0 else 1. I use this binary label
to then compute the accuracy. Even with this binary transformation, the
accuracy with decision tree model is low compared to LR or SVM (for the
specific dataset I used). 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Decision-tree-classifier-in-MLlib-tp9457p10678.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.