You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by pkphlam <pk...@gmail.com> on 2015/08/03 07:20:56 UTC

Extremely poor predictive performance with RF in mllib

Hi,

This might be a long shot, but has anybody run into very poor predictive
performance using RandomForest with Mllib? Here is what I'm doing:

- Spark 1.4.1 with PySpark
- Python 3.4.2
- ~30,000 Tweets of text
- 12289 1s and 15956 0s
- Whitespace tokenization and then hashing trick for feature selection using
10,000 features
- Run RF with 100 trees and maxDepth of 4 and then predict using the
features from all the 1s observations.

So in theory, I should get predictions of close to 12289 1s (especially if
the model overfits). But I'm getting exactly 0 1s, which sounds ludicrous to
me and makes me suspect something is wrong with my code or I'm missing
something. I notice similar behavior (although not as extreme) if I play
around with the settings. But I'm getting normal behavior with other
classifiers, so I don't think it's my setup that's the problem.

For example:

>>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10)
>>> logit_predict = lrm.predict(predict_feat)
>>> logit_predict.sum()
9077

>>> nb = NaiveBayes.train(lp)
>>> nb_predict = nb.predict(predict_feat)
>>> nb_predict.sum()
10287.0

>>> rf = RandomForest.trainClassifier(lp, numClasses=2,
>>> categoricalFeaturesInfo={}, numTrees=100, seed=422)
>>> rf_predict = rf.predict(predict_feat)
>>> rf_predict.sum()
0.0

This code was all run back to back so I didn't change anything in between.
Does anybody have a possible explanation for this?

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Extremely poor predictive performance with RF in mllib

Posted by Yanbo Liang <yb...@gmail.com>.
I can reproduce this issue, so looks like a bug of Random Forest, I will
try to find some clue.

2015-08-05 1:34 GMT+08:00 Patrick Lam <pk...@gmail.com>:

> Yes, I rechecked and the label is correct. As you can see in the code
> posted, I used the exact same features for all the classifiers so unless rf
> somehow switches the labels, it should be correct.
>
> I have posted a sample dataset and sample code to reproduce what I'm
> getting here:
>
> https://github.com/pkphlam/spark_rfpredict
>
> On Tue, Aug 4, 2015 at 6:42 AM, Yanbo Liang <yb...@gmail.com> wrote:
>
>> It looks like the predicted result just opposite with expectation, so
>> could you check whether the label is right?
>> Or could you share several data which can help to reproduce this output?
>>
>> 2015-08-03 19:36 GMT+08:00 Barak Gitsis <ba...@similarweb.com>:
>>
>>> hi,
>>> I've run into some poor RF behavior, although not as pronounced as you..
>>> would be great to get more insight into this one
>>>
>>> Thanks!
>>>
>>> On Mon, Aug 3, 2015 at 8:21 AM pkphlam <pk...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> This might be a long shot, but has anybody run into very poor predictive
>>>> performance using RandomForest with Mllib? Here is what I'm doing:
>>>>
>>>> - Spark 1.4.1 with PySpark
>>>> - Python 3.4.2
>>>> - ~30,000 Tweets of text
>>>> - 12289 1s and 15956 0s
>>>> - Whitespace tokenization and then hashing trick for feature selection
>>>> using
>>>> 10,000 features
>>>> - Run RF with 100 trees and maxDepth of 4 and then predict using the
>>>> features from all the 1s observations.
>>>>
>>>> So in theory, I should get predictions of close to 12289 1s (especially
>>>> if
>>>> the model overfits). But I'm getting exactly 0 1s, which sounds
>>>> ludicrous to
>>>> me and makes me suspect something is wrong with my code or I'm missing
>>>> something. I notice similar behavior (although not as extreme) if I play
>>>> around with the settings. But I'm getting normal behavior with other
>>>> classifiers, so I don't think it's my setup that's the problem.
>>>>
>>>> For example:
>>>>
>>>> >>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10)
>>>> >>> logit_predict = lrm.predict(predict_feat)
>>>> >>> logit_predict.sum()
>>>> 9077
>>>>
>>>> >>> nb = NaiveBayes.train(lp)
>>>> >>> nb_predict = nb.predict(predict_feat)
>>>> >>> nb_predict.sum()
>>>> 10287.0
>>>>
>>>> >>> rf = RandomForest.trainClassifier(lp, numClasses=2,
>>>> >>> categoricalFeaturesInfo={}, numTrees=100, seed=422)
>>>> >>> rf_predict = rf.predict(predict_feat)
>>>> >>> rf_predict.sum()
>>>> 0.0
>>>>
>>>> This code was all run back to back so I didn't change anything in
>>>> between.
>>>> Does anybody have a possible explanation for this?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>> --
>>> *-Barak*
>>>
>>
>>
>
>
> --
> Patrick Lam
> Institute for Quantitative Social Science, Harvard University
> http://www.patricklam.org
>

Re: Extremely poor predictive performance with RF in mllib

Posted by Patrick Lam <pk...@gmail.com>.
Yes, I rechecked and the label is correct. As you can see in the code
posted, I used the exact same features for all the classifiers so unless rf
somehow switches the labels, it should be correct.

I have posted a sample dataset and sample code to reproduce what I'm
getting here:

https://github.com/pkphlam/spark_rfpredict

On Tue, Aug 4, 2015 at 6:42 AM, Yanbo Liang <yb...@gmail.com> wrote:

> It looks like the predicted result just opposite with expectation, so
> could you check whether the label is right?
> Or could you share several data which can help to reproduce this output?
>
> 2015-08-03 19:36 GMT+08:00 Barak Gitsis <ba...@similarweb.com>:
>
>> hi,
>> I've run into some poor RF behavior, although not as pronounced as you..
>> would be great to get more insight into this one
>>
>> Thanks!
>>
>> On Mon, Aug 3, 2015 at 8:21 AM pkphlam <pk...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> This might be a long shot, but has anybody run into very poor predictive
>>> performance using RandomForest with Mllib? Here is what I'm doing:
>>>
>>> - Spark 1.4.1 with PySpark
>>> - Python 3.4.2
>>> - ~30,000 Tweets of text
>>> - 12289 1s and 15956 0s
>>> - Whitespace tokenization and then hashing trick for feature selection
>>> using
>>> 10,000 features
>>> - Run RF with 100 trees and maxDepth of 4 and then predict using the
>>> features from all the 1s observations.
>>>
>>> So in theory, I should get predictions of close to 12289 1s (especially
>>> if
>>> the model overfits). But I'm getting exactly 0 1s, which sounds
>>> ludicrous to
>>> me and makes me suspect something is wrong with my code or I'm missing
>>> something. I notice similar behavior (although not as extreme) if I play
>>> around with the settings. But I'm getting normal behavior with other
>>> classifiers, so I don't think it's my setup that's the problem.
>>>
>>> For example:
>>>
>>> >>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10)
>>> >>> logit_predict = lrm.predict(predict_feat)
>>> >>> logit_predict.sum()
>>> 9077
>>>
>>> >>> nb = NaiveBayes.train(lp)
>>> >>> nb_predict = nb.predict(predict_feat)
>>> >>> nb_predict.sum()
>>> 10287.0
>>>
>>> >>> rf = RandomForest.trainClassifier(lp, numClasses=2,
>>> >>> categoricalFeaturesInfo={}, numTrees=100, seed=422)
>>> >>> rf_predict = rf.predict(predict_feat)
>>> >>> rf_predict.sum()
>>> 0.0
>>>
>>> This code was all run back to back so I didn't change anything in
>>> between.
>>> Does anybody have a possible explanation for this?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>> --
>> *-Barak*
>>
>
>


-- 
Patrick Lam
Institute for Quantitative Social Science, Harvard University
http://www.patricklam.org

Re: Extremely poor predictive performance with RF in mllib

Posted by Yanbo Liang <yb...@gmail.com>.
It looks like the predicted result just opposite with expectation, so could
you check whether the label is right?
Or could you share several data which can help to reproduce this output?

2015-08-03 19:36 GMT+08:00 Barak Gitsis <ba...@similarweb.com>:

> hi,
> I've run into some poor RF behavior, although not as pronounced as you..
> would be great to get more insight into this one
>
> Thanks!
>
> On Mon, Aug 3, 2015 at 8:21 AM pkphlam <pk...@gmail.com> wrote:
>
>> Hi,
>>
>> This might be a long shot, but has anybody run into very poor predictive
>> performance using RandomForest with Mllib? Here is what I'm doing:
>>
>> - Spark 1.4.1 with PySpark
>> - Python 3.4.2
>> - ~30,000 Tweets of text
>> - 12289 1s and 15956 0s
>> - Whitespace tokenization and then hashing trick for feature selection
>> using
>> 10,000 features
>> - Run RF with 100 trees and maxDepth of 4 and then predict using the
>> features from all the 1s observations.
>>
>> So in theory, I should get predictions of close to 12289 1s (especially if
>> the model overfits). But I'm getting exactly 0 1s, which sounds ludicrous
>> to
>> me and makes me suspect something is wrong with my code or I'm missing
>> something. I notice similar behavior (although not as extreme) if I play
>> around with the settings. But I'm getting normal behavior with other
>> classifiers, so I don't think it's my setup that's the problem.
>>
>> For example:
>>
>> >>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10)
>> >>> logit_predict = lrm.predict(predict_feat)
>> >>> logit_predict.sum()
>> 9077
>>
>> >>> nb = NaiveBayes.train(lp)
>> >>> nb_predict = nb.predict(predict_feat)
>> >>> nb_predict.sum()
>> 10287.0
>>
>> >>> rf = RandomForest.trainClassifier(lp, numClasses=2,
>> >>> categoricalFeaturesInfo={}, numTrees=100, seed=422)
>> >>> rf_predict = rf.predict(predict_feat)
>> >>> rf_predict.sum()
>> 0.0
>>
>> This code was all run back to back so I didn't change anything in between.
>> Does anybody have a possible explanation for this?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>> --
> *-Barak*
>

Re: Extremely poor predictive performance with RF in mllib

Posted by Barak Gitsis <ba...@similarweb.com>.
hi,
I've run into some poor RF behavior, although not as pronounced as you..
would be great to get more insight into this one

Thanks!

On Mon, Aug 3, 2015 at 8:21 AM pkphlam <pk...@gmail.com> wrote:

> Hi,
>
> This might be a long shot, but has anybody run into very poor predictive
> performance using RandomForest with Mllib? Here is what I'm doing:
>
> - Spark 1.4.1 with PySpark
> - Python 3.4.2
> - ~30,000 Tweets of text
> - 12289 1s and 15956 0s
> - Whitespace tokenization and then hashing trick for feature selection
> using
> 10,000 features
> - Run RF with 100 trees and maxDepth of 4 and then predict using the
> features from all the 1s observations.
>
> So in theory, I should get predictions of close to 12289 1s (especially if
> the model overfits). But I'm getting exactly 0 1s, which sounds ludicrous
> to
> me and makes me suspect something is wrong with my code or I'm missing
> something. I notice similar behavior (although not as extreme) if I play
> around with the settings. But I'm getting normal behavior with other
> classifiers, so I don't think it's my setup that's the problem.
>
> For example:
>
> >>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10)
> >>> logit_predict = lrm.predict(predict_feat)
> >>> logit_predict.sum()
> 9077
>
> >>> nb = NaiveBayes.train(lp)
> >>> nb_predict = nb.predict(predict_feat)
> >>> nb_predict.sum()
> 10287.0
>
> >>> rf = RandomForest.trainClassifier(lp, numClasses=2,
> >>> categoricalFeaturesInfo={}, numTrees=100, seed=422)
> >>> rf_predict = rf.predict(predict_feat)
> >>> rf_predict.sum()
> 0.0
>
> This code was all run back to back so I didn't change anything in between.
> Does anybody have a possible explanation for this?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
> --
*-Barak*