You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Night Wolf <ni...@gmail.com> on 2011/07/29 18:25:09 UTC

Random Decision Forests - Binary Classification of large data set

Hi all,

I have been playing around with the Random Decision Forests in Mahout. Seems
like the classifier produces good results using the test programs.

I am wondering if this classifier can be used on larger data sets with
around 35,000 features and 100k+ message instances to classify on a small
Hadoop cluster or even a single node development install?

Has anyone used the Random forest classifier to work with massive data sets
reliably and with high accuracy. My previous experience using the RF model
has been good for sparse data sets and I think this is one area Mahout could
really shine. Using tools like Weka and even R, the data sets I'm testing
with now are just to large for these tools to work well so I was hoping
Mahout may be the answer for this problem as well.

So is it worth working with the Random Forest classifier to get a production
or near to production system running?

Does anyone have any examples and stories of their Mahout RF usage?

Thanks!

Re: Random Decision Forests - Binary Classification of large data set

Posted by Hector Yee <he...@gmail.com>.
I haven't had much luck with random forests (vs other stuff) in production.
Its harder to control the regularization and thresholds.
If you have 35K features chances are your data is linearly separable anyway,
so you might as well stick to the logistic regression in Mahout.

On Fri, Jul 29, 2011 at 9:25 AM, Night Wolf <ni...@gmail.com> wrote:

> Hi all,
>
> I have been playing around with the Random Decision Forests in Mahout.
> Seems
> like the classifier produces good results using the test programs.
>
> I am wondering if this classifier can be used on larger data sets with
> around 35,000 features and 100k+ message instances to classify on a small
> Hadoop cluster or even a single node development install?
>
> Has anyone used the Random forest classifier to work with massive data sets
> reliably and with high accuracy. My previous experience using the RF model
> has been good for sparse data sets and I think this is one area Mahout
> could
> really shine. Using tools like Weka and even R, the data sets I'm testing
> with now are just to large for these tools to work well so I was hoping
> Mahout may be the answer for this problem as well.
>
> So is it worth working with the Random Forest classifier to get a
> production
> or near to production system running?
>
> Does anyone have any examples and stories of their Mahout RF usage?
>
> Thanks!
>



-- 
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)