You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by ey-chih chow <ey...@gmail.com> on 2013/04/07 02:45:03 UTC

Re: Classification Algorithms in Mahout

I actually got a lot of over fitting.  The parameter that I can adjust is
minSplitNum.  Is there any other parameters that I can adjust to avoid over
fitting.  Thanks.

Ey-Chih


On Wed, Mar 27, 2013 at 3:12 PM, Andy Twigg <an...@gmail.com> wrote:

> Dear Ey-Chih,
>
> What are your use cases for a better random forest?
>
> On 27 March 2013 11:59, Yutaka Mandai <20...@gmail.com> wrote:
> > My understanding of current Random Forrest has a certain level of
> improvement  for running on Hadoop cluster from data splitting alignment
> perspective for better balanced CPU utilization.
> > Regards,,,
> > Y.Mandai
> >
> > iPhoneから送信
> >
> > On 2013/03/25, at 14:48, Ted Dunning <te...@gmail.com> wrote:
> >
> >> I think that there are some others who could say more.
> >>
> >> On Mon, Mar 25, 2013 at 6:01 AM, Ey-Chih chow <ey...@gmail.com> wrote:
> >>
> >>> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
> >>>
> >>>> - random forest, sequential and parallel implementations, new versions
> >>> are being developed, the current version may or may not be useful to
> you.
> >>>>
> >>> Can you elaborate the usefulness of the current version and features of
> >>> the new versions?  Thanks.
> >>>
> >>> Ey-Chih Chow
> >>>
> >>>
> >>> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
> >>>
> >>>> You are correct to suspect that this page is substantially out of
> date.
> >>>>
> >>>> Currently, Mahout has the following classifiers:
> >>>>
> >>>> - stochastic gradient descent for logistic regression (SGD) with L_1
> or
> >>> L_2 regularization, sequential version only.  These classifiers can be
> >>> easily extended with other gradients and regularizers which should make
> >>> linear SVM's easy to implement.
> >>>>
> >>>> - naive bayes, sequential and parallel implementations
> >>>>
> >>>> - random forest, sequential and parallel implementations, new versions
> >>> are being developed, the current version may or may not be useful to
> you.
> >>>>
> >>>> There are a variety of other classifiers which are in various states
> of
> >>> utility.
> >>>>
> >>>> On Mar 24, 2013, at 4:07 AM, Chidananda Sridhar wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am doing a class project on classification and want to use Mahout.
> I
> >>> was
> >>>>> searching for the classification algorithms already implemented in
> >>> Mahout
> >>>>> and came to this page:
> >>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
> >>>>>
> >>>>> The webpage says that Online Passive
> >>>>> Aggressive<
> >>>
> https://cwiki.apache.org/confluence/display/MAHOUT/Online+Passive+Aggressive
> >>>> is
> >>>>> integrated and the rest of the classification algorithms are open or
> >>>>> awaiting commit. Does the webpage have the latest information, or is
> it
> >>> yet
> >>>>> to be updated? Is "Online Passive Aggressive" the only algorithm I
> can
> >>> use
> >>>>> for now? On the other hand, I see that most of the clustering
> algorithms
> >>>>> have been integrated.
> >>>>>
> >>>>> Thanks,
> >>>>> Chidananda
> >>>>
> >>>
> >>>
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538
>

RE: Classification Algorithms in Mahout

Posted by "Bhattacharjee, Rohan" <ro...@ebay.com>.
Doesn't the "random" part of random forest defend against overfitting ?


-----Original Message-----
From: ey-chih chow [mailto:eychih@gmail.com] 
Sent: Saturday, April 06, 2013 5:45 PM
To: user@mahout.apache.org
Subject: Re: Classification Algorithms in Mahout

I actually got a lot of over fitting.  The parameter that I can adjust is minSplitNum.  Is there any other parameters that I can adjust to avoid over fitting.  Thanks.

Ey-Chih


On Wed, Mar 27, 2013 at 3:12 PM, Andy Twigg <an...@gmail.com> wrote:

> Dear Ey-Chih,
>
> What are your use cases for a better random forest?
>
> On 27 March 2013 11:59, Yutaka Mandai <20...@gmail.com> wrote:
> > My understanding of current Random Forrest has a certain level of
> improvement  for running on Hadoop cluster from data splitting 
> alignment perspective for better balanced CPU utilization.
> > Regards,,,
> > Y.Mandai
> >
> > iPhoneから送信
> >
> > On 2013/03/25, at 14:48, Ted Dunning <te...@gmail.com> wrote:
> >
> >> I think that there are some others who could say more.
> >>
> >> On Mon, Mar 25, 2013 at 6:01 AM, Ey-Chih chow <ey...@gmail.com> wrote:
> >>
> >>> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
> >>>
> >>>> - random forest, sequential and parallel implementations, new 
> >>>> versions
> >>> are being developed, the current version may or may not be useful 
> >>> to
> you.
> >>>>
> >>> Can you elaborate the usefulness of the current version and 
> >>> features of the new versions?  Thanks.
> >>>
> >>> Ey-Chih Chow
> >>>
> >>>
> >>> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
> >>>
> >>>> You are correct to suspect that this page is substantially out of
> date.
> >>>>
> >>>> Currently, Mahout has the following classifiers:
> >>>>
> >>>> - stochastic gradient descent for logistic regression (SGD) with 
> >>>> L_1
> or
> >>> L_2 regularization, sequential version only.  These classifiers 
> >>> can be easily extended with other gradients and regularizers which 
> >>> should make linear SVM's easy to implement.
> >>>>
> >>>> - naive bayes, sequential and parallel implementations
> >>>>
> >>>> - random forest, sequential and parallel implementations, new 
> >>>> versions
> >>> are being developed, the current version may or may not be useful 
> >>> to
> you.
> >>>>
> >>>> There are a variety of other classifiers which are in various 
> >>>> states
> of
> >>> utility.
> >>>>
> >>>> On Mar 24, 2013, at 4:07 AM, Chidananda Sridhar wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am doing a class project on classification and want to use Mahout.
> I
> >>> was
> >>>>> searching for the classification algorithms already implemented 
> >>>>> in
> >>> Mahout
> >>>>> and came to this page:
> >>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
> >>>>>
> >>>>> The webpage says that Online Passive Aggressive<
> >>>
> https://cwiki.apache.org/confluence/display/MAHOUT/Online+Passive+Aggr
> essive
> >>>> is
> >>>>> integrated and the rest of the classification algorithms are 
> >>>>> open or awaiting commit. Does the webpage have the latest 
> >>>>> information, or is
> it
> >>> yet
> >>>>> to be updated? Is "Online Passive Aggressive" the only algorithm 
> >>>>> I
> can
> >>> use
> >>>>> for now? On the other hand, I see that most of the clustering
> algorithms
> >>>>> have been integrated.
> >>>>>
> >>>>> Thanks,
> >>>>> Chidananda
> >>>>
> >>>
> >>>
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford Room 351, Department 
> of Computer Science http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538
>