You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Chidananda Sridhar <ch...@gmail.com> on 2013/03/24 05:07:32 UTC

Classification Algorithms in Mahout

Hi,

I am doing a class project on classification and want to use Mahout. I was
searching for the classification algorithms already implemented in Mahout
and came to this page:
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

The webpage says that Online Passive
Aggressive<https://cwiki.apache.org/confluence/display/MAHOUT/Online+Passive+Aggressive>is
integrated and the rest of the classification algorithms are open or
awaiting commit. Does the webpage have the latest information, or is it yet
to be updated? Is "Online Passive Aggressive" the only algorithm I can use
for now? On the other hand, I see that most of the clustering algorithms
have been integrated.

Thanks,
Chidananda

RE: Classification Algorithms in Mahout

Posted by "Bhattacharjee, Rohan" <ro...@ebay.com>.
Doesn't the "random" part of random forest defend against overfitting ?


-----Original Message-----
From: ey-chih chow [mailto:eychih@gmail.com] 
Sent: Saturday, April 06, 2013 5:45 PM
To: user@mahout.apache.org
Subject: Re: Classification Algorithms in Mahout

I actually got a lot of over fitting.  The parameter that I can adjust is minSplitNum.  Is there any other parameters that I can adjust to avoid over fitting.  Thanks.

Ey-Chih


On Wed, Mar 27, 2013 at 3:12 PM, Andy Twigg <an...@gmail.com> wrote:

> Dear Ey-Chih,
>
> What are your use cases for a better random forest?
>
> On 27 March 2013 11:59, Yutaka Mandai <20...@gmail.com> wrote:
> > My understanding of current Random Forrest has a certain level of
> improvement  for running on Hadoop cluster from data splitting 
> alignment perspective for better balanced CPU utilization.
> > Regards,,,
> > Y.Mandai
> >
> > iPhoneから送信
> >
> > On 2013/03/25, at 14:48, Ted Dunning <te...@gmail.com> wrote:
> >
> >> I think that there are some others who could say more.
> >>
> >> On Mon, Mar 25, 2013 at 6:01 AM, Ey-Chih chow <ey...@gmail.com> wrote:
> >>
> >>> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
> >>>
> >>>> - random forest, sequential and parallel implementations, new 
> >>>> versions
> >>> are being developed, the current version may or may not be useful 
> >>> to
> you.
> >>>>
> >>> Can you elaborate the usefulness of the current version and 
> >>> features of the new versions?  Thanks.
> >>>
> >>> Ey-Chih Chow
> >>>
> >>>
> >>> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
> >>>
> >>>> You are correct to suspect that this page is substantially out of
> date.
> >>>>
> >>>> Currently, Mahout has the following classifiers:
> >>>>
> >>>> - stochastic gradient descent for logistic regression (SGD) with 
> >>>> L_1
> or
> >>> L_2 regularization, sequential version only.  These classifiers 
> >>> can be easily extended with other gradients and regularizers which 
> >>> should make linear SVM's easy to implement.
> >>>>
> >>>> - naive bayes, sequential and parallel implementations
> >>>>
> >>>> - random forest, sequential and parallel implementations, new 
> >>>> versions
> >>> are being developed, the current version may or may not be useful 
> >>> to
> you.
> >>>>
> >>>> There are a variety of other classifiers which are in various 
> >>>> states
> of
> >>> utility.
> >>>>
> >>>> On Mar 24, 2013, at 4:07 AM, Chidananda Sridhar wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am doing a class project on classification and want to use Mahout.
> I
> >>> was
> >>>>> searching for the classification algorithms already implemented 
> >>>>> in
> >>> Mahout
> >>>>> and came to this page:
> >>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
> >>>>>
> >>>>> The webpage says that Online Passive Aggressive<
> >>>
> https://cwiki.apache.org/confluence/display/MAHOUT/Online+Passive+Aggr
> essive
> >>>> is
> >>>>> integrated and the rest of the classification algorithms are 
> >>>>> open or awaiting commit. Does the webpage have the latest 
> >>>>> information, or is
> it
> >>> yet
> >>>>> to be updated? Is "Online Passive Aggressive" the only algorithm 
> >>>>> I
> can
> >>> use
> >>>>> for now? On the other hand, I see that most of the clustering
> algorithms
> >>>>> have been integrated.
> >>>>>
> >>>>> Thanks,
> >>>>> Chidananda
> >>>>
> >>>
> >>>
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford Room 351, Department 
> of Computer Science http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538
>

Re: Classification Algorithms in Mahout

Posted by ey-chih chow <ey...@gmail.com>.
I actually got a lot of over fitting.  The parameter that I can adjust is
minSplitNum.  Is there any other parameters that I can adjust to avoid over
fitting.  Thanks.

Ey-Chih


On Wed, Mar 27, 2013 at 3:12 PM, Andy Twigg <an...@gmail.com> wrote:

> Dear Ey-Chih,
>
> What are your use cases for a better random forest?
>
> On 27 March 2013 11:59, Yutaka Mandai <20...@gmail.com> wrote:
> > My understanding of current Random Forrest has a certain level of
> improvement  for running on Hadoop cluster from data splitting alignment
> perspective for better balanced CPU utilization.
> > Regards,,,
> > Y.Mandai
> >
> > iPhoneから送信
> >
> > On 2013/03/25, at 14:48, Ted Dunning <te...@gmail.com> wrote:
> >
> >> I think that there are some others who could say more.
> >>
> >> On Mon, Mar 25, 2013 at 6:01 AM, Ey-Chih chow <ey...@gmail.com> wrote:
> >>
> >>> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
> >>>
> >>>> - random forest, sequential and parallel implementations, new versions
> >>> are being developed, the current version may or may not be useful to
> you.
> >>>>
> >>> Can you elaborate the usefulness of the current version and features of
> >>> the new versions?  Thanks.
> >>>
> >>> Ey-Chih Chow
> >>>
> >>>
> >>> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
> >>>
> >>>> You are correct to suspect that this page is substantially out of
> date.
> >>>>
> >>>> Currently, Mahout has the following classifiers:
> >>>>
> >>>> - stochastic gradient descent for logistic regression (SGD) with L_1
> or
> >>> L_2 regularization, sequential version only.  These classifiers can be
> >>> easily extended with other gradients and regularizers which should make
> >>> linear SVM's easy to implement.
> >>>>
> >>>> - naive bayes, sequential and parallel implementations
> >>>>
> >>>> - random forest, sequential and parallel implementations, new versions
> >>> are being developed, the current version may or may not be useful to
> you.
> >>>>
> >>>> There are a variety of other classifiers which are in various states
> of
> >>> utility.
> >>>>
> >>>> On Mar 24, 2013, at 4:07 AM, Chidananda Sridhar wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am doing a class project on classification and want to use Mahout.
> I
> >>> was
> >>>>> searching for the classification algorithms already implemented in
> >>> Mahout
> >>>>> and came to this page:
> >>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
> >>>>>
> >>>>> The webpage says that Online Passive
> >>>>> Aggressive<
> >>>
> https://cwiki.apache.org/confluence/display/MAHOUT/Online+Passive+Aggressive
> >>>> is
> >>>>> integrated and the rest of the classification algorithms are open or
> >>>>> awaiting commit. Does the webpage have the latest information, or is
> it
> >>> yet
> >>>>> to be updated? Is "Online Passive Aggressive" the only algorithm I
> can
> >>> use
> >>>>> for now? On the other hand, I see that most of the clustering
> algorithms
> >>>>> have been integrated.
> >>>>>
> >>>>> Thanks,
> >>>>> Chidananda
> >>>>
> >>>
> >>>
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538
>

Re: Classification Algorithms in Mahout

Posted by Andy Twigg <an...@gmail.com>.
Dear Ey-Chih,

What are your use cases for a better random forest?

On 27 March 2013 11:59, Yutaka Mandai <20...@gmail.com> wrote:
> My understanding of current Random Forrest has a certain level of improvement  for running on Hadoop cluster from data splitting alignment perspective for better balanced CPU utilization.
> Regards,,,
> Y.Mandai
>
> iPhoneから送信
>
> On 2013/03/25, at 14:48, Ted Dunning <te...@gmail.com> wrote:
>
>> I think that there are some others who could say more.
>>
>> On Mon, Mar 25, 2013 at 6:01 AM, Ey-Chih chow <ey...@gmail.com> wrote:
>>
>>> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
>>>
>>>> - random forest, sequential and parallel implementations, new versions
>>> are being developed, the current version may or may not be useful to you.
>>>>
>>> Can you elaborate the usefulness of the current version and features of
>>> the new versions?  Thanks.
>>>
>>> Ey-Chih Chow
>>>
>>>
>>> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
>>>
>>>> You are correct to suspect that this page is substantially out of date.
>>>>
>>>> Currently, Mahout has the following classifiers:
>>>>
>>>> - stochastic gradient descent for logistic regression (SGD) with L_1 or
>>> L_2 regularization, sequential version only.  These classifiers can be
>>> easily extended with other gradients and regularizers which should make
>>> linear SVM's easy to implement.
>>>>
>>>> - naive bayes, sequential and parallel implementations
>>>>
>>>> - random forest, sequential and parallel implementations, new versions
>>> are being developed, the current version may or may not be useful to you.
>>>>
>>>> There are a variety of other classifiers which are in various states of
>>> utility.
>>>>
>>>> On Mar 24, 2013, at 4:07 AM, Chidananda Sridhar wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am doing a class project on classification and want to use Mahout. I
>>> was
>>>>> searching for the classification algorithms already implemented in
>>> Mahout
>>>>> and came to this page:
>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
>>>>>
>>>>> The webpage says that Online Passive
>>>>> Aggressive<
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Online+Passive+Aggressive
>>>> is
>>>>> integrated and the rest of the classification algorithms are open or
>>>>> awaiting commit. Does the webpage have the latest information, or is it
>>> yet
>>>>> to be updated? Is "Online Passive Aggressive" the only algorithm I can
>>> use
>>>>> for now? On the other hand, I see that most of the clustering algorithms
>>>>> have been integrated.
>>>>>
>>>>> Thanks,
>>>>> Chidananda
>>>>
>>>
>>>



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Re: Classification Algorithms in Mahout

Posted by Yutaka Mandai <20...@gmail.com>.
My understanding of current Random Forrest has a certain level of improvement  for running on Hadoop cluster from data splitting alignment perspective for better balanced CPU utilization.
Regards,,,
Y.Mandai

iPhoneから送信

On 2013/03/25, at 14:48, Ted Dunning <te...@gmail.com> wrote:

> I think that there are some others who could say more.
> 
> On Mon, Mar 25, 2013 at 6:01 AM, Ey-Chih chow <ey...@gmail.com> wrote:
> 
>> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
>> 
>>> - random forest, sequential and parallel implementations, new versions
>> are being developed, the current version may or may not be useful to you.
>>> 
>> Can you elaborate the usefulness of the current version and features of
>> the new versions?  Thanks.
>> 
>> Ey-Chih Chow
>> 
>> 
>> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
>> 
>>> You are correct to suspect that this page is substantially out of date.
>>> 
>>> Currently, Mahout has the following classifiers:
>>> 
>>> - stochastic gradient descent for logistic regression (SGD) with L_1 or
>> L_2 regularization, sequential version only.  These classifiers can be
>> easily extended with other gradients and regularizers which should make
>> linear SVM's easy to implement.
>>> 
>>> - naive bayes, sequential and parallel implementations
>>> 
>>> - random forest, sequential and parallel implementations, new versions
>> are being developed, the current version may or may not be useful to you.
>>> 
>>> There are a variety of other classifiers which are in various states of
>> utility.
>>> 
>>> On Mar 24, 2013, at 4:07 AM, Chidananda Sridhar wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I am doing a class project on classification and want to use Mahout. I
>> was
>>>> searching for the classification algorithms already implemented in
>> Mahout
>>>> and came to this page:
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
>>>> 
>>>> The webpage says that Online Passive
>>>> Aggressive<
>> https://cwiki.apache.org/confluence/display/MAHOUT/Online+Passive+Aggressive
>>> is
>>>> integrated and the rest of the classification algorithms are open or
>>>> awaiting commit. Does the webpage have the latest information, or is it
>> yet
>>>> to be updated? Is "Online Passive Aggressive" the only algorithm I can
>> use
>>>> for now? On the other hand, I see that most of the clustering algorithms
>>>> have been integrated.
>>>> 
>>>> Thanks,
>>>> Chidananda
>>> 
>> 
>> 

Re: Classification Algorithms in Mahout

Posted by Ted Dunning <te...@gmail.com>.
I think that there are some others who could say more.

On Mon, Mar 25, 2013 at 6:01 AM, Ey-Chih chow <ey...@gmail.com> wrote:

> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
>
> > - random forest, sequential and parallel implementations, new versions
> are being developed, the current version may or may not be useful to you.
> >
> Can you elaborate the usefulness of the current version and features of
> the new versions?  Thanks.
>
> Ey-Chih Chow
>
>
> On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:
>
> > You are correct to suspect that this page is substantially out of date.
> >
> > Currently, Mahout has the following classifiers:
> >
> > - stochastic gradient descent for logistic regression (SGD) with L_1 or
> L_2 regularization, sequential version only.  These classifiers can be
> easily extended with other gradients and regularizers which should make
> linear SVM's easy to implement.
> >
> > - naive bayes, sequential and parallel implementations
> >
> > - random forest, sequential and parallel implementations, new versions
> are being developed, the current version may or may not be useful to you.
> >
> > There are a variety of other classifiers which are in various states of
> utility.
> >
> > On Mar 24, 2013, at 4:07 AM, Chidananda Sridhar wrote:
> >
> >> Hi,
> >>
> >> I am doing a class project on classification and want to use Mahout. I
> was
> >> searching for the classification algorithms already implemented in
> Mahout
> >> and came to this page:
> >> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
> >>
> >> The webpage says that Online Passive
> >> Aggressive<
> https://cwiki.apache.org/confluence/display/MAHOUT/Online+Passive+Aggressive
> >is
> >> integrated and the rest of the classification algorithms are open or
> >> awaiting commit. Does the webpage have the latest information, or is it
> yet
> >> to be updated? Is "Online Passive Aggressive" the only algorithm I can
> use
> >> for now? On the other hand, I see that most of the clustering algorithms
> >> have been integrated.
> >>
> >> Thanks,
> >> Chidananda
> >
>
>

Re: Classification Algorithms in Mahout

Posted by Ey-Chih chow <ey...@gmail.com>.
On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:

> - random forest, sequential and parallel implementations, new versions are being developed, the current version may or may not be useful to you.
> 
Can you elaborate the usefulness of the current version and features of the new versions?  Thanks.

Ey-Chih Chow


On Mar 24, 2013, at 1:00 AM, Ted Dunning wrote:

> You are correct to suspect that this page is substantially out of date.
> 
> Currently, Mahout has the following classifiers:
> 
> - stochastic gradient descent for logistic regression (SGD) with L_1 or L_2 regularization, sequential version only.  These classifiers can be easily extended with other gradients and regularizers which should make linear SVM's easy to implement.
> 
> - naive bayes, sequential and parallel implementations
> 
> - random forest, sequential and parallel implementations, new versions are being developed, the current version may or may not be useful to you.
> 
> There are a variety of other classifiers which are in various states of utility.
> 	
> On Mar 24, 2013, at 4:07 AM, Chidananda Sridhar wrote:
> 
>> Hi,
>> 
>> I am doing a class project on classification and want to use Mahout. I was
>> searching for the classification algorithms already implemented in Mahout
>> and came to this page:
>> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
>> 
>> The webpage says that Online Passive
>> Aggressive<https://cwiki.apache.org/confluence/display/MAHOUT/Online+Passive+Aggressive>is
>> integrated and the rest of the classification algorithms are open or
>> awaiting commit. Does the webpage have the latest information, or is it yet
>> to be updated? Is "Online Passive Aggressive" the only algorithm I can use
>> for now? On the other hand, I see that most of the clustering algorithms
>> have been integrated.
>> 
>> Thanks,
>> Chidananda
> 


Re: Classification Algorithms in Mahout

Posted by Ted Dunning <te...@gmail.com>.
You are correct to suspect that this page is substantially out of date.

Currently, Mahout has the following classifiers:

- stochastic gradient descent for logistic regression (SGD) with L_1 or L_2 regularization, sequential version only.  These classifiers can be easily extended with other gradients and regularizers which should make linear SVM's easy to implement.

- naive bayes, sequential and parallel implementations

- random forest, sequential and parallel implementations, new versions are being developed, the current version may or may not be useful to you.

There are a variety of other classifiers which are in various states of utility.
	
On Mar 24, 2013, at 4:07 AM, Chidananda Sridhar wrote:

> Hi,
> 
> I am doing a class project on classification and want to use Mahout. I was
> searching for the classification algorithms already implemented in Mahout
> and came to this page:
> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
> 
> The webpage says that Online Passive
> Aggressive<https://cwiki.apache.org/confluence/display/MAHOUT/Online+Passive+Aggressive>is
> integrated and the rest of the classification algorithms are open or
> awaiting commit. Does the webpage have the latest information, or is it yet
> to be updated? Is "Online Passive Aggressive" the only algorithm I can use
> for now? On the other hand, I see that most of the clustering algorithms
> have been integrated.
> 
> Thanks,
> Chidananda