You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2013/01/01 00:48:09 UTC

Re: Parallel MapReduce Classification Examples?

I have experimented with a variety of ways of combining clustering and classification.  The methods in your reference are not generally the most effective in my experience. 

The general architecture that I think subsumes all of the important approaches is to use the distances (or inverse distance) as the features for a conventional classifier.  One kind of classifier is a simple lookup table that gives the probabilities for each class based in the nearest centroid but there are lots if alternatives. 

How you produce the clusters is the other major degree of freedom.  Simple k-means is one option.  Another option that I prefer is to include the target variable during the clustering.  This encourages the cluster boundaries to align with class boundaries.  

The key factor for success is to have enough clusters.  Typically you want way more clusters than you have target classes.  

Sent from my iPhone

On Dec 31, 2012, at 11:50 AM, Adam Baron <ad...@gmail.com> wrote:

> Thanks, that is very helpful.
> 
> Does anyone have experience Classifying via Clustering?  If so, I'd love to
> hear any feedback on that technique!  The basic concept being to cluster on
> each training label and corresponding data subset separately then combine
> the centroids to comprise your full model.  I got the idea from here:
> https://onlinecourses.science.psu.edu/stat557/book/export/html/63
> 
> If this is a Classification technique with exploring, I'd like to know if
> there any wrapper functions out there to help enable it?  Otherwise, it
> would feel pretty monotonous to run clustering on every possible label
> value.
> 
> Thanks,
>          Adam
> 
> On Fri, Dec 28, 2012 at 6:54 PM, Ted Dunning <te...@gmail.com> wrote:
> 
>> On Fri, Dec 28, 2012 at 5:30 PM, Adam Baron <ad...@gmail.com>
>> wrote:
>> 
>>> I'm trying to get familiar with the the parallel MapReduce Classification
>>> algorithms offered in Mahout .  ... Online Passive Aggressive and Hidden
>>> Markov Models might be
>>> ready to explore as well.
>> 
>> 
>> I don't think that either of these really got to full production quality in
>> Mahout. The HMM, in particular, may have slow convergence on large problems
>> which is just where you want the parallel program.
>> 
>> I thought that the Online Passive Aggressive code never made is very far,
>> either.
>> 
>> 
>>> Also, is there a parallel version of Logistic Regression officially in
>>> Mahout?
>> 
>> 
>> Nope.
>> 
>> 
>>> ... I ask because I
>>> came across this parallel Logistic Regression implementation which is
>>> apparently based off of Mahout, though not in Mahout:
>>> https://github.com/jpatanooga/KnittingBoar/wiki/Code-Development-Notes
>>> 
>> 
>> Yes.  That is a personal project of Josh Patterson's.  He should comment on
>> it.
>> 
>> It appears to be based on parameter averaging [1], which is an OK approach,
>> but I think that you can do better.  I would generally recommend an
>> alternative with asynchronous parameter updates.  Jeff Dean describes a
>> nice implementation in [2].  Josh's work is based on an experimental
>> map-reduce+ implementation (where + indicates iterated reduce similar to
>> BSP).  The Google learner can be implemented using the standard hack of
>> long-lived mappers that simply re-read their inputs repeatedly in an
>> asynchronous way.
>> 
>> An alternative BSP implementation can be found in Giraph [3].  All BSP
>> implementations tend to use batch synchronous update.
>> 
>> Graphlab [4] uses asynchronous updates.  I don't know the details of what
>> they have available.
>> 
>> 
>> Also, are there any other parallel MapReduce Classification algorithms in
>>> Mahout which I failed to mention worth checking out?
>>> 
>> 
>> I think you did a good survey.
>> 
>> 
>> [1] http://www.aclweb.org/anthology-new/N/N10/N10-1069.pdf
>> [2] http://techtalks.tv/talks/57639/
>> [3] http://incubator.apache.org/giraph/
>> [4] http://graphlab.org/
>>