You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Raviv Pavel <st...@gmail.com> on 2012/01/12 20:29:32 UTC

Clustering user profiles

Hi,

I'm new to Mahout (and machine learning) but did quite a lot of reading,
especially "Mahout in Action".

I'm trying to cluster users based on their profiles.
By profile I mean attributes such as: age, gender, location and set of
interests

All the examples I saw so far were about vectors having dimensions of
similar type (e.g. occurance of words in text) but in my case, each
dimension is of different "type" and seem to require different distance
measure.

* Gender has two possible values - what distance measure should I use here?
* Age has a larger set of possible values - euclidean distance?
* Location, expressed as latitude and longitude - euclidean distance, but
between pairs of points?
* Interests, if expressed as a subset of a finite set - so the distance
number of items, shared between two vectors. I assume I can write a custom
distance measure for it.

Other than deciding on correct distance measures, I'm not sure how to
combine them into one clustering process.

As I said, I'm new to this field so any help would be much appreciated.

Thanks,
Raviv

--
View this message in context: http://lucene.472066.n3.nabble.com/Clustering-user-profiles-tp3654652p3654652.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Clustering user profiles

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
What you have read is correct. Mahout clustering (unsupervised 
classification) can only deal with continuous, homogeneous vector 
representations of the input data, where each vector element is weighted 
the same as the other elements. Mahout (supervised) classification can 
deal with continuous, categorical, word-like and text-like features such 
as in your problem space.

To address your problem with Mahout clustering, you would need to 
develop a mapping for each of your features to continuous vector 
elements and use a WeightedDistanceMeasure to account for the different 
element types and their relative impacts on the overall distance 
computation. This would be an iterative process which might or might not 
produce useful results.

An alternative approach would be to train a Mahout classifier with the 
various features using marked training data which classifies similar 
users into a finite number of "clusters" that seem natural to you. With 
such a model, you could then classify new users into those "clusters". 
This approach would not be very useful for discovering new "clusters" in 
your data, but it would leverage the classifier training mechanisms to 
develop the models as more of a black box than above.

On 1/12/12 12:29 PM, Raviv Pavel wrote:
> Hi,
>
> I'm new to Mahout (and machine learning) but did quite a lot of reading,
> especially "Mahout in Action".
>
> I'm trying to cluster users based on their profiles.
> By profile I mean attributes such as: age, gender, location and set of
> interests
>
> All the examples I saw so far were about vectors having dimensions of
> similar type (e.g. occurance of words in text) but in my case, each
> dimension is of different "type" and seem to require different distance
> measure.
>
> * Gender has two possible values - what distance measure should I use here?
> * Age has a larger set of possible values - euclidean distance?
> * Location, expressed as latitude and longitude - euclidean distance, but
> between pairs of points?
> * Interests, if expressed as a subset of a finite set - so the distance
> number of items, shared between two vectors. I assume I can write a custom
> distance measure for it.
>
> Other than deciding on correct distance measures, I'm not sure how to
> combine them into one clustering process.
>
> As I said, I'm new to this field so any help would be much appreciated.
>
> Thanks,
> Raviv
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Clustering-user-profiles-tp3654652p3654652.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>