You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Clive Cox <cl...@rummble.com> on 2011/07/15 22:07:39 UTC

Clustering demographic data

Hi,

 If one has an implicit dataset of users and actions (purchases, page
clicks for example) and also has demographics for those users (age,
gender, location etc). Does Mahout have any algorithms that could be
used to cluster users/actions by the user's demographics? So one could
derive information on certain user groups doing certain actions.

I suppose one can use the CF algorithms/Matrix factorization on just the
user/action data and then map that back onto the demographics, and then
try to see if that maps onto any significant demographic group? 

But are there clustering algorithms in mahout that work directly on the
demographic data in this situation?

 Thanks,

 Clive



Re: Clustering demographic data

Posted by Ted Dunning <te...@gmail.com>.
Yes.  It all depends on which features you build in as predictors.  You can
use actions to predict (retro-dict, really) demographics.  Or you can use
geo-demographics to predict actions.

It is common to have multiple models to deal with different levels of
cold-start.  One common such model is simply to have a regional top-40 list.
 Simple and to the point.

Pure cold-start models are typically pretty simple and typically don't
provide huge lift over basic heuristic ideas like top-40.

Pure recommendation models are also relatively simple and well understood.

More complex are the intermediate models where you have some action
information, some demographics and some item meta-data.  These models often
require large amounts of interaction variables which can make them difficult
to train.

Keep in mind that for recommendation type models, simple classifiers will
often wind up fixated on predicting which users will click rather than which
items each user will click on.  Segmented or per-user AUC is a critical tool
in these cases.


On Fri, Jul 15, 2011 at 1:55 PM, Clive Cox <cl...@rummble.com> wrote:

>
> Just to clarify though, this is a workflow which can be used for the
> situation where you have a cold-start user with demographics such that
> you then want to predict a set of actions they might do?
>
>

Re: Clustering demographic data

Posted by Clive Cox <cl...@rummble.com>.
Thanks Ted, that workflow sounds good.

Just to clarify though, this is a workflow which can be used for the
situation where you have a cold-start user with demographics such that
you then want to predict a set of actions they might do?

Thanks

 Clive



On Fri, 2011-07-15 at 13:16 -0700, Ted Dunning wrote:
> A typical work-flow for this is to define a disjoint set of demographic
> groups and then train a classifier that has access to user actions and
> "free" geo-demographic data such as IP, geo-IP, time of day and email
> domain.  If you have meta-data from the actions, then you can augment these
> variables by joining the action history to the meta-data and including that
> in your training data.
> 
> Once you have the training data, I would do the standard sort of exploratory
> data analysis using a tool like R.  If you verify with R that relatively
> simple models show predictive lift, then you can switch to training with
> Mahout to get a deployable model.  R is great for agile, interactive
> analysis.  Mahout is great for scaling and deployability.  Use both.
> 
> On Fri, Jul 15, 2011 at 1:07 PM, Clive Cox <cl...@rummble.com> wrote:
> 
> > Hi,
> >
> >  If one has an implicit dataset of users and actions (purchases, page
> > clicks for example) and also has demographics for those users (age,
> > gender, location etc). Does Mahout have any algorithms that could be
> > used to cluster users/actions by the user's demographics? So one could
> > derive information on certain user groups doing certain actions.
> >
> > I suppose one can use the CF algorithms/Matrix factorization on just the
> > user/action data and then map that back onto the demographics, and then
> > try to see if that maps onto any significant demographic group?
> >
> > But are there clustering algorithms in mahout that work directly on the
> > demographic data in this situation?
> >
> >  Thanks,
> >
> >  Clive
> >
> >
> >



Re: Clustering demographic data

Posted by Ted Dunning <te...@gmail.com>.
A typical work-flow for this is to define a disjoint set of demographic
groups and then train a classifier that has access to user actions and
"free" geo-demographic data such as IP, geo-IP, time of day and email
domain.  If you have meta-data from the actions, then you can augment these
variables by joining the action history to the meta-data and including that
in your training data.

Once you have the training data, I would do the standard sort of exploratory
data analysis using a tool like R.  If you verify with R that relatively
simple models show predictive lift, then you can switch to training with
Mahout to get a deployable model.  R is great for agile, interactive
analysis.  Mahout is great for scaling and deployability.  Use both.

On Fri, Jul 15, 2011 at 1:07 PM, Clive Cox <cl...@rummble.com> wrote:

> Hi,
>
>  If one has an implicit dataset of users and actions (purchases, page
> clicks for example) and also has demographics for those users (age,
> gender, location etc). Does Mahout have any algorithms that could be
> used to cluster users/actions by the user's demographics? So one could
> derive information on certain user groups doing certain actions.
>
> I suppose one can use the CF algorithms/Matrix factorization on just the
> user/action data and then map that back onto the demographics, and then
> try to see if that maps onto any significant demographic group?
>
> But are there clustering algorithms in mahout that work directly on the
> demographic data in this situation?
>
>  Thanks,
>
>  Clive
>
>
>