You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Radu Spineanu <ra...@timisoara.roedu.net> on 2010/11/17 22:12:56 UTC

classification algorithm

Hi.


We have data about users that perform certain actions:
user, age, sex, interests has performed actions 1,2,3
(training data)

Our goal is to ask in real time how likely is it that another user 
having age, sex, interests would perform the same actions.


Can we use mahout for this? If yes, which algorithm do you think would 
be best? Would it work if we had partial data, like only age?


Thank you.
-r.

Re: classification algorithm

Posted by Ted Dunning <te...@gmail.com>.

Possibly, but not out of the box.

One way to deal with this is to build multiple models for the different
patterns of data that you have.

Another way is to build models that can sample the missing variables from a
predictive model and then use these as inputs for
a more complete model.  Then you can use the distribution of the outputs for
predictive purposes.

But your specific case will tell.  Your most important priority will be to
figure out how to test models realistically off-line.

On Wed, Nov 17, 2010 at 1:12 PM, Radu Spineanu <ra...@timisoara.roedu.net>wrote:

> Would it work if we had partial data, like only age?

Re: classification algorithm

Posted by Sebastian Schelter <ss...@apache.org>.

apt-get install mahout

Cool :)

On 22.11.2010 17:04, Isabel Drost wrote:
> On Thu, 18 Nov 2010 Radu Spineanu<ra...@timisoara.roedu.net>  wrote:
>    
>> I'm a Debian Developer and I noticed Mahout is not in Debian. If I'm
>> able to wrap my head around everything and get it working I would
>> love to contribute back and package it.
>>      
> That would be awesome. Mahout does have quite a few dependencies which
> might make it an interesting packaging exercise. I am not sure whether
> all of them are available in Debian already. At least Hadoop should be
> available in Debian testing, but did not yet make it to the latest
> stable release.
>
> Isabel
>

Re: classification algorithm

Posted by Radu Spineanu <ra...@timisoara.roedu.net>.

>
> That would be awesome. Mahout does have quite a few dependencies which
> might make it an interesting packaging exercise. I am not sure whether
> all of them are available in Debian already. At least Hadoop should be
> available in Debian testing, but did not yet make it to the latest
> stable release.

It may take a while since I am just starting to get accustomed to R.

If anyone here wants to package it before I get a chance, I'll more than 
happy to sponsor you. (in order to get a package in Debian you have to 
be developer/maintainer or have a sponsor)

-r.

Re: classification algorithm

Posted by Isabel Drost <is...@apache.org>.

On Thu, 18 Nov 2010 Radu Spineanu <ra...@timisoara.roedu.net> wrote:
> I'm a Debian Developer and I noticed Mahout is not in Debian. If I'm 
> able to wrap my head around everything and get it working I would
> love to contribute back and package it.

That would be awesome. Mahout does have quite a few dependencies which
might make it an interesting packaging exercise. I am not sure whether
all of them are available in Debian already. At least Hadoop should be
available in Debian testing, but did not yet make it to the latest
stable release.

Isabel

Re: classification algorithm

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Nov 18, 2010 at 8:54 AM, Radu Spineanu <ra...@timisoara.roedu.net>wrote:

> Offtopic: I can't find examples about how to implement my setup with
> partial queries. In either mahout or R.
>

In R, you build a data frame with all of your columns.  Then when training,
you specify your model using the
formula notation:

    m.all = glm(result ~ age + interest1 + interest2 + gender, yourDataHere,
family=binomial())

or

    m.small = glm(result ~ age + gender, yourDataHere, family=binomial())

This gives you two models, m.all and m.small.  You can select which one you
want to use based on what data you
have.

This is not quite the same as what you were asking for.  Using m.all when
you only have age and gender is a tricky
business since it requires picking some values for interest1 and interest2.
 One thing you can do is sample from your
training data for all examples that match the specified age and gender.
 This gives you a cloud of results, but may
not work (what if you haven't seen *exactly* that combination of age and
gender enough to get a good sample?)

You can't just put in zeros for interest1 and interest2 because of the
internal way that the models are encoded.  Putting
in zeros implicitly chooses the default value (typically interest1 because
it sorts first) which is definitely wrong.

> I can train with "age", "interest1" ... "interestN", "demographic1",
> ..."demographicX" and when querying I could ask with "age", "interest1" ..
> "interestM" where M could be bigger or smaller than N.
>

You really only have the choice of synthesizing data or having multiple
models.  I recommend multiple models in most cases.

I could break them into multiple rows, but it would result fake results.
> Someone interested in Books + Math could yield results, but just Math
> wouldn't.
>

I don't understand it, but I don't recommend it.  The multiple model
approach above is much simpler.

Do you guys know anyone that offers consulting services at a reasonable
> price to help with modelling?
>

I don't know about reasonable, but there are several people who can offer
some help:

- most university stats departments have somebody interested in data-mining.
 There is probably a grad student who could help.

- Chris Poulin at Patterns and Predictions might be able to help.

- Mike Driscoll at Dataspora offers such consulting

- Joseph Turian might be able to help

- many others that don't come to mind instantly.

Good luck!

Re: classification algorithm

Posted by Radu Spineanu <ra...@timisoara.roedu.net>.

>
> Use R.  Mahout is major over-kill for this problem.  You can always
> transition later.
>
> I am not saying your problem isn't difficult or that it isn't valuable to
> solve.  Just that the virtue that Mahout brings (scale) isn't the virtue you
> need (models sooner with least effort).
>

Got it. I've started reading on R today.

>
>> I'm a Debian Developer and I noticed Mahout is not in Debian. If I'm able
>> to wrap my head around everything and get it working I would love to
>> contribute back and package it.
>>
>
> We would love it if you did.  Mahout is fast moving and trunk will be
> significantly more useful for most people for a while yet.  How does that
> affect packaging for debian?
>

Debian has 3 distributions: stable, testing, unstable. A new stable gets 
released every 18 months or so. I've seen packages following trunk, like 
ruby1.9, so in theory it should be OK.

I want look into this when I start using mahout, it would help me too 
having it packaged. Right now I'm trying to get the hang of R though.


Offtopic: I can't find examples about how to implement my setup with 
partial queries. In either mahout or R.

I can train with "age", "interest1" ... "interestN", "demographic1", 
..."demographicX" and when querying I could ask with "age", "interest1" 
.. "interestM" where M could be bigger or smaller than N.

I could break them into multiple rows, but it would result fake results. 
Someone interested in Books + Math could yield results, but just Math 
wouldn't.

Do you guys know anyone that offers consulting services at a reasonable 
price to help with modelling?

-r.

Re: classification algorithm

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Nov 17, 2010 at 2:50 PM, Radu Spineanu <ra...@timisoara.roedu.net>wrote:

> We're going to start with < 1.000 observations but we have to be able to
> scale out very quickly if it works. It could get to 100.000 observations in
> 6-8 months.
>

Use R.  Mahout is major over-kill for this problem.  You can always
transition later.

I am not saying your problem isn't difficult or that it isn't valuable to
solve.  Just that the virtue that Mahout brings (scale) isn't the virtue you
need (models sooner with least effort).

>
> The model is a combination between c) and b). All actions except the first
> one are independent. If we build the model around c) would it be hard to
> move to b) later on if that's the case? I want to go the easier route for
> now.
>

(c) is the easiest (independent models for each outcome).

>
> Could you point me to books, docs, howtos, articles about getting up and
> running with c)?
>

General data-mining books should be good.

Chris Bishop's book is excellent but a bit advanced.
http://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738/ref=ntt_at_ep_dpi_1/192-5152996-7376364

> I'm a Debian Developer and I noticed Mahout is not in Debian. If I'm able
> to wrap my head around everything and get it working I would love to
> contribute back and package it.
>

We would love it if you did.  Mahout is fast moving and trunk will be
significantly more useful for most people for a while yet.  How does that
affect packaging for debian?

>
>
> > But your specific case will tell.  Your most important priority will be
> to
> > figure out how to test models realistically off-line.
>
> What do you mean by this?
>

I mean that you need to be able to tell if you models are doing some good
without going back to your live audience for more data.

Re: classification algorithm

Posted by Radu Spineanu <ra...@timisoara.roedu.net>.

We're going to start with < 1.000 observations but we have to be able to 
scale out very quickly if it works. It could get to 100.000 observations 
in 6-8 months.

The model is a combination between c) and b). All actions except the 
first one are independent. If we build the model around c) would it be 
hard to move to b) later on if that's the case? I want to go the easier 
route for now.

Could you point me to books, docs, howtos, articles about getting up and 
running with c)?

I'm a Debian Developer and I noticed Mahout is not in Debian. If I'm 
able to wrap my head around everything and get it working I would love 
to contribute back and package it.

 > But your specific case will tell.  Your most important priority will 
be to
 > figure out how to test models realistically off-line.

What do you mean by this?

-r.


On 11/18/2010 12:03 AM, Ted Dunning wrote:
> Yes.
>
> I would start with the SGD system and possibly use the naive bayes models if
> you have massive amounts of data.
>
> In fact, if you have<  100,000 observations I would strongly recommend using
> a more user friendly system such as R.
>
> Regardless of which system, you need to decide what kind of model you need
> to build.  There are several natural alternatives:
>
> a) only one of the possible actions matters (or only one can be done) and
> the actions are not ordered.  Use multi-nomial logisitic regression (SGD
> implements this very nicely).
>
> b) the actions nest in some way.  An example might be progression by a web
> visitor toward economic conversion.  Action 1 might be any visitor, action 2
> is clicking on product information, action 3 might be putting an item in a
> shopping cart and action 4 might be buying an item.  These items have a
> clear and important ordering and all users who complete action n have
> completed all lower actions.  Ordinal logistic regression is a natural
> choice here.  Mahout does not really support this.  You can do the poor
> man's version by just
> using the largest action completed and using multinomial logistic
> regression.
>
> c) the actions are relatively independent.  Here you can start with n binary
> logistic regression models.  This will ignore any nesting
> or implication structure among actions.  Mahout can help here with the
> binary logistic regression.
>
> On Wed, Nov 17, 2010 at 1:12 PM, Radu Spineanu<ra...@timisoara.roedu.net>wrote:
>
>> Hi.
>>
>>
>> We have data about users that perform certain actions:
>> user, age, sex, interests has performed actions 1,2,3
>> (training data)
>>
>> Our goal is to ask in real time how likely is it that another user having
>> age, sex, interests would perform the same actions.
>>
>>
>> Can we use mahout for this? If yes, which algorithm do you think would be
>> best? Would it work if we had partial data, like only age?
>>
>>
>> Thank you.
>> -r.
>>
>

Re: classification algorithm

Posted by Ted Dunning <te...@gmail.com>.

Yes.

I would start with the SGD system and possibly use the naive bayes models if
you have massive amounts of data.

In fact, if you have < 100,000 observations I would strongly recommend using
a more user friendly system such as R.

Regardless of which system, you need to decide what kind of model you need
to build.  There are several natural alternatives:

a) only one of the possible actions matters (or only one can be done) and
the actions are not ordered.  Use multi-nomial logisitic regression (SGD
implements this very nicely).

b) the actions nest in some way.  An example might be progression by a web
visitor toward economic conversion.  Action 1 might be any visitor, action 2
is clicking on product information, action 3 might be putting an item in a
shopping cart and action 4 might be buying an item.  These items have a
clear and important ordering and all users who complete action n have
completed all lower actions.  Ordinal logistic regression is a natural
choice here.  Mahout does not really support this.  You can do the poor
man's version by just
using the largest action completed and using multinomial logistic
regression.

c) the actions are relatively independent.  Here you can start with n binary
logistic regression models.  This will ignore any nesting
or implication structure among actions.  Mahout can help here with the
binary logistic regression.

On Wed, Nov 17, 2010 at 1:12 PM, Radu Spineanu <ra...@timisoara.roedu.net>wrote:

> Hi.
>
>
> We have data about users that perform certain actions:
> user, age, sex, interests has performed actions 1,2,3
> (training data)
>
> Our goal is to ask in real time how likely is it that another user having
> age, sex, interests would perform the same actions.
>
>
> Can we use mahout for this? If yes, which algorithm do you think would be
> best? Would it work if we had partial data, like only age?
>
>
> Thank you.
> -r.
>