You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pradeep Pujari <pp...@gmail.com> on 2009/06/23 21:14:03 UTC

Re: [jira] Created: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

Hi All,

I followed Mahout - Taste demo. This was working fine. It looks to this demo
only considers one parameter "Rating". How can I use other input values like
a) purchase history b) demographics etc. besides Rating to compute user
based recommendation?

Thank you very much.

Pradeep.

Re: [jira] Created: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

Posted by Grant Ingersoll <gs...@apache.org>.

Please ask your question on mahout-user@lucene.apache.org and start a  
new thread.

Thanks,
Grant

On Jun 23, 2009, at 3:14 PM, Pradeep Pujari wrote:

> Hi All,
>
> I followed Mahout - Taste demo. This was working fine. It looks to  
> this demo
> only considers one parameter "Rating". How can I use other input  
> values like
> a) purchase history b) demographics etc. besides Rating to compute  
> user
> based recommendation?
>
> Thank you very much.
>
> Pradeep.

Re: [jira] Created: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

Posted by Sean Owen <sr...@gmail.com>.

Any range is OK. Higher values must indicate a stronger positive
preference -- "liking the item more".

On Tue, Jun 23, 2009 at 6:19 PM, Pradeep Pujari<pp...@gmail.com> wrote:
> thanks all. This is a very valuable info for a beginner. Does Mahout
> requires prefernce values in binary values in the range -1 to +1 or it can
> take any range like from 0 to 10 (say).
> thanks,

Re: [jira] Created: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

Posted by Pradeep Pujari <pp...@gmail.com>.

thanks all. This is a very valuable info for a beginner. Does Mahout
requires prefernce values in binary values in the range -1 to +1 or it can
take any range like from 0 to 10 (say).
thanks,
Pradeep.

On Tue, Jun 23, 2009 at 2:12 PM, Ted Dunning <te...@gmail.com> wrote:

> This is what is traditionally done, but it is distinctly sub-optimal in
> many
> ways.  The most serious problem is that there is a heuristic decision that
> says what is important what is not.
>
> A preferable (and as far as I know never used or implemented) approach
> would
> be to build a real model that includes factors that actually help predict
> the desired outcome.  Methods to do this might include:
>
> a) LLR feature selection from several behavior types followed by IDF
> weighted scoring.   I have used this with additional follow on steps in
> attrition and loss models for insurance with very good results, but never
> used it in recommendations.  The basic idea in the attrition and loss
> models
> was to develop positive and negative indicator sets for each outcome and
> then cluster in the space of indicator scores.  Finally, we built ANN
> models
> over the variables formed by distances to cluster centroids.   For
> recommendations, this would mean building positive and negative feature
> sets
> for all items for each kind of behavior.  I would expect little gain from
> negative scores but would still use them.  With positive only sets, this
> reduces (almost) to the sum of cooccurrence scores done in isolation on
> each
> kind of input.
>
> b) shared latent variable reductions across multiple behavior types.  For
> SVD or similar decomposition based techniques, this is equivalent to
> reducing column adjoined matrices for the independent behaviors.  Then, if
> you have only one kind of information, you can use the SVD to fill in the
> other, missing, information.
>
> c) probabilistic latent variable approaches.  For LDA and such, you can put
> all of the behavioral information together and use the model to predict
> missing observations in the standard Bayesian kind of way.  This is similar
> to (b), but much better founded.
>
> On Tue, Jun 23, 2009 at 12:23 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > For example, you could write a script that combines rating,
> > purchase history, demographics, in some way that you think is useful,
> > to produce 'preference' values.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
>

Re: [jira] Created: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

Posted by Pradeep Pujari <pp...@gmail.com>.

thanks all. This is a very valuable info for a beginner. Does Mahout
requires prefernce values in binary values in the range -1 to +1 or it can
take any range like from 0 to 10 (say).
thanks,
Pradeep.

On Tue, Jun 23, 2009 at 2:12 PM, Ted Dunning <te...@gmail.com> wrote:

> This is what is traditionally done, but it is distinctly sub-optimal in
> many
> ways.  The most serious problem is that there is a heuristic decision that
> says what is important what is not.
>
> A preferable (and as far as I know never used or implemented) approach
> would
> be to build a real model that includes factors that actually help predict
> the desired outcome.  Methods to do this might include:
>
> a) LLR feature selection from several behavior types followed by IDF
> weighted scoring.   I have used this with additional follow on steps in
> attrition and loss models for insurance with very good results, but never
> used it in recommendations.  The basic idea in the attrition and loss
> models
> was to develop positive and negative indicator sets for each outcome and
> then cluster in the space of indicator scores.  Finally, we built ANN
> models
> over the variables formed by distances to cluster centroids.   For
> recommendations, this would mean building positive and negative feature
> sets
> for all items for each kind of behavior.  I would expect little gain from
> negative scores but would still use them.  With positive only sets, this
> reduces (almost) to the sum of cooccurrence scores done in isolation on
> each
> kind of input.
>
> b) shared latent variable reductions across multiple behavior types.  For
> SVD or similar decomposition based techniques, this is equivalent to
> reducing column adjoined matrices for the independent behaviors.  Then, if
> you have only one kind of information, you can use the SVD to fill in the
> other, missing, information.
>
> c) probabilistic latent variable approaches.  For LDA and such, you can put
> all of the behavioral information together and use the model to predict
> missing observations in the standard Bayesian kind of way.  This is similar
> to (b), but much better founded.
>
> On Tue, Jun 23, 2009 at 12:23 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > For example, you could write a script that combines rating,
> > purchase history, demographics, in some way that you think is useful,
> > to produce 'preference' values.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
>

Fwd: [jira] Created: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

Posted by Ted Dunning <te...@gmail.com>.

Transposing my answer to mahout-user based on Grant's suggestion:

---------- Forwarded message ----------
From: Ted Dunning <te...@gmail.com>
Date: Tue, Jun 23, 2009 at 2:12 PM
Subject: Re: [jira] Created: (MAHOUT-138) Convert main() methods to use
Commons CLI for argument processing
To: mahout-dev@lucene.apache.org

This is what is traditionally done, but it is distinctly sub-optimal in many
ways.  The most serious problem is that there is a heuristic decision that
says what is important what is not.

A preferable (and as far as I know never used or implemented) approach would
be to build a real model that includes factors that actually help predict
the desired outcome.  Methods to do this might include:

a) LLR feature selection from several behavior types followed by IDF
weighted scoring.   I have used this with additional follow on steps in
attrition and loss models for insurance with very good results, but never
used it in recommendations.  The basic idea in the attrition and loss models
was to develop positive and negative indicator sets for each outcome and
then cluster in the space of indicator scores.  Finally, we built ANN models
over the variables formed by distances to cluster centroids.   For
recommendations, this would mean building positive and negative feature sets
for all items for each kind of behavior.  I would expect little gain from
negative scores but would still use them.  With positive only sets, this
reduces (almost) to the sum of cooccurrence scores done in isolation on each
kind of input.

b) shared latent variable reductions across multiple behavior types.  For
SVD or similar decomposition based techniques, this is equivalent to
reducing column adjoined matrices for the independent behaviors.  Then, if
you have only one kind of information, you can use the SVD to fill in the
other, missing, information.

c) probabilistic latent variable approaches.  For LDA and such, you can put
all of the behavioral information together and use the model to predict
missing observations in the standard Bayesian kind of way.  This is similar
to (b), but much better founded.

On Tue, Jun 23, 2009 at 12:23 PM, Sean Owen <sr...@gmail.com> wrote:

> For example, you could write a script that combines rating,
> purchase history, demographics, in some way that you think is useful,
> to produce 'preference' values.
>

Re: [jira] Created: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

Posted by Ted Dunning <te...@gmail.com>.

This is what is traditionally done, but it is distinctly sub-optimal in many
ways.  The most serious problem is that there is a heuristic decision that
says what is important what is not.

A preferable (and as far as I know never used or implemented) approach would
be to build a real model that includes factors that actually help predict
the desired outcome.  Methods to do this might include:

a) LLR feature selection from several behavior types followed by IDF
weighted scoring.   I have used this with additional follow on steps in
attrition and loss models for insurance with very good results, but never
used it in recommendations.  The basic idea in the attrition and loss models
was to develop positive and negative indicator sets for each outcome and
then cluster in the space of indicator scores.  Finally, we built ANN models
over the variables formed by distances to cluster centroids.   For
recommendations, this would mean building positive and negative feature sets
for all items for each kind of behavior.  I would expect little gain from
negative scores but would still use them.  With positive only sets, this
reduces (almost) to the sum of cooccurrence scores done in isolation on each
kind of input.

b) shared latent variable reductions across multiple behavior types.  For
SVD or similar decomposition based techniques, this is equivalent to
reducing column adjoined matrices for the independent behaviors.  Then, if
you have only one kind of information, you can use the SVD to fill in the
other, missing, information.

c) probabilistic latent variable approaches.  For LDA and such, you can put
all of the behavioral information together and use the model to predict
missing observations in the standard Bayesian kind of way.  This is similar
to (b), but much better founded.

On Tue, Jun 23, 2009 at 12:23 PM, Sean Owen <sr...@gmail.com> wrote:

> For example, you could write a script that combines rating,
> purchase history, demographics, in some way that you think is useful,
> to produce 'preference' values.
>

-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: [jira] Created: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

Posted by Sean Owen <sr...@gmail.com>.

This is a great question. The short answer is that this sort of thing
is currently outside the scope of the collaborative filtering portion
of Mahout.

That is, the library assumes you have, as input already, some
'preference' value for users and items. It takes it from there. It
says nothing about how you come up with those preference values.

Now, you could compute some preference value based on any information
you like. For example, you could write a script that combines rating,
purchase history, demographics, in some way that you think is useful,
to produce 'preference' values. Then the library can help you from
there.

The reason the library can't really help you with producing a
preference is that it is so domain-specific, so tied to the problem
you are solving and the data you have. There are few general
solutions. (But I think it would make an interesting sister project,
or new module, to implement some means of inferring preferences!)

On Tue, Jun 23, 2009 at 3:14 PM, Pradeep Pujari<pp...@gmail.com> wrote:
> Hi All,
>
> I followed Mahout - Taste demo. This was working fine. It looks to this demo
> only considers one parameter "Rating". How can I use other input values like
> a) purchase history b) demographics etc. besides Rating to compute user
> based recommendation?
>
> Thank you very much.
>
> Pradeep.
>