You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Koobas <ko...@gmail.com> on 2012/12/02 22:12:00 UTC

Mahout Amazon EMR usage cost

I was wondering if somebody could give me a rough estimate of the cost of
running Mahout on Amazon's Elastic MapReduce for a specific problem.
I am working with a common case of implicit feedback.
I have a simple, boolean input, i.e., user-item pairs (userID, itemID).
I would like to find 50 nearest neighbors for each item.
I have 10M users, 10K items, and 500M records.
If anybody has any ballpark idea of the kind of cost it would take to solve
the problem using EMR, I would appreciate it very much.
Jacob

Re: Mahout Amazon EMR usage cost

Posted by Sean Owen <sr...@gmail.com>.
Agree with Ted. If you really want to do this, use the Tanimoto
similarity implementation in the job I described earlier and you
should have similarity ranked by overlap. It's one of the simplest
similarity functions. But it's not a great idea. You will find that
most of the 'recommendations' are skewed towards top-selling items.

Something based on cooccurrence or a latent factor model should give
better results. For example, I don't think Amazon actually uses this
for most-similar item calculations. If it ever shows this value, it's
probably just because it is something humans can understand as a
justification. I would choose a different similarity metric.

These aren't recommendations; they're not personalized. They're just
most-similar items. That may be fine if that's what you want but you
could also explore making actual personalized recommendations. That
would take more computation of course.

On Mon, Dec 3, 2012 at 8:03 AM, Ted Dunning <te...@gmail.com> wrote:
> On Mon, Dec 3, 2012 at 3:06 AM, Koobas <ko...@gmail.com> wrote:
>
>> Thank you very much.
>> The pointer to Myrrix is a very useful piece of information.
>> Myrrix, however, relies on an iterative sparse matrix factorization to do
>> PCA.
>> I want to produce Amazon-like recommendations.
>> I.e., "70% of users who bough this, also bought that."
>>
>
> You can always quote figures like that no matter how you got the
> recommendation but it is usually very bad to simply use such coocurrence
> statistics directly to form recommendations since they are seriously
> affected by accidental coincidence.
>
>
>> So, I specifically want the direct kNN algorithm.
>> Any clue what Mahout + Hadoop can deliver on that one?
>>
>
> Yes. Mahout can do this.

Re: Mahout Amazon EMR usage cost

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Dec 3, 2012 at 3:06 AM, Koobas <ko...@gmail.com> wrote:

> Thank you very much.
> The pointer to Myrrix is a very useful piece of information.
> Myrrix, however, relies on an iterative sparse matrix factorization to do
> PCA.
> I want to produce Amazon-like recommendations.
> I.e., "70% of users who bough this, also bought that."
>

You can always quote figures like that no matter how you got the
recommendation but it is usually very bad to simply use such coocurrence
statistics directly to form recommendations since they are seriously
affected by accidental coincidence.


> So, I specifically want the direct kNN algorithm.
> Any clue what Mahout + Hadoop can deliver on that one?
>

Yes. Mahout can do this.

Re: Mahout Amazon EMR usage cost

Posted by Koobas <ko...@gmail.com>.
On Wed, Dec 5, 2012 at 7:03 PM, Ted Dunning <te...@gmail.com> wrote:

> On Wed, Dec 5, 2012 at 5:29 PM, Koobas <ko...@gmail.com> wrote:
>
> > ...
> > Now yet another naive question.
> > Ted is probably going to go ballistic ;)
> >
>
> I hope not.
>
>
> > Assuming that simple overlap methods suck,
> > is there still a metric that works better than others
> > (i.e. Tanimoto vs. Jaccard vs something else)?
> >
>
> LLR works well on usage data.
>
> The idea here is you use a robust test for anomalous cooccurrence between
> pairs of items.  If you find an anomaly, you record a 1 for that pair in
> the item-item matrix.  Otherwise, record a 0.
>

Thanks!

Re: Mahout Amazon EMR usage cost

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Dec 5, 2012 at 5:29 PM, Koobas <ko...@gmail.com> wrote:

> ...
> Now yet another naive question.
> Ted is probably going to go ballistic ;)
>

I hope not.


> Assuming that simple overlap methods suck,
> is there still a metric that works better than others
> (i.e. Tanimoto vs. Jaccard vs something else)?
>

LLR works well on usage data.

The idea here is you use a robust test for anomalous cooccurrence between
pairs of items.  If you find an anomaly, you record a 1 for that pair in
the item-item matrix.  Otherwise, record a 0.

Re: Mahout Amazon EMR usage cost

Posted by Koobas <ko...@gmail.com>.
I am very happy to see that I started a lively thread.
I am a newcomer to the field, so this is all very useful.

Now yet another naive question.
Ted is probably going to go ballistic ;)
Assuming that simple overlap methods suck,
is there still a metric that works better than others
(i.e. Tanimoto vs. Jaccard vs something else)?


On Wed, Dec 5, 2012 at 3:24 AM, Paulo Villegas <pa...@gmail.com>wrote:

> I don't disagree at all with what you're saying. I never said (or intended
> to say) that explanations would have to be a thorough dump of the engine's
> internal computation; this would not make sense to the user and would just
> overwhelm him. Picking up a couple of representative items would be more
> than enough.
>
> And if the original algorithm is too complicated yes, it may make sense to
> bring up an additional, simpler and more understandable engine just to pick
> up explanations. But then you need to ensure that the explanations fit well
> with the results you're actually delivering. And in any case if you've got
> that additional engine and it works sensibly, you could as well aggregate
> its results into the main system and build up an ensemble. It may not work
> in all cases, but may do well in others. YMMV.
>
> I'm also not saying I know exactly what Amazon is doing internally, you
> need a lot more than a casual look at the UI to infer that. They could be
> doing frequent itemset mining, or they couldn't. But I sustain it can be a
> valid approach. A recommendation coming from association rules will have
> less coverage than a "standard" CF engine, and will probably miss a bigger
> part of the long tail, but for the target of enlarging the basket of items
> the user is willing to buy in a single transaction is perfectly well suited
> (i.e. don't find "the next best item", find "the item that goes along well
> with this one").
>
> And if you model transactions adequately (like items watched in a single
> browsing session, when you might state that the user has a single main
> intent, as opposed to coming back next day with a different thing in mind)
> then it might help to neglect spurious associations (such as you see
> sometimes in Amazon, anyway). Of course, a similar effect can be achieved
> with a "standard" recommender engine if you introduce time effects.
>
>
>
>
>  On Wed, Dec 5, 2012 at 6:57 AM, Paulo Villegas <paulo.vllgs@gmail.com
>> >wrote:
>>
>>  On 05/12/12 00:53, Ted Dunning wrote:
>>>
>>>  Also, you have to separate UI considerations from algorithm
>>>> considerations.
>>>>    What algorithm populates the recommendations is the recommender
>>>> algorithm.
>>>>    It has two responsibilities... first, find items that the users will
>>>> like
>>>> and second, pick out a variety of less certain items to learn about.  It
>>>> is
>>>> not responsible for justifying choices to the user.  The UI does that
>>>> and
>>>> it may use analytics of some kind to make claims about choices made, but
>>>> that won't change the choices.
>>>>
>>>>
>>> Here I disagree: explaining recommendations to the user is an important
>>> factor in user acceptance (and therefore uptake) of the results, since if
>>> she can understand why some completely unknown item was recommended it'll
>>> make her more confident that it's a good choice (this has also been
>>> proven
>>> experimentally).
>>>
>>
>>
>> I have demonstrated that explanations help as well in some cases.  Not in
>> all.
>>
>>
>>  And the best one to know why something was recommended is the engine
>>> itself.
>>>
>>
>>
>> This is simply not true.  The engine may have very complex reasons for
>> recommendation.  This applies in classification as well.  It is completely
>> conventional, and often critical to performance to have one engine for
>> recommendation or classification and a completely independent one for
>> explanation.
>>
>>
>>  That's one good additional reason why item-based neighbourhood is more
>>> advantageous than user-based: you can communicate item neighbours to the
>>> user, which then see items she knows that are similar to the one being
>>> recommended (it's one of the things Amazon does in its recommendation
>>> lists).
>>>
>>
>>
>> Again.  This simply isn't that important.  The major goal of the
>> recommendation engine is to produce high quality recommendations and one
>> of
>> the major problems in doing that is avoiding noise effects.  Ironically,
>> it
>> is also important for the recommendation engine to inject metered amounts
>> of a different kind of noise as well.  Neither of those capabilities make
>> sense to explain to the user and these may actually dominate the
>> decisions.
>>
>> Once an explainer is given a clean set of recommendations, then the
>> problem
>> of explaining is vastly different than the job of recommending.  For
>> instance Tanimoto or Jaccard are horrible for recommendation but great for
>> explaining.  The issue is that the explainer doesn't have to explain all
>> of
>> the items that are *not* shown, only those which are shown.
>>
>> Note that Amazon does not actually explain their market basket
>> recommendations.  And in their personal recommendations (which they have
>> partially hidden now), you have to ask for the explanation.  The
>> explanation that they give is typically one or two of your actions which
>> is
>> patently not a complete explanation.  So they clearly are saying one thing
>> and doing another, just as I am recommending here.
>>
>>
>>  Speaking about Amazon, the "also bought" UI thing is still there in their
>>> website, only *not* in their specific recommendation lists.
>>>
>>
>>
>> But note that they don't give percentages any more.  Also note that they
>> don't explain all of the things that they *don't* show you.
>>
>>
>>  It's down in the page, in sections like "Continue Shopping: Customers Who
>>> Bought Items in Your Recent History Also Bought". It does not give %
>>> values
>>> now, but it's essentially the same (and it works also when you are not
>>> logged in, since it is using your recent viewing history). That's why I
>>> thought it's coming from Market Basket Analysis (i.e. frequent itemsets).
>>>
>>>
>> I doubt it seriously.  Frequent itemsets is typically much more expensive
>> than simple recommendations.
>>
>>
>>  Lift is indeed a good metric for the interestingness of a rule, but it
>>> can
>>> also produce unreasonably big values for rare itemsets. On the other
>>> hand,
>>> maybe this is good for uncovering long tail associations.
>>>
>>>
>> I have built a number of commercially successful recommendation engines
>> and
>> simple overlap has always been a complete disaster.  I have also counseled
>> a number of companies along the lines given here and the resulting numbers
>> that they have achieved have been quite striking when they switched to
>> roughly what I am describing here.
>>
>> The only time the overlap is likely to work is if you have absolutely
>> massive data and can afford very high thresholds.  That completely
>> obliterates the long tail.
>>
>> You can claim to understand a system like Amazon's from the UI, but I
>> would
>> seriously doubt that you are seeing 5% of what the recommendation engine
>> is
>> really doing.
>>
>>
>

Re: Mahout Amazon EMR usage cost

Posted by Paulo Villegas <pa...@gmail.com>.
I don't disagree at all with what you're saying. I never said (or 
intended to say) that explanations would have to be a thorough dump of 
the engine's internal computation; this would not make sense to the user 
and would just overwhelm him. Picking up a couple of representative 
items would be more than enough.

And if the original algorithm is too complicated yes, it may make sense 
to bring up an additional, simpler and more understandable engine just 
to pick up explanations. But then you need to ensure that the 
explanations fit well with the results you're actually delivering. And 
in any case if you've got that additional engine and it works sensibly, 
you could as well aggregate its results into the main system and build 
up an ensemble. It may not work in all cases, but may do well in others. 
YMMV.

I'm also not saying I know exactly what Amazon is doing internally, you 
need a lot more than a casual look at the UI to infer that. They could 
be doing frequent itemset mining, or they couldn't. But I sustain it can 
be a valid approach. A recommendation coming from association rules will 
have less coverage than a "standard" CF engine, and will probably miss a 
bigger part of the long tail, but for the target of enlarging the basket 
of items the user is willing to buy in a single transaction is perfectly 
well suited (i.e. don't find "the next best item", find "the item that 
goes along well with this one").

And if you model transactions adequately (like items watched in a single 
browsing session, when you might state that the user has a single main 
intent, as opposed to coming back next day with a different thing in 
mind) then it might help to neglect spurious associations (such as you 
see sometimes in Amazon, anyway). Of course, a similar effect can be 
achieved with a "standard" recommender engine if you introduce time effects.



> On Wed, Dec 5, 2012 at 6:57 AM, Paulo Villegas <pa...@gmail.com>wrote:
>
>> On 05/12/12 00:53, Ted Dunning wrote:
>>
>>> Also, you have to separate UI considerations from algorithm
>>> considerations.
>>>    What algorithm populates the recommendations is the recommender
>>> algorithm.
>>>    It has two responsibilities... first, find items that the users will
>>> like
>>> and second, pick out a variety of less certain items to learn about.  It
>>> is
>>> not responsible for justifying choices to the user.  The UI does that and
>>> it may use analytics of some kind to make claims about choices made, but
>>> that won't change the choices.
>>>
>>
>> Here I disagree: explaining recommendations to the user is an important
>> factor in user acceptance (and therefore uptake) of the results, since if
>> she can understand why some completely unknown item was recommended it'll
>> make her more confident that it's a good choice (this has also been proven
>> experimentally).
>
>
> I have demonstrated that explanations help as well in some cases.  Not in
> all.
>
>
>> And the best one to know why something was recommended is the engine
>> itself.
>
>
> This is simply not true.  The engine may have very complex reasons for
> recommendation.  This applies in classification as well.  It is completely
> conventional, and often critical to performance to have one engine for
> recommendation or classification and a completely independent one for
> explanation.
>
>
>> That's one good additional reason why item-based neighbourhood is more
>> advantageous than user-based: you can communicate item neighbours to the
>> user, which then see items she knows that are similar to the one being
>> recommended (it's one of the things Amazon does in its recommendation
>> lists).
>
>
> Again.  This simply isn't that important.  The major goal of the
> recommendation engine is to produce high quality recommendations and one of
> the major problems in doing that is avoiding noise effects.  Ironically, it
> is also important for the recommendation engine to inject metered amounts
> of a different kind of noise as well.  Neither of those capabilities make
> sense to explain to the user and these may actually dominate the decisions.
>
> Once an explainer is given a clean set of recommendations, then the problem
> of explaining is vastly different than the job of recommending.  For
> instance Tanimoto or Jaccard are horrible for recommendation but great for
> explaining.  The issue is that the explainer doesn't have to explain all of
> the items that are *not* shown, only those which are shown.
>
> Note that Amazon does not actually explain their market basket
> recommendations.  And in their personal recommendations (which they have
> partially hidden now), you have to ask for the explanation.  The
> explanation that they give is typically one or two of your actions which is
> patently not a complete explanation.  So they clearly are saying one thing
> and doing another, just as I am recommending here.
>
>
>> Speaking about Amazon, the "also bought" UI thing is still there in their
>> website, only *not* in their specific recommendation lists.
>
>
> But note that they don't give percentages any more.  Also note that they
> don't explain all of the things that they *don't* show you.
>
>
>> It's down in the page, in sections like "Continue Shopping: Customers Who
>> Bought Items in Your Recent History Also Bought". It does not give % values
>> now, but it's essentially the same (and it works also when you are not
>> logged in, since it is using your recent viewing history). That's why I
>> thought it's coming from Market Basket Analysis (i.e. frequent itemsets).
>>
>
> I doubt it seriously.  Frequent itemsets is typically much more expensive
> than simple recommendations.
>
>
>> Lift is indeed a good metric for the interestingness of a rule, but it can
>> also produce unreasonably big values for rare itemsets. On the other hand,
>> maybe this is good for uncovering long tail associations.
>>
>
> I have built a number of commercially successful recommendation engines and
> simple overlap has always been a complete disaster.  I have also counseled
> a number of companies along the lines given here and the resulting numbers
> that they have achieved have been quite striking when they switched to
> roughly what I am describing here.
>
> The only time the overlap is likely to work is if you have absolutely
> massive data and can afford very high thresholds.  That completely
> obliterates the long tail.
>
> You can claim to understand a system like Amazon's from the UI, but I would
> seriously doubt that you are seeing 5% of what the recommendation engine is
> really doing.
>


Re: Mahout Amazon EMR usage cost

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Dec 5, 2012 at 6:57 AM, Paulo Villegas <pa...@gmail.com>wrote:

> On 05/12/12 00:53, Ted Dunning wrote:
>
>> Also, you have to separate UI considerations from algorithm
>> considerations.
>>   What algorithm populates the recommendations is the recommender
>> algorithm.
>>   It has two responsibilities... first, find items that the users will
>> like
>> and second, pick out a variety of less certain items to learn about.  It
>> is
>> not responsible for justifying choices to the user.  The UI does that and
>> it may use analytics of some kind to make claims about choices made, but
>> that won't change the choices.
>>
>
> Here I disagree: explaining recommendations to the user is an important
> factor in user acceptance (and therefore uptake) of the results, since if
> she can understand why some completely unknown item was recommended it'll
> make her more confident that it's a good choice (this has also been proven
> experimentally).


I have demonstrated that explanations help as well in some cases.  Not in
all.


> And the best one to know why something was recommended is the engine
> itself.


This is simply not true.  The engine may have very complex reasons for
recommendation.  This applies in classification as well.  It is completely
conventional, and often critical to performance to have one engine for
recommendation or classification and a completely independent one for
explanation.


> That's one good additional reason why item-based neighbourhood is more
> advantageous than user-based: you can communicate item neighbours to the
> user, which then see items she knows that are similar to the one being
> recommended (it's one of the things Amazon does in its recommendation
> lists).


Again.  This simply isn't that important.  The major goal of the
recommendation engine is to produce high quality recommendations and one of
the major problems in doing that is avoiding noise effects.  Ironically, it
is also important for the recommendation engine to inject metered amounts
of a different kind of noise as well.  Neither of those capabilities make
sense to explain to the user and these may actually dominate the decisions.

Once an explainer is given a clean set of recommendations, then the problem
of explaining is vastly different than the job of recommending.  For
instance Tanimoto or Jaccard are horrible for recommendation but great for
explaining.  The issue is that the explainer doesn't have to explain all of
the items that are *not* shown, only those which are shown.

Note that Amazon does not actually explain their market basket
recommendations.  And in their personal recommendations (which they have
partially hidden now), you have to ask for the explanation.  The
explanation that they give is typically one or two of your actions which is
patently not a complete explanation.  So they clearly are saying one thing
and doing another, just as I am recommending here.


> Speaking about Amazon, the "also bought" UI thing is still there in their
> website, only *not* in their specific recommendation lists.


But note that they don't give percentages any more.  Also note that they
don't explain all of the things that they *don't* show you.


> It's down in the page, in sections like "Continue Shopping: Customers Who
> Bought Items in Your Recent History Also Bought". It does not give % values
> now, but it's essentially the same (and it works also when you are not
> logged in, since it is using your recent viewing history). That's why I
> thought it's coming from Market Basket Analysis (i.e. frequent itemsets).
>

I doubt it seriously.  Frequent itemsets is typically much more expensive
than simple recommendations.


> Lift is indeed a good metric for the interestingness of a rule, but it can
> also produce unreasonably big values for rare itemsets. On the other hand,
> maybe this is good for uncovering long tail associations.
>

I have built a number of commercially successful recommendation engines and
simple overlap has always been a complete disaster.  I have also counseled
a number of companies along the lines given here and the resulting numbers
that they have achieved have been quite striking when they switched to
roughly what I am describing here.

The only time the overlap is likely to work is if you have absolutely
massive data and can afford very high thresholds.  That completely
obliterates the long tail.

You can claim to understand a system like Amazon's from the UI, but I would
seriously doubt that you are seeing 5% of what the recommendation engine is
really doing.

Re: Mahout Amazon EMR usage cost

Posted by Paulo Villegas <pa...@gmail.com>.
On 05/12/12 00:53, Ted Dunning wrote:
> Also, you have to separate UI considerations from algorithm considerations.
>   What algorithm populates the recommendations is the recommender algorithm.
>   It has two responsibilities... first, find items that the users will like
> and second, pick out a variety of less certain items to learn about.  It is
> not responsible for justifying choices to the user.  The UI does that and
> it may use analytics of some kind to make claims about choices made, but
> that won't change the choices.

Here I disagree: explaining recommendations to the user is an important 
factor in user acceptance (and therefore uptake) of the results, since 
if she can understand why some completely unknown item was recommended 
it'll make her more confident that it's a good choice (this has also 
been proven experimentally). And the best one to know why something was 
recommended is the engine itself. That's one good additional reason why 
item-based neighbourhood is more advantageous than user-based: you can 
communicate item neighbours to the user, which then see items she knows 
that are similar to the one being recommended (it's one of the things 
Amazon does in its recommendation lists). You can achieve more or less 
the same with matrix factorization approaches.

Speaking about Amazon, the "also bought" UI thing is still there in 
their website, only *not* in their specific recommendation lists. It's 
down in the page, in sections like "Continue Shopping: Customers Who 
Bought Items in Your Recent History Also Bought". It does not give % 
values now, but it's essentially the same (and it works also when you 
are not logged in, since it is using your recent viewing history). 
That's why I thought it's coming from Market Basket Analysis (i.e. 
frequent itemsets).

Lift is indeed a good metric for the interestingness of a rule, but it 
can also produce unreasonably big values for rare itemsets. On the other 
hand, maybe this is good for uncovering long tail associations.

Paulo

>
> On Wed, Dec 5, 2012 at 12:48 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> Yes it's not a recommender problem, it's a most-similar-items
>> problems. Frequent itemset mining is really just a most-similar-items
>> algorithm, for one particular definition of similar (confidence). In
>> that sense they are nearly the same.
>>
>> For frequent itemsets, you have to pick a minimum support -- that's an
>> extra parameter to figure out, but that is precisely what speeds up
>> frequent itemset mining. But, it will also mean you have no answer for
>> long-tail items since they are excluded by the min support.
>>
>> If min support is 0, then you run into a different issue. For an item
>> that was bought by 1 person, anything else bought by those people has
>> confidence 1; it loses ability to discriminate. I just looked it up on
>> Wikipedia and found there's an idea of "lift" which would be better to
>> rank on. It's essentially a likelihood ratio. Which takes you right
>> back to Ted's advice to just find items with highest (log-)likelihood
>> similarity.
>>
>> For these reasons I also suspect that this is not actually how Amazon
>> et al determine which items to show you. In fact, I don't see any such
>> "70% of users also bought..." figures on their site now?
>>
>>
>> On Tue, Dec 4, 2012 at 10:51 PM, Paulo Villegas <pa...@tid.es> wrote:
>>> While the "70% of users bought also ... " could be generated by a
>>> suitable recommendation engine, I think it fits better with a frequent
>>> pattern mining approach i.e. Association Rules. I don't know if Amazon
>>> implements it that way, but it seems likely, since it's not really a
>>> personalized recommendation (unless we interpret the personalization as
>>> coming from the pages the user is visiting, i.e. real-time profile
>>> building).
>>>
>>> I believe Mahout has a frequent itemset mining algorithm (FPGrowth),
>>> though I've never tried it myself. For your problem, you would select
>>> the minimum support for your itemsets (this would eliminate spurious
>>> associations), and the confidence obtained would be directly your 70%
>> value.
>>>
>>> Although your formulation selects only the rules with 1 item in the
>>> antecedent, i.e. item1 -> item2, you could use the items visited before
>>> to build bigger antecedents.
>>>
>>> Paulo
>>>
>>>
>>> ________________________________
>>>
>>> Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
>>> nuestra política de envío y recepción de correo electrónico en el enlace
>>> situado más abajo.
>>> This message is intended exclusively for its addressee. We only send and
>>> receive email on the basis of the terms set out at:
>>> http://www.tid.es/ES/PAGINAS/disclaimer.aspx
>>
>


Re: Mahout Amazon EMR usage cost

Posted by Ted Dunning <te...@gmail.com>.
Also, you have to separate UI considerations from algorithm considerations.
 What algorithm populates the recommendations is the recommender algorithm.
 It has two responsibilities... first, find items that the users will like
and second, pick out a variety of less certain items to learn about.  It is
not responsible for justifying choices to the user.  The UI does that and
it may use analytics of some kind to make claims about choices made, but
that won't change the choices.

On Wed, Dec 5, 2012 at 12:48 AM, Sean Owen <sr...@gmail.com> wrote:

> Yes it's not a recommender problem, it's a most-similar-items
> problems. Frequent itemset mining is really just a most-similar-items
> algorithm, for one particular definition of similar (confidence). In
> that sense they are nearly the same.
>
> For frequent itemsets, you have to pick a minimum support -- that's an
> extra parameter to figure out, but that is precisely what speeds up
> frequent itemset mining. But, it will also mean you have no answer for
> long-tail items since they are excluded by the min support.
>
> If min support is 0, then you run into a different issue. For an item
> that was bought by 1 person, anything else bought by those people has
> confidence 1; it loses ability to discriminate. I just looked it up on
> Wikipedia and found there's an idea of "lift" which would be better to
> rank on. It's essentially a likelihood ratio. Which takes you right
> back to Ted's advice to just find items with highest (log-)likelihood
> similarity.
>
> For these reasons I also suspect that this is not actually how Amazon
> et al determine which items to show you. In fact, I don't see any such
> "70% of users also bought..." figures on their site now?
>
>
> On Tue, Dec 4, 2012 at 10:51 PM, Paulo Villegas <pa...@tid.es> wrote:
> > While the "70% of users bought also ... " could be generated by a
> > suitable recommendation engine, I think it fits better with a frequent
> > pattern mining approach i.e. Association Rules. I don't know if Amazon
> > implements it that way, but it seems likely, since it's not really a
> > personalized recommendation (unless we interpret the personalization as
> > coming from the pages the user is visiting, i.e. real-time profile
> > building).
> >
> > I believe Mahout has a frequent itemset mining algorithm (FPGrowth),
> > though I've never tried it myself. For your problem, you would select
> > the minimum support for your itemsets (this would eliminate spurious
> > associations), and the confidence obtained would be directly your 70%
> value.
> >
> > Although your formulation selects only the rules with 1 item in the
> > antecedent, i.e. item1 -> item2, you could use the items visited before
> > to build bigger antecedents.
> >
> > Paulo
> >
> >
> > ________________________________
> >
> > Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
> > nuestra política de envío y recepción de correo electrónico en el enlace
> > situado más abajo.
> > This message is intended exclusively for its addressee. We only send and
> > receive email on the basis of the terms set out at:
> > http://www.tid.es/ES/PAGINAS/disclaimer.aspx
>

Re: Mahout Amazon EMR usage cost

Posted by Sean Owen <sr...@gmail.com>.
Yes it's not a recommender problem, it's a most-similar-items
problems. Frequent itemset mining is really just a most-similar-items
algorithm, for one particular definition of similar (confidence). In
that sense they are nearly the same.

For frequent itemsets, you have to pick a minimum support -- that's an
extra parameter to figure out, but that is precisely what speeds up
frequent itemset mining. But, it will also mean you have no answer for
long-tail items since they are excluded by the min support.

If min support is 0, then you run into a different issue. For an item
that was bought by 1 person, anything else bought by those people has
confidence 1; it loses ability to discriminate. I just looked it up on
Wikipedia and found there's an idea of "lift" which would be better to
rank on. It's essentially a likelihood ratio. Which takes you right
back to Ted's advice to just find items with highest (log-)likelihood
similarity.

For these reasons I also suspect that this is not actually how Amazon
et al determine which items to show you. In fact, I don't see any such
"70% of users also bought..." figures on their site now?


On Tue, Dec 4, 2012 at 10:51 PM, Paulo Villegas <pa...@tid.es> wrote:
> While the "70% of users bought also ... " could be generated by a
> suitable recommendation engine, I think it fits better with a frequent
> pattern mining approach i.e. Association Rules. I don't know if Amazon
> implements it that way, but it seems likely, since it's not really a
> personalized recommendation (unless we interpret the personalization as
> coming from the pages the user is visiting, i.e. real-time profile
> building).
>
> I believe Mahout has a frequent itemset mining algorithm (FPGrowth),
> though I've never tried it myself. For your problem, you would select
> the minimum support for your itemsets (this would eliminate spurious
> associations), and the confidence obtained would be directly your 70% value.
>
> Although your formulation selects only the rules with 1 item in the
> antecedent, i.e. item1 -> item2, you could use the items visited before
> to build bigger antecedents.
>
> Paulo
>
>
> ________________________________
>
> Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
> nuestra política de envío y recepción de correo electrónico en el enlace
> situado más abajo.
> This message is intended exclusively for its addressee. We only send and
> receive email on the basis of the terms set out at:
> http://www.tid.es/ES/PAGINAS/disclaimer.aspx

Re: Mahout Amazon EMR usage cost

Posted by Paulo Villegas <pa...@tid.es>.
On 03/12/12 04:06, Koobas wrote:
> Thank you very much.
> The pointer to Myrrix is a very useful piece of information.
> Myrrix, however, relies on an iterative sparse matrix factorization to do
> PCA.
> I want to produce Amazon-like recommendations.
> I.e., "70% of users who bough this, also bought that."
> So, I specifically want the direct kNN algorithm.
> Any clue what Mahout + Hadoop can deliver on that one?
> Thanks,
> Jacob

While the "70% of users bought also ... " could be generated by a
suitable recommendation engine, I think it fits better with a frequent
pattern mining approach i.e. Association Rules. I don't know if Amazon
implements it that way, but it seems likely, since it's not really a
personalized recommendation (unless we interpret the personalization as
coming from the pages the user is visiting, i.e. real-time profile
building).

I believe Mahout has a frequent itemset mining algorithm (FPGrowth),
though I've never tried it myself. For your problem, you would select
the minimum support for your itemsets (this would eliminate spurious
associations), and the confidence obtained would be directly your 70% value.

Although your formulation selects only the rules with 1 item in the
antecedent, i.e. item1 -> item2, you could use the items visited before
to build bigger antecedents.

Paulo


________________________________

Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo.
This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx

Re: Mahout Amazon EMR usage cost

Posted by Koobas <ko...@gmail.com>.
Thank you very much.
The pointer to Myrrix is a very useful piece of information.
Myrrix, however, relies on an iterative sparse matrix factorization to do
PCA.
I want to produce Amazon-like recommendations.
I.e., "70% of users who bough this, also bought that."
So, I specifically want the direct kNN algorithm.
Any clue what Mahout + Hadoop can deliver on that one?
Thanks,
Jacob


On Sun, Dec 2, 2012 at 5:25 PM, Sean Owen <sr...@gmail.com> wrote:

> My guess is: less than $10. Little enough that I wouldn't worry about
> it. But I have not tried it directly.
>
> You just have 10K items, so it ought to be relatively quick to find
> similar items for them. You will want to look at ItemSimilarityJob.
> Setting some parameters like --maxSimilaritiesPerRow and --threshold
> will be important to speed. On EMR, I suggest using 2-4 m1.xlarge
> instances and using spot instances. For the master, use on-demand and
> use m1.large. The usual Hadoop tunings like mapred.reduce.tasks matter
> a lot too. When set up well it should be quite economical.
>
> Since you mentioned implicit feedback and EMR, you may benefit from a
> look at Myrrix (http://myrrix.com). It can compute recommendations or
> item-item similarities, on Hadoop / EMR if desired, and is built for
> this implicit feedback model. The scale is no problem. It's
> pre-packaged and tuned to run by itself, so, might save you time and
> money versus trying to configure, run and tune it from scratch
> (http://myrrix.com/purchase-computation-layer/).  For what it may be
> worth I do have one recent benchmark on EMR
> (http://myrrix.com/example-wikipedia-links/) computing a model over
> 13M Wikipedia articles for about $7.
>
> On Sun, Dec 2, 2012 at 9:12 PM, Koobas <ko...@gmail.com> wrote:
> > I was wondering if somebody could give me a rough estimate of the cost of
> > running Mahout on Amazon's Elastic MapReduce for a specific problem.
> > I am working with a common case of implicit feedback.
> > I have a simple, boolean input, i.e., user-item pairs (userID, itemID).
> > I would like to find 50 nearest neighbors for each item.
> > I have 10M users, 10K items, and 500M records.
> > If anybody has any ballpark idea of the kind of cost it would take to
> solve
> > the problem using EMR, I would appreciate it very much.
> > Jacob
>

Re: Mahout Amazon EMR usage cost

Posted by Sean Owen <sr...@gmail.com>.
My guess is: less than $10. Little enough that I wouldn't worry about
it. But I have not tried it directly.

You just have 10K items, so it ought to be relatively quick to find
similar items for them. You will want to look at ItemSimilarityJob.
Setting some parameters like --maxSimilaritiesPerRow and --threshold
will be important to speed. On EMR, I suggest using 2-4 m1.xlarge
instances and using spot instances. For the master, use on-demand and
use m1.large. The usual Hadoop tunings like mapred.reduce.tasks matter
a lot too. When set up well it should be quite economical.

Since you mentioned implicit feedback and EMR, you may benefit from a
look at Myrrix (http://myrrix.com). It can compute recommendations or
item-item similarities, on Hadoop / EMR if desired, and is built for
this implicit feedback model. The scale is no problem. It's
pre-packaged and tuned to run by itself, so, might save you time and
money versus trying to configure, run and tune it from scratch
(http://myrrix.com/purchase-computation-layer/).  For what it may be
worth I do have one recent benchmark on EMR
(http://myrrix.com/example-wikipedia-links/) computing a model over
13M Wikipedia articles for about $7.

On Sun, Dec 2, 2012 at 9:12 PM, Koobas <ko...@gmail.com> wrote:
> I was wondering if somebody could give me a rough estimate of the cost of
> running Mahout on Amazon's Elastic MapReduce for a specific problem.
> I am working with a common case of implicit feedback.
> I have a simple, boolean input, i.e., user-item pairs (userID, itemID).
> I would like to find 50 nearest neighbors for each item.
> I have 10M users, 10K items, and 500M records.
> If anybody has any ballpark idea of the kind of cost it would take to solve
> the problem using EMR, I would appreciate it very much.
> Jacob