You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Serega Sheypak <se...@gmail.com> on 2014/09/03 09:43:55 UTC

org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob strage outcome

Hi, i have regular process for recommendation calculation. It is in
production and runs for more than a week.
Each start of day process:
1. consumes data current_day - 2 months
2. prepares data for mahout (use mahout ids)
3. feeds prepared data to ItemSimilarityJob
4. remaps result from mahout id to source id

I've started to get strange results for extremely popular items.
For example: Iphone gets covers and iphone-related tools as recommendations
and there absolutely unrelated items apear in top-recommendations.

What are the right way to debug such situations? What can i read? There
were no changes to any system.

Re: org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob strage outcome

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Sep 3, 2014 at 2:21 PM, Serega Sheypak <se...@gmail.com>
wrote:

> Hi, thanks for the response.
> 1. Where can I read about how does LLR works in mahout? I'm not
> math-person, so java code gives no intuition.
>

See here:

http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html



> 2. What is "down-sampling" in the context of a problem?
> I've found the translation: ~ "reduce the size of group you choose". I do
> not change someting. It worked pretty nice for a week then started to show
> strage items in top.
>

Down-sampling is the process of discarding data (hopefully without changing
results).  In recommendations, it is common is down-sample user histories
to a maximum size (typically 500 or 1000 items) because the cost of the
algorithm is proportional to the sum of the squares of the size of all
histories. Likewise, it is common to down-sample common items.  If you keep
enough information here, it makes no difference to the quality of the
results under normal circumstances.

See

http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf



> 3.  transmission error.
> Can't find a place for this. Guys don't keep history for item dictionary. I
> keep a snapshot of it for each calculation day. I've checked suspicious ids
> of items - no changes at all.
>

OK.  Good to know.

Good for you for keeping snapshots.


>
> 4. the counts are somehow very wrong.  conceivable but unlikely.
> I've lost the context. The count of what?
>

The count of the number of times the items in question occurred in user
histories with and without the other items.


>
> 5. somehow encouraged users to combine these seemingly unrelated items
> Checking the cooccurence...
>

This is basically the same action as for (4)..







>
>
>
>
> 2014-09-03 12:06 GMT+04:00 Ted Dunning <te...@gmail.com>:
>
> > On Wed, Sep 3, 2014 at 12:43 AM, Serega Sheypak <
> serega.sheypak@gmail.com>
> > wrote:
> >
> > > What are the right way to debug such situations? What can i read? There
> > > were no changes to any system.
> > >
> > >
> > First step is to dump the coccurrence counts for the items in question.
> >
> > The process from there is to find out what the problem is.  It can be:
> >
> > 1) the LLR calculation went wrong (unlikely given the high usage of that
> > code)
> >
> > 2) the counts are somehow very wrong.  conceivable but unlikely.
> >
> > 3) the down-sampling causes some strange pathology.  You should record
> > counts before and after downsampling.  This is very unlikely.
> >
> > 4) the data has some format or other transmission error.  Moderately
> > likely.
> >
> > 5) the system has somehow encouraged users to combine these seemingly
> > unrelated items.  This is most likely in my experience.
> >
> > If the case is (5), you can't fix it with math or code.  The best answer
> is
> > to simply have an exception list that says when you have *this* item, you
> > must not show *that* item as a recommendation.  Essentially, this is an
> > edit on the indicator list.
> >
>

Re: org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob strage outcome

Posted by Serega Sheypak <se...@gmail.com>.
Hi, thanks for the response.
1. Where can I read about how does LLR works in mahout? I'm not
math-person, so java code gives no intuition.

2. What is "down-sampling" in the context of a problem?
I've found the translation: ~ "reduce the size of group you choose". I do
not change someting. It worked pretty nice for a week then started to show
strage items in top.

3.  transmission error.
Can't find a place for this. Guys don't keep history for item dictionary. I
keep a snapshot of it for each calculation day. I've checked suspicious ids
of items - no changes at all.

4. the counts are somehow very wrong.  conceivable but unlikely.
I've lost the context. The count of what?

5. somehow encouraged users to combine these seemingly unrelated items
Checking the cooccurence...




2014-09-03 12:06 GMT+04:00 Ted Dunning <te...@gmail.com>:

> On Wed, Sep 3, 2014 at 12:43 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> > What are the right way to debug such situations? What can i read? There
> > were no changes to any system.
> >
> >
> First step is to dump the coccurrence counts for the items in question.
>
> The process from there is to find out what the problem is.  It can be:
>
> 1) the LLR calculation went wrong (unlikely given the high usage of that
> code)
>
> 2) the counts are somehow very wrong.  conceivable but unlikely.
>
> 3) the down-sampling causes some strange pathology.  You should record
> counts before and after downsampling.  This is very unlikely.
>
> 4) the data has some format or other transmission error.  Moderately
> likely.
>
> 5) the system has somehow encouraged users to combine these seemingly
> unrelated items.  This is most likely in my experience.
>
> If the case is (5), you can't fix it with math or code.  The best answer is
> to simply have an exception list that says when you have *this* item, you
> must not show *that* item as a recommendation.  Essentially, this is an
> edit on the indicator list.
>

Re: org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob strage outcome

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Sep 3, 2014 at 12:43 AM, Serega Sheypak <se...@gmail.com>
wrote:

> What are the right way to debug such situations? What can i read? There
> were no changes to any system.
>
>
First step is to dump the coccurrence counts for the items in question.

The process from there is to find out what the problem is.  It can be:

1) the LLR calculation went wrong (unlikely given the high usage of that
code)

2) the counts are somehow very wrong.  conceivable but unlikely.

3) the down-sampling causes some strange pathology.  You should record
counts before and after downsampling.  This is very unlikely.

4) the data has some format or other transmission error.  Moderately likely.

5) the system has somehow encouraged users to combine these seemingly
unrelated items.  This is most likely in my experience.

If the case is (5), you can't fix it with math or code.  The best answer is
to simply have an exception list that says when you have *this* item, you
must not show *that* item as a recommendation.  Essentially, this is an
edit on the indicator list.