You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "Kai R. Larsen" <ka...@Colorado.EDU> on 2012/12/22 00:45:56 UTC

Mahout for item-item tables

Hi,

My sincere apologies if this is a naïve question (I'm sure it is).

I've engaged a programmer to take an weblog and focus on 250 pages containing items that may be similar (or not).  The goal is create item-item relationship tables where every cell contains a score for how similar two items are.  He now tells me that only two of the (many) Mahout algorithms can be used to generate such tables, and those that do generate a distance of 1 or some other constant value between all pairs.

This can't be true, can it?  There must be a way to tease out such information from the algorithms.  Any advice?  Any ideas why all relationships would be one?  While it is common for the website users to have visited only one page at a time, it is not pervasive.

Best,

Kai Larsen

Re: Mahout for item-item tables

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Dec 22, 2012 at 4:33 AM, Kai R. Larsen <ka...@colorado.edu>wrote:

> ...
> I'm not quite sure that your answer is directly responsive to the
> question

That would definitely not be the first time that I have missed the point.

> ...
> 1. Goal is to examine relationship between 250 web pages, so we extract
> the user sessions (they end after 1/2 hour of inactivity), remove bot
> entries, and input looks like this:
> User#   Page#
> 1       5
> 1       8
> 2       1
>

This looks very good.

> We do not include number of hits on a page or a star rating for each page
> (we have none). Sounds like you're saying that this is where the problem
> lies.

No.  I think that this is a good idea.  Occasionally, a threshold on number
of hits is useful, but normally not.  It is common for some additional
measure of engagement to be helpful as well.  For instance, if you can tell
that the page survived for some number of seconds in the users browser,
that might be better than simple page load (JS based beacons work well for
this).  You might also get clues from an unload event (not usually
reliable) or evidence that the user went somewhere else right away (this is
very trick to get right in the presence of multiple tabs).

But the idea of a binary they-did-it feature is good in principle.

> Mahout expecting either a binary variable or a count of number of
> accesses would explain the weird results.

Yes.  You can check for data format sensitivity by just putting a 1 as the
third value on each line.

Doing some kind of log-entropy
> weighting makes further sense, thanks@ Is what you shared log-entropy, by
> the way?
>

It is closely related.

See http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html for
my preferred method.  This is available in Mahout.

Re: Mahout for item-item tables

Posted by "Kai R. Larsen" <ka...@Colorado.EDU>.

Thanks so much for this Ted,

I'm not quite sure that your answer is directly responsive to the
question, so let me try to clarify. As far as I understand Mahout, this is
our process:
1. Goal is to examine relationship between 250 web pages, so we extract
the user sessions (they end after 1/2 hour of inactivity), remove bot
entries, and input looks like this:
User#	Page#
1	5
1	8
2	1
Š

We do not include number of hits on a page or a star rating for each page
(we have none). Sounds like you're saying that this is where the problem
lies.  Mahout expecting either a binary variable or a count of number of
accesses would explain the weird results. Doing some kind of log-entropy
weighting makes further sense, thanks@ Is what you shared log-entropy, by
the way?

Kai :-)

On 12/22/12 4:47 AM, "Ted Dunning" <te...@gmail.com> wrote:

>The basic reason that it is common to binarize the relationships is that
>putting weights on these relationships makes it really easy to over-fit,
>thus giving you very goofy results.
>
>One method for putting weights on these elements is to simply use
>
>weight(i,j) = log ((N_rows +1)/(rowSum_i + 1)) log((N_cols +1) / (colSum_j
>+1))
>
>Where all weights are set to zero if you don't have a 1 in that cell of
>the
>item-item matrix.
>
>Another reasonable weighting is to simply use row or column counts
>(depending on the application).  You get something very similar to this
>weighting when you use a text retrieval engine to produce recommendations
>where documents are columns of the item-item matrix and you multiply by a
>user history expressed in items.
>
>On Fri, Dec 21, 2012 at 3:45 PM, Kai R. Larsen
><ka...@colorado.edu>wrote:
>
>> Hi,
>>
>> My sincere apologies if this is a naïve question (I'm sure it is).
>>
>> I've engaged a programmer to take an weblog and focus on 250 pages
>> containing items that may be similar (or not).  The goal is create
>> item-item relationship tables where every cell contains a score for how
>> similar two items are.  He now tells me that only two of the (many)
>>Mahout
>> algorithms can be used to generate such tables, and those that do
>>generate
>> a distance of 1 or some other constant value between all pairs.
>>
>> This can't be true, can it?  There must be a way to tease out such
>> information from the algorithms.  Any advice?  Any ideas why all
>> relationships would be one?  While it is common for the website users to
>> have visited only one page at a time, it is not pervasive.
>>
>> Best,
>>
>> Kai Larsen
>>

Re: Mahout for item-item tables

Posted by Ted Dunning <te...@gmail.com>.

The basic reason that it is common to binarize the relationships is that
putting weights on these relationships makes it really easy to over-fit,
thus giving you very goofy results.

One method for putting weights on these elements is to simply use

weight(i,j) = log ((N_rows +1)/(rowSum_i + 1)) log((N_cols +1) / (colSum_j
+1))

Where all weights are set to zero if you don't have a 1 in that cell of the
item-item matrix.

Another reasonable weighting is to simply use row or column counts
(depending on the application).  You get something very similar to this
weighting when you use a text retrieval engine to produce recommendations
where documents are columns of the item-item matrix and you multiply by a
user history expressed in items.

On Fri, Dec 21, 2012 at 3:45 PM, Kai R. Larsen <ka...@colorado.edu>wrote:

> Hi,
>
> My sincere apologies if this is a naïve question (I'm sure it is).
>
> I've engaged a programmer to take an weblog and focus on 250 pages
> containing items that may be similar (or not).  The goal is create
> item-item relationship tables where every cell contains a score for how
> similar two items are.  He now tells me that only two of the (many) Mahout
> algorithms can be used to generate such tables, and those that do generate
> a distance of 1 or some other constant value between all pairs.
>
> This can't be true, can it?  There must be a way to tease out such
> information from the algorithms.  Any advice?  Any ideas why all
> relationships would be one?  While it is common for the website users to
> have visited only one page at a time, it is not pervasive.
>
> Best,
>
> Kai Larsen
>

Re: Mahout for item-item tables

Posted by Sean Owen <sr...@gmail.com>.

No, I don't know what use it would be if a similarity measure always
gave a 1. I assume he's thinking of the fact that some similarity
metrics use features that are not real-valued, but binary (exist or
not). But the result is not always 1, or even 1/0, but a value in
[-1,1].

On Fri, Dec 21, 2012 at 11:45 PM, Kai R. Larsen <ka...@colorado.edu> wrote:
> Hi,
>
> My sincere apologies if this is a naïve question (I'm sure it is).
>
> I've engaged a programmer to take an weblog and focus on 250 pages containing items that may be similar (or not).  The goal is create item-item relationship tables where every cell contains a score for how similar two items are.  He now tells me that only two of the (many) Mahout algorithms can be used to generate such tables, and those that do generate a distance of 1 or some other constant value between all pairs.
>
> This can't be true, can it?  There must be a way to tease out such information from the algorithms.  Any advice?  Any ideas why all relationships would be one?  While it is common for the website users to have visited only one page at a time, it is not pervasive.
>
> Best,
>
> Kai Larsen