You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Ankur (JIRA)" <ji...@apache.org> on 2009/01/13 13:33:01 UTC

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663319#action_12663319 ] 

Ankur commented on MAHOUT-19:
-----------------------------

Hi Karl, Welcome back :-)
Can you share the following few things about this patch?

1. Assuming you are training the tree top-down, what is the division criteria you are using ?
2. How well does it scale ?
3. Was the data on which this was tried, sparse ?
4. What is the distance metric that has been used ?

Basically I have a use -case where-in I have a set of 5 - 10 million urls which have an inherent hierarchical relationship and a set of user-clicks on them. I would like to cluster them in a tree and use the model to answer the near neighborhood type queries i.e. what urls are related to what other urls. I did implement a sequential bottom-up hierarchical clustering algorithm but the complexity is too bad for my data-set. I then thought about implementing a top-down hierarchical clustering algorithm using Jaccard co-efficient as my distance measure and came across this patch.

Can you suggest if this patch will help?

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.


Thanks Sean, I will try this one out on my dataset and keep the list
posted on how well it worked.

Regards
-Ankur

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Friday, January 23, 2009 5:55 AM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Here's a DataModel you could try out for your purposes; the rest
should be as I described earlier.


package org.apache.mahout.cf.taste.example;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.model.GenericPreference;
import org.apache.mahout.cf.taste.impl.model.BooleanPrefUser;
import org.apache.mahout.cf.taste.impl.common.FastSet;
import org.apache.mahout.cf.taste.model.Preference;
import org.apache.mahout.cf.taste.model.Item;
import org.apache.mahout.cf.taste.model.User;

import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Map;
import java.util.ArrayList;

public final class AnkursDataModel extends FileDataModel {

  public AnkursDataModel(File ratingsFile) throws IOException {
    super(ratingsFile);
  }

  @Override
  protected void processLine(String line, Map<String,
List<Preference>> data, Map<String, Item> itemCache) {
    String[] tokens = line.split("\t");
    String userID = tokens[0];
    List<Preference> prefs = new ArrayList<Preference>(tokens.length -
1);
    for (int tokenNum = 1; tokenNum < tokens.length; tokenNum++) {
      String itemID = tokens[tokenNum];
      Item item = itemCache.get(itemID);
      if (item == null) {
        item = buildItem(itemID);
        itemCache.put(itemID, item);
      }
      prefs.add(new GenericPreference(null, item, 1.0));
      // this is a little ugly but makes it easy to reuse
FileDataModel -- pref values are tossed below
    }
    data.put(userID, prefs);
  }

  @Override
  protected User buildUser(String id, List<Preference> prefs) {
    FastSet<Object> itemIDs = new FastSet<Object>();
    for (Preference pref : prefs) {
      itemIDs.add(pref.getItem().getID());
    }
    return new BooleanPrefUser(id, itemIDs);
  }
}



On Wed, Jan 21, 2009 at 7:57 AM, Goel, Ankur <an...@corp.aol.com>
wrote:
> The input data format is typically
> User-id \t item-id \t (other information)
>
> From here it can transformed into either of the formats as they are
just
> 1 map-red away. After transformation the input data set will contain
> lines only in 1 format and not both. The data format that I use has
each
> line of the form
>
> User-id \t (Item-id1:other_info) \t ((Item-id1:other_info))...
>
> As for co-occurrence counting the way Ted mentioned, I implemented a
> map-red implementation for the same and I have found it to be pretty
> efficient, simple and effective too.
>
> Couple of tricks like only keeping top-X co-occurred items for an item
> by count and emitting only those item pairs that match a certain
> criteria have worked very well.
>
> I would like to contribute it to Mahout and filed a JIRA for the same
> https://issues.apache.org/jira/browse/MAHOUT-103
>
> I will have a patch coming soon.
>
> What I am looking for is a complimentary technique that does not
depend
> so much on co-occurrences and tries to do some sort of latent variable
> analysis to answer my query.
>
> Thanks
> -Ankur

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Here's a DataModel you could try out for your purposes; the rest
should be as I described earlier.


package org.apache.mahout.cf.taste.example;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.model.GenericPreference;
import org.apache.mahout.cf.taste.impl.model.BooleanPrefUser;
import org.apache.mahout.cf.taste.impl.common.FastSet;
import org.apache.mahout.cf.taste.model.Preference;
import org.apache.mahout.cf.taste.model.Item;
import org.apache.mahout.cf.taste.model.User;

import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Map;
import java.util.ArrayList;

public final class AnkursDataModel extends FileDataModel {

  public AnkursDataModel(File ratingsFile) throws IOException {
    super(ratingsFile);
  }

  @Override
  protected void processLine(String line, Map<String,
List<Preference>> data, Map<String, Item> itemCache) {
    String[] tokens = line.split("\t");
    String userID = tokens[0];
    List<Preference> prefs = new ArrayList<Preference>(tokens.length - 1);
    for (int tokenNum = 1; tokenNum < tokens.length; tokenNum++) {
      String itemID = tokens[tokenNum];
      Item item = itemCache.get(itemID);
      if (item == null) {
        item = buildItem(itemID);
        itemCache.put(itemID, item);
      }
      prefs.add(new GenericPreference(null, item, 1.0));
      // this is a little ugly but makes it easy to reuse
FileDataModel -- pref values are tossed below
    }
    data.put(userID, prefs);
  }

  @Override
  protected User buildUser(String id, List<Preference> prefs) {
    FastSet<Object> itemIDs = new FastSet<Object>();
    for (Preference pref : prefs) {
      itemIDs.add(pref.getItem().getID());
    }
    return new BooleanPrefUser(id, itemIDs);
  }
}



On Wed, Jan 21, 2009 at 7:57 AM, Goel, Ankur <an...@corp.aol.com> wrote:
> The input data format is typically
> User-id \t item-id \t (other information)
>
> From here it can transformed into either of the formats as they are just
> 1 map-red away. After transformation the input data set will contain
> lines only in 1 format and not both. The data format that I use has each
> line of the form
>
> User-id \t (Item-id1:other_info) \t ((Item-id1:other_info))...
>
> As for co-occurrence counting the way Ted mentioned, I implemented a
> map-red implementation for the same and I have found it to be pretty
> efficient, simple and effective too.
>
> Couple of tricks like only keeping top-X co-occurred items for an item
> by count and emitting only those item pairs that match a certain
> criteria have worked very well.
>
> I would like to contribute it to Mahout and filed a JIRA for the same
> https://issues.apache.org/jira/browse/MAHOUT-103
>
> I will have a patch coming soon.
>
> What I am looking for is a complimentary technique that does not depend
> so much on co-occurrences and tries to do some sort of latent variable
> analysis to answer my query.
>
> Thanks
> -Ankur

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

The input data format is typically 
User-id \t item-id \t (other information)

>From here it can transformed into either of the formats as they are just
1 map-red away. After transformation the input data set will contain
lines only in 1 format and not both. The data format that I use has each
line of the form

User-id \t (Item-id1:other_info) \t ((Item-id1:other_info))...

As for co-occurrence counting the way Ted mentioned, I implemented a
map-red implementation for the same and I have found it to be pretty
efficient, simple and effective too. 

Couple of tricks like only keeping top-X co-occurred items for an item
by count and emitting only those item pairs that match a certain
criteria have worked very well. 

I would like to contribute it to Mahout and filed a JIRA for the same
https://issues.apache.org/jira/browse/MAHOUT-103

I will have a patch coming soon.

What I am looking for is a complimentary technique that does not depend
so much on co-occurrences and tries to do some sort of latent variable
analysis to answer my query.

Thanks
-Ankur

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Tuesday, January 20, 2009 11:33 PM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Surprisingly, both forms are about the same in difficult for
co-occurrence.
Firstly, both forms are a single map-reduce apart.  Secondly, both forms
are
likely the output of a log analysis where the input form is actually
more
likely user/item pairs.  From that form, co-occurrence counting is most
easily done by reducing on user, emitting all pairs of items and then
counting in traditional wise.

But with very large data sets, even before doing the actual
co-occurrence,
it is commonly advisable to reduce to item-major form and down-sample
the
users associated with the most common items.  This is similar to the row
and
column normalization done in singular value techniques, but is applied
to
the original data.

Map-reduce is pretty impressive though; sampling is not necessary except
for
the largest data sets on the smallest clusters.

The biggest surprise I have had in using this sort of data reduction is
that
simply emitting all of the item pairs is pretty danged efficient.  There
are
clever things to do to avoid so much data motion, but they save
surprisingly
little and are much more complex to implement (correctly).

On Tue, Jan 20, 2009 at 4:26 AM, Sean Owen <sr...@gmail.com> wrote:

> how can I tell when a line specifies the opposite, item
> followed by user IDs? the former is easier, BTW.
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Ted Dunning <te...@gmail.com>.

Surprisingly, both forms are about the same in difficult for co-occurrence.
Firstly, both forms are a single map-reduce apart.  Secondly, both forms are
likely the output of a log analysis where the input form is actually more
likely user/item pairs.  From that form, co-occurrence counting is most
easily done by reducing on user, emitting all pairs of items and then
counting in traditional wise.

But with very large data sets, even before doing the actual co-occurrence,
it is commonly advisable to reduce to item-major form and down-sample the
users associated with the most common items.  This is similar to the row and
column normalization done in singular value techniques, but is applied to
the original data.

Map-reduce is pretty impressive though; sampling is not necessary except for
the largest data sets on the smallest clusters.

The biggest surprise I have had in using this sort of data reduction is that
simply emitting all of the item pairs is pretty danged efficient.  There are
clever things to do to avoid so much data motion, but they save surprisingly
little and are much more complex to implement (correctly).

On Tue, Jan 20, 2009 at 4:26 AM, Sean Owen <sr...@gmail.com> wrote:

> how can I tell when a line specifies the opposite, item
> followed by user IDs? the former is easier, BTW.
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Continuing talking to myself here -- I reverse myself again. After
thinking about this a bit more... the "boolean" preference case
doesn't need special attention in the context of item-based
recommenders or similarity metrics. It doesn't change the
implementation.

So Otis I think you could try item-based recommenders in your case, in
the way I suggested above.

On Tue, Jan 20, 2009 at 10:09 PM, Sean Owen <sr...@gmail.com> wrote:
> Er yeah Otis reminds me that I really actually need to make a few more
> classes to support item-based recommendation in this case. The code
> snippet I gave will work but will be better after I get to writing in
> a few more changes to actually support item-based recommenders here.
>
> On Tue, Jan 20, 2009 at 7:37 PM, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
>> Hello,
>>
>> Boolean* classes sped up things for me:
>>
>>    UserSimilarity similarity = new BooleanTanimotoCoefficientSimilarity(model);
>>    hood = new NearestNUserNeighborhood(HOOD_SIZE, MIN_SIMILARITY, similarity, model);
>>    recommender = new BooleanUserGenericUserBasedRecommender(model, hood, similarity);
>>
>> Sean did recommend using Item-based recommender when the number of items is relatively low compared to the number of users, but we only have Boolean flavour of User-based recommender in svn for now.
>

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Ted Dunning <te...@gmail.com>.

User based recommendations are very close to item based recommendations.

If you view all of your user history as a sparse matrix H that has users x
items, then the users who have similar histories to a particular history x
can be computed by (roughly) H x.

Then if you want to look at what items those users have interacted with, you
multiply again to get H' (H x).

Because multiplication is associative, this is the same as (H' H) x which is
just item level recommendations.

Now, it is true that H shouldn't really be exactly the user histories
(sampling and scaling will intervene) and it is true that H x isn't really
quite multiplication (you probably will want to trim the number of non-zero
elements, for example) and it is true that because of all these imprecisions
that H' (H x) isn't quite the same as (H' H) x.  It is pretty darned close,
however, and the pattern of computation is substantially the same no matter
what you do.

If you look at the cost you naturally start by dividing the cost into
on-line and off-line components.  The on-line cost is what it costs you to
do each recommendation.  For user based recommendation, this consists of two
matrix-vector multiplications.  For item based recommendation, this consists
of one matrix vector multiplication.  The off-line costs are 0 and one
matrix-matrix product respectively.  Since off-line computation is vastly
cheaper than on-line computation, it usually behooves one to move
computation off-line.  Also, note that the user based recommendation
involves multiplication by H' and by H.  Unless you keep both versions in
memory, you will pay a terrible cost.  That makes user based recs even more
expensive than they first appear.

Latent variable techniques change the structure of H to be something like a
product decomposition.  For SVD, we have H = U S V' where S is a small
diagonal matrix and U' U = I and V' V = I.  This means that H' H = V S^2 V'
which can make computing H' H x more or less expensive depending on
implementation, but, more importantly, it smooths H' H substantially so that
elements that would have been zero are not zero (in a good way).  Other
latent variable techniques like LDA use different terminology and may not
actually do multiplication, but the pattern of computation is still the same
and many of the conclusions still hold at least approximately.

Almost all of the work in recommendations centers around small changes in
exactly what the matrix multiplication really means and how many non-zero
elements are retained.  Looking at the problem in a unified way helps
understand how to make trade-offs in the architecture.

On Tue, Jan 20, 2009 at 11:37 AM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hello,
>
> Boolean* classes sped up things for me:
>
>    UserSimilarity similarity = new
> BooleanTanimotoCoefficientSimilarity(model);
>    hood = new NearestNUserNeighborhood(HOOD_SIZE, MIN_SIMILARITY,
> similarity, model);
>    recommender = new BooleanUserGenericUserBasedRecommender(model, hood,
> similarity);
>
> Sean did recommend using Item-based recommender when the number of items is
> relatively low compared to the number of users, but we only have Boolean
> flavour of User-based recommender in svn for now.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: Sean Owen <sr...@gmail.com>
> > To: mahout-dev@lucene.apache.org
> > Sent: Tuesday, January 20, 2009 7:26:07 AM
> > Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer
> >
> > Yes log-likelihood is already implemented as well as a similarity
> > metric, no problem.
> >
> > What I need to do is cook up a quick DataModel that reads your file
> > format. Is it really like this: user ID followed by item IDs that are
> > associated? how can I tell when a line specifies the opposite, item
> > followed by user IDs? the former is easier, BTW.
> >
> > [1234: 324, 555, 333]
> >
> > Then the code to build a recommender is basically:
> >
> > DataModel model = ...; // the custom DataModel I'll make
> > ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
> > // can also try ... = new TanimotoCoefficientSimilarity(model);
> > similarity = new CachingItemSimilarity(similarity); // for speed
> > Recommender recommender = new GenericItemBasedRecommender(model,
> similarity);
> > Listrecommendations = recommender.recommend("1234", 10);
> >
> > Your data set is big but not so large that it couldn't fit in memory,
> > I think. For now I think this easily runs on a reasonably beefy
> > machine -- it's going to need a lot of RAM to cache lots of item-item
> > similarities or else that computation will slow things down a lot.
> >
> > Easy enough to try and see how it flies.
> >
> > Sean
> >
> > On Tue, Jan 20, 2009 at 5:30 AM, Goel, Ankur wrote:
> > > Hi Sean,
> > >        Thanks for helping out here. The data can be assumed to be in
> > > either form mentioned below, since both forms are interchangeable:-
> > >
> > > [userA: item1, item2, item3 ... ]
> > > OR
> > > [item1: userA, userB, userC ..]
> > >
> > > Each user and each item is assigned a unique identifier. The ratings
> can
> > > be considered as binary 1 if user clicked on an item and 0 otherwise.
> > > Thing to note here is that in case of 0 the item does not exist in the
> > > user history. So what we have essentially is a sparse representation
> > > where 0's are not stored at all.
> > >
> > > As for which one is more (user/item) from the dataset we have
> relatively
> > > high number of users and less items. There are around 200 - 300
> thousand
> > > unique items but expected to grow to 1 - 2 million. So I think item
> > > based recommender sounds like something we can try out.
> > >
> > > About Tanimoto measure, I thought of using it in hierarchical
> clustering
> > > but Ted suggested it might not solve the purpose. He suggested that we
> > > can try computing the log-likelihood of co-occurrence of items.
> > >
> > > I would like to try out both the item based recommender you suggested
> > > and also the log-likelihood approach. Do we have the map-red version of
> > > log-likelihood code in Mahout?
> > >
> > > Ted, any thoughts?
> > >
> > > Regards
> > > -Ankur
>
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Er yeah Otis reminds me that I really actually need to make a few more
classes to support item-based recommendation in this case. The code
snippet I gave will work but will be better after I get to writing in
a few more changes to actually support item-based recommenders here.

On Tue, Jan 20, 2009 at 7:37 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Hello,
>
> Boolean* classes sped up things for me:
>
>    UserSimilarity similarity = new BooleanTanimotoCoefficientSimilarity(model);
>    hood = new NearestNUserNeighborhood(HOOD_SIZE, MIN_SIMILARITY, similarity, model);
>    recommender = new BooleanUserGenericUserBasedRecommender(model, hood, similarity);
>
> Sean did recommend using Item-based recommender when the number of items is relatively low compared to the number of users, but we only have Boolean flavour of User-based recommender in svn for now.

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hello,

Boolean* classes sped up things for me:

    UserSimilarity similarity = new BooleanTanimotoCoefficientSimilarity(model);
    hood = new NearestNUserNeighborhood(HOOD_SIZE, MIN_SIMILARITY, similarity, model);
    recommender = new BooleanUserGenericUserBasedRecommender(model, hood, similarity);

Sean did recommend using Item-based recommender when the number of items is relatively low compared to the number of users, but we only have Boolean flavour of User-based recommender in svn for now.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Sean Owen <sr...@gmail.com>
> To: mahout-dev@lucene.apache.org
> Sent: Tuesday, January 20, 2009 7:26:07 AM
> Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer
> 
> Yes log-likelihood is already implemented as well as a similarity
> metric, no problem.
> 
> What I need to do is cook up a quick DataModel that reads your file
> format. Is it really like this: user ID followed by item IDs that are
> associated? how can I tell when a line specifies the opposite, item
> followed by user IDs? the former is easier, BTW.
> 
> [1234: 324, 555, 333]
> 
> Then the code to build a recommender is basically:
> 
> DataModel model = ...; // the custom DataModel I'll make
> ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
> // can also try ... = new TanimotoCoefficientSimilarity(model);
> similarity = new CachingItemSimilarity(similarity); // for speed
> Recommender recommender = new GenericItemBasedRecommender(model, similarity);
> Listrecommendations = recommender.recommend("1234", 10);
> 
> Your data set is big but not so large that it couldn't fit in memory,
> I think. For now I think this easily runs on a reasonably beefy
> machine -- it's going to need a lot of RAM to cache lots of item-item
> similarities or else that computation will slow things down a lot.
> 
> Easy enough to try and see how it flies.
> 
> Sean
> 
> On Tue, Jan 20, 2009 at 5:30 AM, Goel, Ankur wrote:
> > Hi Sean,
> >        Thanks for helping out here. The data can be assumed to be in
> > either form mentioned below, since both forms are interchangeable:-
> >
> > [userA: item1, item2, item3 ... ]
> > OR
> > [item1: userA, userB, userC ..]
> >
> > Each user and each item is assigned a unique identifier. The ratings can
> > be considered as binary 1 if user clicked on an item and 0 otherwise.
> > Thing to note here is that in case of 0 the item does not exist in the
> > user history. So what we have essentially is a sparse representation
> > where 0's are not stored at all.
> >
> > As for which one is more (user/item) from the dataset we have relatively
> > high number of users and less items. There are around 200 - 300 thousand
> > unique items but expected to grow to 1 - 2 million. So I think item
> > based recommender sounds like something we can try out.
> >
> > About Tanimoto measure, I thought of using it in hierarchical clustering
> > but Ted suggested it might not solve the purpose. He suggested that we
> > can try computing the log-likelihood of co-occurrence of items.
> >
> > I would like to try out both the item based recommender you suggested
> > and also the log-likelihood approach. Do we have the map-red version of
> > log-likelihood code in Mahout?
> >
> > Ted, any thoughts?
> >
> > Regards
> > -Ankur

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Yes log-likelihood is already implemented as well as a similarity
metric, no problem.

What I need to do is cook up a quick DataModel that reads your file
format. Is it really like this: user ID followed by item IDs that are
associated? how can I tell when a line specifies the opposite, item
followed by user IDs? the former is easier, BTW.

[1234: 324, 555, 333]

Then the code to build a recommender is basically:

DataModel model = ...; // the custom DataModel I'll make
ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
// can also try ... = new TanimotoCoefficientSimilarity(model);
similarity = new CachingItemSimilarity(similarity); // for speed
Recommender recommender = new GenericItemBasedRecommender(model, similarity);
List<RecommendedItem> recommendations = recommender.recommend("1234", 10);

Your data set is big but not so large that it couldn't fit in memory,
I think. For now I think this easily runs on a reasonably beefy
machine -- it's going to need a lot of RAM to cache lots of item-item
similarities or else that computation will slow things down a lot.

Easy enough to try and see how it flies.

Sean

On Tue, Jan 20, 2009 at 5:30 AM, Goel, Ankur <an...@corp.aol.com> wrote:
> Hi Sean,
>        Thanks for helping out here. The data can be assumed to be in
> either form mentioned below, since both forms are interchangeable:-
>
> [userA: item1, item2, item3 ... ]
> OR
> [item1: userA, userB, userC ..]
>
> Each user and each item is assigned a unique identifier. The ratings can
> be considered as binary 1 if user clicked on an item and 0 otherwise.
> Thing to note here is that in case of 0 the item does not exist in the
> user history. So what we have essentially is a sparse representation
> where 0's are not stored at all.
>
> As for which one is more (user/item) from the dataset we have relatively
> high number of users and less items. There are around 200 - 300 thousand
> unique items but expected to grow to 1 - 2 million. So I think item
> based recommender sounds like something we can try out.
>
> About Tanimoto measure, I thought of using it in hierarchical clustering
> but Ted suggested it might not solve the purpose. He suggested that we
> can try computing the log-likelihood of co-occurrence of items.
>
> I would like to try out both the item based recommender you suggested
> and also the log-likelihood approach. Do we have the map-red version of
> log-likelihood code in Mahout?
>
> Ted, any thoughts?
>
> Regards
> -Ankur

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Which methods are you referring to specifically? Are they available in
Mahout? if not then perhaps we can think about including them.

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Tuesday, January 20, 2009 12:19 PM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Item based recommenders can be very effective even with very large
numbers
of items, especially if the distribution of usage is a long-tailed one
so
that you get significant amounts of co-occurrence.

Utlimately, the sparsity will hurt you, however.  Then you have to move
to a
latent variable formulation.  That provides smoothing so that you can
reason
from one item to another even without any explicit co-occurrence.  LSI
was a
very fashionable choice for this at one time, but there are much better
methods available now.

On Mon, Jan 19, 2009 at 5:31 AM, Sean Owen <sr...@gmail.com> wrote:

>
> Do you have relatively lots of users, or lots of items? If you have
> relatively few items, and item-based recommender is ideal -- and vice
> versa with user-based recommenders.

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Ted Dunning <te...@gmail.com>.

Item based recommenders can be very effective even with very large numbers
of items, especially if the distribution of usage is a long-tailed one so
that you get significant amounts of co-occurrence.

Utlimately, the sparsity will hurt you, however.  Then you have to move to a
latent variable formulation.  That provides smoothing so that you can reason
from one item to another even without any explicit co-occurrence.  LSI was a
very fashionable choice for this at one time, but there are much better
methods available now.

On Mon, Jan 19, 2009 at 5:31 AM, Sean Owen <sr...@gmail.com> wrote:

>
> Do you have relatively lots of users, or lots of items? If you have
> relatively few items, and item-based recommender is ideal -- and vice
> versa with user-based recommenders.

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Hi Ted,
        I think it will be interesting to have such a code in Mahout.
The first part of it would be to compute the matrix of co-occurrence
counts and the next part would be to compute the log-likelihood scores.

Should we open a JIRA wherein you can provide a worked example and we
can take it from there?

Thanks
-Ankur

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Tuesday, January 20, 2009 12:16 PM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

I can supply code for computing the measure itself, but not for the
map-reduce computation of the counts involved.

In my experience, this only requires about 10-15 lines of pig but rather
a
larger amount of native map-reduce code.  At Veoh, we used this and
other
mechanisms to reduce very large amounts of data (7 months at billions of
events per month) into form usable for recommendation.  Even with a
relatively small cluster, this is not an extremely long computation.

The four inputs to the log-likelihood ratio test for independence are
all
counts.   For item A and item B, the necessary counts are the number of
users who interacted with both item A item B, the number of users who
interacted A, but not B, with B but not A and the number of users who
interacted with interacted with neither item.  To minimize issues with
click
spam it is customary to count only one interaction per user so all of
the
counts can be considered a count of users rather than events.

If you view your set of of histories to be a binary matrix H containing
rows
that correspond to users and columns that correspond to items, then H' H
is
the matrix of coocurrence counts for all possible A's and B's.  Columns
of
H' H provide information needed to get the A-not-B and B-not-A counts
and
the total of the matrix gives the information for the the not-A-not-B
counts.

This matrix multiplication is, in fact, the same as a join.

I have a blog posting on the subject of computing log-likelihood ratios
here:

 http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

If need be, I can add a worked example of how to compute co-occurrence
using
map-reduce.

On Mon, Jan 19, 2009 at 9:30 PM, Goel, Ankur
<an...@corp.aol.com>wrote:

> About Tanimoto measure, I thought of using it in hierarchical
clustering
> but Ted suggested it might not solve the purpose. He suggested that we
> can try computing the log-likelihood of co-occurrence of items.
>
> I would like to try out both the item based recommender you suggested
> and also the log-likelihood approach. Do we have the map-red version
of
> log-likelihood code in Mahout?
>
> Ted, any thoughts?
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Ted Dunning <te...@gmail.com>.

I can supply code for computing the measure itself, but not for the
map-reduce computation of the counts involved.

In my experience, this only requires about 10-15 lines of pig but rather a
larger amount of native map-reduce code.  At Veoh, we used this and other
mechanisms to reduce very large amounts of data (7 months at billions of
events per month) into form usable for recommendation.  Even with a
relatively small cluster, this is not an extremely long computation.

The four inputs to the log-likelihood ratio test for independence are all
counts.   For item A and item B, the necessary counts are the number of
users who interacted with both item A item B, the number of users who
interacted A, but not B, with B but not A and the number of users who
interacted with interacted with neither item.  To minimize issues with click
spam it is customary to count only one interaction per user so all of the
counts can be considered a count of users rather than events.

If you view your set of of histories to be a binary matrix H containing rows
that correspond to users and columns that correspond to items, then H' H is
the matrix of coocurrence counts for all possible A's and B's.  Columns of
H' H provide information needed to get the A-not-B and B-not-A counts and
the total of the matrix gives the information for the the not-A-not-B
counts.

This matrix multiplication is, in fact, the same as a join.

I have a blog posting on the subject of computing log-likelihood ratios
here:

 http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

If need be, I can add a worked example of how to compute co-occurrence using
map-reduce.

On Mon, Jan 19, 2009 at 9:30 PM, Goel, Ankur <an...@corp.aol.com>wrote:

> About Tanimoto measure, I thought of using it in hierarchical clustering
> but Ted suggested it might not solve the purpose. He suggested that we
> can try computing the log-likelihood of co-occurrence of items.
>
> I would like to try out both the item based recommender you suggested
> and also the log-likelihood approach. Do we have the map-red version of
> log-likelihood code in Mahout?
>
> Ted, any thoughts?
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Hi Sean,
        Thanks for helping out here. The data can be assumed to be in
either form mentioned below, since both forms are interchangeable:-

[userA: item1, item2, item3 ... ]
OR
[item1: userA, userB, userC ..]

Each user and each item is assigned a unique identifier. The ratings can
be considered as binary 1 if user clicked on an item and 0 otherwise.
Thing to note here is that in case of 0 the item does not exist in the
user history. So what we have essentially is a sparse representation
where 0's are not stored at all.

As for which one is more (user/item) from the dataset we have relatively
high number of users and less items. There are around 200 - 300 thousand
unique items but expected to grow to 1 - 2 million. So I think item
based recommender sounds like something we can try out.

About Tanimoto measure, I thought of using it in hierarchical clustering
but Ted suggested it might not solve the purpose. He suggested that we
can try computing the log-likelihood of co-occurrence of items. 

I would like to try out both the item based recommender you suggested
and also the log-likelihood approach. Do we have the map-red version of
log-likelihood code in Mahout?

Ted, any thoughts?

Regards
-Ankur

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Monday, January 19, 2009 7:02 PM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Binary meaning you just have a "yes, the user likes or has seen or
bought the book" versus no relation at all? Yeah that should be a
special case of what CF normally works on, where you have some degree
of preference versus yes/no. In that sense yes it is supported and in
theory should be a lot faster -- in practice it's only easy to gain a
little speedup since to really re-orient the algorithms to take
advantage of this case would take a lot of change.

Thinking it through... I am not sure slope one would work in this
case. It operates on relative differences in ratings across items, and
if all your ratings are "1.0" if they exist at all, then it falls
apart.

So perhaps the other algorithms are a better place to start after all.
The binary case does allow you to use fast similarity metrics like the
Tanimoto measure, and if you have a fast similarity metric you
generally have a fast algorithm since most algorithms rely heavily on
computing similarity metrics.

Do you have relatively lots of users, or lots of items? If you have
relatively few items, and item-based recommender is ideal -- and vice
versa with user-based recommenders.

How is that sounding? what form is your data in? I could send over
rough draft code to try out.

On Mon, Jan 19, 2009 at 12:37 PM, Goel, Ankur <an...@corp.aol.com>
wrote:
> Yep! I actually want to recommend items of interest, where item
depends
> on the context say for an online bookshop it is books. Few question
> regarding slope one.
> 1. Can I be applied to a binary data setting like mine?
> 2. Do we have an implementation for it in Mahout?
> 3. Will it scale well?

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Binary meaning you just have a "yes, the user likes or has seen or
bought the book" versus no relation at all? Yeah that should be a
special case of what CF normally works on, where you have some degree
of preference versus yes/no. In that sense yes it is supported and in
theory should be a lot faster -- in practice it's only easy to gain a
little speedup since to really re-orient the algorithms to take
advantage of this case would take a lot of change.

Thinking it through... I am not sure slope one would work in this
case. It operates on relative differences in ratings across items, and
if all your ratings are "1.0" if they exist at all, then it falls
apart.

So perhaps the other algorithms are a better place to start after all.
The binary case does allow you to use fast similarity metrics like the
Tanimoto measure, and if you have a fast similarity metric you
generally have a fast algorithm since most algorithms rely heavily on
computing similarity metrics.

Do you have relatively lots of users, or lots of items? If you have
relatively few items, and item-based recommender is ideal -- and vice
versa with user-based recommenders.

How is that sounding? what form is your data in? I could send over
rough draft code to try out.

On Mon, Jan 19, 2009 at 12:37 PM, Goel, Ankur <an...@corp.aol.com> wrote:
> Yep! I actually want to recommend items of interest, where item depends
> on the context say for an online bookshop it is books. Few question
> regarding slope one.
> 1. Can I be applied to a binary data setting like mine?
> 2. Do we have an implementation for it in Mahout?
> 3. Will it scale well?

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Yep! I actually want to recommend items of interest, where item depends
on the context say for an online bookshop it is books. Few question
regarding slope one.
1. Can I be applied to a binary data setting like mine?
2. Do we have an implementation for it in Mahout?
3. Will it scale well?

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Monday, January 19, 2009 5:36 PM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Oh, so you are really recommending things like books, rather than URLs
-- URL don't have anything directly to do with it? well then this is
indeed a straightforward CF problem.

my favorite CF algorithm at the moment is slope-one -- fast, good
recommendations, and fairly resilient to noise.

On Mon, Jan 19, 2009 at 11:44 AM, Goel, Ankur <an...@corp.aol.com>
wrote:
> Not all URLs represent unique items / entities of interest. For e.g. a
> lot of URLs would be just site specific search/listing pages or pages
> that have a lot of navigational information but do not actually
> represent an entity or item of interest.
>
> Given such a page we do not want to recommend links to items already
on
> the page but items that were far ahead (listing page 3, 4) and were
also
> liked most by the users on the site.
>
> Also for a URL that does represent a unique entity (For e.g. a book on
> Amazon), we do not want to recommend other search/listing/navigational
> pages but pages with actual items that people have liked w.r.t the
> current page.
>
> The intent is to gauge the relative popularity or model the
> co-occurrence of items with respect to each other and also remove the
> anomalies.
>
> Lets say A = book1, C = listing-page, B=book2, D=book3
>
> So if we have patterns like A-C-B, B-C-D-A, A-C-D-B, then A and B can
be
> both recommended for each other, given that one does not have the link
> for the other already on the page.
>
> Whether or not Markov chain will work? I do not know as I need to read
> about Markov chain and find out.
>
> As for log-likelihood ratio tests that sounds like a reasonable
> candidate but I am a bit worried about scalability.
>
> Ted, what's your thought on this?
>
> Thanks
> -Ankur

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Oh, so you are really recommending things like books, rather than URLs
-- URL don't have anything directly to do with it? well then this is
indeed a straightforward CF problem.

my favorite CF algorithm at the moment is slope-one -- fast, good
recommendations, and fairly resilient to noise.

On Mon, Jan 19, 2009 at 11:44 AM, Goel, Ankur <an...@corp.aol.com> wrote:
> Not all URLs represent unique items / entities of interest. For e.g. a
> lot of URLs would be just site specific search/listing pages or pages
> that have a lot of navigational information but do not actually
> represent an entity or item of interest.
>
> Given such a page we do not want to recommend links to items already on
> the page but items that were far ahead (listing page 3, 4) and were also
> liked most by the users on the site.
>
> Also for a URL that does represent a unique entity (For e.g. a book on
> Amazon), we do not want to recommend other search/listing/navigational
> pages but pages with actual items that people have liked w.r.t the
> current page.
>
> The intent is to gauge the relative popularity or model the
> co-occurrence of items with respect to each other and also remove the
> anomalies.
>
> Lets say A = book1, C = listing-page, B=book2, D=book3
>
> So if we have patterns like A-C-B, B-C-D-A, A-C-D-B, then A and B can be
> both recommended for each other, given that one does not have the link
> for the other already on the page.
>
> Whether or not Markov chain will work? I do not know as I need to read
> about Markov chain and find out.
>
> As for log-likelihood ratio tests that sounds like a reasonable
> candidate but I am a bit worried about scalability.
>
> Ted, what's your thought on this?
>
> Thanks
> -Ankur

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Not all URLs represent unique items / entities of interest. For e.g. a
lot of URLs would be just site specific search/listing pages or pages
that have a lot of navigational information but do not actually
represent an entity or item of interest.

Given such a page we do not want to recommend links to items already on
the page but items that were far ahead (listing page 3, 4) and were also
liked most by the users on the site.

Also for a URL that does represent a unique entity (For e.g. a book on
Amazon), we do not want to recommend other search/listing/navigational
pages but pages with actual items that people have liked w.r.t the
current page.
 
The intent is to gauge the relative popularity or model the
co-occurrence of items with respect to each other and also remove the
anomalies.

Lets say A = book1, C = listing-page, B=book2, D=book3

So if we have patterns like A-C-B, B-C-D-A, A-C-D-B, then A and B can be
both recommended for each other, given that one does not have the link
for the other already on the page.

Whether or not Markov chain will work? I do not know as I need to read
about Markov chain and find out.

As for log-likelihood ratio tests that sounds like a reasonable
candidate but I am a bit worried about scalability. 

Ted, what's your thought on this?

Thanks
-Ankur

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Monday, January 19, 2009 3:18 PM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Maybe a dumb question, but why subtract the links? I can only get from A
to
B via a hyperlink (well, if I navigate directly to B, is the fact that I
was
on A meaningful?)  Normalizing for transitions that correspond to a link
seems to do nothing. Maybe I do not understand the problem fully.

An A-C-B transition doesn't suggest that A should be recommended from B
right, but, say, A-B-A would. My point was only that it is not always
symmetric of course, and so applying CF gets a little trickier since the
algorithms would assume symmetry.

Would a short Markov chain work and scale? For 3 elements, it needs
storage
proportional to the cube of the average number of links per page. I
don't
think CF will scale nearly as well here; it is not feeling like quite
the
right tool for the job.

Sean

On 19 Jan 2009, 8:45 AM, "Goel, Ankur" <an...@corp.aol.com> wrote:



Ted / Sean,

The link structure should definitely be subtracted. From the original
dataset or from the recommended item-set is left to the implementation.
I think it will be easier to do this from the recommended item-set.

As for not recommending urls in reverse order (B for A but not A for B,
given B appeared after A) one will have to keep track of his current
browsing history and remove those that user has already seen. Although
if user does reach B through some other link C then it does make sense
to recommend A.

Given the size of the data-set what kind of algorithm and keeping in
mind that it could grow in future what algorithms would you try out?

-----Original Message----- From: Ted Dunning
[mailto:ted.dunning@gmail.com]

Sent: Sunday, January 18, 2009 2:06 AM To: mahout-dev@lucene.apache.org

Subject: Re: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer >
Predicting next URL is an i...

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Maybe a dumb question, but why subtract the links? I can only get from A to
B via a hyperlink (well, if I navigate directly to B, is the fact that I was
on A meaningful?)  Normalizing for transitions that correspond to a link
seems to do nothing. Maybe I do not understand the problem fully.

An A-C-B transition doesn't suggest that A should be recommended from B
right, but, say, A-B-A would. My point was only that it is not always
symmetric of course, and so applying CF gets a little trickier since the
algorithms would assume symmetry.

Would a short Markov chain work and scale? For 3 elements, it needs storage
proportional to the cube of the average number of links per page. I don't
think CF will scale nearly as well here; it is not feeling like quite the
right tool for the job.

Sean

On 19 Jan 2009, 8:45 AM, "Goel, Ankur" <an...@corp.aol.com> wrote:



Ted / Sean,

The link structure should definitely be subtracted. From the original
dataset or from the recommended item-set is left to the implementation.
I think it will be easier to do this from the recommended item-set.

As for not recommending urls in reverse order (B for A but not A for B,
given B appeared after A) one will have to keep track of his current
browsing history and remove those that user has already seen. Although
if user does reach B through some other link C then it does make sense
to recommend A.

Given the size of the data-set what kind of algorithm and keeping in
mind that it could grow in future what algorithms would you try out?

-----Original Message----- From: Ted Dunning [mailto:ted.dunning@gmail.com]

Sent: Sunday, January 18, 2009 2:06 AM To: mahout-dev@lucene.apache.org

Subject: Re: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer >
Predicting next URL is an i...

RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Ted / Sean,

The link structure should definitely be subtracted. From the original
dataset or from the recommended item-set is left to the implementation.
I think it will be easier to do this from the recommended item-set.

As for not recommending urls in reverse order (B for A but not A for B,
given B appeared after A) one will have to keep track of his current
browsing history and remove those that user has already seen. Although
if user does reach B through some other link C then it does make sense
to recommend A.

Given the size of the data-set what kind of algorithm and keeping in
mind that it could grow in future what algorithms would you try out?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Sunday, January 18, 2009 2:06 AM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

> Predicting next URL is an important RI problem and using an item-set
> predictor or an indicator-set predictor or a latent-variable predictor
is
> likely to work reasonably well.  The asymmetry of the prediction is
not
> particularly a problem since it captures important structural cues
(web
> links are unidirectional).  It is important, however, to subtract away
the
> link structure of the web pages before evaluating the system since
> suggesting that the user simply follow links that already exist is
less >than interesting.  As such, a raw markov chain isn't likely to
work well.

On Sat, Jan 17, 2009 at 5:24 AM, Sean Owen <sr...@gmail.com> wrote:

> But then again is this a CF problem? Sounds like markov chains...
given the
> last 1 or 2 or 3 URLs visited, which URL has been next, most often? I
think
> that's relatively easy and fast, does that work?
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Ted Dunning <te...@gmail.com>.

Actually, collaborative filtering as originally stated and often reiterated
in the form of predicting ratings has very limited utility, but is an
example of a more important super-class of algorithms that I call reflected
intelligence where you use user behavior in a more or less simple way to
cause a system to behave in an apparently intelligently.

The problem with ratings predictors is that you have posed some very
difficult problems for yourself from the outset.  These include getting
people to rate things at all (very doable on netflix, almost impossible in
my experienece on Veoh, Musicmatch and all the fraud systems I have built)
and you have to translate a ratings prediction into useful action (very,
very hard to do well in almost all cases).

Predicting next URL is an important RI problem and using an item-set
predictor or an indicator-set predictor or a latent-variable predictor is
likely to work reasonably well.  The asymmetry of the prediction is not
particularly a problem since it captures important structural cues (web
links are unidirectional).  It is important, however, to subtract away the
link structure of the web pages before evaluating the system since
suggesting that the user simply follow links that already exist is less than
interesting.  As such, a raw markov chain isn't likely to work well.

On Sat, Jan 17, 2009 at 5:24 AM, Sean Owen <sr...@gmail.com> wrote:

> But then again is this a CF problem? Sounds like markov chains... given the
> last 1 or 2 or 3 URLs visited, which URL has been next, most often? I think
> that's relatively easy and fast, does that work?
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

At first this indeed sounds like a CF problem. You can use clustering to
solve a CF problem (see for instance TreeClusteringRecommender)

But you could use other algorithms just as well - see any of the other
Recommenders. You don't have ratings for a URL, just a binary 'yes, visited'
or nothing. You can take advantage of that by using the 'Boolean*' classes
and the Tanimoto similarity metric.

This doesn't capture the fact that there is an ordering that is important -
URL A was clicked just before B so when I am on A we should recommend B (but
not necessarily the reverse). To capture this I think you want to try an
item-based recommender with an item-item similarity that captures this
relation. It won't be symmetric which messes up some other things - may need
more tweaking of existing code to get right.

But then again is this a CF problem? Sounds like markov chains... given the
last 1 or 2 or 3 URLs visited, which URL has been next, most often? I think
that's relatively easy and fast, does that work?

As for data I would indeed consider throwing out data you believe is just
noise.

Sean

On 16 Jan 2009, 12:25 PM, "Goel, Ankur" <an...@corp.aol.com> wrote:

Ted / Karl, Thank you both for your comments and suggestions. Continuing
on the comments from Ted...

The end goal is definitely not clustering but rather recommendations.
Thist can be broken down into 2 separate tasks typical to a
recommendation engine.
1. Given a URL show other URLs people have liked.
2. Given a User session and the URL he is seeing, suggest other URLs he
might like.

I experimented a bit with clustering but couldn't get good
recommendations.
>From your advice Log-likelihood ratio sounds like a potential solution
for the first one. I remember having a discussion with you and Sean long
time back where you pointed to a useful paper
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962

Please pardon me if I am asking the question again but do you think it's
a promising approach for problem 1? Do we have an implementation for
this in Mahout? If no then I can open it and work on it (given my other
work commitments allow enough time). Also since I have no formal
statistics background, I am working on 'rebuilding' my statistics
knowledge so that I can grasp these concepts better.

As for the data rates, I really don't know in the context of these
techniques what's low and what's high but what I have learnt after
accumulating weeks of data is that there are few users who have good
engagement (sufficient clicks) over a period of time, moderate number of
users who have small number of clicks and large number of users that
have very few clicks and are just casual surfers.

Also regarding building a user model as a simple mixture, I am not sure
which one you are referring to. Is it the LDA JIRA that Jeff is working
on?

Once again thanks for all the help, much appreciated.

Regards
-Ankur

-----Original Message----- From: Ted Dunning [mailto:ted.dunning@gmail.com]
Sent: Thursday, Januar...

RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Ted / Karl, Thank you both for your comments and suggestions. Continuing
on the comments from Ted...

The end goal is definitely not clustering but rather recommendations.
Thist can be broken down into 2 separate tasks typical to a
recommendation engine.
1. Given a URL show other URLs people have liked.
2. Given a User session and the URL he is seeing, suggest other URLs he
might like.

I experimented a bit with clustering but couldn't get good
recommendations.
>From your advice Log-likelihood ratio sounds like a potential solution
for the first one. I remember having a discussion with you and Sean long
time back where you pointed to a useful paper
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962

Please pardon me if I am asking the question again but do you think it's
a promising approach for problem 1? Do we have an implementation for
this in Mahout? If no then I can open it and work on it (given my other
work commitments allow enough time). Also since I have no formal
statistics background, I am working on 'rebuilding' my statistics
knowledge so that I can grasp these concepts better.

As for the data rates, I really don't know in the context of these
techniques what's low and what's high but what I have learnt after
accumulating weeks of data is that there are few users who have good
engagement (sufficient clicks) over a period of time, moderate number of
users who have small number of clicks and large number of users that
have very few clicks and are just casual surfers.

Also regarding building a user model as a simple mixture, I am not sure
which one you are referring to. Is it the LDA JIRA that Jeff is working
on?

Once again thanks for all the help, much appreciated.

Regards
-Ankur

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Thursday, January 15, 2009 12:58 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

That kind of clustering is not what would answer the important question
for
me.

It sounds like the important question is one of two things:

a) what drives the viewing of the page (using external evidence only and
not
estimating any kind of user model)

b) can we estimate some sort of mental state that leads to viewing
(using
evidence to drive a user model which drives new evidence)

The first approach has the virtue of great simplicity.  To predict a
view,
you can find anomalously coocurrent urls (coocurrent in the sense that
they
were viewed by the some same user).  Then you can use the anomalously
coocurrent urls to build a viewing model, probably using something like
ridged logistic regression or simply using IDF weighting.  I don't think
that NaiveBayes is as good for these models as people think.  Anomalous
coocurrence is not something well detected by Jaccard or cousins, but it
is
well detected by log-likelihood ratio tests based on 2x2 contingency
tables.  Simply using anomalous coocurrent URL's makes for a very good
recommendation engine if you have good data rates.

The second approach has the virtue that you can define a nice metric for
urls.  First you take the likelihood that two urls will both be visited
by a
particular user, then you average over the distribution of all users.
This
gives you an overall coocurrence probability of two urls, the negative
log
of which is likely to be informative if interpreted as a distance.
Building
the user model gives you the virtue of smoothing which is where Jaccard
falls apart.  For low data rates, this is likely to give you the best
results.

Neither of these approaches involves clustering because clustering is
not
the goal.  If you really want the clustering as an end in itself, I
think
what you should restate the coordinate space before clustering.  A good
way
to do that is to build a user model that is a simple mixture, then the
mixture coefficients for each document make good coordinates for normal
clustering algorithms (because dot products have such a natural
interpretation which implies that Euclidean distance is a useful
measure).
In fact, the dot product for the mixture model is exactly the "average
over
all users" idea from above for the special case of a mixture model.
Other
user models might not be so simple.  Jeff Eastman has done a lot of work
on
building such a mixture model estimator for mahout.

I have had cases where clustering and model building were essentially
unable
to interpret the data at all in the original data space, but were able
to
extract enormous amounts of useful information when the data was
restated in
terms of mixture model coefficients.

On Tue, Jan 13, 2009 at 4:33 AM, Ankur (JIRA) <ji...@apache.org> wrote:

> I would like to cluster them in a tree and use the model to answer the
near
> neighborhood type queries i.e. what urls are related to what other
urls. I
> did implement a sequential bottom-up hierarchical clustering algorithm
but
> the complexity is too bad for my data-set. I then thought about
implementing
> a top-down hierarchical clustering algorithm using Jaccard
co-efficient as
> my distance measure and came across this patch.
>

Re: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Ted Dunning <te...@gmail.com>.

That kind of clustering is not what would answer the important question for
me.

It sounds like the important question is one of two things:

a) what drives the viewing of the page (using external evidence only and not
estimating any kind of user model)

b) can we estimate some sort of mental state that leads to viewing (using
evidence to drive a user model which drives new evidence)

The first approach has the virtue of great simplicity.  To predict a view,
you can find anomalously coocurrent urls (coocurrent in the sense that they
were viewed by the some same user).  Then you can use the anomalously
coocurrent urls to build a viewing model, probably using something like
ridged logistic regression or simply using IDF weighting.  I don't think
that NaiveBayes is as good for these models as people think.  Anomalous
coocurrence is not something well detected by Jaccard or cousins, but it is
well detected by log-likelihood ratio tests based on 2x2 contingency
tables.  Simply using anomalous coocurrent URL's makes for a very good
recommendation engine if you have good data rates.

The second approach has the virtue that you can define a nice metric for
urls.  First you take the likelihood that two urls will both be visited by a
particular user, then you average over the distribution of all users.  This
gives you an overall coocurrence probability of two urls, the negative log
of which is likely to be informative if interpreted as a distance.  Building
the user model gives you the virtue of smoothing which is where Jaccard
falls apart.  For low data rates, this is likely to give you the best
results.

Neither of these approaches involves clustering because clustering is not
the goal.  If you really want the clustering as an end in itself, I think
what you should restate the coordinate space before clustering.  A good way
to do that is to build a user model that is a simple mixture, then the
mixture coefficients for each document make good coordinates for normal
clustering algorithms (because dot products have such a natural
interpretation which implies that Euclidean distance is a useful measure).
In fact, the dot product for the mixture model is exactly the "average over
all users" idea from above for the special case of a mixture model.  Other
user models might not be so simple.  Jeff Eastman has done a lot of work on
building such a mixture model estimator for mahout.

I have had cases where clustering and model building were essentially unable
to interpret the data at all in the original data space, but were able to
extract enormous amounts of useful information when the data was restated in
terms of mixture model coefficients.

On Tue, Jan 13, 2009 at 4:33 AM, Ankur (JIRA) <ji...@apache.org> wrote:

> I would like to cluster them in a tree and use the model to answer the near
> neighborhood type queries i.e. what urls are related to what other urls. I
> did implement a sequential bottom-up hierarchical clustering algorithm but
> the complexity is too bad for my data-set. I then thought about implementing
> a top-down hierarchical clustering algorithm using Jaccard co-efficient as
> my distance measure and came across this patch.
>