You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Karl Wettin (JIRA)" <ji...@apache.org> on 2008/03/30 08:11:24 UTC

[jira] Created: (MAHOUT-19) Hierarchial clusterer

Hierarchial clusterer
---------------------

Key: MAHOUT-19
URL: https://issues.apache.org/jira/browse/MAHOUT-19
Project: Mahout
Issue Type: New Feature
Components: Clustering
Reporter: Karl Wettin
Assignee: Karl Wettin
Priority: Minor

In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.

For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.

Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster.

Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.

My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.

See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.


Thanks Sean, I will try this one out on my dataset and keep the list
posted on how well it worked.

Regards
-Ankur

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Friday, January 23, 2009 5:55 AM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Here's a DataModel you could try out for your purposes; the rest
should be as I described earlier.


package org.apache.mahout.cf.taste.example;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.model.GenericPreference;
import org.apache.mahout.cf.taste.impl.model.BooleanPrefUser;
import org.apache.mahout.cf.taste.impl.common.FastSet;
import org.apache.mahout.cf.taste.model.Preference;
import org.apache.mahout.cf.taste.model.Item;
import org.apache.mahout.cf.taste.model.User;

import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Map;
import java.util.ArrayList;

public final class AnkursDataModel extends FileDataModel {

  public AnkursDataModel(File ratingsFile) throws IOException {
    super(ratingsFile);
  }

  @Override
  protected void processLine(String line, Map<String,
List<Preference>> data, Map<String, Item> itemCache) {
    String[] tokens = line.split("\t");
    String userID = tokens[0];
    List<Preference> prefs = new ArrayList<Preference>(tokens.length -
1);
    for (int tokenNum = 1; tokenNum < tokens.length; tokenNum++) {
      String itemID = tokens[tokenNum];
      Item item = itemCache.get(itemID);
      if (item == null) {
        item = buildItem(itemID);
        itemCache.put(itemID, item);
      }
      prefs.add(new GenericPreference(null, item, 1.0));
      // this is a little ugly but makes it easy to reuse
FileDataModel -- pref values are tossed below
    }
    data.put(userID, prefs);
  }

  @Override
  protected User buildUser(String id, List<Preference> prefs) {
    FastSet<Object> itemIDs = new FastSet<Object>();
    for (Preference pref : prefs) {
      itemIDs.add(pref.getItem().getID());
    }
    return new BooleanPrefUser(id, itemIDs);
  }
}



On Wed, Jan 21, 2009 at 7:57 AM, Goel, Ankur <an...@corp.aol.com>
wrote:
> The input data format is typically
> User-id \t item-id \t (other information)
>
> From here it can transformed into either of the formats as they are
just
> 1 map-red away. After transformation the input data set will contain
> lines only in 1 format and not both. The data format that I use has
each
> line of the form
>
> User-id \t (Item-id1:other_info) \t ((Item-id1:other_info))...
>
> As for co-occurrence counting the way Ted mentioned, I implemented a
> map-red implementation for the same and I have found it to be pretty
> efficient, simple and effective too.
>
> Couple of tricks like only keeping top-X co-occurred items for an item
> by count and emitting only those item pairs that match a certain
> criteria have worked very well.
>
> I would like to contribute it to Mahout and filed a JIRA for the same
> https://issues.apache.org/jira/browse/MAHOUT-103
>
> I will have a patch coming soon.
>
> What I am looking for is a complimentary technique that does not
depend
> so much on co-occurrences and tries to do some sort of latent variable
> analysis to answer my query.
>
> Thanks
> -Ankur

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Here's a DataModel you could try out for your purposes; the rest
should be as I described earlier.


package org.apache.mahout.cf.taste.example;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.model.GenericPreference;
import org.apache.mahout.cf.taste.impl.model.BooleanPrefUser;
import org.apache.mahout.cf.taste.impl.common.FastSet;
import org.apache.mahout.cf.taste.model.Preference;
import org.apache.mahout.cf.taste.model.Item;
import org.apache.mahout.cf.taste.model.User;

import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Map;
import java.util.ArrayList;

public final class AnkursDataModel extends FileDataModel {

  public AnkursDataModel(File ratingsFile) throws IOException {
    super(ratingsFile);
  }

  @Override
  protected void processLine(String line, Map<String,
List<Preference>> data, Map<String, Item> itemCache) {
    String[] tokens = line.split("\t");
    String userID = tokens[0];
    List<Preference> prefs = new ArrayList<Preference>(tokens.length - 1);
    for (int tokenNum = 1; tokenNum < tokens.length; tokenNum++) {
      String itemID = tokens[tokenNum];
      Item item = itemCache.get(itemID);
      if (item == null) {
        item = buildItem(itemID);
        itemCache.put(itemID, item);
      }
      prefs.add(new GenericPreference(null, item, 1.0));
      // this is a little ugly but makes it easy to reuse
FileDataModel -- pref values are tossed below
    }
    data.put(userID, prefs);
  }

  @Override
  protected User buildUser(String id, List<Preference> prefs) {
    FastSet<Object> itemIDs = new FastSet<Object>();
    for (Preference pref : prefs) {
      itemIDs.add(pref.getItem().getID());
    }
    return new BooleanPrefUser(id, itemIDs);
  }
}



On Wed, Jan 21, 2009 at 7:57 AM, Goel, Ankur <an...@corp.aol.com> wrote:
> The input data format is typically
> User-id \t item-id \t (other information)
>
> From here it can transformed into either of the formats as they are just
> 1 map-red away. After transformation the input data set will contain
> lines only in 1 format and not both. The data format that I use has each
> line of the form
>
> User-id \t (Item-id1:other_info) \t ((Item-id1:other_info))...
>
> As for co-occurrence counting the way Ted mentioned, I implemented a
> map-red implementation for the same and I have found it to be pretty
> efficient, simple and effective too.
>
> Couple of tricks like only keeping top-X co-occurred items for an item
> by count and emitting only those item pairs that match a certain
> criteria have worked very well.
>
> I would like to contribute it to Mahout and filed a JIRA for the same
> https://issues.apache.org/jira/browse/MAHOUT-103
>
> I will have a patch coming soon.
>
> What I am looking for is a complimentary technique that does not depend
> so much on co-occurrences and tries to do some sort of latent variable
> analysis to answer my query.
>
> Thanks
> -Ankur

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

The input data format is typically 
User-id \t item-id \t (other information)

>From here it can transformed into either of the formats as they are just
1 map-red away. After transformation the input data set will contain
lines only in 1 format and not both. The data format that I use has each
line of the form

User-id \t (Item-id1:other_info) \t ((Item-id1:other_info))...

As for co-occurrence counting the way Ted mentioned, I implemented a
map-red implementation for the same and I have found it to be pretty
efficient, simple and effective too. 

Couple of tricks like only keeping top-X co-occurred items for an item
by count and emitting only those item pairs that match a certain
criteria have worked very well. 

I would like to contribute it to Mahout and filed a JIRA for the same
https://issues.apache.org/jira/browse/MAHOUT-103

I will have a patch coming soon.

What I am looking for is a complimentary technique that does not depend
so much on co-occurrences and tries to do some sort of latent variable
analysis to answer my query.

Thanks
-Ankur

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Tuesday, January 20, 2009 11:33 PM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Surprisingly, both forms are about the same in difficult for
co-occurrence.
Firstly, both forms are a single map-reduce apart.  Secondly, both forms
are
likely the output of a log analysis where the input form is actually
more
likely user/item pairs.  From that form, co-occurrence counting is most
easily done by reducing on user, emitting all pairs of items and then
counting in traditional wise.

But with very large data sets, even before doing the actual
co-occurrence,
it is commonly advisable to reduce to item-major form and down-sample
the
users associated with the most common items.  This is similar to the row
and
column normalization done in singular value techniques, but is applied
to
the original data.

Map-reduce is pretty impressive though; sampling is not necessary except
for
the largest data sets on the smallest clusters.

The biggest surprise I have had in using this sort of data reduction is
that
simply emitting all of the item pairs is pretty danged efficient.  There
are
clever things to do to avoid so much data motion, but they save
surprisingly
little and are much more complex to implement (correctly).

On Tue, Jan 20, 2009 at 4:26 AM, Sean Owen <sr...@gmail.com> wrote:

> how can I tell when a line specifies the opposite, item
> followed by user IDs? the former is easier, BTW.
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Ted Dunning <te...@gmail.com>.

Surprisingly, both forms are about the same in difficult for co-occurrence.
Firstly, both forms are a single map-reduce apart.  Secondly, both forms are
likely the output of a log analysis where the input form is actually more
likely user/item pairs.  From that form, co-occurrence counting is most
easily done by reducing on user, emitting all pairs of items and then
counting in traditional wise.

But with very large data sets, even before doing the actual co-occurrence,
it is commonly advisable to reduce to item-major form and down-sample the
users associated with the most common items.  This is similar to the row and
column normalization done in singular value techniques, but is applied to
the original data.

Map-reduce is pretty impressive though; sampling is not necessary except for
the largest data sets on the smallest clusters.

The biggest surprise I have had in using this sort of data reduction is that
simply emitting all of the item pairs is pretty danged efficient.  There are
clever things to do to avoid so much data motion, but they save surprisingly
little and are much more complex to implement (correctly).

On Tue, Jan 20, 2009 at 4:26 AM, Sean Owen <sr...@gmail.com> wrote:

> how can I tell when a line specifies the opposite, item
> followed by user IDs? the former is easier, BTW.
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Continuing talking to myself here -- I reverse myself again. After
thinking about this a bit more... the "boolean" preference case
doesn't need special attention in the context of item-based
recommenders or similarity metrics. It doesn't change the
implementation.

So Otis I think you could try item-based recommenders in your case, in
the way I suggested above.

On Tue, Jan 20, 2009 at 10:09 PM, Sean Owen <sr...@gmail.com> wrote:
> Er yeah Otis reminds me that I really actually need to make a few more
> classes to support item-based recommendation in this case. The code
> snippet I gave will work but will be better after I get to writing in
> a few more changes to actually support item-based recommenders here.
>
> On Tue, Jan 20, 2009 at 7:37 PM, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
>> Hello,
>>
>> Boolean* classes sped up things for me:
>>
>>    UserSimilarity similarity = new BooleanTanimotoCoefficientSimilarity(model);
>>    hood = new NearestNUserNeighborhood(HOOD_SIZE, MIN_SIMILARITY, similarity, model);
>>    recommender = new BooleanUserGenericUserBasedRecommender(model, hood, similarity);
>>
>> Sean did recommend using Item-based recommender when the number of items is relatively low compared to the number of users, but we only have Boolean flavour of User-based recommender in svn for now.
>

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Ted Dunning <te...@gmail.com>.

User based recommendations are very close to item based recommendations.

If you view all of your user history as a sparse matrix H that has users x
items, then the users who have similar histories to a particular history x
can be computed by (roughly) H x.

Then if you want to look at what items those users have interacted with, you
multiply again to get H' (H x).

Because multiplication is associative, this is the same as (H' H) x which is
just item level recommendations.

Now, it is true that H shouldn't really be exactly the user histories
(sampling and scaling will intervene) and it is true that H x isn't really
quite multiplication (you probably will want to trim the number of non-zero
elements, for example) and it is true that because of all these imprecisions
that H' (H x) isn't quite the same as (H' H) x.  It is pretty darned close,
however, and the pattern of computation is substantially the same no matter
what you do.

If you look at the cost you naturally start by dividing the cost into
on-line and off-line components.  The on-line cost is what it costs you to
do each recommendation.  For user based recommendation, this consists of two
matrix-vector multiplications.  For item based recommendation, this consists
of one matrix vector multiplication.  The off-line costs are 0 and one
matrix-matrix product respectively.  Since off-line computation is vastly
cheaper than on-line computation, it usually behooves one to move
computation off-line.  Also, note that the user based recommendation
involves multiplication by H' and by H.  Unless you keep both versions in
memory, you will pay a terrible cost.  That makes user based recs even more
expensive than they first appear.

Latent variable techniques change the structure of H to be something like a
product decomposition.  For SVD, we have H = U S V' where S is a small
diagonal matrix and U' U = I and V' V = I.  This means that H' H = V S^2 V'
which can make computing H' H x more or less expensive depending on
implementation, but, more importantly, it smooths H' H substantially so that
elements that would have been zero are not zero (in a good way).  Other
latent variable techniques like LDA use different terminology and may not
actually do multiplication, but the pattern of computation is still the same
and many of the conclusions still hold at least approximately.

Almost all of the work in recommendations centers around small changes in
exactly what the matrix multiplication really means and how many non-zero
elements are retained.  Looking at the problem in a unified way helps
understand how to make trade-offs in the architecture.

On Tue, Jan 20, 2009 at 11:37 AM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hello,
>
> Boolean* classes sped up things for me:
>
>    UserSimilarity similarity = new
> BooleanTanimotoCoefficientSimilarity(model);
>    hood = new NearestNUserNeighborhood(HOOD_SIZE, MIN_SIMILARITY,
> similarity, model);
>    recommender = new BooleanUserGenericUserBasedRecommender(model, hood,
> similarity);
>
> Sean did recommend using Item-based recommender when the number of items is
> relatively low compared to the number of users, but we only have Boolean
> flavour of User-based recommender in svn for now.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: Sean Owen <sr...@gmail.com>
> > To: mahout-dev@lucene.apache.org
> > Sent: Tuesday, January 20, 2009 7:26:07 AM
> > Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer
> >
> > Yes log-likelihood is already implemented as well as a similarity
> > metric, no problem.
> >
> > What I need to do is cook up a quick DataModel that reads your file
> > format. Is it really like this: user ID followed by item IDs that are
> > associated? how can I tell when a line specifies the opposite, item
> > followed by user IDs? the former is easier, BTW.
> >
> > [1234: 324, 555, 333]
> >
> > Then the code to build a recommender is basically:
> >
> > DataModel model = ...; // the custom DataModel I'll make
> > ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
> > // can also try ... = new TanimotoCoefficientSimilarity(model);
> > similarity = new CachingItemSimilarity(similarity); // for speed
> > Recommender recommender = new GenericItemBasedRecommender(model,
> similarity);
> > Listrecommendations = recommender.recommend("1234", 10);
> >
> > Your data set is big but not so large that it couldn't fit in memory,
> > I think. For now I think this easily runs on a reasonably beefy
> > machine -- it's going to need a lot of RAM to cache lots of item-item
> > similarities or else that computation will slow things down a lot.
> >
> > Easy enough to try and see how it flies.
> >
> > Sean
> >
> > On Tue, Jan 20, 2009 at 5:30 AM, Goel, Ankur wrote:
> > > Hi Sean,
> > >        Thanks for helping out here. The data can be assumed to be in
> > > either form mentioned below, since both forms are interchangeable:-
> > >
> > > [userA: item1, item2, item3 ... ]
> > > OR
> > > [item1: userA, userB, userC ..]
> > >
> > > Each user and each item is assigned a unique identifier. The ratings
> can
> > > be considered as binary 1 if user clicked on an item and 0 otherwise.
> > > Thing to note here is that in case of 0 the item does not exist in the
> > > user history. So what we have essentially is a sparse representation
> > > where 0's are not stored at all.
> > >
> > > As for which one is more (user/item) from the dataset we have
> relatively
> > > high number of users and less items. There are around 200 - 300
> thousand
> > > unique items but expected to grow to 1 - 2 million. So I think item
> > > based recommender sounds like something we can try out.
> > >
> > > About Tanimoto measure, I thought of using it in hierarchical
> clustering
> > > but Ted suggested it might not solve the purpose. He suggested that we
> > > can try computing the log-likelihood of co-occurrence of items.
> > >
> > > I would like to try out both the item based recommender you suggested
> > > and also the log-likelihood approach. Do we have the map-red version of
> > > log-likelihood code in Mahout?
> > >
> > > Ted, any thoughts?
> > >
> > > Regards
> > > -Ankur
>
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Er yeah Otis reminds me that I really actually need to make a few more
classes to support item-based recommendation in this case. The code
snippet I gave will work but will be better after I get to writing in
a few more changes to actually support item-based recommenders here.

On Tue, Jan 20, 2009 at 7:37 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Hello,
>
> Boolean* classes sped up things for me:
>
>    UserSimilarity similarity = new BooleanTanimotoCoefficientSimilarity(model);
>    hood = new NearestNUserNeighborhood(HOOD_SIZE, MIN_SIMILARITY, similarity, model);
>    recommender = new BooleanUserGenericUserBasedRecommender(model, hood, similarity);
>
> Sean did recommend using Item-based recommender when the number of items is relatively low compared to the number of users, but we only have Boolean flavour of User-based recommender in svn for now.

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hello,

Boolean* classes sped up things for me:

    UserSimilarity similarity = new BooleanTanimotoCoefficientSimilarity(model);
    hood = new NearestNUserNeighborhood(HOOD_SIZE, MIN_SIMILARITY, similarity, model);
    recommender = new BooleanUserGenericUserBasedRecommender(model, hood, similarity);

Sean did recommend using Item-based recommender when the number of items is relatively low compared to the number of users, but we only have Boolean flavour of User-based recommender in svn for now.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Sean Owen <sr...@gmail.com>
> To: mahout-dev@lucene.apache.org
> Sent: Tuesday, January 20, 2009 7:26:07 AM
> Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer
> 
> Yes log-likelihood is already implemented as well as a similarity
> metric, no problem.
> 
> What I need to do is cook up a quick DataModel that reads your file
> format. Is it really like this: user ID followed by item IDs that are
> associated? how can I tell when a line specifies the opposite, item
> followed by user IDs? the former is easier, BTW.
> 
> [1234: 324, 555, 333]
> 
> Then the code to build a recommender is basically:
> 
> DataModel model = ...; // the custom DataModel I'll make
> ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
> // can also try ... = new TanimotoCoefficientSimilarity(model);
> similarity = new CachingItemSimilarity(similarity); // for speed
> Recommender recommender = new GenericItemBasedRecommender(model, similarity);
> Listrecommendations = recommender.recommend("1234", 10);
> 
> Your data set is big but not so large that it couldn't fit in memory,
> I think. For now I think this easily runs on a reasonably beefy
> machine -- it's going to need a lot of RAM to cache lots of item-item
> similarities or else that computation will slow things down a lot.
> 
> Easy enough to try and see how it flies.
> 
> Sean
> 
> On Tue, Jan 20, 2009 at 5:30 AM, Goel, Ankur wrote:
> > Hi Sean,
> >        Thanks for helping out here. The data can be assumed to be in
> > either form mentioned below, since both forms are interchangeable:-
> >
> > [userA: item1, item2, item3 ... ]
> > OR
> > [item1: userA, userB, userC ..]
> >
> > Each user and each item is assigned a unique identifier. The ratings can
> > be considered as binary 1 if user clicked on an item and 0 otherwise.
> > Thing to note here is that in case of 0 the item does not exist in the
> > user history. So what we have essentially is a sparse representation
> > where 0's are not stored at all.
> >
> > As for which one is more (user/item) from the dataset we have relatively
> > high number of users and less items. There are around 200 - 300 thousand
> > unique items but expected to grow to 1 - 2 million. So I think item
> > based recommender sounds like something we can try out.
> >
> > About Tanimoto measure, I thought of using it in hierarchical clustering
> > but Ted suggested it might not solve the purpose. He suggested that we
> > can try computing the log-likelihood of co-occurrence of items.
> >
> > I would like to try out both the item based recommender you suggested
> > and also the log-likelihood approach. Do we have the map-red version of
> > log-likelihood code in Mahout?
> >
> > Ted, any thoughts?
> >
> > Regards
> > -Ankur

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Yes log-likelihood is already implemented as well as a similarity
metric, no problem.

What I need to do is cook up a quick DataModel that reads your file
format. Is it really like this: user ID followed by item IDs that are
associated? how can I tell when a line specifies the opposite, item
followed by user IDs? the former is easier, BTW.

[1234: 324, 555, 333]

Then the code to build a recommender is basically:

DataModel model = ...; // the custom DataModel I'll make
ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
// can also try ... = new TanimotoCoefficientSimilarity(model);
similarity = new CachingItemSimilarity(similarity); // for speed
Recommender recommender = new GenericItemBasedRecommender(model, similarity);
List<RecommendedItem> recommendations = recommender.recommend("1234", 10);

Your data set is big but not so large that it couldn't fit in memory,
I think. For now I think this easily runs on a reasonably beefy
machine -- it's going to need a lot of RAM to cache lots of item-item
similarities or else that computation will slow things down a lot.

Easy enough to try and see how it flies.

Sean

On Tue, Jan 20, 2009 at 5:30 AM, Goel, Ankur <an...@corp.aol.com> wrote:
> Hi Sean,
>        Thanks for helping out here. The data can be assumed to be in
> either form mentioned below, since both forms are interchangeable:-
>
> [userA: item1, item2, item3 ... ]
> OR
> [item1: userA, userB, userC ..]
>
> Each user and each item is assigned a unique identifier. The ratings can
> be considered as binary 1 if user clicked on an item and 0 otherwise.
> Thing to note here is that in case of 0 the item does not exist in the
> user history. So what we have essentially is a sparse representation
> where 0's are not stored at all.
>
> As for which one is more (user/item) from the dataset we have relatively
> high number of users and less items. There are around 200 - 300 thousand
> unique items but expected to grow to 1 - 2 million. So I think item
> based recommender sounds like something we can try out.
>
> About Tanimoto measure, I thought of using it in hierarchical clustering
> but Ted suggested it might not solve the purpose. He suggested that we
> can try computing the log-likelihood of co-occurrence of items.
>
> I would like to try out both the item based recommender you suggested
> and also the log-likelihood approach. Do we have the map-red version of
> log-likelihood code in Mahout?
>
> Ted, any thoughts?
>
> Regards
> -Ankur

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Which methods are you referring to specifically? Are they available in
Mahout? if not then perhaps we can think about including them.

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Tuesday, January 20, 2009 12:19 PM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Item based recommenders can be very effective even with very large
numbers
of items, especially if the distribution of usage is a long-tailed one
so
that you get significant amounts of co-occurrence.

Utlimately, the sparsity will hurt you, however.  Then you have to move
to a
latent variable formulation.  That provides smoothing so that you can
reason
from one item to another even without any explicit co-occurrence.  LSI
was a
very fashionable choice for this at one time, but there are much better
methods available now.

On Mon, Jan 19, 2009 at 5:31 AM, Sean Owen <sr...@gmail.com> wrote:

>
> Do you have relatively lots of users, or lots of items? If you have
> relatively few items, and item-based recommender is ideal -- and vice
> versa with user-based recommenders.

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Ted Dunning <te...@gmail.com>.

Item based recommenders can be very effective even with very large numbers
of items, especially if the distribution of usage is a long-tailed one so
that you get significant amounts of co-occurrence.

Utlimately, the sparsity will hurt you, however.  Then you have to move to a
latent variable formulation.  That provides smoothing so that you can reason
from one item to another even without any explicit co-occurrence.  LSI was a
very fashionable choice for this at one time, but there are much better
methods available now.

On Mon, Jan 19, 2009 at 5:31 AM, Sean Owen <sr...@gmail.com> wrote:

>
> Do you have relatively lots of users, or lots of items? If you have
> relatively few items, and item-based recommender is ideal -- and vice
> versa with user-based recommenders.

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Hi Ted,
        I think it will be interesting to have such a code in Mahout.
The first part of it would be to compute the matrix of co-occurrence
counts and the next part would be to compute the log-likelihood scores.

Should we open a JIRA wherein you can provide a worked example and we
can take it from there?

Thanks
-Ankur

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Tuesday, January 20, 2009 12:16 PM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

I can supply code for computing the measure itself, but not for the
map-reduce computation of the counts involved.

In my experience, this only requires about 10-15 lines of pig but rather
a
larger amount of native map-reduce code.  At Veoh, we used this and
other
mechanisms to reduce very large amounts of data (7 months at billions of
events per month) into form usable for recommendation.  Even with a
relatively small cluster, this is not an extremely long computation.

The four inputs to the log-likelihood ratio test for independence are
all
counts.   For item A and item B, the necessary counts are the number of
users who interacted with both item A item B, the number of users who
interacted A, but not B, with B but not A and the number of users who
interacted with interacted with neither item.  To minimize issues with
click
spam it is customary to count only one interaction per user so all of
the
counts can be considered a count of users rather than events.

If you view your set of of histories to be a binary matrix H containing
rows
that correspond to users and columns that correspond to items, then H' H
is
the matrix of coocurrence counts for all possible A's and B's.  Columns
of
H' H provide information needed to get the A-not-B and B-not-A counts
and
the total of the matrix gives the information for the the not-A-not-B
counts.

This matrix multiplication is, in fact, the same as a join.

I have a blog posting on the subject of computing log-likelihood ratios
here:

 http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

If need be, I can add a worked example of how to compute co-occurrence
using
map-reduce.

On Mon, Jan 19, 2009 at 9:30 PM, Goel, Ankur
<an...@corp.aol.com>wrote:

> About Tanimoto measure, I thought of using it in hierarchical
clustering
> but Ted suggested it might not solve the purpose. He suggested that we
> can try computing the log-likelihood of co-occurrence of items.
>
> I would like to try out both the item based recommender you suggested
> and also the log-likelihood approach. Do we have the map-red version
of
> log-likelihood code in Mahout?
>
> Ted, any thoughts?
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Ted Dunning <te...@gmail.com>.

I can supply code for computing the measure itself, but not for the
map-reduce computation of the counts involved.

In my experience, this only requires about 10-15 lines of pig but rather a
larger amount of native map-reduce code.  At Veoh, we used this and other
mechanisms to reduce very large amounts of data (7 months at billions of
events per month) into form usable for recommendation.  Even with a
relatively small cluster, this is not an extremely long computation.

The four inputs to the log-likelihood ratio test for independence are all
counts.   For item A and item B, the necessary counts are the number of
users who interacted with both item A item B, the number of users who
interacted A, but not B, with B but not A and the number of users who
interacted with interacted with neither item.  To minimize issues with click
spam it is customary to count only one interaction per user so all of the
counts can be considered a count of users rather than events.

If you view your set of of histories to be a binary matrix H containing rows
that correspond to users and columns that correspond to items, then H' H is
the matrix of coocurrence counts for all possible A's and B's.  Columns of
H' H provide information needed to get the A-not-B and B-not-A counts and
the total of the matrix gives the information for the the not-A-not-B
counts.

This matrix multiplication is, in fact, the same as a join.

I have a blog posting on the subject of computing log-likelihood ratios
here:

 http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

If need be, I can add a worked example of how to compute co-occurrence using
map-reduce.

On Mon, Jan 19, 2009 at 9:30 PM, Goel, Ankur <an...@corp.aol.com>wrote:

> About Tanimoto measure, I thought of using it in hierarchical clustering
> but Ted suggested it might not solve the purpose. He suggested that we
> can try computing the log-likelihood of co-occurrence of items.
>
> I would like to try out both the item based recommender you suggested
> and also the log-likelihood approach. Do we have the map-red version of
> log-likelihood code in Mahout?
>
> Ted, any thoughts?
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Hi Sean,
        Thanks for helping out here. The data can be assumed to be in
either form mentioned below, since both forms are interchangeable:-

[userA: item1, item2, item3 ... ]
OR
[item1: userA, userB, userC ..]

Each user and each item is assigned a unique identifier. The ratings can
be considered as binary 1 if user clicked on an item and 0 otherwise.
Thing to note here is that in case of 0 the item does not exist in the
user history. So what we have essentially is a sparse representation
where 0's are not stored at all.

As for which one is more (user/item) from the dataset we have relatively
high number of users and less items. There are around 200 - 300 thousand
unique items but expected to grow to 1 - 2 million. So I think item
based recommender sounds like something we can try out.

About Tanimoto measure, I thought of using it in hierarchical clustering
but Ted suggested it might not solve the purpose. He suggested that we
can try computing the log-likelihood of co-occurrence of items. 

I would like to try out both the item based recommender you suggested
and also the log-likelihood approach. Do we have the map-red version of
log-likelihood code in Mahout?

Ted, any thoughts?

Regards
-Ankur

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Monday, January 19, 2009 7:02 PM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Binary meaning you just have a "yes, the user likes or has seen or
bought the book" versus no relation at all? Yeah that should be a
special case of what CF normally works on, where you have some degree
of preference versus yes/no. In that sense yes it is supported and in
theory should be a lot faster -- in practice it's only easy to gain a
little speedup since to really re-orient the algorithms to take
advantage of this case would take a lot of change.

Thinking it through... I am not sure slope one would work in this
case. It operates on relative differences in ratings across items, and
if all your ratings are "1.0" if they exist at all, then it falls
apart.

So perhaps the other algorithms are a better place to start after all.
The binary case does allow you to use fast similarity metrics like the
Tanimoto measure, and if you have a fast similarity metric you
generally have a fast algorithm since most algorithms rely heavily on
computing similarity metrics.

Do you have relatively lots of users, or lots of items? If you have
relatively few items, and item-based recommender is ideal -- and vice
versa with user-based recommenders.

How is that sounding? what form is your data in? I could send over
rough draft code to try out.

On Mon, Jan 19, 2009 at 12:37 PM, Goel, Ankur <an...@corp.aol.com>
wrote:
> Yep! I actually want to recommend items of interest, where item
depends
> on the context say for an online bookshop it is books. Few question
> regarding slope one.
> 1. Can I be applied to a binary data setting like mine?
> 2. Do we have an implementation for it in Mahout?
> 3. Will it scale well?

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Binary meaning you just have a "yes, the user likes or has seen or
bought the book" versus no relation at all? Yeah that should be a
special case of what CF normally works on, where you have some degree
of preference versus yes/no. In that sense yes it is supported and in
theory should be a lot faster -- in practice it's only easy to gain a
little speedup since to really re-orient the algorithms to take
advantage of this case would take a lot of change.

Thinking it through... I am not sure slope one would work in this
case. It operates on relative differences in ratings across items, and
if all your ratings are "1.0" if they exist at all, then it falls
apart.

So perhaps the other algorithms are a better place to start after all.
The binary case does allow you to use fast similarity metrics like the
Tanimoto measure, and if you have a fast similarity metric you
generally have a fast algorithm since most algorithms rely heavily on
computing similarity metrics.

Do you have relatively lots of users, or lots of items? If you have
relatively few items, and item-based recommender is ideal -- and vice
versa with user-based recommenders.

How is that sounding? what form is your data in? I could send over
rough draft code to try out.

On Mon, Jan 19, 2009 at 12:37 PM, Goel, Ankur <an...@corp.aol.com> wrote:
> Yep! I actually want to recommend items of interest, where item depends
> on the context say for an online bookshop it is books. Few question
> regarding slope one.
> 1. Can I be applied to a binary data setting like mine?
> 2. Do we have an implementation for it in Mahout?
> 3. Will it scale well?

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Yep! I actually want to recommend items of interest, where item depends
on the context say for an online bookshop it is books. Few question
regarding slope one.
1. Can I be applied to a binary data setting like mine?
2. Do we have an implementation for it in Mahout?
3. Will it scale well?

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Monday, January 19, 2009 5:36 PM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Oh, so you are really recommending things like books, rather than URLs
-- URL don't have anything directly to do with it? well then this is
indeed a straightforward CF problem.

my favorite CF algorithm at the moment is slope-one -- fast, good
recommendations, and fairly resilient to noise.

On Mon, Jan 19, 2009 at 11:44 AM, Goel, Ankur <an...@corp.aol.com>
wrote:
> Not all URLs represent unique items / entities of interest. For e.g. a
> lot of URLs would be just site specific search/listing pages or pages
> that have a lot of navigational information but do not actually
> represent an entity or item of interest.
>
> Given such a page we do not want to recommend links to items already
on
> the page but items that were far ahead (listing page 3, 4) and were
also
> liked most by the users on the site.
>
> Also for a URL that does represent a unique entity (For e.g. a book on
> Amazon), we do not want to recommend other search/listing/navigational
> pages but pages with actual items that people have liked w.r.t the
> current page.
>
> The intent is to gauge the relative popularity or model the
> co-occurrence of items with respect to each other and also remove the
> anomalies.
>
> Lets say A = book1, C = listing-page, B=book2, D=book3
>
> So if we have patterns like A-C-B, B-C-D-A, A-C-D-B, then A and B can
be
> both recommended for each other, given that one does not have the link
> for the other already on the page.
>
> Whether or not Markov chain will work? I do not know as I need to read
> about Markov chain and find out.
>
> As for log-likelihood ratio tests that sounds like a reasonable
> candidate but I am a bit worried about scalability.
>
> Ted, what's your thought on this?
>
> Thanks
> -Ankur

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Oh, so you are really recommending things like books, rather than URLs
-- URL don't have anything directly to do with it? well then this is
indeed a straightforward CF problem.

my favorite CF algorithm at the moment is slope-one -- fast, good
recommendations, and fairly resilient to noise.

On Mon, Jan 19, 2009 at 11:44 AM, Goel, Ankur <an...@corp.aol.com> wrote:
> Not all URLs represent unique items / entities of interest. For e.g. a
> lot of URLs would be just site specific search/listing pages or pages
> that have a lot of navigational information but do not actually
> represent an entity or item of interest.
>
> Given such a page we do not want to recommend links to items already on
> the page but items that were far ahead (listing page 3, 4) and were also
> liked most by the users on the site.
>
> Also for a URL that does represent a unique entity (For e.g. a book on
> Amazon), we do not want to recommend other search/listing/navigational
> pages but pages with actual items that people have liked w.r.t the
> current page.
>
> The intent is to gauge the relative popularity or model the
> co-occurrence of items with respect to each other and also remove the
> anomalies.
>
> Lets say A = book1, C = listing-page, B=book2, D=book3
>
> So if we have patterns like A-C-B, B-C-D-A, A-C-D-B, then A and B can be
> both recommended for each other, given that one does not have the link
> for the other already on the page.
>
> Whether or not Markov chain will work? I do not know as I need to read
> about Markov chain and find out.
>
> As for log-likelihood ratio tests that sounds like a reasonable
> candidate but I am a bit worried about scalability.
>
> Ted, what's your thought on this?
>
> Thanks
> -Ankur

RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Not all URLs represent unique items / entities of interest. For e.g. a
lot of URLs would be just site specific search/listing pages or pages
that have a lot of navigational information but do not actually
represent an entity or item of interest.

Given such a page we do not want to recommend links to items already on
the page but items that were far ahead (listing page 3, 4) and were also
liked most by the users on the site.

Also for a URL that does represent a unique entity (For e.g. a book on
Amazon), we do not want to recommend other search/listing/navigational
pages but pages with actual items that people have liked w.r.t the
current page.
 
The intent is to gauge the relative popularity or model the
co-occurrence of items with respect to each other and also remove the
anomalies.

Lets say A = book1, C = listing-page, B=book2, D=book3

So if we have patterns like A-C-B, B-C-D-A, A-C-D-B, then A and B can be
both recommended for each other, given that one does not have the link
for the other already on the page.

Whether or not Markov chain will work? I do not know as I need to read
about Markov chain and find out.

As for log-likelihood ratio tests that sounds like a reasonable
candidate but I am a bit worried about scalability. 

Ted, what's your thought on this?

Thanks
-Ankur

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Monday, January 19, 2009 3:18 PM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Maybe a dumb question, but why subtract the links? I can only get from A
to
B via a hyperlink (well, if I navigate directly to B, is the fact that I
was
on A meaningful?)  Normalizing for transitions that correspond to a link
seems to do nothing. Maybe I do not understand the problem fully.

An A-C-B transition doesn't suggest that A should be recommended from B
right, but, say, A-B-A would. My point was only that it is not always
symmetric of course, and so applying CF gets a little trickier since the
algorithms would assume symmetry.

Would a short Markov chain work and scale? For 3 elements, it needs
storage
proportional to the cube of the average number of links per page. I
don't
think CF will scale nearly as well here; it is not feeling like quite
the
right tool for the job.

Sean

On 19 Jan 2009, 8:45 AM, "Goel, Ankur" <an...@corp.aol.com> wrote:



Ted / Sean,

The link structure should definitely be subtracted. From the original
dataset or from the recommended item-set is left to the implementation.
I think it will be easier to do this from the recommended item-set.

As for not recommending urls in reverse order (B for A but not A for B,
given B appeared after A) one will have to keep track of his current
browsing history and remove those that user has already seen. Although
if user does reach B through some other link C then it does make sense
to recommend A.

Given the size of the data-set what kind of algorithm and keeping in
mind that it could grow in future what algorithms would you try out?

-----Original Message----- From: Ted Dunning
[mailto:ted.dunning@gmail.com]

Sent: Sunday, January 18, 2009 2:06 AM To: mahout-dev@lucene.apache.org

Subject: Re: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer >
Predicting next URL is an i...

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

Maybe a dumb question, but why subtract the links? I can only get from A to
B via a hyperlink (well, if I navigate directly to B, is the fact that I was
on A meaningful?)  Normalizing for transitions that correspond to a link
seems to do nothing. Maybe I do not understand the problem fully.

An A-C-B transition doesn't suggest that A should be recommended from B
right, but, say, A-B-A would. My point was only that it is not always
symmetric of course, and so applying CF gets a little trickier since the
algorithms would assume symmetry.

Would a short Markov chain work and scale? For 3 elements, it needs storage
proportional to the cube of the average number of links per page. I don't
think CF will scale nearly as well here; it is not feeling like quite the
right tool for the job.

Sean

On 19 Jan 2009, 8:45 AM, "Goel, Ankur" <an...@corp.aol.com> wrote:



Ted / Sean,

The link structure should definitely be subtracted. From the original
dataset or from the recommended item-set is left to the implementation.
I think it will be easier to do this from the recommended item-set.

As for not recommending urls in reverse order (B for A but not A for B,
given B appeared after A) one will have to keep track of his current
browsing history and remove those that user has already seen. Although
if user does reach B through some other link C then it does make sense
to recommend A.

Given the size of the data-set what kind of algorithm and keeping in
mind that it could grow in future what algorithms would you try out?

-----Original Message----- From: Ted Dunning [mailto:ted.dunning@gmail.com]

Sent: Sunday, January 18, 2009 2:06 AM To: mahout-dev@lucene.apache.org

Subject: Re: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer >
Predicting next URL is an i...

RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Ted / Sean,

The link structure should definitely be subtracted. From the original
dataset or from the recommended item-set is left to the implementation.
I think it will be easier to do this from the recommended item-set.

As for not recommending urls in reverse order (B for A but not A for B,
given B appeared after A) one will have to keep track of his current
browsing history and remove those that user has already seen. Although
if user does reach B through some other link C then it does make sense
to recommend A.

Given the size of the data-set what kind of algorithm and keeping in
mind that it could grow in future what algorithms would you try out?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Sunday, January 18, 2009 2:06 AM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

> Predicting next URL is an important RI problem and using an item-set
> predictor or an indicator-set predictor or a latent-variable predictor
is
> likely to work reasonably well.  The asymmetry of the prediction is
not
> particularly a problem since it captures important structural cues
(web
> links are unidirectional).  It is important, however, to subtract away
the
> link structure of the web pages before evaluating the system since
> suggesting that the user simply follow links that already exist is
less >than interesting.  As such, a raw markov chain isn't likely to
work well.

On Sat, Jan 17, 2009 at 5:24 AM, Sean Owen <sr...@gmail.com> wrote:

> But then again is this a CF problem? Sounds like markov chains...
given the
> last 1 or 2 or 3 URLs visited, which URL has been next, most often? I
think
> that's relatively easy and fast, does that work?
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Ted Dunning <te...@gmail.com>.

Actually, collaborative filtering as originally stated and often reiterated
in the form of predicting ratings has very limited utility, but is an
example of a more important super-class of algorithms that I call reflected
intelligence where you use user behavior in a more or less simple way to
cause a system to behave in an apparently intelligently.

The problem with ratings predictors is that you have posed some very
difficult problems for yourself from the outset.  These include getting
people to rate things at all (very doable on netflix, almost impossible in
my experienece on Veoh, Musicmatch and all the fraud systems I have built)
and you have to translate a ratings prediction into useful action (very,
very hard to do well in almost all cases).

Predicting next URL is an important RI problem and using an item-set
predictor or an indicator-set predictor or a latent-variable predictor is
likely to work reasonably well.  The asymmetry of the prediction is not
particularly a problem since it captures important structural cues (web
links are unidirectional).  It is important, however, to subtract away the
link structure of the web pages before evaluating the system since
suggesting that the user simply follow links that already exist is less than
interesting.  As such, a raw markov chain isn't likely to work well.

On Sat, Jan 17, 2009 at 5:24 AM, Sean Owen <sr...@gmail.com> wrote:

> But then again is this a CF problem? Sounds like markov chains... given the
> last 1 or 2 or 3 URLs visited, which URL has been next, most often? I think
> that's relatively easy and fast, does that work?
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Sean Owen <sr...@gmail.com>.

At first this indeed sounds like a CF problem. You can use clustering to
solve a CF problem (see for instance TreeClusteringRecommender)

But you could use other algorithms just as well - see any of the other
Recommenders. You don't have ratings for a URL, just a binary 'yes, visited'
or nothing. You can take advantage of that by using the 'Boolean*' classes
and the Tanimoto similarity metric.

This doesn't capture the fact that there is an ordering that is important -
URL A was clicked just before B so when I am on A we should recommend B (but
not necessarily the reverse). To capture this I think you want to try an
item-based recommender with an item-item similarity that captures this
relation. It won't be symmetric which messes up some other things - may need
more tweaking of existing code to get right.

But then again is this a CF problem? Sounds like markov chains... given the
last 1 or 2 or 3 URLs visited, which URL has been next, most often? I think
that's relatively easy and fast, does that work?

As for data I would indeed consider throwing out data you believe is just
noise.

Sean

On 16 Jan 2009, 12:25 PM, "Goel, Ankur" <an...@corp.aol.com> wrote:

Ted / Karl, Thank you both for your comments and suggestions. Continuing
on the comments from Ted...

The end goal is definitely not clustering but rather recommendations.
Thist can be broken down into 2 separate tasks typical to a
recommendation engine.
1. Given a URL show other URLs people have liked.
2. Given a User session and the URL he is seeing, suggest other URLs he
might like.

I experimented a bit with clustering but couldn't get good
recommendations.
>From your advice Log-likelihood ratio sounds like a potential solution
for the first one. I remember having a discussion with you and Sean long
time back where you pointed to a useful paper
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962

Please pardon me if I am asking the question again but do you think it's
a promising approach for problem 1? Do we have an implementation for
this in Mahout? If no then I can open it and work on it (given my other
work commitments allow enough time). Also since I have no formal
statistics background, I am working on 'rebuilding' my statistics
knowledge so that I can grasp these concepts better.

As for the data rates, I really don't know in the context of these
techniques what's low and what's high but what I have learnt after
accumulating weeks of data is that there are few users who have good
engagement (sufficient clicks) over a period of time, moderate number of
users who have small number of clicks and large number of users that
have very few clicks and are just casual surfers.

Also regarding building a user model as a simple mixture, I am not sure
which one you are referring to. Is it the LDA JIRA that Jeff is working
on?

Once again thanks for all the help, much appreciated.

Regards
-Ankur

-----Original Message----- From: Ted Dunning [mailto:ted.dunning@gmail.com]
Sent: Thursday, Januar...

RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Ted / Karl, Thank you both for your comments and suggestions. Continuing
on the comments from Ted...

The end goal is definitely not clustering but rather recommendations.
Thist can be broken down into 2 separate tasks typical to a
recommendation engine.
1. Given a URL show other URLs people have liked.
2. Given a User session and the URL he is seeing, suggest other URLs he
might like.

I experimented a bit with clustering but couldn't get good
recommendations.
>From your advice Log-likelihood ratio sounds like a potential solution
for the first one. I remember having a discussion with you and Sean long
time back where you pointed to a useful paper
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962

Please pardon me if I am asking the question again but do you think it's
a promising approach for problem 1? Do we have an implementation for
this in Mahout? If no then I can open it and work on it (given my other
work commitments allow enough time). Also since I have no formal
statistics background, I am working on 'rebuilding' my statistics
knowledge so that I can grasp these concepts better.

As for the data rates, I really don't know in the context of these
techniques what's low and what's high but what I have learnt after
accumulating weeks of data is that there are few users who have good
engagement (sufficient clicks) over a period of time, moderate number of
users who have small number of clicks and large number of users that
have very few clicks and are just casual surfers.

Also regarding building a user model as a simple mixture, I am not sure
which one you are referring to. Is it the LDA JIRA that Jeff is working
on?

Once again thanks for all the help, much appreciated.

Regards
-Ankur

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Thursday, January 15, 2009 12:58 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

That kind of clustering is not what would answer the important question
for
me.

It sounds like the important question is one of two things:

a) what drives the viewing of the page (using external evidence only and
not
estimating any kind of user model)

b) can we estimate some sort of mental state that leads to viewing
(using
evidence to drive a user model which drives new evidence)

The first approach has the virtue of great simplicity.  To predict a
view,
you can find anomalously coocurrent urls (coocurrent in the sense that
they
were viewed by the some same user).  Then you can use the anomalously
coocurrent urls to build a viewing model, probably using something like
ridged logistic regression or simply using IDF weighting.  I don't think
that NaiveBayes is as good for these models as people think.  Anomalous
coocurrence is not something well detected by Jaccard or cousins, but it
is
well detected by log-likelihood ratio tests based on 2x2 contingency
tables.  Simply using anomalous coocurrent URL's makes for a very good
recommendation engine if you have good data rates.

The second approach has the virtue that you can define a nice metric for
urls.  First you take the likelihood that two urls will both be visited
by a
particular user, then you average over the distribution of all users.
This
gives you an overall coocurrence probability of two urls, the negative
log
of which is likely to be informative if interpreted as a distance.
Building
the user model gives you the virtue of smoothing which is where Jaccard
falls apart.  For low data rates, this is likely to give you the best
results.

Neither of these approaches involves clustering because clustering is
not
the goal.  If you really want the clustering as an end in itself, I
think
what you should restate the coordinate space before clustering.  A good
way
to do that is to build a user model that is a simple mixture, then the
mixture coefficients for each document make good coordinates for normal
clustering algorithms (because dot products have such a natural
interpretation which implies that Euclidean distance is a useful
measure).
In fact, the dot product for the mixture model is exactly the "average
over
all users" idea from above for the special case of a mixture model.
Other
user models might not be so simple.  Jeff Eastman has done a lot of work
on
building such a mixture model estimator for mahout.

I have had cases where clustering and model building were essentially
unable
to interpret the data at all in the original data space, but were able
to
extract enormous amounts of useful information when the data was
restated in
terms of mixture model coefficients.

On Tue, Jan 13, 2009 at 4:33 AM, Ankur (JIRA) <ji...@apache.org> wrote:

> I would like to cluster them in a tree and use the model to answer the
near
> neighborhood type queries i.e. what urls are related to what other
urls. I
> did implement a sequential bottom-up hierarchical clustering algorithm
but
> the complexity is too bad for my data-set. I then thought about
implementing
> a top-down hierarchical clustering algorithm using Jaccard
co-efficient as
> my distance measure and came across this patch.
>

Re: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Ted Dunning <te...@gmail.com>.

That kind of clustering is not what would answer the important question for
me.

It sounds like the important question is one of two things:

a) what drives the viewing of the page (using external evidence only and not
estimating any kind of user model)

b) can we estimate some sort of mental state that leads to viewing (using
evidence to drive a user model which drives new evidence)

The first approach has the virtue of great simplicity.  To predict a view,
you can find anomalously coocurrent urls (coocurrent in the sense that they
were viewed by the some same user).  Then you can use the anomalously
coocurrent urls to build a viewing model, probably using something like
ridged logistic regression or simply using IDF weighting.  I don't think
that NaiveBayes is as good for these models as people think.  Anomalous
coocurrence is not something well detected by Jaccard or cousins, but it is
well detected by log-likelihood ratio tests based on 2x2 contingency
tables.  Simply using anomalous coocurrent URL's makes for a very good
recommendation engine if you have good data rates.

The second approach has the virtue that you can define a nice metric for
urls.  First you take the likelihood that two urls will both be visited by a
particular user, then you average over the distribution of all users.  This
gives you an overall coocurrence probability of two urls, the negative log
of which is likely to be informative if interpreted as a distance.  Building
the user model gives you the virtue of smoothing which is where Jaccard
falls apart.  For low data rates, this is likely to give you the best
results.

Neither of these approaches involves clustering because clustering is not
the goal.  If you really want the clustering as an end in itself, I think
what you should restate the coordinate space before clustering.  A good way
to do that is to build a user model that is a simple mixture, then the
mixture coefficients for each document make good coordinates for normal
clustering algorithms (because dot products have such a natural
interpretation which implies that Euclidean distance is a useful measure).
In fact, the dot product for the mixture model is exactly the "average over
all users" idea from above for the special case of a mixture model.  Other
user models might not be so simple.  Jeff Eastman has done a lot of work on
building such a mixture model estimator for mahout.

I have had cases where clustering and model building were essentially unable
to interpret the data at all in the original data space, but were able to
extract enormous amounts of useful information when the data was restated in
terms of mixture model coefficients.

On Tue, Jan 13, 2009 at 4:33 AM, Ankur (JIRA) <ji...@apache.org> wrote:

> I would like to cluster them in a tree and use the model to answer the near
> neighborhood type queries i.e. what urls are related to what other urls. I
> did implement a sequential bottom-up hierarchical clustering algorithm but
> the complexity is too bad for my data-set. I then thought about implementing
> a top-down hierarchical clustering algorithm using Jaccard co-efficient as
> my distance measure and came across this patch.
>

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588801#action_12588801 ] 

Karl Wettin commented on MAHOUT-19:
-----------------------------------

The bottom feed graph actually demonstrate a few problems.

Max distance between nodes to be included in a cluster is the mean distance to leaf node siblings. Currently 1.5 or so due to the number of leaf nodes that has 0 distance to the sibling. The end result should be something like 40 for best result on this data.

Look at the yellow leaf nodes.

If a leaf node can be able to represent more than one vector when the distance between them is below a threshold the yellow would be gathered as a single child to the branch with 49,042 between siblings. Looking at the rest of the graph I think this will end up with a mean distance of 30 or that resilts in much larger, fewer and better clusters.

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-19:
------------------------------

    Attachment: MAHOUT-19.txt

Rewrote most ugly code from scratch, added some tree integrity checks and simplified a bunch of things.

I'll try to get this hooked up with 20newsgroups now.

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (MAHOUT-19) Hierarchial clusterer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-19.
-----------------------------

    Resolution: Later

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588622#action_12588622 ] 

Karl Wettin commented on MAHOUT-19:
-----------------------------------

This is the first real thing I've done with Hadoop. It would be great with some input on how I have used it. Pretend that DistributedBottomFeed was a driver class.

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590640#action_12590640 ] 

Karl Wettin commented on MAHOUT-19:
-----------------------------------

This is doing quite OK now. It needs to add quite a number of instances before it makes sense to distribute the calculations. A leaf can now contain many instances if they are similar enough as simple pruning to avoid super deep trees. There are a couple of optimizable todos in the code. 

All that's left to be usable is to persist the tree so other applications can access it.

This is a rather slow algorithm as the tree grows big. Perhaps it could work to keep track of all clusters and calculate their mean instance and look for the n closest mean cluster instance and in a second iteration look for the closest instances available. But I don't know, this might have similar effects as top feeding the tree..

I'm also not sure if I bottom feed the best way right now. Once the closest instance is found I look for the closest node in the chain of parents towards root. It might actually be good to also look if sibling nodes along the way up is closer to the new instance.

I don't know. I'll have to test some.



> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589582#action_12589582 ] 

Karl Wettin commented on MAHOUT-19:
-----------------------------------

{quote}
Ok, here is the current map reduce strategy for building the tree:
{quote}

This will not work. Leafs must be inserted one at the time in the tree in order to get the second dimension correct. Perhaps it is possible to run a second job where all paired leafs are moved to the closest branch, but I don't think so. The tree will be messed up from the start. So now I'm back at the original strategy to run train young trees non distributed and then execute one job for each instance to be inserted.

This requires appending to files on dfs so I have hacked up an appender (open, write, append, close) until it's available in Hadoop.

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788106#action_12788106 ] 

Karl Wettin commented on MAHOUT-19:
-----------------------------------

The code should work, it is however in need of some sort of persistence layer to store the tree. It could run in memory but that would not allow for a very big tree. if i'm not misstaken there is code in the patch that stores the tree in a BDB JE store. 

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589060#action_12589060 ] 

Karl Wettin commented on MAHOUT-19:
-----------------------------------

Ok, here is the current map reduce strategy for building the tree:

{noformat}
add all instances to tree as leaf nodes with no parents;
while (tree contains more than 2 nodes with no parent) {
  create file with all permutation of nodes with no parent;
  execute job to find what node pairs are closes to each other;
  add nodes pairs as siblings to each other;
  calculate mean vectors in all branches; (another map reduce job)
}
place last two parentless nodes in root node;
calculate root mean vector;
{noformat}

Any comments to that?

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-19:
------------------------------

    Attachment: MAHOUT-19.txt

This patch depends on Hadoop 0.19 (trunk) and JDK 1.6

Primary local fs storage of cluster tree is PersistentMap, a transactionless quick and dirty Map<Writable, Writable> that use RandomAccessFile and consume no RAM no matter the size of the map. There is also support for BDB JE in the contrib/jeTree package, this is however not yet handled from Ant/Maven. Still looking for a better alternative.


> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663797#action_12663797 ] 

Karl Wettin commented on MAHOUT-19:
-----------------------------------

bq. 1. Assuming you are training the tree top-down, what is the division criteria you are using ?

The distributed code use bottom-up training, one could however add some code that trains it top-down.

bq. 2. How well does it scale ?

It scales OK, the tree grows rather large though.

bq. 3. Was the data on which this was tried, sparse ?

I've tried a bit of everything. My aim is to produce a tree representing a Lucene index in order to supply instant "more like this" results. Such data is definetly sparse.

bq. 4. What is the distance metric that has been used ?

It can use any of the metrics in Mahout. For my text tests I use Tanimoto (Jaccard).

bq. Basically I have a use -case where-in I have a set of 5 - 10 million urls which have an inherent hierarchical relationship and a set of user-clicks on them. I would like to cluster them in a tree and use the model to answer the near neighborhood type queries i.e. what urls are related to what other urls. I did implement a sequential bottom-up hierarchical clustering algorithm but the complexity is too bad for my data-set. I then thought about implementing a top-down hierarchical clustering algorithm using Jaccard co-efficient as my distance measure and came across this patch. Can you suggest if this patch will help?

The tree will probably be huge, but disk space is cheap. There are a few ways of improving speed, one could for instance find the closest clusters and navigate to the closest leaf rather than iterating all leafs. That should speed things up a lot. I've calculated it to be something like factor 10 on text data. There is however no support for that right now but I think there are some comments about it in the code.

If 5-10 million instances will take a long time to train, that I do not know. But once trained it will only take a few milliseconds to extract a cluster for a given trained instance.

There is a bug in this code, a tree will only grow in one child of the root. Not a big problem and it should be rather easy to fix too. Just need to find some time to trace it out.

Another problem is the tree persistency. Currently it use PersistentHashMap, something I hacked up for this patch. There is also a contrib module that use BerkeleyDB JE, and that is much more nice. PersistentHashMap has actually been factored out, improved and posted as a brand new project called BananaDB, available in Apache Labs: 

http://svn.apache.org/repos/asf/labs/bananadb/

The BerkeleyDB dependency (read: the Sleepycat License) might not be a problem anymore as we are moving towards Maven. We should probably talk about that on the dev list.

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594330#action_12594330 ] 

Karl Wettin commented on MAHOUT-19:
-----------------------------------

This is just a comment that I've been really busy and will be for another week or two. Then I've got some time to clean this up and commit it.

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-19:
------------------------------

    Attachment: MAHOUT-19_20100319.diff

Working but non scalable, non thread safe and potentially RAM hogging implementation.

This is a refactored version of something I've been using for a while. I think no bugs managed to get in there, at least the test case works.

It is (as always) in dire need of more documentation but I think the very small and simple test case might be enough if you already know when a heirarchial cluster makes more sense. I'll try to add some this weekend but I'm more busy than ever right now. I have been missing all discussions on the board for a long time so I'm not going to commit anything unless you tell me this is something you want me to commit.

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19_20100319.diff, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (MAHOUT-19) Hierarchial clusterer

Posted by Karl Wettin <ka...@gmail.com>.

Goel,

the persistency is only locally on the machine executing the job. It 
"needs" to spawn a new job for each instance to be inserted in the tree. 
   Where to insert a new instance is what the  spawned jobs figure out, 
they do however never touch the tree. It is only touched by the Driver 
class on local machine. Never DFS.

The reason for me not use MapFile is that it loads the complete 
hashtable to memory (not the values, only the entriestable) and that 
means it is not scalable in eternity. A tree with 10 million instances 
will probably contain ten times as many tree nodes and each one contains 
a mean instances of its children. So 10 million training instances 
equals some 200 million object instances to keep track of.

If maths is right that means something like 700MB RAM.

That is why I use local object storage and not DFS based.

      karl


Goel, Ankur skrev:
> Karl,
>       Did you try using Hadoop MapFile ? Currently HBase (Hadoop Simple
> database)
> uses them for their indexing requirements for data lying in HDFS. I
> think it would
> make a better choice for Hadoop version of the algorithm.
> 
> JDBM would work fine for the non-hadoop version but for the hadoop
> version the JDBM source code would require modification  to talk to 
> the underlying HDFS for persistence. This would include changing the
> serialization code of JDBM to fit the hadoop writable model. 
> This would be quite involved I think.
> 
> -Ankur
> 
> -----Original Message-----
> From: Karl Wettin (JIRA) [mailto:jira@apache.org] 
> Sent: Wednesday, April 23, 2008 3:37 AM
> To: mahout-dev@lucene.apache.org
> Subject: [jira] Updated: (MAHOUT-19) Hierarchial clusterer
> 
> 
>      [
> https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.
> plugin.system.issuetabpanels:all-tabpanel ]
> 
> Karl Wettin updated MAHOUT-19:
> ------------------------------
> 
>     Attachment: MAHOUT-19.txt
> 
> This works now.  Just needs a bit better tests and the tree->dot
> graphviz code needs to be fixed again before I want to commit anything.
> Also want to try out those training optimization strategies I've written
> about earlier (find closest cluster or leaf node first and then the
> closest instance) and have a few small todos.
> 
> It uses my quick and dirty PersistentMap, a Map<Writable, Writable> that
> keeps all data on disk at all time using RandomAccessFile (local
> storage, not dfs) but will probably be replaced by BSDed
> jdbm.sourceforge.net that Andrzej pointed out.
> 
> 
> 
> 
>> Hierarchial clusterer
>> ---------------------
>>
>>                 Key: MAHOUT-19
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>>             Project: Mahout
>>          Issue Type: New Feature
>>          Components: Clustering
>>            Reporter: Karl Wettin
>>            Assignee: Karl Wettin
>>            Priority: Minor
>>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, 
>> TestBottomFeed.test.png, TestTopFeed.test.png
>>
>>
>> In a hierarchial clusterer the instances are the leaf nodes in a tree
> where branch nodes contains the mean features of and the distance
> between its children.
>> For performance reasons I always trained trees from the top->down. I
> have been told that it can cause various effects I never encountered.
> And I believe Huffman solved his problem by training bottom->up? The
> thing is, I don't think it is possible to train the tree top->down using
> map reduce. I do however think it is possible to train it bottom->up. I
> would very much appreciate any thoughts on this.
>> Once this tree is trained one can extract clusters in various ways.
> The mean distance between all instances is usually a good maximum
> distance to allow between nodes when navigating the tree in search for a
> cluster. 
>> Navigating the tree and gather nodes that are not too far away from
> each other is usually instant if the tree is available in memory or
> persisted in a smart way. In my experience there is not much to win from
> extracting all clusters from start. Also, it usually makes sense to
> allow for the user to modify the cluster boundary variables in real time
> using a slider or perhaps present the named summary of neighbouring
> clusters, blacklist paths in the tree, etc. It is also not to bad to use
> secondary classification on the instances to create worm holes in the
> tree. I always thought it would be cool to visualize it using
> Touchgraph.
>> My focus is on clustering text documents for instant "more like
> this"-feature in search engines and use Tanimoto similarity on the
> vector spaces to calculate the distance.
>> See LUCENE-1025 for a single threaded all in memory proof of concept
> of a hierarchial clusterer.
> 
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

RE: [jira] Updated: (MAHOUT-19) Hierarchial clusterer

Posted by "Goel, Ankur" <An...@corp.aol.com>.

Karl,
      Did you try using Hadoop MapFile ? Currently HBase (Hadoop Simple
database)
uses them for their indexing requirements for data lying in HDFS. I
think it would
make a better choice for Hadoop version of the algorithm.

JDBM would work fine for the non-hadoop version but for the hadoop
version the JDBM source code would require modification  to talk to 
the underlying HDFS for persistence. This would include changing the
serialization code of JDBM to fit the hadoop writable model. 
This would be quite involved I think.

-Ankur

-----Original Message-----
From: Karl Wettin (JIRA) [mailto:jira@apache.org] 
Sent: Wednesday, April 23, 2008 3:37 AM
To: mahout-dev@lucene.apache.org
Subject: [jira] Updated: (MAHOUT-19) Hierarchial clusterer


     [
https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.
plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-19:
------------------------------

    Attachment: MAHOUT-19.txt

This works now.  Just needs a bit better tests and the tree->dot
graphviz code needs to be fixed again before I want to commit anything.
Also want to try out those training optimization strategies I've written
about earlier (find closest cluster or leaf node first and then the
closest instance) and have a few small todos.

It uses my quick and dirty PersistentMap, a Map<Writable, Writable> that
keeps all data on disk at all time using RandomAccessFile (local
storage, not dfs) but will probably be replaced by BSDed
jdbm.sourceforge.net that Andrzej pointed out.




> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, 
> TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree
where branch nodes contains the mean features of and the distance
between its children.
> For performance reasons I always trained trees from the top->down. I
have been told that it can cause various effects I never encountered.
And I believe Huffman solved his problem by training bottom->up? The
thing is, I don't think it is possible to train the tree top->down using
map reduce. I do however think it is possible to train it bottom->up. I
would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways.
The mean distance between all instances is usually a good maximum
distance to allow between nodes when navigating the tree in search for a
cluster. 
> Navigating the tree and gather nodes that are not too far away from
each other is usually instant if the tree is available in memory or
persisted in a smart way. In my experience there is not much to win from
extracting all clusters from start. Also, it usually makes sense to
allow for the user to modify the cluster boundary variables in real time
using a slider or perhaps present the named summary of neighbouring
clusters, blacklist paths in the tree, etc. It is also not to bad to use
secondary classification on the instances to create worm holes in the
tree. I always thought it would be cool to visualize it using
Touchgraph.
> My focus is on clustering text documents for instant "more like
this"-feature in search engines and use Tanimoto similarity on the
vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept
of a hierarchial clusterer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-19:
------------------------------

    Attachment: MAHOUT-19.txt

This works now.  Just needs a bit better tests and the tree->dot graphviz code needs to be fixed again before I want to commit anything. Also want to try out those training optimization strategies I've written about earlier (find closest cluster or leaf node first and then the closest instance) and have a few small todos.

It uses my quick and dirty PersistentMap, a Map<Writable, Writable> that keeps all data on disk at all time using RandomAccessFile (local storage, not dfs) but will probably be replaced by BSDed  jdbm.sourceforge.net that Andrzej pointed out.




> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by Ted Dunning <td...@veoh.com>.

Min, max, median and mean distances are all commonly used for this kind of
problem.

This variety is often symptomatic of the limitations of the clustering
algorithm, however.

Here are some examples of alternatives for hierarchical clustering:

http://www.cs.berkeley.edu/~jordan/papers/hierarchical-dp.pdf
http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/nips2004a.pdf
http://sra.itc.it/people/sriharsha/hierarchical_dirichlet.pdf
http://citeseer.ist.psu.edu/mackay94hierarchical.html

Jordan has been involved in some very nice genomic work that infers
cladistic relationships from genetic data as well.

On 4/14/08 3:27 PM, "Sean Owen (JIRA)" <ji...@apache.org> wrote:

> 
>     [ 
> https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin
> .system.issuetabpanels:comment-tabpanel&focusedCommentId=12588828#action_12588
> 828 ] 
> 
> Sean Owen commented on MAHOUT-19:
> ---------------------------------
> 
> This is probably barely relevant but thought I would mention that the
> cluster-based recommenders I've used define distance between clusters as
> either the minimum distance between any pair of nodes, one from the first
> cluster and one from the other. Alternatively you could define it as the max
> distance between any pair. Dunno, maybe that idea could translate to this
> situation is the distance metric is proving problematic.
> 
>> Hierarchial clusterer
>> ---------------------
>> 
>>                 Key: MAHOUT-19
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>>             Project: Mahout
>>          Issue Type: New Feature
>>          Components: Clustering
>>            Reporter: Karl Wettin
>>            Assignee: Karl Wettin
>>            Priority: Minor
>>         Attachments: MAHOUT-19.txt, TestBottomFeed.test.png,
>> TestTopFeed.test.png
>> 
>> 
>> In a hierarchial clusterer the instances are the leaf nodes in a tree where
>> branch nodes contains the mean features of and the distance between its
>> children.
>> For performance reasons I always trained trees from the top->down. I have
>> been told that it can cause various effects I never encountered. And I
>> believe Huffman solved his problem by training bottom->up? The thing is, I
>> don't think it is possible to train the tree top->down using map reduce. I do
>> however think it is possible to train it bottom->up. I would very much
>> appreciate any thoughts on this.
>> Once this tree is trained one can extract clusters in various ways. The mean
>> distance between all instances is usually a good maximum distance to allow
>> between nodes when navigating the tree in search for a cluster.
>> Navigating the tree and gather nodes that are not too far away from each
>> other is usually instant if the tree is available in memory or persisted in a
>> smart way. In my experience there is not much to win from extracting all
>> clusters from start. Also, it usually makes sense to allow for the user to
>> modify the cluster boundary variables in real time using a slider or perhaps
>> present the named summary of neighbouring clusters, blacklist paths in the
>> tree, etc. It is also not to bad to use secondary classification on the
>> instances to create worm holes in the tree. I always thought it would be cool
>> to visualize it using Touchgraph.
>> My focus is on clustering text documents for instant "more like this"-feature
>> in search engines and use Tanimoto similarity on the vector spaces to
>> calculate the distance.
>> See LUCENE-1025 for a single threaded all in memory proof of concept of a
>> hierarchial clusterer.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588828#action_12588828 ] 

Sean Owen commented on MAHOUT-19:
---------------------------------

This is probably barely relevant but thought I would mention that the cluster-based recommenders I've used define distance between clusters as either the minimum distance between any pair of nodes, one from the first cluster and one from the other. Alternatively you could define it as the max distance between any pair. Dunno, maybe that idea could translate to this situation is the distance metric is proving problematic.

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589006#action_12589006 ] 

Karl Wettin commented on MAHOUT-19:
-----------------------------------

{quote}
This is the first real thing I've done with Hadoop. It would be great with some input on how I have used it.
{quote}

Don't. I just figured something out.

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589060#action_12589060 ] 

karl.wettin edited comment on MAHOUT-19 at 4/15/08 5:52 AM:
------------------------------------------------------------

Ok, here is the current map reduce strategy for building the tree:

{noformat}
add all instances to tree as leaf nodes with no parents;
while (tree contains more than 2 nodes with no parent) {
  create file with all permutation of nodes with no parent;
  execute job to find what node pairs are closest to each other;
  add nodes pairs as siblings to each other; (create new branches with node pairs as children)
  calculate mean vectors in all new branches; (another map reduce job)
}
place last two parentless nodes in root node;
calculate root mean vector;
{noformat}

Any comments to that?

      was (Author: karl.wettin):
    Ok, here is the current map reduce strategy for building the tree:

{noformat}
add all instances to tree as leaf nodes with no parents;
while (tree contains more than 2 nodes with no parent) {
  create file with all permutation of nodes with no parent;
  execute job to find what node pairs are closes to each other;
  add nodes pairs as siblings to each other;
  calculate mean vectors in all branches; (another map reduce job)
}
place last two parentless nodes in root node;
calculate root mean vector;
{noformat}

Any comments to that?
  
> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788060#action_12788060 ] 

Sean Owen commented on MAHOUT-19:
---------------------------------

Looks like there was a lot of work done here that never was committed?

Same question, in the name of housekeeping, should this be committed or shelves? We should commit if it's reasonably complete, has some tests, and someone might reasonably maintain it over time.

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789264#action_12789264 ] 

Sean Owen commented on MAHOUT-19:
---------------------------------

Sounds like this is something that should be marked "Later"? Sounds like it's mostly done but not 100% complete, and perhaps not time to finish and maintain it?

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Posted by "Ankur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663319#action_12663319 ] 

Ankur commented on MAHOUT-19:
-----------------------------

Hi Karl, Welcome back :-)
Can you share the following few things about this patch?

1. Assuming you are training the tree top-down, what is the division criteria you are using ?
2. How well does it scale ?
3. Was the data on which this was tried, sparse ?
4. What is the distance metric that has been used ?

Basically I have a use -case where-in I have a set of 5 - 10 million urls which have an inherent hierarchical relationship and a set of user-clicks on them. I would like to cluster them in a tree and use the model to answer the near neighborhood type queries i.e. what urls are related to what other urls. I did implement a sequential bottom-up hierarchical clustering algorithm but the complexity is too bad for my data-set. I then thought about implementing a top-down hierarchical clustering algorithm using Jaccard co-efficient as my distance measure and came across this patch.

Can you suggest if this patch will help?

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-19:
------------------------------

    Attachment: MAHOUT-19.txt

v2

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-19) Hierarchial clusterer

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-19:
------------------------------

    Attachment: TestBottomFeed.test.png
                TestTopFeed.test.png
                MAHOUT-19.txt

Implementation with non distributed top feed and bottom feed. The latter is used by the map reduced tree feed when the tree is too you for it to make sense to distribute. Top feed is most a proof of concept to show the difference.

Patch also contains MAHOUT-20 and MAHOUT-36.

Attached is also the graphviz output from the generated test data I've placed in test/resources.

There are a couple of changes and additions to the core that should be extracted to own patches:

 * build.xml creates a job file like nutch does.
 * Mht - reads ARFFish text files to a Matrix and map ordinal values to doubles. 
 * Vector#fill(double)
 * Vector#fill(double, int, int)
 * VectorStringUtils - converts a vector to a string and vice verse
 * DoubleWritable - see HADOOP-3243
 * SparseVectorWritable - accepts any vector but resolves as SparseVector


I simulate the hadoop framework to test DistributedBottomFeed as all driver code is inside of the same class and will not run for obvious reasons. I'll refactor and get it running in hadoop for real soon.

The tree is an abstract relation storage that needs lots of refactoring.

>From Package.html:

h3. Feeding a tree with instances

All nodes in the tree are either braches with two children, a leaf (instance) or the root.

{noformat}
      root
      2,75
     /   \
  leaf1  branch1
   4.0     1,5
          /   \
       leaf2  leaf3
        1,0    2,0
{noformat}

A new instance is inserted in the tree as the child of a new branch next to the closes existing node.

Insert new instance 1,2:

{noformat}
      root
      2,775
     /   \
  leaf1  branch1
   4.0     1,55
          /   \
       brach2 leaf3
         1,1   2,0
        /   \
     leaf2 leaf4
      1,0   1,2
{noformat}

<p>
  Adding an instance requires sequencially recalculate the mean value (vector space) of all the nodes from the new leaf
  to root. This is a <b>non distributable operation</b>. Finding the where to insert the new leaf is however.
  Traditionally this is done by top or bottom feeding it, i.e. find the closest leaf node by navigating from root
  towards the closest child until a leaf is reached, or by comparing against all leafs and from there navigate the the
  closest node.
</p>

<p>
  The first distributed implementation will assert you have an eternal amount of Hadoop nodes and calculate the distance
  to all nodes: root, branches and leafs, in one large job in search for the closest one.
</p>

<p>
  A second implementation will spawn multiple jobs before it knows where to insert an instance:
  <ul>
    <li>find closest leaf node</li>
    <li>find closest node between closest leaf node and root</li>
  </ul>
</p>

<p>
  Adding instances to a young tree does not make sense to distribute! The first couple of hundred (or so depending on
  vector size) instances could be bottom fed non distributed. (By the way, what is a good antonym to distributed?)
</p>

h3. Clustering

<p>
  Clustering is to navigate the tree from a given node using some set of rules. Usually this means looking at the
  distance between children of a node or the parent of a node. The default max distance computations are based on
  mean distance between leafs that are siblings to other leafs, or the mean distance to all leaf node siblings. I'm
  certain there are much better solutions based on distance to root squared out in ways together with previous mentioned
  values. Or find the value using classes in training data? What algorithm could that be?
</p>

<p>
  If set very sensitive for what to include in a cluster one could train use a classifier for each cluster to allow
  neighbouring clusters to join with the classifying cluster. And perhaps even connect clusters that seems to contain
  the same class even if the clusters are far away from each other. But that has to wait until we have classifiers..
</p>

h3. Persistency

<p>
  The tree is an abstract relational object storage for distances, vectores, clusters, et c., accessed via primary keys
  of the nodes. Only a HashMap implementation is available.
</p>
{html}

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.