You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Goel, Ankur" <An...@corp.aol.com> on 2008/05/20 11:20:57 UTC

RE: Taste on Mahout

Hey Sean,
       I actually plan to use slope-one to start with since
- Its simple and known to work well.
- Can be parallelized nicely into the Map-Reduce style.
I also plan to use Tanimoto coefficient for item-item diffs. 

Do we have something on slope-one already in Taste as a part of Mahout ?

At the moment I am going through the available documentation on Taste
and code that's present in Mahout.

Your suggestions would be greatly appreciated.

Thanks
-Ankur

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Tuesday, April 29, 2008 11:09 PM
To: mahout-dev@lucene.apache.org; Goel, Ankur
Subject: Re: Taste on Mahout

I have some Hadoop code mostly ready to go for Taste.

The first thing to do is let you generate recommendations for all your
users via Hadoop. Unfortunately none of the recommenders truly
parallelize in the way that MapReduce needs it to -- you need all data
to compute any recommendation really -- but you can at least get
paralellization out of this. You can use the framework to run n
recommenders, each computing 1/nth of all recommendations.

The next application is specific to slope-one. Computing the item-item
diffs is exactly the kind of thing that MapReduce is good for, so,
writing a Hadoop job to do this seems like a no-brainer.

On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur <An...@corp.aol.com>
wrote:
> Hi Folks,
>       What's the status of hadoopifying Taste on Mahout ?
>  What's been done and what is in progress/pending ?
>
>  I am looking using a scalable version of Taste for my project.
>  So basically trying to figure out what's already done and where  I 
> can pitch in.
>
>  Thanks
>  -Ankur
>

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

Yup, actually that will be the most natural way to incorporate it into
my code, which is set up for the more general case of a spectrum of
user ratings, and a range of item-item similarities, so by default it
will do what you suggest here. Good good.

... and the rest, I admit, is outside the scope of my knowledge. The
whole "model-based" recommender thing is conspicuously missing from
the package.

I'm encouraged by this book "Collective Intelligence" which explained
much of this to me at a basic level, so I can probably finally cook up
a naive Bayesian classifier-based recommender at some point.

... and then understand all the better ideas you are suggesting.

On Sat, May 31, 2008 at 2:32 PM, Ted Dunning <te...@gmail.com> wrote:
> Yes.  That is a good basic recommendation system.  Another approach is to
> use the co-occurrence matrix to find items that have anomalous co-occurrence
> and then build a weighted model based on overall frequency.  This allows you
> to weight the recommendations differently than you would with the raw
> co-occurrence score.  If you have the right audience and interface then you
> will still do quite well even with some moderately poor ordering of the
> recommendations because your viewers will dig pretty far down into the
> list.  Some other interfaces are not so forgiving (think radio).
>
>
> Variants on that include finding a latent variable representation of movies
> and people that explains which movies people have seen.  The movies you have
> seen will define a latent variable representation for you and that should
> allow you to determine which movies you should have seen.  This general
> approach subsumes LSI, pLSI, LDA, MDCA and non-negative matrix factorization
> for different definitions of latent variable structure.  It would be nice to
> be able to distinguish moves that you have not seen because you never heard
> of them from movies that you declined to see, but in many domains where
> marketing is not such a strong effect, you can presume that all things that
> you have not consumed are things you know nothing about.  For movies, this
> is a weak approximation, for music it is slightly better, for user generated
> content it is very accurate.

Re: Taste on Mahout

Posted by Ted Dunning <te...@gmail.com>.

The comparison matrix is actually sparse because you can eliminate all cases
with zero cooccurrence.  That means that (except for super users), the
comparison process scales roughly with the log file size.  For super active
users, Sean's suggestion for sampling is a good one.

On Wed, Jun 4, 2008 at 4:41 AM, Goel, Ankur <An...@corp.aol.com> wrote:

> After making some sense of what Ted is suggesting and looking at the
> sample code from Sean, I finally understood what we are trying to with
> this.
> The concern here is that this technique could hit scalability issues
> with a large number of users and items even if the computation is
> distributed over reasonably large cluster (50 nodes) since every item
> need to be compared against every other item. Say for e.g. in my case
> there are would be 50 - 100 million items (may be more) and an
> equivalent number of users.
>
> Thoughts ?
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Monday, June 02, 2008 4:50 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Taste on Mahout
>
> OK Ted I implemented this and it works, in the sense that I get the same
> numbers as in your paper. But I have run into some interesting corner
> cases. Here's my code, which will help illustrate:
>
>  public final double itemCorrelation(Item item1, Item item2) throws
> TasteException {
>    if (item1 == null || item2 == null) {
>      throw new IllegalArgumentException("item1 or item2 is null");
>    }
>    int preferring1and2 =
> dataModel.getNumUsersWithPreferenceFor(item1.getID(), item2.getID());
>    if (preferring1and2 == 0) {
>      // I suppose we can say the similarity is 0 if nobody prefers both
> at the same time?
>      return 0.0;
>    }
>    int preferring1 =
> dataModel.getNumUsersWithPreferenceFor(item1.getID());
>    if (preferring1 == preferring1and2) {
>      // everyone who prefers 1 also prefers 2 -- perfect similarity
> then?
>      return 1.0;
>    }
>    int preferring2 =
> dataModel.getNumUsersWithPreferenceFor(item2.getID());
>    int numUsers = dataModel.getNumUsers();
>    double logLikelihood =
>      twoLogLambda(preferring1and2, preferring1 - preferring1and2,
> preferring2, numUsers - preferring2);
>    return 1.0 - 1.0 / (1.0 + logLikelihood);
>  }
>
>  private static double twoLogLambda(double k1, double k2, double n1,
> double n2) {
>    double p1 = k1 / n1;
>    double p2 = k2 / n2;
>    double p = (k1 + k2) / (n1 + n2);
>    return 2.0 * (logL(p1, k1, n1) + logL(p2, k2, n2) - logL(p, k1,
> n1) - logL(p, k2, n2));
>  }
>
>  private static double logL(double p, double k, double n) {
>    return k * Math.log(p) + (n - k) * Math.log(1.0 - p);
>  }
>
>
> So we're going to get "NaN" whenever a "p" value is 0.0 or 1.0 in
> logL(). This comes up in some cases with clear interpretations:
>
> "preferring1and2 == 0"
> If nobody likes both items at the same time, I'd say their similarity is
> very low, or 0.0.
>
> "prefererring1and2 == preferring1" or
> "preferring1and2 == preferring2"
> This means everyone who prefers one items also prefers the other. In
> this case I'd say the two item similarities is quite high, or 1.0.
>
> "preferring2 == 0"
> I guess we can't say anything about the similarity of items 1 and 2 if
> we know of nobody expressing a pref for #2. Or we could say it is 0?
> that or NaN. I am not clear on which is the more natural interpretation.
>
> "numUsers == preferring2"
> Everyone likes item 2. Again not clear what to do here except say we
> can't figure out a similarity? NaN?
>
> "preferring1 - preferring1and2 == numUsers - preferring2"
> Honestly I cannot figure out an interpretation for this case! NaN?
>
> Thoughts would be much appreciated.
>
> On Sat, May 31, 2008 at 2:32 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Yes.  That is a good basic recommendation system.  Another approach is
>
> > to use the co-occurrence matrix to find items that have anomalous
> > co-occurrence and then build a weighted model based on overall
> > frequency.  This allows you to weight the recommendations differently
> > than you would with the raw co-occurrence score.  If you have the
> > right audience and interface then you will still do quite well even
> > with some moderately poor ordering of the recommendations because your
>
> > viewers will dig pretty far down into the list.  Some other interfaces
> are not so forgiving (think radio).
>



-- 
ted

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

Right, yes. One idea is to use a user-based recommender, which
compares users to users. If you don't have so many users, maybe that
is scalable enough.

There is also support in the code (and I can add more) for sampling:
rather than really look at every pair of items, look at some fraction
of them to get some good-enough results.

Another thought is that you don't necessarily need to compute all
item-item similarities; the code is only going to compute what is
needed by the algorithm, and caches. So not precomputing all that
might save a lot of time if the same pairs of items tend to co-occur
in user prefs a lot. The more memory you can give the recommender the
more it will cache.

There is a clustering-based recommender in here, but it clusters
users, and is at the moment pretty slow. It's notion of cluster
similarity is based on distances between all elements in both
clusters, which grows as the square of the size of the clusters. A
faster cluster-similarity approach (k means?) would probably make it
scale up much better. I haven't thought it through, maybe that could
provide a faster solution -- but it does not exist yet.

The number of items you have is pretty daunting and indeed I think
you'll have to distribute the computation. Right now I have committed
a Hadoop job that just runs a recommender n ways. So, you can take any
of the above ideas, bake them into a Recommender, and parallelize it
in a basic way that way.

On Wed, Jun 4, 2008 at 7:41 AM, Goel, Ankur <An...@corp.aol.com> wrote:
> After making some sense of what Ted is suggesting and looking at the
> sample code from Sean, I finally understood what we are trying to with
> this.
> The concern here is that this technique could hit scalability issues
> with a large number of users and items even if the computation is
> distributed over reasonably large cluster (50 nodes) since every item
> need to be compared against every other item. Say for e.g. in my case
> there are would be 50 - 100 million items (may be more) and an
> equivalent number of users.
>
> Thoughts ?
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Monday, June 02, 2008 4:50 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Taste on Mahout
>
> OK Ted I implemented this and it works, in the sense that I get the same
> numbers as in your paper. But I have run into some interesting corner
> cases. Here's my code, which will help illustrate:
>
>  public final double itemCorrelation(Item item1, Item item2) throws
> TasteException {
>    if (item1 == null || item2 == null) {
>      throw new IllegalArgumentException("item1 or item2 is null");
>    }
>    int preferring1and2 =
> dataModel.getNumUsersWithPreferenceFor(item1.getID(), item2.getID());
>    if (preferring1and2 == 0) {
>      // I suppose we can say the similarity is 0 if nobody prefers both
> at the same time?
>      return 0.0;
>    }
>    int preferring1 =
> dataModel.getNumUsersWithPreferenceFor(item1.getID());
>    if (preferring1 == preferring1and2) {
>      // everyone who prefers 1 also prefers 2 -- perfect similarity
> then?
>      return 1.0;
>    }
>    int preferring2 =
> dataModel.getNumUsersWithPreferenceFor(item2.getID());
>    int numUsers = dataModel.getNumUsers();
>    double logLikelihood =
>      twoLogLambda(preferring1and2, preferring1 - preferring1and2,
> preferring2, numUsers - preferring2);
>    return 1.0 - 1.0 / (1.0 + logLikelihood);
>  }
>
>  private static double twoLogLambda(double k1, double k2, double n1,
> double n2) {
>    double p1 = k1 / n1;
>    double p2 = k2 / n2;
>    double p = (k1 + k2) / (n1 + n2);
>    return 2.0 * (logL(p1, k1, n1) + logL(p2, k2, n2) - logL(p, k1,
> n1) - logL(p, k2, n2));
>  }
>
>  private static double logL(double p, double k, double n) {
>    return k * Math.log(p) + (n - k) * Math.log(1.0 - p);
>  }
>
>
> So we're going to get "NaN" whenever a "p" value is 0.0 or 1.0 in
> logL(). This comes up in some cases with clear interpretations:
>
> "preferring1and2 == 0"
> If nobody likes both items at the same time, I'd say their similarity is
> very low, or 0.0.
>
> "prefererring1and2 == preferring1" or
> "preferring1and2 == preferring2"
> This means everyone who prefers one items also prefers the other. In
> this case I'd say the two item similarities is quite high, or 1.0.
>
> "preferring2 == 0"
> I guess we can't say anything about the similarity of items 1 and 2 if
> we know of nobody expressing a pref for #2. Or we could say it is 0?
> that or NaN. I am not clear on which is the more natural interpretation.
>
> "numUsers == preferring2"
> Everyone likes item 2. Again not clear what to do here except say we
> can't figure out a similarity? NaN?
>
> "preferring1 - preferring1and2 == numUsers - preferring2"
> Honestly I cannot figure out an interpretation for this case! NaN?
>
> Thoughts would be much appreciated.
>
> On Sat, May 31, 2008 at 2:32 PM, Ted Dunning <te...@gmail.com>
> wrote:
>> Yes.  That is a good basic recommendation system.  Another approach is
>
>> to use the co-occurrence matrix to find items that have anomalous
>> co-occurrence and then build a weighted model based on overall
>> frequency.  This allows you to weight the recommendations differently
>> than you would with the raw co-occurrence score.  If you have the
>> right audience and interface then you will still do quite well even
>> with some moderately poor ordering of the recommendations because your
>
>> viewers will dig pretty far down into the list.  Some other interfaces
> are not so forgiving (think radio).
>

RE: Taste on Mahout

Posted by "Goel, Ankur" <An...@corp.aol.com>.

After making some sense of what Ted is suggesting and looking at the
sample code from Sean, I finally understood what we are trying to with
this.
The concern here is that this technique could hit scalability issues
with a large number of users and items even if the computation is
distributed over reasonably large cluster (50 nodes) since every item
need to be compared against every other item. Say for e.g. in my case
there are would be 50 - 100 million items (may be more) and an
equivalent number of users. 

Thoughts ?

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Monday, June 02, 2008 4:50 AM
To: mahout-dev@lucene.apache.org
Subject: Re: Taste on Mahout

OK Ted I implemented this and it works, in the sense that I get the same
numbers as in your paper. But I have run into some interesting corner
cases. Here's my code, which will help illustrate:

  public final double itemCorrelation(Item item1, Item item2) throws
TasteException {
    if (item1 == null || item2 == null) {
      throw new IllegalArgumentException("item1 or item2 is null");
    }
    int preferring1and2 =
dataModel.getNumUsersWithPreferenceFor(item1.getID(), item2.getID());
    if (preferring1and2 == 0) {
      // I suppose we can say the similarity is 0 if nobody prefers both
at the same time?
      return 0.0;
    }
    int preferring1 =
dataModel.getNumUsersWithPreferenceFor(item1.getID());
    if (preferring1 == preferring1and2) {
      // everyone who prefers 1 also prefers 2 -- perfect similarity
then?
      return 1.0;
    }
    int preferring2 =
dataModel.getNumUsersWithPreferenceFor(item2.getID());
    int numUsers = dataModel.getNumUsers();
    double logLikelihood =
      twoLogLambda(preferring1and2, preferring1 - preferring1and2,
preferring2, numUsers - preferring2);
    return 1.0 - 1.0 / (1.0 + logLikelihood);
  }

  private static double twoLogLambda(double k1, double k2, double n1,
double n2) {
    double p1 = k1 / n1;
    double p2 = k2 / n2;
    double p = (k1 + k2) / (n1 + n2);
    return 2.0 * (logL(p1, k1, n1) + logL(p2, k2, n2) - logL(p, k1,
n1) - logL(p, k2, n2));
  }

  private static double logL(double p, double k, double n) {
    return k * Math.log(p) + (n - k) * Math.log(1.0 - p);
  }

So we're going to get "NaN" whenever a "p" value is 0.0 or 1.0 in
logL(). This comes up in some cases with clear interpretations:

"preferring1and2 == 0"
If nobody likes both items at the same time, I'd say their similarity is
very low, or 0.0.

"prefererring1and2 == preferring1" or
"preferring1and2 == preferring2"
This means everyone who prefers one items also prefers the other. In
this case I'd say the two item similarities is quite high, or 1.0.

"preferring2 == 0"
I guess we can't say anything about the similarity of items 1 and 2 if
we know of nobody expressing a pref for #2. Or we could say it is 0?
that or NaN. I am not clear on which is the more natural interpretation.

"numUsers == preferring2"
Everyone likes item 2. Again not clear what to do here except say we
can't figure out a similarity? NaN?

"preferring1 - preferring1and2 == numUsers - preferring2"
Honestly I cannot figure out an interpretation for this case! NaN?

Thoughts would be much appreciated.

On Sat, May 31, 2008 at 2:32 PM, Ted Dunning <te...@gmail.com>
wrote:
> Yes.  That is a good basic recommendation system.  Another approach is

> to use the co-occurrence matrix to find items that have anomalous 
> co-occurrence and then build a weighted model based on overall 
> frequency.  This allows you to weight the recommendations differently 
> than you would with the raw co-occurrence score.  If you have the 
> right audience and interface then you will still do quite well even 
> with some moderately poor ordering of the recommendations because your

> viewers will dig pretty far down into the list.  Some other interfaces
are not so forgiving (think radio).

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

OK I just took your word for it and implemented safeLog() and that
solves it. Now I see what you mean, in the computation p*log(k),
p=k/n, so it's like 1/n * k*log(k) and that does indeed approach 0 at
0.

On Mon, Jun 2, 2008 at 9:44 PM, Ted Dunning <te...@gmail.com> wrote:
>
> The matrix k contains counts in the form of a contingency table.  More about
> that in a moment.
>
> The rowSums and colSums functions compute the row and column sums of the
> matrix k.  The result is a vector.
>
> When R applies log to a matrix, it does element-wise evaluation of the
> function to each element of the matrix, likewise with * (matrix
> multiplication is %*%).  The sum function adds up all elements.
>
> The attached image has it in mathematical notation.
>
> I have attached a Java implementation as well (which you are free to use,
> but which needs to have its dependency on Jama removed).

Re: Taste on Mahout

Posted by Ted Dunning <te...@gmail.com>.

The matrix k contains counts in the form of a contingency table.  More about
that in a moment.

The rowSums and colSums functions compute the row and column sums of the
matrix k.  The result is a vector.

When R applies log to a matrix, it does element-wise evaluation of the
function to each element of the matrix, likewise with * (matrix
multiplication is %*%).  The sum function adds up all elements.

The attached image has it in mathematical notation.

I have attached a Java implementation as well (which you are free to use,
but which needs to have its dependency on Jama removed).

On Mon, Jun 2, 2008 at 6:08 PM, Sean Owen <sr...@gmail.com> wrote:

> Sorry I am still not clear on the algorithm. What is rowSums,
> columnSums, and what is k? k seem to be used as both a scalar, and
> vector (argument to a function sum()?)
>
> Maybe it is just that I don't know R. Can you give me a hint or
> forward any Java implementations? I can figure it out from there.
> Don't want to be a bother but I do want to make sure I understand what
> you're saying.
>
> On Mon, Jun 2, 2008 at 8:56 PM, Ted Dunning <te...@gmail.com> wrote:
> > Hmmm... may not have sent that to you.
> >
> >   function(k) {
> >     return (2 * (h(k) - h(rowSums(k)) - h(colSums(k))));
> >   }
> >
> >   function(k) {
> >     p = k / sum(k) + (k == 0);
> >     return (sum(k * log(p)));
> >   }
> >
> > This is the pithiest implementation that I know of.
> >
> > On Mon, Jun 2, 2008 at 5:50 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> PS what's the R code? sorry if I am missing something basic.
> >>
> >> On Sun, Jun 1, 2008 at 8:32 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >> > The R code is instructive here.  Everywhere you are about to take the
> log
> >> of
> >> > a zero probability, you should add 1.  This is equivalent to saying
> that
> >> >
> >> >   lim_{x -> 0} x log x = 0
> >>
> >> OK true but...
> >>
> >> > One easy way to do this is to wrap your log function:
> >> >
> >> >    safeLog = {x -> return (x==0)?0:Math.log(x)}
> >>
> >> ... log(x) approaches -infinity as x approaches 0 so why is 0 a good
> >> substitute for log(0)?
> >>
> >> But I would buy that, for purposes of this function, "log(p)" can be
> >> replaced with "log(p+1)" so at least the output value is >= 0 and less
> >> than 1?
> >>
> >
> >
> >
> > --
> > ted
> >
>



-- 
ted

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

Sorry I am still not clear on the algorithm. What is rowSums,
columnSums, and what is k? k seem to be used as both a scalar, and
vector (argument to a function sum()?)

Maybe it is just that I don't know R. Can you give me a hint or
forward any Java implementations? I can figure it out from there.
Don't want to be a bother but I do want to make sure I understand what
you're saying.

On Mon, Jun 2, 2008 at 8:56 PM, Ted Dunning <te...@gmail.com> wrote:
> Hmmm... may not have sent that to you.
>
>   function(k) {
>     return (2 * (h(k) - h(rowSums(k)) - h(colSums(k))));
>   }
>
>   function(k) {
>     p = k / sum(k) + (k == 0);
>     return (sum(k * log(p)));
>   }
>
> This is the pithiest implementation that I know of.
>
> On Mon, Jun 2, 2008 at 5:50 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> PS what's the R code? sorry if I am missing something basic.
>>
>> On Sun, Jun 1, 2008 at 8:32 PM, Ted Dunning <te...@gmail.com> wrote:
>> > The R code is instructive here.  Everywhere you are about to take the log
>> of
>> > a zero probability, you should add 1.  This is equivalent to saying that
>> >
>> >   lim_{x -> 0} x log x = 0
>>
>> OK true but...
>>
>> > One easy way to do this is to wrap your log function:
>> >
>> >    safeLog = {x -> return (x==0)?0:Math.log(x)}
>>
>> ... log(x) approaches -infinity as x approaches 0 so why is 0 a good
>> substitute for log(0)?
>>
>> But I would buy that, for purposes of this function, "log(p)" can be
>> replaced with "log(p+1)" so at least the output value is >= 0 and less
>> than 1?
>>
>
>
>
> --
> ted
>

Re: Taste on Mahout

Posted by Ted Dunning <te...@gmail.com>.

Hmmm... may not have sent that to you.

   function(k) {
     return (2 * (h(k) - h(rowSums(k)) - h(colSums(k))));
   }

   function(k) {
     p = k / sum(k) + (k == 0);
     return (sum(k * log(p)));
   }

This is the pithiest implementation that I know of.

On Mon, Jun 2, 2008 at 5:50 PM, Sean Owen <sr...@gmail.com> wrote:

> PS what's the R code? sorry if I am missing something basic.
>
> On Sun, Jun 1, 2008 at 8:32 PM, Ted Dunning <te...@gmail.com> wrote:
> > The R code is instructive here.  Everywhere you are about to take the log
> of
> > a zero probability, you should add 1.  This is equivalent to saying that
> >
> >   lim_{x -> 0} x log x = 0
>
> OK true but...
>
> > One easy way to do this is to wrap your log function:
> >
> >    safeLog = {x -> return (x==0)?0:Math.log(x)}
>
> ... log(x) approaches -infinity as x approaches 0 so why is 0 a good
> substitute for log(0)?
>
> But I would buy that, for purposes of this function, "log(p)" can be
> replaced with "log(p+1)" so at least the output value is >= 0 and less
> than 1?
>



-- 
ted

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

PS what's the R code? sorry if I am missing something basic.

On Sun, Jun 1, 2008 at 8:32 PM, Ted Dunning <te...@gmail.com> wrote:
> The R code is instructive here.  Everywhere you are about to take the log of
> a zero probability, you should add 1.  This is equivalent to saying that
>
>   lim_{x -> 0} x log x = 0

OK true but...

> One easy way to do this is to wrap your log function:
>
>    safeLog = {x -> return (x==0)?0:Math.log(x)}

... log(x) approaches -infinity as x approaches 0 so why is 0 a good
substitute for log(0)?

But I would buy that, for purposes of this function, "log(p)" can be
replaced with "log(p+1)" so at least the output value is >= 0 and less
than 1?

Re: Taste on Mahout

Posted by Ted Dunning <te...@gmail.com>.

The R code is instructive here.  Everywhere you are about to take the log of
a zero probability, you should add 1.  This is equivalent to saying that

   lim_{x -> 0} x log x = 0

One easy way to do this is to wrap your log function:

    safeLog = {x -> return (x==0)?0:Math.log(x)}

On Sun, Jun 1, 2008 at 4:20 PM, Sean Owen <sr...@gmail.com> wrote:

> OK Ted I implemented this and it works, in the sense that I get the
> same numbers as in your paper. But I have run into some interesting
> corner cases. Here's my code, which will help illustrate:
>
>  public final double itemCorrelation(Item item1, Item item2) throws
> TasteException {
>    if (item1 == null || item2 == null) {
>      throw new IllegalArgumentException("item1 or item2 is null");
>    }
>    int preferring1and2 =
> dataModel.getNumUsersWithPreferenceFor(item1.getID(), item2.getID());
>    if (preferring1and2 == 0) {
>      // I suppose we can say the similarity is 0 if nobody prefers
> both at the same time?
>      return 0.0;
>    }
>    int preferring1 = dataModel.getNumUsersWithPreferenceFor(item1.getID());
>    if (preferring1 == preferring1and2) {
>      // everyone who prefers 1 also prefers 2 -- perfect similarity then?
>      return 1.0;
>    }
>    int preferring2 = dataModel.getNumUsersWithPreferenceFor(item2.getID());
>    int numUsers = dataModel.getNumUsers();
>    double logLikelihood =
>      twoLogLambda(preferring1and2, preferring1 - preferring1and2,
> preferring2, numUsers - preferring2);
>    return 1.0 - 1.0 / (1.0 + logLikelihood);
>  }
>
>  private static double twoLogLambda(double k1, double k2, double n1,
> double n2) {
>    double p1 = k1 / n1;
>    double p2 = k2 / n2;
>    double p = (k1 + k2) / (n1 + n2);
>    return 2.0 * (logL(p1, k1, n1) + logL(p2, k2, n2) - logL(p, k1,
> n1) - logL(p, k2, n2));
>  }
>
>  private static double logL(double p, double k, double n) {
>    return k * Math.log(p) + (n - k) * Math.log(1.0 - p);
>  }
>
>
> So we're going to get "NaN" whenever a "p" value is 0.0 or 1.0 in
> logL(). This comes up in some cases with clear interpretations:
>
> "preferring1and2 == 0"
> If nobody likes both items at the same time, I'd say their similarity
> is very low, or 0.0.
>
> "prefererring1and2 == preferring1" or
> "preferring1and2 == preferring2"
> This means everyone who prefers one items also prefers the other. In
> this case I'd say the two item similarities is quite high, or 1.0.
>
> "preferring2 == 0"
> I guess we can't say anything about the similarity of items 1 and 2 if
> we know of nobody expressing a pref for #2. Or we could say it is 0?
> that or NaN. I am not clear on which is the more natural
> interpretation.
>
> "numUsers == preferring2"
> Everyone likes item 2. Again not clear what to do here except say we
> can't figure out a similarity? NaN?
>
> "preferring1 - preferring1and2 == numUsers - preferring2"
> Honestly I cannot figure out an interpretation for this case! NaN?
>
> Thoughts would be much appreciated.
>
> On Sat, May 31, 2008 at 2:32 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Yes.  That is a good basic recommendation system.  Another approach is to
> > use the co-occurrence matrix to find items that have anomalous
> co-occurrence
> > and then build a weighted model based on overall frequency.  This allows
> you
> > to weight the recommendations differently than you would with the raw
> > co-occurrence score.  If you have the right audience and interface then
> you
> > will still do quite well even with some moderately poor ordering of the
> > recommendations because your viewers will dig pretty far down into the
> > list.  Some other interfaces are not so forgiving (think radio).
>



-- 
ted

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

OK Ted I implemented this and it works, in the sense that I get the
same numbers as in your paper. But I have run into some interesting
corner cases. Here's my code, which will help illustrate:

  public final double itemCorrelation(Item item1, Item item2) throws
TasteException {
    if (item1 == null || item2 == null) {
      throw new IllegalArgumentException("item1 or item2 is null");
    }
    int preferring1and2 =
dataModel.getNumUsersWithPreferenceFor(item1.getID(), item2.getID());
    if (preferring1and2 == 0) {
      // I suppose we can say the similarity is 0 if nobody prefers
both at the same time?
      return 0.0;
    }
    int preferring1 = dataModel.getNumUsersWithPreferenceFor(item1.getID());
    if (preferring1 == preferring1and2) {
      // everyone who prefers 1 also prefers 2 -- perfect similarity then?
      return 1.0;
    }
    int preferring2 = dataModel.getNumUsersWithPreferenceFor(item2.getID());
    int numUsers = dataModel.getNumUsers();
    double logLikelihood =
      twoLogLambda(preferring1and2, preferring1 - preferring1and2,
preferring2, numUsers - preferring2);
    return 1.0 - 1.0 / (1.0 + logLikelihood);
  }

  private static double twoLogLambda(double k1, double k2, double n1,
double n2) {
    double p1 = k1 / n1;
    double p2 = k2 / n2;
    double p = (k1 + k2) / (n1 + n2);
    return 2.0 * (logL(p1, k1, n1) + logL(p2, k2, n2) - logL(p, k1,
n1) - logL(p, k2, n2));
  }

  private static double logL(double p, double k, double n) {
    return k * Math.log(p) + (n - k) * Math.log(1.0 - p);
  }

So we're going to get "NaN" whenever a "p" value is 0.0 or 1.0 in
logL(). This comes up in some cases with clear interpretations:

"preferring1and2 == 0"
If nobody likes both items at the same time, I'd say their similarity
is very low, or 0.0.

"prefererring1and2 == preferring1" or
"preferring1and2 == preferring2"
This means everyone who prefers one items also prefers the other. In
this case I'd say the two item similarities is quite high, or 1.0.

"preferring2 == 0"
I guess we can't say anything about the similarity of items 1 and 2 if
we know of nobody expressing a pref for #2. Or we could say it is 0?
that or NaN. I am not clear on which is the more natural
interpretation.

"numUsers == preferring2"
Everyone likes item 2. Again not clear what to do here except say we
can't figure out a similarity? NaN?

"preferring1 - preferring1and2 == numUsers - preferring2"
Honestly I cannot figure out an interpretation for this case! NaN?

Thoughts would be much appreciated.

On Sat, May 31, 2008 at 2:32 PM, Ted Dunning <te...@gmail.com> wrote:
> Yes.  That is a good basic recommendation system.  Another approach is to
> use the co-occurrence matrix to find items that have anomalous co-occurrence
> and then build a weighted model based on overall frequency.  This allows you
> to weight the recommendations differently than you would with the raw
> co-occurrence score.  If you have the right audience and interface then you
> will still do quite well even with some moderately poor ordering of the
> recommendations because your viewers will dig pretty far down into the
> list.  Some other interfaces are not so forgiving (think radio).

Re: Taste on Mahout

Posted by Ted Dunning <te...@gmail.com>.

Yes.  That is a good basic recommendation system.  Another approach is to
use the co-occurrence matrix to find items that have anomalous co-occurrence
and then build a weighted model based on overall frequency.  This allows you
to weight the recommendations differently than you would with the raw
co-occurrence score.  If you have the right audience and interface then you
will still do quite well even with some moderately poor ordering of the
recommendations because your viewers will dig pretty far down into the
list.  Some other interfaces are not so forgiving (think radio).

Variants on that include finding a latent variable representation of movies
and people that explains which movies people have seen.  The movies you have
seen will define a latent variable representation for you and that should
allow you to determine which movies you should have seen.  This general
approach subsumes LSI, pLSI, LDA, MDCA and non-negative matrix factorization
for different definitions of latent variable structure.  It would be nice to
be able to distinguish moves that you have not seen because you never heard
of them from movies that you declined to see, but in many domains where
marketing is not such a strong effect, you can presume that all things that
you have not consumed are things you know nothing about.  For movies, this
is a weak approximation, for music it is slightly better, for user generated
content it is very accurate.

On Sat, May 31, 2008 at 8:59 AM, Sean Owen <sr...@gmail.com> wrote:

> Ted coming back to your message here as well -- let me see if I
> understand how to apply this.
>
> Imagine we know which users have seen which movies (no ratings, just
> the set of movies). For each movie I know how many users have seen
> both, or only one, or neither. I can use this to figure out pairs of
> movies that are seen together unusually often.
>
> Then for each user, find the movies among those they haven't seen
> which are most likely to occur together with the ones they have seen,
> and recommend them.
>
>

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

Ted coming back to your message here as well -- let me see if I
understand how to apply this.

Imagine we know which users have seen which movies (no ratings, just
the set of movies). For each movie I know how many users have seen
both, or only one, or neither. I can use this to figure out pairs of
movies that are seen together unusually often.

Then for each user, find the movies among those they haven't seen
which are most likely to occur together with the ones they have seen,
and recommend them.

On Fri, May 30, 2008 at 9:13 PM, Ted Dunning <te...@gmail.com> wrote:
> I am a strong proponent (not surprisingly) of using log likelihood ratio
> (LLR) tests as a primary filter for finding good matches in cooccurrence
> analyses like this.   It is very simple and has proven effective for many
> years in work I have done as well in other peoples work.  Correlation based
> measures are very bad at extreme counts.
>
> See here for a bit more detail: http://citeseer.ist.psu.edu/29096.html . I
> should have a blog entry up which illustrates how simple LLR tests are  to
> implement (literally just a few lines of R and only twice or four times that
> many in Java).
>
> LLR can also be used very effectively for document routing using what are
> essentially variants on Naive Bayes classifiers where features are selected
> using LLR and weighted using corpus frequencies (usually something like log
> IDF).  Other classifiers where you have vast numbers of possible features
> should work well based on the same work.

Re: Taste on Mahout

Posted by Ted Dunning <te...@gmail.com>.

I am a strong proponent (not surprisingly) of using log likelihood ratio
(LLR) tests as a primary filter for finding good matches in cooccurrence
analyses like this.   It is very simple and has proven effective for many
years in work I have done as well in other peoples work.  Correlation based
measures are very bad at extreme counts.

See here for a bit more detail: http://citeseer.ist.psu.edu/29096.html . I
should have a blog entry up which illustrates how simple LLR tests are  to
implement (literally just a few lines of R and only twice or four times that
many in Java).

LLR can also be used very effectively for document routing using what are
essentially variants on Naive Bayes classifiers where features are selected
using LLR and weighted using corpus frequencies (usually something like log
IDF).  Other classifiers where you have vast numbers of possible features
should work well based on the same work.

On Fri, May 30, 2008 at 4:07 PM, Sean Owen <sr...@gmail.com> wrote:

> ...
>
> Coming back to your message Ankur, yes I think the code already
> supports what you suggest, in part. His "item similarity table" is the
> "ItemCorrelation" you feed into GenericItemBasedRecommender. He
> rightly suggests this should be precomputed, which is where
> GenericItemCorrelation comes in -- it can just compute item-item
> correlations based on any algorithm you want like PearsonCorrelation
> and save them off. You could modify this to save only "strong"
> correlations as you say. I can surely dig in here with you and write
> code to do this if it's not clear.
>
>

-- 
ted

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

What's wrong with EuclideanDistance -- looks straightforward to me. Is
it that you are looking to handle a sparse vector more efficiently?

FWIW I have a similar situation since my vectors are very sparse
(users don't express a preference for nearly all items). I just
stipulate that preference are handled in sorted order by item ID in
all cases. I just iterate through both vectors' actual entries at the
same time.

You can see the iteration in PearsonCorrelation, though that example
has a bunch of other junk in it that you wouldn't need.

On this note, might be good to think more about the implementation of
SparseVector. For example if it is desirable to iterate over entries
in order, a SortedMap like TreeMap would be better. actually right now
I am not sure how the values are coming back in any predictable
order...

On Sat, May 31, 2008 at 11:14 AM, Karl Wettin <ka...@gmail.com> wrote:
>
> 31 maj 2008 kl. 01.07 skrev Sean Owen:
>>
>> metrics -- Euclidean distance and Tanimoto coefficient -- that were
>> interesting enough that I implemented them today. So I think I've
>> covered chapter 3!
>
>
> https://issues.apache.org/jira/browse/MAHOUT-42
>
> I'm not quite happy with this implementation, suggestions appreciated.
>
>
>         karl
>

Re: Taste on Mahout

Posted by Karl Wettin <ka...@gmail.com>.

31 maj 2008 kl. 01.07 skrev Sean Owen:
> metrics -- Euclidean distance and Tanimoto coefficient -- that were
> interesting enough that I implemented them today. So I think I've
> covered chapter 3!


https://issues.apache.org/jira/browse/MAHOUT-42

I'm not quite happy with this implementation, suggestions appreciated.


          karl

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

PS I got a copy of "Collective Intelligence" and it is indeed a great overview.

I think my experience agrees with everything he says in Chapter 3 on
Collaborative Filtering. He mentions a few different similarity
metrics -- Euclidean distance and Tanimoto coefficient -- that were
interesting enough that I implemented them today. So I think I've
covered chapter 3!

Coming back to your message Ankur, yes I think the code already
supports what you suggest, in part. His "item similarity table" is the
"ItemCorrelation" you feed into GenericItemBasedRecommender. He
rightly suggests this should be precomputed, which is where
GenericItemCorrelation comes in -- it can just compute item-item
correlations based on any algorithm you want like PearsonCorrelation
and save them off. You could modify this to save only "strong"
correlations as you say. I can surely dig in here with you and write
code to do this if it's not clear.

Yes it is parallelizable in the way you suggest. After you get past
maybe 10,000 items it is going to get quite painful to calculate this
on one machine, indeed.

Another approach is to use some external a priori knowledge to define
item similarity rather than using user preferences. For example maybe
I just say that two movies in the same genre are pretty similar, and
ones that aren't, aren't. That takes no precomputation of any kind.


And finally back to your question about binary yes/no user preferences
-- the TanimotoCoefficientCorrelation I just added should be pretty
suitable for situations where you only care about whether a preference
exists or not.

I need to continue along this line and build optimized versions of
User that can more efficiently store this sort of thing rather than a
bunch of dummy Item and Preference objects that are superfluous in
this context. This could also greatly speed up item-item similarity
computations.





On Wed, May 21, 2008 at 7:36 AM, Goel, Ankur <An...@corp.aol.com> wrote:
> Hey Sean,
>          Thanks for the suggestions. In my case the data-set os only
> going to tell me if the useer clicked on a particualar item. So lets say
> there are 10,000 items a user might only have clicked 20 - 30 items. I
> was thinking more on the lines of building an item similarity table by
> comparing each item with every other item and retaining only 100 top
> items decayed by time.
>
> So a recommender for a user would use his recent browsing history to
> figure out top 10 or 20 most similar items.
>
> The approach is documented in Toby Segaran's "Collective Intelligence"
> book and looks simple to implement even though it is costly since every
> item needs to be compared with every other item. This can be
> parallelized in way that for M items in a cluster of N machines, each
> node has to compare M/N items to M items. Since the data-set is going to
> sparse (no. of items having common users), I believe this would'nt be
> overwhelming for the cluster.
>
> The other approach that I am thinking to reduce the computation cost is
> to use a clustering algorithm like K-Means that's available in Mahout to
> cluster similar user/items together and then use clustering information
> to make recommendations.
>
> Any suggestions?
>
> Thanks
> -Ankur
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Tuesday, May 20, 2008 9:37 PM
> To: mahout-dev@lucene.apache.org; Goel, Ankur
> Subject: Re: Taste on Mahout
>
> + Ankur directly, since I am not sure you are on the dev list.
>
> On Tue, May 20, 2008 at 12:06 PM, Sean Owen <sr...@gmail.com> wrote:
>> All of the algorithms assume a world where you have a continuous range
>
>> of ratings from users for items. Obviously a binary yes/no rating can
>> be mapped into that trivially -- 1 and -1 for example. This causes
>> some issues, most notably for corrletion-based recommenders where the
>> correlation can be undefined between two items/users in special cases
>> that arise from this kind of input -- for example if we overlap in
>> rating 3 items and I voted "yes" for all 3, then no correlation can be
>
>> defined.
>>
>> Slope one doesn't run into this particular mathematical wrinkle.
>>
>> Also, methods like estimatePreference() are not going to give you
>> estimates that are always 1 or -1. Again, you could map this back onto
>> 1 / -1 by rounding or something, just something to note.
>>
>> So, in general it will be better if you can map whatever input you
>> have onto a larger range of input. You will feed more information in,
>> in this way, as well. For example, maybe you call a recent "yes"
>> rating a +2, and a recent "no" a -2, and others +1 and -1.
>>
>>
>> The part of slope one that parallelizes very well is the computing of
>> the item-item diffs. No I have not written this yet.
>>
>>
>> I have committed a first cut at a framework for computing
>> recommendations in parallel for any recommender. Dig in to
>> org.apache.mahout.cf.taste.impl.hadoop. In general, none of the
>> existing recommenders can be parallelized, because they generally need
>
>> access to all the data to produce any recommendation.
>>
>> But, we can take partial advantage of Hadoop by simply parallelizing
>> the computation of recommendations for many users across multiple
>> identical recommender instances. Better than nothing. In this
>> situation, one of the map or reduce phase is trivial.
>>
>> That is what I have committed so far and it works, locally. I am in
>> the middle of figuring out how to write it for real use on a remote
>> Hadoop cluster, and how I would go about testing that!
>>
>> Do we have any test bed available?
>>
>>
>>
>> On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur <An...@corp.aol.com>
> wrote:
>>> I just realized after going through the wikipedia that slope one is
>>> applicable when you have ratings for the items.
>>> In my case, I would be simply working with binary data (Item was
>>> clicked or not-clicked by user) using Tanimoto coefficient to
>>> calculate item similarity.
>>> The idea is to capture the simple intuition "What items have been
>>> visited most along with this item".
>>>
>>>
>>> -----Original Message-----
>>> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
>>> Sent: Tuesday, May 20, 2008 2:51 PM
>>> To: mahout-dev@lucene.apache.org
>>> Subject: RE: Taste on Mahout
>>>
>>>
>>> Hey Sean,
>>>       I actually plan to use slope-one to start with since
>>> - Its simple and known to work well.
>>> - Can be parallelized nicely into the Map-Reduce style.
>>> I also plan to use Tanimoto coefficient for item-item diffs.
>>>
>>> Do we have something on slope-one already in Taste as a part of
> Mahout ?
>>>
>>> At the moment I am going through the available documentation on Taste
>
>>> and code that's present in Mahout.
>>>
>>> Your suggestions would be greatly appreciated.
>>>
>>> Thanks
>>> -Ankur
>>>
>>> -----Original Message-----
>>> From: Sean Owen [mailto:srowen@gmail.com]
>>> Sent: Tuesday, April 29, 2008 11:09 PM
>>> To: mahout-dev@lucene.apache.org; Goel, Ankur
>>> Subject: Re: Taste on Mahout
>>>
>>> I have some Hadoop code mostly ready to go for Taste.
>>>
>>> The first thing to do is let you generate recommendations for all
>>> your users via Hadoop. Unfortunately none of the recommenders truly
>>> parallelize in the way that MapReduce needs it to -- you need all
>>> data to compute any recommendation really -- but you can at least get
>
>>> paralellization out of this. You can use the framework to run n
>>> recommenders, each computing 1/nth of all recommendations.
>>>
>>> The next application is specific to slope-one. Computing the
>>> item-item diffs is exactly the kind of thing that MapReduce is good
>>> for, so, writing a Hadoop job to do this seems like a no-brainer.
>>>
>>> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur
>>> <An...@corp.aol.com>
>>> wrote:
>>>> Hi Folks,
>>>>       What's the status of hadoopifying Taste on Mahout ?
>>>>  What's been done and what is in progress/pending ?
>>>>
>>>>  I am looking using a scalable version of Taste for my project.
>>>>  So basically trying to figure out what's already done and where  I
>>>> can pitch in.
>>>>
>>>>  Thanks
>>>>  -Ankur
>>>>
>>>
>>
>

Re: Taste on Mahout

Posted by Ted Dunning <te...@gmail.com>.

LLR stands for log likelihood ratio (= -2 log lambda).

The table that I gave in the previous email is a bit easier to get to from
the numbers you usually have in practice.

I have some Java code handy that will help implement this function.  It
would require a bit of API coercion to use the right kind of matrix, but
otherwise should work.   I should file a Jira and then post a patch.

The chi-squared value is a non-negative as you say.  Taking the square root
and adding a sign makes the 1 degree of freedom (aka 2x2 table aka
coocurrence) version into an asymptotically normally distributed value if
the events are independent.  That makes interpretation easier for lots of
people since you can talk about standard deviations.

The formula in section 8 is essentially the same as what I gave earlier, but
it is limited to the 2x2 case.

The confusion about notation is endemic with applications of this because it
can be applied for fairly different kinds of problems.

If you are doing feature selection, then usually you have two classes of
training examples.  Let's call these classes 1 and 2.  These classes might
be "relevant" and "non-relevant" for document classification or "appears
after A" and "appears after something not A" for testing which words coocur
with A.  We also have a feature of interest.  For document classification,
that feature might be "word w occurred at least once".  For cooccurrence
testing, that feature might be "word B appears".  In any case, we should be
able to get counts k_1 for the number of times the feature occurred in class
1 and n_1 for the number of examples in class 1.  Similarly, k_2 and n_2 are
the number of occurrences of the feature in class 2 and the number of
examples in class 2.  The table from my previous example would then be:

          k_1         n_1 - k_1
          k_2         n_2 - k_2

and the implementation would proceed as before.  It actually is useful to
have the case of arbitrary size and shape count matrices because, we might
have classes 1, 2 and 3, or our feature might have more than 2 values.  An
example of 3 classes might be "judged relevant", "judged non-relevant" and
"unjudged" for the document classification problem.  An example of a trinary
feature might be "word occured > 1", "word occurred once" and "word didn't
occur".  The issue comes up often enough to make it important to implement
the general case.

Hope this helps.

On Wed, May 21, 2008 at 12:00 PM, Sean Owen <sr...@gmail.com> wrote:

> Thanks, I admit this is pushing the limits of my knowledge of stats.
> Maybe you can help me offline here. What's the LLR function, and how
> does this table of four values related to k1 and k2 in the formula in
> section 8? is that formula OK to implement, since I think I do get
> that one.
>
> The chi-squared stat is a positive value right? then yes to map into
> [-1,1] to make it correlation-like I'd need a simple transformation or
> two like what you describe.
>
> On Wed, May 21, 2008 at 2:49 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Yes.  If you have things A and B and counts k_ab for when A and B occur
> > together, k_a for when A appears with or without B, k_b similar to k_a
> but
> > for B and N which is the total observations then you build the table:
> >
> >       k_ab             k_a - k_ab
> >       k_b - k_ab     N - k_a - k_b + k_ab
> >
> > and then you can apply the LLR function directly.
> >
> > It is often helpful to report sqrt(-2 log lambda) with a sign according
> to
> > whether k_ab is greater of less than expected, i.e.
> >
> >      signum( k_ab / k_b - k_a / N ) * sqrt(- 2 log lambda)
> >
> > Also, for computation of - 2 log lambda, it is usually easier to compute
> it
> > in terms of mutual information which is in turn expressed in terms of
> > entropy:
> >
> >    - 2 log lambda = N * MI = N * ( H(K) - H(rowSums(K)) - H(colSums(K)) )
> >
> >    H(X) = - sum (X / sum(X)) logSafe (X / sum(X))
> >
> >    logSafe(x) = log(x + (x == 0))
> >
> > The resulting score is not directly suitable as a distance measure, but
> is
> > very handy for masking co-occurrence matrices.
> >
> > On Wed, May 21, 2008 at 11:22 AM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> Got it, so do I have it right that you suggest defining the
> >> "correlation" as really the chi-squared statistic, the -2 log lamba
> >> formula? k1 is the number of items 'preferred' by user 1, and n1 is
> >> the total number of items in the universe, and likewise for k2/n2? so
> >> n1 == n2?
> >>
> >> Simple is cool with me, to start. This sounds more sophisticated than
> >> a simple intersection/union approach. I could add that algorithm too
> >> for kicks.
> >>
> >> On Wed, May 21, 2008 at 12:58 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >> > Correlation (per se) between such sparse binary vectors can be very
> >> > problematic.
> >> >
> >> > This is a general problem with this kind of data and really needs to
> be
> >> > handled directly.  Not clicking on an item is much less informative
> than
> >> > clicking on an item (so little time, so much to click).  Any system
> you
> >> > build has to deal with that and with coincidence.  For instance, raw
> >> > correlation gives 100% match for two people who happen to have clicked
> on
> >> > the same single item.  IF that item is a very popular one, however,
> this
> >> is
> >> > not a very interesting fact.
> >> >
> >> > One very simple way of dealing with this was described in
> >> > http://citeseer.ist.psu.edu/29096.html .  Since then, I have found
> >> other,
> >> > more comprehensive techniques but they are considerably more complex.
> >>
> >
> >
> >
> > --
> > ted
> >
>

-- 
ted

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

Thanks, I admit this is pushing the limits of my knowledge of stats.
Maybe you can help me offline here. What's the LLR function, and how
does this table of four values related to k1 and k2 in the formula in
section 8? is that formula OK to implement, since I think I do get
that one.

The chi-squared stat is a positive value right? then yes to map into
[-1,1] to make it correlation-like I'd need a simple transformation or
two like what you describe.

On Wed, May 21, 2008 at 2:49 PM, Ted Dunning <te...@gmail.com> wrote:
> Yes.  If you have things A and B and counts k_ab for when A and B occur
> together, k_a for when A appears with or without B, k_b similar to k_a but
> for B and N which is the total observations then you build the table:
>
>       k_ab             k_a - k_ab
>       k_b - k_ab     N - k_a - k_b + k_ab
>
> and then you can apply the LLR function directly.
>
> It is often helpful to report sqrt(-2 log lambda) with a sign according to
> whether k_ab is greater of less than expected, i.e.
>
>      signum( k_ab / k_b - k_a / N ) * sqrt(- 2 log lambda)
>
> Also, for computation of - 2 log lambda, it is usually easier to compute it
> in terms of mutual information which is in turn expressed in terms of
> entropy:
>
>    - 2 log lambda = N * MI = N * ( H(K) - H(rowSums(K)) - H(colSums(K)) )
>
>    H(X) = - sum (X / sum(X)) logSafe (X / sum(X))
>
>    logSafe(x) = log(x + (x == 0))
>
> The resulting score is not directly suitable as a distance measure, but is
> very handy for masking co-occurrence matrices.
>
> On Wed, May 21, 2008 at 11:22 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> Got it, so do I have it right that you suggest defining the
>> "correlation" as really the chi-squared statistic, the -2 log lamba
>> formula? k1 is the number of items 'preferred' by user 1, and n1 is
>> the total number of items in the universe, and likewise for k2/n2? so
>> n1 == n2?
>>
>> Simple is cool with me, to start. This sounds more sophisticated than
>> a simple intersection/union approach. I could add that algorithm too
>> for kicks.
>>
>> On Wed, May 21, 2008 at 12:58 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > Correlation (per se) between such sparse binary vectors can be very
>> > problematic.
>> >
>> > This is a general problem with this kind of data and really needs to be
>> > handled directly.  Not clicking on an item is much less informative than
>> > clicking on an item (so little time, so much to click).  Any system you
>> > build has to deal with that and with coincidence.  For instance, raw
>> > correlation gives 100% match for two people who happen to have clicked on
>> > the same single item.  IF that item is a very popular one, however, this
>> is
>> > not a very interesting fact.
>> >
>> > One very simple way of dealing with this was described in
>> > http://citeseer.ist.psu.edu/29096.html .  Since then, I have found
>> other,
>> > more comprehensive techniques but they are considerably more complex.
>>
>
>
>
> --
> ted
>

Re: Taste on Mahout

Posted by Ted Dunning <te...@gmail.com>.

Yes.  If you have things A and B and counts k_ab for when A and B occur
together, k_a for when A appears with or without B, k_b similar to k_a but
for B and N which is the total observations then you build the table:

       k_ab             k_a - k_ab
       k_b - k_ab     N - k_a - k_b + k_ab

and then you can apply the LLR function directly.

It is often helpful to report sqrt(-2 log lambda) with a sign according to
whether k_ab is greater of less than expected, i.e.

      signum( k_ab / k_b - k_a / N ) * sqrt(- 2 log lambda)

Also, for computation of - 2 log lambda, it is usually easier to compute it
in terms of mutual information which is in turn expressed in terms of
entropy:

    - 2 log lambda = N * MI = N * ( H(K) - H(rowSums(K)) - H(colSums(K)) )

    H(X) = - sum (X / sum(X)) logSafe (X / sum(X))

    logSafe(x) = log(x + (x == 0))

The resulting score is not directly suitable as a distance measure, but is
very handy for masking co-occurrence matrices.

On Wed, May 21, 2008 at 11:22 AM, Sean Owen <sr...@gmail.com> wrote:

> Got it, so do I have it right that you suggest defining the
> "correlation" as really the chi-squared statistic, the -2 log lamba
> formula? k1 is the number of items 'preferred' by user 1, and n1 is
> the total number of items in the universe, and likewise for k2/n2? so
> n1 == n2?
>
> Simple is cool with me, to start. This sounds more sophisticated than
> a simple intersection/union approach. I could add that algorithm too
> for kicks.
>
> On Wed, May 21, 2008 at 12:58 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Correlation (per se) between such sparse binary vectors can be very
> > problematic.
> >
> > This is a general problem with this kind of data and really needs to be
> > handled directly.  Not clicking on an item is much less informative than
> > clicking on an item (so little time, so much to click).  Any system you
> > build has to deal with that and with coincidence.  For instance, raw
> > correlation gives 100% match for two people who happen to have clicked on
> > the same single item.  IF that item is a very popular one, however, this
> is
> > not a very interesting fact.
> >
> > One very simple way of dealing with this was described in
> > http://citeseer.ist.psu.edu/29096.html .  Since then, I have found
> other,
> > more comprehensive techniques but they are considerably more complex.
>



-- 
ted

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

Got it, so do I have it right that you suggest defining the
"correlation" as really the chi-squared statistic, the -2 log lamba
formula? k1 is the number of items 'preferred' by user 1, and n1 is
the total number of items in the universe, and likewise for k2/n2? so
n1 == n2?

Simple is cool with me, to start. This sounds more sophisticated than
a simple intersection/union approach. I could add that algorithm too
for kicks.

On Wed, May 21, 2008 at 12:58 PM, Ted Dunning <te...@gmail.com> wrote:
> Correlation (per se) between such sparse binary vectors can be very
> problematic.
>
> This is a general problem with this kind of data and really needs to be
> handled directly.  Not clicking on an item is much less informative than
> clicking on an item (so little time, so much to click).  Any system you
> build has to deal with that and with coincidence.  For instance, raw
> correlation gives 100% match for two people who happen to have clicked on
> the same single item.  IF that item is a very popular one, however, this is
> not a very interesting fact.
>
> One very simple way of dealing with this was described in
> http://citeseer.ist.psu.edu/29096.html .  Since then, I have found other,
> more comprehensive techniques but they are considerably more complex.

Re: Taste on Mahout

Posted by Ted Dunning <te...@gmail.com>.

Correlation (per se) between such sparse binary vectors can be very
problematic.

This is a general problem with this kind of data and really needs to be
handled directly.  Not clicking on an item is much less informative than
clicking on an item (so little time, so much to click).  Any system you
build has to deal with that and with coincidence.  For instance, raw
correlation gives 100% match for two people who happen to have clicked on
the same single item.  IF that item is a very popular one, however, this is
not a very interesting fact.

One very simple way of dealing with this was described in
http://citeseer.ist.psu.edu/29096.html .  Since then, I have found other,
more comprehensive techniques but they are considerably more complex.

On Wed, May 21, 2008 at 9:29 AM, Sean Owen <sr...@gmail.com> wrote:

> Interesting, I was just having a separate conversation with another
> developer about best practices when your preferences are binary yes/no
> values rather than a number within a range.
>
> The standard techniques (well, that I have implemented) can have a bit
> of trouble with this sort of input, though it works. Already I wonder
> if we should put in place some new components that address this
> situation:
>
> - a new implementation of Preference that doesn't store a preference
> value, since that is wasteful; it's mere existence is a "true"
>
> - a new implementation of UserCorrelation/ItemCorrelation that is more
> appropriate to binary data. It would probably report the ratio of the
> intersection to union of the two users' or items' preference sets
>
> I will pick up a copy of this book as it sounds like a good read and
> may refresh and expand my knowledge of techniques in this area.
>
>
> Yes comparing every item to every other is expensive. The ways to
> mitigate this are:
>
> - Do it offline, in batch, once a day or hour or something
>  - ... and maybe use Hadoop to do that, yes
>
> - Use sampling -- compare each item to only a random 10% of the other
> items to find items that are among the most similar, if not *the* most
> similar
>
> 10,000 items sounds reasonably small though; even doing it online may
> not be so bad, as long as you provide enough memory to the library for
> caching.
>
>
> Taste also has something like k-means clustering, in the
> "TreeClusteringRecommender". You would ask it to cluster to some
> number of clusters rather than until the cluster similarity reaches
> some threshold.
>
>
>
> On Wed, May 21, 2008 at 7:36 AM, Goel, Ankur <An...@corp.aol.com>
> wrote:
> > Hey Sean,
> >          Thanks for the suggestions. In my case the data-set os only
> > going to tell me if the useer clicked on a particualar item. So lets say
> > there are 10,000 items a user might only have clicked 20 - 30 items. I
> > was thinking more on the lines of building an item similarity table by
> > comparing each item with every other item and retaining only 100 top
> > items decayed by time.
> >
> > So a recommender for a user would use his recent browsing history to
> > figure out top 10 or 20 most similar items.
> >
> > The approach is documented in Toby Segaran's "Collective Intelligence"
> > book and looks simple to implement even though it is costly since every
> > item needs to be compared with every other item. This can be
> > parallelized in way that for M items in a cluster of N machines, each
> > node has to compare M/N items to M items. Since the data-set is going to
> > sparse (no. of items having common users), I believe this would'nt be
> > overwhelming for the cluster.
> >
> > The other approach that I am thinking to reduce the computation cost is
> > to use a clustering algorithm like K-Means that's available in Mahout to
> > cluster similar user/items together and then use clustering information
> > to make recommendations.
> >
> > Any suggestions?
> >
> > Thanks
> > -Ankur
> >
> >
> > -----Original Message-----
> > From: Sean Owen [mailto:srowen@gmail.com]
> > Sent: Tuesday, May 20, 2008 9:37 PM
> > To: mahout-dev@lucene.apache.org; Goel, Ankur
> > Subject: Re: Taste on Mahout
> >
> > + Ankur directly, since I am not sure you are on the dev list.
> >
> > On Tue, May 20, 2008 at 12:06 PM, Sean Owen <sr...@gmail.com> wrote:
> >> All of the algorithms assume a world where you have a continuous range
> >
> >> of ratings from users for items. Obviously a binary yes/no rating can
> >> be mapped into that trivially -- 1 and -1 for example. This causes
> >> some issues, most notably for corrletion-based recommenders where the
> >> correlation can be undefined between two items/users in special cases
> >> that arise from this kind of input -- for example if we overlap in
> >> rating 3 items and I voted "yes" for all 3, then no correlation can be
> >
> >> defined.
> >>
> >> Slope one doesn't run into this particular mathematical wrinkle.
> >>
> >> Also, methods like estimatePreference() are not going to give you
> >> estimates that are always 1 or -1. Again, you could map this back onto
> >> 1 / -1 by rounding or something, just something to note.
> >>
> >> So, in general it will be better if you can map whatever input you
> >> have onto a larger range of input. You will feed more information in,
> >> in this way, as well. For example, maybe you call a recent "yes"
> >> rating a +2, and a recent "no" a -2, and others +1 and -1.
> >>
> >>
> >> The part of slope one that parallelizes very well is the computing of
> >> the item-item diffs. No I have not written this yet.
> >>
> >>
> >> I have committed a first cut at a framework for computing
> >> recommendations in parallel for any recommender. Dig in to
> >> org.apache.mahout.cf.taste.impl.hadoop. In general, none of the
> >> existing recommenders can be parallelized, because they generally need
> >
> >> access to all the data to produce any recommendation.
> >>
> >> But, we can take partial advantage of Hadoop by simply parallelizing
> >> the computation of recommendations for many users across multiple
> >> identical recommender instances. Better than nothing. In this
> >> situation, one of the map or reduce phase is trivial.
> >>
> >> That is what I have committed so far and it works, locally. I am in
> >> the middle of figuring out how to write it for real use on a remote
> >> Hadoop cluster, and how I would go about testing that!
> >>
> >> Do we have any test bed available?
> >>
> >>
> >>
> >> On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur <An...@corp.aol.com>
> > wrote:
> >>> I just realized after going through the wikipedia that slope one is
> >>> applicable when you have ratings for the items.
> >>> In my case, I would be simply working with binary data (Item was
> >>> clicked or not-clicked by user) using Tanimoto coefficient to
> >>> calculate item similarity.
> >>> The idea is to capture the simple intuition "What items have been
> >>> visited most along with this item".
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> >>> Sent: Tuesday, May 20, 2008 2:51 PM
> >>> To: mahout-dev@lucene.apache.org
> >>> Subject: RE: Taste on Mahout
> >>>
> >>>
> >>> Hey Sean,
> >>>       I actually plan to use slope-one to start with since
> >>> - Its simple and known to work well.
> >>> - Can be parallelized nicely into the Map-Reduce style.
> >>> I also plan to use Tanimoto coefficient for item-item diffs.
> >>>
> >>> Do we have something on slope-one already in Taste as a part of
> > Mahout ?
> >>>
> >>> At the moment I am going through the available documentation on Taste
> >
> >>> and code that's present in Mahout.
> >>>
> >>> Your suggestions would be greatly appreciated.
> >>>
> >>> Thanks
> >>> -Ankur
> >>>
> >>> -----Original Message-----
> >>> From: Sean Owen [mailto:srowen@gmail.com]
> >>> Sent: Tuesday, April 29, 2008 11:09 PM
> >>> To: mahout-dev@lucene.apache.org; Goel, Ankur
> >>> Subject: Re: Taste on Mahout
> >>>
> >>> I have some Hadoop code mostly ready to go for Taste.
> >>>
> >>> The first thing to do is let you generate recommendations for all
> >>> your users via Hadoop. Unfortunately none of the recommenders truly
> >>> parallelize in the way that MapReduce needs it to -- you need all
> >>> data to compute any recommendation really -- but you can at least get
> >
> >>> paralellization out of this. You can use the framework to run n
> >>> recommenders, each computing 1/nth of all recommendations.
> >>>
> >>> The next application is specific to slope-one. Computing the
> >>> item-item diffs is exactly the kind of thing that MapReduce is good
> >>> for, so, writing a Hadoop job to do this seems like a no-brainer.
> >>>
> >>> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur
> >>> <An...@corp.aol.com>
> >>> wrote:
> >>>> Hi Folks,
> >>>>       What's the status of hadoopifying Taste on Mahout ?
> >>>>  What's been done and what is in progress/pending ?
> >>>>
> >>>>  I am looking using a scalable version of Taste for my project.
> >>>>  So basically trying to figure out what's already done and where  I
> >>>> can pitch in.
> >>>>
> >>>>  Thanks
> >>>>  -Ankur
> >>>>
> >>>
> >>
> >
>



-- 
ted

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

Interesting, I was just having a separate conversation with another
developer about best practices when your preferences are binary yes/no
values rather than a number within a range.

The standard techniques (well, that I have implemented) can have a bit
of trouble with this sort of input, though it works. Already I wonder
if we should put in place some new components that address this
situation:

- a new implementation of Preference that doesn't store a preference
value, since that is wasteful; it's mere existence is a "true"

- a new implementation of UserCorrelation/ItemCorrelation that is more
appropriate to binary data. It would probably report the ratio of the
intersection to union of the two users' or items' preference sets

I will pick up a copy of this book as it sounds like a good read and
may refresh and expand my knowledge of techniques in this area.


Yes comparing every item to every other is expensive. The ways to
mitigate this are:

- Do it offline, in batch, once a day or hour or something
  - ... and maybe use Hadoop to do that, yes

- Use sampling -- compare each item to only a random 10% of the other
items to find items that are among the most similar, if not *the* most
similar

10,000 items sounds reasonably small though; even doing it online may
not be so bad, as long as you provide enough memory to the library for
caching.


Taste also has something like k-means clustering, in the
"TreeClusteringRecommender". You would ask it to cluster to some
number of clusters rather than until the cluster similarity reaches
some threshold.



On Wed, May 21, 2008 at 7:36 AM, Goel, Ankur <An...@corp.aol.com> wrote:
> Hey Sean,
>          Thanks for the suggestions. In my case the data-set os only
> going to tell me if the useer clicked on a particualar item. So lets say
> there are 10,000 items a user might only have clicked 20 - 30 items. I
> was thinking more on the lines of building an item similarity table by
> comparing each item with every other item and retaining only 100 top
> items decayed by time.
>
> So a recommender for a user would use his recent browsing history to
> figure out top 10 or 20 most similar items.
>
> The approach is documented in Toby Segaran's "Collective Intelligence"
> book and looks simple to implement even though it is costly since every
> item needs to be compared with every other item. This can be
> parallelized in way that for M items in a cluster of N machines, each
> node has to compare M/N items to M items. Since the data-set is going to
> sparse (no. of items having common users), I believe this would'nt be
> overwhelming for the cluster.
>
> The other approach that I am thinking to reduce the computation cost is
> to use a clustering algorithm like K-Means that's available in Mahout to
> cluster similar user/items together and then use clustering information
> to make recommendations.
>
> Any suggestions?
>
> Thanks
> -Ankur
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Tuesday, May 20, 2008 9:37 PM
> To: mahout-dev@lucene.apache.org; Goel, Ankur
> Subject: Re: Taste on Mahout
>
> + Ankur directly, since I am not sure you are on the dev list.
>
> On Tue, May 20, 2008 at 12:06 PM, Sean Owen <sr...@gmail.com> wrote:
>> All of the algorithms assume a world where you have a continuous range
>
>> of ratings from users for items. Obviously a binary yes/no rating can
>> be mapped into that trivially -- 1 and -1 for example. This causes
>> some issues, most notably for corrletion-based recommenders where the
>> correlation can be undefined between two items/users in special cases
>> that arise from this kind of input -- for example if we overlap in
>> rating 3 items and I voted "yes" for all 3, then no correlation can be
>
>> defined.
>>
>> Slope one doesn't run into this particular mathematical wrinkle.
>>
>> Also, methods like estimatePreference() are not going to give you
>> estimates that are always 1 or -1. Again, you could map this back onto
>> 1 / -1 by rounding or something, just something to note.
>>
>> So, in general it will be better if you can map whatever input you
>> have onto a larger range of input. You will feed more information in,
>> in this way, as well. For example, maybe you call a recent "yes"
>> rating a +2, and a recent "no" a -2, and others +1 and -1.
>>
>>
>> The part of slope one that parallelizes very well is the computing of
>> the item-item diffs. No I have not written this yet.
>>
>>
>> I have committed a first cut at a framework for computing
>> recommendations in parallel for any recommender. Dig in to
>> org.apache.mahout.cf.taste.impl.hadoop. In general, none of the
>> existing recommenders can be parallelized, because they generally need
>
>> access to all the data to produce any recommendation.
>>
>> But, we can take partial advantage of Hadoop by simply parallelizing
>> the computation of recommendations for many users across multiple
>> identical recommender instances. Better than nothing. In this
>> situation, one of the map or reduce phase is trivial.
>>
>> That is what I have committed so far and it works, locally. I am in
>> the middle of figuring out how to write it for real use on a remote
>> Hadoop cluster, and how I would go about testing that!
>>
>> Do we have any test bed available?
>>
>>
>>
>> On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur <An...@corp.aol.com>
> wrote:
>>> I just realized after going through the wikipedia that slope one is
>>> applicable when you have ratings for the items.
>>> In my case, I would be simply working with binary data (Item was
>>> clicked or not-clicked by user) using Tanimoto coefficient to
>>> calculate item similarity.
>>> The idea is to capture the simple intuition "What items have been
>>> visited most along with this item".
>>>
>>>
>>> -----Original Message-----
>>> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
>>> Sent: Tuesday, May 20, 2008 2:51 PM
>>> To: mahout-dev@lucene.apache.org
>>> Subject: RE: Taste on Mahout
>>>
>>>
>>> Hey Sean,
>>>       I actually plan to use slope-one to start with since
>>> - Its simple and known to work well.
>>> - Can be parallelized nicely into the Map-Reduce style.
>>> I also plan to use Tanimoto coefficient for item-item diffs.
>>>
>>> Do we have something on slope-one already in Taste as a part of
> Mahout ?
>>>
>>> At the moment I am going through the available documentation on Taste
>
>>> and code that's present in Mahout.
>>>
>>> Your suggestions would be greatly appreciated.
>>>
>>> Thanks
>>> -Ankur
>>>
>>> -----Original Message-----
>>> From: Sean Owen [mailto:srowen@gmail.com]
>>> Sent: Tuesday, April 29, 2008 11:09 PM
>>> To: mahout-dev@lucene.apache.org; Goel, Ankur
>>> Subject: Re: Taste on Mahout
>>>
>>> I have some Hadoop code mostly ready to go for Taste.
>>>
>>> The first thing to do is let you generate recommendations for all
>>> your users via Hadoop. Unfortunately none of the recommenders truly
>>> parallelize in the way that MapReduce needs it to -- you need all
>>> data to compute any recommendation really -- but you can at least get
>
>>> paralellization out of this. You can use the framework to run n
>>> recommenders, each computing 1/nth of all recommendations.
>>>
>>> The next application is specific to slope-one. Computing the
>>> item-item diffs is exactly the kind of thing that MapReduce is good
>>> for, so, writing a Hadoop job to do this seems like a no-brainer.
>>>
>>> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur
>>> <An...@corp.aol.com>
>>> wrote:
>>>> Hi Folks,
>>>>       What's the status of hadoopifying Taste on Mahout ?
>>>>  What's been done and what is in progress/pending ?
>>>>
>>>>  I am looking using a scalable version of Taste for my project.
>>>>  So basically trying to figure out what's already done and where  I
>>>> can pitch in.
>>>>
>>>>  Thanks
>>>>  -Ankur
>>>>
>>>
>>
>

RE: LDA [was RE: Taste on Mahout]

Posted by Daniel Kluesing <da...@ilike-inc.com>.

Just a note that LDA and CRFs are two different algorithms. They both
fall under the general class of graphical models but otherwise solve
different problems.

LDA is trained on un-labeled, unordered data using a bag-of-words
assumption, CRFs are trained on labeled, sequential data with markov
assumptions. (CRFs are sorta a generalization of HMMs)

But CRFs are neat in their own right. Biology folks would get excited
with a good CRF implementation.

-----Original Message-----
From: Robin Anil [mailto:robin.anil@gmail.com] 
Sent: Saturday, June 07, 2008 4:36 AM
To: mahout-dev@lucene.apache.org
Subject: Re: LDA [was RE: Taste on Mahout]

Hi,.
     There some LDA/CRF implementations available online. Might prove
useful when writing the code

* GibbsLDA++ <http://gibbslda.sourceforge.net/>*: GibbsLDA++: A C/C++
Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling
for parameter estimation and inference. GibbsLDA++ is fast and is
designed to analyze hidden/latent topic structures of large-scale (text)
data collections.

* CRFTagger <http://crftagger.sourceforge.net/> *: A Java-based
Conditional Random Fields Part-of-Speech (POS) Tagger for English. The
model was trained on sections 01..24 of WSJ corpus and using section 00
as the development test set (accuracy of 97.00%). Tagging speed: 500
sentences / second.

* CRFChunker <http://crfchunker.sourceforge.net/> *: A Java-based
Conditional Random Fields Phrase Chunker (Phrase Chunking Tool) for
English.
The model was trained on sections 01..24 of WSJ corpus and using section
00 as the development test set (F1-score of 95.77). Chunking speed: 700
sentences / second.

* JTextPro <http://jtextpro.sourceforge.net/>*: A Java-based text
processing tool that includes sentence boundary detection (using maximum
entropy classifier), word tokenization (following Penn convention),
part-of-speech tagging (using CRFTagger), and phrase chunking (using
CRFChunker).

*JWebPro <http://jwebpro.sourceforge.net/>*: A Java-based tool that can
interact with Google search via Google Web APIs and then process the
returned Web documents in a couple of ways. The outputs of JWebPro can
serve as inputs for natural language processing, information retrieval,
information extraction, Web data mining, online social network
extraction/analysis, and ontology development applications.

* JVnSegmenter <http://jvnsegmenter.sourceforge.net/>*: A Java-based and
open-source Vietnamese word segmentation tool. The segmentation model in
this tool was trained on about 8,000 labeled sentences using FlexCRFs.
It would be useful for Vietnamese NLP community.
*FlexCRFs: Flexible Conditional Random Fields* (Including PCRFs - a
parallel version of FlexCRFs)  http://flexcrfs.sourceforge.net/

CRF++: Yet Another CRF toolkit *http://flexcrfs.sourceforge.net/*
Robin
On Thu, Jun 5, 2008 at 9:59 PM, Ted Dunning <te...@gmail.com>
wrote:

> The buntine and jakulin paper is also useful reading.  I would avoid 
> fancy stuff like the powell rao-ization to start.
>
> http://citeseer.ist.psu.edu/750239.html
>
> The gibb's sampling approach is, at its heart, very simple in that 
> most of the math devolves into sampling discrete hidden variables from

> simple distributions and then counting the results as if they were
observed.
>
> On Thu, Jun 5, 2008 at 5:49 AM, Goel, Ankur <An...@corp.aol.com>
> wrote:
>
> > It draws reference from Java implementation - 
> > http://www.arbylon.net/projects/LdaGibbsSampler.java
> > which is a single class version of LDA using gibbs sampling with 
> > slightly better code documentation.
> > I am trying to understand the code while reading the paper you 
> > suggested
> > -
> > "Distributed Inference for Latent Drichlet Allocation".
> >
> > -----Original Message-----
> > From: Daniel Kluesing [mailto:daniel@ilike-inc.com]
> > Sent: Wednesday, June 04, 2008 8:31 PM
> > To: mahout-dev@lucene.apache.org
> > Subject: RE: LDA [was RE: Taste on Mahout]
> >
> > Ted may have a better one, but in my quick poking around at things 
> > http://gibbslda.sourceforge.net/ looks to be a good implementation 
> > of the Gibbs sampling approach.
> >
> > -----Original Message-----
> > From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > Sent: Wednesday, June 04, 2008 4:58 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: RE: LDA [was RE: Taste on Mahout]
> >
> > Ted, Do you have a sequential version of LDA implementation that can

> > be used for reference ?
> > If yes, can you please post it on Jira ? Should we open a new Jira 
> > or use MAHOUT-30 for this ?
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: Tuesday, May 27, 2008 11:50 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: LDA [was RE: Taste on Mahout]
> >
> > Chris Bishop's book has a very clear exposition of the relationship 
> > between the variational techniques and EM.  Very good reading.
> >
> > On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur 
> > <An...@corp.aol.com>
> > wrote:
> >
> > > Daniel/Ted,
> > >      Thanks for the interesting pointers to more information on 
> > > LDA and EM.
> > > I am going through the docs to visualize and understand how LDA 
> > > approach would work for my specific case.
> > >
> > > Once I have some idea, I can volunteer to work on the Map-Reduce 
> > > side of
> > >
> > > thngs as this is something that will benefit both my project and 
> > > the community.
> > >
> > > Looking forward to share more ideas/information on this :-)
> > >
> > > Regards
> > > -Ankur
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: Tuesday, May 27, 2008 6:59 AM
> > > To: mahout-dev@lucene.apache.org
> > > Subject: Re: LDA [was RE: Taste on Mahout]
> > >
> > > Those are both new to me.  Both look interesting.  My own 
> > > experience is that the simplicity of the Gibb's sampling makes it 
> > > very much more attractive for implementation.  Also, since it is 
> > > (nearly) trivially parallelizable, it is more likely we will get a

> > > useful implementation right off the bat.
> > >
> > > On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing 
> > > <da...@ilike-inc.com>
> > > wrote:
> > >
> > > > (Hijacking the thread to discuss ways to implement LDA)
> > > >
> > > > Had you seen
> > > > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > > > ?
> > > >
> > > > Their hierarchical distributed LDA formulation uses gibbs 
> > > > sampling and
> > >
> > > > fits into mapreduce.
> > > >
> > > > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://w
> > > > ww.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > <http://www.c
> > > > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > > <http://www.cs.
> > > > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> > > formulation for the variational EM method.
> > > >
> > > > I'm still chewing on them, but my first impression is that the 
> > > > EM approach would give better performance on bigger data sets. 
> > > > Opposing
> >
> > > > views welcome.
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > ted
> >
>
>
>
> --
> ted
>

--
Robin Anil
4th Year Dual Degree Student
Department of Computer Science & Engineering IIT Kharagpur

------------------------------------------------------------------------
--------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

Re: LDA [was RE: Taste on Mahout]

Posted by Robin Anil <ro...@gmail.com>.

Hi,.
     There some LDA/CRF implementations available online. Might prove useful
when writing the code

* GibbsLDA++ <http://gibbslda.sourceforge.net/>*: GibbsLDA++: A C/C++
Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling for
parameter estimation and inference. GibbsLDA++ is fast and is designed to
analyze hidden/latent topic structures of large-scale (text) data
collections.

* CRFTagger <http://crftagger.sourceforge.net/> *: A Java-based Conditional
Random Fields Part-of-Speech (POS) Tagger for English. The model was trained
on sections 01..24 of WSJ corpus and using section 00 as the development
test set (accuracy of 97.00%). Tagging speed: 500 sentences / second.

* CRFChunker <http://crfchunker.sourceforge.net/> *: A Java-based
Conditional Random Fields Phrase Chunker (Phrase Chunking Tool) for English.
The model was trained on sections 01..24 of WSJ corpus and using section 00
as the development test set (F1-score of 95.77). Chunking speed: 700
sentences / second.

* JTextPro <http://jtextpro.sourceforge.net/>*: A Java-based text processing
tool that includes sentence boundary detection (using maximum entropy
classifier), word tokenization (following Penn convention), part-of-speech
tagging (using CRFTagger), and phrase chunking (using CRFChunker).

*JWebPro <http://jwebpro.sourceforge.net/>*: A Java-based tool that can
interact with Google search via Google Web APIs and then process the
returned Web documents in a couple of ways. The outputs of JWebPro can serve
as inputs for natural language processing, information retrieval,
information extraction, Web data mining, online social network
extraction/analysis, and ontology development applications.

* JVnSegmenter <http://jvnsegmenter.sourceforge.net/>*: A Java-based and
open-source Vietnamese word segmentation tool. The segmentation model in
this tool was trained on about 8,000 labeled sentences using FlexCRFs. It
would be useful for Vietnamese NLP community.
*FlexCRFs: Flexible Conditional Random Fields* (Including PCRFs - a parallel
version of FlexCRFs)  http://flexcrfs.sourceforge.net/

CRF++: Yet Another CRF toolkit *http://flexcrfs.sourceforge.net/*
Robin
On Thu, Jun 5, 2008 at 9:59 PM, Ted Dunning <te...@gmail.com> wrote:

> The buntine and jakulin paper is also useful reading.  I would avoid fancy
> stuff like the powell rao-ization to start.
>
> http://citeseer.ist.psu.edu/750239.html
>
> The gibb's sampling approach is, at its heart, very simple in that most of
> the math devolves into sampling discrete hidden variables from simple
> distributions and then counting the results as if they were observed.
>
> On Thu, Jun 5, 2008 at 5:49 AM, Goel, Ankur <An...@corp.aol.com>
> wrote:
>
> > It draws reference from Java implementation -
> > http://www.arbylon.net/projects/LdaGibbsSampler.java
> > which is a single class version of LDA using gibbs sampling with
> > slightly better code documentation.
> > I am trying to understand the code while reading the paper you suggested
> > -
> > "Distributed Inference for Latent Drichlet Allocation".
> >
> > -----Original Message-----
> > From: Daniel Kluesing [mailto:daniel@ilike-inc.com]
> > Sent: Wednesday, June 04, 2008 8:31 PM
> > To: mahout-dev@lucene.apache.org
> > Subject: RE: LDA [was RE: Taste on Mahout]
> >
> > Ted may have a better one, but in my quick poking around at things
> > http://gibbslda.sourceforge.net/ looks to be a good implementation of
> > the Gibbs sampling approach.
> >
> > -----Original Message-----
> > From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > Sent: Wednesday, June 04, 2008 4:58 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: RE: LDA [was RE: Taste on Mahout]
> >
> > Ted, Do you have a sequential version of LDA implementation that can be
> > used for reference ?
> > If yes, can you please post it on Jira ? Should we open a new Jira or
> > use MAHOUT-30 for this ?
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: Tuesday, May 27, 2008 11:50 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: LDA [was RE: Taste on Mahout]
> >
> > Chris Bishop's book has a very clear exposition of the relationship
> > between the variational techniques and EM.  Very good reading.
> >
> > On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur <An...@corp.aol.com>
> > wrote:
> >
> > > Daniel/Ted,
> > >      Thanks for the interesting pointers to more information on LDA
> > > and EM.
> > > I am going through the docs to visualize and understand how LDA
> > > approach would work for my specific case.
> > >
> > > Once I have some idea, I can volunteer to work on the Map-Reduce side
> > > of
> > >
> > > thngs as this is something that will benefit both my project and the
> > > community.
> > >
> > > Looking forward to share more ideas/information on this :-)
> > >
> > > Regards
> > > -Ankur
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: Tuesday, May 27, 2008 6:59 AM
> > > To: mahout-dev@lucene.apache.org
> > > Subject: Re: LDA [was RE: Taste on Mahout]
> > >
> > > Those are both new to me.  Both look interesting.  My own experience
> > > is that the simplicity of the Gibb's sampling makes it very much more
> > > attractive for implementation.  Also, since it is (nearly) trivially
> > > parallelizable, it is more likely we will get a useful implementation
> > > right off the bat.
> > >
> > > On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing
> > > <da...@ilike-inc.com>
> > > wrote:
> > >
> > > > (Hijacking the thread to discuss ways to implement LDA)
> > > >
> > > > Had you seen
> > > > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > > > ?
> > > >
> > > > Their hierarchical distributed LDA formulation uses gibbs sampling
> > > > and
> > >
> > > > fits into mapreduce.
> > > >
> > > > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > <http://www.c
> > > > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > > <http://www.cs.
> > > > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> > > formulation for the variational EM method.
> > > >
> > > > I'm still chewing on them, but my first impression is that the EM
> > > > approach would give better performance on bigger data sets. Opposing
> >
> > > > views welcome.
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > ted
> >
>
>
>
> --
> ted
>



-- 
Robin Anil
4th Year Dual Degree Student
Department of Computer Science & Engineering
IIT Kharagpur

--------------------------------------------------------------------------------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

Re: LDA [was RE: Taste on Mahout]

Posted by Ted Dunning <te...@gmail.com>.

The buntine and jakulin paper is also useful reading.  I would avoid fancy
stuff like the powell rao-ization to start.

http://citeseer.ist.psu.edu/750239.html

The gibb's sampling approach is, at its heart, very simple in that most of
the math devolves into sampling discrete hidden variables from simple
distributions and then counting the results as if they were observed.

On Thu, Jun 5, 2008 at 5:49 AM, Goel, Ankur <An...@corp.aol.com> wrote:

> It draws reference from Java implementation -
> http://www.arbylon.net/projects/LdaGibbsSampler.java
> which is a single class version of LDA using gibbs sampling with
> slightly better code documentation.
> I am trying to understand the code while reading the paper you suggested
> -
> "Distributed Inference for Latent Drichlet Allocation".
>
> -----Original Message-----
> From: Daniel Kluesing [mailto:daniel@ilike-inc.com]
> Sent: Wednesday, June 04, 2008 8:31 PM
> To: mahout-dev@lucene.apache.org
> Subject: RE: LDA [was RE: Taste on Mahout]
>
> Ted may have a better one, but in my quick poking around at things
> http://gibbslda.sourceforge.net/ looks to be a good implementation of
> the Gibbs sampling approach.
>
> -----Original Message-----
> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> Sent: Wednesday, June 04, 2008 4:58 AM
> To: mahout-dev@lucene.apache.org
> Subject: RE: LDA [was RE: Taste on Mahout]
>
> Ted, Do you have a sequential version of LDA implementation that can be
> used for reference ?
> If yes, can you please post it on Jira ? Should we open a new Jira or
> use MAHOUT-30 for this ?
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Tuesday, May 27, 2008 11:50 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: LDA [was RE: Taste on Mahout]
>
> Chris Bishop's book has a very clear exposition of the relationship
> between the variational techniques and EM.  Very good reading.
>
> On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur <An...@corp.aol.com>
> wrote:
>
> > Daniel/Ted,
> >      Thanks for the interesting pointers to more information on LDA
> > and EM.
> > I am going through the docs to visualize and understand how LDA
> > approach would work for my specific case.
> >
> > Once I have some idea, I can volunteer to work on the Map-Reduce side
> > of
> >
> > thngs as this is something that will benefit both my project and the
> > community.
> >
> > Looking forward to share more ideas/information on this :-)
> >
> > Regards
> > -Ankur
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: Tuesday, May 27, 2008 6:59 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: LDA [was RE: Taste on Mahout]
> >
> > Those are both new to me.  Both look interesting.  My own experience
> > is that the simplicity of the Gibb's sampling makes it very much more
> > attractive for implementation.  Also, since it is (nearly) trivially
> > parallelizable, it is more likely we will get a useful implementation
> > right off the bat.
> >
> > On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing
> > <da...@ilike-inc.com>
> > wrote:
> >
> > > (Hijacking the thread to discuss ways to implement LDA)
> > >
> > > Had you seen
> > > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > > ?
> > >
> > > Their hierarchical distributed LDA formulation uses gibbs sampling
> > > and
> >
> > > fits into mapreduce.
> > >
> > > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.c
> > > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > <http://www.cs.
> > > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> > formulation for the variational EM method.
> > >
> > > I'm still chewing on them, but my first impression is that the EM
> > > approach would give better performance on bigger data sets. Opposing
>
> > > views welcome.
> > >
> > >
> >
>
>
>
> --
> ted
>



-- 
ted

RE: LDA [was RE: Taste on Mahout]

Posted by "Goel, Ankur" <An...@corp.aol.com>.

It draws reference from Java implementation -
http://www.arbylon.net/projects/LdaGibbsSampler.java
which is a single class version of LDA using gibbs sampling with
slightly better code documentation.
I am trying to understand the code while reading the paper you suggested
- 
"Distributed Inference for Latent Drichlet Allocation".

-----Original Message-----
From: Daniel Kluesing [mailto:daniel@ilike-inc.com] 
Sent: Wednesday, June 04, 2008 8:31 PM
To: mahout-dev@lucene.apache.org
Subject: RE: LDA [was RE: Taste on Mahout]

Ted may have a better one, but in my quick poking around at things
http://gibbslda.sourceforge.net/ looks to be a good implementation of
the Gibbs sampling approach.  

-----Original Message-----
From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
Sent: Wednesday, June 04, 2008 4:58 AM
To: mahout-dev@lucene.apache.org
Subject: RE: LDA [was RE: Taste on Mahout]

Ted, Do you have a sequential version of LDA implementation that can be
used for reference ?
If yes, can you please post it on Jira ? Should we open a new Jira or
use MAHOUT-30 for this ?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com]
Sent: Tuesday, May 27, 2008 11:50 AM
To: mahout-dev@lucene.apache.org
Subject: Re: LDA [was RE: Taste on Mahout]

Chris Bishop's book has a very clear exposition of the relationship
between the variational techniques and EM.  Very good reading.

On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur <An...@corp.aol.com>
wrote:

> Daniel/Ted,
>      Thanks for the interesting pointers to more information on LDA 
> and EM.
> I am going through the docs to visualize and understand how LDA 
> approach would work for my specific case.
>
> Once I have some idea, I can volunteer to work on the Map-Reduce side 
> of
>
> thngs as this is something that will benefit both my project and the 
> community.
>
> Looking forward to share more ideas/information on this :-)
>
> Regards
> -Ankur
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Tuesday, May 27, 2008 6:59 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: LDA [was RE: Taste on Mahout]
>
> Those are both new to me.  Both look interesting.  My own experience 
> is that the simplicity of the Gibb's sampling makes it very much more 
> attractive for implementation.  Also, since it is (nearly) trivially 
> parallelizable, it is more likely we will get a useful implementation 
> right off the bat.
>
> On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing 
> <da...@ilike-inc.com>
> wrote:
>
> > (Hijacking the thread to discuss ways to implement LDA)
> >
> > Had you seen
> > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > ?
> >
> > Their hierarchical distributed LDA formulation uses gibbs sampling 
> > and
>
> > fits into mapreduce.
> >
> > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.c
> > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.
> > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> formulation for the variational EM method.
> >
> > I'm still chewing on them, but my first impression is that the EM 
> > approach would give better performance on bigger data sets. Opposing

> > views welcome.
> >
> >
>



--
ted

RE: LDA [was RE: Taste on Mahout]

Posted by Daniel Kluesing <da...@ilike-inc.com>.

Ted may have a better one, but in my quick poking around at things
http://gibbslda.sourceforge.net/ looks to be a good implementation of
the Gibbs sampling approach.  

-----Original Message-----
From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com] 
Sent: Wednesday, June 04, 2008 4:58 AM
To: mahout-dev@lucene.apache.org
Subject: RE: LDA [was RE: Taste on Mahout]

Ted, Do you have a sequential version of LDA implementation that can be
used for reference ?
If yes, can you please post it on Jira ? Should we open a new Jira or
use MAHOUT-30 for this ?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com]
Sent: Tuesday, May 27, 2008 11:50 AM
To: mahout-dev@lucene.apache.org
Subject: Re: LDA [was RE: Taste on Mahout]

Chris Bishop's book has a very clear exposition of the relationship
between the variational techniques and EM.  Very good reading.

On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur <An...@corp.aol.com>
wrote:

> Daniel/Ted,
>      Thanks for the interesting pointers to more information on LDA 
> and EM.
> I am going through the docs to visualize and understand how LDA 
> approach would work for my specific case.
>
> Once I have some idea, I can volunteer to work on the Map-Reduce side 
> of
>
> thngs as this is something that will benefit both my project and the 
> community.
>
> Looking forward to share more ideas/information on this :-)
>
> Regards
> -Ankur
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Tuesday, May 27, 2008 6:59 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: LDA [was RE: Taste on Mahout]
>
> Those are both new to me.  Both look interesting.  My own experience 
> is that the simplicity of the Gibb's sampling makes it very much more 
> attractive for implementation.  Also, since it is (nearly) trivially 
> parallelizable, it is more likely we will get a useful implementation 
> right off the bat.
>
> On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing 
> <da...@ilike-inc.com>
> wrote:
>
> > (Hijacking the thread to discuss ways to implement LDA)
> >
> > Had you seen
> > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > ?
> >
> > Their hierarchical distributed LDA formulation uses gibbs sampling 
> > and
>
> > fits into mapreduce.
> >
> > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.c
> > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.
> > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> formulation for the variational EM method.
> >
> > I'm still chewing on them, but my first impression is that the EM 
> > approach would give better performance on bigger data sets. Opposing

> > views welcome.
> >
> >
>



--
ted

RE: LDA [was RE: Taste on Mahout]

Posted by "Goel, Ankur" <An...@corp.aol.com>.

Ted, Do you have a sequential version of LDA implementation that can be
used for reference ?
If yes, can you please post it on Jira ? Should we open a new Jira or
use MAHOUT-30 for this ?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Tuesday, May 27, 2008 11:50 AM
To: mahout-dev@lucene.apache.org
Subject: Re: LDA [was RE: Taste on Mahout]

Chris Bishop's book has a very clear exposition of the relationship
between the variational techniques and EM.  Very good reading.

On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur <An...@corp.aol.com>
wrote:

> Daniel/Ted,
>      Thanks for the interesting pointers to more information on LDA 
> and EM.
> I am going through the docs to visualize and understand how LDA 
> approach would work for my specific case.
>
> Once I have some idea, I can volunteer to work on the Map-Reduce side 
> of
>
> thngs as this is something that will benefit both my project and the 
> community.
>
> Looking forward to share more ideas/information on this :-)
>
> Regards
> -Ankur
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Tuesday, May 27, 2008 6:59 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: LDA [was RE: Taste on Mahout]
>
> Those are both new to me.  Both look interesting.  My own experience 
> is that the simplicity of the Gibb's sampling makes it very much more 
> attractive for implementation.  Also, since it is (nearly) trivially 
> parallelizable, it is more likely we will get a useful implementation 
> right off the bat.
>
> On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing 
> <da...@ilike-inc.com>
> wrote:
>
> > (Hijacking the thread to discuss ways to implement LDA)
> >
> > Had you seen
> > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > ?
> >
> > Their hierarchical distributed LDA formulation uses gibbs sampling 
> > and
>
> > fits into mapreduce.
> >
> > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.c
> > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.
> > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> formulation for the variational EM method.
> >
> > I'm still chewing on them, but my first impression is that the EM 
> > approach would give better performance on bigger data sets. Opposing

> > views welcome.
> >
> >
>



--
ted

Re: LDA [was RE: Taste on Mahout]

Posted by Ted Dunning <te...@gmail.com>.

Chris Bishop's book has a very clear exposition of the relationship between
the variational techniques and EM.  Very good reading.

On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur <An...@corp.aol.com>
wrote:

> Daniel/Ted,
>      Thanks for the interesting pointers to more information on LDA and
> EM.
> I am going through the docs to visualize and understand how LDA approach
> would work for my specific case.
>
> Once I have some idea, I can volunteer to work on the Map-Reduce side of
>
> thngs as this is something that will benefit both my project and the
> community.
>
> Looking forward to share more ideas/information on this :-)
>
> Regards
> -Ankur
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Tuesday, May 27, 2008 6:59 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: LDA [was RE: Taste on Mahout]
>
> Those are both new to me.  Both look interesting.  My own experience is
> that the simplicity of the Gibb's sampling makes it very much more
> attractive for implementation.  Also, since it is (nearly) trivially
> parallelizable, it is more likely we will get a useful implementation
> right off the bat.
>
> On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing <da...@ilike-inc.com>
> wrote:
>
> > (Hijacking the thread to discuss ways to implement LDA)
> >
> > Had you seen
> > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > ?
> >
> > Their hierarchical distributed LDA formulation uses gibbs sampling and
>
> > fits into mapreduce.
> >
> > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.
> > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> formulation for the variational EM method.
> >
> > I'm still chewing on them, but my first impression is that the EM
> > approach would give better performance on bigger data sets. Opposing
> > views welcome.
> >
> >
>



-- 
ted

RE: LDA [was RE: Taste on Mahout]

Posted by "Goel, Ankur" <An...@corp.aol.com>.

Daniel/Ted,
      Thanks for the interesting pointers to more information on LDA and
EM. 
I am going through the docs to visualize and understand how LDA approach
would work for my specific case. 

Once I have some idea, I can volunteer to work on the Map-Reduce side of

thngs as this is something that will benefit both my project and the
community.

Looking forward to share more ideas/information on this :-)

Regards
-Ankur

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Tuesday, May 27, 2008 6:59 AM
To: mahout-dev@lucene.apache.org
Subject: Re: LDA [was RE: Taste on Mahout]

Those are both new to me.  Both look interesting.  My own experience is
that the simplicity of the Gibb's sampling makes it very much more
attractive for implementation.  Also, since it is (nearly) trivially
parallelizable, it is more likely we will get a useful implementation
right off the bat.

On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing <da...@ilike-inc.com>
wrote:

> (Hijacking the thread to discuss ways to implement LDA)
>
> Had you seen 
> http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> ?
>
> Their hierarchical distributed LDA formulation uses gibbs sampling and

> fits into mapreduce.
>
> http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.cs.
> berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
formulation for the variational EM method.
>
> I'm still chewing on them, but my first impression is that the EM 
> approach would give better performance on bigger data sets. Opposing 
> views welcome.
>
>

Re: LDA [was RE: Taste on Mahout]

Posted by Ted Dunning <te...@gmail.com>.

Those are both new to me.  Both look interesting.  My own experience is that
the simplicity of the Gibb's sampling makes it very much more attractive for
implementation.  Also, since it is (nearly) trivially parallelizable, it is
more likely we will get a useful implementation right off the bat.

On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing <da...@ilike-inc.com>
wrote:

> (Hijacking the thread to discuss ways to implement LDA)
>
> Had you seen http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> ?
>
> Their hierarchical distributed LDA formulation uses gibbs sampling and
> fits into mapreduce.
>
> http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a
> mapreduce formulation for the variational EM method.
>
> I'm still chewing on them, but my first impression is that the EM
> approach would give better performance on bigger data sets. Opposing
> views welcome.
>
>

LDA [was RE: Taste on Mahout]

Posted by Daniel Kluesing <da...@ilike-inc.com>.

(Hijacking the thread to discuss ways to implement LDA) 

Had you seen http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
?

Their hierarchical distributed LDA formulation uses gibbs sampling and
fits into mapreduce.

http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf gives a
mapreduce formulation for the variational EM method. 

I'm still chewing on them, but my first impression is that the EM
approach would give better performance on bigger data sets. Opposing
views welcome.

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Sunday, May 25, 2008 12:11 AM
To: mahout-dev@lucene.apache.org
Subject: Re: Taste on Mahout

Buntine and Jakulin provide even farther ranging common structure
between LDA, pLSI, LSI and non-negative matrix factorizations in
http://citeseer.ist.psu.edu/750239.html

They also provide a much simpler algorithm for estimating parameters
that is (to my mind) simpler to implement in map-reduce than the
variational optimization of Jordan et al.

I am very interested in helping with a good LDA map-reduce
implementation.
My time constraints limit how much actual code I can generate for the
implementation, but I would still like to help in whatever small way I
might be able to .

On Sat, May 24, 2008 at 2:27 PM, Daniel Kluesing <da...@ilike-inc.com>
wrote:

> LDA is a proper 'generative model', while pLSI 'fakes' being a 
> generative model. From a generative model you have the full 
> probability distribution of all variables, this matters when you're 
> working with new unseen data.
>
> You may find 
>
http://www.cs.bham.ac.uk/~axk/sigir2003_mgak.pdf<http://www.cs.bham.ac.u
k/%7Eaxk/sigir2003_mgak.pdf>a good comparison, says that pLSI is a
special case of LDA.
>
> If anybody is working on/interesting working on a mapreduce LDA 
> implementation for mahout, I'd love to chat with you.
>
>
> -----Original Message-----
> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> Sent: Thursday, May 22, 2008 5:31 AM
> To: mahout-dev@lucene.apache.org
> Subject: RE: Taste on Mahout
>
> Hey Ted,
>        I read the paper on LDA
> (http://citeseer.ist.psu.edu/blei03latent.html) and I have to admit I 
> could not understand how LDA would be any different than PLSI for the 
> problem setting that I have (user-click history for various users and 
> urls). May be its my limited statistical knowledge and ML background 
> but I am making best efforts to learn things as they come along.
>
> I found the notations to be quite complex and it would be nice if you 
> could point me to a source offering simpler explanation of LDA model 
> parameters and their estimation methods as after reading the paper I 
> could not map those methods into my problem setting.
>
> Since I already have some understading of PLSI and Expectation 
> Maximization, an explanation describing the role of additional model 
> parameters and their estimation method would suffice. May be that's 
> something you could help me with offline.
>
> Thanks
> -Ankur
>
>
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Wednesday, May 21, 2008 10:24 PM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Taste on Mahout
>
> My suggestion is to build a class of probabilistic models of what 
> people click on.  You can build some number of models as necessary to 
> describe your users' histories well.
>
> These model will give you the answers you need.
>
> I can talk this evening a bit about how to do this.  If you want to 
> read up on it ahead of time, take a look at 
> http://citeseer.ist.psu.edu/750239.htmland
> http://citeseer.ist.psu.edu/blei03latent.html
>
> (hint: consider each person a document and a thing to be clicked as a
> word)
>
> On Wed, May 21, 2008 at 4:36 AM, Goel, Ankur <An...@corp.aol.com>
> wrote:
>
> > Hey Sean,
> >          Thanks for the suggestions. In my case the data-set os only

> > going to tell me if the useer clicked on a particualar item. So lets

> > say there are 10,000 items a user might only have clicked 20 - 30 
> > items. I was thinking more on the lines of building an item 
> > similarity
>
> > table by comparing each item with every other item and retaining 
> > only 100 top items decayed by time.
> >
> > So a recommender for a user would use his recent browsing history to

> > figure out top 10 or 20 most similar items.
> >
> > The approach is documented in Toby Segaran's "Collective
Intelligence"
> > book and looks simple to implement even though it is costly since 
> > every item needs to be compared with every other item. This can be 
> > parallelized in way that for M items in a cluster of N machines, 
> > each node has to compare M/N items to M items. Since the data-set is

> > going to sparse (no. of items having common users), I believe this 
> > would'nt be overwhelming for the cluster.
> >
> > The other approach that I am thinking to reduce the computation cost

> > is to use a clustering algorithm like K-Means that's available in 
> > Mahout to cluster similar user/items together and then use 
> > clustering information to make recommendations.
> >
> > Any suggestions?
> >
> > Thanks
> > -Ankur
> >
> >
> > -----Original Message-----
> > From: Sean Owen [mailto:srowen@gmail.com]
> > Sent: Tuesday, May 20, 2008 9:37 PM
> > To: mahout-dev@lucene.apache.org; Goel, Ankur
> > Subject: Re: Taste on Mahout
> >
> > + Ankur directly, since I am not sure you are on the dev list.
> >
> > On Tue, May 20, 2008 at 12:06 PM, Sean Owen <sr...@gmail.com>
wrote:
> > > All of the algorithms assume a world where you have a continuous 
> > > range
> >
> > > of ratings from users for items. Obviously a binary yes/no rating 
> > > can be mapped into that trivially -- 1 and -1 for example. This 
> > > causes some issues, most notably for corrletion-based recommenders

> > > where the correlation can be undefined between two items/users in 
> > > special cases that arise from this kind of input -- for example if

> > > we overlap in rating 3 items and I voted "yes" for all 3, then no 
> > > correlation can be
> >
> > > defined.
> > >
> > > Slope one doesn't run into this particular mathematical wrinkle.
> > >
> > > Also, methods like estimatePreference() are not going to give you 
> > > estimates that are always 1 or -1. Again, you could map this back 
> > > onto
> > > 1 / -1 by rounding or something, just something to note.
> > >
> > > So, in general it will be better if you can map whatever input you

> > > have onto a larger range of input. You will feed more information 
> > > in, in this way, as well. For example, maybe you call a recent
"yes"
> > > rating a +2, and a recent "no" a -2, and others +1 and -1.
> > >
> > >
> > > The part of slope one that parallelizes very well is the computing

> > > of the item-item diffs. No I have not written this yet.
> > >
> > >
> > > I have committed a first cut at a framework for computing 
> > > recommendations in parallel for any recommender. Dig in to 
> > > org.apache.mahout.cf.taste.impl.hadoop. In general, none of the 
> > > existing recommenders can be parallelized, because they generally 
> > > need
> >
> > > access to all the data to produce any recommendation.
> > >
> > > But, we can take partial advantage of Hadoop by simply 
> > > parallelizing
>
> > > the computation of recommendations for many users across multiple 
> > > identical recommender instances. Better than nothing. In this 
> > > situation, one of the map or reduce phase is trivial.
> > >
> > > That is what I have committed so far and it works, locally. I am 
> > > in the middle of figuring out how to write it for real use on a 
> > > remote Hadoop cluster, and how I would go about testing that!
> > >
> > > Do we have any test bed available?
> > >
> > >
> > >
> > > On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur 
> > > <An...@corp.aol.com>
> > wrote:
> > >> I just realized after going through the wikipedia that slope one 
> > >> is
>
> > >> applicable when you have ratings for the items.
> > >> In my case, I would be simply working with binary data (Item was 
> > >> clicked or not-clicked by user) using Tanimoto coefficient to 
> > >> calculate item similarity.
> > >> The idea is to capture the simple intuition "What items have been

> > >> visited most along with this item".
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > >> Sent: Tuesday, May 20, 2008 2:51 PM
> > >> To: mahout-dev@lucene.apache.org
> > >> Subject: RE: Taste on Mahout
> > >>
> > >>
> > >> Hey Sean,
> > >>       I actually plan to use slope-one to start with since
> > >> - Its simple and known to work well.
> > >> - Can be parallelized nicely into the Map-Reduce style.
> > >> I also plan to use Tanimoto coefficient for item-item diffs.
> > >>
> > >> Do we have something on slope-one already in Taste as a part of
> > Mahout ?
> > >>
> > >> At the moment I am going through the available documentation on 
> > >> Taste
> >
> > >> and code that's present in Mahout.
> > >>
> > >> Your suggestions would be greatly appreciated.
> > >>
> > >> Thanks
> > >> -Ankur
> > >>
> > >> -----Original Message-----
> > >> From: Sean Owen [mailto:srowen@gmail.com]
> > >> Sent: Tuesday, April 29, 2008 11:09 PM
> > >> To: mahout-dev@lucene.apache.org; Goel, Ankur
> > >> Subject: Re: Taste on Mahout
> > >>
> > >> I have some Hadoop code mostly ready to go for Taste.
> > >>
> > >> The first thing to do is let you generate recommendations for all

> > >> your users via Hadoop. Unfortunately none of the recommenders 
> > >> truly
>
> > >> parallelize in the way that MapReduce needs it to -- you need all

> > >> data to compute any recommendation really -- but you can at least

> > >> get
> >
> > >> paralellization out of this. You can use the framework to run n 
> > >> recommenders, each computing 1/nth of all recommendations.
> > >>
> > >> The next application is specific to slope-one. Computing the 
> > >> item-item diffs is exactly the kind of thing that MapReduce is 
> > >> good
>
> > >> for, so, writing a Hadoop job to do this seems like a no-brainer.
> > >>
> > >> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur 
> > >> <An...@corp.aol.com>
> > >> wrote:
> > >>> Hi Folks,
> > >>>       What's the status of hadoopifying Taste on Mahout ?
> > >>>  What's been done and what is in progress/pending ?
> > >>>
> > >>>  I am looking using a scalable version of Taste for my project.
> > >>>  So basically trying to figure out what's already done and where

> > >>> I
>
> > >>> can pitch in.
> > >>>
> > >>>  Thanks
> > >>>  -Ankur
> > >>>
> > >>
> > >
> >
>
>
>
> --
> ted
>



--
ted

Re: Taste on Mahout

Posted by Ted Dunning <te...@gmail.com>.

Buntine and Jakulin provide even farther ranging common structure between
LDA, pLSI, LSI and non-negative matrix factorizations in
http://citeseer.ist.psu.edu/750239.html

They also provide a much simpler algorithm for estimating parameters that is
(to my mind) simpler to implement in map-reduce than the variational
optimization of Jordan et al.

I am very interested in helping with a good LDA map-reduce implementation.
My time constraints limit how much actual code I can generate for the
implementation, but I would still like to help in whatever small way I might
be able to .

On Sat, May 24, 2008 at 2:27 PM, Daniel Kluesing <da...@ilike-inc.com>
wrote:

> LDA is a proper 'generative model', while pLSI 'fakes' being a
> generative model. From a generative model you have the full probability
> distribution of all variables, this matters when you're working with new
> unseen data.
>
> You may find http://www.cs.bham.ac.uk/~axk/sigir2003_mgak.pdf<http://www.cs.bham.ac.uk/%7Eaxk/sigir2003_mgak.pdf>a good
> comparison, says that pLSI is a special case of LDA.
>
> If anybody is working on/interesting working on a mapreduce LDA
> implementation for mahout, I'd love to chat with you.
>
>
> -----Original Message-----
> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> Sent: Thursday, May 22, 2008 5:31 AM
> To: mahout-dev@lucene.apache.org
> Subject: RE: Taste on Mahout
>
> Hey Ted,
>        I read the paper on LDA
> (http://citeseer.ist.psu.edu/blei03latent.html) and I have to admit I
> could not understand how LDA would be any different than PLSI for the
> problem setting that I have (user-click history for various users and
> urls). May be its my limited statistical knowledge and ML background but
> I am making best efforts to learn things as they come along.
>
> I found the notations to be quite complex and it would be nice if you
> could point me to a source offering simpler explanation of LDA model
> parameters and their estimation methods as after reading the paper I
> could not map those methods into my problem setting.
>
> Since I already have some understading of PLSI and Expectation
> Maximization, an explanation describing the role of additional model
> parameters and their estimation method would suffice. May be that's
> something you could help me with offline.
>
> Thanks
> -Ankur
>
>
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Wednesday, May 21, 2008 10:24 PM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Taste on Mahout
>
> My suggestion is to build a class of probabilistic models of what people
> click on.  You can build some number of models as necessary to describe
> your users' histories well.
>
> These model will give you the answers you need.
>
> I can talk this evening a bit about how to do this.  If you want to read
> up on it ahead of time, take a look at
> http://citeseer.ist.psu.edu/750239.htmland
> http://citeseer.ist.psu.edu/blei03latent.html
>
> (hint: consider each person a document and a thing to be clicked as a
> word)
>
> On Wed, May 21, 2008 at 4:36 AM, Goel, Ankur <An...@corp.aol.com>
> wrote:
>
> > Hey Sean,
> >          Thanks for the suggestions. In my case the data-set os only
> > going to tell me if the useer clicked on a particualar item. So lets
> > say there are 10,000 items a user might only have clicked 20 - 30
> > items. I was thinking more on the lines of building an item similarity
>
> > table by comparing each item with every other item and retaining only
> > 100 top items decayed by time.
> >
> > So a recommender for a user would use his recent browsing history to
> > figure out top 10 or 20 most similar items.
> >
> > The approach is documented in Toby Segaran's "Collective Intelligence"
> > book and looks simple to implement even though it is costly since
> > every item needs to be compared with every other item. This can be
> > parallelized in way that for M items in a cluster of N machines, each
> > node has to compare M/N items to M items. Since the data-set is going
> > to sparse (no. of items having common users), I believe this would'nt
> > be overwhelming for the cluster.
> >
> > The other approach that I am thinking to reduce the computation cost
> > is to use a clustering algorithm like K-Means that's available in
> > Mahout to cluster similar user/items together and then use clustering
> > information to make recommendations.
> >
> > Any suggestions?
> >
> > Thanks
> > -Ankur
> >
> >
> > -----Original Message-----
> > From: Sean Owen [mailto:srowen@gmail.com]
> > Sent: Tuesday, May 20, 2008 9:37 PM
> > To: mahout-dev@lucene.apache.org; Goel, Ankur
> > Subject: Re: Taste on Mahout
> >
> > + Ankur directly, since I am not sure you are on the dev list.
> >
> > On Tue, May 20, 2008 at 12:06 PM, Sean Owen <sr...@gmail.com> wrote:
> > > All of the algorithms assume a world where you have a continuous
> > > range
> >
> > > of ratings from users for items. Obviously a binary yes/no rating
> > > can be mapped into that trivially -- 1 and -1 for example. This
> > > causes some issues, most notably for corrletion-based recommenders
> > > where the correlation can be undefined between two items/users in
> > > special cases that arise from this kind of input -- for example if
> > > we overlap in rating 3 items and I voted "yes" for all 3, then no
> > > correlation can be
> >
> > > defined.
> > >
> > > Slope one doesn't run into this particular mathematical wrinkle.
> > >
> > > Also, methods like estimatePreference() are not going to give you
> > > estimates that are always 1 or -1. Again, you could map this back
> > > onto
> > > 1 / -1 by rounding or something, just something to note.
> > >
> > > So, in general it will be better if you can map whatever input you
> > > have onto a larger range of input. You will feed more information
> > > in, in this way, as well. For example, maybe you call a recent "yes"
> > > rating a +2, and a recent "no" a -2, and others +1 and -1.
> > >
> > >
> > > The part of slope one that parallelizes very well is the computing
> > > of the item-item diffs. No I have not written this yet.
> > >
> > >
> > > I have committed a first cut at a framework for computing
> > > recommendations in parallel for any recommender. Dig in to
> > > org.apache.mahout.cf.taste.impl.hadoop. In general, none of the
> > > existing recommenders can be parallelized, because they generally
> > > need
> >
> > > access to all the data to produce any recommendation.
> > >
> > > But, we can take partial advantage of Hadoop by simply parallelizing
>
> > > the computation of recommendations for many users across multiple
> > > identical recommender instances. Better than nothing. In this
> > > situation, one of the map or reduce phase is trivial.
> > >
> > > That is what I have committed so far and it works, locally. I am in
> > > the middle of figuring out how to write it for real use on a remote
> > > Hadoop cluster, and how I would go about testing that!
> > >
> > > Do we have any test bed available?
> > >
> > >
> > >
> > > On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur
> > > <An...@corp.aol.com>
> > wrote:
> > >> I just realized after going through the wikipedia that slope one is
>
> > >> applicable when you have ratings for the items.
> > >> In my case, I would be simply working with binary data (Item was
> > >> clicked or not-clicked by user) using Tanimoto coefficient to
> > >> calculate item similarity.
> > >> The idea is to capture the simple intuition "What items have been
> > >> visited most along with this item".
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > >> Sent: Tuesday, May 20, 2008 2:51 PM
> > >> To: mahout-dev@lucene.apache.org
> > >> Subject: RE: Taste on Mahout
> > >>
> > >>
> > >> Hey Sean,
> > >>       I actually plan to use slope-one to start with since
> > >> - Its simple and known to work well.
> > >> - Can be parallelized nicely into the Map-Reduce style.
> > >> I also plan to use Tanimoto coefficient for item-item diffs.
> > >>
> > >> Do we have something on slope-one already in Taste as a part of
> > Mahout ?
> > >>
> > >> At the moment I am going through the available documentation on
> > >> Taste
> >
> > >> and code that's present in Mahout.
> > >>
> > >> Your suggestions would be greatly appreciated.
> > >>
> > >> Thanks
> > >> -Ankur
> > >>
> > >> -----Original Message-----
> > >> From: Sean Owen [mailto:srowen@gmail.com]
> > >> Sent: Tuesday, April 29, 2008 11:09 PM
> > >> To: mahout-dev@lucene.apache.org; Goel, Ankur
> > >> Subject: Re: Taste on Mahout
> > >>
> > >> I have some Hadoop code mostly ready to go for Taste.
> > >>
> > >> The first thing to do is let you generate recommendations for all
> > >> your users via Hadoop. Unfortunately none of the recommenders truly
>
> > >> parallelize in the way that MapReduce needs it to -- you need all
> > >> data to compute any recommendation really -- but you can at least
> > >> get
> >
> > >> paralellization out of this. You can use the framework to run n
> > >> recommenders, each computing 1/nth of all recommendations.
> > >>
> > >> The next application is specific to slope-one. Computing the
> > >> item-item diffs is exactly the kind of thing that MapReduce is good
>
> > >> for, so, writing a Hadoop job to do this seems like a no-brainer.
> > >>
> > >> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur
> > >> <An...@corp.aol.com>
> > >> wrote:
> > >>> Hi Folks,
> > >>>       What's the status of hadoopifying Taste on Mahout ?
> > >>>  What's been done and what is in progress/pending ?
> > >>>
> > >>>  I am looking using a scalable version of Taste for my project.
> > >>>  So basically trying to figure out what's already done and where I
>
> > >>> can pitch in.
> > >>>
> > >>>  Thanks
> > >>>  -Ankur
> > >>>
> > >>
> > >
> >
>
>
>
> --
> ted
>



-- 
ted

RE: Taste on Mahout

Posted by Daniel Kluesing <da...@ilike-inc.com>.

LDA is a proper 'generative model', while pLSI 'fakes' being a
generative model. From a generative model you have the full probability
distribution of all variables, this matters when you're working with new
unseen data.

You may find http://www.cs.bham.ac.uk/~axk/sigir2003_mgak.pdf a good
comparison, says that pLSI is a special case of LDA.

If anybody is working on/interesting working on a mapreduce LDA
implementation for mahout, I'd love to chat with you.


-----Original Message-----
From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com] 
Sent: Thursday, May 22, 2008 5:31 AM
To: mahout-dev@lucene.apache.org
Subject: RE: Taste on Mahout

Hey Ted,
        I read the paper on LDA
(http://citeseer.ist.psu.edu/blei03latent.html) and I have to admit I
could not understand how LDA would be any different than PLSI for the
problem setting that I have (user-click history for various users and
urls). May be its my limited statistical knowledge and ML background but
I am making best efforts to learn things as they come along. 

I found the notations to be quite complex and it would be nice if you
could point me to a source offering simpler explanation of LDA model
parameters and their estimation methods as after reading the paper I
could not map those methods into my problem setting.

Since I already have some understading of PLSI and Expectation
Maximization, an explanation describing the role of additional model
parameters and their estimation method would suffice. May be that's
something you could help me with offline.

Thanks
-Ankur



-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com]
Sent: Wednesday, May 21, 2008 10:24 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Taste on Mahout

My suggestion is to build a class of probabilistic models of what people
click on.  You can build some number of models as necessary to describe
your users' histories well.

These model will give you the answers you need.

I can talk this evening a bit about how to do this.  If you want to read
up on it ahead of time, take a look at
http://citeseer.ist.psu.edu/750239.htmland
http://citeseer.ist.psu.edu/blei03latent.html

(hint: consider each person a document and a thing to be clicked as a
word)

On Wed, May 21, 2008 at 4:36 AM, Goel, Ankur <An...@corp.aol.com>
wrote:

> Hey Sean,
>          Thanks for the suggestions. In my case the data-set os only 
> going to tell me if the useer clicked on a particualar item. So lets 
> say there are 10,000 items a user might only have clicked 20 - 30 
> items. I was thinking more on the lines of building an item similarity

> table by comparing each item with every other item and retaining only 
> 100 top items decayed by time.
>
> So a recommender for a user would use his recent browsing history to 
> figure out top 10 or 20 most similar items.
>
> The approach is documented in Toby Segaran's "Collective Intelligence"
> book and looks simple to implement even though it is costly since 
> every item needs to be compared with every other item. This can be 
> parallelized in way that for M items in a cluster of N machines, each 
> node has to compare M/N items to M items. Since the data-set is going 
> to sparse (no. of items having common users), I believe this would'nt 
> be overwhelming for the cluster.
>
> The other approach that I am thinking to reduce the computation cost 
> is to use a clustering algorithm like K-Means that's available in 
> Mahout to cluster similar user/items together and then use clustering 
> information to make recommendations.
>
> Any suggestions?
>
> Thanks
> -Ankur
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Tuesday, May 20, 2008 9:37 PM
> To: mahout-dev@lucene.apache.org; Goel, Ankur
> Subject: Re: Taste on Mahout
>
> + Ankur directly, since I am not sure you are on the dev list.
>
> On Tue, May 20, 2008 at 12:06 PM, Sean Owen <sr...@gmail.com> wrote:
> > All of the algorithms assume a world where you have a continuous 
> > range
>
> > of ratings from users for items. Obviously a binary yes/no rating 
> > can be mapped into that trivially -- 1 and -1 for example. This 
> > causes some issues, most notably for corrletion-based recommenders 
> > where the correlation can be undefined between two items/users in 
> > special cases that arise from this kind of input -- for example if 
> > we overlap in rating 3 items and I voted "yes" for all 3, then no 
> > correlation can be
>
> > defined.
> >
> > Slope one doesn't run into this particular mathematical wrinkle.
> >
> > Also, methods like estimatePreference() are not going to give you 
> > estimates that are always 1 or -1. Again, you could map this back 
> > onto
> > 1 / -1 by rounding or something, just something to note.
> >
> > So, in general it will be better if you can map whatever input you 
> > have onto a larger range of input. You will feed more information 
> > in, in this way, as well. For example, maybe you call a recent "yes"
> > rating a +2, and a recent "no" a -2, and others +1 and -1.
> >
> >
> > The part of slope one that parallelizes very well is the computing 
> > of the item-item diffs. No I have not written this yet.
> >
> >
> > I have committed a first cut at a framework for computing 
> > recommendations in parallel for any recommender. Dig in to 
> > org.apache.mahout.cf.taste.impl.hadoop. In general, none of the 
> > existing recommenders can be parallelized, because they generally 
> > need
>
> > access to all the data to produce any recommendation.
> >
> > But, we can take partial advantage of Hadoop by simply parallelizing

> > the computation of recommendations for many users across multiple 
> > identical recommender instances. Better than nothing. In this 
> > situation, one of the map or reduce phase is trivial.
> >
> > That is what I have committed so far and it works, locally. I am in 
> > the middle of figuring out how to write it for real use on a remote 
> > Hadoop cluster, and how I would go about testing that!
> >
> > Do we have any test bed available?
> >
> >
> >
> > On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur 
> > <An...@corp.aol.com>
> wrote:
> >> I just realized after going through the wikipedia that slope one is

> >> applicable when you have ratings for the items.
> >> In my case, I would be simply working with binary data (Item was 
> >> clicked or not-clicked by user) using Tanimoto coefficient to 
> >> calculate item similarity.
> >> The idea is to capture the simple intuition "What items have been 
> >> visited most along with this item".
> >>
> >>
> >> -----Original Message-----
> >> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> >> Sent: Tuesday, May 20, 2008 2:51 PM
> >> To: mahout-dev@lucene.apache.org
> >> Subject: RE: Taste on Mahout
> >>
> >>
> >> Hey Sean,
> >>       I actually plan to use slope-one to start with since
> >> - Its simple and known to work well.
> >> - Can be parallelized nicely into the Map-Reduce style.
> >> I also plan to use Tanimoto coefficient for item-item diffs.
> >>
> >> Do we have something on slope-one already in Taste as a part of
> Mahout ?
> >>
> >> At the moment I am going through the available documentation on 
> >> Taste
>
> >> and code that's present in Mahout.
> >>
> >> Your suggestions would be greatly appreciated.
> >>
> >> Thanks
> >> -Ankur
> >>
> >> -----Original Message-----
> >> From: Sean Owen [mailto:srowen@gmail.com]
> >> Sent: Tuesday, April 29, 2008 11:09 PM
> >> To: mahout-dev@lucene.apache.org; Goel, Ankur
> >> Subject: Re: Taste on Mahout
> >>
> >> I have some Hadoop code mostly ready to go for Taste.
> >>
> >> The first thing to do is let you generate recommendations for all 
> >> your users via Hadoop. Unfortunately none of the recommenders truly

> >> parallelize in the way that MapReduce needs it to -- you need all 
> >> data to compute any recommendation really -- but you can at least 
> >> get
>
> >> paralellization out of this. You can use the framework to run n 
> >> recommenders, each computing 1/nth of all recommendations.
> >>
> >> The next application is specific to slope-one. Computing the 
> >> item-item diffs is exactly the kind of thing that MapReduce is good

> >> for, so, writing a Hadoop job to do this seems like a no-brainer.
> >>
> >> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur 
> >> <An...@corp.aol.com>
> >> wrote:
> >>> Hi Folks,
> >>>       What's the status of hadoopifying Taste on Mahout ?
> >>>  What's been done and what is in progress/pending ?
> >>>
> >>>  I am looking using a scalable version of Taste for my project.
> >>>  So basically trying to figure out what's already done and where I

> >>> can pitch in.
> >>>
> >>>  Thanks
> >>>  -Ankur
> >>>
> >>
> >
>



--
ted

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

Yup the whole "probabilistic model" thing, I have never implemented.
It's the major gap in what is supported. Thank you for the reference,
probably time to get into this.

On Wed, May 21, 2008 at 12:54 PM, Ted Dunning <te...@gmail.com> wrote:
> My suggestion is to build a class of probabilistic models of what people
> click on.  You can build some number of models as necessary to describe your
> users' histories well.
>
> These model will give you the answers you need.
>
> I can talk this evening a bit about how to do this.  If you want to read up
> on it ahead of time, take a look at http://citeseer.ist.psu.edu/750239.htmland
> http://citeseer.ist.psu.edu/blei03latent.html
>
> (hint: consider each person a document and a thing to be clicked as a word)

RE: Taste on Mahout

Posted by "Goel, Ankur" <An...@corp.aol.com>.

Hey Ted,
        I read the paper on LDA
(http://citeseer.ist.psu.edu/blei03latent.html) and I have to admit I
could not understand how LDA would be any different than PLSI for the
problem setting that I have (user-click history for various users and
urls). May be its my limited statistical knowledge and ML background but
I am making best efforts to learn things as they come along. 

I found the notations to be quite complex and it would be nice if you
could point me to a source offering simpler explanation of LDA model
parameters and their estimation methods as after reading the paper I
could not map those methods into my problem setting.

Since I already have some understading of PLSI and Expectation
Maximization, an explanation describing the role of additional model
parameters and their estimation method would suffice. May be that's
something you could help me with offline.

Thanks
-Ankur



-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Wednesday, May 21, 2008 10:24 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Taste on Mahout

My suggestion is to build a class of probabilistic models of what people
click on.  You can build some number of models as necessary to describe
your users' histories well.

These model will give you the answers you need.

I can talk this evening a bit about how to do this.  If you want to read
up on it ahead of time, take a look at
http://citeseer.ist.psu.edu/750239.htmland
http://citeseer.ist.psu.edu/blei03latent.html

(hint: consider each person a document and a thing to be clicked as a
word)

On Wed, May 21, 2008 at 4:36 AM, Goel, Ankur <An...@corp.aol.com>
wrote:

> Hey Sean,
>          Thanks for the suggestions. In my case the data-set os only 
> going to tell me if the useer clicked on a particualar item. So lets 
> say there are 10,000 items a user might only have clicked 20 - 30 
> items. I was thinking more on the lines of building an item similarity

> table by comparing each item with every other item and retaining only 
> 100 top items decayed by time.
>
> So a recommender for a user would use his recent browsing history to 
> figure out top 10 or 20 most similar items.
>
> The approach is documented in Toby Segaran's "Collective Intelligence"
> book and looks simple to implement even though it is costly since 
> every item needs to be compared with every other item. This can be 
> parallelized in way that for M items in a cluster of N machines, each 
> node has to compare M/N items to M items. Since the data-set is going 
> to sparse (no. of items having common users), I believe this would'nt 
> be overwhelming for the cluster.
>
> The other approach that I am thinking to reduce the computation cost 
> is to use a clustering algorithm like K-Means that's available in 
> Mahout to cluster similar user/items together and then use clustering 
> information to make recommendations.
>
> Any suggestions?
>
> Thanks
> -Ankur
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Tuesday, May 20, 2008 9:37 PM
> To: mahout-dev@lucene.apache.org; Goel, Ankur
> Subject: Re: Taste on Mahout
>
> + Ankur directly, since I am not sure you are on the dev list.
>
> On Tue, May 20, 2008 at 12:06 PM, Sean Owen <sr...@gmail.com> wrote:
> > All of the algorithms assume a world where you have a continuous 
> > range
>
> > of ratings from users for items. Obviously a binary yes/no rating 
> > can be mapped into that trivially -- 1 and -1 for example. This 
> > causes some issues, most notably for corrletion-based recommenders 
> > where the correlation can be undefined between two items/users in 
> > special cases that arise from this kind of input -- for example if 
> > we overlap in rating 3 items and I voted "yes" for all 3, then no 
> > correlation can be
>
> > defined.
> >
> > Slope one doesn't run into this particular mathematical wrinkle.
> >
> > Also, methods like estimatePreference() are not going to give you 
> > estimates that are always 1 or -1. Again, you could map this back 
> > onto
> > 1 / -1 by rounding or something, just something to note.
> >
> > So, in general it will be better if you can map whatever input you 
> > have onto a larger range of input. You will feed more information 
> > in, in this way, as well. For example, maybe you call a recent "yes"
> > rating a +2, and a recent "no" a -2, and others +1 and -1.
> >
> >
> > The part of slope one that parallelizes very well is the computing 
> > of the item-item diffs. No I have not written this yet.
> >
> >
> > I have committed a first cut at a framework for computing 
> > recommendations in parallel for any recommender. Dig in to 
> > org.apache.mahout.cf.taste.impl.hadoop. In general, none of the 
> > existing recommenders can be parallelized, because they generally 
> > need
>
> > access to all the data to produce any recommendation.
> >
> > But, we can take partial advantage of Hadoop by simply parallelizing

> > the computation of recommendations for many users across multiple 
> > identical recommender instances. Better than nothing. In this 
> > situation, one of the map or reduce phase is trivial.
> >
> > That is what I have committed so far and it works, locally. I am in 
> > the middle of figuring out how to write it for real use on a remote 
> > Hadoop cluster, and how I would go about testing that!
> >
> > Do we have any test bed available?
> >
> >
> >
> > On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur 
> > <An...@corp.aol.com>
> wrote:
> >> I just realized after going through the wikipedia that slope one is

> >> applicable when you have ratings for the items.
> >> In my case, I would be simply working with binary data (Item was 
> >> clicked or not-clicked by user) using Tanimoto coefficient to 
> >> calculate item similarity.
> >> The idea is to capture the simple intuition "What items have been 
> >> visited most along with this item".
> >>
> >>
> >> -----Original Message-----
> >> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> >> Sent: Tuesday, May 20, 2008 2:51 PM
> >> To: mahout-dev@lucene.apache.org
> >> Subject: RE: Taste on Mahout
> >>
> >>
> >> Hey Sean,
> >>       I actually plan to use slope-one to start with since
> >> - Its simple and known to work well.
> >> - Can be parallelized nicely into the Map-Reduce style.
> >> I also plan to use Tanimoto coefficient for item-item diffs.
> >>
> >> Do we have something on slope-one already in Taste as a part of
> Mahout ?
> >>
> >> At the moment I am going through the available documentation on 
> >> Taste
>
> >> and code that's present in Mahout.
> >>
> >> Your suggestions would be greatly appreciated.
> >>
> >> Thanks
> >> -Ankur
> >>
> >> -----Original Message-----
> >> From: Sean Owen [mailto:srowen@gmail.com]
> >> Sent: Tuesday, April 29, 2008 11:09 PM
> >> To: mahout-dev@lucene.apache.org; Goel, Ankur
> >> Subject: Re: Taste on Mahout
> >>
> >> I have some Hadoop code mostly ready to go for Taste.
> >>
> >> The first thing to do is let you generate recommendations for all 
> >> your users via Hadoop. Unfortunately none of the recommenders truly

> >> parallelize in the way that MapReduce needs it to -- you need all 
> >> data to compute any recommendation really -- but you can at least 
> >> get
>
> >> paralellization out of this. You can use the framework to run n 
> >> recommenders, each computing 1/nth of all recommendations.
> >>
> >> The next application is specific to slope-one. Computing the 
> >> item-item diffs is exactly the kind of thing that MapReduce is good

> >> for, so, writing a Hadoop job to do this seems like a no-brainer.
> >>
> >> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur 
> >> <An...@corp.aol.com>
> >> wrote:
> >>> Hi Folks,
> >>>       What's the status of hadoopifying Taste on Mahout ?
> >>>  What's been done and what is in progress/pending ?
> >>>
> >>>  I am looking using a scalable version of Taste for my project.
> >>>  So basically trying to figure out what's already done and where  
> >>> I can pitch in.
> >>>
> >>>  Thanks
> >>>  -Ankur
> >>>
> >>
> >
>



--
ted

Re: Taste on Mahout

Posted by Ted Dunning <te...@gmail.com>.

My suggestion is to build a class of probabilistic models of what people
click on.  You can build some number of models as necessary to describe your
users' histories well.

These model will give you the answers you need.

I can talk this evening a bit about how to do this.  If you want to read up
on it ahead of time, take a look at http://citeseer.ist.psu.edu/750239.htmland
http://citeseer.ist.psu.edu/blei03latent.html

(hint: consider each person a document and a thing to be clicked as a word)

On Wed, May 21, 2008 at 4:36 AM, Goel, Ankur <An...@corp.aol.com>
wrote:

> Hey Sean,
>          Thanks for the suggestions. In my case the data-set os only
> going to tell me if the useer clicked on a particualar item. So lets say
> there are 10,000 items a user might only have clicked 20 - 30 items. I
> was thinking more on the lines of building an item similarity table by
> comparing each item with every other item and retaining only 100 top
> items decayed by time.
>
> So a recommender for a user would use his recent browsing history to
> figure out top 10 or 20 most similar items.
>
> The approach is documented in Toby Segaran's "Collective Intelligence"
> book and looks simple to implement even though it is costly since every
> item needs to be compared with every other item. This can be
> parallelized in way that for M items in a cluster of N machines, each
> node has to compare M/N items to M items. Since the data-set is going to
> sparse (no. of items having common users), I believe this would'nt be
> overwhelming for the cluster.
>
> The other approach that I am thinking to reduce the computation cost is
> to use a clustering algorithm like K-Means that's available in Mahout to
> cluster similar user/items together and then use clustering information
> to make recommendations.
>
> Any suggestions?
>
> Thanks
> -Ankur
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Tuesday, May 20, 2008 9:37 PM
> To: mahout-dev@lucene.apache.org; Goel, Ankur
> Subject: Re: Taste on Mahout
>
> + Ankur directly, since I am not sure you are on the dev list.
>
> On Tue, May 20, 2008 at 12:06 PM, Sean Owen <sr...@gmail.com> wrote:
> > All of the algorithms assume a world where you have a continuous range
>
> > of ratings from users for items. Obviously a binary yes/no rating can
> > be mapped into that trivially -- 1 and -1 for example. This causes
> > some issues, most notably for corrletion-based recommenders where the
> > correlation can be undefined between two items/users in special cases
> > that arise from this kind of input -- for example if we overlap in
> > rating 3 items and I voted "yes" for all 3, then no correlation can be
>
> > defined.
> >
> > Slope one doesn't run into this particular mathematical wrinkle.
> >
> > Also, methods like estimatePreference() are not going to give you
> > estimates that are always 1 or -1. Again, you could map this back onto
> > 1 / -1 by rounding or something, just something to note.
> >
> > So, in general it will be better if you can map whatever input you
> > have onto a larger range of input. You will feed more information in,
> > in this way, as well. For example, maybe you call a recent "yes"
> > rating a +2, and a recent "no" a -2, and others +1 and -1.
> >
> >
> > The part of slope one that parallelizes very well is the computing of
> > the item-item diffs. No I have not written this yet.
> >
> >
> > I have committed a first cut at a framework for computing
> > recommendations in parallel for any recommender. Dig in to
> > org.apache.mahout.cf.taste.impl.hadoop. In general, none of the
> > existing recommenders can be parallelized, because they generally need
>
> > access to all the data to produce any recommendation.
> >
> > But, we can take partial advantage of Hadoop by simply parallelizing
> > the computation of recommendations for many users across multiple
> > identical recommender instances. Better than nothing. In this
> > situation, one of the map or reduce phase is trivial.
> >
> > That is what I have committed so far and it works, locally. I am in
> > the middle of figuring out how to write it for real use on a remote
> > Hadoop cluster, and how I would go about testing that!
> >
> > Do we have any test bed available?
> >
> >
> >
> > On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur <An...@corp.aol.com>
> wrote:
> >> I just realized after going through the wikipedia that slope one is
> >> applicable when you have ratings for the items.
> >> In my case, I would be simply working with binary data (Item was
> >> clicked or not-clicked by user) using Tanimoto coefficient to
> >> calculate item similarity.
> >> The idea is to capture the simple intuition "What items have been
> >> visited most along with this item".
> >>
> >>
> >> -----Original Message-----
> >> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> >> Sent: Tuesday, May 20, 2008 2:51 PM
> >> To: mahout-dev@lucene.apache.org
> >> Subject: RE: Taste on Mahout
> >>
> >>
> >> Hey Sean,
> >>       I actually plan to use slope-one to start with since
> >> - Its simple and known to work well.
> >> - Can be parallelized nicely into the Map-Reduce style.
> >> I also plan to use Tanimoto coefficient for item-item diffs.
> >>
> >> Do we have something on slope-one already in Taste as a part of
> Mahout ?
> >>
> >> At the moment I am going through the available documentation on Taste
>
> >> and code that's present in Mahout.
> >>
> >> Your suggestions would be greatly appreciated.
> >>
> >> Thanks
> >> -Ankur
> >>
> >> -----Original Message-----
> >> From: Sean Owen [mailto:srowen@gmail.com]
> >> Sent: Tuesday, April 29, 2008 11:09 PM
> >> To: mahout-dev@lucene.apache.org; Goel, Ankur
> >> Subject: Re: Taste on Mahout
> >>
> >> I have some Hadoop code mostly ready to go for Taste.
> >>
> >> The first thing to do is let you generate recommendations for all
> >> your users via Hadoop. Unfortunately none of the recommenders truly
> >> parallelize in the way that MapReduce needs it to -- you need all
> >> data to compute any recommendation really -- but you can at least get
>
> >> paralellization out of this. You can use the framework to run n
> >> recommenders, each computing 1/nth of all recommendations.
> >>
> >> The next application is specific to slope-one. Computing the
> >> item-item diffs is exactly the kind of thing that MapReduce is good
> >> for, so, writing a Hadoop job to do this seems like a no-brainer.
> >>
> >> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur
> >> <An...@corp.aol.com>
> >> wrote:
> >>> Hi Folks,
> >>>       What's the status of hadoopifying Taste on Mahout ?
> >>>  What's been done and what is in progress/pending ?
> >>>
> >>>  I am looking using a scalable version of Taste for my project.
> >>>  So basically trying to figure out what's already done and where  I
> >>> can pitch in.
> >>>
> >>>  Thanks
> >>>  -Ankur
> >>>
> >>
> >
>



-- 
ted

RE: Taste on Mahout

Posted by "Goel, Ankur" <An...@corp.aol.com>.

Hey Sean,
          Thanks for the suggestions. In my case the data-set os only
going to tell me if the useer clicked on a particualar item. So lets say
there are 10,000 items a user might only have clicked 20 - 30 items. I
was thinking more on the lines of building an item similarity table by
comparing each item with every other item and retaining only 100 top
items decayed by time. 

So a recommender for a user would use his recent browsing history to
figure out top 10 or 20 most similar items.

The approach is documented in Toby Segaran's "Collective Intelligence"
book and looks simple to implement even though it is costly since every
item needs to be compared with every other item. This can be
parallelized in way that for M items in a cluster of N machines, each
node has to compare M/N items to M items. Since the data-set is going to
sparse (no. of items having common users), I believe this would'nt be
overwhelming for the cluster.

The other approach that I am thinking to reduce the computation cost is
to use a clustering algorithm like K-Means that's available in Mahout to
cluster similar user/items together and then use clustering information
to make recommendations.

Any suggestions?

Thanks
-Ankur


-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Tuesday, May 20, 2008 9:37 PM
To: mahout-dev@lucene.apache.org; Goel, Ankur
Subject: Re: Taste on Mahout

+ Ankur directly, since I am not sure you are on the dev list.

On Tue, May 20, 2008 at 12:06 PM, Sean Owen <sr...@gmail.com> wrote:
> All of the algorithms assume a world where you have a continuous range

> of ratings from users for items. Obviously a binary yes/no rating can 
> be mapped into that trivially -- 1 and -1 for example. This causes 
> some issues, most notably for corrletion-based recommenders where the 
> correlation can be undefined between two items/users in special cases 
> that arise from this kind of input -- for example if we overlap in 
> rating 3 items and I voted "yes" for all 3, then no correlation can be

> defined.
>
> Slope one doesn't run into this particular mathematical wrinkle.
>
> Also, methods like estimatePreference() are not going to give you 
> estimates that are always 1 or -1. Again, you could map this back onto
> 1 / -1 by rounding or something, just something to note.
>
> So, in general it will be better if you can map whatever input you 
> have onto a larger range of input. You will feed more information in, 
> in this way, as well. For example, maybe you call a recent "yes"
> rating a +2, and a recent "no" a -2, and others +1 and -1.
>
>
> The part of slope one that parallelizes very well is the computing of 
> the item-item diffs. No I have not written this yet.
>
>
> I have committed a first cut at a framework for computing 
> recommendations in parallel for any recommender. Dig in to 
> org.apache.mahout.cf.taste.impl.hadoop. In general, none of the 
> existing recommenders can be parallelized, because they generally need

> access to all the data to produce any recommendation.
>
> But, we can take partial advantage of Hadoop by simply parallelizing 
> the computation of recommendations for many users across multiple 
> identical recommender instances. Better than nothing. In this 
> situation, one of the map or reduce phase is trivial.
>
> That is what I have committed so far and it works, locally. I am in 
> the middle of figuring out how to write it for real use on a remote 
> Hadoop cluster, and how I would go about testing that!
>
> Do we have any test bed available?
>
>
>
> On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur <An...@corp.aol.com>
wrote:
>> I just realized after going through the wikipedia that slope one is 
>> applicable when you have ratings for the items.
>> In my case, I would be simply working with binary data (Item was 
>> clicked or not-clicked by user) using Tanimoto coefficient to 
>> calculate item similarity.
>> The idea is to capture the simple intuition "What items have been 
>> visited most along with this item".
>>
>>
>> -----Original Message-----
>> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
>> Sent: Tuesday, May 20, 2008 2:51 PM
>> To: mahout-dev@lucene.apache.org
>> Subject: RE: Taste on Mahout
>>
>>
>> Hey Sean,
>>       I actually plan to use slope-one to start with since
>> - Its simple and known to work well.
>> - Can be parallelized nicely into the Map-Reduce style.
>> I also plan to use Tanimoto coefficient for item-item diffs.
>>
>> Do we have something on slope-one already in Taste as a part of
Mahout ?
>>
>> At the moment I am going through the available documentation on Taste

>> and code that's present in Mahout.
>>
>> Your suggestions would be greatly appreciated.
>>
>> Thanks
>> -Ankur
>>
>> -----Original Message-----
>> From: Sean Owen [mailto:srowen@gmail.com]
>> Sent: Tuesday, April 29, 2008 11:09 PM
>> To: mahout-dev@lucene.apache.org; Goel, Ankur
>> Subject: Re: Taste on Mahout
>>
>> I have some Hadoop code mostly ready to go for Taste.
>>
>> The first thing to do is let you generate recommendations for all 
>> your users via Hadoop. Unfortunately none of the recommenders truly 
>> parallelize in the way that MapReduce needs it to -- you need all 
>> data to compute any recommendation really -- but you can at least get

>> paralellization out of this. You can use the framework to run n 
>> recommenders, each computing 1/nth of all recommendations.
>>
>> The next application is specific to slope-one. Computing the 
>> item-item diffs is exactly the kind of thing that MapReduce is good 
>> for, so, writing a Hadoop job to do this seems like a no-brainer.
>>
>> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur 
>> <An...@corp.aol.com>
>> wrote:
>>> Hi Folks,
>>>       What's the status of hadoopifying Taste on Mahout ?
>>>  What's been done and what is in progress/pending ?
>>>
>>>  I am looking using a scalable version of Taste for my project.
>>>  So basically trying to figure out what's already done and where  I 
>>> can pitch in.
>>>
>>>  Thanks
>>>  -Ankur
>>>
>>
>

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

+ Ankur directly, since I am not sure you are on the dev list.

On Tue, May 20, 2008 at 12:06 PM, Sean Owen <sr...@gmail.com> wrote:
> All of the algorithms assume a world where you have a continuous range
> of ratings from users for items. Obviously a binary yes/no rating can
> be mapped into that trivially -- 1 and -1 for example. This causes
> some issues, most notably for corrletion-based recommenders where the
> correlation can be undefined between two items/users in special cases
> that arise from this kind of input -- for example if we overlap in
> rating 3 items and I voted "yes" for all 3, then no correlation can be
> defined.
>
> Slope one doesn't run into this particular mathematical wrinkle.
>
> Also, methods like estimatePreference() are not going to give you
> estimates that are always 1 or -1. Again, you could map this back onto
> 1 / -1 by rounding or something, just something to note.
>
> So, in general it will be better if you can map whatever input you
> have onto a larger range of input. You will feed more information in,
> in this way, as well. For example, maybe you call a recent "yes"
> rating a +2, and a recent "no" a -2, and others +1 and -1.
>
>
> The part of slope one that parallelizes very well is the computing of
> the item-item diffs. No I have not written this yet.
>
>
> I have committed a first cut at a framework for computing
> recommendations in parallel for any recommender. Dig in to
> org.apache.mahout.cf.taste.impl.hadoop. In general, none of the
> existing recommenders can be parallelized, because they generally need
> access to all the data to produce any recommendation.
>
> But, we can take partial advantage of Hadoop by simply parallelizing
> the computation of recommendations for many users across multiple
> identical recommender instances. Better than nothing. In this
> situation, one of the map or reduce phase is trivial.
>
> That is what I have committed so far and it works, locally. I am in
> the middle of figuring out how to write it for real use on a remote
> Hadoop cluster, and how I would go about testing that!
>
> Do we have any test bed available?
>
>
>
> On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur <An...@corp.aol.com> wrote:
>> I just realized after going through the wikipedia that slope one is
>> applicable when you have ratings for the items.
>> In my case, I would be simply working with binary data (Item was clicked
>> or not-clicked by user) using Tanimoto coefficient to calculate item
>> similarity.
>> The idea is to capture the simple intuition "What items have been
>> visited most along with this item".
>>
>>
>> -----Original Message-----
>> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
>> Sent: Tuesday, May 20, 2008 2:51 PM
>> To: mahout-dev@lucene.apache.org
>> Subject: RE: Taste on Mahout
>>
>>
>> Hey Sean,
>>       I actually plan to use slope-one to start with since
>> - Its simple and known to work well.
>> - Can be parallelized nicely into the Map-Reduce style.
>> I also plan to use Tanimoto coefficient for item-item diffs.
>>
>> Do we have something on slope-one already in Taste as a part of Mahout ?
>>
>> At the moment I am going through the available documentation on Taste
>> and code that's present in Mahout.
>>
>> Your suggestions would be greatly appreciated.
>>
>> Thanks
>> -Ankur
>>
>> -----Original Message-----
>> From: Sean Owen [mailto:srowen@gmail.com]
>> Sent: Tuesday, April 29, 2008 11:09 PM
>> To: mahout-dev@lucene.apache.org; Goel, Ankur
>> Subject: Re: Taste on Mahout
>>
>> I have some Hadoop code mostly ready to go for Taste.
>>
>> The first thing to do is let you generate recommendations for all your
>> users via Hadoop. Unfortunately none of the recommenders truly
>> parallelize in the way that MapReduce needs it to -- you need all data
>> to compute any recommendation really -- but you can at least get
>> paralellization out of this. You can use the framework to run n
>> recommenders, each computing 1/nth of all recommendations.
>>
>> The next application is specific to slope-one. Computing the item-item
>> diffs is exactly the kind of thing that MapReduce is good for, so,
>> writing a Hadoop job to do this seems like a no-brainer.
>>
>> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur <An...@corp.aol.com>
>> wrote:
>>> Hi Folks,
>>>       What's the status of hadoopifying Taste on Mahout ?
>>>  What's been done and what is in progress/pending ?
>>>
>>>  I am looking using a scalable version of Taste for my project.
>>>  So basically trying to figure out what's already done and where  I
>>> can pitch in.
>>>
>>>  Thanks
>>>  -Ankur
>>>
>>
>

Re: Taste on Mahout

Posted by Sean Owen <sr...@gmail.com>.

All of the algorithms assume a world where you have a continuous range
of ratings from users for items. Obviously a binary yes/no rating can
be mapped into that trivially -- 1 and -1 for example. This causes
some issues, most notably for corrletion-based recommenders where the
correlation can be undefined between two items/users in special cases
that arise from this kind of input -- for example if we overlap in
rating 3 items and I voted "yes" for all 3, then no correlation can be
defined.

Slope one doesn't run into this particular mathematical wrinkle.

Also, methods like estimatePreference() are not going to give you
estimates that are always 1 or -1. Again, you could map this back onto
1 / -1 by rounding or something, just something to note.

So, in general it will be better if you can map whatever input you
have onto a larger range of input. You will feed more information in,
in this way, as well. For example, maybe you call a recent "yes"
rating a +2, and a recent "no" a -2, and others +1 and -1.

The part of slope one that parallelizes very well is the computing of
the item-item diffs. No I have not written this yet.

I have committed a first cut at a framework for computing
recommendations in parallel for any recommender. Dig in to
org.apache.mahout.cf.taste.impl.hadoop. In general, none of the
existing recommenders can be parallelized, because they generally need
access to all the data to produce any recommendation.

But, we can take partial advantage of Hadoop by simply parallelizing
the computation of recommendations for many users across multiple
identical recommender instances. Better than nothing. In this
situation, one of the map or reduce phase is trivial.

That is what I have committed so far and it works, locally. I am in
the middle of figuring out how to write it for real use on a remote
Hadoop cluster, and how I would go about testing that!

Do we have any test bed available?

On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur <An...@corp.aol.com> wrote:
> I just realized after going through the wikipedia that slope one is
> applicable when you have ratings for the items.
> In my case, I would be simply working with binary data (Item was clicked
> or not-clicked by user) using Tanimoto coefficient to calculate item
> similarity.
> The idea is to capture the simple intuition "What items have been
> visited most along with this item".
>
>
> -----Original Message-----
> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> Sent: Tuesday, May 20, 2008 2:51 PM
> To: mahout-dev@lucene.apache.org
> Subject: RE: Taste on Mahout
>
>
> Hey Sean,
>       I actually plan to use slope-one to start with since
> - Its simple and known to work well.
> - Can be parallelized nicely into the Map-Reduce style.
> I also plan to use Tanimoto coefficient for item-item diffs.
>
> Do we have something on slope-one already in Taste as a part of Mahout ?
>
> At the moment I am going through the available documentation on Taste
> and code that's present in Mahout.
>
> Your suggestions would be greatly appreciated.
>
> Thanks
> -Ankur
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Tuesday, April 29, 2008 11:09 PM
> To: mahout-dev@lucene.apache.org; Goel, Ankur
> Subject: Re: Taste on Mahout
>
> I have some Hadoop code mostly ready to go for Taste.
>
> The first thing to do is let you generate recommendations for all your
> users via Hadoop. Unfortunately none of the recommenders truly
> parallelize in the way that MapReduce needs it to -- you need all data
> to compute any recommendation really -- but you can at least get
> paralellization out of this. You can use the framework to run n
> recommenders, each computing 1/nth of all recommendations.
>
> The next application is specific to slope-one. Computing the item-item
> diffs is exactly the kind of thing that MapReduce is good for, so,
> writing a Hadoop job to do this seems like a no-brainer.
>
> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur <An...@corp.aol.com>
> wrote:
>> Hi Folks,
>>       What's the status of hadoopifying Taste on Mahout ?
>>  What's been done and what is in progress/pending ?
>>
>>  I am looking using a scalable version of Taste for my project.
>>  So basically trying to figure out what's already done and where  I
>> can pitch in.
>>
>>  Thanks
>>  -Ankur
>>
>

RE: Taste on Mahout

Posted by "Goel, Ankur" <An...@corp.aol.com>.

I just realized after going through the wikipedia that slope one is
applicable when you have ratings for the items.
In my case, I would be simply working with binary data (Item was clicked
or not-clicked by user) using Tanimoto coefficient to calculate item
similarity.
The idea is to capture the simple intuition "What items have been
visited most along with this item".

-----Original Message-----
From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com] 
Sent: Tuesday, May 20, 2008 2:51 PM
To: mahout-dev@lucene.apache.org
Subject: RE: Taste on Mahout

Hey Sean,
       I actually plan to use slope-one to start with since
- Its simple and known to work well.
- Can be parallelized nicely into the Map-Reduce style.
I also plan to use Tanimoto coefficient for item-item diffs. 

Do we have something on slope-one already in Taste as a part of Mahout ?

At the moment I am going through the available documentation on Taste
and code that's present in Mahout.

Your suggestions would be greatly appreciated.

Thanks
-Ankur

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com]
Sent: Tuesday, April 29, 2008 11:09 PM
To: mahout-dev@lucene.apache.org; Goel, Ankur
Subject: Re: Taste on Mahout

I have some Hadoop code mostly ready to go for Taste.

The first thing to do is let you generate recommendations for all your
users via Hadoop. Unfortunately none of the recommenders truly
parallelize in the way that MapReduce needs it to -- you need all data
to compute any recommendation really -- but you can at least get
paralellization out of this. You can use the framework to run n
recommenders, each computing 1/nth of all recommendations.

The next application is specific to slope-one. Computing the item-item
diffs is exactly the kind of thing that MapReduce is good for, so,
writing a Hadoop job to do this seems like a no-brainer.

On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur <An...@corp.aol.com>
wrote:
> Hi Folks,
>       What's the status of hadoopifying Taste on Mahout ?
>  What's been done and what is in progress/pending ?
>
>  I am looking using a scalable version of Taste for my project.
>  So basically trying to figure out what's already done and where  I 
> can pitch in.
>
>  Thanks
>  -Ankur
>