You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2008/08/28 22:20:30 UTC

Tasty Findory

I imagine everyone here is familiar with Findory.  If not, too bad for you! ;)
I loved Findory and used it on a daily basis until it shut down.

I'm wondering if there are lessons to be learned from what Greg Linden (Findory's papa - let's see if he has Google Ego Alert) wrote about Fiindory over the years:
  http://glinden.blogspot.com/search?q=findory

Here are a few excerpts (from the results of the above search) that seem important to me.  I am wondering if people could comment on how Taste compares:

1. Findory's personalization used a type of hybrid collaborative filtering
algorithm that recommended articles based on a combination of
similarity of content and articles that tended to interested other
Findory users with similar tastes.

One way to think of this is
that, when a person found and read an interesting article on Findory,
that article would be shared with any other Findory readers who likely
would be interested. Likewise, that person would benefit from
interesting articles other Findory readers found. All this sharing of
articles was done implicitly and anonymously without any effort from
readers by Findory's recommendation engine.

Findory's news
recommendations were unusual in that they were primarily based on user
behavior (what articles other readers had found), worked from very
little data (starting after a single click on Findory), worked in
real-time (changed immediately when someone read an article), required
no set-up or configuration (worked just by watching articles read), and
did not readers to identify themselves (no login necessary).


2. For most of Findory's four years, it ran on six servers.

Findory's six servers were all cheap commodity Linux boxes,
typically a single core low-end AMD processors, 1G of RAM, and a single
IDE disk. Findory was cheap, cheap, cheap.


3. So, when someone comes to your personalized site, you need to load
everything you need to know about them, find all the content that that
person might like, rank and layout that content, and serve up a pipin'
hot page. All while the customer is waiting.

Findory works hard
to do all that quickly, almost always in well under 100ms. Time is
money, after all, both in terms of customer satisfaction and the number
of servers Findory has to pay for.

The way Findory does this is
that it pre-computes as much of the expensive personalization as it
can. Much of the task of matching interests to content is moved to an
offline batch process. The online task of personalization, the part
while the user is waiting, is reduced to a few thousand data lookups.

Even
a few thousand database accesses could be prohibitive given the time
constraints. However, much of the content and pre-computed data is
effectively read-only data. Findory replicates the read-only data out
to its webservers, making these thousands of lookups lightning fast
local accesses.

Read-write data, such as each reader's history
on Findory, is in MySQL. MyISAM works well for this task since the data
is not critical and speed is more important than transaction support.

The
read-write user data in MySQL can be partitioned by user id, making the
database trivially scalable. The online personalization task scales
independently of the number of Findory users. Only the offline batch
process faced any issue of scaling as Findory grew, but that batch
process can be done in parallel. 

In the end, it is blazingly
fast. Readers receive fully personalized pages in under 100ms. As they
read new articles, the page changes immediately, no delay. It all just
works.


4. (quoting a Google paper from WWW2007 on Google news personalization)
The paper tested three methods of making news recommendations on the Google News front page.  From the abstract:
We
describe our approach to collaborative filtering for generating
personalized recommendations for users of Google News. We generate
recommendations using three approaches: collaborative filtering using
MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and
covisitation counts.MinHash and PLSI are both
clustering methods; a user is matched to a cluster of similar users,
then they look at the aggregate behavior of users in that cluster to
find recommendations.

OG: I think this is similar in principal to what I mentioned in an earlier email.


 
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


Re: Tasty Findory

Posted by Sean Owen <sr...@gmail.com>.
Oh, I think I misread this. You are right, the article similarity
metric is based on content, not on ratings. This is good -- probably a
better metric, and injects new information into the system.

In that case, this sounds more like item-based recommender systems.
Rather than find similar users, then see what they liked (user-based
recommenders), you see which items are similar to the items the user
likes. It's really just a user-based recommender turned on its side,
but, is generally superior, you can inject some external idea of item
similarity rather than use correlation among ratings. That external
data is, well, more data to inform recommendations, and is generally
fixed over time and precomputable, whereas user-user similarity
doesn't have this property. This is roughly the approach Amazon takes
AFAIK.

It may be they're just using different words for the same thing or in
fact it is something different and I am just not familiar with it.

On Fri, Aug 29, 2008 at 12:02 AM, Satish Dandu
<Sa...@melbourneitdbs.com> wrote:
> Hi Sean,
>
>>> 1. Findory's personalization used a type of hybrid collaborative filtering
>>> algorithm that recommended articles based on a combination of
>>> similarity of content and articles that tended to interested other
>>> Findory users with similar tastes.
>
>        >Interesting -- yeah, that would be a hybrid of user-based and item-based approaches.
>
> When you say hybrid of user-based & item based approaches (as both forms collaborative approach), how can we get articles with similar content.
> From my understanding, I think Findory uses some kind of "Content Based Filtering" + "Collaborative Based Filtering". Content based filtering may be used to fetch documents with similar content. Best Example would be making use of some sort of Lucene's "morelikeThis" or "Similar" queries. Correct me if i am wrong.
>
> Regards,
> -Satish Dandu
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Thursday, 28 August 2008 2:49 PM
> To: mahout-user@lucene.apache.org
> Subject: Re: Tasty Findory
>
> On Thu, Aug 28, 2008 at 9:20 PM, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
>> 1. Findory's personalization used a type of hybrid collaborative filtering
>> algorithm that recommended articles based on a combination of
>> similarity of content and articles that tended to interested other
>> Findory users with similar tastes.
>
> Interesting -- yeah, that would be a hybrid of user-based and
> item-based approaches.
>
> Usually, in a user-based approach, you find similar users, and then
> guess a rating for a new item by averaging the rating for that item of
> similar users -- weighted by the user similarity of course.
>
> Here, I imagine that in Findory you don't have a rating per se for
> articles, just a boolean yes/no. So you substitute a similarity metric
> between those items the user has read and a given new item.
>
> Yeah... that does add up to an interesting new approach, likely. I'd
> have to digest that a bit more to think about how to implement it
> right.
>
>
>> The way Findory does this is
>> that it pre-computes as much of the expensive personalization as it
>> can. Much of the task of matching interests to content is moved to an
>> offline batch process. The online task of personalization, the part
>> while the user is waiting, is reduced to a few thousand data lookups.
>
> Ah-ha, yeah, computing offline is not surprising. Good news, because
> that is the only option for the sorts of parallelization we are
> considering via Hadoop.
>
> There is a notion of "Rescorer" in the code which allows for injecting
> arbitrary logic to re-rank recommendations. That maps to the "online
> personalization" part, and indeed I think that is useful to allow for
> some cheap, real-time logic to affect rankings, on top of
> recommendations computed offline.
>

RE: Tasty Findory

Posted by Satish Dandu <Sa...@melbourneitdbs.com>.
Hi Sean,

>> 1. Findory's personalization used a type of hybrid collaborative filtering
>> algorithm that recommended articles based on a combination of
>> similarity of content and articles that tended to interested other
>> Findory users with similar tastes.

	>Interesting -- yeah, that would be a hybrid of user-based and item-based approaches.

When you say hybrid of user-based & item based approaches (as both forms collaborative approach), how can we get articles with similar content. 
From my understanding, I think Findory uses some kind of "Content Based Filtering" + "Collaborative Based Filtering". Content based filtering may be used to fetch documents with similar content. Best Example would be making use of some sort of Lucene's "morelikeThis" or "Similar" queries. Correct me if i am wrong.

Regards,
-Satish Dandu


-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Thursday, 28 August 2008 2:49 PM
To: mahout-user@lucene.apache.org
Subject: Re: Tasty Findory

On Thu, Aug 28, 2008 at 9:20 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> 1. Findory's personalization used a type of hybrid collaborative filtering
> algorithm that recommended articles based on a combination of
> similarity of content and articles that tended to interested other
> Findory users with similar tastes.

Interesting -- yeah, that would be a hybrid of user-based and
item-based approaches.

Usually, in a user-based approach, you find similar users, and then
guess a rating for a new item by averaging the rating for that item of
similar users -- weighted by the user similarity of course.

Here, I imagine that in Findory you don't have a rating per se for
articles, just a boolean yes/no. So you substitute a similarity metric
between those items the user has read and a given new item.

Yeah... that does add up to an interesting new approach, likely. I'd
have to digest that a bit more to think about how to implement it
right.


> The way Findory does this is
> that it pre-computes as much of the expensive personalization as it
> can. Much of the task of matching interests to content is moved to an
> offline batch process. The online task of personalization, the part
> while the user is waiting, is reduced to a few thousand data lookups.

Ah-ha, yeah, computing offline is not surprising. Good news, because
that is the only option for the sorts of parallelization we are
considering via Hadoop.

There is a notion of "Rescorer" in the code which allows for injecting
arbitrary logic to re-rank recommendations. That maps to the "online
personalization" part, and indeed I think that is useful to allow for
some cheap, real-time logic to affect rankings, on top of
recommendations computed offline.

Re: Tasty Findory

Posted by Sean Owen <sr...@gmail.com>.
On Thu, Aug 28, 2008 at 9:20 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> 1. Findory's personalization used a type of hybrid collaborative filtering
> algorithm that recommended articles based on a combination of
> similarity of content and articles that tended to interested other
> Findory users with similar tastes.

Interesting -- yeah, that would be a hybrid of user-based and
item-based approaches.

Usually, in a user-based approach, you find similar users, and then
guess a rating for a new item by averaging the rating for that item of
similar users -- weighted by the user similarity of course.

Here, I imagine that in Findory you don't have a rating per se for
articles, just a boolean yes/no. So you substitute a similarity metric
between those items the user has read and a given new item.

Yeah... that does add up to an interesting new approach, likely. I'd
have to digest that a bit more to think about how to implement it
right.


> The way Findory does this is
> that it pre-computes as much of the expensive personalization as it
> can. Much of the task of matching interests to content is moved to an
> offline batch process. The online task of personalization, the part
> while the user is waiting, is reduced to a few thousand data lookups.

Ah-ha, yeah, computing offline is not surprising. Good news, because
that is the only option for the sorts of parallelization we are
considering via Hadoop.

There is a notion of "Rescorer" in the code which allows for injecting
arbitrary logic to re-rank recommendations. That maps to the "online
personalization" part, and indeed I think that is useful to allow for
some cheap, real-time logic to affect rankings, on top of
recommendations computed offline.