You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Michael Wechner <mi...@wyona.com> on 2013/08/29 15:21:53 UTC

Recommender for news articles based on own user profile (URL history)

Hi

I am looking for a recommender example for news articles which is making 
suggestions based on a user profile (independent of other users/readers) 
or more specific on the reading history of a user.

Let's say a specific user likes to read articles about cycling and 
international politics and the content management system is saving the 
URL history of all the articles which have been read by this specific user.
When the editorial stuff is creating new articles/stories, then the 
system should make recommendations to this user when she/he is getting 
back online or also when a new story has been created, then the 
recommender should check whether this new story would be good fit/match 
for this particular user and the system should send a notification.

I guess developing such a recommender is possible with Mahout, but since 
I am new Mahout, I would appreciate any pointers to examples which are 
similar to the functionality described above.

I am currently looking at the examples shipped with Mahout

https://cwiki.apache.org/confluence/display/MAHOUT/RecommendationExamples

but if I understand correctly these are based on what other people liked 
and not what the person itself only liked,
or do I misunderstand?

Thanks for your help

Michael

Re: Recommender for news articles based on own user profile (URL history)

Posted by Pat Ferrel <pa...@occamsmachete.com>.

In the simple case I’m not sure a collaborative filtering recommender is going to work here. The items change too quickly to gather significant preference data. Articles are your items, what is their lifetime? To do CF you need relatively long-lived items and enough user preference data about those items.

There are other way to tackle this. Let’s take Google alerts as an example. They start with search text. I created one with the text “machine learning” and got some silly alerts: http://occamsmachete.com/ml/2012/03/16/fun-with-google-alerts/

But what they do is track every time you follow a link from their recs email. Then they train a classifier with all of the text you read. The start is pretty awful but they get better very quickly. I’m sure they do some things to make this more scalable but that’s a longer story. There is a CF angle with enough technology (read on).

Can you do the same thing? If you can tell what articles people read you can use this collection as a content exemplar and recommend new news items based on similarity to this collection.

To use the GA template:
1) use Solr to recommend articles from a user’s tweets (they may be awful at first)
2) track what they read and keep it as an example of the type of thing they like
3) when new articles come in, find the people who like that sort of thing and make them aware of it. You do this by comparing the new article with each of the user’s collection of past reads. You can do this with Solr for ease and simplicity but batch classification will probably give better results.

Some have used Named Entities in news and Tweets to make CF based recs. If you knew one named entity in an article was ‘Putin' you could treat it as an item and gather CF data from people who read about him. With enough history like that you could build a CF type recommender. It wouldn’t surprise me if Google isn’t doing something with this in a lot of their search products, like alerts.

On Feb 16, 2014, at 11:51 AM, Juanjo Ramos <jj...@gmail.com> wrote:

As per your question, we have not built anything yet
so, we are dealing with that problem: How to let the tweets
drive the recommendation of the news to be viewed.

The original idea was to find item-item similarity between the
user tweets and the news in order to deal with the cold-start
problem and infer some initial preference of the users
and the news based on that item-item similarity. This is where
my original idea of using RowSimilarityJob to compute the matrix
of similarities came into place.
Later, as the user accesses different news those preferences
will we tuned as in a regular item-based recommender.

Since the system has not been built yet, our first goal is to design
the architecture of the system first and how it should respond after
new tweets are produced, even if the performance is not the best
in this first version. Then, we will focus on the particular problem
of using tweets to recommend news, for which the links you posted
will be extremely helpful.

I am new to Mahout. I have just finished reading 'Mahout in Action'
and that is why I tried to use only Mahout for the implementation,
but the approach you suggest with Solr seems more reasonable
to deal with the problem of having the system responding and
adapting fast when new tweets are produced.

Thanks again.

Re: Recommender for news articles based on own user profile (URL history)

Posted by Juanjo Ramos <jj...@gmail.com>.

As per your question, we have not built anything yet
so, we are dealing with that problem: How to let the tweets 
drive the recommendation of the news to be viewed.

The original idea was to find item-item similarity between the 
user tweets and the news in order to deal with the cold-start 
problem and infer some initial preference of the users 
and the news based on that item-item similarity. This is where 
my original idea of  using RowSimilarityJob to compute the matrix 
of similarities came into place. 
Later, as the user accesses different news those preferences 
will we tuned as in a regular item-based recommender.

Since the system has not been built yet, our first goal is to design
the architecture of the system first and how it should respond after
new tweets are produced, even if the performance is not the best
in this first version. Then, we will focus on the particular problem 
of using tweets to recommend news, for which the links you posted 
will be extremely helpful.

I am new to Mahout. I have just finished reading 'Mahout in Action'
and that is why I tried to use only Mahout for the implementation,
but the approach you suggest with Solr seems more reasonable
to deal with the problem of having the system responding and
adapting fast when new tweets are produced.

Thanks again.

Re: Recommender for news articles based on own user profile (URL history)

Posted by Pat Ferrel <pa...@occamsmachete.com>.

The solution you mention doesn’t sound right. You would usually not need to create a new ItemSimilarity class unless you have a new way to measure similarity.

lets see if I have this right:

1) you want to recommend news
2) recs are based on a user’s tweets
3) you have little metadata about either input or recommended items

You mention that you have previous tweets? Do you know which tweets led to which news being viewed? Ar you collecting links in tweets? You can augment tweet text with text from the pages linked to.

There are many difficulties in using tweets to recommend news, I’d do some research before you start. A quick search got this article http://nlp.cs.rpi.edu/paper/tweetnews.pdf which references others.

Also Ken Krugler wrote a series of articles on techniques used to improve text to text similarity—make sure to read both. http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machine-learning-part-1/

Can’t predict where this will end up but an easy thing to do as a trial is index news in Solr, use scrubbed tweets as queries. You could set this up in an hour or so probably and try it with your own tweets to see how well is does. I suspect this won’t be your ultimate solution but it’s easy to do while you get your mind around the research

On Feb 16, 2014, at 5:54 AM, Juanjo Ramos <jj...@gmail.com> wrote:

Hi Pat,
Thanks so much for your detailed response.

At the moment we do not have any metadata 
about the articles but just their title & body. 
In addition, in the dataset we have tweets from the user which 
will never be in the output of the recommender 
(we never want to recommend a user to see a particular tweet) 
but we will use them to tune the users' 
preferences for different pieces of news based 
on the similarity between the tweets they have 
produced and the news that we have.

Would the approach you suggest with Solr 
still be valid in this particular scenario? We would need the 
user preferences to be updated as soon as they produce 
a new tweet, therefore my urge in recompute 
item-similarities as soon as a new tweet is produced. 
We do not need to recompute the matrix of 
similarities whenever a piece of news is produced 
as you well mentioned.

I do not if the approach I am about to suggest 
even makes sense but my idea was to precompute the 
similarities between items (news + tweets) 
and stored them along with the vectorized representation 
of every item. 
Then, implement my own ItemSimilarity class 
which would return the similarity for 
every pair of items (from the matrix if available) 
or calculated on the fly if not found. My main 
problem here is that I do not know how to calculate 
in Mahout the cosine distance between the 
vectorized representation of 2 particular items. 
Does this approach make sense in the first place?

Many thanks.

Re: Recommender for news articles based on own user profile (URL history)

Posted by Juanjo Ramos <jj...@gmail.com>.

Hi Pat,
Thanks so much for your detailed response.

At the moment we do not have any metadata 
about the articles but just their title & body. 
In addition, in the dataset we have tweets from the user which 
will never be in the output of the recommender 
(we never want to recommend a user to see a particular tweet) 
but we will use them to tune the users' 
preferences for different pieces of news based 
on the similarity between the tweets they have 
produced and the news that we have.

Would the approach you suggest with Solr 
still be valid in this particular scenario? We would need the 
user preferences to be updated as soon as they produce 
a new tweet, therefore my urge in recompute 
item-similarities as soon as a new tweet is produced. 
We do not need to recompute the matrix of 
similarities whenever a piece of news is produced 
as you well mentioned.

I do not if the approach I am about to suggest 
even makes sense but my idea was to precompute the 
similarities between items (news + tweets) 
and stored them along with the vectorized representation 
of every item. 
Then, implement my own ItemSimilarity class 
which would return the similarity for 
every pair of items (from the matrix if available) 
or calculated on the fly if not found. My main 
problem here is that I do not know how to calculate 
in Mahout the cosine distance between the 
vectorized representation of 2 particular items. 
Does this approach make sense in the first place?

Many thanks.

Re: Recommender for news articles based on own user profile (URL history)

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Yes.

The batch training data should be updated as needed but for some length of time the RowSimilarity Model will be valid and useful even with brand new queries that are made from articles not in the model. Remember however that the only items you will get in results are ones in the training data so that will give you an indication of how often to update it.

For a content based recommender you should look at Solr. The rest of the thread is missing but I think I also suggested that you could use it as the similarity engine, especially if you need immediacy of model updates. In this case you simply maintain an up to date Solr index on all articles, and their metadata. The index can be maintained in realtime or very close to it.

Once the data is in a form that Solr can index you have a very flexible content based recommender. For instance you can create a query from articles read, along with their metadata, like category, location, etc. Or you may know something from the User’s profile, past usage, or browsing context that allows you to boost results by using this metadata.

The collaborative filtering recommender that uses Solr + Mahout can seamlessly include metadata (content based data) to calculate recs. For instance, in the demo site we have Videos with genre data. When a user is looking at a Video which has genre tags these can be included in the query. A simple CF query would be a list of the Videos the user preferred against the RSJ created model. With Solr we can add multiple fields to the query so by adding the current Video’s genre tags against other Videos' genres you get genre boosted CF recs.

You should be able to use the same technique with a purely content based recommender.

On Feb 15, 2014, at 1:37 PM, Juanjo Ramos <jj...@gmail.com> wrote:

Hi Pat,
Thanks for your comment, I found it quite helpful.
I'm also trying to build a content-based
recommender. One question though:
How can I use RunSimilarityJob for online data?
I mean, I have a dataset and the approach you describe
works pretty well to precompute the similarity
matrix.
However, when I get new content in my dataset (it is a dataset of news),
I can I compute the similarity
of only that new item against the rest
without computing the whole matrix again?

Many thanks.

Re: Recommender for news articles based on own user profile (URL history)

Posted by Juanjo Ramos <jj...@gmail.com>.

Hi Pat,
Thanks for your comment, I found it quite helpful. 
I'm also trying to build a content-based 
recommender. One question though:
How can I use RunSimilarityJob for online data? 
I mean, I have a dataset and the approach you describe 
works pretty well to precompute the similarity 
matrix. 
However, when I get new content in my dataset (it is a dataset of news), 
I can I compute the similarity 
of only that new item against the rest 
without computing the whole matrix again?

Many thanks.

Re: Recommender for news articles based on own user profile (URL history)

Posted by Michael Wechner <mi...@wyona.com>.

Hi Pat

Thanks very much for your suggestions. I will try to develop a 
"recommender" based on that and if somebody
is interested in it, then  I could contribute it as another example.

Thanks

Michael

Am 29.08.13 18:02, schrieb Pat Ferrel:
> You can use the Mahout text pipeline, which will give you weighted vectors based on TFIDF for each article. There is an example of this in Mahout in Action for clustering. Then run the RowSimilarityJob on them instead of clustering. This will give you a strength of similarity for each article pair. RSJ produces a DRM (distributed row matrix), which is keyed by the article id and so has a list of how similar every article is to the row article. The highest similarities will indicate most similar text content in the articles.  I've done this before and it works pretty well. There might be something in the new knn (k-nearest-neighbors) framework that is more optimized.
>
> Once you have the article similarities you could combine the most similar to the past articles the user has read and show some number of the ones user hasn't seen yet.
>
> Content-based recommenders are good for avoiding the cold start problem because even if the user has no read history you can show articles similar to the one she is looking at.
>
> Also content-based recs are good when your inventory changes a lot (new articles appear all the time and they go out of favor quickly). You may never generate enough read behavior to use collaborative filtering alone.
>
> BTW you might also look at Solr where you can use an article as a query against all articles indexed. This will also produce a list of ranked similar articles. Use the user's read history as queries and combine the lists somehow.
>
> On Aug 29, 2013, at 7:53 AM, Gokhan Capan <gk...@gmail.com> wrote:
>
> Hi Michael,
>
> Those are collaborative filtering examples, which would recommend a news
> article i, to a user u, based on:
> - A weighted average of other users' ratings on i (where weight is the
> similarity of two users' rating histories)
> - A weighted average of u's ratings on other items (where weight is the
> similarity of two items' rating histories, that is, the users rated the
> item and how they rated it)
> - A combination of the user and item vectors from user and item latent
> factor matrices, which are obtained by decomposing the original rating
> matrix.
>
> If you are expecting the system recommend to a user only the news articles
> those have similar content to the older news articles that the user had
> shown a positive interest before, this is content-based filtering.
> Also, the example you mentioned (recommending brand new articles)
> introduces a challenge called cold-start problem, and content-based
> filtering can generalize to cold-start articles, too.
>
> A search in user-list for content-based filtering/recommendation can help
> you (I am saying this because there were some great discussions on how to
> achieve this with Mahout, for example, with custom similarity measures). if
> you can't find anything satisfying, we can discuss that further.
>
> Best,
> Gokhan
>
>
> On Thu, Aug 29, 2013 at 4:21 PM, Michael Wechner
> <mi...@wyona.com>wrote:
>
>> Hi
>>
>> I am looking for a recommender example for news articles which is making
>> suggestions based on a user profile (independent of other users/readers) or
>> more specific on the reading history of a user.
>
>> Let's say a specific user likes to read articles about cycling and
>> international politics and the content management system is saving the URL
>> history of all the articles which have been read by this specific user.
>> When the editorial stuff is creating new articles/stories, then the system
>> should make recommendations to this user when she/he is getting back online
>> or also when a new story has been created, then the recommender should
>> check whether this new story would be good fit/match for this particular
>> user and the system should send a notification.
>>
>> I guess developing such a recommender is possible with Mahout, but since I
>> am new Mahout, I would appreciate any pointers to examples which are
>> similar to the functionality described above.
>>
>> I am currently looking at the examples shipped with Mahout
>>
>> https://cwiki.apache.org/**confluence/display/MAHOUT/**
>> RecommendationExamples<https://cwiki.apache.org/confluence/display/MAHOUT/RecommendationExamples>
>>
>> but if I understand correctly these are based on what other people liked
>> and not what the person itself only liked,
>> or do I misunderstand?
>>
>> Thanks for your help
>>
>> Michael
>>
>>

Re: Recommender for news articles based on own user profile (URL history)

Posted by Pat Ferrel <pa...@occamsmachete.com>.

You can use the Mahout text pipeline, which will give you weighted vectors based on TFIDF for each article. There is an example of this in Mahout in Action for clustering. Then run the RowSimilarityJob on them instead of clustering. This will give you a strength of similarity for each article pair. RSJ produces a DRM (distributed row matrix), which is keyed by the article id and so has a list of how similar every article is to the row article. The highest similarities will indicate most similar text content in the articles.  I've done this before and it works pretty well. There might be something in the new knn (k-nearest-neighbors) framework that is more optimized.

Once you have the article similarities you could combine the most similar to the past articles the user has read and show some number of the ones user hasn't seen yet.

Content-based recommenders are good for avoiding the cold start problem because even if the user has no read history you can show articles similar to the one she is looking at.

Also content-based recs are good when your inventory changes a lot (new articles appear all the time and they go out of favor quickly). You may never generate enough read behavior to use collaborative filtering alone.

BTW you might also look at Solr where you can use an article as a query against all articles indexed. This will also produce a list of ranked similar articles. Use the user's read history as queries and combine the lists somehow.  

On Aug 29, 2013, at 7:53 AM, Gokhan Capan <gk...@gmail.com> wrote:

Hi Michael,

Those are collaborative filtering examples, which would recommend a news
article i, to a user u, based on:
- A weighted average of other users' ratings on i (where weight is the
similarity of two users' rating histories)
- A weighted average of u's ratings on other items (where weight is the
similarity of two items' rating histories, that is, the users rated the
item and how they rated it)
- A combination of the user and item vectors from user and item latent
factor matrices, which are obtained by decomposing the original rating
matrix.

If you are expecting the system recommend to a user only the news articles
those have similar content to the older news articles that the user had
shown a positive interest before, this is content-based filtering.
Also, the example you mentioned (recommending brand new articles)
introduces a challenge called cold-start problem, and content-based
filtering can generalize to cold-start articles, too.

A search in user-list for content-based filtering/recommendation can help
you (I am saying this because there were some great discussions on how to
achieve this with Mahout, for example, with custom similarity measures). if
you can't find anything satisfying, we can discuss that further.

Best,
Gokhan

On Thu, Aug 29, 2013 at 4:21 PM, Michael Wechner
<mi...@wyona.com>wrote:

> Hi
> 
> I am looking for a recommender example for news articles which is making
> suggestions based on a user profile (independent of other users/readers) or
> more specific on the reading history of a user.

> Let's say a specific user likes to read articles about cycling and
> international politics and the content management system is saving the URL
> history of all the articles which have been read by this specific user.
> When the editorial stuff is creating new articles/stories, then the system
> should make recommendations to this user when she/he is getting back online
> or also when a new story has been created, then the recommender should
> check whether this new story would be good fit/match for this particular
> user and the system should send a notification.
> 
> I guess developing such a recommender is possible with Mahout, but since I
> am new Mahout, I would appreciate any pointers to examples which are
> similar to the functionality described above.
> 
> I am currently looking at the examples shipped with Mahout
> 
> https://cwiki.apache.org/**confluence/display/MAHOUT/**
> RecommendationExamples<https://cwiki.apache.org/confluence/display/MAHOUT/RecommendationExamples>
> 
> but if I understand correctly these are based on what other people liked
> and not what the person itself only liked,
> or do I misunderstand?
> 
> Thanks for your help
> 
> Michael
> 
>

Re: Recommender for news articles based on own user profile (URL history)

Posted by Michael Wechner <mi...@wyona.com>.

Hi Gokhan

Thanks very much for the keywords and hints about this topic.
Will do some more research and probably come back again at some later stage.

Thanks

Michael


Am 29.08.13 16:53, schrieb Gokhan Capan:
> Hi Michael,
>
> Those are collaborative filtering examples, which would recommend a news
> article i, to a user u, based on:
> - A weighted average of other users' ratings on i (where weight is the
> similarity of two users' rating histories)
> - A weighted average of u's ratings on other items (where weight is the
> similarity of two items' rating histories, that is, the users rated the
> item and how they rated it)
> - A combination of the user and item vectors from user and item latent
> factor matrices, which are obtained by decomposing the original rating
> matrix.
>
> If you are expecting the system recommend to a user only the news articles
> those have similar content to the older news articles that the user had
> shown a positive interest before, this is content-based filtering.
> Also, the example you mentioned (recommending brand new articles)
> introduces a challenge called cold-start problem, and content-based
> filtering can generalize to cold-start articles, too.
>
> A search in user-list for content-based filtering/recommendation can help
> you (I am saying this because there were some great discussions on how to
> achieve this with Mahout, for example, with custom similarity measures). if
> you can't find anything satisfying, we can discuss that further.
>
> Best,
> Gokhan
>
>
> On Thu, Aug 29, 2013 at 4:21 PM, Michael Wechner
> <mi...@wyona.com>wrote:
>
>> Hi
>>
>> I am looking for a recommender example for news articles which is making
>> suggestions based on a user profile (independent of other users/readers) or
>> more specific on the reading history of a user.
>
>> Let's say a specific user likes to read articles about cycling and
>> international politics and the content management system is saving the URL
>> history of all the articles which have been read by this specific user.
>> When the editorial stuff is creating new articles/stories, then the system
>> should make recommendations to this user when she/he is getting back online
>> or also when a new story has been created, then the recommender should
>> check whether this new story would be good fit/match for this particular
>> user and the system should send a notification.
>>
>> I guess developing such a recommender is possible with Mahout, but since I
>> am new Mahout, I would appreciate any pointers to examples which are
>> similar to the functionality described above.
>>
>> I am currently looking at the examples shipped with Mahout
>>
>> https://cwiki.apache.org/**confluence/display/MAHOUT/**
>> RecommendationExamples<https://cwiki.apache.org/confluence/display/MAHOUT/RecommendationExamples>
>>
>> but if I understand correctly these are based on what other people liked
>> and not what the person itself only liked,
>> or do I misunderstand?
>>
>> Thanks for your help
>>
>> Michael
>>
>>

Re: Recommender for news articles based on own user profile (URL history)

Posted by Gokhan Capan <gk...@gmail.com>.

Hi Michael,

Those are collaborative filtering examples, which would recommend a news
article i, to a user u, based on:
- A weighted average of other users' ratings on i (where weight is the
similarity of two users' rating histories)
- A weighted average of u's ratings on other items (where weight is the
similarity of two items' rating histories, that is, the users rated the
item and how they rated it)
- A combination of the user and item vectors from user and item latent
factor matrices, which are obtained by decomposing the original rating
matrix.

If you are expecting the system recommend to a user only the news articles
those have similar content to the older news articles that the user had
shown a positive interest before, this is content-based filtering.
Also, the example you mentioned (recommending brand new articles)
introduces a challenge called cold-start problem, and content-based
filtering can generalize to cold-start articles, too.

A search in user-list for content-based filtering/recommendation can help
you (I am saying this because there were some great discussions on how to
achieve this with Mahout, for example, with custom similarity measures). if
you can't find anything satisfying, we can discuss that further.

Best,
Gokhan


On Thu, Aug 29, 2013 at 4:21 PM, Michael Wechner
<mi...@wyona.com>wrote:

> Hi
>
> I am looking for a recommender example for news articles which is making
> suggestions based on a user profile (independent of other users/readers) or
> more specific on the reading history of a user.


> Let's say a specific user likes to read articles about cycling and
> international politics and the content management system is saving the URL
> history of all the articles which have been read by this specific user.
> When the editorial stuff is creating new articles/stories, then the system
> should make recommendations to this user when she/he is getting back online
> or also when a new story has been created, then the recommender should
> check whether this new story would be good fit/match for this particular
> user and the system should send a notification.
>
> I guess developing such a recommender is possible with Mahout, but since I
> am new Mahout, I would appreciate any pointers to examples which are
> similar to the functionality described above.
>
> I am currently looking at the examples shipped with Mahout
>
> https://cwiki.apache.org/**confluence/display/MAHOUT/**
> RecommendationExamples<https://cwiki.apache.org/confluence/display/MAHOUT/RecommendationExamples>
>
> but if I understand correctly these are based on what other people liked
> and not what the person itself only liked,
> or do I misunderstand?
>
> Thanks for your help
>
> Michael
>
>