You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Peter K <pk...@gmail.com> on 2016/01/03 16:01:09 UTC

User similarity in Mahout

Hi all,

I'm trying to implement a recommender based 
on Mahout to recommend jobs for users. 
There are 2 actions - an user applied for a job or 
viewed a job. In terms of weight I'm using 5 for 
an apply and 2 for a view.

Now I'm trying to find best user similarity to capture 
these relations.
For example:
User1 applied to jobs: J1,J2,J3,J4,J5
User2 applied to jobs: J1,J2,J3,J4,J6
User3 applied to jobs: J1, J7

When using Euclidean distance similarity if I'm not mistaken 
users 2 and 3 are equal (when 
calculating similarity to User1). But I feel User2 is more similar 
and thus J6 should be 
higher in the recommendations than J7.

Generally, I'm looking into more suggestions what algorithms 
might be the best for this 
case.

Thank you very much for any suggestions.

P.

Re: User similarity in Mahout

Posted by Peter K <pk...@gmail.com>.

Thank you very much, Pat. Really appreciate it.
I think what you've described is exactly what I need.

I'm going to try Mahout-Samsara, it looks promising.
I originally wanted to include some hybrid method 
(user + item based similarity) but this cross 
cooccurrence might be the solution.

Thanks again.

p.

Re: User similarity in Mahout

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Your problem will be that there isn’t enough cooccurrence between users since, well, how many jobs can any one user apply for and how likely is another user to apply for the same or overlapping jobs? The JDs have a short lifetime and so don’t lend themselves to the older single action recommenders. The cooccurrences you show below are probably optimistic. I know this from public statements made by CareerBuilder. Not to mention direct experience with a similar use case. 

I’d expect collaborative filtering based on any one action, like "applying for a job" to give very poor results for you. CB tried this an got some decent results only  for people with a large number of applications—but this was a small % of cases.

Sooo, their solution was a content based recommender that basically matched resume’s to Job descriptions based on content similarity. To get this to work well you may need things like NLP to get named entities or at least a robust gazetteer that knows a large number of brand and technology names. There are also parsing services that will extract info from resume’s. This is a long and somewhat complicated path and has little to do with Mahout.

A much simpler path is to use cross-cooccurrence with the newer SimilarityAnalysis.cooccurrence part of Mahout-Samsara that runs on Spark. It will allow you to use many more user actions, ones that may give more overlap between user activity. This is collaborative filtering but can ingest user actions that are different from “apply”, and whose targets are not restricted to Job Descriptions.

In this case you have or may be able to collect the following indicators of user preference: 
1) user-id, “apply”, job-description-id: from actual application, this is what you want people to do—“apply” so it’s the closest indicator of user preference—assuming you don’t have information about whether they were accepted for a job, which might be even better.
2) user-id, “view”, job-description-id: from when a user reads the details of a JD
3) user-id, “category-preference”, category-id: again taken when a user “view”s a JD but the target of the action is the category of the JD, not the JD itself
4) user-id, “job-title-preference”, job-title-token: Take the job title and tokenize it, then feed in each token (minus stop words) as if they were “tags”. This could be taken when a user “view”s a JD
5) user-id, “other-JD-meta”, metadata-id: this could be anything about the JD that you know and is collected for users that “view” the JD. If you have tags, this would be a good way to use them.

You may also have user profile info taken from their resume, for instance their current job title, these can be encoded:
6) user-id, “current-title”, job-title: here it might be necessary to tokenize and feed each token in unless you have some standardized list of titles. This is taken when a user enters their information into your app.

The idea is to find many ways that users of your system can have data that is in common with other users. Then the recommender (I’ll describe next) will use a signal like “job-title-preference” or “view” even in cases where the user has never applied for a job and so would have none of the data you mention.

As far as I know the only end-to end, mostly off-the-shelf, implementation of this that uses Mahout is the Universal Recommender here: https://github.com/actionml/template-scala-parallel-universal-recommendation. It is built on the PredictionIO Framework described here: https://prediction.io
It supports any number of the “secondary” indicators—things like #2-#6, and is integrated with an event store and recommendation server. The Mahout docs for the command line version of cooccurrence analysis are here (in case you want to build your own framework): http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html

I seriously doubt the older Mahout hadoop-based recommenders will help since they can only use one indicator.

> On Jan 3, 2016, at 7:01 AM, Peter K <pk...@gmail.com> wrote:
> 
> Hi all,
> 
> I'm trying to implement a recommender based 
> on Mahout to recommend jobs for users. 
> There are 2 actions - an user applied for a job or 
> viewed a job. In terms of weight I'm using 5 for 
> an apply and 2 for a view.
> 
> Now I'm trying to find best user similarity to capture 
> these relations.
> For example:
> User1 applied to jobs: J1,J2,J3,J4,J5
> User2 applied to jobs: J1,J2,J3,J4,J6
> User3 applied to jobs: J1, J7
> 
> When using Euclidean distance similarity if I'm not mistaken 
> users 2 and 3 are equal (when 
> calculating similarity to User1). But I feel User2 is more similar 
> and thus J6 should be 
> higher in the recommendations than J7.
> 
> Generally, I'm looking into more suggestions what algorithms 
> might be the best for this 
> case.
> 
> Thank you very much for any suggestions.
> 
> P.
>