You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by jeremycod <zo...@gmail.com> on 2016/07/15 03:36:52 UTC

How to recommend most similar users using Spark ML

Hi,

I need to develop a service that will recommend user with other similar
users that he can connect to. For each user I have a data about user
preferences for specific items in the form:

user, item, preference  
1,    75,   0.89  
2,    168,  0.478  
2,    99,   0.321  
3,    31,   0.012

So far, I implemented approach using cosine similarity that compare one user
features vector with other users:

def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double=
{
    vec1.dot(vec2)/(vec1.norm2()*vec2.norm2())
}
def user2usersimilarity(userid:Integer, recNumber:Integer): Unit ={
    val userFactor=model.userFeatures.lookup(userid).head
    val userVector=new DoubleMatrix(userFactor)
    val s1=cosineSimilarity(userVector,userVector)
    val sims=model.userFeatures.map{case(id,factor)=>
        val factorVector=new DoubleMatrix(factor)
        val sim=cosineSimilarity(factorVector, userVector)
        (id,sim)
    }
    val sortedSims=sims.top(recNumber+1)(Ordering.by[(Int, Double),Double]
{case(id, similarity)=>similarity})
    println(sortedSims.slice(1,recNumber+1).mkString("\n"))
 }

This approach works fine with the MovieLens dataset in terms of quality of
recommendations. However, my concern is related to performance of such
algorithm. Since I have to generate recommendations for all users in the
system, with this approach I would compare each user with all other users in
the system.

I would appreciate if somebody could suggest how to limit comparison of the
user to top N neighbors, or some other algorithm that would work better in
my use case.

Thanks,
Zoran




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-recommend-most-similar-users-using-Spark-ML-tp27342.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: How to recommend most similar users using Spark ML

Posted by Karl Higley <km...@gmail.com>.

There are also some Spark packages for finding approximate nearest
neighbors using locality sensitive hashing:
https://spark-packages.org/?q=tags%3Alsh

On Fri, Jul 15, 2016 at 7:45 AM nguyen duc Tuan <ne...@gmail.com>
wrote:

> Hi jeremycod,
> If you want to find top N nearest neighbors for all users using exact
> top-k algorithm for all users, I recommend using the same approach as  as
> used in Mllib :
> https://github.com/apache/spark/blob/85d6b0db9f5bd425c36482ffcb1c3b9fd0fcdb31/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L272
>
> If the number of users is large, the exact topk algorithm can rather slow,
> try using approximate nearest neighbors algorithm. There's is a good
> benchmark of various libraries that can be found here:
> https://github.com/erikbern/ann-benchmarks
>
> 2016-07-15 10:36 GMT+07:00 jeremycod <zo...@gmail.com>:
>
>> Hi,
>>
>> I need to develop a service that will recommend user with other similar
>> users that he can connect to. For each user I have a data about user
>> preferences for specific items in the form:
>>
>> user, item, preference
>> 1,    75,   0.89
>> 2,    168,  0.478
>> 2,    99,   0.321
>> 3,    31,   0.012
>>
>> So far, I implemented approach using cosine similarity that compare one
>> user
>> features vector with other users:
>>
>> def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double=
>> {
>>     vec1.dot(vec2)/(vec1.norm2()*vec2.norm2())
>> }
>> def user2usersimilarity(userid:Integer, recNumber:Integer): Unit ={
>>     val userFactor=model.userFeatures.lookup(userid).head
>>     val userVector=new DoubleMatrix(userFactor)
>>     val s1=cosineSimilarity(userVector,userVector)
>>     val sims=model.userFeatures.map{case(id,factor)=>
>>         val factorVector=new DoubleMatrix(factor)
>>         val sim=cosineSimilarity(factorVector, userVector)
>>         (id,sim)
>>     }
>>     val sortedSims=sims.top(recNumber+1)(Ordering.by[(Int, Double),Double]
>> {case(id, similarity)=>similarity})
>>     println(sortedSims.slice(1,recNumber+1).mkString("\n"))
>>  }
>>
>> This approach works fine with the MovieLens dataset in terms of quality of
>> recommendations. However, my concern is related to performance of such
>> algorithm. Since I have to generate recommendations for all users in the
>> system, with this approach I would compare each user with all other users
>> in
>> the system.
>>
>> I would appreciate if somebody could suggest how to limit comparison of
>> the
>> user to top N neighbors, or some other algorithm that would work better in
>> my use case.
>>
>> Thanks,
>> Zoran
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-recommend-most-similar-users-using-Spark-ML-tp27342.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

Re: How to recommend most similar users using Spark ML

Posted by nguyen duc Tuan <ne...@gmail.com>.

Hi jeremycod,
If you want to find top N nearest neighbors for all users using exact top-k
algorithm for all users, I recommend using the same approach as  as used in
Mllib :
https://github.com/apache/spark/blob/85d6b0db9f5bd425c36482ffcb1c3b9fd0fcdb31/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L272

If the number of users is large, the exact topk algorithm can rather slow,
try using approximate nearest neighbors algorithm. There's is a good
benchmark of various libraries that can be found here:
https://github.com/erikbern/ann-benchmarks

2016-07-15 10:36 GMT+07:00 jeremycod <zo...@gmail.com>:

> Hi,
>
> I need to develop a service that will recommend user with other similar
> users that he can connect to. For each user I have a data about user
> preferences for specific items in the form:
>
> user, item, preference
> 1,    75,   0.89
> 2,    168,  0.478
> 2,    99,   0.321
> 3,    31,   0.012
>
> So far, I implemented approach using cosine similarity that compare one
> user
> features vector with other users:
>
> def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double=
> {
>     vec1.dot(vec2)/(vec1.norm2()*vec2.norm2())
> }
> def user2usersimilarity(userid:Integer, recNumber:Integer): Unit ={
>     val userFactor=model.userFeatures.lookup(userid).head
>     val userVector=new DoubleMatrix(userFactor)
>     val s1=cosineSimilarity(userVector,userVector)
>     val sims=model.userFeatures.map{case(id,factor)=>
>         val factorVector=new DoubleMatrix(factor)
>         val sim=cosineSimilarity(factorVector, userVector)
>         (id,sim)
>     }
>     val sortedSims=sims.top(recNumber+1)(Ordering.by[(Int, Double),Double]
> {case(id, similarity)=>similarity})
>     println(sortedSims.slice(1,recNumber+1).mkString("\n"))
>  }
>
> This approach works fine with the MovieLens dataset in terms of quality of
> recommendations. However, my concern is related to performance of such
> algorithm. Since I have to generate recommendations for all users in the
> system, with this approach I would compare each user with all other users
> in
> the system.
>
> I would appreciate if somebody could suggest how to limit comparison of the
> user to top N neighbors, or some other algorithm that would work better in
> my use case.
>
> Thanks,
> Zoran
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-recommend-most-similar-users-using-Spark-ML-tp27342.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>