You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Debasish Das (JIRA)" <ji...@apache.org> on 2015/04/10 05:47:12 UTC

[jira] [Comment Edited] (SPARK-4823) rowSimilarities

    [ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14488842#comment-14488842 ] 

Debasish Das edited comment on SPARK-4823 at 4/10/15 3:46 AM:
--------------------------------------------------------------

I implemented the idea I mentioned above using level 1 BLAS since I abstract a kernel out and I wanted the code to support 2 distributed matrix multiply, kernel abstraction and both sparse/dense vector...in future for dense dense we can do some level 3 BLAS..Code is written in blocked form.

On Netflix dataset we run rowSimilarity using CosineKernel in 20 nodes, 4 cores, 16 gb per node in 500 seconds. If I go from raw data to reduced dimension and then run the rowSimilarity with CosineKernel it runs in 320 seconds. colSimilarity without dimsum sampling has been run upto 28 mins where I killed the job...For matrices that are not twitter tall but say 100M and columns are at 1-10M, I feel this code will work well...

Next trick for this flow is LSH and Random Kitchen Sink...The code is going through legal and will open the PR soon for reviews...Also this code will bring kernel generation to mllib...

Also I will add an examples.MovieLensSimilarity which will compare colSimilarity, colSimilarity with dimsum sampling, row Similarity and row Similarity with dimension reduced by ALS....My experiments so far shows 40% intersection with raw similarity and ALS implicit on Movielens, 24% on Netflix dataset...surprising but I think large rank is the key here to bridge the gap or LSH based sampling...


was (Author: debasish83):
I implemented the idea I mentioned above using level 1 BLAS since I abstract a kernel out and I wanted the code to support 2 distributed matrix multiply, kernel abstraction and both sparse/dense vector...in future for dense dense we can do some level 3 BLAS..Code is written in blocked form.

On Netflix dataset we run rowSimilarity using CosineKernel in 20 nodes, 4 cores, 16 gb per node in 500 seconds. If I go from raw data to reduced dimension and then run the rowSimilarity with CosineKernel it runs in 320 seconds. colSimilarity without dimsum sampling has been run upto 28 mins where I killed the job...For matrices that are not twitter tall but say 100M and columns are at 1-10M, I feel this code will work well...

Next trick for this flow is LSH and Random Kitchen Sink...The code is going through legal and will open the PR soon for reviews...Also this code will bring kernel generation to mllib...

> rowSimilarities
> ---------------
>
>                 Key: SPARK-4823
>                 URL: https://issues.apache.org/jira/browse/SPARK-4823
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows (> 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org