You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Reza Zadeh (JIRA)" <ji...@apache.org> on 2014/08/06 22:08:12 UTC

[jira] [Created] (SPARK-2885) All-pairs similarity via DIMSUM

Reza Zadeh created SPARK-2885:
---------------------------------

             Summary: All-pairs similarity via DIMSUM
                 Key: SPARK-2885
                 URL: https://issues.apache.org/jira/browse/SPARK-2885
             Project: Spark
          Issue Type: New Feature
            Reporter: Reza Zadeh


Build all-pairs similarity algorithm via DIMSUM. 

Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”.

The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible.

Current PR for this is WIP:
https://github.com/apache/spark/pull/1778



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org