You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Mingjie Tang (JIRA)" <ji...@apache.org> on 2017/03/01 00:19:45 UTC

[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

    [ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889180#comment-15889180 ] 

Mingjie Tang commented on SPARK-19771:
--------------------------------------

If we follow the AND-OR framework, one optimization is here: 

At first, suppose we use BucketedRandomProjectionLSH, and  setNumHashTables(3) and setNumHashFunctions(5).

For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], 
we can compute its mapped hash vectors is
WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], [-1.0,-1.0,-1.0,-1.0,0.0])]

For the similarity-join, we map this computed vector into
+--------------------+-----+--------------------+
|            datasetA|entry|           hashValue|
+--------------------+-----+--------------------+
|[0,(6,[0,1,2],[1....|    0|[-1.0,-1.0,-1.0,0...|
|[0,(6,[0,1,2],[1....|    1|[-1.0,0.0,0.0,-1....|
|[0,(6,[0,1,2],[1....|    2|[-1.0,-1.0,-1.0,-...|

Then, based on AND-OR principle, we using the entry and hashValue for equip-join with other table . When we look at the the sql, it need to iterate through the hashValue vector for equal join. Thus, this computation and memory usage cost is very high. For example, if we apply the nest-loop for comparing two vectors with length of NumHashFunctions, the computation cost is (NumHashFunctions* NumHashFunctions) and memory overhead is (N* NumHashFunctions). 

Therefore, we can apply one optimization technique here. that is, we do not need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], instead, we use a simple hash function to improve this. 

Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of setNumHashFunctions. 
then, the mapped hash function is 
     H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime.
where r_i is a integer  and the prime can be set as 2^32-5 for less hash collision.
Then, we only need to store the hash value H(h_l) for AND operation.  

The similar approach also can be applied for the approxNeasetNeighbors searching. 

If the multi-probe approach does not need to read the hash value for one hash function (e.g., h_l mentioned above), we can applied this H(h_l) to the preprocessed data to save memory. I will double check the multi-probe approach and see whether it is work.  Then, I can submit a patch to improve the current way. Note, this hash function to reduce the vector-to-vector comparing is widely used in different places for example, the LSH implementation by Andoni . 
 
[~diefunction] [~mlnick] 


> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> ----------------------------------------------------------------
>
>                 Key: SPARK-19771
>                 URL: https://issues.apache.org/jira/browse/SPARK-19771
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org