You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:01:07 UTC

[jira] [Updated] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

     [ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-18392:
---------------------------------
    Labels: bulk-closed  (was: )

> LSH API, algorithm, and documentation follow-ups
> ------------------------------------------------
>
>                 Key: SPARK-18392
>                 URL: https://issues.apache.org/jira/browse/SPARK-18392
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Joseph K. Bradley
>            Priority: Major
>              Labels: bulk-closed
>
> This JIRA summarizes discussions from the initial LSH PR [https://github.com/apache/spark/pull/15148] as well as the follow-up for hash distance [https://github.com/apache/spark/pull/15800].  This will be broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in the future
> ** Need to clarify which we support, including in each model function (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This is the "raw" output of all hash functions, i.e., with no aggregation for amplification.
> ** Clarify meaning of output in terms of multiple hash functions and amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe LSH paper
> Wikipedia
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org