You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by kturgut <gi...@git.apache.org> on 2017/11/02 06:24:57 UTC

[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

Github user kturgut commented on the issue:

    https://github.com/apache/spark/pull/17092
  
    @jkbradley @MLnick @sethah @Yunni  @merlintang @akatz  
    It seems LSH will be a perfect fit for matching patient records, if only I can figure out how to assign different weights to each column of the patient record that I am comparing.  For instance, each record may have 0 to many identifiers. if the identifiers match exactly, we consider a solid match.  However if ID's do not strongly match,  we also look at additional set of fields such as name, birthdate, address at different weights. 
    For instance, if the names exactly match, it is stronger than if they match with small typos.
    To give different weights for each field we are comparing, should I have to write custom distance calculator?
    Or perhaps, should I do a MinHashing and then LSH as a second step as described in this document: http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf? 
    It does not look like the  AND-OR amplification would help with that, as it takes the number of hash-functions as input, and it does not seem like we have control over the sensitivity of the hash-functions. 
    I will really appreciate your guidance.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org