You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2021/01/02 21:40:20 UTC

[GitHub] [systemds] Shafaq-Siddiqi commented on pull request #1139: [SYSTEMDS-2782] Built-in mdedup

Shafaq-Siddiqi commented on pull request #1139:
URL: https://github.com/apache/systemds/pull/1139#issuecomment-753533113


   > This PR adds a new built-in mdedup for detecting duplicates in frames using matching dependencies (like Street 0.95, City 0.90 -> ZIP 1.0).
   > @Shafaq-Siddiqi For simplicity, used Jaccard similarity, but if found out that Levenshtein or Jaro distance could also be used, should I also add them? To compute Jaccard similarity between rows (strings) of a vector (nx1) the map with 2 args was added dist = map(Xi, "(x, y) -> UtilFunctions.jaccardSim(x, y)").
   > 
   > I also modified discoverFD built-in by setting diag to 1.
   
   Thanks @OlgaOvcharenko, for now it is fine to have Jaccard similarity only we will come to other methods when we will extend the overall implementation. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org