You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2021/01/04 13:58:43 UTC

[GitHub] [systemds] Shafaq-Siddiqi commented on a change in pull request #1139: [SYSTEMDS-2782] Built-in mdedup

Shafaq-Siddiqi commented on a change in pull request #1139:
URL: https://github.com/apache/systemds/pull/1139#discussion_r551331715



##########
File path: scripts/builtin/mdedup.dml
##########
@@ -0,0 +1,114 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#------------------------------------------------------------------------------------------------------------------
+
+# Implements builtin for deduplication using matching dependencies (like Street 0.95, City 0.90 -> ZIP 1.0)
+# and Jaccard distance.
+# 
+# INPUT PARAMETERS:
+# -----------------------------------------------------------------------------------------------------------------
+# NAME            TYPE              DEFAULT     MEANING
+# -----------------------------------------------------------------------------------------------------------------
+# X               Frame               --       Input Frame X
+# LHSfeatures     Matrix[Integer]     --       A matrix 1xd with numbers of columns for MDs
+#                                              (like Street 0.95, City 0.90 -> ZIP 1.0)
+# LHSthreshold    Matrix[Double]      --       A matrix 1xd with threshold values in interval [0, 1] for MDs
+# RHSfeatures     Matrix[Integer]     --       A matrix 1xd with numbers of columns for MDs
+# RHSthreshold    Matrix[Double]      --       A matrix 1xd with threshold values in interval [0, 1] for MDs
+# verbose         Boolean             --       To print the output
+# -----------------------------------------------------------------------------------------------------------------
+#
+# Output(s)
+# -----------------------------------------------------------------------------------------------------------------
+# NAME                 TYPE         DEFAULT     MEANING
+# -----------------------------------------------------------------------------------------------------------------
+# MD              Matrix[Double]      ---       Matrix nx1 of duplicates
+
+s_mdedup = function(Frame[String] X, Matrix[Double] LHSfeatures, Matrix[Double] LHSthreshold,
+    Matrix[Double] RHSfeatures, Matrix[Double] RHSthreshold, Boolean verbose)
+  return(Matrix[Double] MD)
+{
+  n = nrow(X)
+  d = ncol(X)
+
+  if (0 > (ncol(LHSfeatures) + ncol(RHSfeatures)) > d)
+    stop("Invalid input: thresholds should in interval [0, " + d + "]")
+
+  if ((ncol(LHSfeatures) != ncol(LHSthreshold)) | (ncol(RHSfeatures) != ncol(RHSthreshold)))
+      stop("Invalid input: number of thresholds and columns to compare should be equal for LHS and RHS.")
+
+  if (max(LHSfeatures) > d | max(RHSfeatures) > d)
+    stop("Invalid input: feature values should be less than " + d)
+
+  if (sum(LHSthreshold > 1) > 0 | sum(RHSthreshold > 1) > 0)
+    stop("Invalid input: threshold values should be in the interval [0, 1].")
+
+  MD = matrix(0, n, 1)
+
+  LHS_MD = getMDAdjacency(X, LHSfeatures, LHSthreshold)
+
+  if (sum(LHS_MD) > 0) {
+    RHS_MD = getMDAdjacency(X, RHSfeatures, RHSthreshold)
+  }
+
+  MD = detectDuplicates(LHS_MD, LHS_MD)

Review comment:
       Probably a typo.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org