You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemds.apache.org by ba...@apache.org on 2020/08/15 12:58:11 UTC

[systemds] branch master updated: [SYSTEMDS-2621] DBScan and dist builtin

This is an automated email from the ASF dual-hosted git repository.

baunsgaard pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/systemds.git


The following commit(s) were added to refs/heads/master by this push:
     new fc0e210  [SYSTEMDS-2621] DBScan and dist builtin
fc0e210 is described below

commit fc0e21059e06d1928020657b3882e986363f7fda
Author: Olga Ovcharenko <ol...@student.tugraz.at>
AuthorDate: Sat Aug 15 14:56:44 2020 +0200

    [SYSTEMDS-2621] DBScan and dist builtin
    
    - Implemented dist for euclidean distance matrix.
    - Implemented DBSCAN
    - Added R dbscan package
    - Fixed error for similar points and all noise points
    - Fixed DBSCAN
    - Updated DBSCAN and added documentation
    - Added tests for dist and DBScan.
    
    Closes #1003.
---
 docs/site/builtins-reference.md                    | 61 ++++++++++++++
 scripts/builtin/dbscan.dml                         | 72 +++++++++++++++++
 scripts/builtin/dist.dml                           | 29 +++++++
 .../java/org/apache/sysds/common/Builtins.java     |  4 +-
 .../test/functions/builtin/BuiltinDBSCANTest.java  | 93 ++++++++++++++++++++++
 .../test/functions/builtin/BuiltinDistTest.java    | 85 ++++++++++++++++++++
 src/test/scripts/functions/builtin/dbscan.R        | 31 ++++++++
 src/test/scripts/functions/builtin/dbscan.dml      | 26 ++++++
 src/test/scripts/functions/builtin/dist.R          | 28 +++++++
 src/test/scripts/functions/builtin/dist.dml        | 24 ++++++
 10 files changed, 452 insertions(+), 1 deletion(-)

diff --git a/docs/site/builtins-reference.md b/docs/site/builtins-reference.md
index 0ad39c4..6d772f3 100644
--- a/docs/site/builtins-reference.md
+++ b/docs/site/builtins-reference.md
@@ -29,7 +29,9 @@ limitations under the License.
   * [DML-Bodied Built-In functions](#dml-bodied-built-in-functions)
     * [`confusionMatrix`-Function](#confusionmatrix-function)
     * [`cvlm`-Function](#cvlm-function)
+    * [`DBSCAN`-Function](#DBSCAN-function)
     * [`discoverFD`-Function](#discoverFD-function)
+    * [`dist`-Function](#dist-function)
     * [`glm`-Function](#glm-function)
     * [`gridSearch`-Function](#gridSearch-function)
     * [`hyperband`-Function](#hyperband-function)
@@ -212,6 +214,37 @@ y = X %*% rand(rows = ncol(X), cols = 1)
 [predict, beta] = cvlm(X = X, y = y, k = 4)
 ```
 
+## `DBSCAN`-Function
+
+The dbscan() implements the DBSCAN Clustering algorithm using Euclidian distance.
+
+### Usage
+
+```r
+Y = dbscan(X = X, eps = 2.5, minPts = 5)
+```
+
+### Arguments
+
+| Name       | Type            | Default    | Description |
+| :--------- | :-------------- | :--------- | :---------- |
+| X          | Matrix[Double]  | required   | The input Matrix to do DBSCAN on. |
+| eps        | Double          | `0.5`      | Maximum distance between two points for one to be considered reachable for the other. |
+| minPts     | Int             | `5`        | Number of points in a neighborhood for a point to be considered as a core point (includes the point itself). |
+
+### Returns
+
+| Type        | Description |
+| :-----------| :---------- |
+| Matrix[Integer] | The mapping of records to clusters |
+
+### Example
+
+```r
+X = rand(rows=1780, cols=180, min=1, max=20) 
+dbscan(X = X, eps = 2.5, minPts = 360)
+```
+
 ## `discoverFD`-Function
 
 The `discoverFD`-function finds the functional dependencies.
@@ -236,6 +269,34 @@ discoverFD(X, Mask, threshold)
 | :----- | :---------- |
 | Double | matrix of functional dependencies |
 
+## `dist`-Function
+
+The `dist`-function is used to compute Euclidian distances between N d-dimensional points.
+
+### Usage
+
+```r
+dist(X)
+```
+
+### Arguments
+
+| Name | Type           | Default  | Description |
+| :--- | :------------- | :------- | :---------- |
+| X    | Matrix[Double] | required | (n x d) matrix of d-dimensional points  |
+
+### Returns
+
+| Type           | Description |
+| :------------- | :---------- |
+| Matrix[Double] | (n x n) symmetric matrix of Euclidian distances |
+
+### Example
+
+```r
+X = rand (rows = 5, cols = 5)
+Y = dist(X)
+```
 
 ## `glm`-Function
 
diff --git a/scripts/builtin/dbscan.dml b/scripts/builtin/dbscan.dml
new file mode 100644
index 0000000..74d4040
--- /dev/null
+++ b/scripts/builtin/dbscan.dml
@@ -0,0 +1,72 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+#
+# Implements the DBSCAN clustering algorithm using Euclidian distance matrix
+#
+# INPUT PARAMETERS:
+# ----------------------------------------------------------------------------
+# NAME      TYPE             DEFAULT    MEANING
+# ----------------------------------------------------------------------------
+# X         Matrix[Double]   ---        The input Matrix to do DBSCAN on.
+# eps       Double           0.5        Maximum distance between two points for one to be considered reachable for the other.
+# minPts    Int              5          Number of points in a neighborhood for a point to be considered as a core point (includes the point itself).
+#
+
+m_dbscan = function (Matrix[double] X, Double eps = 0.5, Integer minPts = 5)
+    return (Matrix[double] clusterMembers)
+{
+    #check input parameter assertions
+    if(minPts < 0) { stop("DBSCAN: Stopping due to invalid inputs: minPts should be greater than 0"); }
+    if(eps < 0) { stop("DBSCAN: Stopping due to invalid inputs: Epsilon (eps) should be greater than 0"); }
+
+    UNASSIGNED = 0;
+
+    num_records = nrow(X);
+    num_features = ncol(X);
+
+    neighbors = dist(X);
+
+    #find same pts and set their distance to the smallest double representation
+    neighbors = replace(target = neighbors, pattern = 0, replacement = 2.225e-307)
+    neighbors = neighbors - diag(diag(neighbors));
+
+    # neighbors within eps
+    withinEps = ((neighbors <= eps) * (0 < neighbors));
+    corePts = rowSums(withinEps) + 1 >= minPts;
+
+    clusterMembers = matrix(UNASSIGNED, num_records, 1);
+
+    if (sum(corePts) != 0) {
+        # leave only density reachable pts
+        neighbors = (neighbors * corePts * withinEps) > 0;
+
+        # border pts of multiple clusters
+        border = neighbors * (t(corePts) == 0 & colSums(neighbors) > 1) * seq(num_records, 1);
+        border = (border - colMaxs(border)) == 0;
+        neighbors = neighbors * border;
+
+        adjacency = (neighbors + t(neighbors)) > 0;
+
+        clusterMembers = components(G=adjacency, verbose=FALSE);
+        # noise to 0
+        clusterMembers = clusterMembers * (rowSums(adjacency) > 0);
+    }
+}
\ No newline at end of file
diff --git a/scripts/builtin/dist.dml b/scripts/builtin/dist.dml
new file mode 100644
index 0000000..9c473d8
--- /dev/null
+++ b/scripts/builtin/dist.dml
@@ -0,0 +1,29 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+# Returns Euclidian distance matrix (distances between N n-dimensional points)
+
+m_dist = function(Matrix[Double] X) return (Matrix[Double] Y) {
+    G = X %*% t(X);
+    I = matrix(1, rows = nrow(G), cols = ncol(G));
+    Y = -2 * (G) + t(I %*% diag(diag(G))) + t(diag(diag(G)) %*% I);
+    Y = sqrt(Y);
+    Y = replace(target = Y, pattern=0/0, replacement = 0);
+}
\ No newline at end of file
diff --git a/src/main/java/org/apache/sysds/common/Builtins.java b/src/main/java/org/apache/sysds/common/Builtins.java
index 1cd430c..0a05d15 100644
--- a/src/main/java/org/apache/sysds/common/Builtins.java
+++ b/src/main/java/org/apache/sysds/common/Builtins.java
@@ -84,9 +84,11 @@ public enum Builtins {
 	CUMSUMPROD("cumsumprod", false),
 	CONFUSIONMATRIX("confusionMatrix", true),
 	COR("cor", true),
+	DBSCAN("dbscan", true),
 	DETECTSCHEMA("detectSchema", false),
 	DIAG("diag", false),
 	DISCOVER_FD("discoverFD", true),
+	DIST("dist", true),
 	DROP_INVALID_TYPE("dropInvalidType", false),
 	DROP_INVALID_LENGTH("dropInvalidLength", false),
 	EIGEN("eigen", false, ReturnType.MULTI_RETURN),
@@ -291,7 +293,7 @@ public enum Builtins {
 	public static boolean contains(String name, boolean script, boolean parameterized) {
 		Builtins tmp = get(name);
 		return tmp != null && script == tmp.isScript()
-			&& parameterized == tmp.isParameterized();
+				&& parameterized == tmp.isParameterized();
 	}
 
 	public static Builtins get(String name) {
diff --git a/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDBSCANTest.java b/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDBSCANTest.java
new file mode 100644
index 0000000..401229a
--- /dev/null
+++ b/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDBSCANTest.java
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.builtin;
+
+import com.google.common.collect.BiMap;
+import com.google.common.collect.HashBiMap;
+import org.apache.sysds.common.Types.ExecMode;
+import org.apache.sysds.lops.LopProperties.ExecType;
+import org.apache.sysds.runtime.matrix.data.MatrixValue.CellIndex;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.apache.sysds.test.TestUtils;
+import org.junit.Test;
+import java.util.HashMap;
+
+public class BuiltinDBSCANTest extends AutomatedTestBase
+{
+	private final static String TEST_NAME = "dbscan";
+	private final static String TEST_DIR = "functions/builtin/";
+	private static final String TEST_CLASS_DIR = TEST_DIR + BuiltinDBSCANTest.class.getSimpleName() + "/";
+
+	private final static double eps = 1e-3;
+	private final static int rows = 1700;
+	private final static double spDense = 0.99;
+
+	private final static double epsDBSCAN = 1;
+	private final static int minPts = 5;
+
+	@Override
+	public void setUp() { addTestConfiguration(TEST_NAME,new TestConfiguration(TEST_CLASS_DIR, TEST_NAME,new String[]{"B"})); }
+
+	@Test
+	public void testDBSCANDefaultCP() { runDBSCAN(true, ExecType.CP); }
+
+	@Test
+	public void testDBSCANDefaultSP() { runDBSCAN(true, ExecType.SPARK); }
+
+	private void runDBSCAN(boolean defaultProb, ExecType instType)
+	{
+		ExecMode platformOld = setExecMode(instType);
+
+		try
+		{
+			loadTestConfiguration(getTestConfiguration(TEST_NAME));
+			String HOME = SCRIPT_DIR + TEST_DIR;
+
+			fullDMLScriptName = HOME + TEST_NAME + ".dml";
+			programArgs = new String[]{"-nvargs", "X=" + input("A"), "Y=" + output("B"), "eps=" + epsDBSCAN, "minPts=" + minPts};
+			fullRScriptName = HOME + TEST_NAME + ".R";
+			rCmd = getRCmd(inputDir(), Double.toString(epsDBSCAN), Integer.toString(minPts), expectedDir());
+
+			//generate actual dataset
+			double[][] A = getNonZeroRandomMatrix(rows, 3, -10, 10, 7);
+			writeInputMatrixWithMTD("A", A, true);
+
+			runTest(true, false, null, -1);
+			runRScript(true);
+
+			//compare matrices
+			HashMap<CellIndex, Double> dmlfile = readDMLMatrixFromHDFS("B");
+			HashMap<CellIndex, Double> rfile  = readRMatrixFromFS("B");
+
+			//map cluster ids
+			//NOTE: border points that are reachable from more than 1 cluster
+			// are assigned to lowest point id, not cluster id -> can fail in this case, but it's still correct
+			BiMap<Double, Double> merged = HashBiMap.create();
+			rfile.forEach((key, value) -> merged.put(value, dmlfile.get(key)));
+			dmlfile.replaceAll((k, v) -> merged.inverse().get(v));
+
+			TestUtils.compareMatrices(dmlfile, rfile, eps, "Stat-DML", "Stat-R");
+		}
+		finally {
+			rtplatform = platformOld;
+		}
+	}
+}
\ No newline at end of file
diff --git a/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDistTest.java b/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDistTest.java
new file mode 100644
index 0000000..26970f6
--- /dev/null
+++ b/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDistTest.java
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.builtin;
+
+import org.apache.sysds.common.Types.ExecMode;
+import org.apache.sysds.lops.LopProperties.ExecType;
+import org.apache.sysds.runtime.matrix.data.MatrixValue.CellIndex;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.apache.sysds.test.TestUtils;
+import org.junit.Test;
+
+import java.util.HashMap;
+
+public class BuiltinDistTest extends AutomatedTestBase
+{
+	private final static String TEST_NAME = "dist";
+	private final static String TEST_DIR = "functions/builtin/";
+	private static final String TEST_CLASS_DIR = TEST_DIR + BuiltinDistTest.class.getSimpleName() + "/";
+	
+	private final static double eps = 1e-3;
+	private final static int rows = 1765;
+	private final static double spDense = 0.99;
+	
+	@Override
+	public void setUp() {
+		addTestConfiguration(TEST_NAME,new TestConfiguration(TEST_CLASS_DIR, TEST_NAME,new String[]{"B"})); 
+	}
+
+	@Test
+	public void testDistDefaultCP() { runDist(true, ExecType.CP); }
+	
+	@Test
+	public void testDistSP() {
+		runDist(true, ExecType.SPARK);
+	}
+
+	private void runDist(boolean defaultProb, ExecType instType)
+	{
+		ExecMode platformOld = setExecMode(instType);
+		
+		try
+		{
+			loadTestConfiguration(getTestConfiguration(TEST_NAME));
+			
+			String HOME = SCRIPT_DIR + TEST_DIR;
+			fullDMLScriptName = HOME + TEST_NAME + ".dml";
+			programArgs = new String[]{"-args", input("A"), output("B") };
+			fullRScriptName = HOME + TEST_NAME + ".R";
+			rCmd = "Rscript" + " " + fullRScriptName + " " + inputDir() + " " + expectedDir();
+			
+			//generate actual dataset 
+			double[][] A = getRandomMatrix(rows, 10, -1, 1, spDense, 7);
+			writeInputMatrixWithMTD("A", A, true);
+			
+			runTest(true, false, null, -1);
+			runRScript(true);
+			
+			//compare matrices
+			HashMap<CellIndex, Double> dmlfile = readDMLMatrixFromHDFS("B");
+			HashMap<CellIndex, Double> rfile  = readRMatrixFromFS("B");
+			TestUtils.compareMatrices(dmlfile, rfile, eps, "Stat-DML", "Stat-R");
+		}
+		finally {
+			rtplatform = platformOld;
+		}
+	}
+}
diff --git a/src/test/scripts/functions/builtin/dbscan.R b/src/test/scripts/functions/builtin/dbscan.R
new file mode 100644
index 0000000..e332528
--- /dev/null
+++ b/src/test/scripts/functions/builtin/dbscan.R
@@ -0,0 +1,31 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+args<-commandArgs(TRUE)
+library("Matrix")
+options(digits=22)
+library("dbscan")
+
+X = as.matrix(readMM(paste(args[1], "A.mtx", sep="")));
+eps = as.double(args[2]);
+minPts = as.integer(args[3]);
+Ys = dbscan(X, eps, minPts);
+Y = as.matrix(Ys$cluster, FALSE);
+writeMM(as(Y, "CsparseMatrix"), paste(args[4], "B", sep=""));
\ No newline at end of file
diff --git a/src/test/scripts/functions/builtin/dbscan.dml b/src/test/scripts/functions/builtin/dbscan.dml
new file mode 100644
index 0000000..6d5e1eb
--- /dev/null
+++ b/src/test/scripts/functions/builtin/dbscan.dml
@@ -0,0 +1,26 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+X = read($X);
+eps = as.double($eps);
+minPts = as.integer($minPts);
+Y = dbscan(X, eps, minPts);
+write(Y, $Y);
\ No newline at end of file
diff --git a/src/test/scripts/functions/builtin/dist.R b/src/test/scripts/functions/builtin/dist.R
new file mode 100644
index 0000000..dde54f4
--- /dev/null
+++ b/src/test/scripts/functions/builtin/dist.R
@@ -0,0 +1,28 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+args<-commandArgs(TRUE)
+options(digits=22)
+library("Matrix")
+
+X = as.matrix(readMM(paste(args[1], "A.mtx", sep="")));
+R = round(as.matrix(dist(X)), 3);
+diag(R) = 0;
+writeMM(as(R, "CsparseMatrix"), paste(args[2], "B", sep=""));
\ No newline at end of file
diff --git a/src/test/scripts/functions/builtin/dist.dml b/src/test/scripts/functions/builtin/dist.dml
new file mode 100644
index 0000000..9039a97
--- /dev/null
+++ b/src/test/scripts/functions/builtin/dist.dml
@@ -0,0 +1,24 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+X = read($1);
+Y = dist(X);
+write(Y, $2);