You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemds.apache.org by ba...@apache.org on 2020/08/15 12:58:11 UTC
[systemds] branch master updated: [SYSTEMDS-2621] DBScan and dist
builtin
This is an automated email from the ASF dual-hosted git repository.
baunsgaard pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/systemds.git
The following commit(s) were added to refs/heads/master by this push:
new fc0e210 [SYSTEMDS-2621] DBScan and dist builtin
fc0e210 is described below
commit fc0e21059e06d1928020657b3882e986363f7fda
Author: Olga Ovcharenko <ol...@student.tugraz.at>
AuthorDate: Sat Aug 15 14:56:44 2020 +0200
[SYSTEMDS-2621] DBScan and dist builtin
- Implemented dist for euclidean distance matrix.
- Implemented DBSCAN
- Added R dbscan package
- Fixed error for similar points and all noise points
- Fixed DBSCAN
- Updated DBSCAN and added documentation
- Added tests for dist and DBScan.
Closes #1003.
---
docs/site/builtins-reference.md | 61 ++++++++++++++
scripts/builtin/dbscan.dml | 72 +++++++++++++++++
scripts/builtin/dist.dml | 29 +++++++
.../java/org/apache/sysds/common/Builtins.java | 4 +-
.../test/functions/builtin/BuiltinDBSCANTest.java | 93 ++++++++++++++++++++++
.../test/functions/builtin/BuiltinDistTest.java | 85 ++++++++++++++++++++
src/test/scripts/functions/builtin/dbscan.R | 31 ++++++++
src/test/scripts/functions/builtin/dbscan.dml | 26 ++++++
src/test/scripts/functions/builtin/dist.R | 28 +++++++
src/test/scripts/functions/builtin/dist.dml | 24 ++++++
10 files changed, 452 insertions(+), 1 deletion(-)
diff --git a/docs/site/builtins-reference.md b/docs/site/builtins-reference.md
index 0ad39c4..6d772f3 100644
--- a/docs/site/builtins-reference.md
+++ b/docs/site/builtins-reference.md
@@ -29,7 +29,9 @@ limitations under the License.
* [DML-Bodied Built-In functions](#dml-bodied-built-in-functions)
* [`confusionMatrix`-Function](#confusionmatrix-function)
* [`cvlm`-Function](#cvlm-function)
+ * [`DBSCAN`-Function](#DBSCAN-function)
* [`discoverFD`-Function](#discoverFD-function)
+ * [`dist`-Function](#dist-function)
* [`glm`-Function](#glm-function)
* [`gridSearch`-Function](#gridSearch-function)
* [`hyperband`-Function](#hyperband-function)
@@ -212,6 +214,37 @@ y = X %*% rand(rows = ncol(X), cols = 1)
[predict, beta] = cvlm(X = X, y = y, k = 4)
```
+## `DBSCAN`-Function
+
+The dbscan() implements the DBSCAN Clustering algorithm using Euclidian distance.
+
+### Usage
+
+```r
+Y = dbscan(X = X, eps = 2.5, minPts = 5)
+```
+
+### Arguments
+
+| Name | Type | Default | Description |
+| :--------- | :-------------- | :--------- | :---------- |
+| X | Matrix[Double] | required | The input Matrix to do DBSCAN on. |
+| eps | Double | `0.5` | Maximum distance between two points for one to be considered reachable for the other. |
+| minPts | Int | `5` | Number of points in a neighborhood for a point to be considered as a core point (includes the point itself). |
+
+### Returns
+
+| Type | Description |
+| :-----------| :---------- |
+| Matrix[Integer] | The mapping of records to clusters |
+
+### Example
+
+```r
+X = rand(rows=1780, cols=180, min=1, max=20)
+dbscan(X = X, eps = 2.5, minPts = 360)
+```
+
## `discoverFD`-Function
The `discoverFD`-function finds the functional dependencies.
@@ -236,6 +269,34 @@ discoverFD(X, Mask, threshold)
| :----- | :---------- |
| Double | matrix of functional dependencies |
+## `dist`-Function
+
+The `dist`-function is used to compute Euclidian distances between N d-dimensional points.
+
+### Usage
+
+```r
+dist(X)
+```
+
+### Arguments
+
+| Name | Type | Default | Description |
+| :--- | :------------- | :------- | :---------- |
+| X | Matrix[Double] | required | (n x d) matrix of d-dimensional points |
+
+### Returns
+
+| Type | Description |
+| :------------- | :---------- |
+| Matrix[Double] | (n x n) symmetric matrix of Euclidian distances |
+
+### Example
+
+```r
+X = rand (rows = 5, cols = 5)
+Y = dist(X)
+```
## `glm`-Function
diff --git a/scripts/builtin/dbscan.dml b/scripts/builtin/dbscan.dml
new file mode 100644
index 0000000..74d4040
--- /dev/null
+++ b/scripts/builtin/dbscan.dml
@@ -0,0 +1,72 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+#
+# Implements the DBSCAN clustering algorithm using Euclidian distance matrix
+#
+# INPUT PARAMETERS:
+# ----------------------------------------------------------------------------
+# NAME TYPE DEFAULT MEANING
+# ----------------------------------------------------------------------------
+# X Matrix[Double] --- The input Matrix to do DBSCAN on.
+# eps Double 0.5 Maximum distance between two points for one to be considered reachable for the other.
+# minPts Int 5 Number of points in a neighborhood for a point to be considered as a core point (includes the point itself).
+#
+
+m_dbscan = function (Matrix[double] X, Double eps = 0.5, Integer minPts = 5)
+ return (Matrix[double] clusterMembers)
+{
+ #check input parameter assertions
+ if(minPts < 0) { stop("DBSCAN: Stopping due to invalid inputs: minPts should be greater than 0"); }
+ if(eps < 0) { stop("DBSCAN: Stopping due to invalid inputs: Epsilon (eps) should be greater than 0"); }
+
+ UNASSIGNED = 0;
+
+ num_records = nrow(X);
+ num_features = ncol(X);
+
+ neighbors = dist(X);
+
+ #find same pts and set their distance to the smallest double representation
+ neighbors = replace(target = neighbors, pattern = 0, replacement = 2.225e-307)
+ neighbors = neighbors - diag(diag(neighbors));
+
+ # neighbors within eps
+ withinEps = ((neighbors <= eps) * (0 < neighbors));
+ corePts = rowSums(withinEps) + 1 >= minPts;
+
+ clusterMembers = matrix(UNASSIGNED, num_records, 1);
+
+ if (sum(corePts) != 0) {
+ # leave only density reachable pts
+ neighbors = (neighbors * corePts * withinEps) > 0;
+
+ # border pts of multiple clusters
+ border = neighbors * (t(corePts) == 0 & colSums(neighbors) > 1) * seq(num_records, 1);
+ border = (border - colMaxs(border)) == 0;
+ neighbors = neighbors * border;
+
+ adjacency = (neighbors + t(neighbors)) > 0;
+
+ clusterMembers = components(G=adjacency, verbose=FALSE);
+ # noise to 0
+ clusterMembers = clusterMembers * (rowSums(adjacency) > 0);
+ }
+}
\ No newline at end of file
diff --git a/scripts/builtin/dist.dml b/scripts/builtin/dist.dml
new file mode 100644
index 0000000..9c473d8
--- /dev/null
+++ b/scripts/builtin/dist.dml
@@ -0,0 +1,29 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+# Returns Euclidian distance matrix (distances between N n-dimensional points)
+
+m_dist = function(Matrix[Double] X) return (Matrix[Double] Y) {
+ G = X %*% t(X);
+ I = matrix(1, rows = nrow(G), cols = ncol(G));
+ Y = -2 * (G) + t(I %*% diag(diag(G))) + t(diag(diag(G)) %*% I);
+ Y = sqrt(Y);
+ Y = replace(target = Y, pattern=0/0, replacement = 0);
+}
\ No newline at end of file
diff --git a/src/main/java/org/apache/sysds/common/Builtins.java b/src/main/java/org/apache/sysds/common/Builtins.java
index 1cd430c..0a05d15 100644
--- a/src/main/java/org/apache/sysds/common/Builtins.java
+++ b/src/main/java/org/apache/sysds/common/Builtins.java
@@ -84,9 +84,11 @@ public enum Builtins {
CUMSUMPROD("cumsumprod", false),
CONFUSIONMATRIX("confusionMatrix", true),
COR("cor", true),
+ DBSCAN("dbscan", true),
DETECTSCHEMA("detectSchema", false),
DIAG("diag", false),
DISCOVER_FD("discoverFD", true),
+ DIST("dist", true),
DROP_INVALID_TYPE("dropInvalidType", false),
DROP_INVALID_LENGTH("dropInvalidLength", false),
EIGEN("eigen", false, ReturnType.MULTI_RETURN),
@@ -291,7 +293,7 @@ public enum Builtins {
public static boolean contains(String name, boolean script, boolean parameterized) {
Builtins tmp = get(name);
return tmp != null && script == tmp.isScript()
- && parameterized == tmp.isParameterized();
+ && parameterized == tmp.isParameterized();
}
public static Builtins get(String name) {
diff --git a/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDBSCANTest.java b/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDBSCANTest.java
new file mode 100644
index 0000000..401229a
--- /dev/null
+++ b/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDBSCANTest.java
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.builtin;
+
+import com.google.common.collect.BiMap;
+import com.google.common.collect.HashBiMap;
+import org.apache.sysds.common.Types.ExecMode;
+import org.apache.sysds.lops.LopProperties.ExecType;
+import org.apache.sysds.runtime.matrix.data.MatrixValue.CellIndex;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.apache.sysds.test.TestUtils;
+import org.junit.Test;
+import java.util.HashMap;
+
+public class BuiltinDBSCANTest extends AutomatedTestBase
+{
+ private final static String TEST_NAME = "dbscan";
+ private final static String TEST_DIR = "functions/builtin/";
+ private static final String TEST_CLASS_DIR = TEST_DIR + BuiltinDBSCANTest.class.getSimpleName() + "/";
+
+ private final static double eps = 1e-3;
+ private final static int rows = 1700;
+ private final static double spDense = 0.99;
+
+ private final static double epsDBSCAN = 1;
+ private final static int minPts = 5;
+
+ @Override
+ public void setUp() { addTestConfiguration(TEST_NAME,new TestConfiguration(TEST_CLASS_DIR, TEST_NAME,new String[]{"B"})); }
+
+ @Test
+ public void testDBSCANDefaultCP() { runDBSCAN(true, ExecType.CP); }
+
+ @Test
+ public void testDBSCANDefaultSP() { runDBSCAN(true, ExecType.SPARK); }
+
+ private void runDBSCAN(boolean defaultProb, ExecType instType)
+ {
+ ExecMode platformOld = setExecMode(instType);
+
+ try
+ {
+ loadTestConfiguration(getTestConfiguration(TEST_NAME));
+ String HOME = SCRIPT_DIR + TEST_DIR;
+
+ fullDMLScriptName = HOME + TEST_NAME + ".dml";
+ programArgs = new String[]{"-nvargs", "X=" + input("A"), "Y=" + output("B"), "eps=" + epsDBSCAN, "minPts=" + minPts};
+ fullRScriptName = HOME + TEST_NAME + ".R";
+ rCmd = getRCmd(inputDir(), Double.toString(epsDBSCAN), Integer.toString(minPts), expectedDir());
+
+ //generate actual dataset
+ double[][] A = getNonZeroRandomMatrix(rows, 3, -10, 10, 7);
+ writeInputMatrixWithMTD("A", A, true);
+
+ runTest(true, false, null, -1);
+ runRScript(true);
+
+ //compare matrices
+ HashMap<CellIndex, Double> dmlfile = readDMLMatrixFromHDFS("B");
+ HashMap<CellIndex, Double> rfile = readRMatrixFromFS("B");
+
+ //map cluster ids
+ //NOTE: border points that are reachable from more than 1 cluster
+ // are assigned to lowest point id, not cluster id -> can fail in this case, but it's still correct
+ BiMap<Double, Double> merged = HashBiMap.create();
+ rfile.forEach((key, value) -> merged.put(value, dmlfile.get(key)));
+ dmlfile.replaceAll((k, v) -> merged.inverse().get(v));
+
+ TestUtils.compareMatrices(dmlfile, rfile, eps, "Stat-DML", "Stat-R");
+ }
+ finally {
+ rtplatform = platformOld;
+ }
+ }
+}
\ No newline at end of file
diff --git a/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDistTest.java b/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDistTest.java
new file mode 100644
index 0000000..26970f6
--- /dev/null
+++ b/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDistTest.java
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.builtin;
+
+import org.apache.sysds.common.Types.ExecMode;
+import org.apache.sysds.lops.LopProperties.ExecType;
+import org.apache.sysds.runtime.matrix.data.MatrixValue.CellIndex;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.apache.sysds.test.TestUtils;
+import org.junit.Test;
+
+import java.util.HashMap;
+
+public class BuiltinDistTest extends AutomatedTestBase
+{
+ private final static String TEST_NAME = "dist";
+ private final static String TEST_DIR = "functions/builtin/";
+ private static final String TEST_CLASS_DIR = TEST_DIR + BuiltinDistTest.class.getSimpleName() + "/";
+
+ private final static double eps = 1e-3;
+ private final static int rows = 1765;
+ private final static double spDense = 0.99;
+
+ @Override
+ public void setUp() {
+ addTestConfiguration(TEST_NAME,new TestConfiguration(TEST_CLASS_DIR, TEST_NAME,new String[]{"B"}));
+ }
+
+ @Test
+ public void testDistDefaultCP() { runDist(true, ExecType.CP); }
+
+ @Test
+ public void testDistSP() {
+ runDist(true, ExecType.SPARK);
+ }
+
+ private void runDist(boolean defaultProb, ExecType instType)
+ {
+ ExecMode platformOld = setExecMode(instType);
+
+ try
+ {
+ loadTestConfiguration(getTestConfiguration(TEST_NAME));
+
+ String HOME = SCRIPT_DIR + TEST_DIR;
+ fullDMLScriptName = HOME + TEST_NAME + ".dml";
+ programArgs = new String[]{"-args", input("A"), output("B") };
+ fullRScriptName = HOME + TEST_NAME + ".R";
+ rCmd = "Rscript" + " " + fullRScriptName + " " + inputDir() + " " + expectedDir();
+
+ //generate actual dataset
+ double[][] A = getRandomMatrix(rows, 10, -1, 1, spDense, 7);
+ writeInputMatrixWithMTD("A", A, true);
+
+ runTest(true, false, null, -1);
+ runRScript(true);
+
+ //compare matrices
+ HashMap<CellIndex, Double> dmlfile = readDMLMatrixFromHDFS("B");
+ HashMap<CellIndex, Double> rfile = readRMatrixFromFS("B");
+ TestUtils.compareMatrices(dmlfile, rfile, eps, "Stat-DML", "Stat-R");
+ }
+ finally {
+ rtplatform = platformOld;
+ }
+ }
+}
diff --git a/src/test/scripts/functions/builtin/dbscan.R b/src/test/scripts/functions/builtin/dbscan.R
new file mode 100644
index 0000000..e332528
--- /dev/null
+++ b/src/test/scripts/functions/builtin/dbscan.R
@@ -0,0 +1,31 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+args<-commandArgs(TRUE)
+library("Matrix")
+options(digits=22)
+library("dbscan")
+
+X = as.matrix(readMM(paste(args[1], "A.mtx", sep="")));
+eps = as.double(args[2]);
+minPts = as.integer(args[3]);
+Ys = dbscan(X, eps, minPts);
+Y = as.matrix(Ys$cluster, FALSE);
+writeMM(as(Y, "CsparseMatrix"), paste(args[4], "B", sep=""));
\ No newline at end of file
diff --git a/src/test/scripts/functions/builtin/dbscan.dml b/src/test/scripts/functions/builtin/dbscan.dml
new file mode 100644
index 0000000..6d5e1eb
--- /dev/null
+++ b/src/test/scripts/functions/builtin/dbscan.dml
@@ -0,0 +1,26 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+X = read($X);
+eps = as.double($eps);
+minPts = as.integer($minPts);
+Y = dbscan(X, eps, minPts);
+write(Y, $Y);
\ No newline at end of file
diff --git a/src/test/scripts/functions/builtin/dist.R b/src/test/scripts/functions/builtin/dist.R
new file mode 100644
index 0000000..dde54f4
--- /dev/null
+++ b/src/test/scripts/functions/builtin/dist.R
@@ -0,0 +1,28 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+args<-commandArgs(TRUE)
+options(digits=22)
+library("Matrix")
+
+X = as.matrix(readMM(paste(args[1], "A.mtx", sep="")));
+R = round(as.matrix(dist(X)), 3);
+diag(R) = 0;
+writeMM(as(R, "CsparseMatrix"), paste(args[2], "B", sep=""));
\ No newline at end of file
diff --git a/src/test/scripts/functions/builtin/dist.dml b/src/test/scripts/functions/builtin/dist.dml
new file mode 100644
index 0000000..9039a97
--- /dev/null
+++ b/src/test/scripts/functions/builtin/dist.dml
@@ -0,0 +1,24 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+X = read($1);
+Y = dist(X);
+write(Y, $2);