You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2020/07/27 01:02:54 UTC

[GitHub] [systemds] OlgaOvcharenko opened a new pull request #1003: Density-Based Clustering (DBSCAN)

OlgaOvcharenko opened a new pull request #1003:
URL: https://github.com/apache/systemds/pull/1003


   Implemented DBSCAN and EDM (Euclidean Distance Matrix) to compute distances between data points.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] j143 commented on pull request #1003: Density-Based Clustering (DBSCAN)

Posted by GitBox <gi...@apache.org>.
j143 commented on pull request #1003:
URL: https://github.com/apache/systemds/pull/1003#issuecomment-669697363


   Hi @OlgaOvcharenko - Thanks a lot for working on DBSCAN.
   
   Is it possible add documentation at [this place](https://github.com/apache/systemds/blob/master/docs/site/builtins-reference.md#discoverFD-function).
   
   **Info that can be added:**
   - [ ] A description, what this function do and some more information that you would like to add
   - [ ] Usage
   - [ ] Example
   
   Thank you.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on a change in pull request #1003: Density-Based Clustering (DBSCAN)

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on a change in pull request #1003:
URL: https://github.com/apache/systemds/pull/1003#discussion_r470961287



##########
File path: scripts/builtin/dbscan.dml
##########
@@ -0,0 +1,71 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+#
+# Implements the DBSCAN clustering algorithm using Euclidian distance matrix
+#
+# INPUT PARAMETERS:
+# ----------------------------------------------------------------------------
+# NAME  TYPE   DEFAULT  MEANING
+# ----------------------------------------------------------------------------
+# X     String   ---    Location to read matrix X with the input data records
+# eps   Double   ---    Radius for core points
+# minPts Int     ---    Minimum number of points within eps
+#
+
+m_dbscan = function (Matrix[double] X, Double eps = 0.5, Integer minPts = 5)
+    return (Matrix[double] clusterMembers)
+{
+    assert(eps > 0);
+    assert(minPts > 0);

Review comment:
       I like that you have some assertions.
   But it could be improved with some error messages.
   
   I would suggest taking a look at `/scripts/builtin/l2svm.dml`

##########
File path: scripts/builtin/dbscan.dml
##########
@@ -0,0 +1,71 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+#
+# Implements the DBSCAN clustering algorithm using Euclidian distance matrix
+#
+# INPUT PARAMETERS:
+# ----------------------------------------------------------------------------
+# NAME  TYPE   DEFAULT  MEANING
+# ----------------------------------------------------------------------------
+# X     String   ---    Location to read matrix X with the input data records
+# eps   Double   ---    Radius for core points
+# minPts Int     ---    Minimum number of points within eps

Review comment:
       This documentation does not correspond to the documentation in builtin refference.
   I would like the addition of:
   
   - Maximum distance between two points for one to be considered reachable for the other. 
   - Number of points in a neighborhood for a point to be considered as a core point (includes the point itself).
   
   also default values are missing. (but could be removed in this specific case.)

##########
File path: scripts/builtin/dbscan.dml
##########
@@ -0,0 +1,71 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+#
+# Implements the DBSCAN clustering algorithm using Euclidian distance matrix
+#
+# INPUT PARAMETERS:
+# ----------------------------------------------------------------------------
+# NAME  TYPE   DEFAULT  MEANING
+# ----------------------------------------------------------------------------
+# X     String   ---    Location to read matrix X with the input data records
+# eps   Double   ---    Radius for core points
+# minPts Int     ---    Minimum number of points within eps
+#
+
+m_dbscan = function (Matrix[double] X, Double eps = 0.5, Integer minPts = 5)
+    return (Matrix[double] clusterMembers)
+{
+    assert(eps > 0);
+    assert(minPts > 0);
+
+    UNASSIGNED = 0;
+
+    num_records = nrow(X);
+    num_features = ncol(X);
+
+    neighbors = edm(X);
+
+    #find same pts and set their distance to the smallest double representation
+    neighbors = replace(target=neighbors, pattern=0, replacement=2.225e-307)

Review comment:
       syntax: insert a space before and after an operator.
   
   This makes the code more readable for me, but it is up to you if you want to change it.
   
   `neighbors = replace(target = neighbors, pattern = 0, replacement = 2.225e-307)`
   

##########
File path: docs/site/builtins-reference.md
##########
@@ -205,6 +206,33 @@ y = X %*% rand(rows = ncol(X), cols = 1)
 [predict, beta] = cvlm(X = X, y = y, k = 4)
 ```
 
+## `DBSCAN`-Function
+
+The dbscan() implements the DBSCAN Clustering algorithm using Euclidian distance.
+
+### Usage
+```r
+Y = dbscan(X = X, eps = 2.5, minPts = 5)
+```
+
+### Arguments
+| Name       | Type            | Default    | Description |

Review comment:
       to make the format work on our website we need a newline after any heading.
   This was probably missed since this change was done for the other examples after you started this work.

##########
File path: scripts/builtin/edm.dml
##########
@@ -0,0 +1,29 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+# Returns Euclidian distance matrix (distances between N n-dimensional points)
+
+m_edm = function(Matrix[Double] X) return (Matrix[Double] Y) {

Review comment:
       We might consider changing the name to: dist
   also adding tasks for extending the functionality to cover the same as R.
   
   <https://stat.ethz.ch/R-manual/R-devel/library/stats/html/dist.html>

##########
File path: src/test/scripts/functions/builtin/edm.R
##########
@@ -0,0 +1,28 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+args<-commandArgs(TRUE)
+options(digits=22)
+library("Matrix")
+
+X = as.matrix(readMM(paste(args[1], "A.mtx", sep="")));
+R = round(as.matrix(dist(X)), 3);

Review comment:
       I see that you here use the dist function from R.
   Then we really need to change the function name of our new builtin.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] OlgaOvcharenko commented on pull request #1003: Density-Based Clustering (DBSCAN)

Posted by GitBox <gi...@apache.org>.
OlgaOvcharenko commented on pull request #1003:
URL: https://github.com/apache/systemds/pull/1003#issuecomment-674389413


   @Baunsgaard Could you have another look at the modifications? I'd be glad to receive your feedback.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1003: Density-Based Clustering (DBSCAN)

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1003:
URL: https://github.com/apache/systemds/pull/1003#issuecomment-674379652


   Also if possible, then another entry in the builtins-reference.md  with `dist` would be nice.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] j143 edited a comment on pull request #1003: Density-Based Clustering (DBSCAN)

Posted by GitBox <gi...@apache.org>.
j143 edited a comment on pull request #1003:
URL: https://github.com/apache/systemds/pull/1003#issuecomment-669697363


   Hi @OlgaOvcharenko - Thanks a lot for working on DBSCAN.
   
   Is it possible add documentation at [this place](https://github.com/apache/systemds/blob/master/docs/site/builtins-reference.md#discoverFD-function).
   
   **Info that can be added:**
   - [x] A description, what this function do and some more information that you would like to add
   - [x] Usage
   - [x] Example
   
   Thank you.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard closed pull request #1003: Density-Based Clustering (DBSCAN)

Posted by GitBox <gi...@apache.org>.
Baunsgaard closed pull request #1003:
URL: https://github.com/apache/systemds/pull/1003


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] j143 edited a comment on pull request #1003: Density-Based Clustering (DBSCAN)

Posted by GitBox <gi...@apache.org>.
j143 edited a comment on pull request #1003:
URL: https://github.com/apache/systemds/pull/1003#issuecomment-669697363






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org