You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Muhammad-Ali A'rabi (JIRA)" <ji...@apache.org> on 2015/01/15 20:33:36 UTC

[jira] [Comment Edited] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

    [ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279170#comment-14279170 ] 

Muhammad-Ali A'rabi edited comment on SPARK-5226 at 1/15/15 7:33 PM:
---------------------------------------------------------------------

This is DBSCAN algorithm:

{noformat}
DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      NeighborPts = regionQuery(P, eps)
      if sizeof(NeighborPts) < MinPts
         mark P as NOISE
      else
         C = next cluster
         expandCluster(P, NeighborPts, C, eps, MinPts)
          
expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P' in NeighborPts 
      if P' is not visited
         mark P' as visited
         NeighborPts' = regionQuery(P', eps)
         if sizeof(NeighborPts') >= MinPts
            NeighborPts = NeighborPts joined with NeighborPts'
      if P' is not yet member of any cluster
         add P' to cluster C
          
regionQuery(P, eps)
   return all points within P's eps-neighborhood (including P)
{noformat}

As you can see, there are just two parameters. There is two ways of implementation. First one is faster (O(n log n), and requires more memory (O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(‌n)). But I prefer the first one, as we are not short one memory.
There are two phases of running:
* Preprocessing. In this phase a distance matrix for all point is created and distances between every two points is calculated. Very parallel.
* Main Process. In this phase the algorithm will run, as described in pseudo-code, and two foreach's are parallelized. Region queries are done very fast (O(1)), because of preprocessing.


was (Author: angellandros):
This is DBSCAN algorithm:

{noformat}
DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      NeighborPts = regionQuery(P, eps)
      if sizeof(NeighborPts) < MinPts
         mark P as NOISE
      else
         C = next cluster
         expandCluster(P, NeighborPts, C, eps, MinPts)
          
expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P' in NeighborPts 
      if P' is not visited
         mark P' as visited
         NeighborPts' = regionQuery(P', eps)
         if sizeof(NeighborPts') >= MinPts
            NeighborPts = NeighborPts joined with NeighborPts'
      if P' is not yet member of any cluster
         add P' to cluster C
          
regionQuery(P, eps)
   return all points within P's eps-neighborhood (including P)
{noformat}

As you can see, there are just two parameters. There is two ways of implementation. First one is faster (O(n log n), and requires more memory (O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(n)). But I prefer the first one, as we are not short one memory.
There are two phases of running:
* Preprocessing. In this phase a distance matrix for all point is created and distances between every two points is calculated. Very parallel.
* Main Process. In this phase the algorithm will run, as described in pseudo-code, and two foreach's are parallelized. Region queries are done very fast (O(1)), because of preprocessing.

> Add DBSCAN Clustering Algorithm to MLlib
> ----------------------------------------
>
>                 Key: SPARK-5226
>                 URL: https://issues.apache.org/jira/browse/SPARK-5226
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Muhammad-Ali A'rabi
>            Priority: Minor
>              Labels: DBSCAN
>
> MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org