You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Muhammad-Ali A'rabi (JIRA)" <ji...@apache.org> on 2015/01/15 20:33:36 UTC
[jira] [Comment Edited] (SPARK-5226) Add DBSCAN Clustering
Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279170#comment-14279170 ]
Muhammad-Ali A'rabi edited comment on SPARK-5226 at 1/15/15 7:33 PM:
---------------------------------------------------------------------
This is DBSCAN algorithm:
{noformat}
DBSCAN(D, eps, MinPts)
C = 0
for each unvisited point P in dataset D
mark P as visited
NeighborPts = regionQuery(P, eps)
if sizeof(NeighborPts) < MinPts
mark P as NOISE
else
C = next cluster
expandCluster(P, NeighborPts, C, eps, MinPts)
expandCluster(P, NeighborPts, C, eps, MinPts)
add P to cluster C
for each point P' in NeighborPts
if P' is not visited
mark P' as visited
NeighborPts' = regionQuery(P', eps)
if sizeof(NeighborPts') >= MinPts
NeighborPts = NeighborPts joined with NeighborPts'
if P' is not yet member of any cluster
add P' to cluster C
regionQuery(P, eps)
return all points within P's eps-neighborhood (including P)
{noformat}
As you can see, there are just two parameters. There is two ways of implementation. First one is faster (O(n log n), and requires more memory (O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(n)). But I prefer the first one, as we are not short one memory.
There are two phases of running:
* Preprocessing. In this phase a distance matrix for all point is created and distances between every two points is calculated. Very parallel.
* Main Process. In this phase the algorithm will run, as described in pseudo-code, and two foreach's are parallelized. Region queries are done very fast (O(1)), because of preprocessing.
was (Author: angellandros):
This is DBSCAN algorithm:
{noformat}
DBSCAN(D, eps, MinPts)
C = 0
for each unvisited point P in dataset D
mark P as visited
NeighborPts = regionQuery(P, eps)
if sizeof(NeighborPts) < MinPts
mark P as NOISE
else
C = next cluster
expandCluster(P, NeighborPts, C, eps, MinPts)
expandCluster(P, NeighborPts, C, eps, MinPts)
add P to cluster C
for each point P' in NeighborPts
if P' is not visited
mark P' as visited
NeighborPts' = regionQuery(P', eps)
if sizeof(NeighborPts') >= MinPts
NeighborPts = NeighborPts joined with NeighborPts'
if P' is not yet member of any cluster
add P' to cluster C
regionQuery(P, eps)
return all points within P's eps-neighborhood (including P)
{noformat}
As you can see, there are just two parameters. There is two ways of implementation. First one is faster (O(n log n), and requires more memory (O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(n)). But I prefer the first one, as we are not short one memory.
There are two phases of running:
* Preprocessing. In this phase a distance matrix for all point is created and distances between every two points is calculated. Very parallel.
* Main Process. In this phase the algorithm will run, as described in pseudo-code, and two foreach's are parallelized. Region queries are done very fast (O(1)), because of preprocessing.
> Add DBSCAN Clustering Algorithm to MLlib
> ----------------------------------------
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Muhammad-Ali A'rabi
> Priority: Minor
> Labels: DBSCAN
>
> MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org