You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sedona.apache.org by GitBox <gi...@apache.org> on 2022/04/15 20:33:12 UTC

[GitHub] [incubator-sedona] krishnarb3 opened a new pull request, #609: Feature/dbscan

krishnarb3 opened a new pull request, #609:
URL: https://github.com/apache/incubator-sedona/pull/609

   ## Is this PR related to a proposed Issue?
   [SEDONA-46](https://issues.apache.org/jira/browse/SEDONA-46)
   
   ## What changes were proposed in this PR?
   ST_ClusterDBSCAN
   
   ## How was this patch tested?
   Added unit tests for DBScan and UnionFind
   
   ## Did this PR include necessary documentation updates?
   Waiting for confirmation of the approach


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-sedona] jiayuasu closed pull request #609: Feature/dbscan

Posted by GitBox <gi...@apache.org>.
jiayuasu closed pull request #609: Feature/dbscan
URL: https://github.com/apache/incubator-sedona/pull/609


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-sedona] krishnarb3 commented on a diff in pull request #609: Feature/dbscan

Posted by GitBox <gi...@apache.org>.
krishnarb3 commented on code in PR #609:
URL: https://github.com/apache/incubator-sedona/pull/609#discussion_r857571209


##########
core/src/main/java/org/apache/sedona/core/spatialOperator/DBScanQuery.java:
##########
@@ -0,0 +1,31 @@
+package org.apache.sedona.core.spatialOperator;
+
+import org.apache.sedona.core.dbscanJudgement.DBScanJudgement;
+import org.apache.sedona.core.knnJudgement.GeometryDistanceComparator;
+import org.apache.sedona.core.knnJudgement.KnnJudgementUsingIndex;
+import org.apache.sedona.core.spatialRDD.SpatialRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.locationtech.jts.geom.Geometry;
+
+import java.io.Serializable;
+import java.util.HashSet;
+import java.util.List;
+
+public class DBScanQuery
+        implements Serializable
+{
+    public static <T extends Geometry> List<Integer> SpatialDBScanQuery(SpatialRDD<T> spatialRDD, double eps, int minPoints, boolean useIndex)
+    {
+        if (useIndex) {
+            if (spatialRDD.indexedRawRDD == null) {
+                throw new NullPointerException("Need to invoke buildIndex() first, indexedRDDNoId is null");
+            }
+            JavaRDD<Integer> result = spatialRDD.getRawSpatialRDD().repartition(1).mapPartitions(new DBScanJudgement(eps, minPoints, new HashSet<>()), true);

Review Comment:
   Thank you @jiayuasu, I will work on this next week



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-sedona] jiayuasu commented on a diff in pull request #609: Feature/dbscan

Posted by GitBox <gi...@apache.org>.
jiayuasu commented on code in PR #609:
URL: https://github.com/apache/incubator-sedona/pull/609#discussion_r855462337


##########
core/src/main/java/org/apache/sedona/core/spatialOperator/DBScanQuery.java:
##########
@@ -0,0 +1,31 @@
+package org.apache.sedona.core.spatialOperator;
+
+import org.apache.sedona.core.dbscanJudgement.DBScanJudgement;
+import org.apache.sedona.core.knnJudgement.GeometryDistanceComparator;
+import org.apache.sedona.core.knnJudgement.KnnJudgementUsingIndex;
+import org.apache.sedona.core.spatialRDD.SpatialRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.locationtech.jts.geom.Geometry;
+
+import java.io.Serializable;
+import java.util.HashSet;
+import java.util.List;
+
+public class DBScanQuery
+        implements Serializable
+{
+    public static <T extends Geometry> List<Integer> SpatialDBScanQuery(SpatialRDD<T> spatialRDD, double eps, int minPoints, boolean useIndex)
+    {
+        if (useIndex) {
+            if (spatialRDD.indexedRawRDD == null) {
+                throw new NullPointerException("Need to invoke buildIndex() first, indexedRDDNoId is null");
+            }
+            JavaRDD<Integer> result = spatialRDD.getRawSpatialRDD().repartition(1).mapPartitions(new DBScanJudgement(eps, minPoints, new HashSet<>()), true);

Review Comment:
   If your implementation cannot work on multiple partitions, then this PR will NOT be accepted. We are looking for a distributed DBScan algorithm. For example, this one: https://github.com/irvingc/dbscan-on-spark
   
   You can incorporate this library into Sedona since this one is also under Apache 2.0 License



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-sedona] jiayuasu commented on pull request #609: Feature/dbscan

Posted by GitBox <gi...@apache.org>.
jiayuasu commented on PR #609:
URL: https://github.com/apache/incubator-sedona/pull/609#issuecomment-1100739916

   Thanks for your contribution. This PR is huge. I will take some to review it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-sedona] krishnarb3 commented on pull request #609: Feature/dbscan

Posted by GitBox <gi...@apache.org>.
krishnarb3 commented on PR #609:
URL: https://github.com/apache/incubator-sedona/pull/609#issuecomment-1100404580

   This follows the POSTGIS implementation of ST_DBScan.
   I have not added this as a ST_Function since that would not be able to support running DBScan on multiple partitions in the future.
   
   @jiayuasu What are your thoughts on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-sedona] krishnarb3 commented on pull request #609: Feature/dbscan

Posted by GitBox <gi...@apache.org>.
krishnarb3 commented on PR #609:
URL: https://github.com/apache/incubator-sedona/pull/609#issuecomment-1226131510

   I'll pick this PR back up this week


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-sedona] krishnarb3 commented on a diff in pull request #609: Feature/dbscan

Posted by GitBox <gi...@apache.org>.
krishnarb3 commented on code in PR #609:
URL: https://github.com/apache/incubator-sedona/pull/609#discussion_r851517297


##########
core/src/main/java/org/apache/sedona/core/spatialOperator/DBScanQuery.java:
##########
@@ -0,0 +1,31 @@
+package org.apache.sedona.core.spatialOperator;
+
+import org.apache.sedona.core.dbscanJudgement.DBScanJudgement;
+import org.apache.sedona.core.knnJudgement.GeometryDistanceComparator;
+import org.apache.sedona.core.knnJudgement.KnnJudgementUsingIndex;
+import org.apache.sedona.core.spatialRDD.SpatialRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.locationtech.jts.geom.Geometry;
+
+import java.io.Serializable;
+import java.util.HashSet;
+import java.util.List;
+
+public class DBScanQuery
+        implements Serializable
+{
+    public static <T extends Geometry> List<Integer> SpatialDBScanQuery(SpatialRDD<T> spatialRDD, double eps, int minPoints, boolean useIndex)
+    {
+        if (useIndex) {
+            if (spatialRDD.indexedRawRDD == null) {
+                throw new NullPointerException("Need to invoke buildIndex() first, indexedRDDNoId is null");
+            }
+            JavaRDD<Integer> result = spatialRDD.getRawSpatialRDD().repartition(1).mapPartitions(new DBScanJudgement(eps, minPoints, new HashSet<>()), true);

Review Comment:
   Since the implementation of DBScan is not parallel/distributed, it does not work when there are more than one partition(s).
   @jiayuasu Do you think this is acceptable?
   
   I will try to come up with an implementation that can be run on multiple partitions later in that case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-sedona] krishnarb3 commented on a diff in pull request #609: Feature/dbscan

Posted by GitBox <gi...@apache.org>.
krishnarb3 commented on code in PR #609:
URL: https://github.com/apache/incubator-sedona/pull/609#discussion_r851513402


##########
core/src/main/java/org/apache/sedona/core/dbscanJudgement/DBScanJudgement.java:
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sedona.core.dbscanJudgement;
+
+import org.apache.spark.api.java.function.FlatMapFunction;
+import org.locationtech.jts.geom.Envelope;
+import org.locationtech.jts.geom.Geometry;
+import org.locationtech.jts.geom.Point;
+import org.locationtech.jts.index.strtree.STRtree;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+// TODO: Auto-generated Javadoc
+
+/**
+ * The Class DBScanJudgement.
+ */
+public class DBScanJudgement<T extends Geometry>
+        implements FlatMapFunction<Iterator<T>, Integer> {
+
+    /**
+     * The eps distance
+     */
+    double eps;
+
+    /**
+     * The minimum number of points to form a cluster
+     */
+    int minPoints;
+
+    /**
+     * The indexes of geoms that are already part of a cluster
+     */
+    Set<Integer> isInCluster;
+
+    /**
+     * Instantiates a new geometry knn judgement.
+     *
+     * @param eps       the distance eps
+     * @param minPoints the minimum number of points to form a cluster
+     */
+    public DBScanJudgement(double eps, int minPoints, Set<Integer> isInCluster) {
+        this.eps = eps;
+        this.minPoints = minPoints;
+        this.isInCluster = isInCluster;
+    }
+
+
+    @Override
+    public Iterator<Integer> call(Iterator<T> input) throws Exception {
+        List<T> geoms = new ArrayList<>();
+        while (input.hasNext()) {
+            geoms.add(input.next());
+        }
+        if (geoms.size() < minPoints) {
+            return Collections.emptyIterator();
+        }
+        Set<Integer> isInCore = new HashSet<>();
+        Integer[] neighbors = new Integer[minPoints];
+        UnionFind unionFind = new UnionFindImpl(geoms.size());
+        STRtree strtree = new STRtree(geoms.size());
+        for (Geometry geom : geoms) {
+            strtree.insert(geom.getEnvelopeInternal(), geom);
+        }
+        for (int i = 0; i < geoms.size(); i++) {
+            int numNeighbors = 0;
+            List<Integer> geomsInEnvelope = getGeomsInEnvelope(geoms, i, eps, strtree);
+            if (geomsInEnvelope.size() < minPoints) {
+                continue;
+            }
+
+            for (Integer j : geomsInEnvelope) {
+                if (numNeighbors >= minPoints) {
+                    /*
+                     * If we've already identified p as a core point, and it's already

Review Comment:
   @jiayuasu Have added the comments from POSTGIS code for clarity



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-sedona] krishnarb3 commented on a diff in pull request #609: Feature/dbscan

Posted by GitBox <gi...@apache.org>.
krishnarb3 commented on code in PR #609:
URL: https://github.com/apache/incubator-sedona/pull/609#discussion_r851511902


##########
core/src/main/java/org/apache/sedona/core/dbscanJudgement/UnionFindImpl.java:
##########
@@ -0,0 +1,134 @@
+package org.apache.sedona.core.dbscanJudgement;
+
+import java.util.Arrays;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+
+public class UnionFindImpl implements UnionFind {
+    private List<Integer> clusters;
+    private List<Integer> clusterSizes;
+    private int numClusters;
+    private final int n;
+
+    public UnionFindImpl(int n) {
+        this.n = n;
+        this.numClusters = n;
+        this.clusters = new ArrayList<>();
+        this.clusterSizes = new ArrayList<>();
+        for (int i = 0; i < n; i++) {
+            clusters.add(i);
+            clusterSizes.add(1);
+        }
+    }
+
+    @Override
+    public int find(int i) {
+        int base = i;
+        while (!clusters.get(base).equals(base)) {
+            base = clusters.get(base);
+        }
+        while (i != base) {
+            int next = clusters.get(i);
+            clusters.set(i, base);
+            i = next;
+        }
+        return i;
+    }
+
+    @Override
+    public int size(int i) {
+        return clusterSizes.get(find(i));
+    }
+
+    @Override
+    public void union(int i, int j) {
+        int a = find(i);
+        int b = find(j);
+        if (a == b) {
+            return;
+        }
+        if (clusterSizes.get(a) < clusterSizes.get(b) ||
+                (Objects.equals(clusterSizes.get(a), clusterSizes.get(b)) && a > b)) {
+            clusters.set(a, clusters.get(b));
+            clusterSizes.set(b, clusterSizes.get(a) + clusterSizes.get(b));
+            clusterSizes.set(a, 0);
+        } else {
+            clusters.set(b, clusters.get(a));
+            clusterSizes.set(a, clusterSizes.get(a) + clusterSizes.get(b));
+            clusterSizes.set(b, 0);
+        }
+        numClusters--;
+    }
+
+    @Override
+    public Integer[] orderedByCluster() {
+        Integer[] clusterIdByElemId = new Integer[n];
+        for (int i = 0; i < n; i++) {
+            find(i);
+            clusterIdByElemId[i] = i;
+        }
+        Arrays.sort(clusterIdByElemId, Comparator.comparing(index -> clusters.get(index)));
+        return clusterIdByElemId;
+    }
+
+    @Override
+    public Integer[] getCollapsedClusterIds(Set<Integer> isInCluster) {
+        Integer[] orderedComponents = orderedByCluster();
+        Integer[] newIds = new Integer[n];
+        int lastOldId = 0, currentNewId = 0;
+        boolean encounteredCluster = false;
+
+        if (isInCluster == null) {
+            isInCluster = new HashSet<>();
+        }
+
+        for (int i = 0; i < n; i++) {

Review Comment:
   As per POSTGIS `getCollapsedClusterIds` will set the cluster id only for indexes that are part of isInCluster or all the indexes if isInCluster is empty.
   @jiayuasu Is this the expected behavior?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org