You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by GitBox <gi...@apache.org> on 2021/09/15 23:51:18 UTC

[GitHub] [madlib] reductionista commented on a change in pull request #571: Fast parallel-optimized DBSCAN

reductionista commented on a change in pull request #571:
URL: https://github.com/apache/madlib/pull/571#discussion_r709652611



##########
File path: src/ports/postgres/modules/dbscan/dbscan.py_in
##########
@@ -164,76 +274,1271 @@ def dbscan(schema_madlib, source_table, output_table, id_column, expr_point, eps
         """.format(**locals())
         plpy.execute(sql)
 
+        # -- Generate output summary table ---
+
         output_summary_table = add_postfix(output_table, '_summary')
-        plpy.execute("DROP TABLE IF EXISTS {0}".format(output_summary_table))
 
+        plpy.execute("DROP TABLE IF EXISTS {0}".format(output_summary_table))
         sql = """
             CREATE TABLE {output_summary_table} AS
             SELECT  '{id_column}'::VARCHAR AS id_column,
+                    '{expr_point}'::VARCHAR AS expr_point,
                     {eps}::DOUBLE PRECISION AS eps,
+                    {min_samples}::BIGINT AS min_points,

Review comment:
       Hmm... I guess we could do that.
   
   That reminds me, I was planning to ask @orhankislal 's about this.   I'm guessing the main reason why the summary table originally only included `eps` and `metric` is that these are both required inputs for predict to work, while the others are not.  But I'm not sure if there may have been other motivations for keeping the summary table minimal (consistency with other modules?)
   
   I added `min_samples` to this branch when I noticed how difficult its absence made some tasks.  If you have several DBSCAN output tables laying around from a while back, and you want to run it on a new one that's similar to the others (eg. scaled up or down in size with random sampling, noise added, a test/validation split, etc.) but you never wrote down what inputs were used, you have to regenerate all of the outputs again since there is no way of knowing what value of min_samples was used.   While you'll get different outputs for different values of min_samples, you can't know for sure what value was used because it could be that there just weren't any clusters smaller than a certain size.  That could be an important feature of your input data, or it could just be an artifact of the inputs used.
   
   I didn't add `algorithm` or `max_depth` because they are not inputs to DBSCAN in the computational sense of mapping a set of inputs to a set of outputs.  If you know what the input table is, and you know `eps` and `min_samples`, those 3 together determine what the output looks like (aside from some slight randomness in which cluster a point near multiple clusters gets assigned to, due to the order in which points are considered).  Neither `max_depth` nor `algorithm` affect the output, they only affect the runtime.  So my thinking was... there's probably no reason a data scientist would care later by what means the output was generated, as long as they have all the inputs.
   
   I don't have any strong objection to including `algorithm` in the summary, but if we do then I guess we should also include `max_depth` since that also affects the runtime and is sort of like choosing a sub-algorithm among different variants of `optimized`... all of which should give the same result.  There is one use-case I can think of for it, which is performance comparisons of DBSCAN running on similar datasets, such as scale testing.  It would at least be useful for developers, if someone wants to extend DBSCAN or improve it.  And possibly for some end-users, if they wanted to do some runtime-calibration on smaller datasets before attempting something very large.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@madlib.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org