You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by orhankislal <gi...@git.apache.org> on 2017/10/24 00:33:58 UTC

[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

GitHub user orhankislal opened a pull request:

    https://github.com/apache/madlib/pull/191

    KNN: Fix optional parameters and ordering

    Additional Author : Frank McQuillan <fm...@pivotal.io>
    
    JIRA: MADLIB-1129
    
    Add the necessary interfaces for optional parameters.
    Add ordering for the nearest neighbors.
    Change the default for output_neighbors to True.
    Update the documentation to reflect the changes.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/orhankislal/madlib knn_t2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/191.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #191
    
----
commit a32716466a87be1f47bc0c8aa4b370c91670e6e6
Author: Orhan Kislal <ok...@pivotal.io>
Date:   2017-10-23T23:51:27Z

    KNN: Fix optional parameters and ordering
    
    Additional Author : Frank McQuillan <fm...@pivotal.io>
    
    JIRA: MADLIB-1129
    
    Add the necessary interfaces for optional parameters.
    Add ordering for the nearest neighbors.
    Change the default for output_neighbors to True.
    Update the documentation to reflect the changes.

----


---

[GitHub] madlib issue #191: KNN: Fix optional parameters and ordering

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on the issue:

    https://github.com/apache/madlib/pull/191
  
    jenkins ok to retest


---

[GitHub] madlib issue #191: KNN: Fix optional parameters and ordering

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/191
  
    LGTM based on some testing and docs review.
    I would ask other community folks to pls review code in more detail however.


---

[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/madlib/pull/191


---

[GitHub] madlib issue #191: KNN: Fix optional parameters and ordering

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/191
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/264/



---

[GitHub] madlib issue #191: KNN: Fix optional parameters and ordering

Posted by orhankislal <gi...@git.apache.org>.
Github user orhankislal commented on the issue:

    https://github.com/apache/madlib/pull/191
  
    Thanks for the comments @jingyimei and @iyerr3. I have made some significant changes to the code with the latest commit. I would greatly appreciate your comments on them.


---

[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/191#discussion_r146700914
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -160,20 +164,23 @@ def knn(schema_madlib, point_source, point_column_name, point_id, label_column_n
                             ) {x_temp_table}
                         ) {y_temp_table}
                     WHERE {y_temp_table}.r <= {k_val}
    -                """.format(**locals()))
    -            plpy.execute(
    -                """
    +                """.format(**locals())
    +            plpy.execute(sql1)
    +
    +            sql2 = """
                     CREATE TABLE {output_table} AS
                         SELECT {test_id_temp} AS id, {test_column_name} ,
                             CASE WHEN {output_neighbors}
    -                        THEN array_agg(knn_temp.train_id)
    +                        THEN array_agg(knn_temp.train_id
    +                            ORDER BY knn_temp.dist ASC)
    --- End diff --
    
    Can be in one line


---

[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/191#discussion_r146702329
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -135,13 +135,17 @@ def knn(schema_madlib, point_source, point_column_name, point_id, label_column_n
             test_id_temp = unique_string(desp='test_id_temp')
     
             if output_neighbors is None or '':
    -            output_neighbors = False
    +            output_neighbors = True
    --- End diff --
    
    On second read of the code: output_neighbors is already a boolean. Is there a reason to expect it to be `''`? 


---

[GitHub] madlib issue #191: KNN: Fix optional parameters and ordering

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/191
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/260/



---

[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/191#discussion_r146702427
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -135,13 +135,17 @@ def knn(schema_madlib, point_source, point_column_name, point_id, label_column_n
             test_id_temp = unique_string(desp='test_id_temp')
     
             if output_neighbors is None or '':
    -            output_neighbors = False
    +            output_neighbors = True
     
             interim_table = unique_string(desp='interim_table')
     
             if label_column_name is None or label_column_name == '':
    --- End diff --
    
    `if not label_column_name`


---

[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/191#discussion_r146702725
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -135,13 +135,17 @@ def knn(schema_madlib, point_source, point_column_name, point_id, label_column_n
             test_id_temp = unique_string(desp='test_id_temp')
     
             if output_neighbors is None or '':
    -            output_neighbors = False
    +            output_neighbors = True
     
             interim_table = unique_string(desp='interim_table')
     
             if label_column_name is None or label_column_name == '':
    -            plpy.execute(
    -                """
    +
    +            if output_neighbors is False:
    +                plpy.error("kNN error: Either label_column_name or "
    +                       "output_neighbors has to be non-NULL.")
    +
    +            sql1 = """
    --- End diff --
    
    if `sql1` and `sql2` (below) are not reused, then I would prefer to just directly call the `plpy.execute` with the string. 


---

[GitHub] madlib issue #191: KNN: Fix optional parameters and ordering

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/191
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/263/



---

[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/191#discussion_r146646897
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -135,13 +135,17 @@ def knn(schema_madlib, point_source, point_column_name, point_id, label_column_n
             test_id_temp = unique_string(desp='test_id_temp')
     
             if output_neighbors is None or '':
    -            output_neighbors = False
    +            output_neighbors = True
    --- End diff --
    
    We should give this flag a different name from the input string variable. 


---

[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/191#discussion_r146703574
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -215,7 +222,8 @@ def knn(schema_madlib, point_source, point_column_name, point_id, label_column_n
             knn_pred_class = schema_madlib + \
                 '.mode(' + label_col_temp + ') AS prediction'
             knn_pred_reg = 'avg(' + label_col_temp + ') AS prediction'
    -        knn_neighbours = ', array_agg(knn_temp.train_id) AS k_nearest_neighbours '
    +        knn_neighbours = ', array_agg(knn_temp.train_id ORDER BY knn_temp.dist ASC)'\
    --- End diff --
    
    Preferred to break long lines by wrapping expressions in parentheses over using a backslash. ([PEP8 link](https://www.python.org/dev/peps/pep-0008/#maximum-line-length)) 


---

[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/191#discussion_r146700815
  
    --- Diff: src/ports/postgres/modules/knn/knn.sql_in ---
    @@ -96,19 +95,19 @@ in a column of type <tt>DOUBLE PRECISION[]</tt>.
     
     <dt>point_id</dt>
     <dd>TEXT. Name of the column in 'point_source’ containing source data ids.
    -The ids are of type INTEGER with no duplicates. They do not need to be contiguous. 
    -This parameter must be used if the list of nearest neighbors are to be output, i.e., 
    +The ids are of type INTEGER with no duplicates. They do not need to be contiguous.
    +This parameter must be used if the list of nearest neighbors are to be output, i.e.,
     if the parameter 'output_neighbors' below is TRUE or if 'label_column_name' is NULL.
     
     <dt>label_column_name</dt>
     <dd>TEXT. Name of the column with labels/values of training data points.
    -If Boolean, integer or text types will run knn classification, else if 
    -double precision values will run knn regression.  
    -If you set this to NULL will return neighbors only without doing classification or regression.</dd>
    +If Boolean, integer or text types will run KNN classification, else if
    +double precision values will run KNN regression.
    +If you set this to NULL, will return the set of neighbors only without
    +actually doing classification or regression.</dd>
    --- End diff --
    
    When I read this, I had some problem understanding the sentence. Can we modify this sentence like "If you set this to NULL, it will only return the set of neighbors without actually doing classification or regression"?


---

[GitHub] madlib issue #191: KNN: Fix optional parameters and ordering

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/191
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/261/



---

[GitHub] madlib issue #191: KNN: Fix optional parameters and ordering

Posted by orhankislal <gi...@git.apache.org>.
Github user orhankislal commented on the issue:

    https://github.com/apache/madlib/pull/191
  
    jenkins ok to test


---

[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/191#discussion_r146646995
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -135,13 +135,17 @@ def knn(schema_madlib, point_source, point_column_name, point_id, label_column_n
             test_id_temp = unique_string(desp='test_id_temp')
     
             if output_neighbors is None or '':
    -            output_neighbors = False
    +            output_neighbors = True
     
             interim_table = unique_string(desp='interim_table')
     
             if label_column_name is None or label_column_name == '':
    -            plpy.execute(
    -                """
    +
    +            if output_neighbors is False:
    --- End diff --
    
    'if not output_neighbors:'


---

[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/191#discussion_r146701746
  
    --- Diff: src/ports/postgres/modules/knn/knn.sql_in ---
    @@ -218,56 +215,57 @@ INSERT INTO knn_test_data VALUES
     
     -#  Run KNN for classification:
     <pre class="example">
    -DROP TABLE IF EXISTS madlib_knn_result_classification;
    -SELECT * FROM madlib.knn( 
    +DROP TABLE IF EXISTS knn_result_classification;
    +SELECT * FROM madlib.knn(
                     'knn_train_data',      -- Table of training data
                     'data',                -- Col name of training data
    -                'id',                  -- Col Name of id in train data 
    +                'id',                  -- Col Name of id in train data
                     'label',               -- Training labels
                     'knn_test_data',       -- Table of test data
                     'data',                -- Col name of test data
    -                'id',                  -- Col name of id in test data 
    -                'madlib_knn_result_classification',  -- Output table
    -                 3                     -- Number of nearest neighbours
    +                'id',                  -- Col name of id in test data
    +                'knn_result_classification',  -- Output table
    +                 3,                    -- Number of nearest neighbors
                      True                  -- True if you want to show Nearest-Neighbors, False otherwise
    --- End diff --
    
    We can say "True if you want to show Nearest-Neighbors by id" to make it clearer


---