You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by hpandeycodeit <gi...@git.apache.org> on 2018/08/28 06:48:30 UTC

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

GitHub user hpandeycodeit opened a pull request:

    https://github.com/apache/madlib/pull/315

     JIRA:1060 - Modified KNN to accept expressions in point_column_name …

    JIRA: MADLIB-1060
    
    This PR contains changes for KNN: 
    
    1. Modified code - knn.py_in,  to support regular PostgreSQL expressions for point_column_name and test_column_name parameters. 
    2. Added test cases for the above scenarios. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hpandeycodeit/incubator-madlib MADLIB_1060

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/315.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #315
    
----
commit 6ec93a2becbe32d37e1fd4f01ad59ac70d71dab0
Author: hpandeycodeit <hp...@...>
Date:   2018-08-28T06:40:13Z

     JIRA:1060 - Modified KNN to accept expressions in point_column_name and test_column_name

----


---

[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/315
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/671/



---

[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/315
  
    (1)
    expression for test data array:
    ```
    DROP TABLE IF EXISTS knn_result_classification;
    
    SELECT * FROM madlib.knn(
                    'knn_train_data',      -- Table of training data
                    'data',                -- Col name of training data
                    'id',                  -- Col name of id in train data
                    'label',               -- Training labels
                    'knn_test_data',       -- Table of test data
                    '3 || ARRAY[4]',                -- Col name of test data
                    'id',                  -- Col name of id in test data
                    'knn_result_classification',  -- Output table
                     3,                    -- Number of nearest neighbors
                     True,                 -- True to list nearest-neighbors by id
                     'madlib.squared_dist_norm2' -- Distance function
                    );
    
    SELECT * from knn_result_classification ORDER BY id;
    ```
    produces
    ```
     id | 3 || ARRAY[4] | prediction | k_nearest_neighbours 
    ----+---------------+------------+----------------------
      1 | {3,4}         |          1 | {3,4,5}
      2 | {3,4}         |          1 | {3,4,5}
      3 | {3,4}         |          1 | {3,4,5}
      4 | {3,4}         |          1 | {4,3,5}
      5 | {3,4}         |          1 | {3,4,5}
      6 | {3,4}         |          1 | {4,3,5}
    (6 rows)
    ```
    
    
    (2)
    another expression for test data array:
    ```
    DROP TABLE IF EXISTS knn_result_classification;
    
    SELECT * FROM madlib.knn(
                    'knn_train_data',      -- Table of training data
                    'data',                -- Col name of training data
                    'id',                  -- Col name of id in train data
                    'label',               -- Training labels
                    'knn_test_data',       -- Table of test data
                    'array[3.]::int[] || array[4]',                -- Col name of test data
                    'id',                  -- Col name of id in test data
                    'knn_result_classification',  -- Output table
                     3,                    -- Number of nearest neighbors
                     True,                 -- True to list nearest-neighbors by id
                     'madlib.squared_dist_norm2' -- Distance function
                    );
    
    SELECT * from knn_result_classification ORDER BY id;
    ```
    produces
    ```
     id | array[3.]::int[] || array[4] | prediction | k_nearest_neighbours 
    ----+------------------------------+------------+----------------------
      1 | {3,4}                        |          1 | {3,4,5}
      2 | {3,4}                        |          1 | {3,4,5}
      3 | {3,4}                        |          1 | {4,3,5}
      4 | {3,4}                        |          1 | {3,4,5}
      5 | {3,4}                        |          1 | {4,3,5}
      6 | {3,4}                        |          1 | {4,3,5}
    (6 rows)
    ```
    so this bit seems to work



---

[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/315
  
    
    load data:
    ```
    DROP TABLE IF EXISTS knn_train_data;
    
    CREATE TABLE knn_train_data (
                        id integer, 
                        data integer[], 
                        label integer  -- Integer label means for classification
                        );
    
    INSERT INTO knn_train_data VALUES
    (1, '{1,1}', 1),
    (2, '{2,2}', 1),
    (3, '{3,3}', 1),
    (4, '{4,4}', 1),
    (5, '{4,5}', 1),
    (6, '{20,50}', 0),
    (7, '{10,31}', 0),
    (8, '{81,13}', 0),
    (9, '{1,111}', 0);
    ```
    
    run knn to list nearest neighbors only
    (without doing classification or regression). 
    ```
    DROP TABLE IF EXISTS knn_result_list_neighbors;
    
    SELECT * FROM madlib.knn(
                    'knn_train_data_reg',  -- Table of training data
                    'data',                -- Col name of training data
                    'id',                  -- Col Name of id in train data
                    NULL,                  -- NULL training labels means just list neighbors
                    'knn_test_data',       -- Table of test data
                    'data',                -- Col name of test data
                    'id',                  -- Col name of id in test data
                    'knn_result_list_neighbors', -- Output table
                    3                      -- Number of nearest neighbors
                    );
    ```
    results in error
    ```
    ERROR:  plpy.SPIError: column "none" does not exist
    LINE 22: ...b_temp_p_col_name68224549_1536338162_43630832__ , None from ...
                                                                  ^
    QUERY:  
                CREATE TEMP TABLE __madlib_temp_interim_table80328381_1536338162_13726874__ AS
                    SELECT * FROM (
                        SELECT row_number() over
                                (partition by __madlib_temp_test_id_temp32938499_1536338162_42836514__ order by __madlib_temp_dist75319423_1536338162_4342866__) AS __madlib_temp_r77948314_1536338162_302550__,
                                __madlib_temp_test_id_temp32938499_1536338162_42836514__,
                                __madlib_temp_train_id51898468_1536338162_48880352__,
                                CASE WHEN __madlib_temp_dist75319423_1536338162_4342866__ = 0.0 THEN 1000000.0
                                     ELSE 1.0 / __madlib_temp_dist75319423_1536338162_4342866__
                                END AS __madlib_temp_dist_inverse21927322_1536338162_30562226__
                                
                        FROM (
                            SELECT __madlib_temp_test58595915_1536338162_24414359__.id AS __madlib_temp_test_id_temp32938499_1536338162_42836514__,
                                __madlib_temp_train73645570_1536338162_46994036__.id as __madlib_temp_train_id51898468_1536338162_48880352__,
                                madlib.squared_dist_norm2(
                                    __madlib_temp_p_col_name68224549_1536338162_43630832__,
                                    __madlib_temp_t_col_name86464547_1536338162_93305987__)
                                AS __madlib_temp_dist75319423_1536338162_4342866__
                                
                                FROM
                                (
                                SELECT id , data as __madlib_temp_p_col_name68224549_1536338162_43630832__ , None from knn_train_data_reg
                                ) __madlib_temp_train73645570_1536338162_46994036__,
                                (
                                SELECT id ,data as __madlib_temp_t_col_name86464547_1536338162_93305987__ from knn_test_data
                                ) __madlib_temp_test58595915_1536338162_24414359__
                            ) __madlib_temp_x_temp_table56760627_1536338162_7182722__
                        ) __madlib_temp_y_temp_table96170617_1536338162_37029044__
                WHERE __madlib_temp_y_temp_table96170617_1536338162_37029044__.__madlib_temp_r77948314_1536338162_302550__ <= 3
                
    CONTEXT:  Traceback (most recent call last):
      PL/Python function "knn", line 33, in <module>
        weighted_avg
      PL/Python function "knn", line 305, in knn
    PL/Python function "knn"
    SQL statement "SELECT  madlib.knn( $1 , $2 , $3 , $4 , $5 , $6 , $7 , $8 , $9 ,TRUE, 'madlib.squared_dist_norm2', FALSE)"
    PL/pgSQL function "knn" line 4 at assignment
    ```



---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r213792484
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -243,6 +237,23 @@ def knn(schema_madlib, point_source, point_column_name, point_id,
                 label_out = ""
                 comma_label_out_alias = ""
     
    +        # CTAS temporary table from the main table(s) to select the
    +        # given expression from.
    +
    +        point_source_temp_table = unique_string(desp='point_source')
    +        test_source_temp_table = unique_string(desp='test_source')
    +        point_col_name_temp = unique_string(desp='point_col_name_temp')
    +        test_col_name_temp = unique_string(desp='test_col_name_temp')
    +
    +        plpy.execute("""
    +            create table {point_source_temp_table} as
    +            (select {point_id} , {point_column_name} as {point_col_name_temp} , {label_column_name} from {point_source})"""
    +                     .format(**locals()))
    +        plpy.execute("""
    +                create table {test_source_temp_table} as
    +                (select {test_id}, {test_column_name} as {test_col_name_temp} from {test_source})"""
    +                     .format(**locals()))
    --- End diff --
    
    Did you mean to create a temp table for these?
    Either way, we don't have to create a table for these. We could directly run them as subqueries  instead.


---

[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/315
  
    I'm not sure what this is doing:
    ```
    %%sql
    DROP TABLE IF EXISTS knn_result_classification;
    
    SELECT * FROM madlib.knn(
                    'knn_train_data',      -- Table of training data
                    'array[99.]::int[] || array[99]',                -- Col name of training data
                    'id',                  -- Col name of id in train data
                    'label',               -- Training labels
                    'knn_test_data',       -- Table of test data
                    'data',                -- Col name of test data
                    'id',                  -- Col name of id in test data
                    'knn_result_classification',  -- Output table
                     1,                    -- Number of nearest neighbors
                     True,                 -- True to list nearest-neighbors by id
                     'madlib.squared_dist_norm2' -- Distance function
                    );
    
    SELECT * from knn_result_classification ORDER BY id;
    ``` 
    produces
    ```
     id |  data   | prediction | k_nearest_neighbours 
    ----+---------+------------+----------------------
      1 | {2,1}   |          0 | {8}
      2 | {2,6}   |          0 | {8}
      3 | {15,40} |          0 | {8}
      4 | {12,1}  |          0 | {8}
      5 | {2,90}  |          1 | {1}
      6 | {50,45} |          1 | {1}
    (6 rows)
    ```
    
    I get the same result if I do:
    ```
    DROP TABLE IF EXISTS knn_result_classification;
    
    SELECT * FROM madlib.knn(
                    'knn_train_data',      -- Table of training data
                    'array[0.]::int[] || array[0]',                -- Col name of training data
                    'id',                  -- Col name of id in train data
                    'label',               -- Training labels
                    'knn_test_data',       -- Table of test data
                    'data',                -- Col name of test data
                    'id',                  -- Col name of id in test data
                    'knn_result_classification',  -- Output table
                     1,                    -- Number of nearest neighbors
                     True,                 -- True to list nearest-neighbors by id
                     'madlib.squared_dist_norm2' -- Distance function
                    );
    
    SELECT * from knn_result_classification ORDER BY id;
    ```
    gives
    ```
     id |  data   | prediction | k_nearest_neighbours 
    ----+---------+------------+----------------------
      1 | {2,1}   |          0 | {8}
      2 | {2,6}   |          0 | {8}
      3 | {15,40} |          0 | {8}
      4 | {12,1}  |          0 | {8}
      5 | {2,90}  |          1 | {1}
      6 | {50,45} |          1 | {1}
    (6 rows)
    ```
    



---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r215149356
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id,
     
         if label_column_name and label_column_name.strip():
             cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
    -    cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
    -    cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
    --- End diff --
    
    Added these functions back. 


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r214485621
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id,
     
         if label_column_name and label_column_name.strip():
             cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
    -    cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
    -    cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
    -
    -    if not is_col_array(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table is not"
    -                   " an array.".format(point_column_name))
    -    if not is_col_array(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table is not"
    -                   " an array.".format(test_column_name))
    -
    -    if not array_col_has_no_null(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table has some"
    -                   " NULL values.".format(point_column_name))
    -    if not array_col_has_no_null(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table has some"
    -                   " NULL values.".format(test_column_name))
    --- End diff --
    
    I think we will need to keep these asserts too. They seem to be checking for a different validation. If the helper function `array_col_has_no_null()` does not work for expressions, please go ahead and change that, or create a new helper function to handle the scenario (both expressions and column names).


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r213791885
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -53,22 +55,10 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id,
     
         if label_column_name and label_column_name.strip():
             cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
    -    cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
    -    cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
    -
    -    if not is_col_array(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table is not"
    -                   " an array.".format(point_column_name))
    -    if not is_col_array(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table is not"
    -                   " an array.".format(test_column_name))
    -
    -    if not array_col_has_no_null(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table has some"
    -                   " NULL values.".format(point_column_name))
    -    if not array_col_has_no_null(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table has some"
    -                   " NULL values.".format(test_column_name))
    +
    +    _assert(point_column_name, "KNN error: Invalid point_column_name expression")
    +
    +    _assert(test_column_name, "KNN error: Invalid test_column_name expression")
    --- End diff --
    
    Since the original asserts are removed, this results in the function call not exiting gracefully when we have incorrect param values. You may have to use function `is_var_valid()` in `validate_args.py_in` to validate `point_column_name` and `test_column_name`.


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r214482318
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -264,12 +260,17 @@ def knn(schema_madlib, point_source, point_column_name, point_id,
                             SELECT test.{test_id} AS {test_id_temp},
                                 train.{point_id} as train_id,
                                 {fn_dist}(
    -                                train.{point_column_name},
    -                                test.{test_column_name})
    +                                p_col_name,
    +                                t_col_name)
                                 AS dist
                                 {label_out}
    -                            FROM {point_source} AS train,
    -                                 {test_source} AS test
    +                            FROM
    +                            (
    +                            SELECT {point_id} , {point_column_name} as p_col_name , {label_column_name} from {point_source}
    +                            ) train,
    +                            (
    +                            SELECT {test_id} ,{test_column_name} as t_col_name from {test_source}
    +                            ) test
    --- End diff --
    
    Can you please use variables with unique strings for `train`, `test`, `p_col_name` and `t_col_name`. If the train or test table is named any of those, the query would fail I guess.
    While you are at it, could you also do the same for other variables in this query: `train_id`, `r`, `dist_inverse` and others I may have missed listing out?


---

[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/315
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/670/



---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r214487281
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id,
     
         if label_column_name and label_column_name.strip():
             cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
    -    cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
    -    cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
    -
    -    if not is_col_array(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table is not"
    -                   " an array.".format(point_column_name))
    -    if not is_col_array(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table is not"
    -                   " an array.".format(test_column_name))
    -
    -    if not array_col_has_no_null(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table has some"
    -                   " NULL values.".format(point_column_name))
    -    if not array_col_has_no_null(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table has some"
    -                   " NULL values.".format(test_column_name))
    +
    +    _assert(is_var_valid(point_source, point_column_name),
    +            "kNN error: Invalid point_column_name expression")
    --- End diff --
    
    Can we change the error message to indicate invalid column name or expression? How about 
    `"kNN error: {0} is an invalid column name or expression for point_column_name param".format(point_column_name)`?


---

[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on the issue:

    https://github.com/apache/madlib/pull/315
  
    @fmcquillan99 thanks for testing this out. I can have a look at this issue.


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r215458345
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id,
     
         if label_column_name and label_column_name.strip():
             cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
    -    cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
    -    cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
    -
    -    if not is_col_array(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table is not"
    -                   " an array.".format(point_column_name))
    -    if not is_col_array(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table is not"
    -                   " an array.".format(test_column_name))
    --- End diff --
    
    I was just playing with the function to see how user-friendly the error message would be. Found that, with the older kNN version, if a column was not of array type, the error was informative:
    `kNN Error: Feature column 'y' in train table is not an array.`
    With the code in this PR, the error for the same failure is:
    `function array_upper(double precision, integer) does not exist`
    
    I don't think this is informative enough for users. I would surely like to continue the discussion that has already happened on this. @fmcquillan99 and @hpandeycodeit , any thoughts?


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r213806554
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -53,22 +55,10 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id,
     
         if label_column_name and label_column_name.strip():
             cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
    -    cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
    -    cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
    -
    -    if not is_col_array(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table is not"
    -                   " an array.".format(point_column_name))
    -    if not is_col_array(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table is not"
    -                   " an array.".format(test_column_name))
    -
    -    if not array_col_has_no_null(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table has some"
    -                   " NULL values.".format(point_column_name))
    -    if not array_col_has_no_null(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table has some"
    -                   " NULL values.".format(test_column_name))
    +
    +    _assert(point_column_name, "KNN error: Invalid point_column_name expression")
    +
    +    _assert(test_column_name, "KNN error: Invalid test_column_name expression")
    --- End diff --
    
    `KNN` in the error message is different from `kNN` used in other error messages (capital `K`). Please keep it consistent as `kNN`.


---

[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/315
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/659/



---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r214415227
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -290,11 +303,11 @@ def knn(schema_madlib, point_source, point_column_name, point_id,
                     ON knn_temp.{test_id_temp} = knn_test.{test_id}
                         {view_join}
                     GROUP BY knn_temp.{test_id_temp},
    -                         knn_test.{test_column_name}
    +                    {test_column_name}
                              {view_grp_by}
                 """
             plpy.execute(sql.format(**locals()))
    -        plpy.execute("DROP TABLE IF EXISTS {0}".format(interim_table))
    +        # plpy.execute("DROP TABLE IF EXISTS {0}".format(interim_table))
    --- End diff --
    
    I pushed by mistake. I have reverted the changes. 


---

[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on the issue:

    https://github.com/apache/madlib/pull/315
  
    Thanks for the update @fmcquillan99! I have fixed the error above and added a test case as well. 


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r214487426
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id,
     
         if label_column_name and label_column_name.strip():
             cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
    -    cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
    -    cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
    -
    -    if not is_col_array(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table is not"
    -                   " an array.".format(point_column_name))
    -    if not is_col_array(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table is not"
    -                   " an array.".format(test_column_name))
    -
    -    if not array_col_has_no_null(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table has some"
    -                   " NULL values.".format(point_column_name))
    -    if not array_col_has_no_null(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table has some"
    -                   " NULL values.".format(test_column_name))
    +
    +    _assert(is_var_valid(point_source, point_column_name),
    +            "kNN error: Invalid point_column_name expression")
    +
    +    _assert(is_var_valid(test_source, test_column_name),
    +            "kNN error: Invalid test_column_name expression")
    --- End diff --
    
    Similar to the above comment, can we change the error message to:
    `"kNN error: {0} is an invalid column name or expression for test_column_name param".format(test_column_name)`?


---

[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/315
  
    re-running the failed test, seems to pass now:
    
    ```
    SELECT * FROM knn_result_list_neighbors ORDER BY id;
    ```
    produces
    ```
     id |  data   | k_nearest_neighbours 
    ----+---------+----------------------
      1 | {2,1}   | {1,2,3}
      2 | {2,6}   | {5,4,3}
      3 | {15,40} | {7,6,5}
      4 | {12,1}  | {4,5,3}
      5 | {2,90}  | {9,6,7}
      6 | {50,45} | {6,7,8}
    (6 rows)
    ```
    
    LGTM


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/madlib/pull/315


---

[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/315
  
    Actually the earlier issue above ^^^ is OK, where I said `I'm not sure what this is doing` because forcing all training data to be a single point means that the distance to all test points is identical.  So a random nearest neighbor could be picked.  Which it is what seems to be happening.  
    
    So I think just fix the error above and this should be good to go.


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r215464110
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id,
     
         if label_column_name and label_column_name.strip():
             cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
    -    cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
    -    cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
    -
    -    if not is_col_array(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table is not"
    -                   " an array.".format(point_column_name))
    -    if not is_col_array(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table is not"
    -                   " an array.".format(test_column_name))
    --- End diff --
    
    @fmcquillan99 yes, I was looking at the alternatives. @hpandeycodeit I can take care of this. I will push a commit to this PR (you may have to merge it, since I am not a co-author on your branch).


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r213792935
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -264,12 +275,14 @@ def knn(schema_madlib, point_source, point_column_name, point_id,
                             SELECT test.{test_id} AS {test_id_temp},
                                 train.{point_id} as train_id,
                                 {fn_dist}(
    -                                train.{point_column_name},
    -                                test.{test_column_name})
    +                                train.{point_col_name_temp},
    +                                test.{test_col_name_temp})
                                 AS dist
                                 {label_out}
    -                            FROM {point_source} AS train,
    -                                 {test_source} AS test
    +                            FROM
    +                                 {point_source_temp_table} as train,
    +                                 {test_source_temp_table} as test
    --- End diff --
    
    Please use subqueries, instead of tables:
    ```
    (select {point_id} , {point_column_name} as {point_col_name_temp} , {label_column_name} from {point_source}) train,
    (select {test_id}, {test_column_name} as {test_col_name_temp} from {test_source}) test
    ```


---

[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/315
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/672/



---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r214415305
  
    --- Diff: src/ports/postgres/modules/knn/test/knn.sql_in ---
    @@ -122,5 +122,13 @@ select knn('knn_train_data_reg','data','id','label','knn_test_data','data','id',
     select assert(array_agg(prediction::numeric order by id)='{1,1,0.0408728591876018,1,0,0}', 'Wrong output in regression') from madlib_knn_result_regression;
     
     
    +drop table if exists madlib_knn_result_classification;
    +select knn('knn_train_data','data[1:1]||data[2:2]','id','label','knn_test_data','data[1:1]||data[2:2]','id','madlib_knn_result_classification',3,False,'MADLIB_SCHEMA.squared_dist_norm2', True);
    +select assert(array_agg(prediction::numeric order by id)='{1,1,0,1,0,0}', 'Wrong output in classification') from madlib_knn_result_classification;
    +
    +drop table if exists madlib_knn_result_regression;
    +select knn('knn_train_data_reg','data[1:1]||data[2:2]','id','label','knn_test_data','data[1:1]||data[2:2]','id','madlib_knn_result_regression',3,False,'MADLIB_SCHEMA.squared_dist_norm2', True);
    +select assert(array_agg(prediction::numeric order by id)='{1,1,0.0408728591876018,1,0,0}', 'Wrong output in regression') from madlib_knn_result_regression;
    +
    --- End diff --
    
    Added this test as well along with other minor changes as suggested. 


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r213807744
  
    --- Diff: src/ports/postgres/modules/knn/test/knn.sql_in ---
    @@ -122,5 +122,13 @@ select knn('knn_train_data_reg','data','id','label','knn_test_data','data','id',
     select assert(array_agg(prediction::numeric order by id)='{1,1,0.0408728591876018,1,0,0}', 'Wrong output in regression') from madlib_knn_result_regression;
     
     
    +drop table if exists madlib_knn_result_classification;
    +select knn('knn_train_data','data[1:1]||data[2:2]','id','label','knn_test_data','data[1:1]||data[2:2]','id','madlib_knn_result_classification',3,False,'MADLIB_SCHEMA.squared_dist_norm2', True);
    +select assert(array_agg(prediction::numeric order by id)='{1,1,0,1,0,0}', 'Wrong output in classification') from madlib_knn_result_classification;
    +
    +drop table if exists madlib_knn_result_regression;
    +select knn('knn_train_data_reg','data[1:1]||data[2:2]','id','label','knn_test_data','data[1:1]||data[2:2]','id','madlib_knn_result_regression',3,False,'MADLIB_SCHEMA.squared_dist_norm2', True);
    +select assert(array_agg(prediction::numeric order by id)='{1,1,0.0408728591876018,1,0,0}', 'Wrong output in regression') from madlib_knn_result_regression;
    +
    --- End diff --
    
    We could add another test case, where we use `ARRAY(d1, d2)` as the expression, say for `point_column_name`. Of course, you will then have to add two new columns d1 and d2 while creating the input table `knn_train_data` in that case.


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r214484071
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id,
     
         if label_column_name and label_column_name.strip():
             cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
    -    cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
    -    cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
    --- End diff --
    
    `point_id` and `test_id` params are not validated anymore. This should still be done, since the new asserts only check for `point_column_name` and `test_column_name` being valid expressions. This validation is required to catch invalid values such as `NULL`, `''`, invalid or non-existing column name etc. for `point_id` and `test_id` params.


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r214485090
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id,
     
         if label_column_name and label_column_name.strip():
             cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
    -    cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
    -    cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
    -
    -    if not is_col_array(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table is not"
    -                   " an array.".format(point_column_name))
    -    if not is_col_array(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table is not"
    -                   " an array.".format(test_column_name))
    --- End diff --
    
    `point_column_name` and `test_column_name` params must be an array as this if check suggests. If it's not an array it fails further down when the distance function (such as `squared_dist_norm2`) is called.
    
    I don't think the function `is_var_valid()` checks for these being arrays. You may have to check them after the new asserts, using a new helper function (`is_col_array()` cannot be used as is for expressions, and `is_var_valid()` does not check for an array, but just the validity of the expression)


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r215462116
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id,
     
         if label_column_name and label_column_name.strip():
             cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
    -    cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
    -    cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
    -
    -    if not is_col_array(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table is not"
    -                   " an array.".format(point_column_name))
    -    if not is_col_array(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table is not"
    -                   " an array.".format(test_column_name))
    --- End diff --
    
    Yes I agree that this error message is too vague.  So in retrospect I suggest we do check that a valid array type is generated by the expression.   Sorry if I mislead earlier @hpandeycodeit 
    
    @njayaram2 do you have an existing function that we can use for this check?


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r213794697
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -290,11 +303,11 @@ def knn(schema_madlib, point_source, point_column_name, point_id,
                     ON knn_temp.{test_id_temp} = knn_test.{test_id}
                         {view_join}
                     GROUP BY knn_temp.{test_id_temp},
    -                         knn_test.{test_column_name}
    +                    {test_column_name}
                              {view_grp_by}
                 """
             plpy.execute(sql.format(**locals()))
    -        plpy.execute("DROP TABLE IF EXISTS {0}".format(interim_table))
    +        # plpy.execute("DROP TABLE IF EXISTS {0}".format(interim_table))
    --- End diff --
    
    Please bring this back to delete intermediate tables.
    In general, if we create any table that is not part of the output, it has to be dropped. So, if you create `point_source_temp_table` and `test_source_temp_table`, they should also be dropped (in this PR, we don't need to create these tables as mentioned in an earlier comment).


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r215148640
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id,
     
         if label_column_name and label_column_name.strip():
             cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
    -    cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
    -    cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
    -
    -    if not is_col_array(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table is not"
    -                   " an array.".format(point_column_name))
    -    if not is_col_array(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table is not"
    -                   " an array.".format(test_column_name))
    -
    -    if not array_col_has_no_null(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table has some"
    -                   " NULL values.".format(point_column_name))
    -    if not array_col_has_no_null(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table has some"
    -                   " NULL values.".format(test_column_name))
    --- End diff --
    
    Added it back. Had to do a minor change in array_col_has_no_null() for the expressions. 


---

[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on the issue:

    https://github.com/apache/madlib/pull/315
  
    @fmcquillan99 @njayaram2 
    
    Issue is here 
    
    `{point_id} , {point_column_name} as {p_col_name} {label_column_name} from {point_source}`
    
    I will handle it and add a test case as well. 


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r215148966
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id,
     
         if label_column_name and label_column_name.strip():
             cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
    -    cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
    -    cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
    -
    -    if not is_col_array(point_source, point_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in train table is not"
    -                   " an array.".format(point_column_name))
    -    if not is_col_array(test_source, test_column_name):
    -        plpy.error("kNN Error: Feature column '{0}' in test table is not"
    -                   " an array.".format(test_column_name))
    --- End diff --
    
    As per the Jira and our discussion, it is okay to fail at distance function if the point_column_name and test_column_name params are not arrays. 


---

[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/315
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/664/



---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r214414914
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -264,12 +275,14 @@ def knn(schema_madlib, point_source, point_column_name, point_id,
                             SELECT test.{test_id} AS {test_id_temp},
                                 train.{point_id} as train_id,
                                 {fn_dist}(
    -                                train.{point_column_name},
    -                                test.{test_column_name})
    +                                train.{point_col_name_temp},
    +                                test.{test_col_name_temp})
                                 AS dist
                                 {label_out}
    -                            FROM {point_source} AS train,
    -                                 {test_source} AS test
    +                            FROM
    +                                 {point_source_temp_table} as train,
    +                                 {test_source_temp_table} as test
    --- End diff --
    
    Replaced the temp tables with subqueries. 


---

[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/315#discussion_r215149235
  
    --- Diff: src/ports/postgres/modules/knn/knn.py_in ---
    @@ -264,12 +260,17 @@ def knn(schema_madlib, point_source, point_column_name, point_id,
                             SELECT test.{test_id} AS {test_id_temp},
                                 train.{point_id} as train_id,
                                 {fn_dist}(
    -                                train.{point_column_name},
    -                                test.{test_column_name})
    +                                p_col_name,
    +                                t_col_name)
                                 AS dist
                                 {label_out}
    -                            FROM {point_source} AS train,
    -                                 {test_source} AS test
    +                            FROM
    +                            (
    +                            SELECT {point_id} , {point_column_name} as p_col_name , {label_column_name} from {point_source}
    +                            ) train,
    +                            (
    +                            SELECT {test_id} ,{test_column_name} as t_col_name from {test_source}
    +                            ) test
    --- End diff --
    
    Done. 


---