You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by njayaram2 <gi...@git.apache.org> on 2017/01/24 23:34:46 UTC
[GitHub] incubator-madlib pull request #81: JIRA: MADLIB-927 Changes made in KNN-help...

Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/81#discussion_r97676430
  
    --- Diff: src/ports/postgres/modules/knn/knn.sql_in ---
    @@ -0,0 +1,440 @@
    +/* ----------------------------------------------------------------------- *//**
    + *
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + *
    + *//* ----------------------------------------------------------------------- */
    +
    +
    +/* ----------------------------------------------------------------------- *//**
    + *
    + * @file knn.sql_in
    + * @brief Set of functions for k-nearest neighbors.
    + * @sa For a brief introduction to k-nearest neighbors algorithm for regression and classification,
    + * see the module description \ref grp_knn.
    + *
    + *
    + *//* ----------------------------------------------------------------------- */
    +
    +m4_include(`SQLCommon.m4')
    +
    +
    +/**
    +@addtogroup grp_knn
    +
    +<div class="toc"><b>Contents</b>
    +<ul>
    +<li class="level1"><a href="#knnfunction">KNN Function</a></li>
    +<li class="level1"><a href="#output">Output Format</a></li>
    +<li class="level1"><a href="#examples">Examples</a></li>
    +<li class="level1"><a href="#background">Technical Background</a></li>
    +<li class="level1"><a href="#literature">Literature</a></li>
    +<li class="level1"><a href="#related">Related Topics</a></li>
    +</ul>
    +</div>
    +
    +@brief Finds k nearest data points to the given data point and outputs majority vote value of output classes in case of classification and average value of target values for regression task.
    +
    +k-Nearest Neighbors is a method for finding \f$ k closest points to a 
    +given data point in terms of a given metric. Its input consist of 
    +data points as features from testing examples. For a given \f$ k, it 
    +looks for \f$ k closest points in training set for each of the data 
    +points in test set. Algorithm generates one output per testing example.
    +The output of KNN depends on the type of task:
    +For Classification, the output is majority vote of the classes of 
    +the \f$ k nearest data points. The testing example gets assigned the 
    +most popular class among nearest neighbors.
    +For Regression, the output is average of the values of \f$ k nearest
    +neighbors of the given testing example. 
    +
    +@anchor knnfunction
    +@par KNN Function
    +
    +The k-means algorithm can be invoked in three ways, depending on the
    +number and values of arguments:
    +
    +- Use the method explicitly asking for value of k.
    +<pre class="syntax">
    +knn( point_source,
    +     point_column_name,
    +     label_column_name,
    +     test_source,
    +     test_column_name,
    +     id_column_name,
    +     output_table,
    +     operation,
    +     k
    +   )
    +</pre>
    +
    +- The method that assumes k=1 if k is not passed as a ninth argument.
    +<pre class="syntax">
    +knn( point_source,
    +     point_column_name,
    +     label_column_name,
    +     test_source,
    +     test_column_name,
    +     id_column_name,
    +     output_table,
    +     operation
    +   )
    +</pre>
    +
    +- The method that describe the type of arguments required by algorithm.
    +<pre class="syntax">
    +knn( 'help'
    +      )
    +</pre>
    +
    +\b Arguments
    +<dl class="arglist">
    +<dt>point_source</dt>
    +<dd>TEXT. The name of the table containing the training data points.
    +
    +Training data points are expected to be stored row-wise,
    +in a column of type <tt>DOUBLE PRECISION[]</tt>.
    +</dd>
    +
    +<dt>point_column_name</dt>
    +<dd>TEXT. The name of the column with training data points.</dd>
    +
    +<dt>label_column_name</dt>
    +<dd>TEXT. The name of the column with labels/values of training data points.</dd>
    +
    +<dt>test_source</dt>
    +<dd>TEXT. The name of the table containing the test data points.
    +
    +Testing data points are expected to be stored row-wise,
    +in a column of type <tt>DOUBLE PRECISION[]</tt>.
    +</dd>
    +
    +<dt>test_column_name</dt>
    +<dd>TEXT. The name of the column with testing data points.</dd>
    +
    +<dt>id_column_name</dt>
    +<dd>TEXT. Name of the column having ids of data points in test data table.</dd>
    + 
    +<dt>output_table</dt>
    +<dd>TEXT. Name of the table to store final results.</dd>
    +
    +<dt>operation</dt>
    +<dd>TEXT. the type of task; \f$ r for regression and \f$ c for classification.</dd>
    +
    +<dt>k (optional)</dt>
    +<dd>INTEGER. The number of nearest neighbors to consider.</dd>
    +
    +</dl>
    +
    +
    +@anchor output
    +@par Output Format
    +
    +The output of the KNN module is a table with the following columns:
    +<table class="output">
    +    <tr>
    +        <th>id</th>
    +        <td>INTEGER. The ids of test data points.</td>
    +    </tr>
    +    <tr>
    +        <th>test_column_name</th>
    +        <td>DOUBLE PRECISION[]. The test data points.</td>
    +    </tr>
    +    <tr>
    +        <th>predLabel</th>
    +        <td>INTEGER. The output of KNN- label in case of classification, average value in case of regression.</td>
    +    </tr>
    +</table>
    +
    +
    +@anchor examples
    +@examp
    +
    +-#  Prepare some training data.
    +<pre class="example">
    +CREATE TABLE knn_train_data (id integer, data integer[], label float);
    +COPY knn_train_data (id, data, label) from stdin delimiter '|';
    +1|{1,1}|1.0
    +2|{2,2}|1.0
    +3|{3,3}|1.0
    +4|{4,4}|1.0
    +5|{4,5}|1.0
    +6|{20,50}|0.0
    +7|{10,31}|0.0
    +8|{81,13}|0.0
    +9|{1,111}|0.0
    +\\.
    +</pre>
    +
    +-#  Prepare some testing data.
    +<pre class="example">
    +CREATE TABLE knn_test_data (id integer, data integer[]);
    +COPY knn_test_data (id, data) from stdin delimiter '|';
    +1|{2,1}
    +2|{2,6}
    +3|{15,40}
    +4|{12,1}
    +5|{2,90}
    +6|{50,45}
    +\\.
    +</pre>
    +
    +-#  Run KNN for classification:
    +<pre class="example">
    +SELECT * FROM madlib.knn( 'knn_train_data',
    +                               'data',
    +                               'label',
    +                               'knn_test_data',
    +                               'data',
    +                               'id',
    +                               'madlib_knn_result_classification',
    +			       'c',
    +			       3
    +                             );
    +</pre>
    +Result:
    +<pre class="result">
    +SELECT * from madlib_knn_result_classification;
    +
    + id |   data    | predLabel
    +-----+----------+-----------
    +   1 | {2,1}    |       1
    +   2 | {2,6}    |       1
    +   3 | {15,40}  |       0
    +   4 | {12,1}   |       1
    +   5 | {2,90}   |       0
    +   6 | {50,45}  |       0
    +
    +</pre>
    +
    +-#  Run KNN for regression:
    +<pre class="example">
    +SELECT * FROM madlib.knn( 'knn_train_data',
    +                               'data',
    +                               'label',
    +                               'knn_test_data',
    +                               'data',
    +                               'id',
    +                               'madlib_knn_result_regression',
    +			       'r',
    +			       3
    +                             );
    +</pre>
    +Result:
    +<pre class="result">
    +SELECT * from madlib_knn_result_regression;
    +
    + id |   data    | predLabel
    +-----+----------+-----------
    +   1 | {2,1}    |      1
    +   2 | {2,6}    |      1
    +   3 | {15,40}  |      0.5
    +   4 | {12,1}   |      1
    +   5 | {2,90}   |      0.25
    +   6 | {50,45}  |      0.25
    +
    +</pre>
    +
    +
    +
    +@anchor background
    +@par Technical Background
    +
    +The training data points are vectors in a multidimensional feature space, 
    +each with a class label. The training phase of the algorithm consists 
    +only of storing the feature vectors and class labels of the training points.
    +
    +In the classification phase, k is a user-defined constant, and an unlabeled 
    +vector (a test point) is classified by assigning the label which is most 
    +frequent among the k training samples nearest to that test point.
    +In case of regression, average of the values of these k training samples
    +is assigned to the test point.
    +
    +A commonly used distance metric for continuous variables is Euclidean distance.
    +
    +
    +
    +@anchor literature
    +@literature
    +
    +@anchor kmeans-lit-1
    +[1] Wikipedia, k-nearest neighbors algorithm,
    +    https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
    +
    +@anchor kmeans-lit-2
    +[2] N. S. Altman: An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression
    +    http://www.stat.washington.edu/courses/stat527/s13/readings/Altman_AmStat_1992.pdf
    +
    +@anchor kmeans-lit-3
    +[3] Gongde Guo1, Hui Wang, David Bell, Yaxin Bi, Kieran Greer: KNN Model-Based Approach in Classification,
    +    https://ai2-s2-pdfs.s3.amazonaws.com/a7e2/814ec5db800d2f8c4313fd436e9cf8273821.pdf
    +
    +
    +@anchor related
    +@par Related Topics
    +
    +File knn.sql_in documenting the k-Means SQL functions
    +
    +
    +
    +
    +@internal
    +@sa namespace knn (documenting the implementation in Python)
    +@endinternal
    +*/
    +
    +
    +
    +
    +
    +
    +
    +
    +
    +
    +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__knn_validate_src(
    +rel_source VARCHAR
    +) RETURNS VOID AS $$
    +    PythonFunction(knn, knn, knn_validate_src)
    +$$ LANGUAGE plpythonu
    +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
    +
    +
    +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
    +    arg1 VARCHAR
    +) RETURNS VOID AS $$
    +BEGIN
    +    IF arg1 = 'help' THEN
    +	RAISE NOTICE 'You need to enter following arguments in order:
    +	Argument 1: Training data table having training features as vector column and labels
    +	Argument 2: Name of column having feature vectors in training data table
    +	Argument 3: Name of column having actual label/vlaue for corresponding feature vector in training data table
    +	Argument 4: Test data table having features as vector column. Id of features is mandatory
    +	Argument 5: Name of column having feature vectors in test data table
    +	Argument 6: Name of column having feature vector Ids in test data table
    +	Argument 7: Name of output table
    +	Argument 8: c for classification task, r for regression task
    +	Argument 9: value of k. Default will go as 1';
    +    END IF;
    +END;
    +$$ LANGUAGE plpgsql VOLATILE
    +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
    +
    +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
    +) RETURNS VOID AS $$
    +BEGIN    
    +    EXECUTE $sql$ select * from MADLIB_SCHEMA.knn('help') $sql$;
    +END;
    +$$ LANGUAGE plpgsql VOLATILE
    +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
    +
    +
    +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
    +    point_source VARCHAR,
    +    point_column_name VARCHAR,
    +    label_column_name VARCHAR,
    +    test_source VARCHAR,
    +    test_column_name VARCHAR,
    +    id_column_name VARCHAR,
    +    output_table VARCHAR,
    +    operation VARCHAR,
    +    k INTEGER
    +) RETURNS VARCHAR AS $$
    +DECLARE
    +    class_test_source REGCLASS;
    +    class_point_source REGCLASS;
    +    l FLOAT;
    +    outputTableFlag INTEGER;
    +    id INTEGER;
    +    vector DOUBLE PRECISION[];
    +    cur_pid integer;
    +    oldClientMinMessages VARCHAR;
    +    returnstring VARCHAR;
    +BEGIN
    +    oldClientMinMessages :=
    +        (SELECT setting FROM pg_settings WHERE name = 'client_min_messages');
    +    EXECUTE 'SET client_min_messages TO warning';
    +    PERFORM MADLIB_SCHEMA.__knn_validate_src(test_source);
    +    PERFORM MADLIB_SCHEMA.__knn_validate_src(point_source);
    +    class_test_source := test_source;
    +    class_point_source := point_source;
    +    --checks
    +    IF (k <= 0) THEN
    +        RAISE EXCEPTION 'KNN error: Number of neighbors k must be a positive integer.';
    +    END IF;
    +    IF (operation != 'c' AND operation != 'r') THEN
    +        RAISE EXCEPTION 'KNN error: put r for regression OR c for classification.';
    +    END IF;
    +    PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
    +    
    +    EXECUTE 
    +	$sql$ 
    +	SELECT count(*) FROM information_schema.tables WHERE table_name = '$sql$ || output_table || $sql$'$sql$ into outputTableFlag; 
    +    IF (outputTableFlag != 0) THEN
    +	RAISE Exception '% table exists. Drop it or use different name', output_table;
    +    END IF;
    +    --EXECUTE format('DROP TABLE IF EXISTS %I',output_table);
    +    --EXECUTE format('CREATE TABLE %I(%I integer, %I DOUBLE PRECISION[], predlabel float)',output_table,id_column_name,test_column_name);
    +	
    +    EXECUTE
    +        $sql$
    +	DROP TABLE IF EXISTS pg_temp.madlib_knn_interm;
    +	CREATE TABLE pg_temp.madlib_knn_interm AS    
    +	select * from ( select row_number() over (partition by test_id order by dist) as r, x.* from (select test. $sql$ || id_column_name || $sql$ as test_id, madlib.squared_dist_norm2(train.$sql$ || point_column_name || $sql$,test.$sql$ || test_column_name || $sql$) as dist, $sql$ || label_column_name || $sql$ from $sql$ || textin(regclassout(point_source)) || $sql$ as train, $sql$ || textin(regclassout(test_source)) || $sql$ as test)x)y where y.r <= $sql$ || k;
    +	IF (operation = 'c') THEN
    --- End diff --
    
    I was playing around with knn a bit, and observed that using `x` and `y` in `|| $sql$ as test)x)y where y.r` could cause some trouble if the column names corresponding to `label_column_name`, `point_column_name` or `test_column_name` are also named `x` or `y` (or even if there is another column called `y` in `test_source`). The query would fail to execute due to ambiguity.
    
    In general, when you are using temp names in a sql query, its always best to using a unique string instead of something like x or y. MADlib has a python function named `unique_string()` in module `utilities.utilities` which will generate a unique string for you. I would suggest you apply this wherever you are using a temp name/placeholder in the SQL queries used in knn. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---