You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@madlib.apache.org by "Kazmi,Auon H" <ak...@ufl.edu> on 2016/11/15 06:41:34 UTC

Adding KNN to madlib

Hi,

I am a first year Computer Science graduate student at University of Florida working on implementing KNN in Madlib. I am ready with a first version of it but I don't know how to proceed with testing and adding it to Madlib platform. Also, I am not clear on what standards do I have to choose in the final implementation. My current version asks for the table name and column name having vectors in which I have to find the neighbours. The other table given as input holds the vector whose K-NN needs to be found. It is assuming euclidean distance metric for distance calculation. It would really help if somebody can share ideas on what can be added to this functionality.





Regards,

Auon Haidar Kazmi

Re: Adding KNN to madlib

Posted by "Kazmi,Auon H" <ak...@ufl.edu>.

Sure NJ.

Thanks!




Auon

________________________________
From: Nandish Jayaram <nj...@pivotal.io>
Sent: Tuesday, December 13, 2016 12:22:50 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hi Auon,

I do see the pull request, thank you! Folks in the community should also be
able to comment on it! :)
I too will have a look at it sometime soon and comment on the PR if need be.

NJ

On Mon, Dec 12, 2016 at 6:30 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi NJ,
>
> I have done that. Please check if it is rightly done.
>
>
>
>
> Thanks,
>
> Auon
>
> ________________________________
> From: Nandish Jayaram <nj...@pivotal.io>
> Sent: Monday, December 12, 2016 6:28:38 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi Auon,
>
> Please push all the changes you have made in your branch for KNN to your
> incubator-madlib repo, and open a PR on that push.
>
> NJ
>
> On Mon, Dec 12, 2016 at 1:58 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> > Hi NJ,
> >
> > Where should I git push my code? I am doing that in my github id. Also,
> > should I push just KNN folder or the whole src/ folder of madlib?
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> > ________________________________
> > From: Kazmi,Auon H <ak...@ufl.edu>
> > Sent: Monday, December 5, 2016 8:32:38 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi NJ,
> >
> > Thanks!
> >
> > I will do that.
> >
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> > ________________________________
> > From: Nandish Jayaram <nj...@pivotal.io>
> > Sent: Sunday, December 4, 2016 1:39:53 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi Auon,
> >
> > That's great!
> > I think the best way to share your code with the community is by opening
> a
> > pull request on github. Please do that and a lot of folks will be able to
> > comment and give suggestions to you.
> >
> > NJ
> >
> > On Sat, Dec 3, 2016 at 2:13 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> >
> > > Hi NJ,
> > >
> > > I got the solution to my problem.
> > >
> > > So, I might be done with my first version of interface of KNN for
> > > classification as suggested by you, by Monday or so. I will generalise
> it
> > > for regression and then please let me know how to share it with you
> guys.
> > > After that, I can start making required changes as and when needed.
> > >
> > >
> > >
> > > regards,
> > >
> > > Auon Haidar
> > >
> > > ________________________________
> > > From: Kazmi,Auon H <ak...@ufl.edu>
> > > Sent: Thursday, December 1, 2016 2:59:21 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hi NJ,
> > >
> > > No, this is just an example I gave. So, I want in a postgres function
> to
> > > iterate over the rows of a table given as a VARCHAR argument.
> > >
> > > FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> > >
> > > will do that. Now, r is a record, i.e. a row of table 'point_source'. I
> > > want to store a particular column of that row r in a variable. Now,
> this
> > > column name is also passed as VARCHAR argument to function. I am not
> able
> > > to figure out the way to access this particular column from the current
> > row
> > > 'r'.
> > >
> > >
> > > Basically, I am trying to iterate over my testing data one by one and
> > pass
> > > its vector column to a function that finds its label.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon
> > >
> > >
> > > ________________________________
> > > From: Nandish Jayaram <nj...@pivotal.io>
> > > Sent: Thursday, December 1, 2016 2:51:47 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hi Auon,
> > >
> > > My apologies for the late reply.
> > > Can you please give me more information regarding the design approach
> you
> > > have taken. Information like
> > > what files you have created so far would be helpful. I am not sure I
> > > understand your approach correctly
> > > yet. Is the above snippet of code the only code you have, or do you
> have
> > > some other files too?
> > >
> > > NJ
> > >
> > > On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> > >
> > > > Hi NJ,
> > > >
> > > > I got stuck at a place. Need a little help.
> > > >
> > > > Suppose I have a function that receives table_name and column_name as
> > > > varchar.
> > > >
> > > > Now I would like to iterate through each rows of this table, while
> > > > accessing the value of this column. I am doing something like this:
> > > >
> > > >
> > > > CREATE OR REPLACE FUNCTION Foo(
> > > > table_name VARCHAR,
> > > > column_name VARCHAR
> > > > ) RETURNS VOID AS
> > > > $BODY$
> > > > DECLARE
> > > >     r record;
> > > >     b integer;
> > > > BEGIN
> > > >
> > > >     FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> > > >     LOOP
> > > >
> > > >         b := r.column_name;
> > > >
> > > >    END LOOP
> > > > END
> > > >
> > > > So, everything works except column_name is a varchar. So,
> r.column_name
> > > > won't give me the correponding column's value in extracted row r. So,
> > > > suppose it is 'pid' in the given table, then b:= r.pid will give the
> > > right
> > > > result, but I want to get this effective statement from
> > > > b := r.column_name;
> > > >
> > > >
> > > > Could you please help.
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon
> > > >
> > > > ________________________________
> > > > From: Kazmi,Auon H <ak...@ufl.edu>
> > > > Sent: Friday, November 25, 2016 3:23:46 PM
> > > > To: dev@madlib.incubator.apache.org
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Thanks NJ,
> > > >
> > > > I will move forward in the suggested way.
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon
> > > >
> > > > ________________________________
> > > > From: Nandish Jayaram <nj...@pivotal.io>
> > > > Sent: Wednesday, November 23, 2016 12:20:35 PM
> > > > To: dev@madlib.incubator.apache.org
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Hey Auon,
> > > >
> > > > Starting with only classification for now sounds like a good idea!
> > > > Yes, the output should be just the predicted label for each row.
> > > > If the table you want to run the classification task on is like the
> > > > following:
> > > > *id |   x   |  y*
> > > > 1    10     10.5
> > > > 2    30     31.5
> > > > 3    20     22.5
> > > >
> > > > then the output table could be something like the following:
> > > > *id |   x   |    y     |  predicted_label*
> > > > 1    10     10.5          true
> > > > 2    30     31.5          false
> > > > 3    20     22.5          true
> > > >
> > > > You are basically adding a new column to the input table called
> > > > "predicted_label", and assign the label for each row based on the
> k-NN.
> > > >
> > > > We can certainly make it better, by modifying the kNN function
> > interface.
> > > > But let's just keep it simple for now and work on that later.
> > > >
> > > > NJ
> > > >
> > > > On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <ak...@ufl.edu>
> wrote:
> > > >
> > > > >
> > > > > Hi NJ,
> > > > >
> > > > > I have implemented a first version of interface as suggested by
> you.
> > > > Right
> > > > > now, I am just looking at classification task. I will generalize it
> > to
> > > > work
> > > > > for regression task as well. I have a question regarding output of
> > the
> > > > > function. Should it just be the predicted label (or prediction
> value
> > in
> > > > > case of regression)? Can you give an example of output?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Auon Haidar
> > > > >
> > > > > ________________________________
> > > > > From: Kazmi,Auon H <ak...@ufl.edu>
> > > > > Sent: Friday, November 18, 2016 3:16:00 AM
> > > > > To: dev@madlib.incubator.apache.org
> > > > > Subject: Re: Adding KNN to madlib
> > > > >
> > > > > Hi NJ,
> > > > >
> > > > > Thanks for your inputs!
> > > > >
> > > > > I will go through everyone of them and try to incorporate them.
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Auon Haidar
> > > > >
> > > > > ________________________________
> > > > > From: Nandish Jayaram <nj...@pivotal.io>
> > > > > Sent: Wednesday, November 16, 2016 2:29:05 PM
> > > > > To: dev@madlib.incubator.apache.org
> > > > > Subject: Re: Adding KNN to madlib
> > > > >
> > > > > Hi Auon,
> > > > >
> > > > > Defining the interface is a good start for k-NN. I have slightly
> > > modified
> > > > > your interface to help it conform with other MADlib algorithms'
> > > > interfaces.
> > > > > Note that the output for each new data point is not the 'k' nearest
> > > > > neighbors, but either a classification or regression task on the
> data
> > > > point
> > > > > based on its 'k' nearest neighbors. Every data point in the
> training
> > > data
> > > > > will have an associated class label (regression value) in a
> different
> > > > > column. Normally, the column containing the data point itself is
> > called
> > > > the
> > > > > independent variable, and the column containing the class label is
> > > called
> > > > > the dependent variable. If it is classification, you take a
> majority
> > > vote
> > > > > of the class labels of the 'k' nearest neighbors, and if it is
> > > > regression,
> > > > > you average the dependent variable values of the 'k' nearest
> > neighbors.
> > > > > Here is a preliminary interface we could start with:
> > > > >
> > > > > *knn*(
> > > > > source_table, -- *TEXT, name of table containing training data.*
> > > > > new_data_table, -- *TEXT, name of table containing new data on
> which
> > > > > classification or regression has to be performed. Classification or
> > > > > regression can be performed based on the type of
> > "dependent_varname".*
> > > > > output_table, -- *TEXT, name of the table where output predictors
> are
> > > > > written. If this table is already present, an error is returned.*
> > > > > dependent_varname, -- *TEXT, name of the independent variable
> column.
> > > If
> > > > > this column is of type boolean/integer, we could probably perform
> > k-NN
> > > > > classification, and perform k-NN regression if this is of type
> > double.*
> > > > > independent_varname, -- *TEXT, column defining data points. Data
> > points
> > > > can
> > > > > be of type SVEC or any type convertible to SVEC such as float[] or
> > > > > integer[].*
> > > > > k, --* INTEGER, (optional, default value could be some odd number,
> > say
> > > 5)
> > > > > number of neighbors to consider*
> > > > > metric, -- *TEXT, (optional, default value could be what you are
> > using
> > > > now
> > > > > for distance) the distance metric to use.*
> > > > > );
> > > > >
> > > > > For now you can just use the distance metric you had mentioned in
> an
> > > > > earlier email. Note that the source_table and new_data_table are
> > tables
> > > > in
> > > > > the database and not files.
> > > > >
> > > > > Some pointers to help you start off with the implementation:
> > > > > -
> > > > > https://cwiki.apache.org/confluence/display/MADLIB/
> > > > Quick+Start+Guide+for+
> > > > > Developers
> > > > > is a very useful resource with a great hello-world example. It
> gives
> > > you
> > > > > details about how to add a new module (k-NN would be a new module)
> to
> > > > > MADlib.
> > > > > - k-NN is a great candidate for parallelizing. Do try to use UDA
> > (User
> > > > > Defined Aggregates) in your implementation. This will require you
> to
> > > add
> > > > a
> > > > > C++ layer too, along with the SQL and python layers. Feel free to
> ask
> > > > > specific questions about this after you have tried out the hello
> > world
> > > > > example.
> > > > > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives
> > you
> > > > > more
> > > > > Design Document - Apache MADlib<http://madlib.
> > > > incubator.apache.org/design.
> > > > > pdf>
> > > > > madlib.incubator.apache.org
> > > > > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> > > > > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument
> > v0.4
> > > > > Supportforfunctionpointersandsparse ...
> > > > >
> > > > >
> > > > >
> > > > > information regarding the C++ abstraction layer in MADlib.
> > > > >
> > > > > Feel free to shout out for help if you are stuck! Cheers. :)
> > > > >
> > > > > NJ
> > > > >
> > > > > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu>
> > wrote:
> > > > >
> > > > > > Hi Frank and NJ,
> > > > > >
> > > > > > Thanks for your comments. I will go through the suggestions
> > provided
> > > by
> > > > > NJ.
> > > > > >
> > > > > > Current interface of KNN is as follows:
> > > > > >
> > > > > > 1) Input:
> > > > > >
> > > > > >        - Name of table having all the data points in
> n-dimensional
> > > > vector
> > > > > > form (Double                              Precision[ ])
> > > > > >
> > > > > >        - Column-name of these data points
> > > > > >
> > > > > >        - Name of file having that n-dim vector (v, say) whose
> > > k-nearest
> > > > > > neighbours need to be               found from first table
> (Double
> > > > > > Precision[ ])
> > > > > >
> > > > > >        - Column name having this vector
> > > > > >
> > > > > >        - value of 'k'
> > > > > >
> > > > > >
> > > > > > It returns 'k' nearest neighbours of vector v from first table
> > having
> > > > > data
> > > > > > points.
> > > > > >
> > > > > >
> > > > > >
> > > > > > For now, I am using madlib's squared norm function to calculate
> > > > distance
> > > > > > between any two vectors. I will try to generalise that.
> > > > > >
> > > > > >
> > > > > > Please suggest any other improvements.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Auon Haidar
> > > > > >
> > > > > > ________________________________
> > > > > > From: Frank McQuillan <fm...@pivotal.io>
> > > > > > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > > > > > To: dev@madlib.incubator.apache.org
> > > > > > Subject: Re: Adding KNN to madlib
> > > > > >
> > > > > > Auon,
> > > > > >
> > > > > > Thanks for working on kNN for MADlib.   Can you expand a little
> bit
> > > on
> > > > > your
> > > > > > note, and post the interface that you are thinking about and
> > > > description
> > > > > of
> > > > > > the arguments?  Then people can comment on that.
> > > > > >
> > > > > > Thanks,
> > > > > > Frank
> > > > > >
> > > > > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <
> > > njayaram@pivotal.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Auon,
> > > > > > >
> > > > > > > Great going with your first version of k-NN implementation.
> > > > > > > Some useful links for coding guidelines are at (see Developer
> > > > > > > Documentation):
> > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > > > > action?pageId=61319606
> > > > > > > MADilb has something called as install-checks for basic
> testing.
> > > You
> > > > > can
> > > > > > > look at any existing module for an example of the same. For
> > > instance,
> > > > > > check
> > > > > > > out the install check code for k-means at:
> > > > > > > https://github.com/apache/incubator-madlib/tree/master/
> > > > > > > src/ports/postgres/modules/kmeans/test
> > > > > > >
> > > > > > > I am sure others will pitch in to help you more with your other
> > > > > > questions,
> > > > > > > but these are some starters you can consider! Good luck!
> > > > > > >
> > > > > > > NJ
> > > > > > >
> > > > > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <akazmi@ufl.edu
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I am a first year Computer Science graduate student at
> > University
> > > > of
> > > > > > > > Florida working on implementing KNN in Madlib. I am ready
> with
> > a
> > > > > first
> > > > > > > > version of it but I don't know how to proceed with testing
> and
> > > > adding
> > > > > > it
> > > > > > > to
> > > > > > > > Madlib platform. Also, I am not clear on what standards do I
> > have
> > > > to
> > > > > > > choose
> > > > > > > > in the final implementation. My current version asks for the
> > > table
> > > > > name
> > > > > > > and
> > > > > > > > column name having vectors in which I have to find the
> > > neighbours.
> > > > > The
> > > > > > > > other table given as input holds the vector whose K-NN needs
> to
> > > be
> > > > > > found.
> > > > > > > > It is assuming euclidean distance metric for distance
> > > calculation.
> > > > It
> > > > > > > would
> > > > > > > > really help if somebody can share ideas on what can be added
> to
> > > > this
> > > > > > > > functionality.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Auon Haidar Kazmi
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Posted by Nandish Jayaram <nj...@pivotal.io>.

Hi Auon,

I do see the pull request, thank you! Folks in the community should also be
able to comment on it! :)
I too will have a look at it sometime soon and comment on the PR if need be.

NJ

On Mon, Dec 12, 2016 at 6:30 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi NJ,
>
> I have done that. Please check if it is rightly done.
>
>
>
>
> Thanks,
>
> Auon
>
> ________________________________
> From: Nandish Jayaram <nj...@pivotal.io>
> Sent: Monday, December 12, 2016 6:28:38 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi Auon,
>
> Please push all the changes you have made in your branch for KNN to your
> incubator-madlib repo, and open a PR on that push.
>
> NJ
>
> On Mon, Dec 12, 2016 at 1:58 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> > Hi NJ,
> >
> > Where should I git push my code? I am doing that in my github id. Also,
> > should I push just KNN folder or the whole src/ folder of madlib?
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> > ________________________________
> > From: Kazmi,Auon H <ak...@ufl.edu>
> > Sent: Monday, December 5, 2016 8:32:38 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi NJ,
> >
> > Thanks!
> >
> > I will do that.
> >
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> > ________________________________
> > From: Nandish Jayaram <nj...@pivotal.io>
> > Sent: Sunday, December 4, 2016 1:39:53 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi Auon,
> >
> > That's great!
> > I think the best way to share your code with the community is by opening
> a
> > pull request on github. Please do that and a lot of folks will be able to
> > comment and give suggestions to you.
> >
> > NJ
> >
> > On Sat, Dec 3, 2016 at 2:13 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> >
> > > Hi NJ,
> > >
> > > I got the solution to my problem.
> > >
> > > So, I might be done with my first version of interface of KNN for
> > > classification as suggested by you, by Monday or so. I will generalise
> it
> > > for regression and then please let me know how to share it with you
> guys.
> > > After that, I can start making required changes as and when needed.
> > >
> > >
> > >
> > > regards,
> > >
> > > Auon Haidar
> > >
> > > ________________________________
> > > From: Kazmi,Auon H <ak...@ufl.edu>
> > > Sent: Thursday, December 1, 2016 2:59:21 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hi NJ,
> > >
> > > No, this is just an example I gave. So, I want in a postgres function
> to
> > > iterate over the rows of a table given as a VARCHAR argument.
> > >
> > > FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> > >
> > > will do that. Now, r is a record, i.e. a row of table 'point_source'. I
> > > want to store a particular column of that row r in a variable. Now,
> this
> > > column name is also passed as VARCHAR argument to function. I am not
> able
> > > to figure out the way to access this particular column from the current
> > row
> > > 'r'.
> > >
> > >
> > > Basically, I am trying to iterate over my testing data one by one and
> > pass
> > > its vector column to a function that finds its label.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon
> > >
> > >
> > > ________________________________
> > > From: Nandish Jayaram <nj...@pivotal.io>
> > > Sent: Thursday, December 1, 2016 2:51:47 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hi Auon,
> > >
> > > My apologies for the late reply.
> > > Can you please give me more information regarding the design approach
> you
> > > have taken. Information like
> > > what files you have created so far would be helpful. I am not sure I
> > > understand your approach correctly
> > > yet. Is the above snippet of code the only code you have, or do you
> have
> > > some other files too?
> > >
> > > NJ
> > >
> > > On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> > >
> > > > Hi NJ,
> > > >
> > > > I got stuck at a place. Need a little help.
> > > >
> > > > Suppose I have a function that receives table_name and column_name as
> > > > varchar.
> > > >
> > > > Now I would like to iterate through each rows of this table, while
> > > > accessing the value of this column. I am doing something like this:
> > > >
> > > >
> > > > CREATE OR REPLACE FUNCTION Foo(
> > > > table_name VARCHAR,
> > > > column_name VARCHAR
> > > > ) RETURNS VOID AS
> > > > $BODY$
> > > > DECLARE
> > > >     r record;
> > > >     b integer;
> > > > BEGIN
> > > >
> > > >     FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> > > >     LOOP
> > > >
> > > >         b := r.column_name;
> > > >
> > > >    END LOOP
> > > > END
> > > >
> > > > So, everything works except column_name is a varchar. So,
> r.column_name
> > > > won't give me the correponding column's value in extracted row r. So,
> > > > suppose it is 'pid' in the given table, then b:= r.pid will give the
> > > right
> > > > result, but I want to get this effective statement from
> > > > b := r.column_name;
> > > >
> > > >
> > > > Could you please help.
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon
> > > >
> > > > ________________________________
> > > > From: Kazmi,Auon H <ak...@ufl.edu>
> > > > Sent: Friday, November 25, 2016 3:23:46 PM
> > > > To: dev@madlib.incubator.apache.org
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Thanks NJ,
> > > >
> > > > I will move forward in the suggested way.
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon
> > > >
> > > > ________________________________
> > > > From: Nandish Jayaram <nj...@pivotal.io>
> > > > Sent: Wednesday, November 23, 2016 12:20:35 PM
> > > > To: dev@madlib.incubator.apache.org
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Hey Auon,
> > > >
> > > > Starting with only classification for now sounds like a good idea!
> > > > Yes, the output should be just the predicted label for each row.
> > > > If the table you want to run the classification task on is like the
> > > > following:
> > > > *id |   x   |  y*
> > > > 1    10     10.5
> > > > 2    30     31.5
> > > > 3    20     22.5
> > > >
> > > > then the output table could be something like the following:
> > > > *id |   x   |    y     |  predicted_label*
> > > > 1    10     10.5          true
> > > > 2    30     31.5          false
> > > > 3    20     22.5          true
> > > >
> > > > You are basically adding a new column to the input table called
> > > > "predicted_label", and assign the label for each row based on the
> k-NN.
> > > >
> > > > We can certainly make it better, by modifying the kNN function
> > interface.
> > > > But let's just keep it simple for now and work on that later.
> > > >
> > > > NJ
> > > >
> > > > On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <ak...@ufl.edu>
> wrote:
> > > >
> > > > >
> > > > > Hi NJ,
> > > > >
> > > > > I have implemented a first version of interface as suggested by
> you.
> > > > Right
> > > > > now, I am just looking at classification task. I will generalize it
> > to
> > > > work
> > > > > for regression task as well. I have a question regarding output of
> > the
> > > > > function. Should it just be the predicted label (or prediction
> value
> > in
> > > > > case of regression)? Can you give an example of output?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Auon Haidar
> > > > >
> > > > > ________________________________
> > > > > From: Kazmi,Auon H <ak...@ufl.edu>
> > > > > Sent: Friday, November 18, 2016 3:16:00 AM
> > > > > To: dev@madlib.incubator.apache.org
> > > > > Subject: Re: Adding KNN to madlib
> > > > >
> > > > > Hi NJ,
> > > > >
> > > > > Thanks for your inputs!
> > > > >
> > > > > I will go through everyone of them and try to incorporate them.
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Auon Haidar
> > > > >
> > > > > ________________________________
> > > > > From: Nandish Jayaram <nj...@pivotal.io>
> > > > > Sent: Wednesday, November 16, 2016 2:29:05 PM
> > > > > To: dev@madlib.incubator.apache.org
> > > > > Subject: Re: Adding KNN to madlib
> > > > >
> > > > > Hi Auon,
> > > > >
> > > > > Defining the interface is a good start for k-NN. I have slightly
> > > modified
> > > > > your interface to help it conform with other MADlib algorithms'
> > > > interfaces.
> > > > > Note that the output for each new data point is not the 'k' nearest
> > > > > neighbors, but either a classification or regression task on the
> data
> > > > point
> > > > > based on its 'k' nearest neighbors. Every data point in the
> training
> > > data
> > > > > will have an associated class label (regression value) in a
> different
> > > > > column. Normally, the column containing the data point itself is
> > called
> > > > the
> > > > > independent variable, and the column containing the class label is
> > > called
> > > > > the dependent variable. If it is classification, you take a
> majority
> > > vote
> > > > > of the class labels of the 'k' nearest neighbors, and if it is
> > > > regression,
> > > > > you average the dependent variable values of the 'k' nearest
> > neighbors.
> > > > > Here is a preliminary interface we could start with:
> > > > >
> > > > > *knn*(
> > > > > source_table, -- *TEXT, name of table containing training data.*
> > > > > new_data_table, -- *TEXT, name of table containing new data on
> which
> > > > > classification or regression has to be performed. Classification or
> > > > > regression can be performed based on the type of
> > "dependent_varname".*
> > > > > output_table, -- *TEXT, name of the table where output predictors
> are
> > > > > written. If this table is already present, an error is returned.*
> > > > > dependent_varname, -- *TEXT, name of the independent variable
> column.
> > > If
> > > > > this column is of type boolean/integer, we could probably perform
> > k-NN
> > > > > classification, and perform k-NN regression if this is of type
> > double.*
> > > > > independent_varname, -- *TEXT, column defining data points. Data
> > points
> > > > can
> > > > > be of type SVEC or any type convertible to SVEC such as float[] or
> > > > > integer[].*
> > > > > k, --* INTEGER, (optional, default value could be some odd number,
> > say
> > > 5)
> > > > > number of neighbors to consider*
> > > > > metric, -- *TEXT, (optional, default value could be what you are
> > using
> > > > now
> > > > > for distance) the distance metric to use.*
> > > > > );
> > > > >
> > > > > For now you can just use the distance metric you had mentioned in
> an
> > > > > earlier email. Note that the source_table and new_data_table are
> > tables
> > > > in
> > > > > the database and not files.
> > > > >
> > > > > Some pointers to help you start off with the implementation:
> > > > > -
> > > > > https://cwiki.apache.org/confluence/display/MADLIB/
> > > > Quick+Start+Guide+for+
> > > > > Developers
> > > > > is a very useful resource with a great hello-world example. It
> gives
> > > you
> > > > > details about how to add a new module (k-NN would be a new module)
> to
> > > > > MADlib.
> > > > > - k-NN is a great candidate for parallelizing. Do try to use UDA
> > (User
> > > > > Defined Aggregates) in your implementation. This will require you
> to
> > > add
> > > > a
> > > > > C++ layer too, along with the SQL and python layers. Feel free to
> ask
> > > > > specific questions about this after you have tried out the hello
> > world
> > > > > example.
> > > > > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives
> > you
> > > > > more
> > > > > Design Document - Apache MADlib<http://madlib.
> > > > incubator.apache.org/design.
> > > > > pdf>
> > > > > madlib.incubator.apache.org
> > > > > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> > > > > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument
> > v0.4
> > > > > Supportforfunctionpointersandsparse ...
> > > > >
> > > > >
> > > > >
> > > > > information regarding the C++ abstraction layer in MADlib.
> > > > >
> > > > > Feel free to shout out for help if you are stuck! Cheers. :)
> > > > >
> > > > > NJ
> > > > >
> > > > > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu>
> > wrote:
> > > > >
> > > > > > Hi Frank and NJ,
> > > > > >
> > > > > > Thanks for your comments. I will go through the suggestions
> > provided
> > > by
> > > > > NJ.
> > > > > >
> > > > > > Current interface of KNN is as follows:
> > > > > >
> > > > > > 1) Input:
> > > > > >
> > > > > >        - Name of table having all the data points in
> n-dimensional
> > > > vector
> > > > > > form (Double                              Precision[ ])
> > > > > >
> > > > > >        - Column-name of these data points
> > > > > >
> > > > > >        - Name of file having that n-dim vector (v, say) whose
> > > k-nearest
> > > > > > neighbours need to be               found from first table
> (Double
> > > > > > Precision[ ])
> > > > > >
> > > > > >        - Column name having this vector
> > > > > >
> > > > > >        - value of 'k'
> > > > > >
> > > > > >
> > > > > > It returns 'k' nearest neighbours of vector v from first table
> > having
> > > > > data
> > > > > > points.
> > > > > >
> > > > > >
> > > > > >
> > > > > > For now, I am using madlib's squared norm function to calculate
> > > > distance
> > > > > > between any two vectors. I will try to generalise that.
> > > > > >
> > > > > >
> > > > > > Please suggest any other improvements.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Auon Haidar
> > > > > >
> > > > > > ________________________________
> > > > > > From: Frank McQuillan <fm...@pivotal.io>
> > > > > > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > > > > > To: dev@madlib.incubator.apache.org
> > > > > > Subject: Re: Adding KNN to madlib
> > > > > >
> > > > > > Auon,
> > > > > >
> > > > > > Thanks for working on kNN for MADlib.   Can you expand a little
> bit
> > > on
> > > > > your
> > > > > > note, and post the interface that you are thinking about and
> > > > description
> > > > > of
> > > > > > the arguments?  Then people can comment on that.
> > > > > >
> > > > > > Thanks,
> > > > > > Frank
> > > > > >
> > > > > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <
> > > njayaram@pivotal.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Auon,
> > > > > > >
> > > > > > > Great going with your first version of k-NN implementation.
> > > > > > > Some useful links for coding guidelines are at (see Developer
> > > > > > > Documentation):
> > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > > > > action?pageId=61319606
> > > > > > > MADilb has something called as install-checks for basic
> testing.
> > > You
> > > > > can
> > > > > > > look at any existing module for an example of the same. For
> > > instance,
> > > > > > check
> > > > > > > out the install check code for k-means at:
> > > > > > > https://github.com/apache/incubator-madlib/tree/master/
> > > > > > > src/ports/postgres/modules/kmeans/test
> > > > > > >
> > > > > > > I am sure others will pitch in to help you more with your other
> > > > > > questions,
> > > > > > > but these are some starters you can consider! Good luck!
> > > > > > >
> > > > > > > NJ
> > > > > > >
> > > > > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <akazmi@ufl.edu
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I am a first year Computer Science graduate student at
> > University
> > > > of
> > > > > > > > Florida working on implementing KNN in Madlib. I am ready
> with
> > a
> > > > > first
> > > > > > > > version of it but I don't know how to proceed with testing
> and
> > > > adding
> > > > > > it
> > > > > > > to
> > > > > > > > Madlib platform. Also, I am not clear on what standards do I
> > have
> > > > to
> > > > > > > choose
> > > > > > > > in the final implementation. My current version asks for the
> > > table
> > > > > name
> > > > > > > and
> > > > > > > > column name having vectors in which I have to find the
> > > neighbours.
> > > > > The
> > > > > > > > other table given as input holds the vector whose K-NN needs
> to
> > > be
> > > > > > found.
> > > > > > > > It is assuming euclidean distance metric for distance
> > > calculation.
> > > > It
> > > > > > > would
> > > > > > > > really help if somebody can share ideas on what can be added
> to
> > > > this
> > > > > > > > functionality.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Auon Haidar Kazmi
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Posted by "Kazmi,Auon H" <ak...@ufl.edu>.

Hi NJ,

I have done that. Please check if it is rightly done.




Thanks,

Auon

________________________________
From: Nandish Jayaram <nj...@pivotal.io>
Sent: Monday, December 12, 2016 6:28:38 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hi Auon,

Please push all the changes you have made in your branch for KNN to your
incubator-madlib repo, and open a PR on that push.

NJ

On Mon, Dec 12, 2016 at 1:58 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi NJ,
>
> Where should I git push my code? I am doing that in my github id. Also,
> should I push just KNN folder or the whole src/ folder of madlib?
>
>
>
> Regards,
>
> Auon
>
> ________________________________
> From: Kazmi,Auon H <ak...@ufl.edu>
> Sent: Monday, December 5, 2016 8:32:38 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi NJ,
>
> Thanks!
>
> I will do that.
>
>
>
>
> Regards,
>
> Auon
>
> ________________________________
> From: Nandish Jayaram <nj...@pivotal.io>
> Sent: Sunday, December 4, 2016 1:39:53 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi Auon,
>
> That's great!
> I think the best way to share your code with the community is by opening a
> pull request on github. Please do that and a lot of folks will be able to
> comment and give suggestions to you.
>
> NJ
>
> On Sat, Dec 3, 2016 at 2:13 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> > Hi NJ,
> >
> > I got the solution to my problem.
> >
> > So, I might be done with my first version of interface of KNN for
> > classification as suggested by you, by Monday or so. I will generalise it
> > for regression and then please let me know how to share it with you guys.
> > After that, I can start making required changes as and when needed.
> >
> >
> >
> > regards,
> >
> > Auon Haidar
> >
> > ________________________________
> > From: Kazmi,Auon H <ak...@ufl.edu>
> > Sent: Thursday, December 1, 2016 2:59:21 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi NJ,
> >
> > No, this is just an example I gave. So, I want in a postgres function to
> > iterate over the rows of a table given as a VARCHAR argument.
> >
> > FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> >
> > will do that. Now, r is a record, i.e. a row of table 'point_source'. I
> > want to store a particular column of that row r in a variable. Now, this
> > column name is also passed as VARCHAR argument to function. I am not able
> > to figure out the way to access this particular column from the current
> row
> > 'r'.
> >
> >
> > Basically, I am trying to iterate over my testing data one by one and
> pass
> > its vector column to a function that finds its label.
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> >
> > ________________________________
> > From: Nandish Jayaram <nj...@pivotal.io>
> > Sent: Thursday, December 1, 2016 2:51:47 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi Auon,
> >
> > My apologies for the late reply.
> > Can you please give me more information regarding the design approach you
> > have taken. Information like
> > what files you have created so far would be helpful. I am not sure I
> > understand your approach correctly
> > yet. Is the above snippet of code the only code you have, or do you have
> > some other files too?
> >
> > NJ
> >
> > On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> >
> > > Hi NJ,
> > >
> > > I got stuck at a place. Need a little help.
> > >
> > > Suppose I have a function that receives table_name and column_name as
> > > varchar.
> > >
> > > Now I would like to iterate through each rows of this table, while
> > > accessing the value of this column. I am doing something like this:
> > >
> > >
> > > CREATE OR REPLACE FUNCTION Foo(
> > > table_name VARCHAR,
> > > column_name VARCHAR
> > > ) RETURNS VOID AS
> > > $BODY$
> > > DECLARE
> > >     r record;
> > >     b integer;
> > > BEGIN
> > >
> > >     FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> > >     LOOP
> > >
> > >         b := r.column_name;
> > >
> > >    END LOOP
> > > END
> > >
> > > So, everything works except column_name is a varchar. So, r.column_name
> > > won't give me the correponding column's value in extracted row r. So,
> > > suppose it is 'pid' in the given table, then b:= r.pid will give the
> > right
> > > result, but I want to get this effective statement from
> > > b := r.column_name;
> > >
> > >
> > > Could you please help.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon
> > >
> > > ________________________________
> > > From: Kazmi,Auon H <ak...@ufl.edu>
> > > Sent: Friday, November 25, 2016 3:23:46 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Thanks NJ,
> > >
> > > I will move forward in the suggested way.
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon
> > >
> > > ________________________________
> > > From: Nandish Jayaram <nj...@pivotal.io>
> > > Sent: Wednesday, November 23, 2016 12:20:35 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hey Auon,
> > >
> > > Starting with only classification for now sounds like a good idea!
> > > Yes, the output should be just the predicted label for each row.
> > > If the table you want to run the classification task on is like the
> > > following:
> > > *id |   x   |  y*
> > > 1    10     10.5
> > > 2    30     31.5
> > > 3    20     22.5
> > >
> > > then the output table could be something like the following:
> > > *id |   x   |    y     |  predicted_label*
> > > 1    10     10.5          true
> > > 2    30     31.5          false
> > > 3    20     22.5          true
> > >
> > > You are basically adding a new column to the input table called
> > > "predicted_label", and assign the label for each row based on the k-NN.
> > >
> > > We can certainly make it better, by modifying the kNN function
> interface.
> > > But let's just keep it simple for now and work on that later.
> > >
> > > NJ
> > >
> > > On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> > >
> > > >
> > > > Hi NJ,
> > > >
> > > > I have implemented a first version of interface as suggested by you.
> > > Right
> > > > now, I am just looking at classification task. I will generalize it
> to
> > > work
> > > > for regression task as well. I have a question regarding output of
> the
> > > > function. Should it just be the predicted label (or prediction value
> in
> > > > case of regression)? Can you give an example of output?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon Haidar
> > > >
> > > > ________________________________
> > > > From: Kazmi,Auon H <ak...@ufl.edu>
> > > > Sent: Friday, November 18, 2016 3:16:00 AM
> > > > To: dev@madlib.incubator.apache.org
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Hi NJ,
> > > >
> > > > Thanks for your inputs!
> > > >
> > > > I will go through everyone of them and try to incorporate them.
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon Haidar
> > > >
> > > > ________________________________
> > > > From: Nandish Jayaram <nj...@pivotal.io>
> > > > Sent: Wednesday, November 16, 2016 2:29:05 PM
> > > > To: dev@madlib.incubator.apache.org
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Hi Auon,
> > > >
> > > > Defining the interface is a good start for k-NN. I have slightly
> > modified
> > > > your interface to help it conform with other MADlib algorithms'
> > > interfaces.
> > > > Note that the output for each new data point is not the 'k' nearest
> > > > neighbors, but either a classification or regression task on the data
> > > point
> > > > based on its 'k' nearest neighbors. Every data point in the training
> > data
> > > > will have an associated class label (regression value) in a different
> > > > column. Normally, the column containing the data point itself is
> called
> > > the
> > > > independent variable, and the column containing the class label is
> > called
> > > > the dependent variable. If it is classification, you take a majority
> > vote
> > > > of the class labels of the 'k' nearest neighbors, and if it is
> > > regression,
> > > > you average the dependent variable values of the 'k' nearest
> neighbors.
> > > > Here is a preliminary interface we could start with:
> > > >
> > > > *knn*(
> > > > source_table, -- *TEXT, name of table containing training data.*
> > > > new_data_table, -- *TEXT, name of table containing new data on which
> > > > classification or regression has to be performed. Classification or
> > > > regression can be performed based on the type of
> "dependent_varname".*
> > > > output_table, -- *TEXT, name of the table where output predictors are
> > > > written. If this table is already present, an error is returned.*
> > > > dependent_varname, -- *TEXT, name of the independent variable column.
> > If
> > > > this column is of type boolean/integer, we could probably perform
> k-NN
> > > > classification, and perform k-NN regression if this is of type
> double.*
> > > > independent_varname, -- *TEXT, column defining data points. Data
> points
> > > can
> > > > be of type SVEC or any type convertible to SVEC such as float[] or
> > > > integer[].*
> > > > k, --* INTEGER, (optional, default value could be some odd number,
> say
> > 5)
> > > > number of neighbors to consider*
> > > > metric, -- *TEXT, (optional, default value could be what you are
> using
> > > now
> > > > for distance) the distance metric to use.*
> > > > );
> > > >
> > > > For now you can just use the distance metric you had mentioned in an
> > > > earlier email. Note that the source_table and new_data_table are
> tables
> > > in
> > > > the database and not files.
> > > >
> > > > Some pointers to help you start off with the implementation:
> > > > -
> > > > https://cwiki.apache.org/confluence/display/MADLIB/
> > > Quick+Start+Guide+for+
> > > > Developers
> > > > is a very useful resource with a great hello-world example. It gives
> > you
> > > > details about how to add a new module (k-NN would be a new module) to
> > > > MADlib.
> > > > - k-NN is a great candidate for parallelizing. Do try to use UDA
> (User
> > > > Defined Aggregates) in your implementation. This will require you to
> > add
> > > a
> > > > C++ layer too, along with the SQL and python layers. Feel free to ask
> > > > specific questions about this after you have tried out the hello
> world
> > > > example.
> > > > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives
> you
> > > > more
> > > > Design Document - Apache MADlib<http://madlib.
> > > incubator.apache.org/design.
> > > > pdf>
> > > > madlib.incubator.apache.org
> > > > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> > > > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument
> v0.4
> > > > Supportforfunctionpointersandsparse ...
> > > >
> > > >
> > > >
> > > > information regarding the C++ abstraction layer in MADlib.
> > > >
> > > > Feel free to shout out for help if you are stuck! Cheers. :)
> > > >
> > > > NJ
> > > >
> > > > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu>
> wrote:
> > > >
> > > > > Hi Frank and NJ,
> > > > >
> > > > > Thanks for your comments. I will go through the suggestions
> provided
> > by
> > > > NJ.
> > > > >
> > > > > Current interface of KNN is as follows:
> > > > >
> > > > > 1) Input:
> > > > >
> > > > >        - Name of table having all the data points in n-dimensional
> > > vector
> > > > > form (Double                              Precision[ ])
> > > > >
> > > > >        - Column-name of these data points
> > > > >
> > > > >        - Name of file having that n-dim vector (v, say) whose
> > k-nearest
> > > > > neighbours need to be               found from first table (Double
> > > > > Precision[ ])
> > > > >
> > > > >        - Column name having this vector
> > > > >
> > > > >        - value of 'k'
> > > > >
> > > > >
> > > > > It returns 'k' nearest neighbours of vector v from first table
> having
> > > > data
> > > > > points.
> > > > >
> > > > >
> > > > >
> > > > > For now, I am using madlib's squared norm function to calculate
> > > distance
> > > > > between any two vectors. I will try to generalise that.
> > > > >
> > > > >
> > > > > Please suggest any other improvements.
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Auon Haidar
> > > > >
> > > > > ________________________________
> > > > > From: Frank McQuillan <fm...@pivotal.io>
> > > > > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > > > > To: dev@madlib.incubator.apache.org
> > > > > Subject: Re: Adding KNN to madlib
> > > > >
> > > > > Auon,
> > > > >
> > > > > Thanks for working on kNN for MADlib.   Can you expand a little bit
> > on
> > > > your
> > > > > note, and post the interface that you are thinking about and
> > > description
> > > > of
> > > > > the arguments?  Then people can comment on that.
> > > > >
> > > > > Thanks,
> > > > > Frank
> > > > >
> > > > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <
> > njayaram@pivotal.io>
> > > > > wrote:
> > > > >
> > > > > > Hi Auon,
> > > > > >
> > > > > > Great going with your first version of k-NN implementation.
> > > > > > Some useful links for coding guidelines are at (see Developer
> > > > > > Documentation):
> > > > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > > > action?pageId=61319606
> > > > > > MADilb has something called as install-checks for basic testing.
> > You
> > > > can
> > > > > > look at any existing module for an example of the same. For
> > instance,
> > > > > check
> > > > > > out the install check code for k-means at:
> > > > > > https://github.com/apache/incubator-madlib/tree/master/
> > > > > > src/ports/postgres/modules/kmeans/test
> > > > > >
> > > > > > I am sure others will pitch in to help you more with your other
> > > > > questions,
> > > > > > but these are some starters you can consider! Good luck!
> > > > > >
> > > > > > NJ
> > > > > >
> > > > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu>
> > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am a first year Computer Science graduate student at
> University
> > > of
> > > > > > > Florida working on implementing KNN in Madlib. I am ready with
> a
> > > > first
> > > > > > > version of it but I don't know how to proceed with testing and
> > > adding
> > > > > it
> > > > > > to
> > > > > > > Madlib platform. Also, I am not clear on what standards do I
> have
> > > to
> > > > > > choose
> > > > > > > in the final implementation. My current version asks for the
> > table
> > > > name
> > > > > > and
> > > > > > > column name having vectors in which I have to find the
> > neighbours.
> > > > The
> > > > > > > other table given as input holds the vector whose K-NN needs to
> > be
> > > > > found.
> > > > > > > It is assuming euclidean distance metric for distance
> > calculation.
> > > It
> > > > > > would
> > > > > > > really help if somebody can share ideas on what can be added to
> > > this
> > > > > > > functionality.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Auon Haidar Kazmi
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Posted by Nandish Jayaram <nj...@pivotal.io>.

Hi Auon,

Please push all the changes you have made in your branch for KNN to your
incubator-madlib repo, and open a PR on that push.

NJ

On Mon, Dec 12, 2016 at 1:58 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi NJ,
>
> Where should I git push my code? I am doing that in my github id. Also,
> should I push just KNN folder or the whole src/ folder of madlib?
>
>
>
> Regards,
>
> Auon
>
> ________________________________
> From: Kazmi,Auon H <ak...@ufl.edu>
> Sent: Monday, December 5, 2016 8:32:38 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi NJ,
>
> Thanks!
>
> I will do that.
>
>
>
>
> Regards,
>
> Auon
>
> ________________________________
> From: Nandish Jayaram <nj...@pivotal.io>
> Sent: Sunday, December 4, 2016 1:39:53 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi Auon,
>
> That's great!
> I think the best way to share your code with the community is by opening a
> pull request on github. Please do that and a lot of folks will be able to
> comment and give suggestions to you.
>
> NJ
>
> On Sat, Dec 3, 2016 at 2:13 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> > Hi NJ,
> >
> > I got the solution to my problem.
> >
> > So, I might be done with my first version of interface of KNN for
> > classification as suggested by you, by Monday or so. I will generalise it
> > for regression and then please let me know how to share it with you guys.
> > After that, I can start making required changes as and when needed.
> >
> >
> >
> > regards,
> >
> > Auon Haidar
> >
> > ________________________________
> > From: Kazmi,Auon H <ak...@ufl.edu>
> > Sent: Thursday, December 1, 2016 2:59:21 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi NJ,
> >
> > No, this is just an example I gave. So, I want in a postgres function to
> > iterate over the rows of a table given as a VARCHAR argument.
> >
> > FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> >
> > will do that. Now, r is a record, i.e. a row of table 'point_source'. I
> > want to store a particular column of that row r in a variable. Now, this
> > column name is also passed as VARCHAR argument to function. I am not able
> > to figure out the way to access this particular column from the current
> row
> > 'r'.
> >
> >
> > Basically, I am trying to iterate over my testing data one by one and
> pass
> > its vector column to a function that finds its label.
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> >
> > ________________________________
> > From: Nandish Jayaram <nj...@pivotal.io>
> > Sent: Thursday, December 1, 2016 2:51:47 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi Auon,
> >
> > My apologies for the late reply.
> > Can you please give me more information regarding the design approach you
> > have taken. Information like
> > what files you have created so far would be helpful. I am not sure I
> > understand your approach correctly
> > yet. Is the above snippet of code the only code you have, or do you have
> > some other files too?
> >
> > NJ
> >
> > On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> >
> > > Hi NJ,
> > >
> > > I got stuck at a place. Need a little help.
> > >
> > > Suppose I have a function that receives table_name and column_name as
> > > varchar.
> > >
> > > Now I would like to iterate through each rows of this table, while
> > > accessing the value of this column. I am doing something like this:
> > >
> > >
> > > CREATE OR REPLACE FUNCTION Foo(
> > > table_name VARCHAR,
> > > column_name VARCHAR
> > > ) RETURNS VOID AS
> > > $BODY$
> > > DECLARE
> > >     r record;
> > >     b integer;
> > > BEGIN
> > >
> > >     FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> > >     LOOP
> > >
> > >         b := r.column_name;
> > >
> > >    END LOOP
> > > END
> > >
> > > So, everything works except column_name is a varchar. So, r.column_name
> > > won't give me the correponding column's value in extracted row r. So,
> > > suppose it is 'pid' in the given table, then b:= r.pid will give the
> > right
> > > result, but I want to get this effective statement from
> > > b := r.column_name;
> > >
> > >
> > > Could you please help.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon
> > >
> > > ________________________________
> > > From: Kazmi,Auon H <ak...@ufl.edu>
> > > Sent: Friday, November 25, 2016 3:23:46 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Thanks NJ,
> > >
> > > I will move forward in the suggested way.
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon
> > >
> > > ________________________________
> > > From: Nandish Jayaram <nj...@pivotal.io>
> > > Sent: Wednesday, November 23, 2016 12:20:35 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hey Auon,
> > >
> > > Starting with only classification for now sounds like a good idea!
> > > Yes, the output should be just the predicted label for each row.
> > > If the table you want to run the classification task on is like the
> > > following:
> > > *id |   x   |  y*
> > > 1    10     10.5
> > > 2    30     31.5
> > > 3    20     22.5
> > >
> > > then the output table could be something like the following:
> > > *id |   x   |    y     |  predicted_label*
> > > 1    10     10.5          true
> > > 2    30     31.5          false
> > > 3    20     22.5          true
> > >
> > > You are basically adding a new column to the input table called
> > > "predicted_label", and assign the label for each row based on the k-NN.
> > >
> > > We can certainly make it better, by modifying the kNN function
> interface.
> > > But let's just keep it simple for now and work on that later.
> > >
> > > NJ
> > >
> > > On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> > >
> > > >
> > > > Hi NJ,
> > > >
> > > > I have implemented a first version of interface as suggested by you.
> > > Right
> > > > now, I am just looking at classification task. I will generalize it
> to
> > > work
> > > > for regression task as well. I have a question regarding output of
> the
> > > > function. Should it just be the predicted label (or prediction value
> in
> > > > case of regression)? Can you give an example of output?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon Haidar
> > > >
> > > > ________________________________
> > > > From: Kazmi,Auon H <ak...@ufl.edu>
> > > > Sent: Friday, November 18, 2016 3:16:00 AM
> > > > To: dev@madlib.incubator.apache.org
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Hi NJ,
> > > >
> > > > Thanks for your inputs!
> > > >
> > > > I will go through everyone of them and try to incorporate them.
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon Haidar
> > > >
> > > > ________________________________
> > > > From: Nandish Jayaram <nj...@pivotal.io>
> > > > Sent: Wednesday, November 16, 2016 2:29:05 PM
> > > > To: dev@madlib.incubator.apache.org
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Hi Auon,
> > > >
> > > > Defining the interface is a good start for k-NN. I have slightly
> > modified
> > > > your interface to help it conform with other MADlib algorithms'
> > > interfaces.
> > > > Note that the output for each new data point is not the 'k' nearest
> > > > neighbors, but either a classification or regression task on the data
> > > point
> > > > based on its 'k' nearest neighbors. Every data point in the training
> > data
> > > > will have an associated class label (regression value) in a different
> > > > column. Normally, the column containing the data point itself is
> called
> > > the
> > > > independent variable, and the column containing the class label is
> > called
> > > > the dependent variable. If it is classification, you take a majority
> > vote
> > > > of the class labels of the 'k' nearest neighbors, and if it is
> > > regression,
> > > > you average the dependent variable values of the 'k' nearest
> neighbors.
> > > > Here is a preliminary interface we could start with:
> > > >
> > > > *knn*(
> > > > source_table, -- *TEXT, name of table containing training data.*
> > > > new_data_table, -- *TEXT, name of table containing new data on which
> > > > classification or regression has to be performed. Classification or
> > > > regression can be performed based on the type of
> "dependent_varname".*
> > > > output_table, -- *TEXT, name of the table where output predictors are
> > > > written. If this table is already present, an error is returned.*
> > > > dependent_varname, -- *TEXT, name of the independent variable column.
> > If
> > > > this column is of type boolean/integer, we could probably perform
> k-NN
> > > > classification, and perform k-NN regression if this is of type
> double.*
> > > > independent_varname, -- *TEXT, column defining data points. Data
> points
> > > can
> > > > be of type SVEC or any type convertible to SVEC such as float[] or
> > > > integer[].*
> > > > k, --* INTEGER, (optional, default value could be some odd number,
> say
> > 5)
> > > > number of neighbors to consider*
> > > > metric, -- *TEXT, (optional, default value could be what you are
> using
> > > now
> > > > for distance) the distance metric to use.*
> > > > );
> > > >
> > > > For now you can just use the distance metric you had mentioned in an
> > > > earlier email. Note that the source_table and new_data_table are
> tables
> > > in
> > > > the database and not files.
> > > >
> > > > Some pointers to help you start off with the implementation:
> > > > -
> > > > https://cwiki.apache.org/confluence/display/MADLIB/
> > > Quick+Start+Guide+for+
> > > > Developers
> > > > is a very useful resource with a great hello-world example. It gives
> > you
> > > > details about how to add a new module (k-NN would be a new module) to
> > > > MADlib.
> > > > - k-NN is a great candidate for parallelizing. Do try to use UDA
> (User
> > > > Defined Aggregates) in your implementation. This will require you to
> > add
> > > a
> > > > C++ layer too, along with the SQL and python layers. Feel free to ask
> > > > specific questions about this after you have tried out the hello
> world
> > > > example.
> > > > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives
> you
> > > > more
> > > > Design Document - Apache MADlib<http://madlib.
> > > incubator.apache.org/design.
> > > > pdf>
> > > > madlib.incubator.apache.org
> > > > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> > > > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument
> v0.4
> > > > Supportforfunctionpointersandsparse ...
> > > >
> > > >
> > > >
> > > > information regarding the C++ abstraction layer in MADlib.
> > > >
> > > > Feel free to shout out for help if you are stuck! Cheers. :)
> > > >
> > > > NJ
> > > >
> > > > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu>
> wrote:
> > > >
> > > > > Hi Frank and NJ,
> > > > >
> > > > > Thanks for your comments. I will go through the suggestions
> provided
> > by
> > > > NJ.
> > > > >
> > > > > Current interface of KNN is as follows:
> > > > >
> > > > > 1) Input:
> > > > >
> > > > >        - Name of table having all the data points in n-dimensional
> > > vector
> > > > > form (Double                              Precision[ ])
> > > > >
> > > > >        - Column-name of these data points
> > > > >
> > > > >        - Name of file having that n-dim vector (v, say) whose
> > k-nearest
> > > > > neighbours need to be               found from first table (Double
> > > > > Precision[ ])
> > > > >
> > > > >        - Column name having this vector
> > > > >
> > > > >        - value of 'k'
> > > > >
> > > > >
> > > > > It returns 'k' nearest neighbours of vector v from first table
> having
> > > > data
> > > > > points.
> > > > >
> > > > >
> > > > >
> > > > > For now, I am using madlib's squared norm function to calculate
> > > distance
> > > > > between any two vectors. I will try to generalise that.
> > > > >
> > > > >
> > > > > Please suggest any other improvements.
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Auon Haidar
> > > > >
> > > > > ________________________________
> > > > > From: Frank McQuillan <fm...@pivotal.io>
> > > > > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > > > > To: dev@madlib.incubator.apache.org
> > > > > Subject: Re: Adding KNN to madlib
> > > > >
> > > > > Auon,
> > > > >
> > > > > Thanks for working on kNN for MADlib.   Can you expand a little bit
> > on
> > > > your
> > > > > note, and post the interface that you are thinking about and
> > > description
> > > > of
> > > > > the arguments?  Then people can comment on that.
> > > > >
> > > > > Thanks,
> > > > > Frank
> > > > >
> > > > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <
> > njayaram@pivotal.io>
> > > > > wrote:
> > > > >
> > > > > > Hi Auon,
> > > > > >
> > > > > > Great going with your first version of k-NN implementation.
> > > > > > Some useful links for coding guidelines are at (see Developer
> > > > > > Documentation):
> > > > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > > > action?pageId=61319606
> > > > > > MADilb has something called as install-checks for basic testing.
> > You
> > > > can
> > > > > > look at any existing module for an example of the same. For
> > instance,
> > > > > check
> > > > > > out the install check code for k-means at:
> > > > > > https://github.com/apache/incubator-madlib/tree/master/
> > > > > > src/ports/postgres/modules/kmeans/test
> > > > > >
> > > > > > I am sure others will pitch in to help you more with your other
> > > > > questions,
> > > > > > but these are some starters you can consider! Good luck!
> > > > > >
> > > > > > NJ
> > > > > >
> > > > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu>
> > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am a first year Computer Science graduate student at
> University
> > > of
> > > > > > > Florida working on implementing KNN in Madlib. I am ready with
> a
> > > > first
> > > > > > > version of it but I don't know how to proceed with testing and
> > > adding
> > > > > it
> > > > > > to
> > > > > > > Madlib platform. Also, I am not clear on what standards do I
> have
> > > to
> > > > > > choose
> > > > > > > in the final implementation. My current version asks for the
> > table
> > > > name
> > > > > > and
> > > > > > > column name having vectors in which I have to find the
> > neighbours.
> > > > The
> > > > > > > other table given as input holds the vector whose K-NN needs to
> > be
> > > > > found.
> > > > > > > It is assuming euclidean distance metric for distance
> > calculation.
> > > It
> > > > > > would
> > > > > > > really help if somebody can share ideas on what can be added to
> > > this
> > > > > > > functionality.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Auon Haidar Kazmi
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Posted by "Kazmi,Auon H" <ak...@ufl.edu>.

Hi NJ,

Where should I git push my code? I am doing that in my github id. Also, should I push just KNN folder or the whole src/ folder of madlib?



Regards,

Auon

________________________________
From: Kazmi,Auon H <ak...@ufl.edu>
Sent: Monday, December 5, 2016 8:32:38 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hi NJ,

Thanks!

I will do that.




Regards,

Auon

________________________________
From: Nandish Jayaram <nj...@pivotal.io>
Sent: Sunday, December 4, 2016 1:39:53 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hi Auon,

That's great!
I think the best way to share your code with the community is by opening a
pull request on github. Please do that and a lot of folks will be able to
comment and give suggestions to you.

NJ

On Sat, Dec 3, 2016 at 2:13 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi NJ,
>
> I got the solution to my problem.
>
> So, I might be done with my first version of interface of KNN for
> classification as suggested by you, by Monday or so. I will generalise it
> for regression and then please let me know how to share it with you guys.
> After that, I can start making required changes as and when needed.
>
>
>
> regards,
>
> Auon Haidar
>
> ________________________________
> From: Kazmi,Auon H <ak...@ufl.edu>
> Sent: Thursday, December 1, 2016 2:59:21 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi NJ,
>
> No, this is just an example I gave. So, I want in a postgres function to
> iterate over the rows of a table given as a VARCHAR argument.
>
> FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
>
> will do that. Now, r is a record, i.e. a row of table 'point_source'. I
> want to store a particular column of that row r in a variable. Now, this
> column name is also passed as VARCHAR argument to function. I am not able
> to figure out the way to access this particular column from the current row
> 'r'.
>
>
> Basically, I am trying to iterate over my testing data one by one and pass
> its vector column to a function that finds its label.
>
>
>
> Regards,
>
> Auon
>
>
> ________________________________
> From: Nandish Jayaram <nj...@pivotal.io>
> Sent: Thursday, December 1, 2016 2:51:47 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi Auon,
>
> My apologies for the late reply.
> Can you please give me more information regarding the design approach you
> have taken. Information like
> what files you have created so far would be helpful. I am not sure I
> understand your approach correctly
> yet. Is the above snippet of code the only code you have, or do you have
> some other files too?
>
> NJ
>
> On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> > Hi NJ,
> >
> > I got stuck at a place. Need a little help.
> >
> > Suppose I have a function that receives table_name and column_name as
> > varchar.
> >
> > Now I would like to iterate through each rows of this table, while
> > accessing the value of this column. I am doing something like this:
> >
> >
> > CREATE OR REPLACE FUNCTION Foo(
> > table_name VARCHAR,
> > column_name VARCHAR
> > ) RETURNS VOID AS
> > $BODY$
> > DECLARE
> >     r record;
> >     b integer;
> > BEGIN
> >
> >     FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> >     LOOP
> >
> >         b := r.column_name;
> >
> >    END LOOP
> > END
> >
> > So, everything works except column_name is a varchar. So, r.column_name
> > won't give me the correponding column's value in extracted row r. So,
> > suppose it is 'pid' in the given table, then b:= r.pid will give the
> right
> > result, but I want to get this effective statement from
> > b := r.column_name;
> >
> >
> > Could you please help.
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> > ________________________________
> > From: Kazmi,Auon H <ak...@ufl.edu>
> > Sent: Friday, November 25, 2016 3:23:46 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Thanks NJ,
> >
> > I will move forward in the suggested way.
> >
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> > ________________________________
> > From: Nandish Jayaram <nj...@pivotal.io>
> > Sent: Wednesday, November 23, 2016 12:20:35 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hey Auon,
> >
> > Starting with only classification for now sounds like a good idea!
> > Yes, the output should be just the predicted label for each row.
> > If the table you want to run the classification task on is like the
> > following:
> > *id |   x   |  y*
> > 1    10     10.5
> > 2    30     31.5
> > 3    20     22.5
> >
> > then the output table could be something like the following:
> > *id |   x   |    y     |  predicted_label*
> > 1    10     10.5          true
> > 2    30     31.5          false
> > 3    20     22.5          true
> >
> > You are basically adding a new column to the input table called
> > "predicted_label", and assign the label for each row based on the k-NN.
> >
> > We can certainly make it better, by modifying the kNN function interface.
> > But let's just keep it simple for now and work on that later.
> >
> > NJ
> >
> > On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> >
> > >
> > > Hi NJ,
> > >
> > > I have implemented a first version of interface as suggested by you.
> > Right
> > > now, I am just looking at classification task. I will generalize it to
> > work
> > > for regression task as well. I have a question regarding output of the
> > > function. Should it just be the predicted label (or prediction value in
> > > case of regression)? Can you give an example of output?
> > >
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon Haidar
> > >
> > > ________________________________
> > > From: Kazmi,Auon H <ak...@ufl.edu>
> > > Sent: Friday, November 18, 2016 3:16:00 AM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hi NJ,
> > >
> > > Thanks for your inputs!
> > >
> > > I will go through everyone of them and try to incorporate them.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon Haidar
> > >
> > > ________________________________
> > > From: Nandish Jayaram <nj...@pivotal.io>
> > > Sent: Wednesday, November 16, 2016 2:29:05 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hi Auon,
> > >
> > > Defining the interface is a good start for k-NN. I have slightly
> modified
> > > your interface to help it conform with other MADlib algorithms'
> > interfaces.
> > > Note that the output for each new data point is not the 'k' nearest
> > > neighbors, but either a classification or regression task on the data
> > point
> > > based on its 'k' nearest neighbors. Every data point in the training
> data
> > > will have an associated class label (regression value) in a different
> > > column. Normally, the column containing the data point itself is called
> > the
> > > independent variable, and the column containing the class label is
> called
> > > the dependent variable. If it is classification, you take a majority
> vote
> > > of the class labels of the 'k' nearest neighbors, and if it is
> > regression,
> > > you average the dependent variable values of the 'k' nearest neighbors.
> > > Here is a preliminary interface we could start with:
> > >
> > > *knn*(
> > > source_table, -- *TEXT, name of table containing training data.*
> > > new_data_table, -- *TEXT, name of table containing new data on which
> > > classification or regression has to be performed. Classification or
> > > regression can be performed based on the type of "dependent_varname".*
> > > output_table, -- *TEXT, name of the table where output predictors are
> > > written. If this table is already present, an error is returned.*
> > > dependent_varname, -- *TEXT, name of the independent variable column.
> If
> > > this column is of type boolean/integer, we could probably perform k-NN
> > > classification, and perform k-NN regression if this is of type double.*
> > > independent_varname, -- *TEXT, column defining data points. Data points
> > can
> > > be of type SVEC or any type convertible to SVEC such as float[] or
> > > integer[].*
> > > k, --* INTEGER, (optional, default value could be some odd number, say
> 5)
> > > number of neighbors to consider*
> > > metric, -- *TEXT, (optional, default value could be what you are using
> > now
> > > for distance) the distance metric to use.*
> > > );
> > >
> > > For now you can just use the distance metric you had mentioned in an
> > > earlier email. Note that the source_table and new_data_table are tables
> > in
> > > the database and not files.
> > >
> > > Some pointers to help you start off with the implementation:
> > > -
> > > https://cwiki.apache.org/confluence/display/MADLIB/
> > Quick+Start+Guide+for+
> > > Developers
> > > is a very useful resource with a great hello-world example. It gives
> you
> > > details about how to add a new module (k-NN would be a new module) to
> > > MADlib.
> > > - k-NN is a great candidate for parallelizing. Do try to use UDA (User
> > > Defined Aggregates) in your implementation. This will require you to
> add
> > a
> > > C++ layer too, along with the SQL and python layers. Feel free to ask
> > > specific questions about this after you have tried out the hello world
> > > example.
> > > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you
> > > more
> > > Design Document - Apache MADlib<http://madlib.
> > incubator.apache.org/design.
> > > pdf>
> > > madlib.incubator.apache.org
> > > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> > > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4
> > > Supportforfunctionpointersandsparse ...
> > >
> > >
> > >
> > > information regarding the C++ abstraction layer in MADlib.
> > >
> > > Feel free to shout out for help if you are stuck! Cheers. :)
> > >
> > > NJ
> > >
> > > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> > >
> > > > Hi Frank and NJ,
> > > >
> > > > Thanks for your comments. I will go through the suggestions provided
> by
> > > NJ.
> > > >
> > > > Current interface of KNN is as follows:
> > > >
> > > > 1) Input:
> > > >
> > > >        - Name of table having all the data points in n-dimensional
> > vector
> > > > form (Double                              Precision[ ])
> > > >
> > > >        - Column-name of these data points
> > > >
> > > >        - Name of file having that n-dim vector (v, say) whose
> k-nearest
> > > > neighbours need to be               found from first table (Double
> > > > Precision[ ])
> > > >
> > > >        - Column name having this vector
> > > >
> > > >        - value of 'k'
> > > >
> > > >
> > > > It returns 'k' nearest neighbours of vector v from first table having
> > > data
> > > > points.
> > > >
> > > >
> > > >
> > > > For now, I am using madlib's squared norm function to calculate
> > distance
> > > > between any two vectors. I will try to generalise that.
> > > >
> > > >
> > > > Please suggest any other improvements.
> > > >
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Auon Haidar
> > > >
> > > > ________________________________
> > > > From: Frank McQuillan <fm...@pivotal.io>
> > > > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > > > To: dev@madlib.incubator.apache.org
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Auon,
> > > >
> > > > Thanks for working on kNN for MADlib.   Can you expand a little bit
> on
> > > your
> > > > note, and post the interface that you are thinking about and
> > description
> > > of
> > > > the arguments?  Then people can comment on that.
> > > >
> > > > Thanks,
> > > > Frank
> > > >
> > > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <
> njayaram@pivotal.io>
> > > > wrote:
> > > >
> > > > > Hi Auon,
> > > > >
> > > > > Great going with your first version of k-NN implementation.
> > > > > Some useful links for coding guidelines are at (see Developer
> > > > > Documentation):
> > > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > > action?pageId=61319606
> > > > > MADilb has something called as install-checks for basic testing.
> You
> > > can
> > > > > look at any existing module for an example of the same. For
> instance,
> > > > check
> > > > > out the install check code for k-means at:
> > > > > https://github.com/apache/incubator-madlib/tree/master/
> > > > > src/ports/postgres/modules/kmeans/test
> > > > >
> > > > > I am sure others will pitch in to help you more with your other
> > > > questions,
> > > > > but these are some starters you can consider! Good luck!
> > > > >
> > > > > NJ
> > > > >
> > > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am a first year Computer Science graduate student at University
> > of
> > > > > > Florida working on implementing KNN in Madlib. I am ready with a
> > > first
> > > > > > version of it but I don't know how to proceed with testing and
> > adding
> > > > it
> > > > > to
> > > > > > Madlib platform. Also, I am not clear on what standards do I have
> > to
> > > > > choose
> > > > > > in the final implementation. My current version asks for the
> table
> > > name
> > > > > and
> > > > > > column name having vectors in which I have to find the
> neighbours.
> > > The
> > > > > > other table given as input holds the vector whose K-NN needs to
> be
> > > > found.
> > > > > > It is assuming euclidean distance metric for distance
> calculation.
> > It
> > > > > would
> > > > > > really help if somebody can share ideas on what can be added to
> > this
> > > > > > functionality.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Auon Haidar Kazmi
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Posted by "Kazmi,Auon H" <ak...@ufl.edu>.

Hi NJ,

Thanks!

I will do that.




Regards,

Auon

________________________________
From: Nandish Jayaram <nj...@pivotal.io>
Sent: Sunday, December 4, 2016 1:39:53 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hi Auon,

That's great!
I think the best way to share your code with the community is by opening a
pull request on github. Please do that and a lot of folks will be able to
comment and give suggestions to you.

NJ

On Sat, Dec 3, 2016 at 2:13 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi NJ,
>
> I got the solution to my problem.
>
> So, I might be done with my first version of interface of KNN for
> classification as suggested by you, by Monday or so. I will generalise it
> for regression and then please let me know how to share it with you guys.
> After that, I can start making required changes as and when needed.
>
>
>
> regards,
>
> Auon Haidar
>
> ________________________________
> From: Kazmi,Auon H <ak...@ufl.edu>
> Sent: Thursday, December 1, 2016 2:59:21 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi NJ,
>
> No, this is just an example I gave. So, I want in a postgres function to
> iterate over the rows of a table given as a VARCHAR argument.
>
> FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
>
> will do that. Now, r is a record, i.e. a row of table 'point_source'. I
> want to store a particular column of that row r in a variable. Now, this
> column name is also passed as VARCHAR argument to function. I am not able
> to figure out the way to access this particular column from the current row
> 'r'.
>
>
> Basically, I am trying to iterate over my testing data one by one and pass
> its vector column to a function that finds its label.
>
>
>
> Regards,
>
> Auon
>
>
> ________________________________
> From: Nandish Jayaram <nj...@pivotal.io>
> Sent: Thursday, December 1, 2016 2:51:47 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi Auon,
>
> My apologies for the late reply.
> Can you please give me more information regarding the design approach you
> have taken. Information like
> what files you have created so far would be helpful. I am not sure I
> understand your approach correctly
> yet. Is the above snippet of code the only code you have, or do you have
> some other files too?
>
> NJ
>
> On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> > Hi NJ,
> >
> > I got stuck at a place. Need a little help.
> >
> > Suppose I have a function that receives table_name and column_name as
> > varchar.
> >
> > Now I would like to iterate through each rows of this table, while
> > accessing the value of this column. I am doing something like this:
> >
> >
> > CREATE OR REPLACE FUNCTION Foo(
> > table_name VARCHAR,
> > column_name VARCHAR
> > ) RETURNS VOID AS
> > $BODY$
> > DECLARE
> >     r record;
> >     b integer;
> > BEGIN
> >
> >     FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> >     LOOP
> >
> >         b := r.column_name;
> >
> >    END LOOP
> > END
> >
> > So, everything works except column_name is a varchar. So, r.column_name
> > won't give me the correponding column's value in extracted row r. So,
> > suppose it is 'pid' in the given table, then b:= r.pid will give the
> right
> > result, but I want to get this effective statement from
> > b := r.column_name;
> >
> >
> > Could you please help.
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> > ________________________________
> > From: Kazmi,Auon H <ak...@ufl.edu>
> > Sent: Friday, November 25, 2016 3:23:46 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Thanks NJ,
> >
> > I will move forward in the suggested way.
> >
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> > ________________________________
> > From: Nandish Jayaram <nj...@pivotal.io>
> > Sent: Wednesday, November 23, 2016 12:20:35 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hey Auon,
> >
> > Starting with only classification for now sounds like a good idea!
> > Yes, the output should be just the predicted label for each row.
> > If the table you want to run the classification task on is like the
> > following:
> > *id |   x   |  y*
> > 1    10     10.5
> > 2    30     31.5
> > 3    20     22.5
> >
> > then the output table could be something like the following:
> > *id |   x   |    y     |  predicted_label*
> > 1    10     10.5          true
> > 2    30     31.5          false
> > 3    20     22.5          true
> >
> > You are basically adding a new column to the input table called
> > "predicted_label", and assign the label for each row based on the k-NN.
> >
> > We can certainly make it better, by modifying the kNN function interface.
> > But let's just keep it simple for now and work on that later.
> >
> > NJ
> >
> > On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> >
> > >
> > > Hi NJ,
> > >
> > > I have implemented a first version of interface as suggested by you.
> > Right
> > > now, I am just looking at classification task. I will generalize it to
> > work
> > > for regression task as well. I have a question regarding output of the
> > > function. Should it just be the predicted label (or prediction value in
> > > case of regression)? Can you give an example of output?
> > >
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon Haidar
> > >
> > > ________________________________
> > > From: Kazmi,Auon H <ak...@ufl.edu>
> > > Sent: Friday, November 18, 2016 3:16:00 AM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hi NJ,
> > >
> > > Thanks for your inputs!
> > >
> > > I will go through everyone of them and try to incorporate them.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon Haidar
> > >
> > > ________________________________
> > > From: Nandish Jayaram <nj...@pivotal.io>
> > > Sent: Wednesday, November 16, 2016 2:29:05 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hi Auon,
> > >
> > > Defining the interface is a good start for k-NN. I have slightly
> modified
> > > your interface to help it conform with other MADlib algorithms'
> > interfaces.
> > > Note that the output for each new data point is not the 'k' nearest
> > > neighbors, but either a classification or regression task on the data
> > point
> > > based on its 'k' nearest neighbors. Every data point in the training
> data
> > > will have an associated class label (regression value) in a different
> > > column. Normally, the column containing the data point itself is called
> > the
> > > independent variable, and the column containing the class label is
> called
> > > the dependent variable. If it is classification, you take a majority
> vote
> > > of the class labels of the 'k' nearest neighbors, and if it is
> > regression,
> > > you average the dependent variable values of the 'k' nearest neighbors.
> > > Here is a preliminary interface we could start with:
> > >
> > > *knn*(
> > > source_table, -- *TEXT, name of table containing training data.*
> > > new_data_table, -- *TEXT, name of table containing new data on which
> > > classification or regression has to be performed. Classification or
> > > regression can be performed based on the type of "dependent_varname".*
> > > output_table, -- *TEXT, name of the table where output predictors are
> > > written. If this table is already present, an error is returned.*
> > > dependent_varname, -- *TEXT, name of the independent variable column.
> If
> > > this column is of type boolean/integer, we could probably perform k-NN
> > > classification, and perform k-NN regression if this is of type double.*
> > > independent_varname, -- *TEXT, column defining data points. Data points
> > can
> > > be of type SVEC or any type convertible to SVEC such as float[] or
> > > integer[].*
> > > k, --* INTEGER, (optional, default value could be some odd number, say
> 5)
> > > number of neighbors to consider*
> > > metric, -- *TEXT, (optional, default value could be what you are using
> > now
> > > for distance) the distance metric to use.*
> > > );
> > >
> > > For now you can just use the distance metric you had mentioned in an
> > > earlier email. Note that the source_table and new_data_table are tables
> > in
> > > the database and not files.
> > >
> > > Some pointers to help you start off with the implementation:
> > > -
> > > https://cwiki.apache.org/confluence/display/MADLIB/
> > Quick+Start+Guide+for+
> > > Developers
> > > is a very useful resource with a great hello-world example. It gives
> you
> > > details about how to add a new module (k-NN would be a new module) to
> > > MADlib.
> > > - k-NN is a great candidate for parallelizing. Do try to use UDA (User
> > > Defined Aggregates) in your implementation. This will require you to
> add
> > a
> > > C++ layer too, along with the SQL and python layers. Feel free to ask
> > > specific questions about this after you have tried out the hello world
> > > example.
> > > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you
> > > more
> > > Design Document - Apache MADlib<http://madlib.
> > incubator.apache.org/design.
> > > pdf>
> > > madlib.incubator.apache.org
> > > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> > > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4
> > > Supportforfunctionpointersandsparse ...
> > >
> > >
> > >
> > > information regarding the C++ abstraction layer in MADlib.
> > >
> > > Feel free to shout out for help if you are stuck! Cheers. :)
> > >
> > > NJ
> > >
> > > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> > >
> > > > Hi Frank and NJ,
> > > >
> > > > Thanks for your comments. I will go through the suggestions provided
> by
> > > NJ.
> > > >
> > > > Current interface of KNN is as follows:
> > > >
> > > > 1) Input:
> > > >
> > > >        - Name of table having all the data points in n-dimensional
> > vector
> > > > form (Double                              Precision[ ])
> > > >
> > > >        - Column-name of these data points
> > > >
> > > >        - Name of file having that n-dim vector (v, say) whose
> k-nearest
> > > > neighbours need to be               found from first table (Double
> > > > Precision[ ])
> > > >
> > > >        - Column name having this vector
> > > >
> > > >        - value of 'k'
> > > >
> > > >
> > > > It returns 'k' nearest neighbours of vector v from first table having
> > > data
> > > > points.
> > > >
> > > >
> > > >
> > > > For now, I am using madlib's squared norm function to calculate
> > distance
> > > > between any two vectors. I will try to generalise that.
> > > >
> > > >
> > > > Please suggest any other improvements.
> > > >
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Auon Haidar
> > > >
> > > > ________________________________
> > > > From: Frank McQuillan <fm...@pivotal.io>
> > > > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > > > To: dev@madlib.incubator.apache.org
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Auon,
> > > >
> > > > Thanks for working on kNN for MADlib.   Can you expand a little bit
> on
> > > your
> > > > note, and post the interface that you are thinking about and
> > description
> > > of
> > > > the arguments?  Then people can comment on that.
> > > >
> > > > Thanks,
> > > > Frank
> > > >
> > > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <
> njayaram@pivotal.io>
> > > > wrote:
> > > >
> > > > > Hi Auon,
> > > > >
> > > > > Great going with your first version of k-NN implementation.
> > > > > Some useful links for coding guidelines are at (see Developer
> > > > > Documentation):
> > > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > > action?pageId=61319606
> > > > > MADilb has something called as install-checks for basic testing.
> You
> > > can
> > > > > look at any existing module for an example of the same. For
> instance,
> > > > check
> > > > > out the install check code for k-means at:
> > > > > https://github.com/apache/incubator-madlib/tree/master/
> > > > > src/ports/postgres/modules/kmeans/test
> > > > >
> > > > > I am sure others will pitch in to help you more with your other
> > > > questions,
> > > > > but these are some starters you can consider! Good luck!
> > > > >
> > > > > NJ
> > > > >
> > > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am a first year Computer Science graduate student at University
> > of
> > > > > > Florida working on implementing KNN in Madlib. I am ready with a
> > > first
> > > > > > version of it but I don't know how to proceed with testing and
> > adding
> > > > it
> > > > > to
> > > > > > Madlib platform. Also, I am not clear on what standards do I have
> > to
> > > > > choose
> > > > > > in the final implementation. My current version asks for the
> table
> > > name
> > > > > and
> > > > > > column name having vectors in which I have to find the
> neighbours.
> > > The
> > > > > > other table given as input holds the vector whose K-NN needs to
> be
> > > > found.
> > > > > > It is assuming euclidean distance metric for distance
> calculation.
> > It
> > > > > would
> > > > > > really help if somebody can share ideas on what can be added to
> > this
> > > > > > functionality.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Auon Haidar Kazmi
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Posted by Nandish Jayaram <nj...@pivotal.io>.

Hi Auon,

That's great!
I think the best way to share your code with the community is by opening a
pull request on github. Please do that and a lot of folks will be able to
comment and give suggestions to you.

NJ

On Sat, Dec 3, 2016 at 2:13 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi NJ,
>
> I got the solution to my problem.
>
> So, I might be done with my first version of interface of KNN for
> classification as suggested by you, by Monday or so. I will generalise it
> for regression and then please let me know how to share it with you guys.
> After that, I can start making required changes as and when needed.
>
>
>
> regards,
>
> Auon Haidar
>
> ________________________________
> From: Kazmi,Auon H <ak...@ufl.edu>
> Sent: Thursday, December 1, 2016 2:59:21 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi NJ,
>
> No, this is just an example I gave. So, I want in a postgres function to
> iterate over the rows of a table given as a VARCHAR argument.
>
> FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
>
> will do that. Now, r is a record, i.e. a row of table 'point_source'. I
> want to store a particular column of that row r in a variable. Now, this
> column name is also passed as VARCHAR argument to function. I am not able
> to figure out the way to access this particular column from the current row
> 'r'.
>
>
> Basically, I am trying to iterate over my testing data one by one and pass
> its vector column to a function that finds its label.
>
>
>
> Regards,
>
> Auon
>
>
> ________________________________
> From: Nandish Jayaram <nj...@pivotal.io>
> Sent: Thursday, December 1, 2016 2:51:47 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi Auon,
>
> My apologies for the late reply.
> Can you please give me more information regarding the design approach you
> have taken. Information like
> what files you have created so far would be helpful. I am not sure I
> understand your approach correctly
> yet. Is the above snippet of code the only code you have, or do you have
> some other files too?
>
> NJ
>
> On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> > Hi NJ,
> >
> > I got stuck at a place. Need a little help.
> >
> > Suppose I have a function that receives table_name and column_name as
> > varchar.
> >
> > Now I would like to iterate through each rows of this table, while
> > accessing the value of this column. I am doing something like this:
> >
> >
> > CREATE OR REPLACE FUNCTION Foo(
> > table_name VARCHAR,
> > column_name VARCHAR
> > ) RETURNS VOID AS
> > $BODY$
> > DECLARE
> >     r record;
> >     b integer;
> > BEGIN
> >
> >     FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> >     LOOP
> >
> >         b := r.column_name;
> >
> >    END LOOP
> > END
> >
> > So, everything works except column_name is a varchar. So, r.column_name
> > won't give me the correponding column's value in extracted row r. So,
> > suppose it is 'pid' in the given table, then b:= r.pid will give the
> right
> > result, but I want to get this effective statement from
> > b := r.column_name;
> >
> >
> > Could you please help.
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> > ________________________________
> > From: Kazmi,Auon H <ak...@ufl.edu>
> > Sent: Friday, November 25, 2016 3:23:46 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Thanks NJ,
> >
> > I will move forward in the suggested way.
> >
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> > ________________________________
> > From: Nandish Jayaram <nj...@pivotal.io>
> > Sent: Wednesday, November 23, 2016 12:20:35 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hey Auon,
> >
> > Starting with only classification for now sounds like a good idea!
> > Yes, the output should be just the predicted label for each row.
> > If the table you want to run the classification task on is like the
> > following:
> > *id |   x   |  y*
> > 1    10     10.5
> > 2    30     31.5
> > 3    20     22.5
> >
> > then the output table could be something like the following:
> > *id |   x   |    y     |  predicted_label*
> > 1    10     10.5          true
> > 2    30     31.5          false
> > 3    20     22.5          true
> >
> > You are basically adding a new column to the input table called
> > "predicted_label", and assign the label for each row based on the k-NN.
> >
> > We can certainly make it better, by modifying the kNN function interface.
> > But let's just keep it simple for now and work on that later.
> >
> > NJ
> >
> > On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> >
> > >
> > > Hi NJ,
> > >
> > > I have implemented a first version of interface as suggested by you.
> > Right
> > > now, I am just looking at classification task. I will generalize it to
> > work
> > > for regression task as well. I have a question regarding output of the
> > > function. Should it just be the predicted label (or prediction value in
> > > case of regression)? Can you give an example of output?
> > >
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon Haidar
> > >
> > > ________________________________
> > > From: Kazmi,Auon H <ak...@ufl.edu>
> > > Sent: Friday, November 18, 2016 3:16:00 AM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hi NJ,
> > >
> > > Thanks for your inputs!
> > >
> > > I will go through everyone of them and try to incorporate them.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon Haidar
> > >
> > > ________________________________
> > > From: Nandish Jayaram <nj...@pivotal.io>
> > > Sent: Wednesday, November 16, 2016 2:29:05 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hi Auon,
> > >
> > > Defining the interface is a good start for k-NN. I have slightly
> modified
> > > your interface to help it conform with other MADlib algorithms'
> > interfaces.
> > > Note that the output for each new data point is not the 'k' nearest
> > > neighbors, but either a classification or regression task on the data
> > point
> > > based on its 'k' nearest neighbors. Every data point in the training
> data
> > > will have an associated class label (regression value) in a different
> > > column. Normally, the column containing the data point itself is called
> > the
> > > independent variable, and the column containing the class label is
> called
> > > the dependent variable. If it is classification, you take a majority
> vote
> > > of the class labels of the 'k' nearest neighbors, and if it is
> > regression,
> > > you average the dependent variable values of the 'k' nearest neighbors.
> > > Here is a preliminary interface we could start with:
> > >
> > > *knn*(
> > > source_table, -- *TEXT, name of table containing training data.*
> > > new_data_table, -- *TEXT, name of table containing new data on which
> > > classification or regression has to be performed. Classification or
> > > regression can be performed based on the type of "dependent_varname".*
> > > output_table, -- *TEXT, name of the table where output predictors are
> > > written. If this table is already present, an error is returned.*
> > > dependent_varname, -- *TEXT, name of the independent variable column.
> If
> > > this column is of type boolean/integer, we could probably perform k-NN
> > > classification, and perform k-NN regression if this is of type double.*
> > > independent_varname, -- *TEXT, column defining data points. Data points
> > can
> > > be of type SVEC or any type convertible to SVEC such as float[] or
> > > integer[].*
> > > k, --* INTEGER, (optional, default value could be some odd number, say
> 5)
> > > number of neighbors to consider*
> > > metric, -- *TEXT, (optional, default value could be what you are using
> > now
> > > for distance) the distance metric to use.*
> > > );
> > >
> > > For now you can just use the distance metric you had mentioned in an
> > > earlier email. Note that the source_table and new_data_table are tables
> > in
> > > the database and not files.
> > >
> > > Some pointers to help you start off with the implementation:
> > > -
> > > https://cwiki.apache.org/confluence/display/MADLIB/
> > Quick+Start+Guide+for+
> > > Developers
> > > is a very useful resource with a great hello-world example. It gives
> you
> > > details about how to add a new module (k-NN would be a new module) to
> > > MADlib.
> > > - k-NN is a great candidate for parallelizing. Do try to use UDA (User
> > > Defined Aggregates) in your implementation. This will require you to
> add
> > a
> > > C++ layer too, along with the SQL and python layers. Feel free to ask
> > > specific questions about this after you have tried out the hello world
> > > example.
> > > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you
> > > more
> > > Design Document - Apache MADlib<http://madlib.
> > incubator.apache.org/design.
> > > pdf>
> > > madlib.incubator.apache.org
> > > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> > > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4
> > > Supportforfunctionpointersandsparse ...
> > >
> > >
> > >
> > > information regarding the C++ abstraction layer in MADlib.
> > >
> > > Feel free to shout out for help if you are stuck! Cheers. :)
> > >
> > > NJ
> > >
> > > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> > >
> > > > Hi Frank and NJ,
> > > >
> > > > Thanks for your comments. I will go through the suggestions provided
> by
> > > NJ.
> > > >
> > > > Current interface of KNN is as follows:
> > > >
> > > > 1) Input:
> > > >
> > > >        - Name of table having all the data points in n-dimensional
> > vector
> > > > form (Double                              Precision[ ])
> > > >
> > > >        - Column-name of these data points
> > > >
> > > >        - Name of file having that n-dim vector (v, say) whose
> k-nearest
> > > > neighbours need to be               found from first table (Double
> > > > Precision[ ])
> > > >
> > > >        - Column name having this vector
> > > >
> > > >        - value of 'k'
> > > >
> > > >
> > > > It returns 'k' nearest neighbours of vector v from first table having
> > > data
> > > > points.
> > > >
> > > >
> > > >
> > > > For now, I am using madlib's squared norm function to calculate
> > distance
> > > > between any two vectors. I will try to generalise that.
> > > >
> > > >
> > > > Please suggest any other improvements.
> > > >
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Auon Haidar
> > > >
> > > > ________________________________
> > > > From: Frank McQuillan <fm...@pivotal.io>
> > > > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > > > To: dev@madlib.incubator.apache.org
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Auon,
> > > >
> > > > Thanks for working on kNN for MADlib.   Can you expand a little bit
> on
> > > your
> > > > note, and post the interface that you are thinking about and
> > description
> > > of
> > > > the arguments?  Then people can comment on that.
> > > >
> > > > Thanks,
> > > > Frank
> > > >
> > > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <
> njayaram@pivotal.io>
> > > > wrote:
> > > >
> > > > > Hi Auon,
> > > > >
> > > > > Great going with your first version of k-NN implementation.
> > > > > Some useful links for coding guidelines are at (see Developer
> > > > > Documentation):
> > > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > > action?pageId=61319606
> > > > > MADilb has something called as install-checks for basic testing.
> You
> > > can
> > > > > look at any existing module for an example of the same. For
> instance,
> > > > check
> > > > > out the install check code for k-means at:
> > > > > https://github.com/apache/incubator-madlib/tree/master/
> > > > > src/ports/postgres/modules/kmeans/test
> > > > >
> > > > > I am sure others will pitch in to help you more with your other
> > > > questions,
> > > > > but these are some starters you can consider! Good luck!
> > > > >
> > > > > NJ
> > > > >
> > > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am a first year Computer Science graduate student at University
> > of
> > > > > > Florida working on implementing KNN in Madlib. I am ready with a
> > > first
> > > > > > version of it but I don't know how to proceed with testing and
> > adding
> > > > it
> > > > > to
> > > > > > Madlib platform. Also, I am not clear on what standards do I have
> > to
> > > > > choose
> > > > > > in the final implementation. My current version asks for the
> table
> > > name
> > > > > and
> > > > > > column name having vectors in which I have to find the
> neighbours.
> > > The
> > > > > > other table given as input holds the vector whose K-NN needs to
> be
> > > > found.
> > > > > > It is assuming euclidean distance metric for distance
> calculation.
> > It
> > > > > would
> > > > > > really help if somebody can share ideas on what can be added to
> > this
> > > > > > functionality.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Auon Haidar Kazmi
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Posted by "Kazmi,Auon H" <ak...@ufl.edu>.

Hi NJ,

I got the solution to my problem.

So, I might be done with my first version of interface of KNN for classification as suggested by you, by Monday or so. I will generalise it for regression and then please let me know how to share it with you guys. After that, I can start making required changes as and when needed.



regards,

Auon Haidar

________________________________
From: Kazmi,Auon H <ak...@ufl.edu>
Sent: Thursday, December 1, 2016 2:59:21 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hi NJ,

No, this is just an example I gave. So, I want in a postgres function to iterate over the rows of a table given as a VARCHAR argument.

FOR r IN EXECUTE format('SELECT * FROM %I', point_source)

will do that. Now, r is a record, i.e. a row of table 'point_source'. I want to store a particular column of that row r in a variable. Now, this column name is also passed as VARCHAR argument to function. I am not able to figure out the way to access this particular column from the current row 'r'.


Basically, I am trying to iterate over my testing data one by one and pass its vector column to a function that finds its label.



Regards,

Auon


________________________________
From: Nandish Jayaram <nj...@pivotal.io>
Sent: Thursday, December 1, 2016 2:51:47 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hi Auon,

My apologies for the late reply.
Can you please give me more information regarding the design approach you
have taken. Information like
what files you have created so far would be helpful. I am not sure I
understand your approach correctly
yet. Is the above snippet of code the only code you have, or do you have
some other files too?

NJ

On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi NJ,
>
> I got stuck at a place. Need a little help.
>
> Suppose I have a function that receives table_name and column_name as
> varchar.
>
> Now I would like to iterate through each rows of this table, while
> accessing the value of this column. I am doing something like this:
>
>
> CREATE OR REPLACE FUNCTION Foo(
> table_name VARCHAR,
> column_name VARCHAR
> ) RETURNS VOID AS
> $BODY$
> DECLARE
>     r record;
>     b integer;
> BEGIN
>
>     FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
>     LOOP
>
>         b := r.column_name;
>
>    END LOOP
> END
>
> So, everything works except column_name is a varchar. So, r.column_name
> won't give me the correponding column's value in extracted row r. So,
> suppose it is 'pid' in the given table, then b:= r.pid will give the right
> result, but I want to get this effective statement from
> b := r.column_name;
>
>
> Could you please help.
>
>
>
> Regards,
>
> Auon
>
> ________________________________
> From: Kazmi,Auon H <ak...@ufl.edu>
> Sent: Friday, November 25, 2016 3:23:46 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Thanks NJ,
>
> I will move forward in the suggested way.
>
>
>
>
> Regards,
>
> Auon
>
> ________________________________
> From: Nandish Jayaram <nj...@pivotal.io>
> Sent: Wednesday, November 23, 2016 12:20:35 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hey Auon,
>
> Starting with only classification for now sounds like a good idea!
> Yes, the output should be just the predicted label for each row.
> If the table you want to run the classification task on is like the
> following:
> *id |   x   |  y*
> 1    10     10.5
> 2    30     31.5
> 3    20     22.5
>
> then the output table could be something like the following:
> *id |   x   |    y     |  predicted_label*
> 1    10     10.5          true
> 2    30     31.5          false
> 3    20     22.5          true
>
> You are basically adding a new column to the input table called
> "predicted_label", and assign the label for each row based on the k-NN.
>
> We can certainly make it better, by modifying the kNN function interface.
> But let's just keep it simple for now and work on that later.
>
> NJ
>
> On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> >
> > Hi NJ,
> >
> > I have implemented a first version of interface as suggested by you.
> Right
> > now, I am just looking at classification task. I will generalize it to
> work
> > for regression task as well. I have a question regarding output of the
> > function. Should it just be the predicted label (or prediction value in
> > case of regression)? Can you give an example of output?
> >
> >
> >
> >
> >
> > Regards,
> >
> > Auon Haidar
> >
> > ________________________________
> > From: Kazmi,Auon H <ak...@ufl.edu>
> > Sent: Friday, November 18, 2016 3:16:00 AM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi NJ,
> >
> > Thanks for your inputs!
> >
> > I will go through everyone of them and try to incorporate them.
> >
> >
> >
> > Regards,
> >
> > Auon Haidar
> >
> > ________________________________
> > From: Nandish Jayaram <nj...@pivotal.io>
> > Sent: Wednesday, November 16, 2016 2:29:05 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi Auon,
> >
> > Defining the interface is a good start for k-NN. I have slightly modified
> > your interface to help it conform with other MADlib algorithms'
> interfaces.
> > Note that the output for each new data point is not the 'k' nearest
> > neighbors, but either a classification or regression task on the data
> point
> > based on its 'k' nearest neighbors. Every data point in the training data
> > will have an associated class label (regression value) in a different
> > column. Normally, the column containing the data point itself is called
> the
> > independent variable, and the column containing the class label is called
> > the dependent variable. If it is classification, you take a majority vote
> > of the class labels of the 'k' nearest neighbors, and if it is
> regression,
> > you average the dependent variable values of the 'k' nearest neighbors.
> > Here is a preliminary interface we could start with:
> >
> > *knn*(
> > source_table, -- *TEXT, name of table containing training data.*
> > new_data_table, -- *TEXT, name of table containing new data on which
> > classification or regression has to be performed. Classification or
> > regression can be performed based on the type of "dependent_varname".*
> > output_table, -- *TEXT, name of the table where output predictors are
> > written. If this table is already present, an error is returned.*
> > dependent_varname, -- *TEXT, name of the independent variable column. If
> > this column is of type boolean/integer, we could probably perform k-NN
> > classification, and perform k-NN regression if this is of type double.*
> > independent_varname, -- *TEXT, column defining data points. Data points
> can
> > be of type SVEC or any type convertible to SVEC such as float[] or
> > integer[].*
> > k, --* INTEGER, (optional, default value could be some odd number, say 5)
> > number of neighbors to consider*
> > metric, -- *TEXT, (optional, default value could be what you are using
> now
> > for distance) the distance metric to use.*
> > );
> >
> > For now you can just use the distance metric you had mentioned in an
> > earlier email. Note that the source_table and new_data_table are tables
> in
> > the database and not files.
> >
> > Some pointers to help you start off with the implementation:
> > -
> > https://cwiki.apache.org/confluence/display/MADLIB/
> Quick+Start+Guide+for+
> > Developers
> > is a very useful resource with a great hello-world example. It gives you
> > details about how to add a new module (k-NN would be a new module) to
> > MADlib.
> > - k-NN is a great candidate for parallelizing. Do try to use UDA (User
> > Defined Aggregates) in your implementation. This will require you to add
> a
> > C++ layer too, along with the SQL and python layers. Feel free to ask
> > specific questions about this after you have tried out the hello world
> > example.
> > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you
> > more
> > Design Document - Apache MADlib<http://madlib.
> incubator.apache.org/design.
> > pdf>
> > madlib.incubator.apache.org
> > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4
> > Supportforfunctionpointersandsparse ...
> >
> >
> >
> > information regarding the C++ abstraction layer in MADlib.
> >
> > Feel free to shout out for help if you are stuck! Cheers. :)
> >
> > NJ
> >
> > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> >
> > > Hi Frank and NJ,
> > >
> > > Thanks for your comments. I will go through the suggestions provided by
> > NJ.
> > >
> > > Current interface of KNN is as follows:
> > >
> > > 1) Input:
> > >
> > >        - Name of table having all the data points in n-dimensional
> vector
> > > form (Double                              Precision[ ])
> > >
> > >        - Column-name of these data points
> > >
> > >        - Name of file having that n-dim vector (v, say) whose k-nearest
> > > neighbours need to be               found from first table (Double
> > > Precision[ ])
> > >
> > >        - Column name having this vector
> > >
> > >        - value of 'k'
> > >
> > >
> > > It returns 'k' nearest neighbours of vector v from first table having
> > data
> > > points.
> > >
> > >
> > >
> > > For now, I am using madlib's squared norm function to calculate
> distance
> > > between any two vectors. I will try to generalise that.
> > >
> > >
> > > Please suggest any other improvements.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Auon Haidar
> > >
> > > ________________________________
> > > From: Frank McQuillan <fm...@pivotal.io>
> > > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Auon,
> > >
> > > Thanks for working on kNN for MADlib.   Can you expand a little bit on
> > your
> > > note, and post the interface that you are thinking about and
> description
> > of
> > > the arguments?  Then people can comment on that.
> > >
> > > Thanks,
> > > Frank
> > >
> > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <nj...@pivotal.io>
> > > wrote:
> > >
> > > > Hi Auon,
> > > >
> > > > Great going with your first version of k-NN implementation.
> > > > Some useful links for coding guidelines are at (see Developer
> > > > Documentation):
> > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > action?pageId=61319606
> > > > MADilb has something called as install-checks for basic testing. You
> > can
> > > > look at any existing module for an example of the same. For instance,
> > > check
> > > > out the install check code for k-means at:
> > > > https://github.com/apache/incubator-madlib/tree/master/
> > > > src/ports/postgres/modules/kmeans/test
> > > >
> > > > I am sure others will pitch in to help you more with your other
> > > questions,
> > > > but these are some starters you can consider! Good luck!
> > > >
> > > > NJ
> > > >
> > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu>
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am a first year Computer Science graduate student at University
> of
> > > > > Florida working on implementing KNN in Madlib. I am ready with a
> > first
> > > > > version of it but I don't know how to proceed with testing and
> adding
> > > it
> > > > to
> > > > > Madlib platform. Also, I am not clear on what standards do I have
> to
> > > > choose
> > > > > in the final implementation. My current version asks for the table
> > name
> > > > and
> > > > > column name having vectors in which I have to find the neighbours.
> > The
> > > > > other table given as input holds the vector whose K-NN needs to be
> > > found.
> > > > > It is assuming euclidean distance metric for distance calculation.
> It
> > > > would
> > > > > really help if somebody can share ideas on what can be added to
> this
> > > > > functionality.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Auon Haidar Kazmi
> > > > >
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Posted by "Kazmi,Auon H" <ak...@ufl.edu>.

Hi NJ,

No, this is just an example I gave. So, I want in a postgres function to iterate over the rows of a table given as a VARCHAR argument.

FOR r IN EXECUTE format('SELECT * FROM %I', point_source)

will do that. Now, r is a record, i.e. a row of table 'point_source'. I want to store a particular column of that row r in a variable. Now, this column name is also passed as VARCHAR argument to function. I am not able to figure out the way to access this particular column from the current row 'r'.


Basically, I am trying to iterate over my testing data one by one and pass its vector column to a function that finds its label.



Regards,

Auon


________________________________
From: Nandish Jayaram <nj...@pivotal.io>
Sent: Thursday, December 1, 2016 2:51:47 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hi Auon,

My apologies for the late reply.
Can you please give me more information regarding the design approach you
have taken. Information like
what files you have created so far would be helpful. I am not sure I
understand your approach correctly
yet. Is the above snippet of code the only code you have, or do you have
some other files too?

NJ

On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi NJ,
>
> I got stuck at a place. Need a little help.
>
> Suppose I have a function that receives table_name and column_name as
> varchar.
>
> Now I would like to iterate through each rows of this table, while
> accessing the value of this column. I am doing something like this:
>
>
> CREATE OR REPLACE FUNCTION Foo(
> table_name VARCHAR,
> column_name VARCHAR
> ) RETURNS VOID AS
> $BODY$
> DECLARE
>     r record;
>     b integer;
> BEGIN
>
>     FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
>     LOOP
>
>         b := r.column_name;
>
>    END LOOP
> END
>
> So, everything works except column_name is a varchar. So, r.column_name
> won't give me the correponding column's value in extracted row r. So,
> suppose it is 'pid' in the given table, then b:= r.pid will give the right
> result, but I want to get this effective statement from
> b := r.column_name;
>
>
> Could you please help.
>
>
>
> Regards,
>
> Auon
>
> ________________________________
> From: Kazmi,Auon H <ak...@ufl.edu>
> Sent: Friday, November 25, 2016 3:23:46 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Thanks NJ,
>
> I will move forward in the suggested way.
>
>
>
>
> Regards,
>
> Auon
>
> ________________________________
> From: Nandish Jayaram <nj...@pivotal.io>
> Sent: Wednesday, November 23, 2016 12:20:35 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hey Auon,
>
> Starting with only classification for now sounds like a good idea!
> Yes, the output should be just the predicted label for each row.
> If the table you want to run the classification task on is like the
> following:
> *id |   x   |  y*
> 1    10     10.5
> 2    30     31.5
> 3    20     22.5
>
> then the output table could be something like the following:
> *id |   x   |    y     |  predicted_label*
> 1    10     10.5          true
> 2    30     31.5          false
> 3    20     22.5          true
>
> You are basically adding a new column to the input table called
> "predicted_label", and assign the label for each row based on the k-NN.
>
> We can certainly make it better, by modifying the kNN function interface.
> But let's just keep it simple for now and work on that later.
>
> NJ
>
> On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> >
> > Hi NJ,
> >
> > I have implemented a first version of interface as suggested by you.
> Right
> > now, I am just looking at classification task. I will generalize it to
> work
> > for regression task as well. I have a question regarding output of the
> > function. Should it just be the predicted label (or prediction value in
> > case of regression)? Can you give an example of output?
> >
> >
> >
> >
> >
> > Regards,
> >
> > Auon Haidar
> >
> > ________________________________
> > From: Kazmi,Auon H <ak...@ufl.edu>
> > Sent: Friday, November 18, 2016 3:16:00 AM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi NJ,
> >
> > Thanks for your inputs!
> >
> > I will go through everyone of them and try to incorporate them.
> >
> >
> >
> > Regards,
> >
> > Auon Haidar
> >
> > ________________________________
> > From: Nandish Jayaram <nj...@pivotal.io>
> > Sent: Wednesday, November 16, 2016 2:29:05 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi Auon,
> >
> > Defining the interface is a good start for k-NN. I have slightly modified
> > your interface to help it conform with other MADlib algorithms'
> interfaces.
> > Note that the output for each new data point is not the 'k' nearest
> > neighbors, but either a classification or regression task on the data
> point
> > based on its 'k' nearest neighbors. Every data point in the training data
> > will have an associated class label (regression value) in a different
> > column. Normally, the column containing the data point itself is called
> the
> > independent variable, and the column containing the class label is called
> > the dependent variable. If it is classification, you take a majority vote
> > of the class labels of the 'k' nearest neighbors, and if it is
> regression,
> > you average the dependent variable values of the 'k' nearest neighbors.
> > Here is a preliminary interface we could start with:
> >
> > *knn*(
> > source_table, -- *TEXT, name of table containing training data.*
> > new_data_table, -- *TEXT, name of table containing new data on which
> > classification or regression has to be performed. Classification or
> > regression can be performed based on the type of "dependent_varname".*
> > output_table, -- *TEXT, name of the table where output predictors are
> > written. If this table is already present, an error is returned.*
> > dependent_varname, -- *TEXT, name of the independent variable column. If
> > this column is of type boolean/integer, we could probably perform k-NN
> > classification, and perform k-NN regression if this is of type double.*
> > independent_varname, -- *TEXT, column defining data points. Data points
> can
> > be of type SVEC or any type convertible to SVEC such as float[] or
> > integer[].*
> > k, --* INTEGER, (optional, default value could be some odd number, say 5)
> > number of neighbors to consider*
> > metric, -- *TEXT, (optional, default value could be what you are using
> now
> > for distance) the distance metric to use.*
> > );
> >
> > For now you can just use the distance metric you had mentioned in an
> > earlier email. Note that the source_table and new_data_table are tables
> in
> > the database and not files.
> >
> > Some pointers to help you start off with the implementation:
> > -
> > https://cwiki.apache.org/confluence/display/MADLIB/
> Quick+Start+Guide+for+
> > Developers
> > is a very useful resource with a great hello-world example. It gives you
> > details about how to add a new module (k-NN would be a new module) to
> > MADlib.
> > - k-NN is a great candidate for parallelizing. Do try to use UDA (User
> > Defined Aggregates) in your implementation. This will require you to add
> a
> > C++ layer too, along with the SQL and python layers. Feel free to ask
> > specific questions about this after you have tried out the hello world
> > example.
> > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you
> > more
> > Design Document - Apache MADlib<http://madlib.
> incubator.apache.org/design.
> > pdf>
> > madlib.incubator.apache.org
> > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4
> > Supportforfunctionpointersandsparse ...
> >
> >
> >
> > information regarding the C++ abstraction layer in MADlib.
> >
> > Feel free to shout out for help if you are stuck! Cheers. :)
> >
> > NJ
> >
> > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> >
> > > Hi Frank and NJ,
> > >
> > > Thanks for your comments. I will go through the suggestions provided by
> > NJ.
> > >
> > > Current interface of KNN is as follows:
> > >
> > > 1) Input:
> > >
> > >        - Name of table having all the data points in n-dimensional
> vector
> > > form (Double                              Precision[ ])
> > >
> > >        - Column-name of these data points
> > >
> > >        - Name of file having that n-dim vector (v, say) whose k-nearest
> > > neighbours need to be               found from first table (Double
> > > Precision[ ])
> > >
> > >        - Column name having this vector
> > >
> > >        - value of 'k'
> > >
> > >
> > > It returns 'k' nearest neighbours of vector v from first table having
> > data
> > > points.
> > >
> > >
> > >
> > > For now, I am using madlib's squared norm function to calculate
> distance
> > > between any two vectors. I will try to generalise that.
> > >
> > >
> > > Please suggest any other improvements.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Auon Haidar
> > >
> > > ________________________________
> > > From: Frank McQuillan <fm...@pivotal.io>
> > > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Auon,
> > >
> > > Thanks for working on kNN for MADlib.   Can you expand a little bit on
> > your
> > > note, and post the interface that you are thinking about and
> description
> > of
> > > the arguments?  Then people can comment on that.
> > >
> > > Thanks,
> > > Frank
> > >
> > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <nj...@pivotal.io>
> > > wrote:
> > >
> > > > Hi Auon,
> > > >
> > > > Great going with your first version of k-NN implementation.
> > > > Some useful links for coding guidelines are at (see Developer
> > > > Documentation):
> > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > action?pageId=61319606
> > > > MADilb has something called as install-checks for basic testing. You
> > can
> > > > look at any existing module for an example of the same. For instance,
> > > check
> > > > out the install check code for k-means at:
> > > > https://github.com/apache/incubator-madlib/tree/master/
> > > > src/ports/postgres/modules/kmeans/test
> > > >
> > > > I am sure others will pitch in to help you more with your other
> > > questions,
> > > > but these are some starters you can consider! Good luck!
> > > >
> > > > NJ
> > > >
> > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu>
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am a first year Computer Science graduate student at University
> of
> > > > > Florida working on implementing KNN in Madlib. I am ready with a
> > first
> > > > > version of it but I don't know how to proceed with testing and
> adding
> > > it
> > > > to
> > > > > Madlib platform. Also, I am not clear on what standards do I have
> to
> > > > choose
> > > > > in the final implementation. My current version asks for the table
> > name
> > > > and
> > > > > column name having vectors in which I have to find the neighbours.
> > The
> > > > > other table given as input holds the vector whose K-NN needs to be
> > > found.
> > > > > It is assuming euclidean distance metric for distance calculation.
> It
> > > > would
> > > > > really help if somebody can share ideas on what can be added to
> this
> > > > > functionality.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Auon Haidar Kazmi
> > > > >
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Posted by Nandish Jayaram <nj...@pivotal.io>.

Hi Auon,

My apologies for the late reply.
Can you please give me more information regarding the design approach you
have taken. Information like
what files you have created so far would be helpful. I am not sure I
understand your approach correctly
yet. Is the above snippet of code the only code you have, or do you have
some other files too?

NJ

On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi NJ,
>
> I got stuck at a place. Need a little help.
>
> Suppose I have a function that receives table_name and column_name as
> varchar.
>
> Now I would like to iterate through each rows of this table, while
> accessing the value of this column. I am doing something like this:
>
>
> CREATE OR REPLACE FUNCTION Foo(
> table_name VARCHAR,
> column_name VARCHAR
> ) RETURNS VOID AS
> $BODY$
> DECLARE
>     r record;
>     b integer;
> BEGIN
>
>     FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
>     LOOP
>
>         b := r.column_name;
>
>    END LOOP
> END
>
> So, everything works except column_name is a varchar. So, r.column_name
> won't give me the correponding column's value in extracted row r. So,
> suppose it is 'pid' in the given table, then b:= r.pid will give the right
> result, but I want to get this effective statement from
> b := r.column_name;
>
>
> Could you please help.
>
>
>
> Regards,
>
> Auon
>
> ________________________________
> From: Kazmi,Auon H <ak...@ufl.edu>
> Sent: Friday, November 25, 2016 3:23:46 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Thanks NJ,
>
> I will move forward in the suggested way.
>
>
>
>
> Regards,
>
> Auon
>
> ________________________________
> From: Nandish Jayaram <nj...@pivotal.io>
> Sent: Wednesday, November 23, 2016 12:20:35 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hey Auon,
>
> Starting with only classification for now sounds like a good idea!
> Yes, the output should be just the predicted label for each row.
> If the table you want to run the classification task on is like the
> following:
> *id |   x   |  y*
> 1    10     10.5
> 2    30     31.5
> 3    20     22.5
>
> then the output table could be something like the following:
> *id |   x   |    y     |  predicted_label*
> 1    10     10.5          true
> 2    30     31.5          false
> 3    20     22.5          true
>
> You are basically adding a new column to the input table called
> "predicted_label", and assign the label for each row based on the k-NN.
>
> We can certainly make it better, by modifying the kNN function interface.
> But let's just keep it simple for now and work on that later.
>
> NJ
>
> On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> >
> > Hi NJ,
> >
> > I have implemented a first version of interface as suggested by you.
> Right
> > now, I am just looking at classification task. I will generalize it to
> work
> > for regression task as well. I have a question regarding output of the
> > function. Should it just be the predicted label (or prediction value in
> > case of regression)? Can you give an example of output?
> >
> >
> >
> >
> >
> > Regards,
> >
> > Auon Haidar
> >
> > ________________________________
> > From: Kazmi,Auon H <ak...@ufl.edu>
> > Sent: Friday, November 18, 2016 3:16:00 AM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi NJ,
> >
> > Thanks for your inputs!
> >
> > I will go through everyone of them and try to incorporate them.
> >
> >
> >
> > Regards,
> >
> > Auon Haidar
> >
> > ________________________________
> > From: Nandish Jayaram <nj...@pivotal.io>
> > Sent: Wednesday, November 16, 2016 2:29:05 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi Auon,
> >
> > Defining the interface is a good start for k-NN. I have slightly modified
> > your interface to help it conform with other MADlib algorithms'
> interfaces.
> > Note that the output for each new data point is not the 'k' nearest
> > neighbors, but either a classification or regression task on the data
> point
> > based on its 'k' nearest neighbors. Every data point in the training data
> > will have an associated class label (regression value) in a different
> > column. Normally, the column containing the data point itself is called
> the
> > independent variable, and the column containing the class label is called
> > the dependent variable. If it is classification, you take a majority vote
> > of the class labels of the 'k' nearest neighbors, and if it is
> regression,
> > you average the dependent variable values of the 'k' nearest neighbors.
> > Here is a preliminary interface we could start with:
> >
> > *knn*(
> > source_table, -- *TEXT, name of table containing training data.*
> > new_data_table, -- *TEXT, name of table containing new data on which
> > classification or regression has to be performed. Classification or
> > regression can be performed based on the type of "dependent_varname".*
> > output_table, -- *TEXT, name of the table where output predictors are
> > written. If this table is already present, an error is returned.*
> > dependent_varname, -- *TEXT, name of the independent variable column. If
> > this column is of type boolean/integer, we could probably perform k-NN
> > classification, and perform k-NN regression if this is of type double.*
> > independent_varname, -- *TEXT, column defining data points. Data points
> can
> > be of type SVEC or any type convertible to SVEC such as float[] or
> > integer[].*
> > k, --* INTEGER, (optional, default value could be some odd number, say 5)
> > number of neighbors to consider*
> > metric, -- *TEXT, (optional, default value could be what you are using
> now
> > for distance) the distance metric to use.*
> > );
> >
> > For now you can just use the distance metric you had mentioned in an
> > earlier email. Note that the source_table and new_data_table are tables
> in
> > the database and not files.
> >
> > Some pointers to help you start off with the implementation:
> > -
> > https://cwiki.apache.org/confluence/display/MADLIB/
> Quick+Start+Guide+for+
> > Developers
> > is a very useful resource with a great hello-world example. It gives you
> > details about how to add a new module (k-NN would be a new module) to
> > MADlib.
> > - k-NN is a great candidate for parallelizing. Do try to use UDA (User
> > Defined Aggregates) in your implementation. This will require you to add
> a
> > C++ layer too, along with the SQL and python layers. Feel free to ask
> > specific questions about this after you have tried out the hello world
> > example.
> > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you
> > more
> > Design Document - Apache MADlib<http://madlib.
> incubator.apache.org/design.
> > pdf>
> > madlib.incubator.apache.org
> > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> > ReplacedUMLﬁgure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4
> > Supportforfunctionpointersandsparse ...
> >
> >
> >
> > information regarding the C++ abstraction layer in MADlib.
> >
> > Feel free to shout out for help if you are stuck! Cheers. :)
> >
> > NJ
> >
> > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> >
> > > Hi Frank and NJ,
> > >
> > > Thanks for your comments. I will go through the suggestions provided by
> > NJ.
> > >
> > > Current interface of KNN is as follows:
> > >
> > > 1) Input:
> > >
> > >        - Name of table having all the data points in n-dimensional
> vector
> > > form (Double                              Precision[ ])
> > >
> > >        - Column-name of these data points
> > >
> > >        - Name of file having that n-dim vector (v, say) whose k-nearest
> > > neighbours need to be               found from first table (Double
> > > Precision[ ])
> > >
> > >        - Column name having this vector
> > >
> > >        - value of 'k'
> > >
> > >
> > > It returns 'k' nearest neighbours of vector v from first table having
> > data
> > > points.
> > >
> > >
> > >
> > > For now, I am using madlib's squared norm function to calculate
> distance
> > > between any two vectors. I will try to generalise that.
> > >
> > >
> > > Please suggest any other improvements.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Auon Haidar
> > >
> > > ________________________________
> > > From: Frank McQuillan <fm...@pivotal.io>
> > > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Auon,
> > >
> > > Thanks for working on kNN for MADlib.   Can you expand a little bit on
> > your
> > > note, and post the interface that you are thinking about and
> description
> > of
> > > the arguments?  Then people can comment on that.
> > >
> > > Thanks,
> > > Frank
> > >
> > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <nj...@pivotal.io>
> > > wrote:
> > >
> > > > Hi Auon,
> > > >
> > > > Great going with your first version of k-NN implementation.
> > > > Some useful links for coding guidelines are at (see Developer
> > > > Documentation):
> > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > action?pageId=61319606
> > > > MADilb has something called as install-checks for basic testing. You
> > can
> > > > look at any existing module for an example of the same. For instance,
> > > check
> > > > out the install check code for k-means at:
> > > > https://github.com/apache/incubator-madlib/tree/master/
> > > > src/ports/postgres/modules/kmeans/test
> > > >
> > > > I am sure others will pitch in to help you more with your other
> > > questions,
> > > > but these are some starters you can consider! Good luck!
> > > >
> > > > NJ
> > > >
> > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu>
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am a first year Computer Science graduate student at University
> of
> > > > > Florida working on implementing KNN in Madlib. I am ready with a
> > first
> > > > > version of it but I don't know how to proceed with testing and
> adding
> > > it
> > > > to
> > > > > Madlib platform. Also, I am not clear on what standards do I have
> to
> > > > choose
> > > > > in the final implementation. My current version asks for the table
> > name
> > > > and
> > > > > column name having vectors in which I have to find the neighbours.
> > The
> > > > > other table given as input holds the vector whose K-NN needs to be
> > > found.
> > > > > It is assuming euclidean distance metric for distance calculation.
> It
> > > > would
> > > > > really help if somebody can share ideas on what can be added to
> this
> > > > > functionality.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Auon Haidar Kazmi
> > > > >
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Posted by "Kazmi,Auon H" <ak...@ufl.edu>.

Hi NJ,

I got stuck at a place. Need a little help.

Suppose I have a function that receives table_name and column_name as varchar.

Now I would like to iterate through each rows of this table, while accessing the value of this column. I am doing something like this:


CREATE OR REPLACE FUNCTION Foo(
table_name VARCHAR,
column_name VARCHAR
) RETURNS VOID AS
$BODY$
DECLARE
    r record;
    b integer;
BEGIN

    FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
    LOOP

        b := r.column_name;

   END LOOP
END

So, everything works except column_name is a varchar. So, r.column_name won't give me the correponding column's value in extracted row r. So, suppose it is 'pid' in the given table, then b:= r.pid will give the right result, but I want to get this effective statement from
b := r.column_name;


Could you please help.



Regards,

Auon

________________________________
From: Kazmi,Auon H <ak...@ufl.edu>
Sent: Friday, November 25, 2016 3:23:46 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Thanks NJ,

I will move forward in the suggested way.




Regards,

Auon

________________________________
From: Nandish Jayaram <nj...@pivotal.io>
Sent: Wednesday, November 23, 2016 12:20:35 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hey Auon,

Starting with only classification for now sounds like a good idea!
Yes, the output should be just the predicted label for each row.
If the table you want to run the classification task on is like the
following:
*id |   x   |  y*
1    10     10.5
2    30     31.5
3    20     22.5

then the output table could be something like the following:
*id |   x   |    y     |  predicted_label*
1    10     10.5          true
2    30     31.5          false
3    20     22.5          true

You are basically adding a new column to the input table called
"predicted_label", and assign the label for each row based on the k-NN.

We can certainly make it better, by modifying the kNN function interface.
But let's just keep it simple for now and work on that later.

NJ

On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

>
> Hi NJ,
>
> I have implemented a first version of interface as suggested by you. Right
> now, I am just looking at classification task. I will generalize it to work
> for regression task as well. I have a question regarding output of the
> function. Should it just be the predicted label (or prediction value in
> case of regression)? Can you give an example of output?
>
>
>
>
>
> Regards,
>
> Auon Haidar
>
> ________________________________
> From: Kazmi,Auon H <ak...@ufl.edu>
> Sent: Friday, November 18, 2016 3:16:00 AM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi NJ,
>
> Thanks for your inputs!
>
> I will go through everyone of them and try to incorporate them.
>
>
>
> Regards,
>
> Auon Haidar
>
> ________________________________
> From: Nandish Jayaram <nj...@pivotal.io>
> Sent: Wednesday, November 16, 2016 2:29:05 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi Auon,
>
> Defining the interface is a good start for k-NN. I have slightly modified
> your interface to help it conform with other MADlib algorithms' interfaces.
> Note that the output for each new data point is not the 'k' nearest
> neighbors, but either a classification or regression task on the data point
> based on its 'k' nearest neighbors. Every data point in the training data
> will have an associated class label (regression value) in a different
> column. Normally, the column containing the data point itself is called the
> independent variable, and the column containing the class label is called
> the dependent variable. If it is classification, you take a majority vote
> of the class labels of the 'k' nearest neighbors, and if it is regression,
> you average the dependent variable values of the 'k' nearest neighbors.
> Here is a preliminary interface we could start with:
>
> *knn*(
> source_table, -- *TEXT, name of table containing training data.*
> new_data_table, -- *TEXT, name of table containing new data on which
> classification or regression has to be performed. Classification or
> regression can be performed based on the type of "dependent_varname".*
> output_table, -- *TEXT, name of the table where output predictors are
> written. If this table is already present, an error is returned.*
> dependent_varname, -- *TEXT, name of the independent variable column. If
> this column is of type boolean/integer, we could probably perform k-NN
> classification, and perform k-NN regression if this is of type double.*
> independent_varname, -- *TEXT, column defining data points. Data points can
> be of type SVEC or any type convertible to SVEC such as float[] or
> integer[].*
> k, --* INTEGER, (optional, default value could be some odd number, say 5)
> number of neighbors to consider*
> metric, -- *TEXT, (optional, default value could be what you are using now
> for distance) the distance metric to use.*
> );
>
> For now you can just use the distance metric you had mentioned in an
> earlier email. Note that the source_table and new_data_table are tables in
> the database and not files.
>
> Some pointers to help you start off with the implementation:
> -
> https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+
> Developers
> is a very useful resource with a great hello-world example. It gives you
> details about how to add a new module (k-NN would be a new module) to
> MADlib.
> - k-NN is a great candidate for parallelizing. Do try to use UDA (User
> Defined Aggregates) in your implementation. This will require you to add a
> C++ layer too, along with the SQL and python layers. Feel free to ask
> specific questions about this after you have tried out the hello world
> example.
> - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you
> more
> Design Document - Apache MADlib<http://madlib.incubator.apache.org/design.
> pdf>
> madlib.incubator.apache.org
> 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> ReplacedUMLﬁgure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4
> Supportforfunctionpointersandsparse ...
>
>
>
> information regarding the C++ abstraction layer in MADlib.
>
> Feel free to shout out for help if you are stuck! Cheers. :)
>
> NJ
>
> On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> > Hi Frank and NJ,
> >
> > Thanks for your comments. I will go through the suggestions provided by
> NJ.
> >
> > Current interface of KNN is as follows:
> >
> > 1) Input:
> >
> >        - Name of table having all the data points in n-dimensional vector
> > form (Double                              Precision[ ])
> >
> >        - Column-name of these data points
> >
> >        - Name of file having that n-dim vector (v, say) whose k-nearest
> > neighbours need to be               found from first table (Double
> > Precision[ ])
> >
> >        - Column name having this vector
> >
> >        - value of 'k'
> >
> >
> > It returns 'k' nearest neighbours of vector v from first table having
> data
> > points.
> >
> >
> >
> > For now, I am using madlib's squared norm function to calculate distance
> > between any two vectors. I will try to generalise that.
> >
> >
> > Please suggest any other improvements.
> >
> >
> >
> > Thanks,
> >
> > Auon Haidar
> >
> > ________________________________
> > From: Frank McQuillan <fm...@pivotal.io>
> > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Auon,
> >
> > Thanks for working on kNN for MADlib.   Can you expand a little bit on
> your
> > note, and post the interface that you are thinking about and description
> of
> > the arguments?  Then people can comment on that.
> >
> > Thanks,
> > Frank
> >
> > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <nj...@pivotal.io>
> > wrote:
> >
> > > Hi Auon,
> > >
> > > Great going with your first version of k-NN implementation.
> > > Some useful links for coding guidelines are at (see Developer
> > > Documentation):
> > > https://cwiki.apache.org/confluence/pages/viewpage.
> > action?pageId=61319606
> > > MADilb has something called as install-checks for basic testing. You
> can
> > > look at any existing module for an example of the same. For instance,
> > check
> > > out the install check code for k-means at:
> > > https://github.com/apache/incubator-madlib/tree/master/
> > > src/ports/postgres/modules/kmeans/test
> > >
> > > I am sure others will pitch in to help you more with your other
> > questions,
> > > but these are some starters you can consider! Good luck!
> > >
> > > NJ
> > >
> > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am a first year Computer Science graduate student at University of
> > > > Florida working on implementing KNN in Madlib. I am ready with a
> first
> > > > version of it but I don't know how to proceed with testing and adding
> > it
> > > to
> > > > Madlib platform. Also, I am not clear on what standards do I have to
> > > choose
> > > > in the final implementation. My current version asks for the table
> name
> > > and
> > > > column name having vectors in which I have to find the neighbours.
> The
> > > > other table given as input holds the vector whose K-NN needs to be
> > found.
> > > > It is assuming euclidean distance metric for distance calculation. It
> > > would
> > > > really help if somebody can share ideas on what can be added to this
> > > > functionality.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon Haidar Kazmi
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Posted by "Kazmi,Auon H" <ak...@ufl.edu>.

Thanks NJ,

I will move forward in the suggested way.




Regards,

Auon

________________________________
From: Nandish Jayaram <nj...@pivotal.io>
Sent: Wednesday, November 23, 2016 12:20:35 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hey Auon,

Starting with only classification for now sounds like a good idea!
Yes, the output should be just the predicted label for each row.
If the table you want to run the classification task on is like the
following:
*id |   x   |  y*
1    10     10.5
2    30     31.5
3    20     22.5

then the output table could be something like the following:
*id |   x   |    y     |  predicted_label*
1    10     10.5          true
2    30     31.5          false
3    20     22.5          true

You are basically adding a new column to the input table called
"predicted_label", and assign the label for each row based on the k-NN.

We can certainly make it better, by modifying the kNN function interface.
But let's just keep it simple for now and work on that later.

NJ

On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

>
> Hi NJ,
>
> I have implemented a first version of interface as suggested by you. Right
> now, I am just looking at classification task. I will generalize it to work
> for regression task as well. I have a question regarding output of the
> function. Should it just be the predicted label (or prediction value in
> case of regression)? Can you give an example of output?
>
>
>
>
>
> Regards,
>
> Auon Haidar
>
> ________________________________
> From: Kazmi,Auon H <ak...@ufl.edu>
> Sent: Friday, November 18, 2016 3:16:00 AM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi NJ,
>
> Thanks for your inputs!
>
> I will go through everyone of them and try to incorporate them.
>
>
>
> Regards,
>
> Auon Haidar
>
> ________________________________
> From: Nandish Jayaram <nj...@pivotal.io>
> Sent: Wednesday, November 16, 2016 2:29:05 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi Auon,
>
> Defining the interface is a good start for k-NN. I have slightly modified
> your interface to help it conform with other MADlib algorithms' interfaces.
> Note that the output for each new data point is not the 'k' nearest
> neighbors, but either a classification or regression task on the data point
> based on its 'k' nearest neighbors. Every data point in the training data
> will have an associated class label (regression value) in a different
> column. Normally, the column containing the data point itself is called the
> independent variable, and the column containing the class label is called
> the dependent variable. If it is classification, you take a majority vote
> of the class labels of the 'k' nearest neighbors, and if it is regression,
> you average the dependent variable values of the 'k' nearest neighbors.
> Here is a preliminary interface we could start with:
>
> *knn*(
> source_table, -- *TEXT, name of table containing training data.*
> new_data_table, -- *TEXT, name of table containing new data on which
> classification or regression has to be performed. Classification or
> regression can be performed based on the type of "dependent_varname".*
> output_table, -- *TEXT, name of the table where output predictors are
> written. If this table is already present, an error is returned.*
> dependent_varname, -- *TEXT, name of the independent variable column. If
> this column is of type boolean/integer, we could probably perform k-NN
> classification, and perform k-NN regression if this is of type double.*
> independent_varname, -- *TEXT, column defining data points. Data points can
> be of type SVEC or any type convertible to SVEC such as float[] or
> integer[].*
> k, --* INTEGER, (optional, default value could be some odd number, say 5)
> number of neighbors to consider*
> metric, -- *TEXT, (optional, default value could be what you are using now
> for distance) the distance metric to use.*
> );
>
> For now you can just use the distance metric you had mentioned in an
> earlier email. Note that the source_table and new_data_table are tables in
> the database and not files.
>
> Some pointers to help you start off with the implementation:
> -
> https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+
> Developers
> is a very useful resource with a great hello-world example. It gives you
> details about how to add a new module (k-NN would be a new module) to
> MADlib.
> - k-NN is a great candidate for parallelizing. Do try to use UDA (User
> Defined Aggregates) in your implementation. This will require you to add a
> C++ layer too, along with the SQL and python layers. Feel free to ask
> specific questions about this after you have tried out the hello world
> example.
> - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you
> more
> Design Document - Apache MADlib<http://madlib.incubator.apache.org/design.
> pdf>
> madlib.incubator.apache.org
> 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> ReplacedUMLﬁgure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4
> Supportforfunctionpointersandsparse ...
>
>
>
> information regarding the C++ abstraction layer in MADlib.
>
> Feel free to shout out for help if you are stuck! Cheers. :)
>
> NJ
>
> On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> > Hi Frank and NJ,
> >
> > Thanks for your comments. I will go through the suggestions provided by
> NJ.
> >
> > Current interface of KNN is as follows:
> >
> > 1) Input:
> >
> >        - Name of table having all the data points in n-dimensional vector
> > form (Double                              Precision[ ])
> >
> >        - Column-name of these data points
> >
> >        - Name of file having that n-dim vector (v, say) whose k-nearest
> > neighbours need to be               found from first table (Double
> > Precision[ ])
> >
> >        - Column name having this vector
> >
> >        - value of 'k'
> >
> >
> > It returns 'k' nearest neighbours of vector v from first table having
> data
> > points.
> >
> >
> >
> > For now, I am using madlib's squared norm function to calculate distance
> > between any two vectors. I will try to generalise that.
> >
> >
> > Please suggest any other improvements.
> >
> >
> >
> > Thanks,
> >
> > Auon Haidar
> >
> > ________________________________
> > From: Frank McQuillan <fm...@pivotal.io>
> > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Auon,
> >
> > Thanks for working on kNN for MADlib.   Can you expand a little bit on
> your
> > note, and post the interface that you are thinking about and description
> of
> > the arguments?  Then people can comment on that.
> >
> > Thanks,
> > Frank
> >
> > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <nj...@pivotal.io>
> > wrote:
> >
> > > Hi Auon,
> > >
> > > Great going with your first version of k-NN implementation.
> > > Some useful links for coding guidelines are at (see Developer
> > > Documentation):
> > > https://cwiki.apache.org/confluence/pages/viewpage.
> > action?pageId=61319606
> > > MADilb has something called as install-checks for basic testing. You
> can
> > > look at any existing module for an example of the same. For instance,
> > check
> > > out the install check code for k-means at:
> > > https://github.com/apache/incubator-madlib/tree/master/
> > > src/ports/postgres/modules/kmeans/test
> > >
> > > I am sure others will pitch in to help you more with your other
> > questions,
> > > but these are some starters you can consider! Good luck!
> > >
> > > NJ
> > >
> > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am a first year Computer Science graduate student at University of
> > > > Florida working on implementing KNN in Madlib. I am ready with a
> first
> > > > version of it but I don't know how to proceed with testing and adding
> > it
> > > to
> > > > Madlib platform. Also, I am not clear on what standards do I have to
> > > choose
> > > > in the final implementation. My current version asks for the table
> name
> > > and
> > > > column name having vectors in which I have to find the neighbours.
> The
> > > > other table given as input holds the vector whose K-NN needs to be
> > found.
> > > > It is assuming euclidean distance metric for distance calculation. It
> > > would
> > > > really help if somebody can share ideas on what can be added to this
> > > > functionality.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon Haidar Kazmi
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Posted by Nandish Jayaram <nj...@pivotal.io>.

Hey Auon,

Starting with only classification for now sounds like a good idea!
Yes, the output should be just the predicted label for each row.
If the table you want to run the classification task on is like the
following:
*id |   x   |  y*
1    10     10.5
2    30     31.5
3    20     22.5

then the output table could be something like the following:
*id |   x   |    y     |  predicted_label*
1    10     10.5          true
2    30     31.5          false
3    20     22.5          true

You are basically adding a new column to the input table called
"predicted_label", and assign the label for each row based on the k-NN.

We can certainly make it better, by modifying the kNN function interface.
But let's just keep it simple for now and work on that later.

NJ

On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

>
> Hi NJ,
>
> I have implemented a first version of interface as suggested by you. Right
> now, I am just looking at classification task. I will generalize it to work
> for regression task as well. I have a question regarding output of the
> function. Should it just be the predicted label (or prediction value in
> case of regression)? Can you give an example of output?
>
>
>
>
>
> Regards,
>
> Auon Haidar
>
> ________________________________
> From: Kazmi,Auon H <ak...@ufl.edu>
> Sent: Friday, November 18, 2016 3:16:00 AM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi NJ,
>
> Thanks for your inputs!
>
> I will go through everyone of them and try to incorporate them.
>
>
>
> Regards,
>
> Auon Haidar
>
> ________________________________
> From: Nandish Jayaram <nj...@pivotal.io>
> Sent: Wednesday, November 16, 2016 2:29:05 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi Auon,
>
> Defining the interface is a good start for k-NN. I have slightly modified
> your interface to help it conform with other MADlib algorithms' interfaces.
> Note that the output for each new data point is not the 'k' nearest
> neighbors, but either a classification or regression task on the data point
> based on its 'k' nearest neighbors. Every data point in the training data
> will have an associated class label (regression value) in a different
> column. Normally, the column containing the data point itself is called the
> independent variable, and the column containing the class label is called
> the dependent variable. If it is classification, you take a majority vote
> of the class labels of the 'k' nearest neighbors, and if it is regression,
> you average the dependent variable values of the 'k' nearest neighbors.
> Here is a preliminary interface we could start with:
>
> *knn*(
> source_table, -- *TEXT, name of table containing training data.*
> new_data_table, -- *TEXT, name of table containing new data on which
> classification or regression has to be performed. Classification or
> regression can be performed based on the type of "dependent_varname".*
> output_table, -- *TEXT, name of the table where output predictors are
> written. If this table is already present, an error is returned.*
> dependent_varname, -- *TEXT, name of the independent variable column. If
> this column is of type boolean/integer, we could probably perform k-NN
> classification, and perform k-NN regression if this is of type double.*
> independent_varname, -- *TEXT, column defining data points. Data points can
> be of type SVEC or any type convertible to SVEC such as float[] or
> integer[].*
> k, --* INTEGER, (optional, default value could be some odd number, say 5)
> number of neighbors to consider*
> metric, -- *TEXT, (optional, default value could be what you are using now
> for distance) the distance metric to use.*
> );
>
> For now you can just use the distance metric you had mentioned in an
> earlier email. Note that the source_table and new_data_table are tables in
> the database and not files.
>
> Some pointers to help you start off with the implementation:
> -
> https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+
> Developers
> is a very useful resource with a great hello-world example. It gives you
> details about how to add a new module (k-NN would be a new module) to
> MADlib.
> - k-NN is a great candidate for parallelizing. Do try to use UDA (User
> Defined Aggregates) in your implementation. This will require you to add a
> C++ layer too, along with the SQL and python layers. Feel free to ask
> specific questions about this after you have tried out the hello world
> example.
> - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you
> more
> Design Document - Apache MADlib<http://madlib.incubator.apache.org/design.
> pdf>
> madlib.incubator.apache.org
> 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> ReplacedUMLﬁgure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4
> Supportforfunctionpointersandsparse ...
>
>
>
> information regarding the C++ abstraction layer in MADlib.
>
> Feel free to shout out for help if you are stuck! Cheers. :)
>
> NJ
>
> On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> > Hi Frank and NJ,
> >
> > Thanks for your comments. I will go through the suggestions provided by
> NJ.
> >
> > Current interface of KNN is as follows:
> >
> > 1) Input:
> >
> >        - Name of table having all the data points in n-dimensional vector
> > form (Double                              Precision[ ])
> >
> >        - Column-name of these data points
> >
> >        - Name of file having that n-dim vector (v, say) whose k-nearest
> > neighbours need to be               found from first table (Double
> > Precision[ ])
> >
> >        - Column name having this vector
> >
> >        - value of 'k'
> >
> >
> > It returns 'k' nearest neighbours of vector v from first table having
> data
> > points.
> >
> >
> >
> > For now, I am using madlib's squared norm function to calculate distance
> > between any two vectors. I will try to generalise that.
> >
> >
> > Please suggest any other improvements.
> >
> >
> >
> > Thanks,
> >
> > Auon Haidar
> >
> > ________________________________
> > From: Frank McQuillan <fm...@pivotal.io>
> > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Auon,
> >
> > Thanks for working on kNN for MADlib.   Can you expand a little bit on
> your
> > note, and post the interface that you are thinking about and description
> of
> > the arguments?  Then people can comment on that.
> >
> > Thanks,
> > Frank
> >
> > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <nj...@pivotal.io>
> > wrote:
> >
> > > Hi Auon,
> > >
> > > Great going with your first version of k-NN implementation.
> > > Some useful links for coding guidelines are at (see Developer
> > > Documentation):
> > > https://cwiki.apache.org/confluence/pages/viewpage.
> > action?pageId=61319606
> > > MADilb has something called as install-checks for basic testing. You
> can
> > > look at any existing module for an example of the same. For instance,
> > check
> > > out the install check code for k-means at:
> > > https://github.com/apache/incubator-madlib/tree/master/
> > > src/ports/postgres/modules/kmeans/test
> > >
> > > I am sure others will pitch in to help you more with your other
> > questions,
> > > but these are some starters you can consider! Good luck!
> > >
> > > NJ
> > >
> > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am a first year Computer Science graduate student at University of
> > > > Florida working on implementing KNN in Madlib. I am ready with a
> first
> > > > version of it but I don't know how to proceed with testing and adding
> > it
> > > to
> > > > Madlib platform. Also, I am not clear on what standards do I have to
> > > choose
> > > > in the final implementation. My current version asks for the table
> name
> > > and
> > > > column name having vectors in which I have to find the neighbours.
> The
> > > > other table given as input holds the vector whose K-NN needs to be
> > found.
> > > > It is assuming euclidean distance metric for distance calculation. It
> > > would
> > > > really help if somebody can share ideas on what can be added to this
> > > > functionality.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon Haidar Kazmi
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Posted by "Kazmi,Auon H" <ak...@ufl.edu>.

Hi NJ,

I have implemented a first version of interface as suggested by you. Right now, I am just looking at classification task. I will generalize it to work for regression task as well. I have a question regarding output of the function. Should it just be the predicted label (or prediction value in case of regression)? Can you give an example of output?

Regards,

Auon Haidar

________________________________
From: Kazmi,Auon H <ak...@ufl.edu>
Sent: Friday, November 18, 2016 3:16:00 AM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hi NJ,

Thanks for your inputs!

I will go through everyone of them and try to incorporate them.

Regards,

Auon Haidar

________________________________
From: Nandish Jayaram <nj...@pivotal.io>
Sent: Wednesday, November 16, 2016 2:29:05 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hi Auon,

Defining the interface is a good start for k-NN. I have slightly modified
your interface to help it conform with other MADlib algorithms' interfaces.
Note that the output for each new data point is not the 'k' nearest
neighbors, but either a classification or regression task on the data point
based on its 'k' nearest neighbors. Every data point in the training data
will have an associated class label (regression value) in a different
column. Normally, the column containing the data point itself is called the
independent variable, and the column containing the class label is called
the dependent variable. If it is classification, you take a majority vote
of the class labels of the 'k' nearest neighbors, and if it is regression,
you average the dependent variable values of the 'k' nearest neighbors.
Here is a preliminary interface we could start with:

*knn*(
source_table, -- *TEXT, name of table containing training data.*
new_data_table, -- *TEXT, name of table containing new data on which
classification or regression has to be performed. Classification or
regression can be performed based on the type of "dependent_varname".*
output_table, -- *TEXT, name of the table where output predictors are
written. If this table is already present, an error is returned.*
dependent_varname, -- *TEXT, name of the independent variable column. If
this column is of type boolean/integer, we could probably perform k-NN
classification, and perform k-NN regression if this is of type double.*
independent_varname, -- *TEXT, column defining data points. Data points can
be of type SVEC or any type convertible to SVEC such as float[] or
integer[].*
k, --* INTEGER, (optional, default value could be some odd number, say 5)
number of neighbors to consider*
metric, -- *TEXT, (optional, default value could be what you are using now
for distance) the distance metric to use.*
);

For now you can just use the distance metric you had mentioned in an
earlier email. Note that the source_table and new_data_table are tables in
the database and not files.

Some pointers to help you start off with the implementation:
-
https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+Developers
is a very useful resource with a great hello-world example. It gives you
details about how to add a new module (k-NN would be a new module) to
MADlib.
- k-NN is a great candidate for parallelizing. Do try to use UDA (User
Defined Aggregates) in your implementation. This will require you to add a
C++ layer too, along with the SQL and python layers. Feel free to ask
specific questions about this after you have tried out the hello world
example.
- Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you more
Design Document - Apache MADlib<http://madlib.incubator.apache.org/design.pdf>
madlib.incubator.apache.org
1 AbstractionLayers Author FlorianSchoppmann Historyv0.6 ReplacedUMLﬁgure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4 Supportforfunctionpointersandsparse ...

information regarding the C++ abstraction layer in MADlib.

Feel free to shout out for help if you are stuck! Cheers. :)

NJ

On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi Frank and NJ,
>
> Thanks for your comments. I will go through the suggestions provided by NJ.
>
> Current interface of KNN is as follows:
>
> 1) Input:
>
>        - Name of table having all the data points in n-dimensional vector
> form (Double                              Precision[ ])
>
>        - Column-name of these data points
>
>        - Name of file having that n-dim vector (v, say) whose k-nearest
> neighbours need to be               found from first table (Double
> Precision[ ])
>
>        - Column name having this vector
>
>        - value of 'k'
>
>
> It returns 'k' nearest neighbours of vector v from first table having data
> points.
>
>
>
> For now, I am using madlib's squared norm function to calculate distance
> between any two vectors. I will try to generalise that.
>
>
> Please suggest any other improvements.
>
>
>
> Thanks,
>
> Auon Haidar
>
> ________________________________
> From: Frank McQuillan <fm...@pivotal.io>
> Sent: Tuesday, November 15, 2016 1:30:53 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Auon,
>
> Thanks for working on kNN for MADlib.   Can you expand a little bit on your
> note, and post the interface that you are thinking about and description of
> the arguments?  Then people can comment on that.
>
> Thanks,
> Frank
>
> On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <nj...@pivotal.io>
> wrote:
>
> > Hi Auon,
> >
> > Great going with your first version of k-NN implementation.
> > Some useful links for coding guidelines are at (see Developer
> > Documentation):
> > https://cwiki.apache.org/confluence/pages/viewpage.
> action?pageId=61319606
> > MADilb has something called as install-checks for basic testing. You can
> > look at any existing module for an example of the same. For instance,
> check
> > out the install check code for k-means at:
> > https://github.com/apache/incubator-madlib/tree/master/
> > src/ports/postgres/modules/kmeans/test
> >
> > I am sure others will pitch in to help you more with your other
> questions,
> > but these are some starters you can consider! Good luck!
> >
> > NJ
> >
> > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> >
> > > Hi,
> > >
> > > I am a first year Computer Science graduate student at University of
> > > Florida working on implementing KNN in Madlib. I am ready with a first
> > > version of it but I don't know how to proceed with testing and adding
> it
> > to
> > > Madlib platform. Also, I am not clear on what standards do I have to
> > choose
> > > in the final implementation. My current version asks for the table name
> > and
> > > column name having vectors in which I have to find the neighbours. The
> > > other table given as input holds the vector whose K-NN needs to be
> found.
> > > It is assuming euclidean distance metric for distance calculation. It
> > would
> > > really help if somebody can share ideas on what can be added to this
> > > functionality.
> > >
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon Haidar Kazmi
> > >
> >
>

Re: Adding KNN to madlib

Posted by "Kazmi,Auon H" <ak...@ufl.edu>.

Hi NJ,

Thanks for your inputs!

I will go through everyone of them and try to incorporate them.

Regards,

Auon Haidar

________________________________
From: Nandish Jayaram <nj...@pivotal.io>
Sent: Wednesday, November 16, 2016 2:29:05 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hi Auon,

Defining the interface is a good start for k-NN. I have slightly modified
your interface to help it conform with other MADlib algorithms' interfaces.
Note that the output for each new data point is not the 'k' nearest
neighbors, but either a classification or regression task on the data point
based on its 'k' nearest neighbors. Every data point in the training data
will have an associated class label (regression value) in a different
column. Normally, the column containing the data point itself is called the
independent variable, and the column containing the class label is called
the dependent variable. If it is classification, you take a majority vote
of the class labels of the 'k' nearest neighbors, and if it is regression,
you average the dependent variable values of the 'k' nearest neighbors.
Here is a preliminary interface we could start with:

*knn*(
source_table, -- *TEXT, name of table containing training data.*
new_data_table, -- *TEXT, name of table containing new data on which
classification or regression has to be performed. Classification or
regression can be performed based on the type of "dependent_varname".*
output_table, -- *TEXT, name of the table where output predictors are
written. If this table is already present, an error is returned.*
dependent_varname, -- *TEXT, name of the independent variable column. If
this column is of type boolean/integer, we could probably perform k-NN
classification, and perform k-NN regression if this is of type double.*
independent_varname, -- *TEXT, column defining data points. Data points can
be of type SVEC or any type convertible to SVEC such as float[] or
integer[].*
k, --* INTEGER, (optional, default value could be some odd number, say 5)
number of neighbors to consider*
metric, -- *TEXT, (optional, default value could be what you are using now
for distance) the distance metric to use.*
);

For now you can just use the distance metric you had mentioned in an
earlier email. Note that the source_table and new_data_table are tables in
the database and not files.

Some pointers to help you start off with the implementation:
-
https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+Developers
is a very useful resource with a great hello-world example. It gives you
details about how to add a new module (k-NN would be a new module) to
MADlib.
- k-NN is a great candidate for parallelizing. Do try to use UDA (User
Defined Aggregates) in your implementation. This will require you to add a
C++ layer too, along with the SQL and python layers. Feel free to ask
specific questions about this after you have tried out the hello world
example.
- Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you more
information regarding the C++ abstraction layer in MADlib.

Feel free to shout out for help if you are stuck! Cheers. :)

NJ

On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi Frank and NJ,
>
> Thanks for your comments. I will go through the suggestions provided by NJ.
>
> Current interface of KNN is as follows:
>
> 1) Input:
>
>        - Name of table having all the data points in n-dimensional vector
> form (Double                              Precision[ ])
>
>        - Column-name of these data points
>
>        - Name of file having that n-dim vector (v, say) whose k-nearest
> neighbours need to be               found from first table (Double
> Precision[ ])
>
>        - Column name having this vector
>
>        - value of 'k'
>
>
> It returns 'k' nearest neighbours of vector v from first table having data
> points.
>
>
>
> For now, I am using madlib's squared norm function to calculate distance
> between any two vectors. I will try to generalise that.
>
>
> Please suggest any other improvements.
>
>
>
> Thanks,
>
> Auon Haidar
>
> ________________________________
> From: Frank McQuillan <fm...@pivotal.io>
> Sent: Tuesday, November 15, 2016 1:30:53 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Auon,
>
> Thanks for working on kNN for MADlib.   Can you expand a little bit on your
> note, and post the interface that you are thinking about and description of
> the arguments?  Then people can comment on that.
>
> Thanks,
> Frank
>
> On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <nj...@pivotal.io>
> wrote:
>
> > Hi Auon,
> >
> > Great going with your first version of k-NN implementation.
> > Some useful links for coding guidelines are at (see Developer
> > Documentation):
> > https://cwiki.apache.org/confluence/pages/viewpage.
> action?pageId=61319606
> > MADilb has something called as install-checks for basic testing. You can
> > look at any existing module for an example of the same. For instance,
> check
> > out the install check code for k-means at:
> > https://github.com/apache/incubator-madlib/tree/master/
> > src/ports/postgres/modules/kmeans/test
> >
> > I am sure others will pitch in to help you more with your other
> questions,
> > but these are some starters you can consider! Good luck!
> >
> > NJ
> >
> > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> >
> > > Hi,
> > >
> > > I am a first year Computer Science graduate student at University of
> > > Florida working on implementing KNN in Madlib. I am ready with a first
> > > version of it but I don't know how to proceed with testing and adding
> it
> > to
> > > Madlib platform. Also, I am not clear on what standards do I have to
> > choose
> > > in the final implementation. My current version asks for the table name
> > and
> > > column name having vectors in which I have to find the neighbours. The
> > > other table given as input holds the vector whose K-NN needs to be
> found.
> > > It is assuming euclidean distance metric for distance calculation. It
> > would
> > > really help if somebody can share ideas on what can be added to this
> > > functionality.
> > >
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon Haidar Kazmi
> > >
> >
>

Re: Adding KNN to madlib

Posted by Nandish Jayaram <nj...@pivotal.io>.

Hi Auon,

Defining the interface is a good start for k-NN. I have slightly modified
your interface to help it conform with other MADlib algorithms' interfaces.
Note that the output for each new data point is not the 'k' nearest
neighbors, but either a classification or regression task on the data point
based on its 'k' nearest neighbors. Every data point in the training data
will have an associated class label (regression value) in a different
column. Normally, the column containing the data point itself is called the
independent variable, and the column containing the class label is called
the dependent variable. If it is classification, you take a majority vote
of the class labels of the 'k' nearest neighbors, and if it is regression,
you average the dependent variable values of the 'k' nearest neighbors.
Here is a preliminary interface we could start with:

*knn*(
source_table, -- *TEXT, name of table containing training data.*
new_data_table, -- *TEXT, name of table containing new data on which
classification or regression has to be performed. Classification or
regression can be performed based on the type of "dependent_varname".*
output_table, -- *TEXT, name of the table where output predictors are
written. If this table is already present, an error is returned.*
dependent_varname, -- *TEXT, name of the independent variable column. If
this column is of type boolean/integer, we could probably perform k-NN
classification, and perform k-NN regression if this is of type double.*
independent_varname, -- *TEXT, column defining data points. Data points can
be of type SVEC or any type convertible to SVEC such as float[] or
integer[].*
k, --* INTEGER, (optional, default value could be some odd number, say 5)
number of neighbors to consider*
metric, -- *TEXT, (optional, default value could be what you are using now
for distance) the distance metric to use.*
);

For now you can just use the distance metric you had mentioned in an
earlier email. Note that the source_table and new_data_table are tables in
the database and not files.

Some pointers to help you start off with the implementation:
-
https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+Developers
is a very useful resource with a great hello-world example. It gives you
details about how to add a new module (k-NN would be a new module) to
MADlib.
- k-NN is a great candidate for parallelizing. Do try to use UDA (User
Defined Aggregates) in your implementation. This will require you to add a
C++ layer too, along with the SQL and python layers. Feel free to ask
specific questions about this after you have tried out the hello world
example.
- Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you more
information regarding the C++ abstraction layer in MADlib.

Feel free to shout out for help if you are stuck! Cheers. :)

NJ

On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi Frank and NJ,
>
> Thanks for your comments. I will go through the suggestions provided by NJ.
>
> Current interface of KNN is as follows:
>
> 1) Input:
>
>        - Name of table having all the data points in n-dimensional vector
> form (Double                              Precision[ ])
>
>        - Column-name of these data points
>
>        - Name of file having that n-dim vector (v, say) whose k-nearest
> neighbours need to be               found from first table (Double
> Precision[ ])
>
>        - Column name having this vector
>
>        - value of 'k'
>
>
> It returns 'k' nearest neighbours of vector v from first table having data
> points.
>
>
>
> For now, I am using madlib's squared norm function to calculate distance
> between any two vectors. I will try to generalise that.
>
>
> Please suggest any other improvements.
>
>
>
> Thanks,
>
> Auon Haidar
>
> ________________________________
> From: Frank McQuillan <fm...@pivotal.io>
> Sent: Tuesday, November 15, 2016 1:30:53 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Auon,
>
> Thanks for working on kNN for MADlib.   Can you expand a little bit on your
> note, and post the interface that you are thinking about and description of
> the arguments?  Then people can comment on that.
>
> Thanks,
> Frank
>
> On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <nj...@pivotal.io>
> wrote:
>
> > Hi Auon,
> >
> > Great going with your first version of k-NN implementation.
> > Some useful links for coding guidelines are at (see Developer
> > Documentation):
> > https://cwiki.apache.org/confluence/pages/viewpage.
> action?pageId=61319606
> > MADilb has something called as install-checks for basic testing. You can
> > look at any existing module for an example of the same. For instance,
> check
> > out the install check code for k-means at:
> > https://github.com/apache/incubator-madlib/tree/master/
> > src/ports/postgres/modules/kmeans/test
> >
> > I am sure others will pitch in to help you more with your other
> questions,
> > but these are some starters you can consider! Good luck!
> >
> > NJ
> >
> > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
> >
> > > Hi,
> > >
> > > I am a first year Computer Science graduate student at University of
> > > Florida working on implementing KNN in Madlib. I am ready with a first
> > > version of it but I don't know how to proceed with testing and adding
> it
> > to
> > > Madlib platform. Also, I am not clear on what standards do I have to
> > choose
> > > in the final implementation. My current version asks for the table name
> > and
> > > column name having vectors in which I have to find the neighbours. The
> > > other table given as input holds the vector whose K-NN needs to be
> found.
> > > It is assuming euclidean distance metric for distance calculation. It
> > would
> > > really help if somebody can share ideas on what can be added to this
> > > functionality.
> > >
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon Haidar Kazmi
> > >
> >
>

Re: Adding KNN to madlib

Posted by "Kazmi,Auon H" <ak...@ufl.edu>.

Hi Frank and NJ,

Thanks for your comments. I will go through the suggestions provided by NJ.

Current interface of KNN is as follows:

1) Input:

       - Name of table having all the data points in n-dimensional vector form (Double                              Precision[ ])

       - Column-name of these data points

       - Name of file having that n-dim vector (v, say) whose k-nearest neighbours need to be               found from first table (Double Precision[ ])

       - Column name having this vector

       - value of 'k'


It returns 'k' nearest neighbours of vector v from first table having data points.



For now, I am using madlib's squared norm function to calculate distance between any two vectors. I will try to generalise that.


Please suggest any other improvements.



Thanks,

Auon Haidar

________________________________
From: Frank McQuillan <fm...@pivotal.io>
Sent: Tuesday, November 15, 2016 1:30:53 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Auon,

Thanks for working on kNN for MADlib.   Can you expand a little bit on your
note, and post the interface that you are thinking about and description of
the arguments?  Then people can comment on that.

Thanks,
Frank

On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <nj...@pivotal.io>
wrote:

> Hi Auon,
>
> Great going with your first version of k-NN implementation.
> Some useful links for coding guidelines are at (see Developer
> Documentation):
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61319606
> MADilb has something called as install-checks for basic testing. You can
> look at any existing module for an example of the same. For instance, check
> out the install check code for k-means at:
> https://github.com/apache/incubator-madlib/tree/master/
> src/ports/postgres/modules/kmeans/test
>
> I am sure others will pitch in to help you more with your other questions,
> but these are some starters you can consider! Good luck!
>
> NJ
>
> On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> > Hi,
> >
> > I am a first year Computer Science graduate student at University of
> > Florida working on implementing KNN in Madlib. I am ready with a first
> > version of it but I don't know how to proceed with testing and adding it
> to
> > Madlib platform. Also, I am not clear on what standards do I have to
> choose
> > in the final implementation. My current version asks for the table name
> and
> > column name having vectors in which I have to find the neighbours. The
> > other table given as input holds the vector whose K-NN needs to be found.
> > It is assuming euclidean distance metric for distance calculation. It
> would
> > really help if somebody can share ideas on what can be added to this
> > functionality.
> >
> >
> >
> >
> >
> > Regards,
> >
> > Auon Haidar Kazmi
> >
>

Re: Adding KNN to madlib

Posted by Frank McQuillan <fm...@pivotal.io>.

Auon,

Thanks for working on kNN for MADlib.   Can you expand a little bit on your
note, and post the interface that you are thinking about and description of
the arguments?  Then people can comment on that.

Thanks,
Frank

On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <nj...@pivotal.io>
wrote:

> Hi Auon,
>
> Great going with your first version of k-NN implementation.
> Some useful links for coding guidelines are at (see Developer
> Documentation):
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61319606
> MADilb has something called as install-checks for basic testing. You can
> look at any existing module for an example of the same. For instance, check
> out the install check code for k-means at:
> https://github.com/apache/incubator-madlib/tree/master/
> src/ports/postgres/modules/kmeans/test
>
> I am sure others will pitch in to help you more with your other questions,
> but these are some starters you can consider! Good luck!
>
> NJ
>
> On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:
>
> > Hi,
> >
> > I am a first year Computer Science graduate student at University of
> > Florida working on implementing KNN in Madlib. I am ready with a first
> > version of it but I don't know how to proceed with testing and adding it
> to
> > Madlib platform. Also, I am not clear on what standards do I have to
> choose
> > in the final implementation. My current version asks for the table name
> and
> > column name having vectors in which I have to find the neighbours. The
> > other table given as input holds the vector whose K-NN needs to be found.
> > It is assuming euclidean distance metric for distance calculation. It
> would
> > really help if somebody can share ideas on what can be added to this
> > functionality.
> >
> >
> >
> >
> >
> > Regards,
> >
> > Auon Haidar Kazmi
> >
>

Re: Adding KNN to madlib

Posted by Nandish Jayaram <nj...@pivotal.io>.

Hi Auon,

Great going with your first version of k-NN implementation.
Some useful links for coding guidelines are at (see Developer
Documentation):
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61319606
MADilb has something called as install-checks for basic testing. You can
look at any existing module for an example of the same. For instance, check
out the install check code for k-means at:
https://github.com/apache/incubator-madlib/tree/master/src/ports/postgres/modules/kmeans/test

I am sure others will pitch in to help you more with your other questions,
but these are some starters you can consider! Good luck!

NJ

On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <ak...@ufl.edu> wrote:

> Hi,
>
> I am a first year Computer Science graduate student at University of
> Florida working on implementing KNN in Madlib. I am ready with a first
> version of it but I don't know how to proceed with testing and adding it to
> Madlib platform. Also, I am not clear on what standards do I have to choose
> in the final implementation. My current version asks for the table name and
> column name having vectors in which I have to find the neighbours. The
> other table given as input holds the vector whose K-NN needs to be found.
> It is assuming euclidean distance metric for distance calculation. It would
> really help if somebody can share ideas on what can be added to this
> functionality.
>
>
>
>
>
> Regards,
>
> Auon Haidar Kazmi
>