You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by Babak Alipour <ba...@gmail.com> on 2016/03/29 00:34:03 UTC

kNN implementation

Greetings everyone,

I am Babak Alipour, a student at University of Florida. I have been using
MADlib and was hoping to use kNN classification, which is unfortunately not
available so I decided to give implementing it a shot.

Looking at the issue tracker, I found two Jiras regarding kNN: MADLIB-409
and MADLIB-927.
The more recent JIRA mentions that a naive implementation, or linearly
searching through the data, is expected.
I have a few questions regarding the details the JIRA doesn't specify:
Generally, what is the interface of the module? This questions involves
questions such as:  Where is the user expected to provide k, whether to use
distance weighting and distance metric (manhattan, euclidean, minkowski
with some p > 2)?
 Another question is, how should the user specify the data points whose
k-nearest neighbors are desired? Is it some subset of the original data
table or points from another data table with same schema as the original
data table?
Also, are the output points to be kept in a separate table?

I'd love to hear some feedback from the community so that I can move
forward with the implementation.

Thanks in advance for your time.


Best regards,
*Babak Alipour ,*
*University of Florida*

Re: kNN implementation

Posted by Frank McQuillan <fm...@pivotal.io>.
Hi Babak,

I noticed the KNN poster
http://dsr.cise.ufl.edu/wp-content/uploads/2016/05/MADlib_Combined.pptx.pdf
and was wondering if you have plans to make a pull request?

Frank

On Fri, Apr 8, 2016 at 11:35 AM, Xiaocheng Tang <xt...@pivotal.io> wrote:

> Thanks Babak. We will review the interface and get back to you soon.
>
> Meanwhile, you may go ahead with the implementations, for which I would
> suggest
> that you describe your approach in a design doc and share it here first
> before actually
> coding it up. The design doc should include necessary math formulations and
> relevant references if there is any.
>
> > On Apr 8, 2016, at 8:19 AM, Babak Alipour <ba...@gmail.com>
> wrote:
> >
> > Thank you so much Xiaocheng for your reply.
> >
> > I was actually using the k-means interface and modifying it to fit kNN
> > needs.
> >
> > The function definitions as I have been thinking about are:
> >
> > FUNCTION MADLIB_SCHEMA.knn(
> >
> >        rel_train VARCHAR,
> >
> >        rel_test VARCHAR
> >
> >        ignore_col_train VARCHAR,
> >
> >        fn_dist VARCHAR,
> >
> >        fn_dist_weighting VARCHAR,
> >
> >        k INTEGER
> >
> > ) Returns knn_result
> >
> > @param rel_train   Name of the relation containing the training input
> points
> >
> > @param rel_test    Name of the relation containing the testing input
> points
> > {these are the points whose k-nearest neighbors are to be found in
> > rel_train }
> >
> > @param ignore_col_train  Name of column to be ignored(e.g. the Class
> column
> > containing the classes which can be used for supervised learning
> elsewhere)
> >
> > @param fn_dist Name of a function with signature <tt>DOUBLE PRECISION[] x
> > DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that returns the distance
> > between two points. The default is the \ref
> > squared_dist_norm2(float8[],float8[]) "squared Euclidean distance". {
> based
> > on k-means example, basically take two points, output their distance  }
> > I'd rather use  minkowski  distance and add another parameter, P, with a
> > default of 2, so that the function is more generalizable (P=1, Manhattan,
> > P=2, Euclidean, etc.)
> >
> > @param fn_dist_weighting  Name of a function with signature <tt>DOUBLE
> > PRECISION -> DOUBLE PRECISION</tt> that, for each  distance, returns a
> > weighted distance value. (e.g. None, 1-d and 1/d weighting schemes)
> >
> > @param k number of nearest neighbors
> >
> >
> > @returns a composite value
> >
> > Type knn_result
> >    indexes INTEGER[][],
> >    distances DOUBLE PRECISION[][]
> > - <tt>indexes</tt> -  matrix[n][k] of k-nearest neighbors' indexes (n is
> > size of rel_test, i.e.  |rel_test|  , k is an input parameter which
> > specifies the number of nearest neighbors), which specifies data point
> > indexed i,  whose k-nearest neighbors are in columns 0 to k-1.
> > Each column is the index of - <tt>distances</tt> - array of k-nearest
> > neighbors' distances
> > In other words, the row numbers of k-nearest neighbors in rel_train, for
> > the i*th*  data point of rel_test,  are in matrix[i][0:k-1], the
> distances
> > to those data points are in distances[i][0:k-1]
> > For implementation, I'm not quite sure how to store the results so that I
> > can efficiently update the k-nearest points as a pass over the data is
> > being done. Since kNN is non-iterative, I think it is possible to
> implement
> > kNN completely in SQL without calling external driver functions. It might
> > also be a good idea to store the results in another table instead of
> > returning a type like this or put the results in a temp table and then
> > return the result, feedback is always welcome.
> > I'm not specifically using a majority voting, because I do not want to
> > limit this implementation to kNN classification, but rather a general kNN
> > algorithm, then we could come up with another method that uses this kNN
> > implementation, then runs a majority vote for classification or average
> for
> > regression.
> >
> >
> > FUNCTION MADLIB_SCHEMA.__init_validate_args  {Similar to the k-means
> > function __seeding_validate_args, k should be positive, k should be less
> > than the number of points, fn_dist should have the correct signature,
> > fn_dist_weighting should have the correct signature)
> >
> >
> >
> > I look forward to the community's feedback as I work on this module.
> >
> >
> > Best regards,
> > Babak
> >
> >
> > On Mon, Apr 4, 2016 at 1:28 AM, Xiaocheng Tang <xtang@pivotal.io
> <ma...@pivotal.io>> wrote:
> >
> >> Hi Babak,
> >>
> >> Thank you for your interest in k-NN!
> >> https://issues.apache.org/jira/browse/MADLIB-927 <
> https://issues.apache.org/jira/browse/MADLIB-927> <
> >> https://issues.apache.org/jira/browse/MADLIB-927 <
> https://issues.apache.org/jira/browse/MADLIB-927>>
> >>
> >> The interface of new modules should be consistent with
> >> the existing ones in MADlib. In this case I would suggest
> >> studying the K-means first
> >> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html
> <https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html>
> <
> >> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html
> <https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html>>
> >> which in a similar way requires user input
> >> on k — number of centroids as well as a distance function
> >> including Manhattan, Euclidean or a general UDF with a specified
> signature.
> >>
> >> As the first steps, may I suggest that if you send a proposal of the
> >> function
> >> definitions and the parameters and return values as well as description
> of
> >> the functions and what they do.
> >>
> >> Based on that we can discuss the design of the interface and once it
> looks
> >> good you can start working on the actual implementation of the coding.
> >> When you get to implementation we can help you on technical challenges.
> >>
> >> Finally, check out the previous discussions about kNN in the forum
> >>
> >> https://mail-archives.apache.org/mod_mbox/incubator-madlib-
> dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw%
> 40mail.gmail.com%3E
> >> <
> >> https://mail-archives.apache.org/mod_mbox/incubator-madlib-
> dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0
> pvwucMi1Dqg6YHVEQdEbjcw@mail.gmail.com%3E <https://mail-archives.apache.
> org/mod_mbox/incubator-madlib-dev/201603.mbox/%
> 3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw@mail.gmail.com%3E>
> >>>
> >>
> >> Feel free to ask questions. Look forward to collaborating with you.
> >>
> >> Best,
> >> Xiaocheng
> >>
> >>
> >>
> >>> On Mar 28, 2016, at 3:34 PM, Babak Alipour <ba...@gmail.com>
> >> wrote:
> >>>
> >>> Greetings everyone,
> >>>
> >>> I am Babak Alipour, a student at University of Florida. I have been
> using
> >>> MADlib and was hoping to use kNN classification, which is unfortunately
> >> not
> >>> available so I decided to give implementing it a shot.
> >>>
> >>> Looking at the issue tracker, I found two Jiras regarding kNN:
> MADLIB-409
> >>> and MADLIB-927.
> >>> The more recent JIRA mentions that a naive implementation, or linearly
> >>> searching through the data, is expected.
> >>> I have a few questions regarding the details the JIRA doesn't specify:
> >>> Generally, what is the interface of the module? This questions involves
> >>> questions such as:  Where is the user expected to provide k, whether to
> >> use
> >>> distance weighting and distance metric (manhattan, euclidean, minkowski
> >>> with some p > 2)?
> >>> Another question is, how should the user specify the data points whose
> >>> k-nearest neighbors are desired? Is it some subset of the original data
> >>> table or points from another data table with same schema as the
> original
> >>> data table?
> >>> Also, are the output points to be kept in a separate table?
> >>>
> >>> I'd love to hear some feedback from the community so that I can move
> >>> forward with the implementation.
> >>>
> >>> Thanks in advance for your time.
> >>>
> >>>
> >>> Best regards,
> >>> *Babak Alipour ,*
> >>> *University of Florida*
> >>
> >>
> >
> >
> > --
> > *Babak Alipour ,*
> > *University of Florida*
>
>

Re: kNN implementation

Posted by Xiaocheng Tang <xt...@pivotal.io>.
Thanks Babak. We will review the interface and get back to you soon. 

Meanwhile, you may go ahead with the implementations, for which I would suggest 
that you describe your approach in a design doc and share it here first before actually 
coding it up. The design doc should include necessary math formulations and 
relevant references if there is any.

> On Apr 8, 2016, at 8:19 AM, Babak Alipour <ba...@gmail.com> wrote:
> 
> Thank you so much Xiaocheng for your reply.
> 
> I was actually using the k-means interface and modifying it to fit kNN
> needs.
> 
> The function definitions as I have been thinking about are:
> 
> FUNCTION MADLIB_SCHEMA.knn(
> 
>        rel_train VARCHAR,
> 
>        rel_test VARCHAR
> 
>        ignore_col_train VARCHAR,
> 
>        fn_dist VARCHAR,
> 
>        fn_dist_weighting VARCHAR,
> 
>        k INTEGER
> 
> ) Returns knn_result
> 
> @param rel_train   Name of the relation containing the training input points
> 
> @param rel_test    Name of the relation containing the testing input points
> {these are the points whose k-nearest neighbors are to be found in
> rel_train }
> 
> @param ignore_col_train  Name of column to be ignored(e.g. the Class column
> containing the classes which can be used for supervised learning elsewhere)
> 
> @param fn_dist Name of a function with signature <tt>DOUBLE PRECISION[] x
> DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that returns the distance
> between two points. The default is the \ref
> squared_dist_norm2(float8[],float8[]) "squared Euclidean distance". { based
> on k-means example, basically take two points, output their distance  }
> I'd rather use  minkowski  distance and add another parameter, P, with a
> default of 2, so that the function is more generalizable (P=1, Manhattan,
> P=2, Euclidean, etc.)
> 
> @param fn_dist_weighting  Name of a function with signature <tt>DOUBLE
> PRECISION -> DOUBLE PRECISION</tt> that, for each  distance, returns a
> weighted distance value. (e.g. None, 1-d and 1/d weighting schemes)
> 
> @param k number of nearest neighbors
> 
> 
> @returns a composite value
> 
> Type knn_result
>    indexes INTEGER[][],
>    distances DOUBLE PRECISION[][]
> - <tt>indexes</tt> -  matrix[n][k] of k-nearest neighbors' indexes (n is
> size of rel_test, i.e.  |rel_test|  , k is an input parameter which
> specifies the number of nearest neighbors), which specifies data point
> indexed i,  whose k-nearest neighbors are in columns 0 to k-1.
> Each column is the index of - <tt>distances</tt> - array of k-nearest
> neighbors' distances
> In other words, the row numbers of k-nearest neighbors in rel_train, for
> the i*th*  data point of rel_test,  are in matrix[i][0:k-1], the distances
> to those data points are in distances[i][0:k-1]
> For implementation, I'm not quite sure how to store the results so that I
> can efficiently update the k-nearest points as a pass over the data is
> being done. Since kNN is non-iterative, I think it is possible to implement
> kNN completely in SQL without calling external driver functions. It might
> also be a good idea to store the results in another table instead of
> returning a type like this or put the results in a temp table and then
> return the result, feedback is always welcome.
> I'm not specifically using a majority voting, because I do not want to
> limit this implementation to kNN classification, but rather a general kNN
> algorithm, then we could come up with another method that uses this kNN
> implementation, then runs a majority vote for classification or average for
> regression.
> 
> 
> FUNCTION MADLIB_SCHEMA.__init_validate_args  {Similar to the k-means
> function __seeding_validate_args, k should be positive, k should be less
> than the number of points, fn_dist should have the correct signature,
> fn_dist_weighting should have the correct signature)
> 
> 
> 
> I look forward to the community's feedback as I work on this module.
> 
> 
> Best regards,
> Babak
> 
> 
> On Mon, Apr 4, 2016 at 1:28 AM, Xiaocheng Tang <xtang@pivotal.io <ma...@pivotal.io>> wrote:
> 
>> Hi Babak,
>> 
>> Thank you for your interest in k-NN!
>> https://issues.apache.org/jira/browse/MADLIB-927 <https://issues.apache.org/jira/browse/MADLIB-927> <
>> https://issues.apache.org/jira/browse/MADLIB-927 <https://issues.apache.org/jira/browse/MADLIB-927>>
>> 
>> The interface of new modules should be consistent with
>> the existing ones in MADlib. In this case I would suggest
>> studying the K-means first
>> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html <https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html> <
>> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html <https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html>>
>> which in a similar way requires user input
>> on k — number of centroids as well as a distance function
>> including Manhattan, Euclidean or a general UDF with a specified signature.
>> 
>> As the first steps, may I suggest that if you send a proposal of the
>> function
>> definitions and the parameters and return values as well as description of
>> the functions and what they do.
>> 
>> Based on that we can discuss the design of the interface and once it looks
>> good you can start working on the actual implementation of the coding.
>> When you get to implementation we can help you on technical challenges.
>> 
>> Finally, check out the previous discussions about kNN in the forum
>> 
>> https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw%40mail.gmail.com%3E
>> <
>> https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw@mail.gmail.com%3E <https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw@mail.gmail.com%3E>
>>> 
>> 
>> Feel free to ask questions. Look forward to collaborating with you.
>> 
>> Best,
>> Xiaocheng
>> 
>> 
>> 
>>> On Mar 28, 2016, at 3:34 PM, Babak Alipour <ba...@gmail.com>
>> wrote:
>>> 
>>> Greetings everyone,
>>> 
>>> I am Babak Alipour, a student at University of Florida. I have been using
>>> MADlib and was hoping to use kNN classification, which is unfortunately
>> not
>>> available so I decided to give implementing it a shot.
>>> 
>>> Looking at the issue tracker, I found two Jiras regarding kNN: MADLIB-409
>>> and MADLIB-927.
>>> The more recent JIRA mentions that a naive implementation, or linearly
>>> searching through the data, is expected.
>>> I have a few questions regarding the details the JIRA doesn't specify:
>>> Generally, what is the interface of the module? This questions involves
>>> questions such as:  Where is the user expected to provide k, whether to
>> use
>>> distance weighting and distance metric (manhattan, euclidean, minkowski
>>> with some p > 2)?
>>> Another question is, how should the user specify the data points whose
>>> k-nearest neighbors are desired? Is it some subset of the original data
>>> table or points from another data table with same schema as the original
>>> data table?
>>> Also, are the output points to be kept in a separate table?
>>> 
>>> I'd love to hear some feedback from the community so that I can move
>>> forward with the implementation.
>>> 
>>> Thanks in advance for your time.
>>> 
>>> 
>>> Best regards,
>>> *Babak Alipour ,*
>>> *University of Florida*
>> 
>> 
> 
> 
> -- 
> *Babak Alipour ,*
> *University of Florida*


Re: kNN implementation

Posted by Babak Alipour <ba...@gmail.com>.
Thank you so much Xiaocheng for your reply.

I was actually using the k-means interface and modifying it to fit kNN
needs.

The function definitions as I have been thinking about are:

FUNCTION MADLIB_SCHEMA.knn(

        rel_train VARCHAR,

        rel_test VARCHAR

        ignore_col_train VARCHAR,

        fn_dist VARCHAR,

        fn_dist_weighting VARCHAR,

        k INTEGER

) Returns knn_result

@param rel_train   Name of the relation containing the training input points

@param rel_test    Name of the relation containing the testing input points
{these are the points whose k-nearest neighbors are to be found in
rel_train }

@param ignore_col_train  Name of column to be ignored(e.g. the Class column
containing the classes which can be used for supervised learning elsewhere)

@param fn_dist Name of a function with signature <tt>DOUBLE PRECISION[] x
DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that returns the distance
between two points. The default is the \ref
squared_dist_norm2(float8[],float8[]) "squared Euclidean distance". { based
on k-means example, basically take two points, output their distance  }
I'd rather use  minkowski  distance and add another parameter, P, with a
default of 2, so that the function is more generalizable (P=1, Manhattan,
P=2, Euclidean, etc.)

@param fn_dist_weighting  Name of a function with signature <tt>DOUBLE
PRECISION -> DOUBLE PRECISION</tt> that, for each  distance, returns a
weighted distance value. (e.g. None, 1-d and 1/d weighting schemes)

@param k number of nearest neighbors


@returns a composite value

Type knn_result
    indexes INTEGER[][],
    distances DOUBLE PRECISION[][]
- <tt>indexes</tt> -  matrix[n][k] of k-nearest neighbors' indexes (n is
 size of rel_test, i.e.  |rel_test|  , k is an input parameter which
specifies the number of nearest neighbors), which specifies data point
indexed i,  whose k-nearest neighbors are in columns 0 to k-1.
Each column is the index of - <tt>distances</tt> - array of k-nearest
neighbors' distances
In other words, the row numbers of k-nearest neighbors in rel_train, for
the i*th*  data point of rel_test,  are in matrix[i][0:k-1], the distances
to those data points are in distances[i][0:k-1]
For implementation, I'm not quite sure how to store the results so that I
can efficiently update the k-nearest points as a pass over the data is
being done. Since kNN is non-iterative, I think it is possible to implement
kNN completely in SQL without calling external driver functions. It might
also be a good idea to store the results in another table instead of
returning a type like this or put the results in a temp table and then
return the result, feedback is always welcome.
I'm not specifically using a majority voting, because I do not want to
limit this implementation to kNN classification, but rather a general kNN
algorithm, then we could come up with another method that uses this kNN
implementation, then runs a majority vote for classification or average for
regression.


FUNCTION MADLIB_SCHEMA.__init_validate_args  {Similar to the k-means
function __seeding_validate_args, k should be positive, k should be less
than the number of points, fn_dist should have the correct signature,
fn_dist_weighting should have the correct signature)



I look forward to the community's feedback as I work on this module.


Best regards,
Babak


On Mon, Apr 4, 2016 at 1:28 AM, Xiaocheng Tang <xt...@pivotal.io> wrote:

> Hi Babak,
>
> Thank you for your interest in k-NN!
> https://issues.apache.org/jira/browse/MADLIB-927 <
> https://issues.apache.org/jira/browse/MADLIB-927>
>
> The interface of new modules should be consistent with
> the existing ones in MADlib. In this case I would suggest
> studying the K-means first
> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html <
> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html>
> which in a similar way requires user input
> on k — number of centroids as well as a distance function
> including Manhattan, Euclidean or a general UDF with a specified signature.
>
> As the first steps, may I suggest that if you send a proposal of the
> function
> definitions and the parameters and return values as well as description of
> the functions and what they do.
>
> Based on that we can discuss the design of the interface and once it looks
> good you can start working on the actual implementation of the coding.
> When you get to implementation we can help you on technical challenges.
>
> Finally, check out the previous discussions about kNN in the forum
>
> https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw%40mail.gmail.com%3E
> <
> https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw@mail.gmail.com%3E
> >
>
> Feel free to ask questions. Look forward to collaborating with you.
>
> Best,
> Xiaocheng
>
>
>
> > On Mar 28, 2016, at 3:34 PM, Babak Alipour <ba...@gmail.com>
> wrote:
> >
> > Greetings everyone,
> >
> > I am Babak Alipour, a student at University of Florida. I have been using
> > MADlib and was hoping to use kNN classification, which is unfortunately
> not
> > available so I decided to give implementing it a shot.
> >
> > Looking at the issue tracker, I found two Jiras regarding kNN: MADLIB-409
> > and MADLIB-927.
> > The more recent JIRA mentions that a naive implementation, or linearly
> > searching through the data, is expected.
> > I have a few questions regarding the details the JIRA doesn't specify:
> > Generally, what is the interface of the module? This questions involves
> > questions such as:  Where is the user expected to provide k, whether to
> use
> > distance weighting and distance metric (manhattan, euclidean, minkowski
> > with some p > 2)?
> > Another question is, how should the user specify the data points whose
> > k-nearest neighbors are desired? Is it some subset of the original data
> > table or points from another data table with same schema as the original
> > data table?
> > Also, are the output points to be kept in a separate table?
> >
> > I'd love to hear some feedback from the community so that I can move
> > forward with the implementation.
> >
> > Thanks in advance for your time.
> >
> >
> > Best regards,
> > *Babak Alipour ,*
> > *University of Florida*
>
>


-- 
*Babak Alipour ,*
*University of Florida*

Re: kNN implementation

Posted by Xiaocheng Tang <xt...@pivotal.io>.
Hi Babak,

Thank you for your interest in k-NN!
https://issues.apache.org/jira/browse/MADLIB-927 <https://issues.apache.org/jira/browse/MADLIB-927>

The interface of new modules should be consistent with
the existing ones in MADlib. In this case I would suggest 
studying the K-means first 
https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html <https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html>
which in a similar way requires user input
on k — number of centroids as well as a distance function 
including Manhattan, Euclidean or a general UDF with a specified signature. 

As the first steps, may I suggest that if you send a proposal of the function
definitions and the parameters and return values as well as description of
the functions and what they do.

Based on that we can discuss the design of the interface and once it looks
good you can start working on the actual implementation of the coding.
When you get to implementation we can help you on technical challenges.

Finally, check out the previous discussions about kNN in the forum
https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw%40mail.gmail.com%3E <https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw@mail.gmail.com%3E>

Feel free to ask questions. Look forward to collaborating with you.

Best,
Xiaocheng



> On Mar 28, 2016, at 3:34 PM, Babak Alipour <ba...@gmail.com> wrote:
> 
> Greetings everyone,
> 
> I am Babak Alipour, a student at University of Florida. I have been using
> MADlib and was hoping to use kNN classification, which is unfortunately not
> available so I decided to give implementing it a shot.
> 
> Looking at the issue tracker, I found two Jiras regarding kNN: MADLIB-409
> and MADLIB-927.
> The more recent JIRA mentions that a naive implementation, or linearly
> searching through the data, is expected.
> I have a few questions regarding the details the JIRA doesn't specify:
> Generally, what is the interface of the module? This questions involves
> questions such as:  Where is the user expected to provide k, whether to use
> distance weighting and distance metric (manhattan, euclidean, minkowski
> with some p > 2)?
> Another question is, how should the user specify the data points whose
> k-nearest neighbors are desired? Is it some subset of the original data
> table or points from another data table with same schema as the original
> data table?
> Also, are the output points to be kept in a separate table?
> 
> I'd love to hear some feedback from the community so that I can move
> forward with the implementation.
> 
> Thanks in advance for your time.
> 
> 
> Best regards,
> *Babak Alipour ,*
> *University of Florida*