You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by Frank McQuillan <fm...@pivotal.io> on 2016/09/19 19:11:46 UTC

Re: kNN implementation

Hi Babak,

I noticed the KNN poster
http://dsr.cise.ufl.edu/wp-content/uploads/2016/05/MADlib_Combined.pptx.pdf
and was wondering if you have plans to make a pull request?

Frank

On Fri, Apr 8, 2016 at 11:35 AM, Xiaocheng Tang <xt...@pivotal.io> wrote:

> Thanks Babak. We will review the interface and get back to you soon.
>
> Meanwhile, you may go ahead with the implementations, for which I would
> suggest
> that you describe your approach in a design doc and share it here first
> before actually
> coding it up. The design doc should include necessary math formulations and
> relevant references if there is any.
>
> > On Apr 8, 2016, at 8:19 AM, Babak Alipour <ba...@gmail.com>
> wrote:
> >
> > Thank you so much Xiaocheng for your reply.
> >
> > I was actually using the k-means interface and modifying it to fit kNN
> > needs.
> >
> > The function definitions as I have been thinking about are:
> >
> > FUNCTION MADLIB_SCHEMA.knn(
> >
> >        rel_train VARCHAR,
> >
> >        rel_test VARCHAR
> >
> >        ignore_col_train VARCHAR,
> >
> >        fn_dist VARCHAR,
> >
> >        fn_dist_weighting VARCHAR,
> >
> >        k INTEGER
> >
> > ) Returns knn_result
> >
> > @param rel_train   Name of the relation containing the training input
> points
> >
> > @param rel_test    Name of the relation containing the testing input
> points
> > {these are the points whose k-nearest neighbors are to be found in
> > rel_train }
> >
> > @param ignore_col_train  Name of column to be ignored(e.g. the Class
> column
> > containing the classes which can be used for supervised learning
> elsewhere)
> >
> > @param fn_dist Name of a function with signature <tt>DOUBLE PRECISION[] x
> > DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that returns the distance
> > between two points. The default is the \ref
> > squared_dist_norm2(float8[],float8[]) "squared Euclidean distance". {
> based
> > on k-means example, basically take two points, output their distance  }
> > I'd rather use  minkowski  distance and add another parameter, P, with a
> > default of 2, so that the function is more generalizable (P=1, Manhattan,
> > P=2, Euclidean, etc.)
> >
> > @param fn_dist_weighting  Name of a function with signature <tt>DOUBLE
> > PRECISION -> DOUBLE PRECISION</tt> that, for each  distance, returns a
> > weighted distance value. (e.g. None, 1-d and 1/d weighting schemes)
> >
> > @param k number of nearest neighbors
> >
> >
> > @returns a composite value
> >
> > Type knn_result
> >    indexes INTEGER[][],
> >    distances DOUBLE PRECISION[][]
> > - <tt>indexes</tt> -  matrix[n][k] of k-nearest neighbors' indexes (n is
> > size of rel_test, i.e.  |rel_test|  , k is an input parameter which
> > specifies the number of nearest neighbors), which specifies data point
> > indexed i,  whose k-nearest neighbors are in columns 0 to k-1.
> > Each column is the index of - <tt>distances</tt> - array of k-nearest
> > neighbors' distances
> > In other words, the row numbers of k-nearest neighbors in rel_train, for
> > the i*th*  data point of rel_test,  are in matrix[i][0:k-1], the
> distances
> > to those data points are in distances[i][0:k-1]
> > For implementation, I'm not quite sure how to store the results so that I
> > can efficiently update the k-nearest points as a pass over the data is
> > being done. Since kNN is non-iterative, I think it is possible to
> implement
> > kNN completely in SQL without calling external driver functions. It might
> > also be a good idea to store the results in another table instead of
> > returning a type like this or put the results in a temp table and then
> > return the result, feedback is always welcome.
> > I'm not specifically using a majority voting, because I do not want to
> > limit this implementation to kNN classification, but rather a general kNN
> > algorithm, then we could come up with another method that uses this kNN
> > implementation, then runs a majority vote for classification or average
> for
> > regression.
> >
> >
> > FUNCTION MADLIB_SCHEMA.__init_validate_args  {Similar to the k-means
> > function __seeding_validate_args, k should be positive, k should be less
> > than the number of points, fn_dist should have the correct signature,
> > fn_dist_weighting should have the correct signature)
> >
> >
> >
> > I look forward to the community's feedback as I work on this module.
> >
> >
> > Best regards,
> > Babak
> >
> >
> > On Mon, Apr 4, 2016 at 1:28 AM, Xiaocheng Tang <xtang@pivotal.io
> <ma...@pivotal.io>> wrote:
> >
> >> Hi Babak,
> >>
> >> Thank you for your interest in k-NN!
> >> https://issues.apache.org/jira/browse/MADLIB-927 <
> https://issues.apache.org/jira/browse/MADLIB-927> <
> >> https://issues.apache.org/jira/browse/MADLIB-927 <
> https://issues.apache.org/jira/browse/MADLIB-927>>
> >>
> >> The interface of new modules should be consistent with
> >> the existing ones in MADlib. In this case I would suggest
> >> studying the K-means first
> >> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html
> <https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html>
> <
> >> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html
> <https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html>>
> >> which in a similar way requires user input
> >> on k — number of centroids as well as a distance function
> >> including Manhattan, Euclidean or a general UDF with a specified
> signature.
> >>
> >> As the first steps, may I suggest that if you send a proposal of the
> >> function
> >> definitions and the parameters and return values as well as description
> of
> >> the functions and what they do.
> >>
> >> Based on that we can discuss the design of the interface and once it
> looks
> >> good you can start working on the actual implementation of the coding.
> >> When you get to implementation we can help you on technical challenges.
> >>
> >> Finally, check out the previous discussions about kNN in the forum
> >>
> >> https://mail-archives.apache.org/mod_mbox/incubator-madlib-
> dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw%
> 40mail.gmail.com%3E
> >> <
> >> https://mail-archives.apache.org/mod_mbox/incubator-madlib-
> dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0
> pvwucMi1Dqg6YHVEQdEbjcw@mail.gmail.com%3E <https://mail-archives.apache.
> org/mod_mbox/incubator-madlib-dev/201603.mbox/%
> 3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw@mail.gmail.com%3E>
> >>>
> >>
> >> Feel free to ask questions. Look forward to collaborating with you.
> >>
> >> Best,
> >> Xiaocheng
> >>
> >>
> >>
> >>> On Mar 28, 2016, at 3:34 PM, Babak Alipour <ba...@gmail.com>
> >> wrote:
> >>>
> >>> Greetings everyone,
> >>>
> >>> I am Babak Alipour, a student at University of Florida. I have been
> using
> >>> MADlib and was hoping to use kNN classification, which is unfortunately
> >> not
> >>> available so I decided to give implementing it a shot.
> >>>
> >>> Looking at the issue tracker, I found two Jiras regarding kNN:
> MADLIB-409
> >>> and MADLIB-927.
> >>> The more recent JIRA mentions that a naive implementation, or linearly
> >>> searching through the data, is expected.
> >>> I have a few questions regarding the details the JIRA doesn't specify:
> >>> Generally, what is the interface of the module? This questions involves
> >>> questions such as:  Where is the user expected to provide k, whether to
> >> use
> >>> distance weighting and distance metric (manhattan, euclidean, minkowski
> >>> with some p > 2)?
> >>> Another question is, how should the user specify the data points whose
> >>> k-nearest neighbors are desired? Is it some subset of the original data
> >>> table or points from another data table with same schema as the
> original
> >>> data table?
> >>> Also, are the output points to be kept in a separate table?
> >>>
> >>> I'd love to hear some feedback from the community so that I can move
> >>> forward with the implementation.
> >>>
> >>> Thanks in advance for your time.
> >>>
> >>>
> >>> Best regards,
> >>> *Babak Alipour ,*
> >>> *University of Florida*
> >>
> >>
> >
> >
> > --
> > *Babak Alipour ,*
> > *University of Florida*
>
>