You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by Frank McQuillan <fm...@pivotal.io> on 2016/09/19 19:13:37 UTC

Re: Contributing GMM and Perceptron to MADLib

Hi Aditya,

I noticed the KNN poster
http://dsr.cise.ufl.edu/wp-content/uploads/2016/05/MADlib_Combined.pptx.pdf
and was wondering if you have plans to make a pull request?

Frank


On Mon, Mar 28, 2016 at 9:37 PM, Roman Shaposhnik <rv...@apache.org> wrote:

> Awesome!
>
> On Mon, Mar 28, 2016 at 9:18 PM, Frank McQuillan <fm...@pivotal.io>
> wrote:
> > Thanks Roman.  I was able to do it just now.
> >
> > Frank
> >
> > On Mon, Mar 28, 2016 at 9:12 PM, Roman Shaposhnik <rv...@apache.org>
> wrote:
> >>
> >> I can help with that -- stay tuned.
> >>
> >> On Mon, Mar 28, 2016 at 8:29 PM, Frank McQuillan <fmcquillan@pivotal.io
> >
> >> wrote:
> >> > Let me figure out how to do this and add Aditya as the owner of that
> >> > JIRA.
> >> > My initial attempts in ASF infra-land were not quite successful.
> >> >
> >> > Frank
> >> >
> >> > On Mon, Mar 28, 2016 at 4:54 PM, Rahul Iyer <ri...@pivotal.io> wrote:
> >> >>
> >> >> @Frank, Roman: I believe Aditya needs to be added as a developer to
> the
> >> >> MADlib project to assign a JIRA to him? Is this only available to the
> >> >> lead/owner?
> >> >>
> >> >> On Mon, Mar 28, 2016 at 3:49 PM, Aditya Nain <ad...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> Hi Rahul,
> >> >>>
> >> >>> I didn't have an id, so I created one now.
> >> >>> My id is : Aditya Nain
> >> >>>
> >> >>> Thanks,
> >> >>> Aditya
> >> >>>
> >> >>> On Mon, Mar 28, 2016 at 6:40 PM, Rahul Iyer <ri...@pivotal.io>
> wrote:
> >> >>>
> >> >>> > I can assign this to you, but you need to have an account in
> >> >>> > https://issues.apache.org.
> >> >>> > If you already have an account, then please send your id - I
> wasn't
> >> >>> > able to
> >> >>> > find you just using your name.
> >> >>> >
> >> >>> > On Mon, Mar 28, 2016 at 3:31 PM, Aditya Nain <
> adityanain1@gmail.com>
> >> >>> > wrote:
> >> >>> >
> >> >>> > > Hi Rahul,
> >> >>> > >
> >> >>> > > Thanks for the reply!
> >> >>> > >
> >> >>> > > I am working on implementing Gaussian Mixture Model assuming
> that
> >> >>> > > the
> >> >>> > > co-variance matrix is same for all the Gaussians.
> >> >>> > > The JIRA which deals GMM is MADBLIB-410:
> >> >>> > >
> >> >>> >
> >> >>> >
> >> >>> > https://issues.apache.org/jira/browse/MADLIB-410?jql=
> project%20%3D%20MADLIB
> >> >>> > >
> >> >>> > > Can this be assigned to me, or how do I get it assigned to me?
> >> >>> > >
> >> >>> > > Thanks,
> >> >>> > > Aditya
> >> >>> > >
> >> >>> > > On Mon, Mar 21, 2016 at 3:41 PM, Rahul Iyer <ri...@pivotal.io>
> >> >>> > > wrote:
> >> >>> > >
> >> >>> > > > Hi Aditya,
> >> >>> > > >
> >> >>> > > > Welcome to the MADlib community!
> >> >>> > > >
> >> >>> > > > Gaussian Mixture models is extrememly useful and we would
> >> >>> > > > heartily
> >> >>> > > welcome
> >> >>> > > > a contribution for it. The SQLEM paper might be
> oversimplifying
> >> >>> > > > the
> >> >>> > > > capabilities of the database (e.g. assuming there is no array
> >> >>> > > > type
> >> >>> > > > is
> >> >>> > > > unnecessary for Postgresql). You could speed things (both dev
> >> >>> > > > time
> >> >>> > > > and
> >> >>> > > > execution time) by writing some of the functions in C++.
> K-means
> >> >>> > > > is
> >> >>> > > > an
> >> >>> > > > example of how clustering is implemented.
> >> >>> > > > IMO, assuming the same covariance matrix is reasonable. We
> could
> >> >>> > > > extend
> >> >>> > > the
> >> >>> > > > capabilities after the initial implementation is complete.
> >> >>> > > >
> >> >>> > > > There was some work started a long time ago that built
> >> >>> > > > perceptrons
> >> >>> > using
> >> >>> > > > the convex framework (link
> >> >>> > > > <https://github.com/iyerr3/madlib/tree/mlp
> >> >>> > >).
> >> >>> > > > There are still some bugs in that code since the trained
> network
> >> >>> > > > isn't
> >> >>> > > > converging. You could start there or build a new module -
> either
> >> >>> > > > ways
> >> >>> > an
> >> >>> > > > MLP module is frequently demanded by the data science
> community.
> >> >>> > > >
> >> >>> > > > I would suggest starting with Gaussian mixtures and then
> moving
> >> >>> > > > to
> >> >>> > > > perceptrons if GMM work is completed.
> >> >>> > > >
> >> >>> > > > Feel free to ask questions on this forum. Looking forward to
> >> >>> > > collaborating
> >> >>> > > > with you.
> >> >>> > > >
> >> >>> > > > Best,
> >> >>> > > > Rahul
> >> >>> > > >
> >> >>> > > > On Thu, Mar 17, 2016 at 2:08 PM, Aditya Nain
> >> >>> > > > <ad...@gmail.com>
> >> >>> > > > wrote:
> >> >>> > > >
> >> >>> > > > > Hi,
> >> >>> > > > >
> >> >>> > > > > My name is Aditya Nain, and I am a graduate student at
> >> >>> > > > > University
> >> >>> > > > > of
> >> >>> > > > > Florida.
> >> >>> > > > > I have been learning MADLib for a while and want to
> contribute
> >> >>> > > > > to
> >> >>> > > MADLib.
> >> >>> > > > > I went through some of the open stories in JIRA and started
> >> >>> > > > > working
> >> >>> > on
> >> >>> > > > > MADLIB-410  :
> >> >>> > > > >
> >> >>> > > > >
> >> >>> > > >
> >> >>> > >
> >> >>> >
> >> >>> >
> >> >>> > https://issues.apache.org/jira/browse/MADLIB-410?jql=
> project%20%3D%20MADLIB
> >> >>> > > > >
> >> >>> > > > > which is about implementing Gaussian Mixture Model using
> >> >>> > > > > Expectation
> >> >>> > > > > Maximization (EM) algorithm.
> >> >>> > > > >
> >> >>> > > > > I came across the following paper while searching for
> >> >>> > > > > distributed
> >> >>> > > > > EM
> >> >>> > > > > algorithm which can be implemented in MADLib.
> >> >>> > > > >
> >> >>> > > > > Carlos Ordonez, Paul Cereghini "SQLEM: fast clustering in
> SQL
> >> >>> > > > > using
> >> >>> > the
> >> >>> > > > EM
> >> >>> > > > > algorithm" ACM SIGMOD Record, Volume 29 Issue 2, June 2000
> >> >>> > > > > Pages
> >> >>> > > 559-570.
> >> >>> > > > >
> >> >>> > > > > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.
> 7564
> >> >>> > > > >
> >> >>> > > > > I thought of implementing the approach discussed in the
> paper,
> >> >>> > > > > but
> >> >>> > the
> >> >>> > > > > paper makes an assumption that the covariance martix is the
> >> >>> > > > > same
> >> >>> > > > > for
> >> >>> > > all
> >> >>> > > > > the clusters ( i.e covariance matrix is same for all the
> >> >>> > > > > Gaussian
> >> >>> > > > > distributions). So, I wanted to know the opinion of the
> >> >>> > > > > community
> >> >>> > > > > if
> >> >>> > > it's
> >> >>> > > > > fine to go with the assumption made in the paper and
> implement
> >> >>> > > > > it
> >> >>> > > > > in
> >> >>> > > > > MADLib.
> >> >>> > > > >
> >> >>> > > > > Also, currently MADLib doesn't have an implementation of a
> >> >>> > perceptron,
> >> >>> > > > nor
> >> >>> > > > > did I find any open story related to it in JIRA. I came
> across
> >> >>> > > > > the
> >> >>> > > > > following paper, which talks about a distributed algorithm
> for
> >> >>> > > > perceptron :
> >> >>> > > > >
> >> >>> > > > > Ryan McDonald, Keith Hall, Gideon Mann "Distributed training
> >> >>> > strategies
> >> >>> > > > for
> >> >>> > > > > the structured perceptron"
> >> >>> > > > > http://dl.acm.org/citation.cfm?id=1858068
> >> >>> > > > >
> >> >>> > > > > Would it useful to have a distributed implementaion of
> >> >>> > > > > perceptron
> >> >>> > > > > in
> >> >>> > > > > MADlib?
> >> >>> > > > >
> >> >>> > > > > Thanks,
> >> >>> > > > > Aditya
> >> >>> > > > >
> >> >>> > > >
> >> >>> > >
> >> >>> >
> >> >>
> >> >>
> >> >
> >
> >
>