You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by Gautam Muralidhar <ga...@gmail.com> on 2015/12/24 21:29:37 UTC

Bayesian Analysis using MADlib (Gibbs Sampling for Probit Regression)

> Hi Team MADlib,
> 
> I managed to complete the implementation of the Bayesian analysis of the binary Probit regression model on MPP. The code has been tested on the greenplum sandbox VM and seems to work fine. You can find the code here:
> 
> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis
> 
> In the git repo, probit_regression.ipynb is the stand alone python implementation. To verify correctness, I compared against R's MCMCpack library that can also be run in the Jupyter notebook!
> 
> probit_regression_mpp.ipynb is the distributed implementation for Greenplum or HAWQ. This uses the MADlib matrix operations heavily. In addition, it also has dependencies on numpy and scipy. If you look at the Gibbs Probit Driver function, you will see that the only operations in memory are those that involve inverting a matrix (in this case, the covariance matrix or the X_T_X matrix, whose dimensions equal the number of features and hence, hopefully reasonable), sampling from a multivariate normal, and handling the coefficients.
> 
> A couple of observations based on my experience with the MADlib matrix operations:
> 
> 1. First of all, they are a real boon! Last year, we implemented the auto encoder in MPP and we had to write our own matrix operations, which was painful. So kudos to you guys! The Matrix operations meant that it took me ~ 4 hours to complete the implementation in MPP. That is significant, albeit I have experience with SQL and PL/Python.
> 
> 2. It would be great if we can get the following matrix functionality in MADlib at some point:
>     a. Creating an identity matrix
>     b. Creating a zero matrix
>     c. Sampling matrices and scalars from certain distributions. We could start with Gaussian (multi-variate), truncated normal, Wishart, Inverse-Wishart, Gamma, and Beta.
> 
> 3. I still do think that as a developer using MADlib matrix operations, we need to write a lot of code, mainly due to the fact that we need to create SQL tables in a pipeline. We should probably look to reduce this and see if we can efficiently pipeline operations. 
> 
> 4. Lastly, I would like to see if this can end up in MADlib at some point. But to end up in MADlib, we will need to implement the truncated normal and multi-variate normal samplers. If we can perhaps carve out a numpy and scipy dependent section in MADlib and make it clear that these functions work only if numpy and scipy are installed, then that might accelerate MADlib contributions from committers.

Sent from my iPhone

Re: Bayesian Analysis using MADlib (Gibbs Sampling for Probit Regression)

Posted by Gautam Muralidhar <ga...@gmail.com>.
Following up on the suggestions made on this thread, I re-implemented the
MCMC Probit and Logit regression using the MADlib crossprod function. The
resulting implementations are orders of magnitude faster than the first
implementation that used the matrix_mult function. The new implementation
can be found here (look for the cell containing madlib.crossprod):

https://github.com/gautamsm/data-science-on-mpp/blob/master/BayesianAnalysis/probit_regression_mpp.ipynb

https://github.com/gautamsm/data-science-on-mpp/blob/master/BayesianAnalysis/logit_regression_mpp.ipynb

The new implementation has now given us a couple of MCMC-based algorithms
that we can actually use.

Best,
Gautam

On Fri, Jan 15, 2016 at 4:27 PM, Gautam Muralidhar <
gautam.s.muralidhar@gmail.com> wrote:

> Thanks for the comments, Rahul and Caleb.
>
> @Rahul - I will look to implement your suggestions.
>
> Best,
> Gautam
>
> On Fri, Jan 15, 2016 at 3:27 PM, Rahul Iyer <ri...@pivotal.io> wrote:
>
>> Thanks for your comments, Caleb.
>>
>> @Gautam: as I mentioned in the community call today, we have an
>> aggregate function, crossprod(float8[], float8[]), that can be used to
>> perform the X'X and X'Y operation.
>> - for X'X, the row_vec column would be both vector inputs
>> - for X'Y, the row_vec column of X would be the first input and the Y
>> value as an array would be the 2nd input (crossprod needs to treat the
>> Y as a 1x1 vector).
>> You would, however, have to be careful of the X'X output - it's the
>> matrix flattened into an array, so you would have to reshape it.
>>
>> As Caleb said, we would benefit by inspecting the distribution of the
>> two input matrices in matrix_mult and switch between the currently
>> implemented inner product and this crossprod aggregate (outer
>> product).
>>
>> On Fri, Jan 15, 2016 at 2:52 PM, Caleb Welton <cw...@pivotal.io> wrote:
>> > Sorry I missed the community call this morning.  I heard that this was
>> > discussed in more detail, but haven't seen the minutes of the call
>> posted
>> > yet.  Here are a couple more thoughts on this:
>> >
>> > The matrix operation based implementation offered by Guatam is intuitive
>> > and logical way of describing the algorithm, if we had an efficient way
>> of
>> > expressing algorithms like this it would greatly simply the process of
>> > adding new algorithms and lower the barrier to entry for contributions
>> to
>> > the project.  Which would be a good thing, so I wanted to spend a bit
>> more
>> > thought on what this would take and why this solution is not efficient
>> > today.
>> >
>> > Primarily the existing implementation we have for calculating X_T_X in
>> > MADlib is singnificantly more efficient than the implementation within
>> > madlib.matrix_mult(), but the implementation in madlib.matrix_mult() is
>> > much more general purpose.  The existing implementation is hard coded to
>> > handle the fact that both X and t(X) are operating on the same matrix
>> and
>> > that this specific calculation is such that each row of the matrix
>> becomes
>> > the column in the transpose that it is multiplied with meaning that if
>> we
>> > have all the data for the row then the contributions from that row can
>> be
>> > calculated without any additional redistribution of data.  Further since
>> > they are the same table we don't have to join the two tables together to
>> > get that data and we can complete the entire operation with a single
>> scan
>> > of one table.  We do not seem to have the optimization for this very
>> > special case enabled in madlib.matrix_mult() resulting in the
>> > implementation of the multiplication being substantially slower.
>> >
>> > Similarly for X_T_Y in our typical cases X and Y are both in the same
>> > initial input table and in some ways we can think of "XY" as a single
>> > matrix that we have simply sliced vertically to produce X and Y as
>> separate
>> > matrices, this means that despite X and Y being different matrices from
>> the
>> > mathematical expression of the model we can still use the same in-place
>> > logic that we used for X_T_X.  As expressed in the current
>> > madlib.matrix_mult() api there is no easy way for matrix_mult to
>> recognize
>> > this relationship and so we end up forced to go the inefficient route
>> even
>> > if we added the special case optimization when the left and right sides
>> of
>> > the multiplication are transpositions of the same matrix.
>> >
>> > One path forward that would help make this type of implementation viable
>> > would be by adding some of these optimizations and possible api
>> > enhancements into matrix_mult code so that we can get the implementation
>> > more efficient going this route we could probably get from 30X
>> perfomance
>> > hit down to only 2X performance hit - based on having to make separate
>> > scans for X_T_X and X_T_Y rather than being able to combine both
>> > calculations in a single scan of the data.  Reducing that last 2X would
>> > take more effort and a greater level of sophistication in our
>> optimization
>> > routines.  The general case would likely require some amount of code
>> > generation.
>> >
>> > Regards,
>> >   Caleb
>> >
>> > On Thu, Jan 14, 2016 at 5:32 PM, Caleb Welton <cw...@pivotal.io>
>> wrote:
>> >
>> >> Great seeing the prototype work here, I'm sure that there is something
>> >> that we can find from this work that we can bring into MADlib.
>> >>
>> >> However... It is a very different implementation from the existing
>> >> algorithms, calling into the madlib matrix functions directly rather
>> than
>> >> having the majority of the work done within the abstraction layer.
>> >> Unfortunately this leads to a very inefficient implementation.
>> >>
>> >> As demonstration of this I ran this test case:
>> >>
>> >> Dataset: 1 dependent variable, 4 independent variables + intercept,
>> >> 10,000,00 observations
>> >>
>> >> Run using Postgres 9.4 on a Macbook Pro:
>> >>
>> >> Creating the X matrix from source table: 13.9s
>> >> Creating the Y matrix from source table: 9.1s
>> >> Computing X_T_X via matrix_mult: 169.2s
>> >> Computing X_T_Y via matrix_mult: 114.8s
>> >>
>> >> Calling madlib.linregr_train directly (implicitly calculates all of the
>> >> above as well as inverting the X_T_X matrix and calculating some other
>> >> statistics): 10.3s
>> >>
>> >> So in total about 30X slower than our existing methodology for doing
>> the
>> >> same calculations.  I would expect this delta to potentially get even
>> >> larger if it was to move from Postgres to Greenplum or HAWQ where we
>> would
>> >> be able to start applying parallelism.  (the specialized XtX
>> multiplication
>> >> in linregr parallelizes perfectly, but the more general matrix_mult
>> >> functionality may not)
>> >>
>> >> As performance has been a key aspect to our development I'm not sure
>> that
>> >> we want to architecturally go down the path outlined in this example
>> code.
>> >>
>> >> That said... I can certainly see how this layer of abstraction could
>> be a
>> >> valuable way of expressing things from a development perspective so the
>> >> question for the development community is if there is a way that we can
>> >> enable people to write code more similar to what Guatam has expressed
>> while
>> >> preserving the performance of our existing implementations?
>> >>
>> >> The ideas that come to mind would be to take an API abstraction
>> approach
>> >> more akin to what we can see in Theano where we can express a series of
>> >> matrix transformations abstractly and then let the framework work out
>> the
>> >> best way to calculate the pipeline?  Large project to do that... but it
>> >> could one answer to the long held question "how should we define our
>> python
>> >> abstraction layer?".
>> >>
>> >> As a whole I'd be pretty resistant to adding dependencies on
>> numpy/scipy
>> >> unless there was a compelling use case where the performance overhead
>> of
>> >> implementing the MATH (instead of the control flow) in python was not
>> >> unacceptably large.
>> >>
>> >> -Caleb
>> >>
>> >> On Thu, Dec 24, 2015 at 12:51 PM, Frank McQuillan <
>> fmcquillan@pivotal.io>
>> >> wrote:
>> >>
>> >>> Gautam,
>> >>>
>> >>> Thank you for working on this, it can be a great addition to MADlib.
>> Cpl
>> >>> comments below:
>> >>>
>> >>> 0) Dependencies on numpy and scipy.  Currently the platforms
>> PostgreSQL,
>> >>> GPDB and HAWQ do not ship with numpy or scipy by default, so we may
>> need
>> >>> to
>> >>> look at this dependency more closely.
>> >>>
>> >>> 2a,b) The following creation methods exist will exist MADlib 1.9.
>> They
>> >>> are
>> >>> already in the MADlib code base:
>> >>>
>> >>> -- Create a matrix initialized with ones of given row and column
>> dimension
>> >>>   matrix_ones( row_dim, col_dim, matrix_out, out_args)
>> >>>
>> >>> -- Create a matrix initialized with zeros of given row and column
>> >>> dimension
>> >>>   matrix_zeros( row_dim, col_dim, matrix_out, out_args)
>> >>>
>> >>> -- Create an square identity matrix of size dim x dim
>> >>>   matrix_identity( dim, matrix_out, out_args)
>> >>>
>> >>> -- Create a diag matrix initialized with given diagonal elements
>> >>>   matrix_diag( diag_elements, matrix_out, out_args)
>> >>>
>> >>> 2c) As for “Sampling matrices and scalars from certain distributions.
>> We
>> >>> could start with Gaussian (multi-variate), truncated normal, Wishart,
>> >>> Inverse-Wishart, Gamma, and Beta.”  I created a JIRA for that here:
>> >>> https://issues.apache.org/jira/browse/MADLIB-940
>> >>> I agree with your recommendation.
>> >>>
>> >>> 3) Pipelining
>> >>> * it’s an architecture question that I agree we need to address, to
>> reduce
>> >>> disk I/O between steps
>> >>> * Could be a platform implementation, or we can think about if MADlib
>> can
>> >>> do something on top of the existing platform by coming up with a way
>> to
>> >>> chain operations in-memory
>> >>>
>> >>> 4) I would *strongly* encourage you to go the next/last mile and get
>> this
>> >>> into MADlib.  The community can help you do it.  And as you say we
>> need to
>> >>> figure out how/if to support numpy and scipy, or do MADlib functions
>> via
>> >>> Eigen or Boost to handle alternatively.
>> >>>
>> >>> Frank
>> >>>
>> >>> On Thu, Dec 24, 2015 at 12:29 PM, Gautam Muralidhar <
>> >>> gautam.s.muralidhar@gmail.com> wrote:
>> >>>
>> >>> > > Hi Team MADlib,
>> >>> > >
>> >>> > > I managed to complete the implementation of the Bayesian analysis
>> of
>> >>> the
>> >>> > binary Probit regression model on MPP. The code has been tested on
>> the
>> >>> > greenplum sandbox VM and seems to work fine. You can find the code
>> here:
>> >>> > >
>> >>> > >
>> >>> >
>> >>>
>> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis
>> >>> > >
>> >>> > > In the git repo, probit_regression.ipynb is the stand alone python
>> >>> > implementation. To verify correctness, I compared against R's
>> MCMCpack
>> >>> > library that can also be run in the Jupyter notebook!
>> >>> > >
>> >>> > > probit_regression_mpp.ipynb is the distributed implementation for
>> >>> > Greenplum or HAWQ. This uses the MADlib matrix operations heavily.
>> In
>> >>> > addition, it also has dependencies on numpy and scipy. If you look
>> at
>> >>> the
>> >>> > Gibbs Probit Driver function, you will see that the only operations
>> in
>> >>> > memory are those that involve inverting a matrix (in this case, the
>> >>> > covariance matrix or the X_T_X matrix, whose dimensions equal the
>> >>> number of
>> >>> > features and hence, hopefully reasonable), sampling from a
>> multivariate
>> >>> > normal, and handling the coefficients.
>> >>> > >
>> >>> > > A couple of observations based on my experience with the MADlib
>> matrix
>> >>> > operations:
>> >>> > >
>> >>> > > 1. First of all, they are a real boon! Last year, we implemented
>> the
>> >>> > auto encoder in MPP and we had to write our own matrix operations,
>> which
>> >>> > was painful. So kudos to you guys! The Matrix operations meant that
>> it
>> >>> took
>> >>> > me ~ 4 hours to complete the implementation in MPP. That is
>> significant,
>> >>> > albeit I have experience with SQL and PL/Python.
>> >>> > >
>> >>> > > 2. It would be great if we can get the following matrix
>> functionality
>> >>> in
>> >>> > MADlib at some point:
>> >>> > >     a. Creating an identity matrix
>> >>> > >     b. Creating a zero matrix
>> >>> > >     c. Sampling matrices and scalars from certain distributions.
>> We
>> >>> > could start with Gaussian (multi-variate), truncated normal,
>> Wishart,
>> >>> > Inverse-Wishart, Gamma, and Beta.
>> >>> > >
>> >>> > > 3. I still do think that as a developer using MADlib matrix
>> >>> operations,
>> >>> > we need to write a lot of code, mainly due to the fact that we need
>> to
>> >>> > create SQL tables in a pipeline. We should probably look to reduce
>> this
>> >>> and
>> >>> > see if we can efficiently pipeline operations.
>> >>> > >
>> >>> > > 4. Lastly, I would like to see if this can end up in MADlib at
>> some
>> >>> > point. But to end up in MADlib, we will need to implement the
>> truncated
>> >>> > normal and multi-variate normal samplers. If we can perhaps carve
>> out a
>> >>> > numpy and scipy dependent section in MADlib and make it clear that
>> these
>> >>> > functions work only if numpy and scipy are installed, then that
>> might
>> >>> > accelerate MADlib contributions from committers.
>> >>> >
>> >>> > Sent from my iPhone
>> >>>
>> >>
>> >>
>>
>
>
>
> --
> ==========================================================
> Gautam Muralidhar
> PhD, The University of Texas at Austin, USA.
> ==========================================================
>



-- 
==========================================================
Gautam Muralidhar
PhD, The University of Texas at Austin, USA.
==========================================================

Re: Bayesian Analysis using MADlib (Gibbs Sampling for Probit Regression)

Posted by Gautam Muralidhar <ga...@gmail.com>.
Thanks for the comments, Rahul and Caleb.

@Rahul - I will look to implement your suggestions.

Best,
Gautam

On Fri, Jan 15, 2016 at 3:27 PM, Rahul Iyer <ri...@pivotal.io> wrote:

> Thanks for your comments, Caleb.
>
> @Gautam: as I mentioned in the community call today, we have an
> aggregate function, crossprod(float8[], float8[]), that can be used to
> perform the X'X and X'Y operation.
> - for X'X, the row_vec column would be both vector inputs
> - for X'Y, the row_vec column of X would be the first input and the Y
> value as an array would be the 2nd input (crossprod needs to treat the
> Y as a 1x1 vector).
> You would, however, have to be careful of the X'X output - it's the
> matrix flattened into an array, so you would have to reshape it.
>
> As Caleb said, we would benefit by inspecting the distribution of the
> two input matrices in matrix_mult and switch between the currently
> implemented inner product and this crossprod aggregate (outer
> product).
>
> On Fri, Jan 15, 2016 at 2:52 PM, Caleb Welton <cw...@pivotal.io> wrote:
> > Sorry I missed the community call this morning.  I heard that this was
> > discussed in more detail, but haven't seen the minutes of the call posted
> > yet.  Here are a couple more thoughts on this:
> >
> > The matrix operation based implementation offered by Guatam is intuitive
> > and logical way of describing the algorithm, if we had an efficient way
> of
> > expressing algorithms like this it would greatly simply the process of
> > adding new algorithms and lower the barrier to entry for contributions to
> > the project.  Which would be a good thing, so I wanted to spend a bit
> more
> > thought on what this would take and why this solution is not efficient
> > today.
> >
> > Primarily the existing implementation we have for calculating X_T_X in
> > MADlib is singnificantly more efficient than the implementation within
> > madlib.matrix_mult(), but the implementation in madlib.matrix_mult() is
> > much more general purpose.  The existing implementation is hard coded to
> > handle the fact that both X and t(X) are operating on the same matrix and
> > that this specific calculation is such that each row of the matrix
> becomes
> > the column in the transpose that it is multiplied with meaning that if we
> > have all the data for the row then the contributions from that row can be
> > calculated without any additional redistribution of data.  Further since
> > they are the same table we don't have to join the two tables together to
> > get that data and we can complete the entire operation with a single scan
> > of one table.  We do not seem to have the optimization for this very
> > special case enabled in madlib.matrix_mult() resulting in the
> > implementation of the multiplication being substantially slower.
> >
> > Similarly for X_T_Y in our typical cases X and Y are both in the same
> > initial input table and in some ways we can think of "XY" as a single
> > matrix that we have simply sliced vertically to produce X and Y as
> separate
> > matrices, this means that despite X and Y being different matrices from
> the
> > mathematical expression of the model we can still use the same in-place
> > logic that we used for X_T_X.  As expressed in the current
> > madlib.matrix_mult() api there is no easy way for matrix_mult to
> recognize
> > this relationship and so we end up forced to go the inefficient route
> even
> > if we added the special case optimization when the left and right sides
> of
> > the multiplication are transpositions of the same matrix.
> >
> > One path forward that would help make this type of implementation viable
> > would be by adding some of these optimizations and possible api
> > enhancements into matrix_mult code so that we can get the implementation
> > more efficient going this route we could probably get from 30X perfomance
> > hit down to only 2X performance hit - based on having to make separate
> > scans for X_T_X and X_T_Y rather than being able to combine both
> > calculations in a single scan of the data.  Reducing that last 2X would
> > take more effort and a greater level of sophistication in our
> optimization
> > routines.  The general case would likely require some amount of code
> > generation.
> >
> > Regards,
> >   Caleb
> >
> > On Thu, Jan 14, 2016 at 5:32 PM, Caleb Welton <cw...@pivotal.io>
> wrote:
> >
> >> Great seeing the prototype work here, I'm sure that there is something
> >> that we can find from this work that we can bring into MADlib.
> >>
> >> However... It is a very different implementation from the existing
> >> algorithms, calling into the madlib matrix functions directly rather
> than
> >> having the majority of the work done within the abstraction layer.
> >> Unfortunately this leads to a very inefficient implementation.
> >>
> >> As demonstration of this I ran this test case:
> >>
> >> Dataset: 1 dependent variable, 4 independent variables + intercept,
> >> 10,000,00 observations
> >>
> >> Run using Postgres 9.4 on a Macbook Pro:
> >>
> >> Creating the X matrix from source table: 13.9s
> >> Creating the Y matrix from source table: 9.1s
> >> Computing X_T_X via matrix_mult: 169.2s
> >> Computing X_T_Y via matrix_mult: 114.8s
> >>
> >> Calling madlib.linregr_train directly (implicitly calculates all of the
> >> above as well as inverting the X_T_X matrix and calculating some other
> >> statistics): 10.3s
> >>
> >> So in total about 30X slower than our existing methodology for doing the
> >> same calculations.  I would expect this delta to potentially get even
> >> larger if it was to move from Postgres to Greenplum or HAWQ where we
> would
> >> be able to start applying parallelism.  (the specialized XtX
> multiplication
> >> in linregr parallelizes perfectly, but the more general matrix_mult
> >> functionality may not)
> >>
> >> As performance has been a key aspect to our development I'm not sure
> that
> >> we want to architecturally go down the path outlined in this example
> code.
> >>
> >> That said... I can certainly see how this layer of abstraction could be
> a
> >> valuable way of expressing things from a development perspective so the
> >> question for the development community is if there is a way that we can
> >> enable people to write code more similar to what Guatam has expressed
> while
> >> preserving the performance of our existing implementations?
> >>
> >> The ideas that come to mind would be to take an API abstraction approach
> >> more akin to what we can see in Theano where we can express a series of
> >> matrix transformations abstractly and then let the framework work out
> the
> >> best way to calculate the pipeline?  Large project to do that... but it
> >> could one answer to the long held question "how should we define our
> python
> >> abstraction layer?".
> >>
> >> As a whole I'd be pretty resistant to adding dependencies on numpy/scipy
> >> unless there was a compelling use case where the performance overhead of
> >> implementing the MATH (instead of the control flow) in python was not
> >> unacceptably large.
> >>
> >> -Caleb
> >>
> >> On Thu, Dec 24, 2015 at 12:51 PM, Frank McQuillan <
> fmcquillan@pivotal.io>
> >> wrote:
> >>
> >>> Gautam,
> >>>
> >>> Thank you for working on this, it can be a great addition to MADlib.
> Cpl
> >>> comments below:
> >>>
> >>> 0) Dependencies on numpy and scipy.  Currently the platforms
> PostgreSQL,
> >>> GPDB and HAWQ do not ship with numpy or scipy by default, so we may
> need
> >>> to
> >>> look at this dependency more closely.
> >>>
> >>> 2a,b) The following creation methods exist will exist MADlib 1.9.  They
> >>> are
> >>> already in the MADlib code base:
> >>>
> >>> -- Create a matrix initialized with ones of given row and column
> dimension
> >>>   matrix_ones( row_dim, col_dim, matrix_out, out_args)
> >>>
> >>> -- Create a matrix initialized with zeros of given row and column
> >>> dimension
> >>>   matrix_zeros( row_dim, col_dim, matrix_out, out_args)
> >>>
> >>> -- Create an square identity matrix of size dim x dim
> >>>   matrix_identity( dim, matrix_out, out_args)
> >>>
> >>> -- Create a diag matrix initialized with given diagonal elements
> >>>   matrix_diag( diag_elements, matrix_out, out_args)
> >>>
> >>> 2c) As for “Sampling matrices and scalars from certain distributions.
> We
> >>> could start with Gaussian (multi-variate), truncated normal, Wishart,
> >>> Inverse-Wishart, Gamma, and Beta.”  I created a JIRA for that here:
> >>> https://issues.apache.org/jira/browse/MADLIB-940
> >>> I agree with your recommendation.
> >>>
> >>> 3) Pipelining
> >>> * it’s an architecture question that I agree we need to address, to
> reduce
> >>> disk I/O between steps
> >>> * Could be a platform implementation, or we can think about if MADlib
> can
> >>> do something on top of the existing platform by coming up with a way to
> >>> chain operations in-memory
> >>>
> >>> 4) I would *strongly* encourage you to go the next/last mile and get
> this
> >>> into MADlib.  The community can help you do it.  And as you say we
> need to
> >>> figure out how/if to support numpy and scipy, or do MADlib functions
> via
> >>> Eigen or Boost to handle alternatively.
> >>>
> >>> Frank
> >>>
> >>> On Thu, Dec 24, 2015 at 12:29 PM, Gautam Muralidhar <
> >>> gautam.s.muralidhar@gmail.com> wrote:
> >>>
> >>> > > Hi Team MADlib,
> >>> > >
> >>> > > I managed to complete the implementation of the Bayesian analysis
> of
> >>> the
> >>> > binary Probit regression model on MPP. The code has been tested on
> the
> >>> > greenplum sandbox VM and seems to work fine. You can find the code
> here:
> >>> > >
> >>> > >
> >>> >
> >>>
> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis
> >>> > >
> >>> > > In the git repo, probit_regression.ipynb is the stand alone python
> >>> > implementation. To verify correctness, I compared against R's
> MCMCpack
> >>> > library that can also be run in the Jupyter notebook!
> >>> > >
> >>> > > probit_regression_mpp.ipynb is the distributed implementation for
> >>> > Greenplum or HAWQ. This uses the MADlib matrix operations heavily. In
> >>> > addition, it also has dependencies on numpy and scipy. If you look at
> >>> the
> >>> > Gibbs Probit Driver function, you will see that the only operations
> in
> >>> > memory are those that involve inverting a matrix (in this case, the
> >>> > covariance matrix or the X_T_X matrix, whose dimensions equal the
> >>> number of
> >>> > features and hence, hopefully reasonable), sampling from a
> multivariate
> >>> > normal, and handling the coefficients.
> >>> > >
> >>> > > A couple of observations based on my experience with the MADlib
> matrix
> >>> > operations:
> >>> > >
> >>> > > 1. First of all, they are a real boon! Last year, we implemented
> the
> >>> > auto encoder in MPP and we had to write our own matrix operations,
> which
> >>> > was painful. So kudos to you guys! The Matrix operations meant that
> it
> >>> took
> >>> > me ~ 4 hours to complete the implementation in MPP. That is
> significant,
> >>> > albeit I have experience with SQL and PL/Python.
> >>> > >
> >>> > > 2. It would be great if we can get the following matrix
> functionality
> >>> in
> >>> > MADlib at some point:
> >>> > >     a. Creating an identity matrix
> >>> > >     b. Creating a zero matrix
> >>> > >     c. Sampling matrices and scalars from certain distributions. We
> >>> > could start with Gaussian (multi-variate), truncated normal, Wishart,
> >>> > Inverse-Wishart, Gamma, and Beta.
> >>> > >
> >>> > > 3. I still do think that as a developer using MADlib matrix
> >>> operations,
> >>> > we need to write a lot of code, mainly due to the fact that we need
> to
> >>> > create SQL tables in a pipeline. We should probably look to reduce
> this
> >>> and
> >>> > see if we can efficiently pipeline operations.
> >>> > >
> >>> > > 4. Lastly, I would like to see if this can end up in MADlib at some
> >>> > point. But to end up in MADlib, we will need to implement the
> truncated
> >>> > normal and multi-variate normal samplers. If we can perhaps carve
> out a
> >>> > numpy and scipy dependent section in MADlib and make it clear that
> these
> >>> > functions work only if numpy and scipy are installed, then that might
> >>> > accelerate MADlib contributions from committers.
> >>> >
> >>> > Sent from my iPhone
> >>>
> >>
> >>
>



-- 
==========================================================
Gautam Muralidhar
PhD, The University of Texas at Austin, USA.
==========================================================

Re: Bayesian Analysis using MADlib (Gibbs Sampling for Probit Regression)

Posted by Rahul Iyer <ri...@pivotal.io>.
Thanks for your comments, Caleb.

@Gautam: as I mentioned in the community call today, we have an
aggregate function, crossprod(float8[], float8[]), that can be used to
perform the X'X and X'Y operation.
- for X'X, the row_vec column would be both vector inputs
- for X'Y, the row_vec column of X would be the first input and the Y
value as an array would be the 2nd input (crossprod needs to treat the
Y as a 1x1 vector).
You would, however, have to be careful of the X'X output - it's the
matrix flattened into an array, so you would have to reshape it.

As Caleb said, we would benefit by inspecting the distribution of the
two input matrices in matrix_mult and switch between the currently
implemented inner product and this crossprod aggregate (outer
product).

On Fri, Jan 15, 2016 at 2:52 PM, Caleb Welton <cw...@pivotal.io> wrote:
> Sorry I missed the community call this morning.  I heard that this was
> discussed in more detail, but haven't seen the minutes of the call posted
> yet.  Here are a couple more thoughts on this:
>
> The matrix operation based implementation offered by Guatam is intuitive
> and logical way of describing the algorithm, if we had an efficient way of
> expressing algorithms like this it would greatly simply the process of
> adding new algorithms and lower the barrier to entry for contributions to
> the project.  Which would be a good thing, so I wanted to spend a bit more
> thought on what this would take and why this solution is not efficient
> today.
>
> Primarily the existing implementation we have for calculating X_T_X in
> MADlib is singnificantly more efficient than the implementation within
> madlib.matrix_mult(), but the implementation in madlib.matrix_mult() is
> much more general purpose.  The existing implementation is hard coded to
> handle the fact that both X and t(X) are operating on the same matrix and
> that this specific calculation is such that each row of the matrix becomes
> the column in the transpose that it is multiplied with meaning that if we
> have all the data for the row then the contributions from that row can be
> calculated without any additional redistribution of data.  Further since
> they are the same table we don't have to join the two tables together to
> get that data and we can complete the entire operation with a single scan
> of one table.  We do not seem to have the optimization for this very
> special case enabled in madlib.matrix_mult() resulting in the
> implementation of the multiplication being substantially slower.
>
> Similarly for X_T_Y in our typical cases X and Y are both in the same
> initial input table and in some ways we can think of "XY" as a single
> matrix that we have simply sliced vertically to produce X and Y as separate
> matrices, this means that despite X and Y being different matrices from the
> mathematical expression of the model we can still use the same in-place
> logic that we used for X_T_X.  As expressed in the current
> madlib.matrix_mult() api there is no easy way for matrix_mult to recognize
> this relationship and so we end up forced to go the inefficient route even
> if we added the special case optimization when the left and right sides of
> the multiplication are transpositions of the same matrix.
>
> One path forward that would help make this type of implementation viable
> would be by adding some of these optimizations and possible api
> enhancements into matrix_mult code so that we can get the implementation
> more efficient going this route we could probably get from 30X perfomance
> hit down to only 2X performance hit - based on having to make separate
> scans for X_T_X and X_T_Y rather than being able to combine both
> calculations in a single scan of the data.  Reducing that last 2X would
> take more effort and a greater level of sophistication in our optimization
> routines.  The general case would likely require some amount of code
> generation.
>
> Regards,
>   Caleb
>
> On Thu, Jan 14, 2016 at 5:32 PM, Caleb Welton <cw...@pivotal.io> wrote:
>
>> Great seeing the prototype work here, I'm sure that there is something
>> that we can find from this work that we can bring into MADlib.
>>
>> However... It is a very different implementation from the existing
>> algorithms, calling into the madlib matrix functions directly rather than
>> having the majority of the work done within the abstraction layer.
>> Unfortunately this leads to a very inefficient implementation.
>>
>> As demonstration of this I ran this test case:
>>
>> Dataset: 1 dependent variable, 4 independent variables + intercept,
>> 10,000,00 observations
>>
>> Run using Postgres 9.4 on a Macbook Pro:
>>
>> Creating the X matrix from source table: 13.9s
>> Creating the Y matrix from source table: 9.1s
>> Computing X_T_X via matrix_mult: 169.2s
>> Computing X_T_Y via matrix_mult: 114.8s
>>
>> Calling madlib.linregr_train directly (implicitly calculates all of the
>> above as well as inverting the X_T_X matrix and calculating some other
>> statistics): 10.3s
>>
>> So in total about 30X slower than our existing methodology for doing the
>> same calculations.  I would expect this delta to potentially get even
>> larger if it was to move from Postgres to Greenplum or HAWQ where we would
>> be able to start applying parallelism.  (the specialized XtX multiplication
>> in linregr parallelizes perfectly, but the more general matrix_mult
>> functionality may not)
>>
>> As performance has been a key aspect to our development I'm not sure that
>> we want to architecturally go down the path outlined in this example code.
>>
>> That said... I can certainly see how this layer of abstraction could be a
>> valuable way of expressing things from a development perspective so the
>> question for the development community is if there is a way that we can
>> enable people to write code more similar to what Guatam has expressed while
>> preserving the performance of our existing implementations?
>>
>> The ideas that come to mind would be to take an API abstraction approach
>> more akin to what we can see in Theano where we can express a series of
>> matrix transformations abstractly and then let the framework work out the
>> best way to calculate the pipeline?  Large project to do that... but it
>> could one answer to the long held question "how should we define our python
>> abstraction layer?".
>>
>> As a whole I'd be pretty resistant to adding dependencies on numpy/scipy
>> unless there was a compelling use case where the performance overhead of
>> implementing the MATH (instead of the control flow) in python was not
>> unacceptably large.
>>
>> -Caleb
>>
>> On Thu, Dec 24, 2015 at 12:51 PM, Frank McQuillan <fm...@pivotal.io>
>> wrote:
>>
>>> Gautam,
>>>
>>> Thank you for working on this, it can be a great addition to MADlib.  Cpl
>>> comments below:
>>>
>>> 0) Dependencies on numpy and scipy.  Currently the platforms PostgreSQL,
>>> GPDB and HAWQ do not ship with numpy or scipy by default, so we may need
>>> to
>>> look at this dependency more closely.
>>>
>>> 2a,b) The following creation methods exist will exist MADlib 1.9.  They
>>> are
>>> already in the MADlib code base:
>>>
>>> -- Create a matrix initialized with ones of given row and column dimension
>>>   matrix_ones( row_dim, col_dim, matrix_out, out_args)
>>>
>>> -- Create a matrix initialized with zeros of given row and column
>>> dimension
>>>   matrix_zeros( row_dim, col_dim, matrix_out, out_args)
>>>
>>> -- Create an square identity matrix of size dim x dim
>>>   matrix_identity( dim, matrix_out, out_args)
>>>
>>> -- Create a diag matrix initialized with given diagonal elements
>>>   matrix_diag( diag_elements, matrix_out, out_args)
>>>
>>> 2c) As for “Sampling matrices and scalars from certain distributions. We
>>> could start with Gaussian (multi-variate), truncated normal, Wishart,
>>> Inverse-Wishart, Gamma, and Beta.”  I created a JIRA for that here:
>>> https://issues.apache.org/jira/browse/MADLIB-940
>>> I agree with your recommendation.
>>>
>>> 3) Pipelining
>>> * it’s an architecture question that I agree we need to address, to reduce
>>> disk I/O between steps
>>> * Could be a platform implementation, or we can think about if MADlib can
>>> do something on top of the existing platform by coming up with a way to
>>> chain operations in-memory
>>>
>>> 4) I would *strongly* encourage you to go the next/last mile and get this
>>> into MADlib.  The community can help you do it.  And as you say we need to
>>> figure out how/if to support numpy and scipy, or do MADlib functions via
>>> Eigen or Boost to handle alternatively.
>>>
>>> Frank
>>>
>>> On Thu, Dec 24, 2015 at 12:29 PM, Gautam Muralidhar <
>>> gautam.s.muralidhar@gmail.com> wrote:
>>>
>>> > > Hi Team MADlib,
>>> > >
>>> > > I managed to complete the implementation of the Bayesian analysis of
>>> the
>>> > binary Probit regression model on MPP. The code has been tested on the
>>> > greenplum sandbox VM and seems to work fine. You can find the code here:
>>> > >
>>> > >
>>> >
>>> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis
>>> > >
>>> > > In the git repo, probit_regression.ipynb is the stand alone python
>>> > implementation. To verify correctness, I compared against R's MCMCpack
>>> > library that can also be run in the Jupyter notebook!
>>> > >
>>> > > probit_regression_mpp.ipynb is the distributed implementation for
>>> > Greenplum or HAWQ. This uses the MADlib matrix operations heavily. In
>>> > addition, it also has dependencies on numpy and scipy. If you look at
>>> the
>>> > Gibbs Probit Driver function, you will see that the only operations in
>>> > memory are those that involve inverting a matrix (in this case, the
>>> > covariance matrix or the X_T_X matrix, whose dimensions equal the
>>> number of
>>> > features and hence, hopefully reasonable), sampling from a multivariate
>>> > normal, and handling the coefficients.
>>> > >
>>> > > A couple of observations based on my experience with the MADlib matrix
>>> > operations:
>>> > >
>>> > > 1. First of all, they are a real boon! Last year, we implemented the
>>> > auto encoder in MPP and we had to write our own matrix operations, which
>>> > was painful. So kudos to you guys! The Matrix operations meant that it
>>> took
>>> > me ~ 4 hours to complete the implementation in MPP. That is significant,
>>> > albeit I have experience with SQL and PL/Python.
>>> > >
>>> > > 2. It would be great if we can get the following matrix functionality
>>> in
>>> > MADlib at some point:
>>> > >     a. Creating an identity matrix
>>> > >     b. Creating a zero matrix
>>> > >     c. Sampling matrices and scalars from certain distributions. We
>>> > could start with Gaussian (multi-variate), truncated normal, Wishart,
>>> > Inverse-Wishart, Gamma, and Beta.
>>> > >
>>> > > 3. I still do think that as a developer using MADlib matrix
>>> operations,
>>> > we need to write a lot of code, mainly due to the fact that we need to
>>> > create SQL tables in a pipeline. We should probably look to reduce this
>>> and
>>> > see if we can efficiently pipeline operations.
>>> > >
>>> > > 4. Lastly, I would like to see if this can end up in MADlib at some
>>> > point. But to end up in MADlib, we will need to implement the truncated
>>> > normal and multi-variate normal samplers. If we can perhaps carve out a
>>> > numpy and scipy dependent section in MADlib and make it clear that these
>>> > functions work only if numpy and scipy are installed, then that might
>>> > accelerate MADlib contributions from committers.
>>> >
>>> > Sent from my iPhone
>>>
>>
>>

Re: Bayesian Analysis using MADlib (Gibbs Sampling for Probit Regression)

Posted by Caleb Welton <cw...@pivotal.io>.
Sorry I missed the community call this morning.  I heard that this was
discussed in more detail, but haven't seen the minutes of the call posted
yet.  Here are a couple more thoughts on this:

The matrix operation based implementation offered by Guatam is intuitive
and logical way of describing the algorithm, if we had an efficient way of
expressing algorithms like this it would greatly simply the process of
adding new algorithms and lower the barrier to entry for contributions to
the project.  Which would be a good thing, so I wanted to spend a bit more
thought on what this would take and why this solution is not efficient
today.

Primarily the existing implementation we have for calculating X_T_X in
MADlib is singnificantly more efficient than the implementation within
madlib.matrix_mult(), but the implementation in madlib.matrix_mult() is
much more general purpose.  The existing implementation is hard coded to
handle the fact that both X and t(X) are operating on the same matrix and
that this specific calculation is such that each row of the matrix becomes
the column in the transpose that it is multiplied with meaning that if we
have all the data for the row then the contributions from that row can be
calculated without any additional redistribution of data.  Further since
they are the same table we don't have to join the two tables together to
get that data and we can complete the entire operation with a single scan
of one table.  We do not seem to have the optimization for this very
special case enabled in madlib.matrix_mult() resulting in the
implementation of the multiplication being substantially slower.

Similarly for X_T_Y in our typical cases X and Y are both in the same
initial input table and in some ways we can think of "XY" as a single
matrix that we have simply sliced vertically to produce X and Y as separate
matrices, this means that despite X and Y being different matrices from the
mathematical expression of the model we can still use the same in-place
logic that we used for X_T_X.  As expressed in the current
madlib.matrix_mult() api there is no easy way for matrix_mult to recognize
this relationship and so we end up forced to go the inefficient route even
if we added the special case optimization when the left and right sides of
the multiplication are transpositions of the same matrix.

One path forward that would help make this type of implementation viable
would be by adding some of these optimizations and possible api
enhancements into matrix_mult code so that we can get the implementation
more efficient going this route we could probably get from 30X perfomance
hit down to only 2X performance hit - based on having to make separate
scans for X_T_X and X_T_Y rather than being able to combine both
calculations in a single scan of the data.  Reducing that last 2X would
take more effort and a greater level of sophistication in our optimization
routines.  The general case would likely require some amount of code
generation.

Regards,
  Caleb

On Thu, Jan 14, 2016 at 5:32 PM, Caleb Welton <cw...@pivotal.io> wrote:

> Great seeing the prototype work here, I'm sure that there is something
> that we can find from this work that we can bring into MADlib.
>
> However... It is a very different implementation from the existing
> algorithms, calling into the madlib matrix functions directly rather than
> having the majority of the work done within the abstraction layer.
> Unfortunately this leads to a very inefficient implementation.
>
> As demonstration of this I ran this test case:
>
> Dataset: 1 dependent variable, 4 independent variables + intercept,
> 10,000,00 observations
>
> Run using Postgres 9.4 on a Macbook Pro:
>
> Creating the X matrix from source table: 13.9s
> Creating the Y matrix from source table: 9.1s
> Computing X_T_X via matrix_mult: 169.2s
> Computing X_T_Y via matrix_mult: 114.8s
>
> Calling madlib.linregr_train directly (implicitly calculates all of the
> above as well as inverting the X_T_X matrix and calculating some other
> statistics): 10.3s
>
> So in total about 30X slower than our existing methodology for doing the
> same calculations.  I would expect this delta to potentially get even
> larger if it was to move from Postgres to Greenplum or HAWQ where we would
> be able to start applying parallelism.  (the specialized XtX multiplication
> in linregr parallelizes perfectly, but the more general matrix_mult
> functionality may not)
>
> As performance has been a key aspect to our development I'm not sure that
> we want to architecturally go down the path outlined in this example code.
>
> That said... I can certainly see how this layer of abstraction could be a
> valuable way of expressing things from a development perspective so the
> question for the development community is if there is a way that we can
> enable people to write code more similar to what Guatam has expressed while
> preserving the performance of our existing implementations?
>
> The ideas that come to mind would be to take an API abstraction approach
> more akin to what we can see in Theano where we can express a series of
> matrix transformations abstractly and then let the framework work out the
> best way to calculate the pipeline?  Large project to do that... but it
> could one answer to the long held question "how should we define our python
> abstraction layer?".
>
> As a whole I'd be pretty resistant to adding dependencies on numpy/scipy
> unless there was a compelling use case where the performance overhead of
> implementing the MATH (instead of the control flow) in python was not
> unacceptably large.
>
> -Caleb
>
> On Thu, Dec 24, 2015 at 12:51 PM, Frank McQuillan <fm...@pivotal.io>
> wrote:
>
>> Gautam,
>>
>> Thank you for working on this, it can be a great addition to MADlib.  Cpl
>> comments below:
>>
>> 0) Dependencies on numpy and scipy.  Currently the platforms PostgreSQL,
>> GPDB and HAWQ do not ship with numpy or scipy by default, so we may need
>> to
>> look at this dependency more closely.
>>
>> 2a,b) The following creation methods exist will exist MADlib 1.9.  They
>> are
>> already in the MADlib code base:
>>
>> -- Create a matrix initialized with ones of given row and column dimension
>>   matrix_ones( row_dim, col_dim, matrix_out, out_args)
>>
>> -- Create a matrix initialized with zeros of given row and column
>> dimension
>>   matrix_zeros( row_dim, col_dim, matrix_out, out_args)
>>
>> -- Create an square identity matrix of size dim x dim
>>   matrix_identity( dim, matrix_out, out_args)
>>
>> -- Create a diag matrix initialized with given diagonal elements
>>   matrix_diag( diag_elements, matrix_out, out_args)
>>
>> 2c) As for “Sampling matrices and scalars from certain distributions. We
>> could start with Gaussian (multi-variate), truncated normal, Wishart,
>> Inverse-Wishart, Gamma, and Beta.”  I created a JIRA for that here:
>> https://issues.apache.org/jira/browse/MADLIB-940
>> I agree with your recommendation.
>>
>> 3) Pipelining
>> * it’s an architecture question that I agree we need to address, to reduce
>> disk I/O between steps
>> * Could be a platform implementation, or we can think about if MADlib can
>> do something on top of the existing platform by coming up with a way to
>> chain operations in-memory
>>
>> 4) I would *strongly* encourage you to go the next/last mile and get this
>> into MADlib.  The community can help you do it.  And as you say we need to
>> figure out how/if to support numpy and scipy, or do MADlib functions via
>> Eigen or Boost to handle alternatively.
>>
>> Frank
>>
>> On Thu, Dec 24, 2015 at 12:29 PM, Gautam Muralidhar <
>> gautam.s.muralidhar@gmail.com> wrote:
>>
>> > > Hi Team MADlib,
>> > >
>> > > I managed to complete the implementation of the Bayesian analysis of
>> the
>> > binary Probit regression model on MPP. The code has been tested on the
>> > greenplum sandbox VM and seems to work fine. You can find the code here:
>> > >
>> > >
>> >
>> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis
>> > >
>> > > In the git repo, probit_regression.ipynb is the stand alone python
>> > implementation. To verify correctness, I compared against R's MCMCpack
>> > library that can also be run in the Jupyter notebook!
>> > >
>> > > probit_regression_mpp.ipynb is the distributed implementation for
>> > Greenplum or HAWQ. This uses the MADlib matrix operations heavily. In
>> > addition, it also has dependencies on numpy and scipy. If you look at
>> the
>> > Gibbs Probit Driver function, you will see that the only operations in
>> > memory are those that involve inverting a matrix (in this case, the
>> > covariance matrix or the X_T_X matrix, whose dimensions equal the
>> number of
>> > features and hence, hopefully reasonable), sampling from a multivariate
>> > normal, and handling the coefficients.
>> > >
>> > > A couple of observations based on my experience with the MADlib matrix
>> > operations:
>> > >
>> > > 1. First of all, they are a real boon! Last year, we implemented the
>> > auto encoder in MPP and we had to write our own matrix operations, which
>> > was painful. So kudos to you guys! The Matrix operations meant that it
>> took
>> > me ~ 4 hours to complete the implementation in MPP. That is significant,
>> > albeit I have experience with SQL and PL/Python.
>> > >
>> > > 2. It would be great if we can get the following matrix functionality
>> in
>> > MADlib at some point:
>> > >     a. Creating an identity matrix
>> > >     b. Creating a zero matrix
>> > >     c. Sampling matrices and scalars from certain distributions. We
>> > could start with Gaussian (multi-variate), truncated normal, Wishart,
>> > Inverse-Wishart, Gamma, and Beta.
>> > >
>> > > 3. I still do think that as a developer using MADlib matrix
>> operations,
>> > we need to write a lot of code, mainly due to the fact that we need to
>> > create SQL tables in a pipeline. We should probably look to reduce this
>> and
>> > see if we can efficiently pipeline operations.
>> > >
>> > > 4. Lastly, I would like to see if this can end up in MADlib at some
>> > point. But to end up in MADlib, we will need to implement the truncated
>> > normal and multi-variate normal samplers. If we can perhaps carve out a
>> > numpy and scipy dependent section in MADlib and make it clear that these
>> > functions work only if numpy and scipy are installed, then that might
>> > accelerate MADlib contributions from committers.
>> >
>> > Sent from my iPhone
>>
>
>

Re: Bayesian Analysis using MADlib (Gibbs Sampling for Probit Regression)

Posted by Caleb Welton <cw...@pivotal.io>.
Great seeing the prototype work here, I'm sure that there is something that
we can find from this work that we can bring into MADlib.

However... It is a very different implementation from the existing
algorithms, calling into the madlib matrix functions directly rather than
having the majority of the work done within the abstraction layer.
Unfortunately this leads to a very inefficient implementation.

As demonstration of this I ran this test case:

Dataset: 1 dependent variable, 4 independent variables + intercept,
10,000,00 observations

Run using Postgres 9.4 on a Macbook Pro:

Creating the X matrix from source table: 13.9s
Creating the Y matrix from source table: 9.1s
Computing X_T_X via matrix_mult: 169.2s
Computing X_T_Y via matrix_mult: 114.8s

Calling madlib.linregr_train directly (implicitly calculates all of the
above as well as inverting the X_T_X matrix and calculating some other
statistics): 10.3s

So in total about 30X slower than our existing methodology for doing the
same calculations.  I would expect this delta to potentially get even
larger if it was to move from Postgres to Greenplum or HAWQ where we would
be able to start applying parallelism.  (the specialized XtX multiplication
in linregr parallelizes perfectly, but the more general matrix_mult
functionality may not)

As performance has been a key aspect to our development I'm not sure that
we want to architecturally go down the path outlined in this example code.

That said... I can certainly see how this layer of abstraction could be a
valuable way of expressing things from a development perspective so the
question for the development community is if there is a way that we can
enable people to write code more similar to what Guatam has expressed while
preserving the performance of our existing implementations?

The ideas that come to mind would be to take an API abstraction approach
more akin to what we can see in Theano where we can express a series of
matrix transformations abstractly and then let the framework work out the
best way to calculate the pipeline?  Large project to do that... but it
could one answer to the long held question "how should we define our python
abstraction layer?".

As a whole I'd be pretty resistant to adding dependencies on numpy/scipy
unless there was a compelling use case where the performance overhead of
implementing the MATH (instead of the control flow) in python was not
unacceptably large.

-Caleb

On Thu, Dec 24, 2015 at 12:51 PM, Frank McQuillan <fm...@pivotal.io>
wrote:

> Gautam,
>
> Thank you for working on this, it can be a great addition to MADlib.  Cpl
> comments below:
>
> 0) Dependencies on numpy and scipy.  Currently the platforms PostgreSQL,
> GPDB and HAWQ do not ship with numpy or scipy by default, so we may need to
> look at this dependency more closely.
>
> 2a,b) The following creation methods exist will exist MADlib 1.9.  They are
> already in the MADlib code base:
>
> -- Create a matrix initialized with ones of given row and column dimension
>   matrix_ones( row_dim, col_dim, matrix_out, out_args)
>
> -- Create a matrix initialized with zeros of given row and column dimension
>   matrix_zeros( row_dim, col_dim, matrix_out, out_args)
>
> -- Create an square identity matrix of size dim x dim
>   matrix_identity( dim, matrix_out, out_args)
>
> -- Create a diag matrix initialized with given diagonal elements
>   matrix_diag( diag_elements, matrix_out, out_args)
>
> 2c) As for “Sampling matrices and scalars from certain distributions. We
> could start with Gaussian (multi-variate), truncated normal, Wishart,
> Inverse-Wishart, Gamma, and Beta.”  I created a JIRA for that here:
> https://issues.apache.org/jira/browse/MADLIB-940
> I agree with your recommendation.
>
> 3) Pipelining
> * it’s an architecture question that I agree we need to address, to reduce
> disk I/O between steps
> * Could be a platform implementation, or we can think about if MADlib can
> do something on top of the existing platform by coming up with a way to
> chain operations in-memory
>
> 4) I would *strongly* encourage you to go the next/last mile and get this
> into MADlib.  The community can help you do it.  And as you say we need to
> figure out how/if to support numpy and scipy, or do MADlib functions via
> Eigen or Boost to handle alternatively.
>
> Frank
>
> On Thu, Dec 24, 2015 at 12:29 PM, Gautam Muralidhar <
> gautam.s.muralidhar@gmail.com> wrote:
>
> > > Hi Team MADlib,
> > >
> > > I managed to complete the implementation of the Bayesian analysis of
> the
> > binary Probit regression model on MPP. The code has been tested on the
> > greenplum sandbox VM and seems to work fine. You can find the code here:
> > >
> > >
> >
> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis
> > >
> > > In the git repo, probit_regression.ipynb is the stand alone python
> > implementation. To verify correctness, I compared against R's MCMCpack
> > library that can also be run in the Jupyter notebook!
> > >
> > > probit_regression_mpp.ipynb is the distributed implementation for
> > Greenplum or HAWQ. This uses the MADlib matrix operations heavily. In
> > addition, it also has dependencies on numpy and scipy. If you look at the
> > Gibbs Probit Driver function, you will see that the only operations in
> > memory are those that involve inverting a matrix (in this case, the
> > covariance matrix or the X_T_X matrix, whose dimensions equal the number
> of
> > features and hence, hopefully reasonable), sampling from a multivariate
> > normal, and handling the coefficients.
> > >
> > > A couple of observations based on my experience with the MADlib matrix
> > operations:
> > >
> > > 1. First of all, they are a real boon! Last year, we implemented the
> > auto encoder in MPP and we had to write our own matrix operations, which
> > was painful. So kudos to you guys! The Matrix operations meant that it
> took
> > me ~ 4 hours to complete the implementation in MPP. That is significant,
> > albeit I have experience with SQL and PL/Python.
> > >
> > > 2. It would be great if we can get the following matrix functionality
> in
> > MADlib at some point:
> > >     a. Creating an identity matrix
> > >     b. Creating a zero matrix
> > >     c. Sampling matrices and scalars from certain distributions. We
> > could start with Gaussian (multi-variate), truncated normal, Wishart,
> > Inverse-Wishart, Gamma, and Beta.
> > >
> > > 3. I still do think that as a developer using MADlib matrix operations,
> > we need to write a lot of code, mainly due to the fact that we need to
> > create SQL tables in a pipeline. We should probably look to reduce this
> and
> > see if we can efficiently pipeline operations.
> > >
> > > 4. Lastly, I would like to see if this can end up in MADlib at some
> > point. But to end up in MADlib, we will need to implement the truncated
> > normal and multi-variate normal samplers. If we can perhaps carve out a
> > numpy and scipy dependent section in MADlib and make it clear that these
> > functions work only if numpy and scipy are installed, then that might
> > accelerate MADlib contributions from committers.
> >
> > Sent from my iPhone
>

Re: Bayesian Analysis using MADlib (Gibbs Sampling for Probit Regression)

Posted by Frank McQuillan <fm...@pivotal.io>.
Gautam,

Thank you for working on this, it can be a great addition to MADlib.  Cpl
comments below:

0) Dependencies on numpy and scipy.  Currently the platforms PostgreSQL,
GPDB and HAWQ do not ship with numpy or scipy by default, so we may need to
look at this dependency more closely.

2a,b) The following creation methods exist will exist MADlib 1.9.  They are
already in the MADlib code base:

-- Create a matrix initialized with ones of given row and column dimension
  matrix_ones( row_dim, col_dim, matrix_out, out_args)

-- Create a matrix initialized with zeros of given row and column dimension
  matrix_zeros( row_dim, col_dim, matrix_out, out_args)

-- Create an square identity matrix of size dim x dim
  matrix_identity( dim, matrix_out, out_args)

-- Create a diag matrix initialized with given diagonal elements
  matrix_diag( diag_elements, matrix_out, out_args)

2c) As for “Sampling matrices and scalars from certain distributions. We
could start with Gaussian (multi-variate), truncated normal, Wishart,
Inverse-Wishart, Gamma, and Beta.”  I created a JIRA for that here:
https://issues.apache.org/jira/browse/MADLIB-940
I agree with your recommendation.

3) Pipelining
* it’s an architecture question that I agree we need to address, to reduce
disk I/O between steps
* Could be a platform implementation, or we can think about if MADlib can
do something on top of the existing platform by coming up with a way to
chain operations in-memory

4) I would *strongly* encourage you to go the next/last mile and get this
into MADlib.  The community can help you do it.  And as you say we need to
figure out how/if to support numpy and scipy, or do MADlib functions via
Eigen or Boost to handle alternatively.

Frank

On Thu, Dec 24, 2015 at 12:29 PM, Gautam Muralidhar <
gautam.s.muralidhar@gmail.com> wrote:

> > Hi Team MADlib,
> >
> > I managed to complete the implementation of the Bayesian analysis of the
> binary Probit regression model on MPP. The code has been tested on the
> greenplum sandbox VM and seems to work fine. You can find the code here:
> >
> >
> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis
> >
> > In the git repo, probit_regression.ipynb is the stand alone python
> implementation. To verify correctness, I compared against R's MCMCpack
> library that can also be run in the Jupyter notebook!
> >
> > probit_regression_mpp.ipynb is the distributed implementation for
> Greenplum or HAWQ. This uses the MADlib matrix operations heavily. In
> addition, it also has dependencies on numpy and scipy. If you look at the
> Gibbs Probit Driver function, you will see that the only operations in
> memory are those that involve inverting a matrix (in this case, the
> covariance matrix or the X_T_X matrix, whose dimensions equal the number of
> features and hence, hopefully reasonable), sampling from a multivariate
> normal, and handling the coefficients.
> >
> > A couple of observations based on my experience with the MADlib matrix
> operations:
> >
> > 1. First of all, they are a real boon! Last year, we implemented the
> auto encoder in MPP and we had to write our own matrix operations, which
> was painful. So kudos to you guys! The Matrix operations meant that it took
> me ~ 4 hours to complete the implementation in MPP. That is significant,
> albeit I have experience with SQL and PL/Python.
> >
> > 2. It would be great if we can get the following matrix functionality in
> MADlib at some point:
> >     a. Creating an identity matrix
> >     b. Creating a zero matrix
> >     c. Sampling matrices and scalars from certain distributions. We
> could start with Gaussian (multi-variate), truncated normal, Wishart,
> Inverse-Wishart, Gamma, and Beta.
> >
> > 3. I still do think that as a developer using MADlib matrix operations,
> we need to write a lot of code, mainly due to the fact that we need to
> create SQL tables in a pipeline. We should probably look to reduce this and
> see if we can efficiently pipeline operations.
> >
> > 4. Lastly, I would like to see if this can end up in MADlib at some
> point. But to end up in MADlib, we will need to implement the truncated
> normal and multi-variate normal samplers. If we can perhaps carve out a
> numpy and scipy dependent section in MADlib and make it clear that these
> functions work only if numpy and scipy are installed, then that might
> accelerate MADlib contributions from committers.
>
> Sent from my iPhone