You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by Gautam Muralidhar <ga...@gmail.com> on 2016/02/10 07:12:08 UTC

Re: Bayesian Analysis using MADlib (Gibbs Sampling for Probit Regression)

Following up on the suggestions made on this thread, I re-implemented the
MCMC Probit and Logit regression using the MADlib crossprod function. The
resulting implementations are orders of magnitude faster than the first
implementation that used the matrix_mult function. The new implementation
can be found here (look for the cell containing madlib.crossprod):

https://github.com/gautamsm/data-science-on-mpp/blob/master/BayesianAnalysis/probit_regression_mpp.ipynb

https://github.com/gautamsm/data-science-on-mpp/blob/master/BayesianAnalysis/logit_regression_mpp.ipynb

The new implementation has now given us a couple of MCMC-based algorithms
that we can actually use.

Best,
Gautam

On Fri, Jan 15, 2016 at 4:27 PM, Gautam Muralidhar <
gautam.s.muralidhar@gmail.com> wrote:

> Thanks for the comments, Rahul and Caleb.
>
> @Rahul - I will look to implement your suggestions.
>
> Best,
> Gautam
>
> On Fri, Jan 15, 2016 at 3:27 PM, Rahul Iyer <ri...@pivotal.io> wrote:
>
>> Thanks for your comments, Caleb.
>>
>> @Gautam: as I mentioned in the community call today, we have an
>> aggregate function, crossprod(float8[], float8[]), that can be used to
>> perform the X'X and X'Y operation.
>> - for X'X, the row_vec column would be both vector inputs
>> - for X'Y, the row_vec column of X would be the first input and the Y
>> value as an array would be the 2nd input (crossprod needs to treat the
>> Y as a 1x1 vector).
>> You would, however, have to be careful of the X'X output - it's the
>> matrix flattened into an array, so you would have to reshape it.
>>
>> As Caleb said, we would benefit by inspecting the distribution of the
>> two input matrices in matrix_mult and switch between the currently
>> implemented inner product and this crossprod aggregate (outer
>> product).
>>
>> On Fri, Jan 15, 2016 at 2:52 PM, Caleb Welton <cw...@pivotal.io> wrote:
>> > Sorry I missed the community call this morning.  I heard that this was
>> > discussed in more detail, but haven't seen the minutes of the call
>> posted
>> > yet.  Here are a couple more thoughts on this:
>> >
>> > The matrix operation based implementation offered by Guatam is intuitive
>> > and logical way of describing the algorithm, if we had an efficient way
>> of
>> > expressing algorithms like this it would greatly simply the process of
>> > adding new algorithms and lower the barrier to entry for contributions
>> to
>> > the project.  Which would be a good thing, so I wanted to spend a bit
>> more
>> > thought on what this would take and why this solution is not efficient
>> > today.
>> >
>> > Primarily the existing implementation we have for calculating X_T_X in
>> > MADlib is singnificantly more efficient than the implementation within
>> > madlib.matrix_mult(), but the implementation in madlib.matrix_mult() is
>> > much more general purpose.  The existing implementation is hard coded to
>> > handle the fact that both X and t(X) are operating on the same matrix
>> and
>> > that this specific calculation is such that each row of the matrix
>> becomes
>> > the column in the transpose that it is multiplied with meaning that if
>> we
>> > have all the data for the row then the contributions from that row can
>> be
>> > calculated without any additional redistribution of data.  Further since
>> > they are the same table we don't have to join the two tables together to
>> > get that data and we can complete the entire operation with a single
>> scan
>> > of one table.  We do not seem to have the optimization for this very
>> > special case enabled in madlib.matrix_mult() resulting in the
>> > implementation of the multiplication being substantially slower.
>> >
>> > Similarly for X_T_Y in our typical cases X and Y are both in the same
>> > initial input table and in some ways we can think of "XY" as a single
>> > matrix that we have simply sliced vertically to produce X and Y as
>> separate
>> > matrices, this means that despite X and Y being different matrices from
>> the
>> > mathematical expression of the model we can still use the same in-place
>> > logic that we used for X_T_X.  As expressed in the current
>> > madlib.matrix_mult() api there is no easy way for matrix_mult to
>> recognize
>> > this relationship and so we end up forced to go the inefficient route
>> even
>> > if we added the special case optimization when the left and right sides
>> of
>> > the multiplication are transpositions of the same matrix.
>> >
>> > One path forward that would help make this type of implementation viable
>> > would be by adding some of these optimizations and possible api
>> > enhancements into matrix_mult code so that we can get the implementation
>> > more efficient going this route we could probably get from 30X
>> perfomance
>> > hit down to only 2X performance hit - based on having to make separate
>> > scans for X_T_X and X_T_Y rather than being able to combine both
>> > calculations in a single scan of the data.  Reducing that last 2X would
>> > take more effort and a greater level of sophistication in our
>> optimization
>> > routines.  The general case would likely require some amount of code
>> > generation.
>> >
>> > Regards,
>> >   Caleb
>> >
>> > On Thu, Jan 14, 2016 at 5:32 PM, Caleb Welton <cw...@pivotal.io>
>> wrote:
>> >
>> >> Great seeing the prototype work here, I'm sure that there is something
>> >> that we can find from this work that we can bring into MADlib.
>> >>
>> >> However... It is a very different implementation from the existing
>> >> algorithms, calling into the madlib matrix functions directly rather
>> than
>> >> having the majority of the work done within the abstraction layer.
>> >> Unfortunately this leads to a very inefficient implementation.
>> >>
>> >> As demonstration of this I ran this test case:
>> >>
>> >> Dataset: 1 dependent variable, 4 independent variables + intercept,
>> >> 10,000,00 observations
>> >>
>> >> Run using Postgres 9.4 on a Macbook Pro:
>> >>
>> >> Creating the X matrix from source table: 13.9s
>> >> Creating the Y matrix from source table: 9.1s
>> >> Computing X_T_X via matrix_mult: 169.2s
>> >> Computing X_T_Y via matrix_mult: 114.8s
>> >>
>> >> Calling madlib.linregr_train directly (implicitly calculates all of the
>> >> above as well as inverting the X_T_X matrix and calculating some other
>> >> statistics): 10.3s
>> >>
>> >> So in total about 30X slower than our existing methodology for doing
>> the
>> >> same calculations.  I would expect this delta to potentially get even
>> >> larger if it was to move from Postgres to Greenplum or HAWQ where we
>> would
>> >> be able to start applying parallelism.  (the specialized XtX
>> multiplication
>> >> in linregr parallelizes perfectly, but the more general matrix_mult
>> >> functionality may not)
>> >>
>> >> As performance has been a key aspect to our development I'm not sure
>> that
>> >> we want to architecturally go down the path outlined in this example
>> code.
>> >>
>> >> That said... I can certainly see how this layer of abstraction could
>> be a
>> >> valuable way of expressing things from a development perspective so the
>> >> question for the development community is if there is a way that we can
>> >> enable people to write code more similar to what Guatam has expressed
>> while
>> >> preserving the performance of our existing implementations?
>> >>
>> >> The ideas that come to mind would be to take an API abstraction
>> approach
>> >> more akin to what we can see in Theano where we can express a series of
>> >> matrix transformations abstractly and then let the framework work out
>> the
>> >> best way to calculate the pipeline?  Large project to do that... but it
>> >> could one answer to the long held question "how should we define our
>> python
>> >> abstraction layer?".
>> >>
>> >> As a whole I'd be pretty resistant to adding dependencies on
>> numpy/scipy
>> >> unless there was a compelling use case where the performance overhead
>> of
>> >> implementing the MATH (instead of the control flow) in python was not
>> >> unacceptably large.
>> >>
>> >> -Caleb
>> >>
>> >> On Thu, Dec 24, 2015 at 12:51 PM, Frank McQuillan <
>> fmcquillan@pivotal.io>
>> >> wrote:
>> >>
>> >>> Gautam,
>> >>>
>> >>> Thank you for working on this, it can be a great addition to MADlib.
>> Cpl
>> >>> comments below:
>> >>>
>> >>> 0) Dependencies on numpy and scipy.  Currently the platforms
>> PostgreSQL,
>> >>> GPDB and HAWQ do not ship with numpy or scipy by default, so we may
>> need
>> >>> to
>> >>> look at this dependency more closely.
>> >>>
>> >>> 2a,b) The following creation methods exist will exist MADlib 1.9.
>> They
>> >>> are
>> >>> already in the MADlib code base:
>> >>>
>> >>> -- Create a matrix initialized with ones of given row and column
>> dimension
>> >>>   matrix_ones( row_dim, col_dim, matrix_out, out_args)
>> >>>
>> >>> -- Create a matrix initialized with zeros of given row and column
>> >>> dimension
>> >>>   matrix_zeros( row_dim, col_dim, matrix_out, out_args)
>> >>>
>> >>> -- Create an square identity matrix of size dim x dim
>> >>>   matrix_identity( dim, matrix_out, out_args)
>> >>>
>> >>> -- Create a diag matrix initialized with given diagonal elements
>> >>>   matrix_diag( diag_elements, matrix_out, out_args)
>> >>>
>> >>> 2c) As for “Sampling matrices and scalars from certain distributions.
>> We
>> >>> could start with Gaussian (multi-variate), truncated normal, Wishart,
>> >>> Inverse-Wishart, Gamma, and Beta.”  I created a JIRA for that here:
>> >>> https://issues.apache.org/jira/browse/MADLIB-940
>> >>> I agree with your recommendation.
>> >>>
>> >>> 3) Pipelining
>> >>> * it’s an architecture question that I agree we need to address, to
>> reduce
>> >>> disk I/O between steps
>> >>> * Could be a platform implementation, or we can think about if MADlib
>> can
>> >>> do something on top of the existing platform by coming up with a way
>> to
>> >>> chain operations in-memory
>> >>>
>> >>> 4) I would *strongly* encourage you to go the next/last mile and get
>> this
>> >>> into MADlib.  The community can help you do it.  And as you say we
>> need to
>> >>> figure out how/if to support numpy and scipy, or do MADlib functions
>> via
>> >>> Eigen or Boost to handle alternatively.
>> >>>
>> >>> Frank
>> >>>
>> >>> On Thu, Dec 24, 2015 at 12:29 PM, Gautam Muralidhar <
>> >>> gautam.s.muralidhar@gmail.com> wrote:
>> >>>
>> >>> > > Hi Team MADlib,
>> >>> > >
>> >>> > > I managed to complete the implementation of the Bayesian analysis
>> of
>> >>> the
>> >>> > binary Probit regression model on MPP. The code has been tested on
>> the
>> >>> > greenplum sandbox VM and seems to work fine. You can find the code
>> here:
>> >>> > >
>> >>> > >
>> >>> >
>> >>>
>> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis
>> >>> > >
>> >>> > > In the git repo, probit_regression.ipynb is the stand alone python
>> >>> > implementation. To verify correctness, I compared against R's
>> MCMCpack
>> >>> > library that can also be run in the Jupyter notebook!
>> >>> > >
>> >>> > > probit_regression_mpp.ipynb is the distributed implementation for
>> >>> > Greenplum or HAWQ. This uses the MADlib matrix operations heavily.
>> In
>> >>> > addition, it also has dependencies on numpy and scipy. If you look
>> at
>> >>> the
>> >>> > Gibbs Probit Driver function, you will see that the only operations
>> in
>> >>> > memory are those that involve inverting a matrix (in this case, the
>> >>> > covariance matrix or the X_T_X matrix, whose dimensions equal the
>> >>> number of
>> >>> > features and hence, hopefully reasonable), sampling from a
>> multivariate
>> >>> > normal, and handling the coefficients.
>> >>> > >
>> >>> > > A couple of observations based on my experience with the MADlib
>> matrix
>> >>> > operations:
>> >>> > >
>> >>> > > 1. First of all, they are a real boon! Last year, we implemented
>> the
>> >>> > auto encoder in MPP and we had to write our own matrix operations,
>> which
>> >>> > was painful. So kudos to you guys! The Matrix operations meant that
>> it
>> >>> took
>> >>> > me ~ 4 hours to complete the implementation in MPP. That is
>> significant,
>> >>> > albeit I have experience with SQL and PL/Python.
>> >>> > >
>> >>> > > 2. It would be great if we can get the following matrix
>> functionality
>> >>> in
>> >>> > MADlib at some point:
>> >>> > >     a. Creating an identity matrix
>> >>> > >     b. Creating a zero matrix
>> >>> > >     c. Sampling matrices and scalars from certain distributions.
>> We
>> >>> > could start with Gaussian (multi-variate), truncated normal,
>> Wishart,
>> >>> > Inverse-Wishart, Gamma, and Beta.
>> >>> > >
>> >>> > > 3. I still do think that as a developer using MADlib matrix
>> >>> operations,
>> >>> > we need to write a lot of code, mainly due to the fact that we need
>> to
>> >>> > create SQL tables in a pipeline. We should probably look to reduce
>> this
>> >>> and
>> >>> > see if we can efficiently pipeline operations.
>> >>> > >
>> >>> > > 4. Lastly, I would like to see if this can end up in MADlib at
>> some
>> >>> > point. But to end up in MADlib, we will need to implement the
>> truncated
>> >>> > normal and multi-variate normal samplers. If we can perhaps carve
>> out a
>> >>> > numpy and scipy dependent section in MADlib and make it clear that
>> these
>> >>> > functions work only if numpy and scipy are installed, then that
>> might
>> >>> > accelerate MADlib contributions from committers.
>> >>> >
>> >>> > Sent from my iPhone
>> >>>
>> >>
>> >>
>>
>
>
>
> --
> ==========================================================
> Gautam Muralidhar
> PhD, The University of Texas at Austin, USA.
> ==========================================================
>



-- 
==========================================================
Gautam Muralidhar
PhD, The University of Texas at Austin, USA.
==========================================================