You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by Frank McQuillan <fm...@pivotal.io> on 2017/02/09 01:11:18 UTC

Re: Madlib Feature Improvement Proposal: Update SVD with Improved Eigen Function

Thanks for the question, Aaron.

MADlib does not use the Eigen SVD very much actually, only in single node
situations, so while moving to a better version is a good idea, it probably
won't materially impact operations on large data sets.

For most cases (e,g., PCA) the distributed version of SVD is used
http://madlib.incubator.apache.org/docs/latest/group__grp__svd.html

It is a custom version which is describe in Chapter 5 of the MADlib design
document:
http://madlib.incubator.apache.org/design.pdf
using Lanczos bidiagonalization.

Now, if we were to make improvements to performance of distributed SVD,
then that would help with large data sets which is our focus.  Perhaps you
have some suggestions on that aspect?

Frank

On Wed, Dec 28, 2016 at 10:17 PM, Aaron Gokaslan <Aa...@gmail.com>
wrote:

> Hello, this is my time using an email based forum so let me know if there
> is anything else I need to do.
>
> I was reading the most recent survey
> <https://madlib.incubator.apache.org/community-
> artifacts/Apache-MADlib-user-survey-results-Oct-2016.pdf>
> results and one of the features I really agreed on is more scalable SVD. I
> happened to look into that issue and found an interesting Stack Overflow
> post
> <https://stackoverflow.com/questions/36959506/eigen-
> library-svd-is-slow-compared-to-gsl>
> about a new SVD algorithm that has just been officially added to the latest
> version of Eigen. According to the documentation
> <https://eigen.tuxfamily.org/dox/classEigen_1_1BDCSVD.html> the new
> algorithm is much more scalable than the previous one. This would obviously
> bump the requirements of Eigen to the latest version, 3.3.1, but the much
> faster SVD algorithm would be worth it. I am interested in helping out
> implement the feature, but I wanted to have a JIRA issue opened and discuss
> how to best proceed as this is my first time contributing to an Apache
> project.
>
> TLDR: New version of Eigen released with more scalable SVD, I would like to
> see it implemented in Madlib.
>
> Aaron Gokaslan
>