You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by DB Tsai <db...@dbtsai.com> on 2013/06/12 01:56:18 UTC

PCA in mahout

Hi folks,

I'm trying to use mahout's PCA implementation based on SSVD in our
application. I understand that in order to avoid densifying the sparse
input, mahout provides an option that the mean of cols can be a parameters
to pass into the algorithms. However, a lot of time, the scale of each
axis is different, is there any way to pass the variance into the
algorithms without generating the new data set?

ie, x' = (x-u)/\sigma

Also, our original data set maybe in CSV format, and we have cleanup
method which can generate clean data set in the mapper. My initial
implementation will be that we'll generate an intermediate result in
mahout row matrix format, and then pass it to SSVD. However, it will
be nice and can save lots of storage if we can do this step on-the-fly
when we run the algorithms. Could you give me some feedback about
this?

Thank you very much. Have a good day.

Sincerely,

DB Tsai
-----------------------------------
Web: http://www.dbtsai.com