You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Vijay B <b....@gmail.com> on 2014/03/19 17:45:26 UTC

Fwd: Using SSVD for dimensionality reduction on Mahout

Hi All,
I have a CSV file on which I've to perform dimensionality reduction. I'm
new to Mahout, on doing some search I understood that SSVD can be used for
performing dimensionality reduction. I'm not sure of the steps that have to
be executed before  SSVD, please help me.

Thanks,
Vijay

Re: Using SSVD for dimensionality reduction on Mahout

Posted by Vijay B <b....@gmail.com>.
Yes, I agree that the dimensions of my dataset are low, I intended to only
experiment with this data and then apply SSVD on a huge dataset.

I was actually interested in finding out how my original variables are
contributing to every principal component and your reply answered to that.
Many thanks!

Thanks,
Vijay.



On Sat, Mar 22, 2014 at 3:00 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Vijay,
>
> what Ted said. It doesn't make tons of sense to do reduction from 12 to 7
> because 12 is still dimensionality low enough.
>
> But suppose we accept rational of reducing 12 dimensions into 7. Your
> original points are rotated into PCA space of 7 dimensions where they
> retain most (as much as possible) of variance of original data, i.e.
> basically retain proportions of euclidean distances between each other,
> i.e. still suitable for stuff like clustering or regression or whatever
> else you want to do with them on that basis.
>
> your U*Sigma output should have the same keys as the input.
> If you want to analyze what is contribution of your original variables to
> every principal component, you need to examine V output, which in your case
> will be really tiny, 12 x 7.
>
>
>
>
> On Fri, Mar 21, 2014 at 11:12 AM, Vijay B <b....@gmail.com> wrote:
>
> > Thanks a lot for the reply.
> >
> > To gain an understanding of how SSVD works, I have taken a sample CSV
> file
> > with 12 columns and I want to perform dimensionality reduction on it by
> > asking SSVD to give me 7 most significant columns.
> >
> > Snippet of my input csv
> >
> > 22,2,44,36,5,9,2824,2,4,733,285,169
> > 25,1,150,175,3,9,4037,2,18,1822,254,171
> >
> > Here's what I have done.
> > Step 1: Converted the csv to a sequence file, below is a snippet of the
> > output
> > Key: 1: Value:
> >
> >
> 1:{0:22.0,1:2.0,2:44.0,3:36.0,4:5.0,5:9.0,6:2824.0,7:2.0,8:4.0,9:733.0,10:285.0,11:169.0}
> > Key: 2: Value:
> >
> >
> 2:{0:25.0,1:1.0,2:150.0,3:175.0,4:3.0,5:9.0,6:4037.0,7:2.0,8:18.0,9:1822.0,10:254.0,11:171.0}
> >
> > Step 2; Passed this sequence file as input to the SSVD command, below is
> > the command I used
> >
> > bin/mahout ssvd -i /user/cloudera/seq-data.seq -o
> > /user/cloudera/reduced_dimensions1 --rank 7 -us true -V false -U false
> -pca
> > true -ow -t 1
> >
> >  I then executed vectordump on the contents of USigma folder, below is a
> > snippet of the output
> >
> >
> >
> {0:191.5917217160858,1:-349.96930149831184,2:-78.21082086351002,3:98.73075808083476,4:-122.89919847376068,5:4.160343860343885,6:1.4336136023933244}
> >
> >
> {0:1293.9486625354516,1:697.7408635015182,2:24.0653800270275,3:60.79480738654566,4:11.733624175113523,5:6.479815864873287,6:-0.9269136621845396}
> >
> > Please help me interpret the above results in the USigma folder.
> >
> > Thanks,
> > Vijay.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Mar 21, 2014 at 9:52 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >
> > > Vijay, how many columns do you have in the CSV? That is the number you
> > > will be reducing.
> > >
> > > csv:
> > > 1,22,33,44,55
> > > 13,23,34,45,56
> > >
> > > would be dense vectors:
> > > Key:1: Value:{1:1,2:22,3:33,4:44,5:55}
> > > Key: 2: Value:{1:13,2:23,3:34,4:45,5:56}
> > >
> > > Unless you have some reason to assign different dimension indexes the
> row
> > > and column numbers from your csv should be used in Mahout. Internal to
> > > Mahout the dimensions are assumed to be ordinal. If you do have reasons
> > to
> > > say column 1 corresponds to something with an id of 12 (your example
> > below)
> > > then you handle that in the output phase of your problem. In other
> words
> > if
> > > you get an answer corresponding to the Mahout column index of 1, you
> > lookup
> > > its association to 12 in some dictionary you keep outside of Mahout,
> same
> > > with the row keys. Don't put external Ids in the matrix unless they
> > really
> > > are ordinal dimensions.
> > >
> > > As Dmitriy said this sounds like a Dense matrix problem. Usually when
> > I've
> > > used SSVD it was on a matrix with 80,000-500,000 columns in a very
> sparse
> > > matrix so reduction yields big benefits. Also remember that the output
> is
> > > always a dense matrix so ops performed on it tend to be more heavy
> > weight.
> > >
> > >
> > > On Mar 19, 2014, at 11:16 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> > >
> > > On Wed, Mar 19, 2014 at 11:00 AM, Vijay B <b....@gmail.com>
> wrote:
> > >
> > > > Thanks a lot for the detailed explanation, it was very helpful.
> > > > I will write a CSV to sequence converter, just needed some clarity on
> > the
> > > > key/value pairs in the sequence file.
> > > >
> > > > Suppose my csv file contains the below values
> > > > 11,22,33,44,55
> > > > 13,23,34,45,56
> > > >
> > > > I assume that the sequence file would look like this, where 12, 1,
> 14,
> > 8,
> > > > 15 are indices which hold the values
> > > > Key:1: Value:{12:11,1:22,14:33,8:44,15:55}
> > > > Key: 2: Value:{12:13,1:23,14:34,8:45,15:56}
> > > >
> > >
> > > I am not sure -- why are you remapping ordinal position into an index
> > > position? Obviously, DRM supports sparse computations (i.e. you can use
> > > either SequetialAccessSparseVector or RandomAccessSparseVector as
> vector
> > > values, as long as they have the same cardinality). However, if you
> imply
> > > that all data point ordinal positions map into the same sparse vector
> > > index, then there's no true sparsity here and you could just form dense
> > > vectors in ordinal order of your data, it seems.
> > >
> > > Other than that, I don't see any issues with your assumptions.
> > >
> > >
> > > > Please confirm if my understanding is correct.
> > > >
> > > > Thanks,
> > > > Vijay
> > > >
> > > >
> > > > On Wed, Mar 19, 2014 at 11:02 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > > >> wrote:
> > > >
> > > >> I am not sure if we have direct CSV converters to do that; CSV is
> not
> > > > that
> > > >> expressive anyway. But it is not difficult to write up such
> converter
> > on
> > > >> your own, i suppose.
> > > >>
> > > >> The steps you need to do is this :
> > > >>
> > > >> (1) prepare set of data points in a form of (unique vector key,
> > > n-vector)
> > > >> tuples. Vector key can be anything that can be adapted into a
> > > >> WritableComparable. Notably, Long or String. Vector key also has to
> be
> > > >> unique to make sense for you.
> > > >> (2) save the above tuples into a set of sequence files so that
> > sequence
> > > >> file key is unique vector key, and sequence file value is
> > > >> o.a.m.math.VectorWritable.
> > > >> (3) decide how many dimensions there will be in reduced space. The
> key
> > > is
> > > >> reduced, i.e. you don't need too many. Say 50.
> > > >> (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
> > > >> reduced dimensionality output will be in the folder USigma. The
> output
> > > > will
> > > >> have same keys bounds to vectors in reduced space of k dimensions.
> > > >>
> > > >>
> > > >> On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <b....@gmail.com>
> > wrote:
> > > >>
> > > >>> Hi All,
> > > >>> I have a CSV file on which I've to perform dimensionality
> reduction.
> > > > I'm
> > > >>> new to Mahout, on doing some search I understood that SSVD can be
> > used
> > > >> for
> > > >>> performing dimensionality reduction. I'm not sure of the steps that
> > > > have
> > > >> to
> > > >>> be executed before  SSVD, please help me.
> > > >>>
> > > >>> Thanks,
> > > >>> Vijay
> > > >>>
> > > >>
> > > >
> > >
> > >
> >
>

Re: Using SSVD for dimensionality reduction on Mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Vijay,

what Ted said. It doesn't make tons of sense to do reduction from 12 to 7
because 12 is still dimensionality low enough.

But suppose we accept rational of reducing 12 dimensions into 7. Your
original points are rotated into PCA space of 7 dimensions where they
retain most (as much as possible) of variance of original data, i.e.
basically retain proportions of euclidean distances between each other,
i.e. still suitable for stuff like clustering or regression or whatever
else you want to do with them on that basis.

your U*Sigma output should have the same keys as the input.
If you want to analyze what is contribution of your original variables to
every principal component, you need to examine V output, which in your case
will be really tiny, 12 x 7.




On Fri, Mar 21, 2014 at 11:12 AM, Vijay B <b....@gmail.com> wrote:

> Thanks a lot for the reply.
>
> To gain an understanding of how SSVD works, I have taken a sample CSV file
> with 12 columns and I want to perform dimensionality reduction on it by
> asking SSVD to give me 7 most significant columns.
>
> Snippet of my input csv
>
> 22,2,44,36,5,9,2824,2,4,733,285,169
> 25,1,150,175,3,9,4037,2,18,1822,254,171
>
> Here's what I have done.
> Step 1: Converted the csv to a sequence file, below is a snippet of the
> output
> Key: 1: Value:
>
> 1:{0:22.0,1:2.0,2:44.0,3:36.0,4:5.0,5:9.0,6:2824.0,7:2.0,8:4.0,9:733.0,10:285.0,11:169.0}
> Key: 2: Value:
>
> 2:{0:25.0,1:1.0,2:150.0,3:175.0,4:3.0,5:9.0,6:4037.0,7:2.0,8:18.0,9:1822.0,10:254.0,11:171.0}
>
> Step 2; Passed this sequence file as input to the SSVD command, below is
> the command I used
>
> bin/mahout ssvd -i /user/cloudera/seq-data.seq -o
> /user/cloudera/reduced_dimensions1 --rank 7 -us true -V false -U false -pca
> true -ow -t 1
>
>  I then executed vectordump on the contents of USigma folder, below is a
> snippet of the output
>
>
> {0:191.5917217160858,1:-349.96930149831184,2:-78.21082086351002,3:98.73075808083476,4:-122.89919847376068,5:4.160343860343885,6:1.4336136023933244}
>
> {0:1293.9486625354516,1:697.7408635015182,2:24.0653800270275,3:60.79480738654566,4:11.733624175113523,5:6.479815864873287,6:-0.9269136621845396}
>
> Please help me interpret the above results in the USigma folder.
>
> Thanks,
> Vijay.
>
>
>
>
>
>
>
>
>
> On Fri, Mar 21, 2014 at 9:52 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > Vijay, how many columns do you have in the CSV? That is the number you
> > will be reducing.
> >
> > csv:
> > 1,22,33,44,55
> > 13,23,34,45,56
> >
> > would be dense vectors:
> > Key:1: Value:{1:1,2:22,3:33,4:44,5:55}
> > Key: 2: Value:{1:13,2:23,3:34,4:45,5:56}
> >
> > Unless you have some reason to assign different dimension indexes the row
> > and column numbers from your csv should be used in Mahout. Internal to
> > Mahout the dimensions are assumed to be ordinal. If you do have reasons
> to
> > say column 1 corresponds to something with an id of 12 (your example
> below)
> > then you handle that in the output phase of your problem. In other words
> if
> > you get an answer corresponding to the Mahout column index of 1, you
> lookup
> > its association to 12 in some dictionary you keep outside of Mahout, same
> > with the row keys. Don't put external Ids in the matrix unless they
> really
> > are ordinal dimensions.
> >
> > As Dmitriy said this sounds like a Dense matrix problem. Usually when
> I've
> > used SSVD it was on a matrix with 80,000-500,000 columns in a very sparse
> > matrix so reduction yields big benefits. Also remember that the output is
> > always a dense matrix so ops performed on it tend to be more heavy
> weight.
> >
> >
> > On Mar 19, 2014, at 11:16 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> > On Wed, Mar 19, 2014 at 11:00 AM, Vijay B <b....@gmail.com> wrote:
> >
> > > Thanks a lot for the detailed explanation, it was very helpful.
> > > I will write a CSV to sequence converter, just needed some clarity on
> the
> > > key/value pairs in the sequence file.
> > >
> > > Suppose my csv file contains the below values
> > > 11,22,33,44,55
> > > 13,23,34,45,56
> > >
> > > I assume that the sequence file would look like this, where 12, 1, 14,
> 8,
> > > 15 are indices which hold the values
> > > Key:1: Value:{12:11,1:22,14:33,8:44,15:55}
> > > Key: 2: Value:{12:13,1:23,14:34,8:45,15:56}
> > >
> >
> > I am not sure -- why are you remapping ordinal position into an index
> > position? Obviously, DRM supports sparse computations (i.e. you can use
> > either SequetialAccessSparseVector or RandomAccessSparseVector as vector
> > values, as long as they have the same cardinality). However, if you imply
> > that all data point ordinal positions map into the same sparse vector
> > index, then there's no true sparsity here and you could just form dense
> > vectors in ordinal order of your data, it seems.
> >
> > Other than that, I don't see any issues with your assumptions.
> >
> >
> > > Please confirm if my understanding is correct.
> > >
> > > Thanks,
> > > Vijay
> > >
> > >
> > > On Wed, Mar 19, 2014 at 11:02 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > >> wrote:
> > >
> > >> I am not sure if we have direct CSV converters to do that; CSV is not
> > > that
> > >> expressive anyway. But it is not difficult to write up such converter
> on
> > >> your own, i suppose.
> > >>
> > >> The steps you need to do is this :
> > >>
> > >> (1) prepare set of data points in a form of (unique vector key,
> > n-vector)
> > >> tuples. Vector key can be anything that can be adapted into a
> > >> WritableComparable. Notably, Long or String. Vector key also has to be
> > >> unique to make sense for you.
> > >> (2) save the above tuples into a set of sequence files so that
> sequence
> > >> file key is unique vector key, and sequence file value is
> > >> o.a.m.math.VectorWritable.
> > >> (3) decide how many dimensions there will be in reduced space. The key
> > is
> > >> reduced, i.e. you don't need too many. Say 50.
> > >> (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
> > >> reduced dimensionality output will be in the folder USigma. The output
> > > will
> > >> have same keys bounds to vectors in reduced space of k dimensions.
> > >>
> > >>
> > >> On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <b....@gmail.com>
> wrote:
> > >>
> > >>> Hi All,
> > >>> I have a CSV file on which I've to perform dimensionality reduction.
> > > I'm
> > >>> new to Mahout, on doing some search I understood that SSVD can be
> used
> > >> for
> > >>> performing dimensionality reduction. I'm not sure of the steps that
> > > have
> > >> to
> > >>> be executed before  SSVD, please help me.
> > >>>
> > >>> Thanks,
> > >>> Vijay
> > >>>
> > >>
> > >
> >
> >
>

Re: Using SSVD for dimensionality reduction on Mahout

Posted by Ted Dunning <te...@gmail.com>.
Vijay,

SSVD is not really appropriate with 12 columns.  You aren't going to see
any savings at all.

It would be much better if you were to look at extraction of the 7 most
interesting columns out of 1000.

The problem is not that SSVD will fail, but rather that you will have to
include all the columns in the computation so the whole random projection
step is simply wasted effort.

If you want to compute the SVD of a tall skinny matrix you can instead do
this:

     X = A' A
     R'R = X
     Ux D V' = R

     U = A V D^{-1}

The first step is a simple map reduce.  The second and third steps are
in-memory.  The fourth step is a map-only parallel computation (and is
optional in many cases).




On Fri, Mar 21, 2014 at 11:12 AM, Vijay B <b....@gmail.com> wrote:

> Thanks a lot for the reply.
>
> To gain an understanding of how SSVD works, I have taken a sample CSV file
> with 12 columns and I want to perform dimensionality reduction on it by
> asking SSVD to give me 7 most significant columns.
>
> Snippet of my input csv
>
> 22,2,44,36,5,9,2824,2,4,733,285,169
> 25,1,150,175,3,9,4037,2,18,1822,254,171
>
> Here's what I have done.
> Step 1: Converted the csv to a sequence file, below is a snippet of the
> output
> Key: 1: Value:
>
> 1:{0:22.0,1:2.0,2:44.0,3:36.0,4:5.0,5:9.0,6:2824.0,7:2.0,8:4.0,9:733.0,10:285.0,11:169.0}
> Key: 2: Value:
>
> 2:{0:25.0,1:1.0,2:150.0,3:175.0,4:3.0,5:9.0,6:4037.0,7:2.0,8:18.0,9:1822.0,10:254.0,11:171.0}
>
> Step 2; Passed this sequence file as input to the SSVD command, below is
> the command I used
>
> bin/mahout ssvd -i /user/cloudera/seq-data.seq -o
> /user/cloudera/reduced_dimensions1 --rank 7 -us true -V false -U false -pca
> true -ow -t 1
>
>  I then executed vectordump on the contents of USigma folder, below is a
> snippet of the output
>
>
> {0:191.5917217160858,1:-349.96930149831184,2:-78.21082086351002,3:98.73075808083476,4:-122.89919847376068,5:4.160343860343885,6:1.4336136023933244}
>
> {0:1293.9486625354516,1:697.7408635015182,2:24.0653800270275,3:60.79480738654566,4:11.733624175113523,5:6.479815864873287,6:-0.9269136621845396}
>
> Please help me interpret the above results in the USigma folder.
>
> Thanks,
> Vijay.
>
>
>
>
>
>
>
>
>
> On Fri, Mar 21, 2014 at 9:52 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > Vijay, how many columns do you have in the CSV? That is the number you
> > will be reducing.
> >
> > csv:
> > 1,22,33,44,55
> > 13,23,34,45,56
> >
> > would be dense vectors:
> > Key:1: Value:{1:1,2:22,3:33,4:44,5:55}
> > Key: 2: Value:{1:13,2:23,3:34,4:45,5:56}
> >
> > Unless you have some reason to assign different dimension indexes the row
> > and column numbers from your csv should be used in Mahout. Internal to
> > Mahout the dimensions are assumed to be ordinal. If you do have reasons
> to
> > say column 1 corresponds to something with an id of 12 (your example
> below)
> > then you handle that in the output phase of your problem. In other words
> if
> > you get an answer corresponding to the Mahout column index of 1, you
> lookup
> > its association to 12 in some dictionary you keep outside of Mahout, same
> > with the row keys. Don't put external Ids in the matrix unless they
> really
> > are ordinal dimensions.
> >
> > As Dmitriy said this sounds like a Dense matrix problem. Usually when
> I've
> > used SSVD it was on a matrix with 80,000-500,000 columns in a very sparse
> > matrix so reduction yields big benefits. Also remember that the output is
> > always a dense matrix so ops performed on it tend to be more heavy
> weight.
> >
> >
> > On Mar 19, 2014, at 11:16 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> > On Wed, Mar 19, 2014 at 11:00 AM, Vijay B <b....@gmail.com> wrote:
> >
> > > Thanks a lot for the detailed explanation, it was very helpful.
> > > I will write a CSV to sequence converter, just needed some clarity on
> the
> > > key/value pairs in the sequence file.
> > >
> > > Suppose my csv file contains the below values
> > > 11,22,33,44,55
> > > 13,23,34,45,56
> > >
> > > I assume that the sequence file would look like this, where 12, 1, 14,
> 8,
> > > 15 are indices which hold the values
> > > Key:1: Value:{12:11,1:22,14:33,8:44,15:55}
> > > Key: 2: Value:{12:13,1:23,14:34,8:45,15:56}
> > >
> >
> > I am not sure -- why are you remapping ordinal position into an index
> > position? Obviously, DRM supports sparse computations (i.e. you can use
> > either SequetialAccessSparseVector or RandomAccessSparseVector as vector
> > values, as long as they have the same cardinality). However, if you imply
> > that all data point ordinal positions map into the same sparse vector
> > index, then there's no true sparsity here and you could just form dense
> > vectors in ordinal order of your data, it seems.
> >
> > Other than that, I don't see any issues with your assumptions.
> >
> >
> > > Please confirm if my understanding is correct.
> > >
> > > Thanks,
> > > Vijay
> > >
> > >
> > > On Wed, Mar 19, 2014 at 11:02 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > >> wrote:
> > >
> > >> I am not sure if we have direct CSV converters to do that; CSV is not
> > > that
> > >> expressive anyway. But it is not difficult to write up such converter
> on
> > >> your own, i suppose.
> > >>
> > >> The steps you need to do is this :
> > >>
> > >> (1) prepare set of data points in a form of (unique vector key,
> > n-vector)
> > >> tuples. Vector key can be anything that can be adapted into a
> > >> WritableComparable. Notably, Long or String. Vector key also has to be
> > >> unique to make sense for you.
> > >> (2) save the above tuples into a set of sequence files so that
> sequence
> > >> file key is unique vector key, and sequence file value is
> > >> o.a.m.math.VectorWritable.
> > >> (3) decide how many dimensions there will be in reduced space. The key
> > is
> > >> reduced, i.e. you don't need too many. Say 50.
> > >> (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
> > >> reduced dimensionality output will be in the folder USigma. The output
> > > will
> > >> have same keys bounds to vectors in reduced space of k dimensions.
> > >>
> > >>
> > >> On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <b....@gmail.com>
> wrote:
> > >>
> > >>> Hi All,
> > >>> I have a CSV file on which I've to perform dimensionality reduction.
> > > I'm
> > >>> new to Mahout, on doing some search I understood that SSVD can be
> used
> > >> for
> > >>> performing dimensionality reduction. I'm not sure of the steps that
> > > have
> > >> to
> > >>> be executed before  SSVD, please help me.
> > >>>
> > >>> Thanks,
> > >>> Vijay
> > >>>
> > >>
> > >
> >
> >
>

Re: Using SSVD for dimensionality reduction on Mahout

Posted by Vijay B <b....@gmail.com>.
Thanks a lot for the reply.

To gain an understanding of how SSVD works, I have taken a sample CSV file
with 12 columns and I want to perform dimensionality reduction on it by
asking SSVD to give me 7 most significant columns.

Snippet of my input csv

22,2,44,36,5,9,2824,2,4,733,285,169
25,1,150,175,3,9,4037,2,18,1822,254,171

Here's what I have done.
Step 1: Converted the csv to a sequence file, below is a snippet of the
output
Key: 1: Value:
1:{0:22.0,1:2.0,2:44.0,3:36.0,4:5.0,5:9.0,6:2824.0,7:2.0,8:4.0,9:733.0,10:285.0,11:169.0}
Key: 2: Value:
2:{0:25.0,1:1.0,2:150.0,3:175.0,4:3.0,5:9.0,6:4037.0,7:2.0,8:18.0,9:1822.0,10:254.0,11:171.0}

Step 2; Passed this sequence file as input to the SSVD command, below is
the command I used

bin/mahout ssvd -i /user/cloudera/seq-data.seq -o
/user/cloudera/reduced_dimensions1 --rank 7 -us true -V false -U false -pca
true -ow -t 1

 I then executed vectordump on the contents of USigma folder, below is a
snippet of the output

{0:191.5917217160858,1:-349.96930149831184,2:-78.21082086351002,3:98.73075808083476,4:-122.89919847376068,5:4.160343860343885,6:1.4336136023933244}
{0:1293.9486625354516,1:697.7408635015182,2:24.0653800270275,3:60.79480738654566,4:11.733624175113523,5:6.479815864873287,6:-0.9269136621845396}

Please help me interpret the above results in the USigma folder.

Thanks,
Vijay.









On Fri, Mar 21, 2014 at 9:52 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Vijay, how many columns do you have in the CSV? That is the number you
> will be reducing.
>
> csv:
> 1,22,33,44,55
> 13,23,34,45,56
>
> would be dense vectors:
> Key:1: Value:{1:1,2:22,3:33,4:44,5:55}
> Key: 2: Value:{1:13,2:23,3:34,4:45,5:56}
>
> Unless you have some reason to assign different dimension indexes the row
> and column numbers from your csv should be used in Mahout. Internal to
> Mahout the dimensions are assumed to be ordinal. If you do have reasons to
> say column 1 corresponds to something with an id of 12 (your example below)
> then you handle that in the output phase of your problem. In other words if
> you get an answer corresponding to the Mahout column index of 1, you lookup
> its association to 12 in some dictionary you keep outside of Mahout, same
> with the row keys. Don't put external Ids in the matrix unless they really
> are ordinal dimensions.
>
> As Dmitriy said this sounds like a Dense matrix problem. Usually when I've
> used SSVD it was on a matrix with 80,000-500,000 columns in a very sparse
> matrix so reduction yields big benefits. Also remember that the output is
> always a dense matrix so ops performed on it tend to be more heavy weight.
>
>
> On Mar 19, 2014, at 11:16 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> On Wed, Mar 19, 2014 at 11:00 AM, Vijay B <b....@gmail.com> wrote:
>
> > Thanks a lot for the detailed explanation, it was very helpful.
> > I will write a CSV to sequence converter, just needed some clarity on the
> > key/value pairs in the sequence file.
> >
> > Suppose my csv file contains the below values
> > 11,22,33,44,55
> > 13,23,34,45,56
> >
> > I assume that the sequence file would look like this, where 12, 1, 14, 8,
> > 15 are indices which hold the values
> > Key:1: Value:{12:11,1:22,14:33,8:44,15:55}
> > Key: 2: Value:{12:13,1:23,14:34,8:45,15:56}
> >
>
> I am not sure -- why are you remapping ordinal position into an index
> position? Obviously, DRM supports sparse computations (i.e. you can use
> either SequetialAccessSparseVector or RandomAccessSparseVector as vector
> values, as long as they have the same cardinality). However, if you imply
> that all data point ordinal positions map into the same sparse vector
> index, then there's no true sparsity here and you could just form dense
> vectors in ordinal order of your data, it seems.
>
> Other than that, I don't see any issues with your assumptions.
>
>
> > Please confirm if my understanding is correct.
> >
> > Thanks,
> > Vijay
> >
> >
> > On Wed, Mar 19, 2014 at 11:02 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >> wrote:
> >
> >> I am not sure if we have direct CSV converters to do that; CSV is not
> > that
> >> expressive anyway. But it is not difficult to write up such converter on
> >> your own, i suppose.
> >>
> >> The steps you need to do is this :
> >>
> >> (1) prepare set of data points in a form of (unique vector key,
> n-vector)
> >> tuples. Vector key can be anything that can be adapted into a
> >> WritableComparable. Notably, Long or String. Vector key also has to be
> >> unique to make sense for you.
> >> (2) save the above tuples into a set of sequence files so that sequence
> >> file key is unique vector key, and sequence file value is
> >> o.a.m.math.VectorWritable.
> >> (3) decide how many dimensions there will be in reduced space. The key
> is
> >> reduced, i.e. you don't need too many. Say 50.
> >> (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
> >> reduced dimensionality output will be in the folder USigma. The output
> > will
> >> have same keys bounds to vectors in reduced space of k dimensions.
> >>
> >>
> >> On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <b....@gmail.com> wrote:
> >>
> >>> Hi All,
> >>> I have a CSV file on which I've to perform dimensionality reduction.
> > I'm
> >>> new to Mahout, on doing some search I understood that SSVD can be used
> >> for
> >>> performing dimensionality reduction. I'm not sure of the steps that
> > have
> >> to
> >>> be executed before  SSVD, please help me.
> >>>
> >>> Thanks,
> >>> Vijay
> >>>
> >>
> >
>
>

Re: Using SSVD for dimensionality reduction on Mahout

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Vijay, how many columns do you have in the CSV? That is the number you will be reducing. 

csv:
1,22,33,44,55
13,23,34,45,56

would be dense vectors:
Key:1: Value:{1:1,2:22,3:33,4:44,5:55}
Key: 2: Value:{1:13,2:23,3:34,4:45,5:56}

Unless you have some reason to assign different dimension indexes the row and column numbers from your csv should be used in Mahout. Internal to Mahout the dimensions are assumed to be ordinal. If you do have reasons to say column 1 corresponds to something with an id of 12 (your example below) then you handle that in the output phase of your problem. In other words if you get an answer corresponding to the Mahout column index of 1, you lookup its association to 12 in some dictionary you keep outside of Mahout, same with the row keys. Don’t put external Ids in the matrix unless they really are ordinal dimensions. 

As Dmitriy said this sounds like a Dense matrix problem. Usually when I’ve used SSVD it was on a matrix with 80,000-500,000 columns in a very sparse matrix so reduction yields big benefits. Also remember that the output is always a dense matrix so ops performed on it tend to be more heavy weight.


On Mar 19, 2014, at 11:16 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

On Wed, Mar 19, 2014 at 11:00 AM, Vijay B <b....@gmail.com> wrote:

> Thanks a lot for the detailed explanation, it was very helpful.
> I will write a CSV to sequence converter, just needed some clarity on the
> key/value pairs in the sequence file.
> 
> Suppose my csv file contains the below values
> 11,22,33,44,55
> 13,23,34,45,56
> 
> I assume that the sequence file would look like this, where 12, 1, 14, 8,
> 15 are indices which hold the values
> Key:1: Value:{12:11,1:22,14:33,8:44,15:55}
> Key: 2: Value:{12:13,1:23,14:34,8:45,15:56}
> 

I am not sure -- why are you remapping ordinal position into an index
position? Obviously, DRM supports sparse computations (i.e. you can use
either SequetialAccessSparseVector or RandomAccessSparseVector as vector
values, as long as they have the same cardinality). However, if you imply
that all data point ordinal positions map into the same sparse vector
index, then there's no true sparsity here and you could just form dense
vectors in ordinal order of your data, it seems.

Other than that, I don't see any issues with your assumptions.


> Please confirm if my understanding is correct.
> 
> Thanks,
> Vijay
> 
> 
> On Wed, Mar 19, 2014 at 11:02 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> wrote:
> 
>> I am not sure if we have direct CSV converters to do that; CSV is not
> that
>> expressive anyway. But it is not difficult to write up such converter on
>> your own, i suppose.
>> 
>> The steps you need to do is this :
>> 
>> (1) prepare set of data points in a form of (unique vector key, n-vector)
>> tuples. Vector key can be anything that can be adapted into a
>> WritableComparable. Notably, Long or String. Vector key also has to be
>> unique to make sense for you.
>> (2) save the above tuples into a set of sequence files so that sequence
>> file key is unique vector key, and sequence file value is
>> o.a.m.math.VectorWritable.
>> (3) decide how many dimensions there will be in reduced space. The key is
>> reduced, i.e. you don't need too many. Say 50.
>> (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
>> reduced dimensionality output will be in the folder USigma. The output
> will
>> have same keys bounds to vectors in reduced space of k dimensions.
>> 
>> 
>> On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <b....@gmail.com> wrote:
>> 
>>> Hi All,
>>> I have a CSV file on which I've to perform dimensionality reduction.
> I'm
>>> new to Mahout, on doing some search I understood that SSVD can be used
>> for
>>> performing dimensionality reduction. I'm not sure of the steps that
> have
>> to
>>> be executed before  SSVD, please help me.
>>> 
>>> Thanks,
>>> Vijay
>>> 
>> 
> 


Re: Using SSVD for dimensionality reduction on Mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Wed, Mar 19, 2014 at 11:00 AM, Vijay B <b....@gmail.com> wrote:

> Thanks a lot for the detailed explanation, it was very helpful.
> I will write a CSV to sequence converter, just needed some clarity on the
> key/value pairs in the sequence file.
>
> Suppose my csv file contains the below values
> 11,22,33,44,55
> 13,23,34,45,56
>
> I assume that the sequence file would look like this, where 12, 1, 14, 8,
> 15 are indices which hold the values
> Key:1: Value:{12:11,1:22,14:33,8:44,15:55}
> Key: 2: Value:{12:13,1:23,14:34,8:45,15:56}
>

I am not sure -- why are you remapping ordinal position into an index
position? Obviously, DRM supports sparse computations (i.e. you can use
either SequetialAccessSparseVector or RandomAccessSparseVector as vector
values, as long as they have the same cardinality). However, if you imply
that all data point ordinal positions map into the same sparse vector
index, then there's no true sparsity here and you could just form dense
vectors in ordinal order of your data, it seems.

Other than that, I don't see any issues with your assumptions.


> Please confirm if my understanding is correct.
>
> Thanks,
> Vijay
>
>
> On Wed, Mar 19, 2014 at 11:02 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > I am not sure if we have direct CSV converters to do that; CSV is not
> that
> > expressive anyway. But it is not difficult to write up such converter on
> > your own, i suppose.
> >
> > The steps you need to do is this :
> >
> > (1) prepare set of data points in a form of (unique vector key, n-vector)
> > tuples. Vector key can be anything that can be adapted into a
> > WritableComparable. Notably, Long or String. Vector key also has to be
> > unique to make sense for you.
> > (2) save the above tuples into a set of sequence files so that sequence
> > file key is unique vector key, and sequence file value is
> > o.a.m.math.VectorWritable.
> > (3) decide how many dimensions there will be in reduced space. The key is
> > reduced, i.e. you don't need too many. Say 50.
> > (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
> > reduced dimensionality output will be in the folder USigma. The output
> will
> > have same keys bounds to vectors in reduced space of k dimensions.
> >
> >
> > On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <b....@gmail.com> wrote:
> >
> > > Hi All,
> > > I have a CSV file on which I've to perform dimensionality reduction.
> I'm
> > > new to Mahout, on doing some search I understood that SSVD can be used
> > for
> > > performing dimensionality reduction. I'm not sure of the steps that
> have
> > to
> > > be executed before  SSVD, please help me.
> > >
> > > Thanks,
> > > Vijay
> > >
> >
>

Re: Using SSVD for dimensionality reduction on Mahout

Posted by Vijay B <b....@gmail.com>.
Thanks a lot for the detailed explanation, it was very helpful.
I will write a CSV to sequence converter, just needed some clarity on the
key/value pairs in the sequence file.

Suppose my csv file contains the below values
11,22,33,44,55
13,23,34,45,56

I assume that the sequence file would look like this, where 12, 1, 14, 8,
15 are indices which hold the values
Key:1: Value:{12:11,1:22,14:33,8:44,15:55}
Key: 2: Value:{12:13,1:23,14:34,8:45,15:56}

Please confirm if my understanding is correct.

Thanks,
Vijay


On Wed, Mar 19, 2014 at 11:02 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> I am not sure if we have direct CSV converters to do that; CSV is not that
> expressive anyway. But it is not difficult to write up such converter on
> your own, i suppose.
>
> The steps you need to do is this :
>
> (1) prepare set of data points in a form of (unique vector key, n-vector)
> tuples. Vector key can be anything that can be adapted into a
> WritableComparable. Notably, Long or String. Vector key also has to be
> unique to make sense for you.
> (2) save the above tuples into a set of sequence files so that sequence
> file key is unique vector key, and sequence file value is
> o.a.m.math.VectorWritable.
> (3) decide how many dimensions there will be in reduced space. The key is
> reduced, i.e. you don't need too many. Say 50.
> (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
> reduced dimensionality output will be in the folder USigma. The output will
> have same keys bounds to vectors in reduced space of k dimensions.
>
>
> On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <b....@gmail.com> wrote:
>
> > Hi All,
> > I have a CSV file on which I've to perform dimensionality reduction. I'm
> > new to Mahout, on doing some search I understood that SSVD can be used
> for
> > performing dimensionality reduction. I'm not sure of the steps that have
> to
> > be executed before  SSVD, please help me.
> >
> > Thanks,
> > Vijay
> >
>

Re: Using SSVD for dimensionality reduction on Mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
PS. dspca method, which is almost exact replica of SSVD --pca true,  is
also available on Spark running on exactly same sequence file DRM (there's
no CLI though, it needs to be wrapped in a scala code) [1]. It potentially
may be a bit better performant than MR version, although it is new. If you
are in Scala world and looking for an embedded api, this may be a better
option for you to try. Although it is a new code, and we haven't collected
data on its application yet. it would be awesome if you could try it.

[1] http://mahout.apache.org/users/sparkbindings/home.html


On Wed, Mar 19, 2014 at 10:17 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> I am not sure if we have direct CSV converters to do that; CSV is not that
> expressive anyway. But it is not difficult to write up such converter on
> your own, i suppose.
>
> The steps you need to do is this :
>
> (1) prepare set of data points in a form of (unique vector key, n-vector)
> tuples. Vector key can be anything that can be adapted into a
> WritableComparable. Notably, Long or String. Vector key also has to be
> unique to make sense for you.
> (2) save the above tuples into a set of sequence files so that sequence
> file key is unique vector key, and sequence file value is
> o.a.m.math.VectorWritable.
> (3) decide how many dimensions there will be in reduced space. The key is
> reduced, i.e. you don't need too many. Say 50.
> (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
> reduced dimensionality output will be in the folder USigma. The output will
> have same keys bounds to vectors in reduced space of k dimensions.
>
>
> On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <b....@gmail.com> wrote:
>
>> Hi All,
>> I have a CSV file on which I've to perform dimensionality reduction. I'm
>> new to Mahout, on doing some search I understood that SSVD can be used for
>> performing dimensionality reduction. I'm not sure of the steps that have
>> to
>> be executed before  SSVD, please help me.
>>
>> Thanks,
>> Vijay
>>
>
>

Re: Using SSVD for dimensionality reduction on Mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
I am not sure if we have direct CSV converters to do that; CSV is not that
expressive anyway. But it is not difficult to write up such converter on
your own, i suppose.

The steps you need to do is this :

(1) prepare set of data points in a form of (unique vector key, n-vector)
tuples. Vector key can be anything that can be adapted into a
WritableComparable. Notably, Long or String. Vector key also has to be
unique to make sense for you.
(2) save the above tuples into a set of sequence files so that sequence
file key is unique vector key, and sequence file value is
o.a.m.math.VectorWritable.
(3) decide how many dimensions there will be in reduced space. The key is
reduced, i.e. you don't need too many. Say 50.
(4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
reduced dimensionality output will be in the folder USigma. The output will
have same keys bounds to vectors in reduced space of k dimensions.


On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <b....@gmail.com> wrote:

> Hi All,
> I have a CSV file on which I've to perform dimensionality reduction. I'm
> new to Mahout, on doing some search I understood that SSVD can be used for
> performing dimensionality reduction. I'm not sure of the steps that have to
> be executed before  SSVD, please help me.
>
> Thanks,
> Vijay
>