You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Vijaya Pratap <bv...@gmail.com> on 2014/03/18 06:22:45 UTC

Fwd: Need help in executing SSVD for dimensionality reduction on Mahout

Hi,

I am trying to use SSVD for dimensionality reduction on Mahout, the input
is a sample data in CSV format. Below is a snippet of the input

22,2,44,36,5,9,2824,2,4,733,285,169
25,1,150,175,3,9,4037,2,18,1822,254,171

I have executed the below steps.

1. Loaded the csv file and Vectorized the data by following the steps
mentioned at https://github.com/tdunning/pig-vector with key as
TextConverter and value as VectorWritable. Listed below is the output of
this step. I believe the values 420468, 279945 are indices, please correct
me if I am wrong.
Key: 1: Value:
{420468:733.0,279945:2.0,607618:285.0,107323:4.0,88330:2.0,263605:9.0,975378:169.0,796003:2824.0,899937:44.0,422862:5.0,723271:22.0,508675:36.0}
Key: 1: Value:
{420468:1822.0,279945:2.0,607618:254.0,107323:18.0,88330:1.0,263605:9.0,975378:171.0,796003:4037.0,899937:150.0,422862:3.0,723271:25.0,508675:175.0}

2. Passed the output of the above command to SSVD as follows
bin/mahout ssvd -i /user/cloudera/vectorized_data/ -o
/user/cloudera/reduced_dimensions --rank 7 -us true -V false -U false -pca
true -ow -t 1

Below is a snippet of the output in USigma folder
Key: 1: Value:
{0:190.78376981262613,1:350.30406212052424,2:78.24932121461198,3:98.67283686605012,4:-122.95056058078157,5:-4.201436498582381,6:-1.4370820809434337}
Key: 1: Value:
{0:1295.933786837574,1:-698.5629072274602,2:-24.15996813349674,3:60.936737740013946,4:11.859426028893711,5:-6.379057682687426,6:0.9356299409590896}

Please let me know if my approach is correct and help me in interpreting
the output in USigma folder


Thanks in advance
Pratap

Re: Need help in executing SSVD for dimensionality reduction on Mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
If the rows in the input for SSVD are data points you are trying to create
reduced space for, then rows of USigma represent the same points in the PCA
(reduced) space. The mapping between the input rows and output rows is by
same keys in the sequence files. However, it doesn't look like your input
is using distinct such values (1), this is not recommended.

SSVD will also propagate names if NamedVector is used for rows of the
input. That's possibly another way to map input rows to PCA space rows in
USigma. However, it doesn't look like the input is using Named vectors in
this case.


On Mon, Mar 17, 2014 at 10:22 PM, Vijaya Pratap <bv...@gmail.com> wrote:

> Hi,
>
> I am trying to use SSVD for dimensionality reduction on Mahout, the input
> is a sample data in CSV format. Below is a snippet of the input
>
> 22,2,44,36,5,9,2824,2,4,733,285,169
> 25,1,150,175,3,9,4037,2,18,1822,254,171
>
> I have executed the below steps.
>
> 1. Loaded the csv file and Vectorized the data by following the steps
> mentioned at https://github.com/tdunning/pig-vector with key as
> TextConverter and value as VectorWritable. Listed below is the output of
> this step. I believe the values 420468, 279945 are indices, please correct
> me if I am wrong.
> Key: 1: Value:
>
> {420468:733.0,279945:2.0,607618:285.0,107323:4.0,88330:2.0,263605:9.0,975378:169.0,796003:2824.0,899937:44.0,422862:5.0,723271:22.0,508675:36.0}
> Key: 1: Value:
>
> {420468:1822.0,279945:2.0,607618:254.0,107323:18.0,88330:1.0,263605:9.0,975378:171.0,796003:4037.0,899937:150.0,422862:3.0,723271:25.0,508675:175.0}
>
> 2. Passed the output of the above command to SSVD as follows
> bin/mahout ssvd -i /user/cloudera/vectorized_data/ -o
> /user/cloudera/reduced_dimensions --rank 7 -us true -V false -U false -pca
> true -ow -t 1
>
> Below is a snippet of the output in USigma folder
> Key: 1: Value:
>
> {0:190.78376981262613,1:350.30406212052424,2:78.24932121461198,3:98.67283686605012,4:-122.95056058078157,5:-4.201436498582381,6:-1.4370820809434337}
> Key: 1: Value:
>
> {0:1295.933786837574,1:-698.5629072274602,2:-24.15996813349674,3:60.936737740013946,4:11.859426028893711,5:-6.379057682687426,6:0.9356299409590896}
>
> Please let me know if my approach is correct and help me in interpreting
> the output in USigma folder
>
>
> Thanks in advance
> Pratap
>