You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Donni Khan <pr...@googlemail.com> on 2015/03/30 09:39:20 UTC

Text clustering with SVD

Hallo Mahout users,

I'm working on text clustering, I would like to reduce the features to
enhance the clustering process.
I would like to use  the Singular Value Decomposition before cluatering
process. I will be thankfull if anyone has used this before, Is it a good
idea for clustering?
Is there any other method in mahout to reduce the text features before
clustring?
Is anyone has idea how can I apply SVD by using Java code?

Thanks in advance,
Donni

Re: Text clustering with SVD

Posted by Fernando Fernández <fe...@gmail.com>.

SSVD is just one of may ways to compute a partial SVD. In mahout you also
have Lanczos method, which I have found faster and more reliable in some
applications, but most of people here seem to prefer SSVD, in fact I think
Lanczos is (or has been) planned to be deprecated. This may also have
changed, it's been two years since I last used mahout's Lanczos.

As for the optimal K, It's a good idea to compute, for example, k=300 - 400
 and plot the eigenvalues. The resulting chart usually has an elbow which
determines a good k for many situations, though in practice, you may find
that four your application a lower or a greater k can yield better results.
It depends on your application and data so you might want to experiment a
bit with different values of k.

Best,
Fernando.

2015-03-30 11:19 GMT+02:00 Donni Khan <pr...@googlemail.com>:

> Hallo Suneel,
> Thanks for fast reply.
> Is SSVD like SVD? which one is better?
> I run the SSVD  by java code on my data, but how do I compute U*Sigma?  Can
> I do that by Mahout?
> Is there optimal method to determin K?
>
> another quesion is how do I make the relation between ssvd output and
> words dictionary(real words)?
>
> Thank you
> Donni
>
> On Mon, Mar 30, 2015 at 10:04 AM, Suneel Marthi <su...@gmail.com>
> wrote:
>
> > Here are the steps if u r using Mahout-mrlegacy in the present Mahout
> > trunk:
> >
> > 1. Generate tfidf vectors from the input corpus using seq2sparse (I am
> > assuming you had done this before and hence avoiding the details)
> >
> > 2. Run SSVD on the generated tfidf vectors from (1)
> >
> >       ./bin/mahout ssvd -i <tfidf vectors> -o <svd output> -k 80 -pca
> true
> > -us true -U false -V false
> >
> >      k = no. of reduced basis vectors
> >
> >     You would need the U*Sigma output of the PCA flow for the next
> > clustering step
> >
> > 3. Run KMeans (or any other clustering algo) with the U*Sigma from (2) as
> > input.
> >
> >
> > On Mon, Mar 30, 2015 at 3:39 AM, Donni Khan <
> prince.donnii@googlemail.com>
> > wrote:
> >
> > > Hallo Mahout users,
> > >
> > > I'm working on text clustering, I would like to reduce the features to
> > > enhance the clustering process.
> > > I would like to use  the Singular Value Decomposition before cluatering
> > > process. I will be thankfull if anyone has used this before, Is it a
> good
> > > idea for clustering?
> > > Is there any other method in mahout to reduce the text features before
> > > clustring?
> > > Is anyone has idea how can I apply SVD by using Java code?
> > >
> > > Thanks in advance,
> > > Donni
> > >
> >
>

Re: Text clustering with SVD

Posted by Donni Khan <pr...@googlemail.com>.

Hallo again,

I have run the ssvd on the  textual data as the following.
1. Run ssvd:
bin/mahout ssvd -i  outputTV/tfidf/tfidf-vectors/part-r-00000  -o svdOutput
-k 100   -us true -U false -V false   -t 1   -ow   -pca true
2. Run kmeans:
 bin/mahout kmeans -i svdOutput/USigma/  -c work/kmeans/kmeans-centroids
-cl -o work/kmeans/cluster -k 10 -ow -x 1000 -dm
org.apache.mahout.common.distance.CosineDistanceMeasure
3. Dumping:
bin/mahout clusterdump  -d outputTV/dictionary.file-0   -dt sequencefile -i
work/kmeans/cluster/clusters-1-final -n 20 -b 100 -o work/kmeans/cDump.txt
-p work/kmeans/cluster/clusteredPoints/

A'm I right in the above steps?

I got bad results.  In the clustering output  all words start with the
letter "a*".  anyone has idea why?

Thanks in advance,
Donni

On Mon, Mar 30, 2015 at 11:07 PM, Ted Dunning <te...@gmail.com> wrote:

> Lanczos may be more accurate than SSVD, but if you use a power step or
> three, this difference goes away as well.
>
> The best way to select k is actually to pick a value k_max larger than you
> expect to need and then pick random vectors instead of singular vectors.
> To evaluate how many singular vectors you really need, substitute more and
> more of the components of the random vectors with values from the singular
> vectors.  It is common that the best k_max will be 100-300 for text
> applications, but it is also common that the best k < k_max is much, much
> smaller.
>
> The reason that this is a better selection method is because a) random word
> vectors actually work pretty well because they maintain approximate
> independence of words and b) after k gets to a certain (pretty darned
> small) size, all the SVD is doing is acting as a very fancy and slow random
> number generator.
>
>
>
> On Mon, Mar 30, 2015 at 12:00 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > I am not aware of _any_ scenario under which lanczos would be faster (see
> > N. Halko's dissertation for comparisons), although admittedly i did not
> > study all possible cases.
> >
> > having -k=100 is probably enough for anything.  I would not recommend
> > running -q>0 for k>100 as it would become quite slow in power iterations
> > step.
> >
> > to your other questions, e.g. U*sigma result output, see "overview and
> > usage" link given here:
> > http://mahout.apache.org/users/dim-reduction/ssvd.html
> >
> > On Mon, Mar 30, 2015 at 2:19 AM, Donni Khan <
> prince.donnii@googlemail.com>
> > wrote:
> >
> > > Hallo Suneel,
> > > Thanks for fast reply.
> > > Is SSVD like SVD? which one is better?
> > > I run the SSVD  by java code on my data, but how do I compute U*Sigma?
> > Can
> > > I do that by Mahout?
> > > Is there optimal method to determin K?
> > >
> > > another quesion is how do I make the relation between ssvd output and
> > > words dictionary(real words)?
> > >
> > > Thank you
> > > Donni
> > >
> > > On Mon, Mar 30, 2015 at 10:04 AM, Suneel Marthi <
> suneel.marthi@gmail.com
> > >
> > > wrote:
> > >
> > > > Here are the steps if u r using Mahout-mrlegacy in the present Mahout
> > > > trunk:
> > > >
> > > > 1. Generate tfidf vectors from the input corpus using seq2sparse (I
> am
> > > > assuming you had done this before and hence avoiding the details)
> > > >
> > > > 2. Run SSVD on the generated tfidf vectors from (1)
> > > >
> > > >       ./bin/mahout ssvd -i <tfidf vectors> -o <svd output> -k 80 -pca
> > > true
> > > > -us true -U false -V false
> > > >
> > > >      k = no. of reduced basis vectors
> > > >
> > > >     You would need the U*Sigma output of the PCA flow for the next
> > > > clustering step
> > > >
> > > > 3. Run KMeans (or any other clustering algo) with the U*Sigma from
> (2)
> > as
> > > > input.
> > > >
> > > >
> > > > On Mon, Mar 30, 2015 at 3:39 AM, Donni Khan <
> > > prince.donnii@googlemail.com>
> > > > wrote:
> > > >
> > > > > Hallo Mahout users,
> > > > >
> > > > > I'm working on text clustering, I would like to reduce the features
> > to
> > > > > enhance the clustering process.
> > > > > I would like to use  the Singular Value Decomposition before
> > cluatering
> > > > > process. I will be thankfull if anyone has used this before, Is it
> a
> > > good
> > > > > idea for clustering?
> > > > > Is there any other method in mahout to reduce the text features
> > before
> > > > > clustring?
> > > > > Is anyone has idea how can I apply SVD by using Java code?
> > > > >
> > > > > Thanks in advance,
> > > > > Donni
> > > > >
> > > >
> > >
> >
>

Re: Text clustering with SVD

Posted by Ted Dunning <te...@gmail.com>.

Lanczos may be more accurate than SSVD, but if you use a power step or
three, this difference goes away as well.

The best way to select k is actually to pick a value k_max larger than you
expect to need and then pick random vectors instead of singular vectors.
To evaluate how many singular vectors you really need, substitute more and
more of the components of the random vectors with values from the singular
vectors.  It is common that the best k_max will be 100-300 for text
applications, but it is also common that the best k < k_max is much, much
smaller.

The reason that this is a better selection method is because a) random word
vectors actually work pretty well because they maintain approximate
independence of words and b) after k gets to a certain (pretty darned
small) size, all the SVD is doing is acting as a very fancy and slow random
number generator.



On Mon, Mar 30, 2015 at 12:00 PM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:

> I am not aware of _any_ scenario under which lanczos would be faster (see
> N. Halko's dissertation for comparisons), although admittedly i did not
> study all possible cases.
>
> having -k=100 is probably enough for anything.  I would not recommend
> running -q>0 for k>100 as it would become quite slow in power iterations
> step.
>
> to your other questions, e.g. U*sigma result output, see "overview and
> usage" link given here:
> http://mahout.apache.org/users/dim-reduction/ssvd.html
>
> On Mon, Mar 30, 2015 at 2:19 AM, Donni Khan <pr...@googlemail.com>
> wrote:
>
> > Hallo Suneel,
> > Thanks for fast reply.
> > Is SSVD like SVD? which one is better?
> > I run the SSVD  by java code on my data, but how do I compute U*Sigma?
> Can
> > I do that by Mahout?
> > Is there optimal method to determin K?
> >
> > another quesion is how do I make the relation between ssvd output and
> > words dictionary(real words)?
> >
> > Thank you
> > Donni
> >
> > On Mon, Mar 30, 2015 at 10:04 AM, Suneel Marthi <suneel.marthi@gmail.com
> >
> > wrote:
> >
> > > Here are the steps if u r using Mahout-mrlegacy in the present Mahout
> > > trunk:
> > >
> > > 1. Generate tfidf vectors from the input corpus using seq2sparse (I am
> > > assuming you had done this before and hence avoiding the details)
> > >
> > > 2. Run SSVD on the generated tfidf vectors from (1)
> > >
> > >       ./bin/mahout ssvd -i <tfidf vectors> -o <svd output> -k 80 -pca
> > true
> > > -us true -U false -V false
> > >
> > >      k = no. of reduced basis vectors
> > >
> > >     You would need the U*Sigma output of the PCA flow for the next
> > > clustering step
> > >
> > > 3. Run KMeans (or any other clustering algo) with the U*Sigma from (2)
> as
> > > input.
> > >
> > >
> > > On Mon, Mar 30, 2015 at 3:39 AM, Donni Khan <
> > prince.donnii@googlemail.com>
> > > wrote:
> > >
> > > > Hallo Mahout users,
> > > >
> > > > I'm working on text clustering, I would like to reduce the features
> to
> > > > enhance the clustering process.
> > > > I would like to use  the Singular Value Decomposition before
> cluatering
> > > > process. I will be thankfull if anyone has used this before, Is it a
> > good
> > > > idea for clustering?
> > > > Is there any other method in mahout to reduce the text features
> before
> > > > clustring?
> > > > Is anyone has idea how can I apply SVD by using Java code?
> > > >
> > > > Thanks in advance,
> > > > Donni
> > > >
> > >
> >
>

Re: Text clustering with SVD

Posted by Suneel Marthi <su...@gmail.com>.

Lanczos has since been deprecated and will be removed in the upcoming
release, so please desist from using/suggesting Lanczos.


On Mon, Mar 30, 2015 at 3:00 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I am not aware of _any_ scenario under which lanczos would be faster (see
> N. Halko's dissertation for comparisons), although admittedly i did not
> study all possible cases.
>
> having -k=100 is probably enough for anything.  I would not recommend
> running -q>0 for k>100 as it would become quite slow in power iterations
> step.
>
> to your other questions, e.g. U*sigma result output, see "overview and
> usage" link given here:
> http://mahout.apache.org/users/dim-reduction/ssvd.html
>
> On Mon, Mar 30, 2015 at 2:19 AM, Donni Khan <pr...@googlemail.com>
> wrote:
>
> > Hallo Suneel,
> > Thanks for fast reply.
> > Is SSVD like SVD? which one is better?
> > I run the SSVD  by java code on my data, but how do I compute U*Sigma?
> Can
> > I do that by Mahout?
> > Is there optimal method to determin K?
> >
> > another quesion is how do I make the relation between ssvd output and
> > words dictionary(real words)?
> >
> > Thank you
> > Donni
> >
> > On Mon, Mar 30, 2015 at 10:04 AM, Suneel Marthi <suneel.marthi@gmail.com
> >
> > wrote:
> >
> > > Here are the steps if u r using Mahout-mrlegacy in the present Mahout
> > > trunk:
> > >
> > > 1. Generate tfidf vectors from the input corpus using seq2sparse (I am
> > > assuming you had done this before and hence avoiding the details)
> > >
> > > 2. Run SSVD on the generated tfidf vectors from (1)
> > >
> > >       ./bin/mahout ssvd -i <tfidf vectors> -o <svd output> -k 80 -pca
> > true
> > > -us true -U false -V false
> > >
> > >      k = no. of reduced basis vectors
> > >
> > >     You would need the U*Sigma output of the PCA flow for the next
> > > clustering step
> > >
> > > 3. Run KMeans (or any other clustering algo) with the U*Sigma from (2)
> as
> > > input.
> > >
> > >
> > > On Mon, Mar 30, 2015 at 3:39 AM, Donni Khan <
> > prince.donnii@googlemail.com>
> > > wrote:
> > >
> > > > Hallo Mahout users,
> > > >
> > > > I'm working on text clustering, I would like to reduce the features
> to
> > > > enhance the clustering process.
> > > > I would like to use  the Singular Value Decomposition before
> cluatering
> > > > process. I will be thankfull if anyone has used this before, Is it a
> > good
> > > > idea for clustering?
> > > > Is there any other method in mahout to reduce the text features
> before
> > > > clustring?
> > > > Is anyone has idea how can I apply SVD by using Java code?
> > > >
> > > > Thanks in advance,
> > > > Donni
> > > >
> > >
> >
>

Re: Text clustering with SVD

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I am not aware of _any_ scenario under which lanczos would be faster (see
N. Halko's dissertation for comparisons), although admittedly i did not
study all possible cases.

having -k=100 is probably enough for anything.  I would not recommend
running -q>0 for k>100 as it would become quite slow in power iterations
step.

to your other questions, e.g. U*sigma result output, see "overview and
usage" link given here:
http://mahout.apache.org/users/dim-reduction/ssvd.html

On Mon, Mar 30, 2015 at 2:19 AM, Donni Khan <pr...@googlemail.com>
wrote:

> Hallo Suneel,
> Thanks for fast reply.
> Is SSVD like SVD? which one is better?
> I run the SSVD  by java code on my data, but how do I compute U*Sigma?  Can
> I do that by Mahout?
> Is there optimal method to determin K?
>
> another quesion is how do I make the relation between ssvd output and
> words dictionary(real words)?
>
> Thank you
> Donni
>
> On Mon, Mar 30, 2015 at 10:04 AM, Suneel Marthi <su...@gmail.com>
> wrote:
>
> > Here are the steps if u r using Mahout-mrlegacy in the present Mahout
> > trunk:
> >
> > 1. Generate tfidf vectors from the input corpus using seq2sparse (I am
> > assuming you had done this before and hence avoiding the details)
> >
> > 2. Run SSVD on the generated tfidf vectors from (1)
> >
> >       ./bin/mahout ssvd -i <tfidf vectors> -o <svd output> -k 80 -pca
> true
> > -us true -U false -V false
> >
> >      k = no. of reduced basis vectors
> >
> >     You would need the U*Sigma output of the PCA flow for the next
> > clustering step
> >
> > 3. Run KMeans (or any other clustering algo) with the U*Sigma from (2) as
> > input.
> >
> >
> > On Mon, Mar 30, 2015 at 3:39 AM, Donni Khan <
> prince.donnii@googlemail.com>
> > wrote:
> >
> > > Hallo Mahout users,
> > >
> > > I'm working on text clustering, I would like to reduce the features to
> > > enhance the clustering process.
> > > I would like to use  the Singular Value Decomposition before cluatering
> > > process. I will be thankfull if anyone has used this before, Is it a
> good
> > > idea for clustering?
> > > Is there any other method in mahout to reduce the text features before
> > > clustring?
> > > Is anyone has idea how can I apply SVD by using Java code?
> > >
> > > Thanks in advance,
> > > Donni
> > >
> >
>

Re: Text clustering with SVD

Posted by Donni Khan <pr...@googlemail.com>.

Hallo Suneel,
Thanks for fast reply.
Is SSVD like SVD? which one is better?
I run the SSVD  by java code on my data, but how do I compute U*Sigma?  Can
I do that by Mahout?
Is there optimal method to determin K?

another quesion is how do I make the relation between ssvd output and
words dictionary(real words)?

Thank you
Donni

On Mon, Mar 30, 2015 at 10:04 AM, Suneel Marthi <su...@gmail.com>
wrote:

> Here are the steps if u r using Mahout-mrlegacy in the present Mahout
> trunk:
>
> 1. Generate tfidf vectors from the input corpus using seq2sparse (I am
> assuming you had done this before and hence avoiding the details)
>
> 2. Run SSVD on the generated tfidf vectors from (1)
>
>       ./bin/mahout ssvd -i <tfidf vectors> -o <svd output> -k 80 -pca true
> -us true -U false -V false
>
>      k = no. of reduced basis vectors
>
>     You would need the U*Sigma output of the PCA flow for the next
> clustering step
>
> 3. Run KMeans (or any other clustering algo) with the U*Sigma from (2) as
> input.
>
>
> On Mon, Mar 30, 2015 at 3:39 AM, Donni Khan <pr...@googlemail.com>
> wrote:
>
> > Hallo Mahout users,
> >
> > I'm working on text clustering, I would like to reduce the features to
> > enhance the clustering process.
> > I would like to use  the Singular Value Decomposition before cluatering
> > process. I will be thankfull if anyone has used this before, Is it a good
> > idea for clustering?
> > Is there any other method in mahout to reduce the text features before
> > clustring?
> > Is anyone has idea how can I apply SVD by using Java code?
> >
> > Thanks in advance,
> > Donni
> >
>

Re: Text clustering with SVD

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Note that these instructions actually mean running PCA, not SVD but that's
probably the intention here. I don't think just running SVD helps.

On Mon, Mar 30, 2015 at 1:04 AM, Suneel Marthi <su...@gmail.com>
wrote:

> Here are the steps if u r using Mahout-mrlegacy in the present Mahout
> trunk:
>
> 1. Generate tfidf vectors from the input corpus using seq2sparse (I am
> assuming you had done this before and hence avoiding the details)
>
> 2. Run SSVD on the generated tfidf vectors from (1)
>
>       ./bin/mahout ssvd -i <tfidf vectors> -o <svd output> -k 80 -pca true
> -us true -U false -V false
>
>      k = no. of reduced basis vectors
>
>     You would need the U*Sigma output of the PCA flow for the next
> clustering step
>
> 3. Run KMeans (or any other clustering algo) with the U*Sigma from (2) as
> input.
>
>
> On Mon, Mar 30, 2015 at 3:39 AM, Donni Khan <pr...@googlemail.com>
> wrote:
>
> > Hallo Mahout users,
> >
> > I'm working on text clustering, I would like to reduce the features to
> > enhance the clustering process.
> > I would like to use  the Singular Value Decomposition before cluatering
> > process. I will be thankfull if anyone has used this before, Is it a good
> > idea for clustering?
> > Is there any other method in mahout to reduce the text features before
> > clustring?
> > Is anyone has idea how can I apply SVD by using Java code?
> >
> > Thanks in advance,
> > Donni
> >
>

Re: Text clustering with SVD

Posted by Suneel Marthi <su...@gmail.com>.

Here are the steps if u r using Mahout-mrlegacy in the present Mahout trunk:

1. Generate tfidf vectors from the input corpus using seq2sparse (I am
assuming you had done this before and hence avoiding the details)

2. Run SSVD on the generated tfidf vectors from (1)

      ./bin/mahout ssvd -i <tfidf vectors> -o <svd output> -k 80 -pca true
-us true -U false -V false

     k = no. of reduced basis vectors

    You would need the U*Sigma output of the PCA flow for the next
clustering step

3. Run KMeans (or any other clustering algo) with the U*Sigma from (2) as
input.

On Mon, Mar 30, 2015 at 3:39 AM, Donni Khan <pr...@googlemail.com>
wrote:

> Hallo Mahout users,
>
> I'm working on text clustering, I would like to reduce the features to
> enhance the clustering process.
> I would like to use  the Singular Value Decomposition before cluatering
> process. I will be thankfull if anyone has used this before, Is it a good
> idea for clustering?
> Is there any other method in mahout to reduce the text features before
> clustring?
> Is anyone has idea how can I apply SVD by using Java code?
>
> Thanks in advance,
> Donni
>