You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by syed kather <in...@gmail.com> on 2012/10/19 05:03:23 UTC

K-Means generates only one cluster

Team

    Version Used : Mahout 0.6
    Hadoop : 5 Nodes(1 Master + 4 Slaves)

    Once we had generated kmean clusters for 600000 documents.I had run the
clusterdump, which will extract the top terms from the cluster, There i had
noticed only one clusters is made even though we had specified the number
of cluster to 10. I had cross check the commands with some 1000 documents
and applied clustering. As i had notice that out of the 1000
documents,mahout can able to generated 10 cluster.

Some Observation which i had made on 600000 Data:-
    In clusterdump I had added  "--pointDir <path>". Because this command
will extactly tell us .what are top terms for each documents vise. In this
 i had noticed that some of the documents which doesnt have a distance.
1.0 : [distance=NaN]: /0_6_1343_504071_6198107.txt =]
  0_6_1343_504071_6198107.txt ==> File Name
1.0 : [distance=NaN]: /0_6_1343_504071_6198108.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198109.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198110.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198111.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198112.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198113.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198114.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198115.txt =]

Have a look command which i had executed one is for huge data(600000) and
one is for small data (1000 documents)

#sequencial File generation
bin/mahout seqdirectory -i /hugeData/hugeData/ -o /hugeData/SequenceFiles/
-c UTF-8 -chunk 64   (600000 documents)
bin/mahout seqdirectory -i /blrdata/blrdata/ -o /blrdata/SequenceFiles/ -c
UTF-8 -chunk 64               (1000 documents)

#Term Vector Creation.
bin/mahout seq2sparse -i /hugeData/SequenceFiles/ -o
/hugeData/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
   (600000 doc)
bin/mahout seq2sparse -i /blrdata/SequenceFiles/ -o
/blrdata/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
         (1000 documents)

#Clustering
bin/mahout kmeans -i /hugeData/SequenceFiles-sparse/tfidf-vectors/ -c
/hugeData/kmeans-clusters -o /hugeData/kmeans -dm
org.apache.mahout.common.distance.CosineDistanceMea0sure -x 10 -k 10 -ow
--clustering                       (600000 documents)
bin/mahout kmeans -i /blrdata/SequenceFiles-sparse/tfidf-vectors/ -c
/blrdata/kmeans-clusters -o /blrdata/kmeans -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 10 -ow
--clustering                        (1000 documents)

#Cluster Dump
bin/mahout clusterdump -s
 hdfs://localhost:9000/hugeData/kmeans/clusters-2-final/ -d
hdfs://localhost:9000/hugeData/SequenceFiles-sparse/dictionary.file-0 -dt
sequencefile -b 100 -n 100                                  (600000
documents)
bin/mahout clusterdump -s
 hdfs://localhost:9000/blrdata/kmeans/clusters-2-final/ -d
hdfs://localhost:9000/blrdata/SequenceFiles-sparse/dictionary.file-0 -dt
sequencefile -b 100 -n 10                                        (1000
documents)

I am using Map Reduced Method. For calculating K-Means.

 I had no clue what is going wrong. So please help me what i had missed in
this.  please give me some suggestion how to check what goes wrong.


Let me know if there is any further information is required

Thanks in advance
 S SYED ABDUL KATHER

Re: K-Means generates only one cluster

Posted by DAN HELM <da...@verizon.net>.

To look at vectors you can check out the data in the "clusteredPoints" folder generated by k-means.  You can write the data out in text format via the seqdumper command (as shown in step 5 here): http://amgadmadkour.blogspot.com/2012/07/kmeans-clustering-using-apache-mahout.html
 
The clusteredPoints output shows the topics each document was assigned including the distance score and I believe it also lists the document vectors (term:weight pairs).
 
You could also dump out the sparse vectors used as input to k-means via the seqdumper command running against part files in your tfidf-vectors folder, e.g., 
 
mahout seqdumper -s ..../reuters-vectors/tfidf-vectors/part-r-00000 > vectors.txt
 
I believe there is also a vectordump command that can also be used to dump out vectors in text format.
 
It is always good to know what kind of input (vectors) you were feeding k-means, to make sure that was not causing the problems.
 
Dan
 

________________________________
 From: syed kather <in...@gmail.com>
To: user@mahout.apache.org; DAN HELM <da...@verizon.net> 
Sent: Friday, October 19, 2012 8:16 AM
Subject: Re: K-Means generates only one cluster
  

 Thanks  Dan .. 
 Yes i had tried tanimoto that gives 6 cluster .  


" It appeared for our data after our custom
lucene analyzer and the tfidf filtering was applied (in seq2sparse command) all
terms for many of our documents were removed.  These were documents that had minimal (and/or garbage) text to begin "
  We had also did the same way clearing the junck from the original documents and even we had removed the stop words . But i our case there is no use .  

 How to verify the vector ?  Can you suggest me please .. 

           Thanks and Regards,
        S SYED ABDUL KATHER 
               



On Fri, Oct 19, 2012 at 9:20 AM, DAN HELM <da...@verizon.net> wrote:

We previously did some k-means clustering runs on
>different sized collections and noticed how that a large cluster was often created
>along with some smaller others. In digging deeper it turned out a lot of the
>document vectors (produced via the seq2sparse command) were null (empty).  k-means apparently put these together in one large
>cluster.  I also saw NaN for computed distances
>for these vectors.  And in the “clusteredPoints”
>file, it was clear many vectors were empty.  It appeared for our data after our custom
>lucene analyzer and the tfidf filtering was applied (in seq2sparse command) all
>terms for many of our documents were removed.  These were documents that had minimal (and/or garbage) text to begin
>with.
>So, maybe first verify if you are getting
>proper vectors for the input to k-means. We ended up cleaning up the vectors
>before clustering them (tossing out the null ones). You can also experiment
>with different similarity measures in k-means too (e.g., tanimoto).
> Dan 
>
>________________________________
> From: syed kather <in...@gmail.com>
>To: user@mahout.apache.org
>Cc: Raja Ramesh <ra...@pointcross.com>
>Sent: Thursday, October 18, 2012 11:03 PM
>Subject: K-Means generates only one cluster
>
>
>Team
>
>    Version Used : Mahout 0.6
>    Hadoop : 5 Nodes(1 Master + 4 Slaves)
>
>    Once we had generated kmean clusters for 600000 documents.I had run the
>clusterdump, which will extract the top terms from the cluster, There i had
>noticed only one clusters is made even though we had specified the number
>of cluster to 10. I had cross check the commands with some 1000 documents
>and applied clustering. As i had notice that out of the 1000
>documents,mahout can able to generated 10 cluster.
>
>Some Observation which i had made on 600000 Data:-
>    In clusterdump I had added  "--pointDir <path>". Because this command
>will extactly tell us .what are top terms for each documents vise. In this
>i had noticed that some of the documents which doesnt have a distance.
>1.0 : [distance=NaN]: /0_6_1343_504071_6198107.txt =]
>  0_6_1343_504071_6198107.txt ==> File Name
>1.0 : [distance=NaN]: /0_6_1343_504071_6198108.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198109.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198110.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198111.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198112.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198113.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198114.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198115.txt =]
>
>Have a look command which i had executed one is for huge data(600000) and
>one is for small data (1000 documents)
>
>#sequencial File generation
>bin/mahout seqdirectory -i /hugeData/hugeData/ -o /hugeData/SequenceFiles/
>-c UTF-8 -chunk 64   (600000 documents)
>bin/mahout seqdirectory -i /blrdata/blrdata/ -o /blrdata/SequenceFiles/ -c
>UTF-8 -chunk 64               (1000 documents)
>
>#Term Vector Creation.
>bin/mahout seq2sparse -i /hugeData/SequenceFiles/ -o
>/hugeData/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
>   (600000 doc)
>bin/mahout seq2sparse -i /blrdata/SequenceFiles/ -o
>/blrdata/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
>         (1000 documents)
>
>#Clustering
>bin/mahout kmeans -i /hugeData/SequenceFiles-sparse/tfidf-vectors/ -c
>/hugeData/kmeans-clusters -o /hugeData/kmeans -dm
>org.apache.mahout.common.distance.CosineDistanceMea0sure -x 10 -k 10 -ow
>--clustering                       (600000 documents)
>bin/mahout kmeans -i /blrdata/SequenceFiles-sparse/tfidf-vectors/ -c
>/blrdata/kmeans-clusters -o /blrdata/kmeans -dm
>org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 10 -ow
>--clustering                        (1000 documents)
>
>#Cluster Dump
>bin/mahout clusterdump -s
>hdfs://localhost:9000/hugeData/kmeans/clusters-2-final/ -d
>hdfs://localhost:9000/hugeData/SequenceFiles-sparse/dictionary.file-0 -dt
>sequencefile -b 100 -n 100                                  (600000
>documents)
>bin/mahout clusterdump -s
>hdfs://localhost:9000/blrdata/kmeans/clusters-2-final/ -d
>hdfs://localhost:9000/blrdata/SequenceFiles-sparse/dictionary.file-0 -dt
>sequencefile -b 100 -n 10                                        (1000
>documents)
>
>I am using Map Reduced Method. For calculating K-Means.
>
>I had no clue what is going wrong. So please help me what i had missed in
>this.  please give me some suggestion how to check what goes wrong.
>
>
>Let me know if there is any further information is required
>
>Thanks in advance
>S SYED ABDUL KATHER

Re: K-Means generates only one cluster

Posted by syed kather <in...@gmail.com>.

 Thanks  Dan ..
 Yes i had tried tanimoto that gives 6 cluster .


" It appeared for our data after our custom
lucene analyzer and the tfidf filtering was applied (in seq2sparse command)
all
terms for many of our documents were removed.  These were documents that
had minimal (and/or garbage) text to begin "
  We had also did the same way clearing the junck from the original
documents and even we had removed the stop words . But i our case there is
no use .

 How to verify the vector ?  Can you suggest me please ..

            Thanks and Regards,
        S SYED ABDUL KATHER



On Fri, Oct 19, 2012 at 9:20 AM, DAN HELM <da...@verizon.net> wrote:

> We previously did some k-means clustering runs on
> different sized collections and noticed how that a large cluster was often
> created
> along with some smaller others. In digging deeper it turned out a lot of
> the
> document vectors (produced via the seq2sparse command) were null (empty).
>  k-means apparently put these together in one large
> cluster.  I also saw NaN for computed distances
> for these vectors.  And in the “clusteredPoints”
> file, it was clear many vectors were empty.  It appeared for our data
> after our custom
> lucene analyzer and the tfidf filtering was applied (in seq2sparse
> command) all
> terms for many of our documents were removed.  These were documents that
> had minimal (and/or garbage) text to begin
> with.
> So, maybe first verify if you are getting
> proper vectors for the input to k-means. We ended up cleaning up the
> vectors
> before clustering them (tossing out the null ones). You can also experiment
> with different similarity measures in k-means too (e.g., tanimoto).
>  Dan
>
> ________________________________
>  From: syed kather <in...@gmail.com>
> To: user@mahout.apache.org
> Cc: Raja Ramesh <ra...@pointcross.com>
> Sent: Thursday, October 18, 2012 11:03 PM
> Subject: K-Means generates only one cluster
>
> Team
>
>     Version Used : Mahout 0.6
>     Hadoop : 5 Nodes(1 Master + 4 Slaves)
>
>     Once we had generated kmean clusters for 600000 documents.I had run the
> clusterdump, which will extract the top terms from the cluster, There i had
> noticed only one clusters is made even though we had specified the number
> of cluster to 10. I had cross check the commands with some 1000 documents
> and applied clustering. As i had notice that out of the 1000
> documents,mahout can able to generated 10 cluster.
>
> Some Observation which i had made on 600000 Data:-
>     In clusterdump I had added  "--pointDir <path>". Because this command
> will extactly tell us .what are top terms for each documents vise. In this
> i had noticed that some of the documents which doesnt have a distance.
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198107.txt =]
>   0_6_1343_504071_6198107.txt ==> File Name
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198108.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198109.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198110.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198111.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198112.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198113.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198114.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198115.txt =]
>
> Have a look command which i had executed one is for huge data(600000) and
> one is for small data (1000 documents)
>
> #sequencial File generation
> bin/mahout seqdirectory -i /hugeData/hugeData/ -o /hugeData/SequenceFiles/
> -c UTF-8 -chunk 64   (600000 documents)
> bin/mahout seqdirectory -i /blrdata/blrdata/ -o /blrdata/SequenceFiles/ -c
> UTF-8 -chunk 64               (1000 documents)
>
> #Term Vector Creation.
> bin/mahout seq2sparse -i /hugeData/SequenceFiles/ -o
> /hugeData/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
>    (600000 doc)
> bin/mahout seq2sparse -i /blrdata/SequenceFiles/ -o
> /blrdata/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
>          (1000 documents)
>
> #Clustering
> bin/mahout kmeans -i /hugeData/SequenceFiles-sparse/tfidf-vectors/ -c
> /hugeData/kmeans-clusters -o /hugeData/kmeans -dm
> org.apache.mahout.common.distance.CosineDistanceMea0sure -x 10 -k 10 -ow
> --clustering                       (600000 documents)
> bin/mahout kmeans -i /blrdata/SequenceFiles-sparse/tfidf-vectors/ -c
> /blrdata/kmeans-clusters -o /blrdata/kmeans -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 10 -ow
> --clustering                        (1000 documents)
>
> #Cluster Dump
> bin/mahout clusterdump -s
> hdfs://localhost:9000/hugeData/kmeans/clusters-2-final/ -d
> hdfs://localhost:9000/hugeData/SequenceFiles-sparse/dictionary.file-0 -dt
> sequencefile -b 100 -n 100                                  (600000
> documents)
> bin/mahout clusterdump -s
> hdfs://localhost:9000/blrdata/kmeans/clusters-2-final/ -d
> hdfs://localhost:9000/blrdata/SequenceFiles-sparse/dictionary.file-0 -dt
> sequencefile -b 100 -n 10                                        (1000
> documents)
>
> I am using Map Reduced Method. For calculating K-Means.
>
> I had no clue what is going wrong. So please help me what i had missed in
> this.  please give me some suggestion how to check what goes wrong.
>
>
> Let me know if there is any further information is required
>
> Thanks in advance
> S SYED ABDUL KATHER
>

Re: K-Means generates only one cluster

Posted by DAN HELM <da...@verizon.net>.

We previously did some k-means clustering runs on
different sized collections and noticed how that a large cluster was often created
along with some smaller others. In digging deeper it turned out a lot of the
document vectors (produced via the seq2sparse command) were null (empty).  k-means apparently put these together in one large
cluster.  I also saw NaN for computed distances
for these vectors.  And in the “clusteredPoints”
file, it was clear many vectors were empty.  It appeared for our data after our custom
lucene analyzer and the tfidf filtering was applied (in seq2sparse command) all
terms for many of our documents were removed.  These were documents that had minimal (and/or garbage) text to begin
with.
So, maybe first verify if you are getting
proper vectors for the input to k-means. We ended up cleaning up the vectors
before clustering them (tossing out the null ones). You can also experiment
with different similarity measures in k-means too (e.g., tanimoto).
 Dan  

________________________________
 From: syed kather <in...@gmail.com>
To: user@mahout.apache.org 
Cc: Raja Ramesh <ra...@pointcross.com> 
Sent: Thursday, October 18, 2012 11:03 PM
Subject: K-Means generates only one cluster
  
Team

    Version Used : Mahout 0.6
    Hadoop : 5 Nodes(1 Master + 4 Slaves)

    Once we had generated kmean clusters for 600000 documents.I had run the
clusterdump, which will extract the top terms from the cluster, There i had
noticed only one clusters is made even though we had specified the number
of cluster to 10. I had cross check the commands with some 1000 documents
and applied clustering. As i had notice that out of the 1000
documents,mahout can able to generated 10 cluster.

Some Observation which i had made on 600000 Data:-
    In clusterdump I had added  "--pointDir <path>". Because this command
will extactly tell us .what are top terms for each documents vise. In this
i had noticed that some of the documents which doesnt have a distance.
1.0 : [distance=NaN]: /0_6_1343_504071_6198107.txt =]
  0_6_1343_504071_6198107.txt ==> File Name
1.0 : [distance=NaN]: /0_6_1343_504071_6198108.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198109.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198110.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198111.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198112.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198113.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198114.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198115.txt =]

Have a look command which i had executed one is for huge data(600000) and
one is for small data (1000 documents)

#sequencial File generation
bin/mahout seqdirectory -i /hugeData/hugeData/ -o /hugeData/SequenceFiles/
-c UTF-8 -chunk 64   (600000 documents)
bin/mahout seqdirectory -i /blrdata/blrdata/ -o /blrdata/SequenceFiles/ -c
UTF-8 -chunk 64               (1000 documents)

#Term Vector Creation.
bin/mahout seq2sparse -i /hugeData/SequenceFiles/ -o
/hugeData/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
   (600000 doc)
bin/mahout seq2sparse -i /blrdata/SequenceFiles/ -o
/blrdata/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
         (1000 documents)

#Clustering
bin/mahout kmeans -i /hugeData/SequenceFiles-sparse/tfidf-vectors/ -c
/hugeData/kmeans-clusters -o /hugeData/kmeans -dm
org.apache.mahout.common.distance.CosineDistanceMea0sure -x 10 -k 10 -ow
--clustering                       (600000 documents)
bin/mahout kmeans -i /blrdata/SequenceFiles-sparse/tfidf-vectors/ -c
/blrdata/kmeans-clusters -o /blrdata/kmeans -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 10 -ow
--clustering                        (1000 documents)

#Cluster Dump
bin/mahout clusterdump -s
hdfs://localhost:9000/hugeData/kmeans/clusters-2-final/ -d
hdfs://localhost:9000/hugeData/SequenceFiles-sparse/dictionary.file-0 -dt
sequencefile -b 100 -n 100                                  (600000
documents)
bin/mahout clusterdump -s
hdfs://localhost:9000/blrdata/kmeans/clusters-2-final/ -d
hdfs://localhost:9000/blrdata/SequenceFiles-sparse/dictionary.file-0 -dt
sequencefile -b 100 -n 10                                        (1000
documents)

I am using Map Reduced Method. For calculating K-Means.

I had no clue what is going wrong. So please help me what i had missed in
this.  please give me some suggestion how to check what goes wrong.


Let me know if there is any further information is required

Thanks in advance
S SYED ABDUL KATHER