You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by "Periya.Data" <pe...@gmail.com> on 2011/12/31 01:07:37 UTC

number of clusters

Hi all,
    I am a newbie to Mahout. I am running a basic k-means clustering on a
sample txt file. No matter what number I give to the --numClusters
parameter, I always get only one cluster (VL-0). Can someone please point
out any mistake and suggest what I should do to see a decent number of
clusters?

I successfully convert the txt file into seq-file and then to vectorized
format.

The command I use is the following:

$MAHOUT_HOME/bin/mahout kmeans       --input
/input/mahout/vectorized/tfidf-vectors \
                        --output           $HDFS_OUTPUT_DIR/clusters \
                        --clusters         $HDFS_OUTPUT_DIR/initialclusters
\
                        --maxIter          10 \
                        --numClusters      20 \
                        --clustering       \
                        --overwrite


Here is the console output:
=====================

pd@PeriyaData:~/Mahout/examples/bin$ ./bigdata_kmeans.sh
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
MAHOUT-JOB:
/home/pd/CDH3/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/12/30 15:59:23 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=/output/mahout/kmeans/initialclusters,
--convergenceDelta=0.5,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=/input/mahout/vectorized/tfidf-vectors,
--maxIter=10, --method=mapreduce, --numClusters=20,
--output=/output/mahout/kmeans/clusters, --overwrite=null, --startPhase=0,
--tempDir=temp}
11/12/30 15:59:23 INFO common.HadoopUtil: Deleting
/output/mahout/kmeans/clusters
11/12/30 15:59:24 INFO common.HadoopUtil: Deleting
/output/mahout/kmeans/initialclusters
11/12/30 15:59:25 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
11/12/30 15:59:25 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
11/12/30 15:59:25 INFO compress.CodecPool: Got brand-new compressor
11/12/30 15:59:25 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to
/output/mahout/kmeans/initialclusters/part-randomSeed
11/12/30 15:59:25 INFO kmeans.KMeansDriver: Input:
/input/mahout/vectorized/tfidf-vectors Clusters In:
/output/mahout/kmeans/initialclusters/part-randomSeed Out:
/output/mahout/kmeans/clusters Distance:
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
11/12/30 15:59:25 INFO kmeans.KMeansDriver: convergence: 0.5 max
Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable
Input Vectors: {}
11/12/30 15:59:25 INFO kmeans.KMeansDriver: K-Means Iteration 1
11/12/30 15:59:25 INFO input.FileInputFormat: Total input paths to process
: 1
11/12/30 15:59:26 INFO mapred.JobClient: Running job: job_201112301129_0029
11/12/30 15:59:27 INFO mapred.JobClient:  map 0% reduce 0%
11/12/30 15:59:30 INFO mapred.JobClient:  map 100% reduce 0%
11/12/30 15:59:39 INFO mapred.JobClient:  map 100% reduce 100%
11/12/30 15:59:39 INFO mapred.JobClient: Job complete: job_201112301129_0029
11/12/30 15:59:39 INFO mapred.JobClient: Counters: 23
11/12/30 15:59:39 INFO mapred.JobClient:   Job Counters
11/12/30 15:59:39 INFO mapred.JobClient:     Launched reduce tasks=1
11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3074
11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
11/12/30 15:59:39 INFO mapred.JobClient:     Launched map tasks=1
11/12/30 15:59:39 INFO mapred.JobClient:     Data-local map tasks=1
11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8299
11/12/30 15:59:39 INFO mapred.JobClient:   Clustering
11/12/30 15:59:39 INFO mapred.JobClient:     Converged Clusters=1
11/12/30 15:59:39 INFO mapred.JobClient:   FileSystemCounters
11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_READ=185593
11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_READ=139801
11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=477505
11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92991
11/12/30 15:59:39 INFO mapred.JobClient:   Map-Reduce Framework
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input groups=1
11/12/30 15:59:39 INFO mapred.JobClient:     Combine output records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Map input records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce shuffle bytes=0
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce output records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Spilled Records=2
11/12/30 15:59:39 INFO mapred.JobClient:     Map output bytes=185582
11/12/30 15:59:39 INFO mapred.JobClient:     Combine input records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Map output records=1
11/12/30 15:59:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input records=1
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Clustering data
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Running Clustering
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Input:
/input/mahout/vectorized/tfidf-vectors Clusters In:
/output/mahout/kmeans/clusters/clusters-1-final Out:
/output/mahout/kmeans/clusters/clusteredPoints Distance:
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure@14e4e31
11/12/30 15:59:39 INFO kmeans.KMeansDriver: convergence: 0.5 Input Vectors:
org.apache.mahout.math.VectorWritable
11/12/30 15:59:40 INFO input.FileInputFormat: Total input paths to process
: 1
11/12/30 15:59:40 INFO mapred.JobClient: Running job: job_201112301129_0030
11/12/30 15:59:41 INFO mapred.JobClient:  map 0% reduce 0%
11/12/30 15:59:45 INFO mapred.JobClient:  map 100% reduce 0%
11/12/30 15:59:45 INFO mapred.JobClient: Job complete: job_201112301129_0030
11/12/30 15:59:45 INFO mapred.JobClient: Counters: 13
11/12/30 15:59:45 INFO mapred.JobClient:   Job Counters
11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3815
11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
11/12/30 15:59:45 INFO mapred.JobClient:     Launched map tasks=1
11/12/30 15:59:45 INFO mapred.JobClient:     Data-local map tasks=1
11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
11/12/30 15:59:45 INFO mapred.JobClient:   FileSystemCounters
11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_READ=186054
11/12/30 15:59:45 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=52059
11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92956
11/12/30 15:59:45 INFO mapred.JobClient:   Map-Reduce Framework
11/12/30 15:59:45 INFO mapred.JobClient:     Map input records=1
11/12/30 15:59:45 INFO mapred.JobClient:     Spilled Records=0
11/12/30 15:59:45 INFO mapred.JobClient:     Map output records=1
11/12/30 15:59:45 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
11/12/30 15:59:45 INFO driver.MahoutDriver: Program took 21888 ms (Minutes:
0.3648)
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
MAHOUT-JOB:
/home/pd/CDH3/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/12/30 15:59:48 INFO common.AbstractJob: Command line arguments:
{--dictionary=/input/mahout/vectorized/dictionary.file-0,
--dictionaryType=sequencefile,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --numWords=30,
--output=/home/pd/Mahout/examples/output/clusteranalyze.txt,
--outputFormat=TEXT,
--pointsDir=/output/mahout/kmeans/clusters/clusteredPoints,
--seqFileDir=/output/mahout/kmeans/clusters/clusters-*-final,
--startPhase=0, --tempDir=temp}
*11/12/30 15:59:49 INFO clustering.ClusterDumper: Wrote 1 clusters*
11/12/30 15:59:49 INFO driver.MahoutDriver: Program took 1171 ms (Minutes:
0.01951666666666667)
pd@PeriyaData:~/Mahout/examples/bin$


pd@PeriyaData:~/Mahout/examples/bin$ hadoop fs -ls
/output/mahout/kmeans/clusters
Found 2 items
drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusteredPoints
drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final
pd@PeriyaData:~/Mahout/rabi/examples/bin$ hadoop fs -ls
/output/mahout/kmeans/clusters/clusters-1-final
Found 3 items
-rw-r--r--   1 pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final/_SUCCESS
drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final/_logs
-rw-r--r--   1 pd supergroup      92991 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final/part-r-00000
pd@PeriyaData:~/Mahout/examples/bin$


Thanks,
PD

RE: number of clusters

Posted by Paritosh Ranjan <pr...@xebia.com>.
There can be two reasons for only one cluster being found.

1) The vectors are really close to each other and the clusters converge.
2) The distance measure you are using is not appropriate with your vector values.

Try to 
1) Analyze the vectors and the distance between them. Are they good candidates to be inside different clusters?
2) Try to use CanopyClustering first to guess the number of clusters ( experiment a bit by changing values of t1 and t2 ).
3) Then provided the clusters returned by CanopyClustering to KMeans.
4) Use EuclideanDistance instead of Squared...

Paritosh

________________________________________
From: Periya.Data [periya.data@gmail.com]
Sent: Saturday, December 31, 2011 1:07 AM
To: user@mahout.apache.org
Subject: number of clusters

Hi all,
    I am a newbie to Mahout. I am running a basic k-means clustering on a
sample txt file. No matter what number I give to the --numClusters
parameter, I always get only one cluster (VL-0). Can someone please point
out any mistake and suggest what I should do to see a decent number of
clusters?

I successfully convert the txt file into seq-file and then to vectorized
format.

The command I use is the following:

$MAHOUT_HOME/bin/mahout kmeans       --input
/input/mahout/vectorized/tfidf-vectors \
                        --output           $HDFS_OUTPUT_DIR/clusters \
                        --clusters         $HDFS_OUTPUT_DIR/initialclusters
\
                        --maxIter          10 \
                        --numClusters      20 \
                        --clustering       \
                        --overwrite


Here is the console output:
=====================

pd@PeriyaData:~/Mahout/examples/bin$ ./bigdata_kmeans.sh
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
MAHOUT-JOB:
/home/pd/CDH3/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/12/30 15:59:23 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=/output/mahout/kmeans/initialclusters,
--convergenceDelta=0.5,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=/input/mahout/vectorized/tfidf-vectors,
--maxIter=10, --method=mapreduce, --numClusters=20,
--output=/output/mahout/kmeans/clusters, --overwrite=null, --startPhase=0,
--tempDir=temp}
11/12/30 15:59:23 INFO common.HadoopUtil: Deleting
/output/mahout/kmeans/clusters
11/12/30 15:59:24 INFO common.HadoopUtil: Deleting
/output/mahout/kmeans/initialclusters
11/12/30 15:59:25 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
11/12/30 15:59:25 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
11/12/30 15:59:25 INFO compress.CodecPool: Got brand-new compressor
11/12/30 15:59:25 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to
/output/mahout/kmeans/initialclusters/part-randomSeed
11/12/30 15:59:25 INFO kmeans.KMeansDriver: Input:
/input/mahout/vectorized/tfidf-vectors Clusters In:
/output/mahout/kmeans/initialclusters/part-randomSeed Out:
/output/mahout/kmeans/clusters Distance:
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
11/12/30 15:59:25 INFO kmeans.KMeansDriver: convergence: 0.5 max
Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable
Input Vectors: {}
11/12/30 15:59:25 INFO kmeans.KMeansDriver: K-Means Iteration 1
11/12/30 15:59:25 INFO input.FileInputFormat: Total input paths to process
: 1
11/12/30 15:59:26 INFO mapred.JobClient: Running job: job_201112301129_0029
11/12/30 15:59:27 INFO mapred.JobClient:  map 0% reduce 0%
11/12/30 15:59:30 INFO mapred.JobClient:  map 100% reduce 0%
11/12/30 15:59:39 INFO mapred.JobClient:  map 100% reduce 100%
11/12/30 15:59:39 INFO mapred.JobClient: Job complete: job_201112301129_0029
11/12/30 15:59:39 INFO mapred.JobClient: Counters: 23
11/12/30 15:59:39 INFO mapred.JobClient:   Job Counters
11/12/30 15:59:39 INFO mapred.JobClient:     Launched reduce tasks=1
11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3074
11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
11/12/30 15:59:39 INFO mapred.JobClient:     Launched map tasks=1
11/12/30 15:59:39 INFO mapred.JobClient:     Data-local map tasks=1
11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8299
11/12/30 15:59:39 INFO mapred.JobClient:   Clustering
11/12/30 15:59:39 INFO mapred.JobClient:     Converged Clusters=1
11/12/30 15:59:39 INFO mapred.JobClient:   FileSystemCounters
11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_READ=185593
11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_READ=139801
11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=477505
11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92991
11/12/30 15:59:39 INFO mapred.JobClient:   Map-Reduce Framework
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input groups=1
11/12/30 15:59:39 INFO mapred.JobClient:     Combine output records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Map input records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce shuffle bytes=0
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce output records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Spilled Records=2
11/12/30 15:59:39 INFO mapred.JobClient:     Map output bytes=185582
11/12/30 15:59:39 INFO mapred.JobClient:     Combine input records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Map output records=1
11/12/30 15:59:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input records=1
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Clustering data
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Running Clustering
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Input:
/input/mahout/vectorized/tfidf-vectors Clusters In:
/output/mahout/kmeans/clusters/clusters-1-final Out:
/output/mahout/kmeans/clusters/clusteredPoints Distance:
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure@14e4e31
11/12/30 15:59:39 INFO kmeans.KMeansDriver: convergence: 0.5 Input Vectors:
org.apache.mahout.math.VectorWritable
11/12/30 15:59:40 INFO input.FileInputFormat: Total input paths to process
: 1
11/12/30 15:59:40 INFO mapred.JobClient: Running job: job_201112301129_0030
11/12/30 15:59:41 INFO mapred.JobClient:  map 0% reduce 0%
11/12/30 15:59:45 INFO mapred.JobClient:  map 100% reduce 0%
11/12/30 15:59:45 INFO mapred.JobClient: Job complete: job_201112301129_0030
11/12/30 15:59:45 INFO mapred.JobClient: Counters: 13
11/12/30 15:59:45 INFO mapred.JobClient:   Job Counters
11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3815
11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
11/12/30 15:59:45 INFO mapred.JobClient:     Launched map tasks=1
11/12/30 15:59:45 INFO mapred.JobClient:     Data-local map tasks=1
11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
11/12/30 15:59:45 INFO mapred.JobClient:   FileSystemCounters
11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_READ=186054
11/12/30 15:59:45 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=52059
11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92956
11/12/30 15:59:45 INFO mapred.JobClient:   Map-Reduce Framework
11/12/30 15:59:45 INFO mapred.JobClient:     Map input records=1
11/12/30 15:59:45 INFO mapred.JobClient:     Spilled Records=0
11/12/30 15:59:45 INFO mapred.JobClient:     Map output records=1
11/12/30 15:59:45 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
11/12/30 15:59:45 INFO driver.MahoutDriver: Program took 21888 ms (Minutes:
0.3648)
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
MAHOUT-JOB:
/home/pd/CDH3/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/12/30 15:59:48 INFO common.AbstractJob: Command line arguments:
{--dictionary=/input/mahout/vectorized/dictionary.file-0,
--dictionaryType=sequencefile,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --numWords=30,
--output=/home/pd/Mahout/examples/output/clusteranalyze.txt,
--outputFormat=TEXT,
--pointsDir=/output/mahout/kmeans/clusters/clusteredPoints,
--seqFileDir=/output/mahout/kmeans/clusters/clusters-*-final,
--startPhase=0, --tempDir=temp}
*11/12/30 15:59:49 INFO clustering.ClusterDumper: Wrote 1 clusters*
11/12/30 15:59:49 INFO driver.MahoutDriver: Program took 1171 ms (Minutes:
0.01951666666666667)
pd@PeriyaData:~/Mahout/examples/bin$


pd@PeriyaData:~/Mahout/examples/bin$ hadoop fs -ls
/output/mahout/kmeans/clusters
Found 2 items
drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusteredPoints
drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final
pd@PeriyaData:~/Mahout/rabi/examples/bin$ hadoop fs -ls
/output/mahout/kmeans/clusters/clusters-1-final
Found 3 items
-rw-r--r--   1 pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final/_SUCCESS
drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final/_logs
-rw-r--r--   1 pd supergroup      92991 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final/part-r-00000
pd@PeriyaData:~/Mahout/examples/bin$


Thanks,
PD