You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Oisin Boydell <oi...@ucd.ie> on 2014/07/30 12:42:32 UTC

Clusterdump output format

Hi,

We have been using K-Means to cluster a fairly large dataset (just under a million 128 dimension vectors of floating point values - about 9.2GB in space delimited file format). We’re using Hadoop 2.2.0 and Mahout 0.9. The dataset is first converted from simple space delimited format into RandomAccessSparseVector format for K-Means using the org.apache.mahout.clustering.conversion.InputDriver utility.

We’re not using Canopy clustering to determine the initial clusters as we want a specific number of clusters (100,000) so we let K-Means create the initial random 100,000 centroids:

./mahout kmeans -i /lookandlearn/vectors_all -c /data/initial_centres -o /data/clusters_output -k 100000 -x 20 -ow -xm mapreduce

It all runs fine and we then extract the computed centroids using the clusterdump utility:

./mahout clusterdump -i /data/clusters_output/clusters-1-final/ -o ./clusters.txt -of TEXT

The clusters.txt output file contains the expected 100,000 lines (once cluster per line) however there seem to be some idiosyncrasies in the output format…

If we add up all the values of n for each cluster, which should be the number of data points belonging to each cluster, we get a total of 39,160,754. But we expect this to be the same as the number of input points (9,769,004) as each input point should belong to a single cluster. We are not sure why the sum of n values is nearly 4 times as large as the number of input points.
We also notice that the vector output format for the cluster centroids and radii seem to be in a couple of different formats. The majority are a simple comma separated array format e.g.

c=[0.008, 0.006, 0.009, 0.014, 0.006, 0.003, 0.007, 0.005, 0.032, 0.004, 0.001, 0.003, 0.002, 0.002, 0.007, 0.017, 0.011, 0.002, 0.001, 0.014, 0.032, 0.015, 0.001, 0.002, 0.025, 0.007, 0.001, 0.007, 0.031, 0.004, 0.000, 0.005, 0.006, 0.003, 0.005, 0.029, 0.023, 0.001, 0.000, 0.005, 0.032, 0.007, 0.001, 0.009, 0.014, 0.002, 0.000, 0.004, 0.011, 0.001, 0.002, 0.010, 0.032, 0.017, 0.000, 0.002, 0.013, 0.019, 0.008, 0.009, 0.017, 0.005, 0.001, 0.003, 0.007, 0.005, 0.002, 0.014, 0.021, 0.002, 0.001, 0.005, 0.032, 0.006, 0.005, 0.014, 0.016, 0.003, 0.001, 0.004, 0.006, 0.000, 0.001, 0.005, 0.031, 0.026, 0.001, 0.002, 0.009, 0.002, 0.003, 0.004, 0.006, 0.015, 0.004, 0.006, 0.006, 0.002, 0.002, 0.006, 0.003, 0.001, 0.003, 0.009, 0.004, 0.002, 0.005, 0.018, 0.012, 0.001, 0.000, 0.002, 0.001, 0.000, 0.007, 0.016, 0.021, 0.006, 0.001, 0.000, 0.006, 0.003, 0.013, 0.012, 0.003, 0.002, 0.000, 0.001]

But there are also a significant number of clusters where the format appears to be a sparse array representation with each value prefixed by the position index e.g.

c=[0:0.056, 1:0.006, 2:0.000, 3:0.000, 4:0.000, 5:0.000, 6:0.000, 7:0.004, 8:0.057, 9:0.002, 10:0.000, 11:0.000, 12:0.000, 13:0.000, 14:0.000, 15:0.005, 16:0.056, 17:0.004, 18:0.000, 19:0.000, 20:0.000, 23:0.002, 24:0.024, 25:0.009, 26:0.013, 27:0.005, 28:0.001, 29:0.001, 30:0.000, 31:0.000, 32:0.057, 33:0.006, 34:0.000, 35:0.000, 36:0.000, 37:0.000, 38:0.000, 39:0.002, 40:0.057, 41:0.007, 42:0.000, 43:0.000, 44:0.000, 45:0.000, 46:0.000, 47:0.004, 48:0.057, 49:0.008, 50:0.000, 51:0.000, 52:0.000, 55:0.001, 56:0.050, 57:0.007, 58:0.000, 59:0.000, 60:0.000, 61:0.000, 62:0.000, 63:0.001, 64:0.057, 65:0.003, 66:0.000, 67:0.000, 68:0.000, 69:0.000, 70:0.000, 71:0.006, 72:0.057, 73:0.004, 74:0.000, 75:0.000, 76:0.000, 77:0.000, 78:0.000, 79:0.009, 80:0.057, 81:0.003, 82:0.000, 83:0.000, 84:0.000, 87:0.006, 88:0.047, 89:0.004, 90:0.000, 91:0.000, 92:0.000, 93:0.000, 94:0.000, 95:0.006, 96:0.056, 97:0.005, 98:0.000, 99:0.000, 100:0.000, 101:0.000, 102:0.000, 103:0.003, 104:0.057, 105:0.003, 106:0.000, 107:0.000, 108:0.000, 109:0.000, 110:0.000, 111:0.006, 112:0.056, 113:0.000, 114:0.000, 115:0.000, 116:0.000, 117:0.000, 118:0.000, 119:0.008, 120:0.038, 121:0.001, 122:0.000, 123:0.000, 124:0.000, 125:0.000, 126:0.000, 127:0.006]

In this case should values missing in this sparse vector format be interpreted as 0.0 e.g. the value for dimension 21 in the above example? Why are zero values still included in his output format (e.g. dimensions 2,3,4 etc. above) and it seems awkward to us that the clusterdump output contains different vector formats as it makes it more complex to parse. Also we find that if we set the clusterdump output format to CSV instead of TEXT ("-of CSV”) no output file is produced.

Any information or feedback on the above would be greatly appreciated.

Regards,
Oisin.

Re: Clusterdump output format

Posted by Arian Pasquali <ar...@arianpasquali.com>.

Actually I m having the same situation with mahout 0.9.TEXT dump works fine printing clusters and its representative points. I change the output format to  CSV and it dumps an empty file :/
I m stuck here too
Arian Pasquali



On Wed, Jul 30, 2014 at 3:38 AM -0700, "Oisin Boydell" <oi...@ucd.ie> wrote:










Hi,

We have been using K-Means to cluster a fairly large dataset (just under a million 128 dimension vectors of floating point values - about 9.2GB in space delimited file format). We’re using Hadoop 2.2.0 and Mahout 0.9. The dataset is first converted from simple space delimited format into RandomAccessSparseVector format for K-Means using the org.apache.mahout.clustering.conversion.InputDriver utility.

We’re not using Canopy clustering to determine the initial clusters as we want a specific number of clusters (100,000) so we let K-Means create the initial random 100,000 centroids:

./mahout kmeans -i /lookandlearn/vectors_all -c /data/initial_centres -o /data/clusters_output -k 100000 -x 20 -ow -xm mapreduce

It all runs fine and we then extract the computed centroids using the clusterdump utility:

./mahout clusterdump -i /data/clusters_output/clusters-1-final/ -o ./clusters.txt -of TEXT

The clusters.txt output file contains the expected 100,000 lines (once cluster per line) however there seem to be some idiosyncrasies in the output format…

If we add up all the values of n for each cluster, which should be the number of data points belonging to each cluster, we get a total of 39,160,754. But we expect this to be the same as the number of input points (9,769,004) as each input point should belong to a single cluster. We are not sure why the sum of n values is nearly 4 times as large as the number of input points.
We also notice that the vector output format for the cluster centroids and radii seem to be in a couple of different formats. The majority are a simple comma separated array format e.g.

c=[0.008, 0.006, 0.009, 0.014, 0.006, 0.003, 0.007, 0.005, 0.032, 0.004, 0.001, 0.003, 0.002, 0.002, 0.007, 0.017, 0.011, 0.002, 0.001, 0.014, 0.032, 0.015, 0.001, 0.002, 0.025, 0.007, 0.001, 0.007, 0.031, 0.004, 0.000, 0.005, 0.006, 0.003, 0.005, 0.029, 0.023, 0.001, 0.000, 0.005, 0.032, 0.007, 0.001, 0.009, 0.014, 0.002, 0.000, 0.004, 0.011, 0.001, 0.002, 0.010, 0.032, 0.017, 0.000, 0.002, 0.013, 0.019, 0.008, 0.009, 0.017, 0.005, 0.001, 0.003, 0.007, 0.005, 0.002, 0.014, 0.021, 0.002, 0.001, 0.005, 0.032, 0.006, 0.005, 0.014, 0.016, 0.003, 0.001, 0.004, 0.006, 0.000, 0.001, 0.005, 0.031, 0.026, 0.001, 0.002, 0.009, 0.002, 0.003, 0.004, 0.006, 0.015, 0.004, 0.006, 0.006, 0.002, 0.002, 0.006, 0.003, 0.001, 0.003, 0.009, 0.004, 0.002, 0.005, 0.018, 0.012, 0.001, 0.000, 0.002, 0.001, 0.000, 0.007, 0.016, 0.021, 0.006, 0.001, 0.000, 0.006, 0.003, 0.013, 0.012, 0.003, 0.002, 0.000, 0.001]

But there are also a significant number of clusters where the format appears to be a sparse array representation with each value prefixed by the position index e.g.

c=[0:0.056, 1:0.006, 2:0.000, 3:0.000, 4:0.000, 5:0.000, 6:0.000, 7:0.004, 8:0.057, 9:0.002, 10:0.000, 11:0.000, 12:0.000, 13:0.000, 14:0.000, 15:0.005, 16:0.056, 17:0.004, 18:0.000, 19:0.000, 20:0.000, 23:0.002, 24:0.024, 25:0.009, 26:0.013, 27:0.005, 28:0.001, 29:0.001, 30:0.000, 31:0.000, 32:0.057, 33:0.006, 34:0.000, 35:0.000, 36:0.000, 37:0.000, 38:0.000, 39:0.002, 40:0.057, 41:0.007, 42:0.000, 43:0.000, 44:0.000, 45:0.000, 46:0.000, 47:0.004, 48:0.057, 49:0.008, 50:0.000, 51:0.000, 52:0.000, 55:0.001, 56:0.050, 57:0.007, 58:0.000, 59:0.000, 60:0.000, 61:0.000, 62:0.000, 63:0.001, 64:0.057, 65:0.003, 66:0.000, 67:0.000, 68:0.000, 69:0.000, 70:0.000, 71:0.006, 72:0.057, 73:0.004, 74:0.000, 75:0.000, 76:0.000, 77:0.000, 78:0.000, 79:0.009, 80:0.057, 81:0.003, 82:0.000, 83:0.000, 84:0.000, 87:0.006, 88:0.047, 89:0.004, 90:0.000, 91:0.000, 92:0.000, 93:0.000, 94:0.000, 95:0.006, 96:0.056, 97:0.005, 98:0.000, 99:0.000, 100:0.000, 101:0.000, 102:0.000, 103:0.003, 104:0.057, 105:0.003, 106:0.000, 107:0.000, 108:0.000, 109:0.000, 110:0.000, 111:0.006, 112:0.056, 113:0.000, 114:0.000, 115:0.000, 116:0.000, 117:0.000, 118:0.000, 119:0.008, 120:0.038, 121:0.001, 122:0.000, 123:0.000, 124:0.000, 125:0.000, 126:0.000, 127:0.006]

In this case should values missing in this sparse vector format be interpreted as 0.0 e.g. the value for dimension 21 in the above example? Why are zero values still included in his output format (e.g. dimensions 2,3,4 etc. above) and it seems awkward to us that the clusterdump output contains different vector formats as it makes it more complex to parse. Also we find that if we set the clusterdump output format to CSV instead of TEXT ("-of CSV”) no output file is produced.

Any information or feedback on the above would be greatly appreciated.

Regards,
Oisin.