You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by DAN HELM <da...@verizon.net> on 2012/12/01 02:55:54 UTC

Re: command line input dataset format for k-means and USCensus dataset

Eduard,
 
My guess is you will need to convert your CSV vectors to Mahout vector format and then run that through k-means.
 
I believe the seqdirectory program just converts a collection of individual text files to sequence file format that can then be transformed to Mahout vectors via seq2sparse command.
 
I have never used it but do see there is a CSVVectorIterator class that could be used (in your own custom program):
 
 https://cwiki.apache.org/MAHOUT/file-format-integrations.html 
 
This thread talks more about the topic: http://comments.gmane.org/gmane.comp.apache.mahout.user/11310
 
Dan
 

________________________________
 From: Eduard Gamonal <ed...@gmail.com>
To: user@mahout.apache.org 
Sent: Friday, November 30, 2012 4:51 PM
Subject: command line input dataset format for k-means and USCensus dataset
  
Hi,
I have a text file that contains a few thousands of lines. each line is a
set of features, like this:

10000,5,0,1,0,0,5,3,2,2,1,0,1,0,4,3,0,2,0,0,1,0,0,0,0,10,0,1,0,1,0,1,4,2,2,3,0,2,0,2,1,4,3,0,0,0,3,1,0,3,22,0,3,0,1,0,1,0,0,0,5,0,2,1,1,0,11,1,0
source: http://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%29

My goal is to cluster all this data with k-means using the command line
interface.
I read the Reuters-kmeans tutorial but I guess I can't apply the same
procedure in a straight forward manner.
The reuters example is for analyzing text. However, I want to analyze
records.

This is what I did
$ mahout seqdirectory --input uscensus --output uscensus -seq
$ mahout seq2sparse -i uscensus -seq -o uscensus -vec
$ mahout kmeans -i uscensus-vec/tfidf-vectors -o uscensus-kmeans-clusters
-c uscensus-kmeans-centroids -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -ow -cl -k 25

I still haven't guessed a good starting k and x, though.

I get an empty result:

edu@hadoop:~/kmeans-mahout-uscensus$ cat cdump.txt
CL-1{n=2 c=[] r=[]}
    Top Terms:
    Weight : [props - optional]:  Point:
    1.0: []
    1.0: []
CL-0{n=1 c=[] r=[]}
    Top Terms:
edu@hadoop:~/kmeans-mahout-uscensus$


Questions:
* do you think my vectors are created correctly? I guess they have to be
like <1000, 5, 0, ... 1, 0> but because I'm following the reuters example I
can't see why they could be correct.

* why should I be using the TFIDF-vectors?