You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by "Allen, Ronald L." <al...@ornl.gov> on 2014/02/11 21:28:33 UTC
seqdumper output?
Hello,
I have done something wrong with clustering a CSV file and can't quite figure it out. I am using Mahout 0.9 on a local machine only. Below is the output from seqdumper, and I am not sure how to interpret it. Can anyone help?
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/_policy
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.clustering.iterator.ClusteringPolicyWritable
Key: : Value: org.apache.mahout.clustering.iterator.ClusteringPolicyWritable@78be9eb3
Count: 1
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/part-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable
Key: 0: Value: org.apache.mahout.clustering.iterator.ClusterWritable@592ea0f8
Count: 1
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/part-00001
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable
Key: 1: Value: org.apache.mahout.clustering.iterator.ClusterWritable@44a2786
Count: 1
There's probably a good chance I am still not getting my CSV data into something usable. I can get it into a sequence file, but this is the output.
Thanks,
Ronald
RE: seqdumper output?
Posted by "Allen, Ronald L." <al...@ornl.gov>.
Hello again, and sorry to bother you with this once again,
I'm having a bit of trouble. My CSV files are just full of numbers (doubles). Each line looks something like this: 2.4135,1.1120. I'm not sure if this makes a big difference. But when I try to do step #2, I can't seem to figure out what I should put for field and idField for the input. What would I put for these options? Or how could I find out what they are if they already exist?
Thanks very much for your help,
Ronald
Oh, and if it helps, this is the java code that I came up with to get CSV file to text files. I then tried to use lucene to get text files into an index. I did this because I couldn't quite follow the code from the link you gave me. I don't think I needed to use a hashmap, but just wanted to learn to use them.
public static void main(String[] args) throws IOException {
String inputFile = "/home/r9r/seqTest/seqTestData.csv";
String outputPath = "/home/r9r/seqTest/seqTestOut/";
try {
File myFile = new File(inputFile);
FileReader fileReader = new FileReader(myFile);
BufferedReader reader = new BufferedReader(fileReader);
String text = null;
int j = 0;
while ((text = reader.readLine()) != null) {
List<String> line = new ArrayList<String>();
line.add(text);
Map<String, List<String>> aHashMap = new HashMap<String, List<String>>();
aHashMap.put(Integer.toString(j), line);
File newFile = new File(outputPath + Integer.toString(j));
PrintWriter writer = new PrintWriter(newFile);
Set set = aHashMap.entrySet();
Iterator i = set.iterator();
while(i.hasNext()) {
Map.Entry me = (Map.Entry)i.next();
writer.println(me.getKey() + " " + me.getValue().toString().replace("[", " ").replace(",", " ").replace("]", " "));
System.out.print(me.getKey() + " " + me.getValue().toString().replace("[", " ").replace(",", " ").replace("]", " "));
System.out.println();
}
j++;
}
reader.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
________________________________________
From: Suneel Marthi [suneel_marthi@yahoo.com]
Sent: Tuesday, February 11, 2014 5:44 PM
To: user@mahout.apache.org
Subject: Re: seqdumper output?
You should run the clusterdump on /home/r9r/seqTest/seqKmeans/clusters-1-final/part-xxxxx to see the points that are in the cluster.
But u need a dictionary for that which wouldn't be available if the vectors were generated from CSV.
So one way to generate a dictionary for a CSV and verify the clustering output would be to go through the below process :-
1. Convert CSV file to a lucene index (see http://glaforge.appspot.com/article/lucene-s-fun for sample code).
2. Run the lucene index from (1) through Mahout's lucene2seq utility - this converts the lucene indexes into sequencefiles
3. Run the output of (2) thru seq2sparse - this should generate tf-idf vectors, dictionary, tf-vectors, wordcounts
4. Run the output of (3) thru KMeans Driver.
Please give this a try.
On Tuesday, February 11, 2014 3:33 PM, "Allen, Ronald L." <al...@ornl.gov> wrote:
Hello,
I have done something wrong with
clustering a CSV file and can't quite figure it out. I am using Mahout 0.9 on a local machine only. Below is the output from seqdumper, and I am not sure how to interpret it. Can anyone help?
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/_policy
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.clustering.iterator.ClusteringPolicyWritable
Key: : Value: org.apache.mahout.clustering.iterator.ClusteringPolicyWritable@78be9eb3
Count: 1
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/part-00000
Key class:
class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable
Key: 0: Value: org.apache.mahout.clustering.iterator.ClusterWritable@592ea0f8
Count: 1
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/part-00001
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable
Key: 1: Value: org.apache.mahout.clustering.iterator.ClusterWritable@44a2786
Count: 1
There's probably a good chance I am still not getting my CSV data into something usable. I can get it into a sequence file, but this is the output.
Thanks,
Ronald
RE: seqdumper output?
Posted by "Allen, Ronald L." <al...@ornl.gov>.
Hey again,
I was able to figure out a way to get my CSV file clustered. For now it is a very rough process. I will refine the steps I took and post what I did on the list hopefully in a week or so.
Thanks for all the help!
Ronald
________________________________________
From: Suneel Marthi [suneel_marthi@yahoo.com]
Sent: Tuesday, February 11, 2014 5:44 PM
To: user@mahout.apache.org
Subject: Re: seqdumper output?
You should run the clusterdump on /home/r9r/seqTest/seqKmeans/clusters-1-final/part-xxxxx to see the points that are in the cluster.
But u need a dictionary for that which wouldn't be available if the vectors were generated from CSV.
So one way to generate a dictionary for a CSV and verify the clustering output would be to go through the below process :-
1. Convert CSV file to a lucene index (see http://glaforge.appspot.com/article/lucene-s-fun for sample code).
2. Run the lucene index from (1) through Mahout's lucene2seq utility - this converts the lucene indexes into sequencefiles
3. Run the output of (2) thru seq2sparse - this should generate tf-idf vectors, dictionary, tf-vectors, wordcounts
4. Run the output of (3) thru KMeans Driver.
Please give this a try.
On Tuesday, February 11, 2014 3:33 PM, "Allen, Ronald L." <al...@ornl.gov> wrote:
Hello,
I have done something wrong with
clustering a CSV file and can't quite figure it out. I am using Mahout 0.9 on a local machine only. Below is the output from seqdumper, and I am not sure how to interpret it. Can anyone help?
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/_policy
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.clustering.iterator.ClusteringPolicyWritable
Key: : Value: org.apache.mahout.clustering.iterator.ClusteringPolicyWritable@78be9eb3
Count: 1
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/part-00000
Key class:
class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable
Key: 0: Value: org.apache.mahout.clustering.iterator.ClusterWritable@592ea0f8
Count: 1
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/part-00001
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable
Key: 1: Value: org.apache.mahout.clustering.iterator.ClusterWritable@44a2786
Count: 1
There's probably a good chance I am still not getting my CSV data into something usable. I can get it into a sequence file, but this is the output.
Thanks,
Ronald
RE: seqdumper output?
Posted by "Allen, Ronald L." <al...@ornl.gov>.
Thank you Suneel! I will give this a try and let you know how it goes!
Ronald
________________________________________
From: Suneel Marthi [suneel_marthi@yahoo.com]
Sent: Tuesday, February 11, 2014 5:44 PM
To: user@mahout.apache.org
Subject: Re: seqdumper output?
You should run the clusterdump on /home/r9r/seqTest/seqKmeans/clusters-1-final/part-xxxxx to see the points that are in the cluster.
But u need a dictionary for that which wouldn't be available if the vectors were generated from CSV.
So one way to generate a dictionary for a CSV and verify the clustering output would be to go through the below process :-
1. Convert CSV file to a lucene index (see http://glaforge.appspot.com/article/lucene-s-fun for sample code).
2. Run the lucene index from (1) through Mahout's lucene2seq utility - this converts the lucene indexes into sequencefiles
3. Run the output of (2) thru seq2sparse - this should generate tf-idf vectors, dictionary, tf-vectors, wordcounts
4. Run the output of (3) thru KMeans Driver.
Please give this a try.
On Tuesday, February 11, 2014 3:33 PM, "Allen, Ronald L." <al...@ornl.gov> wrote:
Hello,
I have done something wrong with
clustering a CSV file and can't quite figure it out. I am using Mahout 0.9 on a local machine only. Below is the output from seqdumper, and I am not sure how to interpret it. Can anyone help?
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/_policy
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.clustering.iterator.ClusteringPolicyWritable
Key: : Value: org.apache.mahout.clustering.iterator.ClusteringPolicyWritable@78be9eb3
Count: 1
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/part-00000
Key class:
class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable
Key: 0: Value: org.apache.mahout.clustering.iterator.ClusterWritable@592ea0f8
Count: 1
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/part-00001
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable
Key: 1: Value: org.apache.mahout.clustering.iterator.ClusterWritable@44a2786
Count: 1
There's probably a good chance I am still not getting my CSV data into something usable. I can get it into a sequence file, but this is the output.
Thanks,
Ronald
Re: seqdumper output?
Posted by Suneel Marthi <su...@yahoo.com>.
You should run the clusterdump on /home/r9r/seqTest/seqKmeans/clusters-1-final/part-xxxxx to see the points that are in the cluster.
But u need a dictionary for that which wouldn't be available if the vectors were generated from CSV.
So one way to generate a dictionary for a CSV and verify the clustering output would be to go through the below process :-
1. Convert CSV file to a lucene index (see http://glaforge.appspot.com/article/lucene-s-fun for sample code).
2. Run the lucene index from (1) through Mahout's lucene2seq utility - this converts the lucene indexes into sequencefiles
3. Run the output of (2) thru seq2sparse - this should generate tf-idf vectors, dictionary, tf-vectors, wordcounts
4. Run the output of (3) thru KMeans Driver.
Please give this a try.
On Tuesday, February 11, 2014 3:33 PM, "Allen, Ronald L." <al...@ornl.gov> wrote:
Hello,
I have done something wrong with
clustering a CSV file and can't quite figure it out. I am using Mahout 0.9 on a local machine only. Below is the output from seqdumper, and I am not sure how to interpret it. Can anyone help?
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/_policy
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.clustering.iterator.ClusteringPolicyWritable
Key: : Value: org.apache.mahout.clustering.iterator.ClusteringPolicyWritable@78be9eb3
Count: 1
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/part-00000
Key class:
class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable
Key: 0: Value: org.apache.mahout.clustering.iterator.ClusterWritable@592ea0f8
Count: 1
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/part-00001
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable
Key: 1: Value: org.apache.mahout.clustering.iterator.ClusterWritable@44a2786
Count: 1
There's probably a good chance I am still not getting my CSV data into something usable. I can get it into a sequence file, but this is the output.
Thanks,
Ronald