You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Aleksander Sadecki <al...@pi.esisar.grenoble-inp.fr> on 2014/05/20 15:44:29 UTC
How to print data after canopy clustering

Hi Experts, 

Here you can find simple piece of code which I wrote: 


/****************************************************************************/
import java.io.BufferedReader; 
import java.io.FileReader; 
import java.io.IOException; 

import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.FileSystem; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.LongWritable; 
import org.apache.hadoop.io.SequenceFile; 
import org.apache.mahout.clustering.Cluster; 
import org.apache.mahout.clustering.canopy.CanopyDriver; 
import org.apache.mahout.clustering.classify.WeightedPropertyVectorWritable; 
import org.apache.mahout.common.distance.EuclideanDistanceMeasure; 
import org.apache.mahout.math.RandomAccessSparseVector; 
import org.apache.mahout.math.Vector; 
import org.apache.mahout.math.VectorWritable; 


public class Clustering { 

private final static String root = "C:\\root\\BI\\"; 
private final static String dataDir = root + "synthetic_control.data"; 
private final static String seqDir = root + "synthetic_control.seq"; 
private final static String outputDir = root + "output"; 
private final static String partMDir = outputDir + "\\" 
+ Cluster.CLUSTERED_POINTS_DIR + "\\part-m-0"; 

private final static String SEPARATOR = " "; 

private final static int NUMBER_OF_ELEMENTS = 2; 

private Configuration conf; 
private FileSystem fs; 

public Clustering() throws IOException { 
conf = new Configuration(); 
fs = FileSystem.get(conf); 
} 

public void convertToVectorFile() throws IOException { 

BufferedReader reader = new BufferedReader(new FileReader(dataDir)); 
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, 
new Path(seqDir), LongWritable.class, VectorWritable.class); 

String line; 
long counter = 0; 
while ((line = reader.readLine()) != null) { 
String[] c; 
c = line.split(SEPARATOR); 
double[] d = new double[c.length]; 
for (int i = 0; i < NUMBER_OF_ELEMENTS; i++) { 
try { 
d[i] = Double.parseDouble(c[i]); 

} catch (Exception ex) { 
d[i] = 0; 
} 
} 

Vector vec = new RandomAccessSparseVector(c.length); 
vec.assign(d); 

VectorWritable writable = new VectorWritable(); 
writable.set(vec); 
writer.append(new LongWritable(counter++), writable); 
} 
writer.close(); 
} 

public void createClusters(double t1, double t2, 
double clusterClassificationThreshold, boolean runSequential) 
throws ClassNotFoundException, IOException, InterruptedException { 

EuclideanDistanceMeasure measure = new EuclideanDistanceMeasure(); 
Path inputPath = new Path(seqDir); 
Path outputPath = new Path(outputDir); 

CanopyDriver.run(inputPath, outputPath, measure, t1, t2, runSequential, 
clusterClassificationThreshold, runSequential); 
} 

public void printClusters() throws IOException { 
SequenceFile.Reader readerSequence = new SequenceFile.Reader(fs, 
new Path(partMDir), conf); 

IntWritable key = new IntWritable(); 
WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable(); 
while (readerSequence.next(key, value)) { 
System.out.println(value.toString() + " belongs to cluster " 
+ key.toString()); 
} 
readerSequence.close(); 
} 
} 

/****************************************************************************/

Here we have got 3 different methods. 

A. convertToVectorFile() 

This function takes a file C:\root\BI\synthetic_control.data and converts it into another file (I was following book Mahout in Action ). 

For file: 

0.01 1.0 
0.1 0.9 
0.1 0.95 
12.0 13.0 
12.5 12.8 

it generated for me the following structure: 

>tree /F 
C:. 
.synthetic_control.seq.crc 
synthetic_control.data 
synthetic_control.seq 

with log in Eclipse: 

DEBUG Groups - Creating new Groups object 
DEBUG Groups - Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000 
DEBUG UserGroupInformation - hadoop login 
DEBUG UserGroupInformation - hadoop login commit 
DEBUG UserGroupInformation - using local user:NTUserPrincipal : xxxxxxxx 
DEBUG UserGroupInformation - UGI loginUser:xxxxxxxx 
DEBUG FileSystem - Creating filesystem for file:/// 
DEBUG NativeCodeLoader - Trying to load the custom-built native-hadoop library... 
DEBUG NativeCodeLoader - Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path 
DEBUG NativeCodeLoader - java.library.path=C:\Program Files\Java\jre7\bin;C:\Windows\Sun\Java\bin;C:\Windows\system32;C:\Windows;C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files\MATLAB\R2009b\runtime\win64;C:\Program Files\MATLAB\R2009b\bin;C:\Program Files\TortoiseSVN\bin;C:\Users\xxxxxxxx\Documents\apache-maven-3.1.1\bin;. 
WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 


B. createClusters() 

Next method generates clusters. When I run it it gives me a log: 

INFO CanopyDriver - Build Clusters Input: C:/Users/xxxxxxxx/Desktop/BI/synthetic_control.seq Out: C:/Users/xxxxxxxx/Desktop/BI/output Measure: org.apache.mahout.common.distance.EuclideanDistanceMeasure@2224ece4 t1: 2.0 t2: 3.0 
DEBUG CanopyClusterer - Created new Canopy:0 at center:[0.010, 1.000] 
DEBUG CanopyClusterer - Added point: [0.100, 0.900] to canopy: C-0 
DEBUG CanopyClusterer - Added point: [0.100, 0.950] to canopy: C-0 
DEBUG CanopyClusterer - Created new Canopy:1 at center:[12.000, 13.000] 
DEBUG CanopyClusterer - Added point: [12.500, 12.800] to canopy: C-1 
DEBUG CanopyDriver - Writing Canopy:C-0 center:[0.070, 0.950] numPoints:3 radius:[0.042, 0.041] 
DEBUG CanopyDriver - Writing Canopy:C-1 center:[12.250, 12.900] numPoints:2 radius:[0.250, 0.100] 
DEBUG FileSystem - Starting clear of FileSystem cache with 1 elements. 
DEBUG FileSystem - Removing filesystem for file:/// 
DEBUG FileSystem - Removing filesystem for file:/// 
DEBUG FileSystem - Done clearing cache 

and I can see more files in my directory: 

>tree /F 
C:. 
│ .synthetic_control.seq.crc 
│ synthetic_control.data 
│ synthetic_control.seq 
│ 
└───output 
├───clusteredPoints 
│ .part-m-0.crc 
│ part-m-0 
│ 
└───clusters-0-final 
.part-r-00000.crc 
._policy.crc 
part-r-00000 
_policy 

Reading the log we can see that everything worked well. We have got 2 clusters with proper points. 

C. printClusters() 

Here is my problem. 

I have no erros but I cannot see any results in console screen. My code never goes in while loop. 

Thank you for any help