You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Alex Luya <al...@gmail.com> on 2010/09/07 15:57:37 UTC

How to analyze the result of clustering based on mahout 0.4?

Hello:
       I found directory : output/clusteredPointsdoesn't existed.and the result of dump like this:
---------------------------------------------------------------------------------------
VL-21569{n=760 c=[0.4:0.009, 0.68:0.012, 0.75:0.011, 0.79:0.013, 00:0.062, 00.11:0.012,
  Top Terms: 
		quarter                                 =>  2.8133782223651282
		share                                   =>   2.619699128050553
		earnings                                =>   2.210144190411819
		dlrs                                    =>  2.1388998663739156
		cts                                     =>  2.0921635480303515
		dividend                                =>     2.0305285077346
		company                                 =>  1.9935854278112712
		said                                    =>  1.9911234617233275
		its                                     =>  1.8312319523409792
		year                                    =>  1.6385857475431342
---------------------------------------------------------------------------------------

what does first line mean?

Re: How to analyze the result of clustering based on mahout 0.4?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  It's a little cryptic I suppose. This looks like ClusterDumper output. 
The first line is a formatted representation of a converged k-Means 
cluster (VL) id = 21569. It observed (was assigned) 760 points during 
the last iteration. It has a center vector (c=[...]) with several terms 
and looks to be sparse. The sparse vector terms print index:value and it 
looks like the term dictionary you provided contains some floating point 
coefficients at the beginning. From the top terms printout following, it 
looks like only some of your terms are numeric and indeed the top 10 for 
this cluster all have textual values (quarter, share, earnings, ..., 
year). Buried in the other term printouts you should see 
"quarter:2.813", "share:2.62" and so fourth. The cluster also has a 
radius vector (r=[...]) which is the standard deviation of the 760 
observed data points.

On 9/7/10 6:57 AM, Alex Luya wrote:
> Hello:
>         I found directory : output/clusteredPoints---------------------------------------------------------------------------------------
> VL-21569{n=760 c=[0.4:0.009, 0.68:0.012, 0.75:0.011, 0.79:0.013, 00:0.062, 00.11:0.012,
>    Top Terms:
> 		quarter                                 =>   2.8133782223651282
> 		share                                   =>    2.619699128050553
> 		earnings                                =>    2.210144190411819
> 		dlrs                                    =>   2.1388998663739156
> 		cts                                     =>   2.0921635480303515
> 		dividend                                =>      2.0305285077346
> 		company                                 =>   1.9935854278112712
> 		said                                    =>   1.9911234617233275
> 		its                                     =>   1.8312319523409792
> 		year                                    =>   1.6385857475431342
> ---------------------------------------------------------------------------------------
>
> what does first line mean?
>