You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/07/05 17:20:34 UTC
Cluster Labels
I'm running ClusterLabels and it seems to be outputting the same values for every centroid [1]. When I run the cluster dumper, the top terms are fairly different for those same vectors.
Have I hit a vagary of LLR or is this a bug?
Thanks,
Grant
[1]
<snip>
Top labels for Cluster 129062 containing 22710 vectors
Term LLR In-ClusterDF Out-ClusterDF
a 43269.00830466254 0 72060
his 7185.503760070074 0 17203
has 7028.243643655442 0 16855
from 6415.739411605988 0 15488
year 5930.141497239005 0 14391
state 5858.43069797568 0 14228
said 5616.422720833216 0 13676
it 5545.207108973991 0 13513
he 5239.340392438695 0 12810
new 4830.124521905556 0 11862
Top labels for Cluster 129145 containing 11188 vectors
Term LLR In-ClusterDF Out-ClusterDF
a 19576.26998734614 0 72060
his 3352.5135342599824 0 17203
has 3279.466228939127 0 16855
from 2994.8128935270943 0 15488
year 2768.974903047085 0 14391
state 2735.612128134351 0 14228
said 2622.997358441353 0 13676
it 2589.8515553446487 0 13513
he 2447.4579147226177 0 12810
new 2256.8640938592143 0 11862
Top labels for Cluster 129201 containing 13040 vectors
Term LLR In-ClusterDF Out-ClusterDF
a 23110.173012922285 0 72060
his 3940.4691014224663 0 17203
has 3854.554399965331 0 16855
from 3519.784154796507 0 15488
year 3254.2127395244315 0 14391
state 3214.9822960514575 0 14228
said 3082.565408431459 0 13676
it 3043.5924300444312 0 13513
he 2876.171367166564 0 12810
new 2652.0934832417406 0 11862
Top labels for Cluster 129211 containing 14053 vectors
Term LLR In-ClusterDF Out-ClusterDF
a 25083.46391701023 0 72060
his 4266.378291217145 0 17203
has 4173.323467798065 0 16855
from 3810.7467373879626 0 15488
year 3523.1337431534193 0 14391
state 3480.648573280778 0 14228
said 3337.2482196930796 0 13676
it 3295.0432900944725 0 13513
he 3113.741967030335 0 12810
new 2871.0957860480994 0 11862
Top labels for Cluster 129242 containing 12861 vectors
Term LLR In-ClusterDF Out-ClusterDF
a 22764.503256496973 0 72060
his 3883.2002838114277 0 17203
has 3798.5396822127514 0 16855
from 3468.6536546614952 0 15488
year 3206.954131908249 0 14391
state 3168.2954448102973 0 14228
said 3037.808057511691 0 13676
it 2999.402857856825 0 13513
he 2834.4202939094976 0 12810
new 2613.604658874683 0 11862
Top labels for Cluster 129245 containing 6443 vectors
Term LLR In-ClusterDF Out-ClusterDF
a 10925.268199045677 0 72060
his 1890.511348863598 0 17203
has 1849.385320336558 0 16855
from 1689.0946326381527 0 15488
year 1561.8904545903206 0 14391
state 1543.096286157146 0 14228
said 1479.652662154287 0 13676
it 1460.9780013803393 0 13513
he 1380.745082413312 0 12810
new 1273.3357145632617 0 11862
Top labels for Cluster 129255 containing 11390 vectors
Term LLR In-ClusterDF Out-ClusterDF
a 19957.211259535048 0 72060
his 3416.1555761522613 0 17203
has 3341.7163103362545 0 16855
from 3051.6410844950005 0 15488
year 2821.504116652999 0 14391
state 2787.5064550531097 0 14228
said 2672.7490201727487 0 13676
it 2638.972676954698 0 13513
he 2493.870809029322 0 12810
new 2299.653438703157 0 11862
Top labels for Cluster 129265 containing 9461 vectors
Term LLR In-ClusterDF Out-ClusterDF
a 16362.85457371641 0 72060
his 2813.167819214519 0 17203
has 2751.908798408229 0 16855
from 2513.176188033074 0 15488
year 2323.752471229993 0 14391
state 2295.767774611246 0 14228
said 2201.3039346230216 0 13676
it 2173.4997256915085 0 13513
he 2054.0495802331716 0 12810
new 1894.1558320098557 0 11862
Top labels for Cluster 129279 containing 14559 vectors
Term LLR In-ClusterDF Out-ClusterDF
a 26080.197364640888 0 72060
his 4430.338072712999 0 17203
has 4333.689091425855 0 16855
from 3957.116204748396 0 15488
year 3658.40981121175 0 14391
state 3614.286633652635 0 14228
said 3465.358771919273 0 13676
it 3421.527382406406 0 13513
he 3233.2411222746596 0 12810
new 2981.251407010015 0 11862
Top labels for Cluster 129290 containing 13592 vectors
Term LLR In-ClusterDF Out-ClusterDF
a 24181.82589298836 0 72060
his 4117.6785482652485 0 17203
has 4027.8821644652635 0 16855
from 3677.9947950267233 0 15488
year 3400.440033295192 0 14391
state 3359.4400672735646 0 14228
said 3221.0516651300713 0 13676
it 3180.321518546436 0 13513
he 3005.353873868007 0 12810
new 2771.180380204227 0 11862
</snip>
Re: Cluster Labels
Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote:
> Can't say just off-hand.
>
> What is the data?
Small docs, title and description, taken from RSS feeds from 20 or so news sites. Hmm, looks like I created my docs from the wrong field (there shouldn't be stopwords like those below). Let me re-run and I'll report back.
>
> On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
>> I'm running ClusterLabels and it seems to be outputting the same values for
>> every centroid [1]. When I run the cluster dumper, the top terms are fairly
>> different for those same vectors.
>>
>> Have I hit a vagary of LLR or is this a bug?
>>
>>
>> Thanks,
>> Grant
>>
>>
>> [1]
>> <snip>
>> Top labels for Cluster 129062 containing 22710 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 43269.00830466254 0 72060
>> his 7185.503760070074 0 17203
>> has 7028.243643655442 0 16855
>> from 6415.739411605988 0 15488
>> year 5930.141497239005 0 14391
>> state 5858.43069797568 0 14228
>> said 5616.422720833216 0 13676
>> it 5545.207108973991 0 13513
>> he 5239.340392438695 0 12810
>> new 4830.124521905556 0 11862
>>
>> Top labels for Cluster 129145 containing 11188 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 19576.26998734614 0 72060
>> his 3352.5135342599824 0 17203
>> has 3279.466228939127 0 16855
>> from 2994.8128935270943 0 15488
>> year 2768.974903047085 0 14391
>> state 2735.612128134351 0 14228
>> said 2622.997358441353 0 13676
>> it 2589.8515553446487 0 13513
>> he 2447.4579147226177 0 12810
>> new 2256.8640938592143 0 11862
>>
>> Top labels for Cluster 129201 containing 13040 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 23110.173012922285 0 72060
>> his 3940.4691014224663 0 17203
>> has 3854.554399965331 0 16855
>> from 3519.784154796507 0 15488
>> year 3254.2127395244315 0 14391
>> state 3214.9822960514575 0 14228
>> said 3082.565408431459 0 13676
>> it 3043.5924300444312 0 13513
>> he 2876.171367166564 0 12810
>> new 2652.0934832417406 0 11862
>>
>> Top labels for Cluster 129211 containing 14053 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 25083.46391701023 0 72060
>> his 4266.378291217145 0 17203
>> has 4173.323467798065 0 16855
>> from 3810.7467373879626 0 15488
>> year 3523.1337431534193 0 14391
>> state 3480.648573280778 0 14228
>> said 3337.2482196930796 0 13676
>> it 3295.0432900944725 0 13513
>> he 3113.741967030335 0 12810
>> new 2871.0957860480994 0 11862
>>
>> Top labels for Cluster 129242 containing 12861 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 22764.503256496973 0 72060
>> his 3883.2002838114277 0 17203
>> has 3798.5396822127514 0 16855
>> from 3468.6536546614952 0 15488
>> year 3206.954131908249 0 14391
>> state 3168.2954448102973 0 14228
>> said 3037.808057511691 0 13676
>> it 2999.402857856825 0 13513
>> he 2834.4202939094976 0 12810
>> new 2613.604658874683 0 11862
>>
>> Top labels for Cluster 129245 containing 6443 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 10925.268199045677 0 72060
>> his 1890.511348863598 0 17203
>> has 1849.385320336558 0 16855
>> from 1689.0946326381527 0 15488
>> year 1561.8904545903206 0 14391
>> state 1543.096286157146 0 14228
>> said 1479.652662154287 0 13676
>> it 1460.9780013803393 0 13513
>> he 1380.745082413312 0 12810
>> new 1273.3357145632617 0 11862
>>
>> Top labels for Cluster 129255 containing 11390 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 19957.211259535048 0 72060
>> his 3416.1555761522613 0 17203
>> has 3341.7163103362545 0 16855
>> from 3051.6410844950005 0 15488
>> year 2821.504116652999 0 14391
>> state 2787.5064550531097 0 14228
>> said 2672.7490201727487 0 13676
>> it 2638.972676954698 0 13513
>> he 2493.870809029322 0 12810
>> new 2299.653438703157 0 11862
>>
>> Top labels for Cluster 129265 containing 9461 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 16362.85457371641 0 72060
>> his 2813.167819214519 0 17203
>> has 2751.908798408229 0 16855
>> from 2513.176188033074 0 15488
>> year 2323.752471229993 0 14391
>> state 2295.767774611246 0 14228
>> said 2201.3039346230216 0 13676
>> it 2173.4997256915085 0 13513
>> he 2054.0495802331716 0 12810
>> new 1894.1558320098557 0 11862
>>
>> Top labels for Cluster 129279 containing 14559 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 26080.197364640888 0 72060
>> his 4430.338072712999 0 17203
>> has 4333.689091425855 0 16855
>> from 3957.116204748396 0 15488
>> year 3658.40981121175 0 14391
>> state 3614.286633652635 0 14228
>> said 3465.358771919273 0 13676
>> it 3421.527382406406 0 13513
>> he 3233.2411222746596 0 12810
>> new 2981.251407010015 0 11862
>>
>> Top labels for Cluster 129290 containing 13592 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 24181.82589298836 0 72060
>> his 4117.6785482652485 0 17203
>> has 4027.8821644652635 0 16855
>> from 3677.9947950267233 0 15488
>> year 3400.440033295192 0 14391
>> state 3359.4400672735646 0 14228
>> said 3221.0516651300713 0 13676
>> it 3180.321518546436 0 13513
>> he 3005.353873868007 0 12810
>> new 2771.180380204227 0 11862
>> </snip>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Cluster Labels
Posted by Grant Ingersoll <gs...@apache.org>.
MAHOUT-434 solves the problem.
On Jul 5, 2010, at 2:34 PM, Grant Ingersoll wrote:
> https://issues.apache.org/jira/browse/MAHOUT-433
>
> On Mon, Jul 5, 2010 at 2:28 PM, Grant Ingersoll <gs...@apache.org> wrote:
> OK, seems the problem is ClusterLabels was never updated when we switched over to WeightedVectorWritable and it also seems like somewhere in the equation of KMeans being run that we lost the NamedVector again, as the clusteredPoints directory does not contain NamedVectors, even though that is what I created the original points as when starting.
>
>
> On Mon, Jul 5, 2010 at 1:55 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Hmmm, different field, more or less the same result, i.e. all labels are the same for each vector [1]. I also included the Cluster dump [2]. I'm suspecting a bug.
>
> [1]
> Top labels for Cluster 129022 containing 19186 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 5836.5347257247195 0 16867
> from 5328.54616727354 0 15499
> year 4925.276801970322 0 14400
> state 4866.91887763422 0 14240
> new 4011.6858639516868 0 11867
> after 3882.1740732807666 0 11503
> first 3002.5827110484242 0 8998
> two 2984.1892275922 0 8945
> unit 2930.794111499563 0 8791
> one 2686.95768492762 0 8085
>
> Top labels for Cluster 129119 containing 16043 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 4808.386086146813 0 16867
> from 4390.346637147013 0 15499
> year 4058.4180186586455 0 14400
> state 4010.379176544491 0 14240
> new 3306.234930681996 0 11867
> after 3199.5810555517673 0 11503
> first 2475.079962851014 0 8998
> two 2459.926843432244 0 8945
> unit 2415.9376569474116 0 8791
> one 2215.042654468678 0 8085
>
> Top labels for Cluster 129191 containing 7770 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 2243.2657141932286 0 16867
> from 2048.755412856117 0 15499
> year 1894.2384706358425 0 14400
> state 1871.8704557279125 0 14240
> new 1543.8513879175298 0 11867
> after 1494.1429192917421 0 11503
> first 1156.303048826754 0 8998
> two 1149.2339147529565 0 8945
> unit 1128.711646862328 0 8791
> one 1034.9745452422649 0 8085
>
> Top labels for Cluster 129302 containing 9426 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 2741.316972494591 0 16867
> from 2503.501101480797 0 15499
> year 2314.5996575923637 0 14400
> state 2287.255346294027 0 14240
> new 1886.2961270781234 0 11867
> after 1825.5399498036131 0 11503
> first 1412.654560342431 0 8998
> two 1404.0158626483753 0 8945
> unit 1378.9371921028942 0 8791
> one 1264.391515379306 0 8085
>
> Top labels for Cluster 129360 containing 13092 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 3870.8181769265793 0 16867
> from 3534.623348234687 0 15499
> year 3267.633215776179 0 14400
> state 3228.989259615075 0 14240
> new 2662.4551618834957 0 11867
> after 2576.628638952039 0 11503
> first 1993.499155438505 0 8998
> two 1981.3008509986103 0 8945
> unit 1945.8889682726003 0 8791
> one 1784.1570986662991 0 8085
>
> Top labels for Cluster 129371 containing 23944 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 7455.31941217836 0 16867
> from 6805.274207816925 0 15499
> year 6289.398677708115 0 14400
> state 6214.757351316046 0 14240
> new 5121.23683049297 0 11867
> after 4955.695805796888 0 11503
> first 3831.788851835765 0 8998
> two 3808.2933898111805 0 8945
> unit 3740.0891623105854 0 8791
> one 3428.6551325367764 0 8085
>
> Top labels for Cluster 129373 containing 9885 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 2880.6778563517146 0 16867
> from 2630.736483251676 0 15499
> year 2432.208566541318 0 14400
> state 2403.4711471684277 0 14240
> new 1982.0948037123308 0 11867
> after 1918.2465800205246 0 11503
> first 1484.359997350257 0 8998
> two 1475.282112147659 0 8945
> unit 1448.9285028181039 0 8791
> one 1328.560536378529 0 8085
>
> Top labels for Cluster 129377 containing 11303 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 3314.8890487886965 0 16867
> from 3027.14497121796 0 15499
> year 2798.608615776524 0 14400
> state 2765.528720188886 0 14240
> new 2280.5166378575377 0 11867
> after 2207.0322705539875 0 11503
> first 1707.7044410486706 0 8998
> two 1697.2581536169164 0 8945
> unit 1666.932174641639 0 8791
> one 1528.4241032432765 0 8085
>
> Top labels for Cluster 129381 containing 11411 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 3348.190782570746 0 16867
> from 3057.545994592365 0 15499
> year 2826.7072093421593 0 14400
> state 2793.2941474220715 0 14240
> new 2303.4001871203072 0 11867
> after 2229.176642407663 0 11503
> first 1724.8293614634313 0 8998
> two 1714.2781240069307 0 8945
> unit 1683.6474849330261 0 8791
> one 1543.7481994605623 0 8085
>
> Top labels for Cluster 129391 containing 7334 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 2113.35227333894 0 16867
> from 1930.1305988361128 0 15499
> year 1784.577833758667 0 14400
> state 1763.5072347805835 0 14240
> new 1454.5072316131555 0 11867
> after 1407.6797917694785 0 11503
> first 1089.4127462548204 0 8998
> two 1082.7530186888762 0 8945
> unit 1063.4192575318739 0 8791
> one 975.1101242941804 0 8085
>
> [2]
> :C-129022: [0:0.001, 000:0.003, 004:0.000, 0040:0.000, 0060:0.000, 01:0.000, 0100:0.000, 0110:0.000,
> Top Terms:
> from =>0.022236135215980328
> u => 0.01589135359475966
> busi =>0.014789942880805335
> bank =>0.014395075820558541
> us => 0.01402954110138604
> presid => 0.01341952961319183
> month =>0.012118726267037198
> about =>0.011986047971260612
> compani =>0.011201454374207618
> obama => 0.01105482429336391
> :C-129119: [0:0.001, 00:0.000, 000:0.003, 03:0.000, 04:0.000, 05:0.000, 0656:0.000, 07:0.000, 09:0.00
> Top Terms:
> citi => 0.04119064757467011
> former =>0.030966538725529232
> home =>0.029642735534519644
> player => 0.02879703136878369
> soccer => 0.01847372541986708
> has =>0.015236681440174855
> mark =>0.015185164518720528
> new => 0.01266468154720074
> polic => 0.01253454821409647
> world =>0.011803315296178046
> :C-129191: [0:0.013, 00:0.002, 000:0.004, 000000:0.000, 001:0.000, 0011:0.000, 0022:0.000, 003:0.000,
> Top Terms:
> 4 =>0.027636996760550075
> 3 =>0.026093296145846434
> 1 => 0.02570191540464146
> 5 =>0.024807189589701305
> 2 =>0.023669513631826157
> were =>0.021134415210709086
> sunday =>0.017928504766147838
> play =>0.017243683740808733
> through =>0.017133336974828554
> game =>0.017027790192043733
> :C-129302: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 002:0.000, 008:0.000, 01:0.000, 011:0.000, 0112:
> Top Terms:
> new =>0.039501149799390206
> peopl => 0.01933397797740685
> world =>0.017478792605253438
> could =>0.013495142418778704
> has =>0.012987326502897916
> more =>0.012585724039194569
> from =>0.012242682917236177
> face => 0.0117046220661272
> leader =>0.011579584625370691
> presid =>0.011192085113854965
> :C-129360: [0:0.000, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, 005:0.000, 007:0.000, 008:
> Top Terms:
> state =>0.044732720259456946
> unit =>0.032493582810588666
> year =>0.025651340609304542
> san =>0.025617706557963606
> after =>0.022019046306438913
> francisco =>0.020771004252363168
> california => 0.01847124801606253
> day =>0.015514125170527842
> wednesday =>0.014587851421509652
> citi =>0.012973538756014369
> :C-129371: [0:0.002, 00:0.000, 000:0.001, 01:0.000, 010:0.000, 0134:0.000, 016:0.000, 02:0.000, 03:0.
> Top Terms:
> game => 0.04311022785679375
> has => 0.03059922226267673
> all =>0.027605073346921877
> leagu => 0.0267627245855276
> star => 0.02206632764439995
> final =>0.020017765794918686
> season => 0.01534931562714024
> start => 0.01450896856938099
> week =>0.014407234069110549
> nation => 0.01429746391305699
> :C-129373: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 01:0.000, 012:0.000, 016:0.000, 03:0.000, 034:0.
> Top Terms:
> coach => 0.05209277512761816
> team =>0.031773971685165554
> charg =>0.024246280249912454
> from => 0.02093643936347752
> has => 0.02057631329905952
> week =>0.016848920922797363
> last => 0.01674320150844955
> program =>0.016023081209070564
> former =>0.015872337289314063
> after => 0.01341825692502786
> :C-129377: [0:0.002, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, 006:0.000, 0065:0.000, 007
> Top Terms:
> been => 0.03757994091979662
> time => 0.03591307497544333
> first => 0.03422461795380875
> has =>0.029800513863644906
> feder =>0.027382680342986195
> monday =>0.022174840523045594
> sinc => 0.02185219249613946
> year => 0.01933420097135394
> from => 0.01162537888358458
> state =>0.009756869426688311
> :C-129381: [0:0.004, 00:0.000, 000:0.002, 00000000235:0.000, 001:0.000, 0011:0.000, 002:0.000, 0051:0
> Top Terms:
> win => 0.03267669747239372
> one =>0.031009191445456212
> second =>0.028066582472705007
> three =>0.026147346665631184
> out => 0.0226123748207931
> shot =>0.020446190395276405
> last =>0.019624841184867056
> night =>0.019103407305052604
> over =>0.017376642133669604
> year =>0.016475201865715022
> :C-129391: [0:0.003, 00:0.000, 000:0.001, 002:0.000, 01:0.000, 0112:0.000, 0123:0.000, 02:0.000, 0213
> Top Terms:
> championship =>0.035449579372280104
> run =>0.026446073370591447
> art => 0.02489330236372834
> open => 0.02282619503375418
> place =>0.022410914360311056
> grand => 0.0169734705340118
> reuter =>0.015895311339829302
> 6 =>0.015700075983436933
> continu =>0.015418929721703813
> slam =>0.012102435338420274
>
>
> -Grant
> On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote:
>
> > Can't say just off-hand.
> >
> > What is the data?
> >
> > On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <gs...@apache.org> wrote:
> >
> >> I'm running ClusterLabels and it seems to be outputting the same values for
> >> every centroid [1]. When I run the cluster dumper, the top terms are fairly
> >> different for those same vectors.
> >>
> >> Have I hit a vagary of LLR or is this a bug?
> >>
> >>
> >> Thanks,
> >> Grant
> >>
> >>
> >> [1]
> >> <snip>
> >> Top labels for Cluster 129062 containing 22710 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 43269.00830466254 0 72060
> >> his 7185.503760070074 0 17203
> >> has 7028.243643655442 0 16855
> >> from 6415.739411605988 0 15488
> >> year 5930.141497239005 0 14391
> >> state 5858.43069797568 0 14228
> >> said 5616.422720833216 0 13676
> >> it 5545.207108973991 0 13513
> >> he 5239.340392438695 0 12810
> >> new 4830.124521905556 0 11862
> >>
> >> Top labels for Cluster 129145 containing 11188 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 19576.26998734614 0 72060
> >> his 3352.5135342599824 0 17203
> >> has 3279.466228939127 0 16855
> >> from 2994.8128935270943 0 15488
> >> year 2768.974903047085 0 14391
> >> state 2735.612128134351 0 14228
> >> said 2622.997358441353 0 13676
> >> it 2589.8515553446487 0 13513
> >> he 2447.4579147226177 0 12810
> >> new 2256.8640938592143 0 11862
> >>
> >> Top labels for Cluster 129201 containing 13040 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 23110.173012922285 0 72060
> >> his 3940.4691014224663 0 17203
> >> has 3854.554399965331 0 16855
> >> from 3519.784154796507 0 15488
> >> year 3254.2127395244315 0 14391
> >> state 3214.9822960514575 0 14228
> >> said 3082.565408431459 0 13676
> >> it 3043.5924300444312 0 13513
> >> he 2876.171367166564 0 12810
> >> new 2652.0934832417406 0 11862
> >>
> >> Top labels for Cluster 129211 containing 14053 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 25083.46391701023 0 72060
> >> his 4266.378291217145 0 17203
> >> has 4173.323467798065 0 16855
> >> from 3810.7467373879626 0 15488
> >> year 3523.1337431534193 0 14391
> >> state 3480.648573280778 0 14228
> >> said 3337.2482196930796 0 13676
> >> it 3295.0432900944725 0 13513
> >> he 3113.741967030335 0 12810
> >> new 2871.0957860480994 0 11862
> >>
> >> Top labels for Cluster 129242 containing 12861 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 22764.503256496973 0 72060
> >> his 3883.2002838114277 0 17203
> >> has 3798.5396822127514 0 16855
> >> from 3468.6536546614952 0 15488
> >> year 3206.954131908249 0 14391
> >> state 3168.2954448102973 0 14228
> >> said 3037.808057511691 0 13676
> >> it 2999.402857856825 0 13513
> >> he 2834.4202939094976 0 12810
> >> new 2613.604658874683 0 11862
> >>
> >> Top labels for Cluster 129245 containing 6443 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 10925.268199045677 0 72060
> >> his 1890.511348863598 0 17203
> >> has 1849.385320336558 0 16855
> >> from 1689.0946326381527 0 15488
> >> year 1561.8904545903206 0 14391
> >> state 1543.096286157146 0 14228
> >> said 1479.652662154287 0 13676
> >> it 1460.9780013803393 0 13513
> >> he 1380.745082413312 0 12810
> >> new 1273.3357145632617 0 11862
> >>
> >> Top labels for Cluster 129255 containing 11390 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 19957.211259535048 0 72060
> >> his 3416.1555761522613 0 17203
> >> has 3341.7163103362545 0 16855
> >> from 3051.6410844950005 0 15488
> >> year 2821.504116652999 0 14391
> >> state 2787.5064550531097 0 14228
> >> said 2672.7490201727487 0 13676
> >> it 2638.972676954698 0 13513
> >> he 2493.870809029322 0 12810
> >> new 2299.653438703157 0 11862
> >>
> >> Top labels for Cluster 129265 containing 9461 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 16362.85457371641 0 72060
> >> his 2813.167819214519 0 17203
> >> has 2751.908798408229 0 16855
> >> from 2513.176188033074 0 15488
> >> year 2323.752471229993 0 14391
> >> state 2295.767774611246 0 14228
> >> said 2201.3039346230216 0 13676
> >> it 2173.4997256915085 0 13513
> >> he 2054.0495802331716 0 12810
> >> new 1894.1558320098557 0 11862
> >>
> >> Top labels for Cluster 129279 containing 14559 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 26080.197364640888 0 72060
> >> his 4430.338072712999 0 17203
> >> has 4333.689091425855 0 16855
> >> from 3957.116204748396 0 15488
> >> year 3658.40981121175 0 14391
> >> state 3614.286633652635 0 14228
> >> said 3465.358771919273 0 13676
> >> it 3421.527382406406 0 13513
> >> he 3233.2411222746596 0 12810
> >> new 2981.251407010015 0 11862
> >>
> >> Top labels for Cluster 129290 containing 13592 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 24181.82589298836 0 72060
> >> his 4117.6785482652485 0 17203
> >> has 4027.8821644652635 0 16855
> >> from 3677.9947950267233 0 15488
> >> year 3400.440033295192 0 14391
> >> state 3359.4400672735646 0 14228
> >> said 3221.0516651300713 0 13676
> >> it 3180.321518546436 0 13513
> >> he 3005.353873868007 0 12810
> >> new 2771.180380204227 0 11862
> >> </snip>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
>
>
>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Cluster Labels
Posted by Grant Ingersoll <gs...@apache.org>.
https://issues.apache.org/jira/browse/MAHOUT-433
On Mon, Jul 5, 2010 at 2:28 PM, Grant Ingersoll <gs...@apache.org> wrote:
> OK, seems the problem is ClusterLabels was never updated when we switched
> over to WeightedVectorWritable and it also seems like somewhere in the
> equation of KMeans being run that we lost the NamedVector again, as the
> clusteredPoints directory does not contain NamedVectors, even though that is
> what I created the original points as when starting.
>
>
> On Mon, Jul 5, 2010 at 1:55 PM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> Hmmm, different field, more or less the same result, i.e. all labels are
>> the same for each vector [1]. I also included the Cluster dump [2]. I'm
>> suspecting a bug.
>>
>> [1]
>> Top labels for Cluster 129022 containing 19186 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> has 5836.5347257247195 0 16867
>> from 5328.54616727354 0 15499
>> year 4925.276801970322 0 14400
>> state 4866.91887763422 0 14240
>> new 4011.6858639516868 0 11867
>> after 3882.1740732807666 0 11503
>> first 3002.5827110484242 0 8998
>> two 2984.1892275922 0 8945
>> unit 2930.794111499563 0 8791
>> one 2686.95768492762 0 8085
>>
>> Top labels for Cluster 129119 containing 16043 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> has 4808.386086146813 0 16867
>> from 4390.346637147013 0 15499
>> year 4058.4180186586455 0 14400
>> state 4010.379176544491 0 14240
>> new 3306.234930681996 0 11867
>> after 3199.5810555517673 0 11503
>> first 2475.079962851014 0 8998
>> two 2459.926843432244 0 8945
>> unit 2415.9376569474116 0 8791
>> one 2215.042654468678 0 8085
>>
>> Top labels for Cluster 129191 containing 7770 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> has 2243.2657141932286 0 16867
>> from 2048.755412856117 0 15499
>> year 1894.2384706358425 0 14400
>> state 1871.8704557279125 0 14240
>> new 1543.8513879175298 0 11867
>> after 1494.1429192917421 0 11503
>> first 1156.303048826754 0 8998
>> two 1149.2339147529565 0 8945
>> unit 1128.711646862328 0 8791
>> one 1034.9745452422649 0 8085
>>
>> Top labels for Cluster 129302 containing 9426 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> has 2741.316972494591 0 16867
>> from 2503.501101480797 0 15499
>> year 2314.5996575923637 0 14400
>> state 2287.255346294027 0 14240
>> new 1886.2961270781234 0 11867
>> after 1825.5399498036131 0 11503
>> first 1412.654560342431 0 8998
>> two 1404.0158626483753 0 8945
>> unit 1378.9371921028942 0 8791
>> one 1264.391515379306 0 8085
>>
>> Top labels for Cluster 129360 containing 13092 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> has 3870.8181769265793 0 16867
>> from 3534.623348234687 0 15499
>> year 3267.633215776179 0 14400
>> state 3228.989259615075 0 14240
>> new 2662.4551618834957 0 11867
>> after 2576.628638952039 0 11503
>> first 1993.499155438505 0 8998
>> two 1981.3008509986103 0 8945
>> unit 1945.8889682726003 0 8791
>> one 1784.1570986662991 0 8085
>>
>> Top labels for Cluster 129371 containing 23944 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> has 7455.31941217836 0 16867
>> from 6805.274207816925 0 15499
>> year 6289.398677708115 0 14400
>> state 6214.757351316046 0 14240
>> new 5121.23683049297 0 11867
>> after 4955.695805796888 0 11503
>> first 3831.788851835765 0 8998
>> two 3808.2933898111805 0 8945
>> unit 3740.0891623105854 0 8791
>> one 3428.6551325367764 0 8085
>>
>> Top labels for Cluster 129373 containing 9885 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> has 2880.6778563517146 0 16867
>> from 2630.736483251676 0 15499
>> year 2432.208566541318 0 14400
>> state 2403.4711471684277 0 14240
>> new 1982.0948037123308 0 11867
>> after 1918.2465800205246 0 11503
>> first 1484.359997350257 0 8998
>> two 1475.282112147659 0 8945
>> unit 1448.9285028181039 0 8791
>> one 1328.560536378529 0 8085
>>
>> Top labels for Cluster 129377 containing 11303 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> has 3314.8890487886965 0 16867
>> from 3027.14497121796 0 15499
>> year 2798.608615776524 0 14400
>> state 2765.528720188886 0 14240
>> new 2280.5166378575377 0 11867
>> after 2207.0322705539875 0 11503
>> first 1707.7044410486706 0 8998
>> two 1697.2581536169164 0 8945
>> unit 1666.932174641639 0 8791
>> one 1528.4241032432765 0 8085
>>
>> Top labels for Cluster 129381 containing 11411 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> has 3348.190782570746 0 16867
>> from 3057.545994592365 0 15499
>> year 2826.7072093421593 0 14400
>> state 2793.2941474220715 0 14240
>> new 2303.4001871203072 0 11867
>> after 2229.176642407663 0 11503
>> first 1724.8293614634313 0 8998
>> two 1714.2781240069307 0 8945
>> unit 1683.6474849330261 0 8791
>> one 1543.7481994605623 0 8085
>>
>> Top labels for Cluster 129391 containing 7334 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> has 2113.35227333894 0 16867
>> from 1930.1305988361128 0 15499
>> year 1784.577833758667 0 14400
>> state 1763.5072347805835 0 14240
>> new 1454.5072316131555 0 11867
>> after 1407.6797917694785 0 11503
>> first 1089.4127462548204 0 8998
>> two 1082.7530186888762 0 8945
>> unit 1063.4192575318739 0 8791
>> one 975.1101242941804 0 8085
>>
>> [2]
>> :C-129022: [0:0.001, 000:0.003, 004:0.000, 0040:0.000, 0060:0.000,
>> 01:0.000, 0100:0.000, 0110:0.000,
>> Top Terms:
>> from
>> =>0.022236135215980328
>> u =>
>> 0.01589135359475966
>> busi
>> =>0.014789942880805335
>> bank
>> =>0.014395075820558541
>> us =>
>> 0.01402954110138604
>> presid =>
>> 0.01341952961319183
>> month
>> =>0.012118726267037198
>> about
>> =>0.011986047971260612
>> compani
>> =>0.011201454374207618
>> obama =>
>> 0.01105482429336391
>> :C-129119: [0:0.001, 00:0.000, 000:0.003, 03:0.000, 04:0.000, 05:0.000,
>> 0656:0.000, 07:0.000, 09:0.00
>> Top Terms:
>> citi =>
>> 0.04119064757467011
>> former
>> =>0.030966538725529232
>> home
>> =>0.029642735534519644
>> player =>
>> 0.02879703136878369
>> soccer =>
>> 0.01847372541986708
>> has
>> =>0.015236681440174855
>> mark
>> =>0.015185164518720528
>> new =>
>> 0.01266468154720074
>> polic =>
>> 0.01253454821409647
>> world
>> =>0.011803315296178046
>> :C-129191: [0:0.013, 00:0.002, 000:0.004, 000000:0.000, 001:0.000,
>> 0011:0.000, 0022:0.000, 003:0.000,
>> Top Terms:
>> 4
>> =>0.027636996760550075
>> 3
>> =>0.026093296145846434
>> 1 =>
>> 0.02570191540464146
>> 5
>> =>0.024807189589701305
>> 2
>> =>0.023669513631826157
>> were
>> =>0.021134415210709086
>> sunday
>> =>0.017928504766147838
>> play
>> =>0.017243683740808733
>> through
>> =>0.017133336974828554
>> game
>> =>0.017027790192043733
>> :C-129302: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 002:0.000, 008:0.000,
>> 01:0.000, 011:0.000, 0112:
>> Top Terms:
>> new
>> =>0.039501149799390206
>> peopl =>
>> 0.01933397797740685
>> world
>> =>0.017478792605253438
>> could
>> =>0.013495142418778704
>> has
>> =>0.012987326502897916
>> more
>> =>0.012585724039194569
>> from
>> =>0.012242682917236177
>> face =>
>> 0.0117046220661272
>> leader
>> =>0.011579584625370691
>> presid
>> =>0.011192085113854965
>> :C-129360: [0:0.000, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000,
>> 005:0.000, 007:0.000, 008:
>> Top Terms:
>> state
>> =>0.044732720259456946
>> unit
>> =>0.032493582810588666
>> year
>> =>0.025651340609304542
>> san
>> =>0.025617706557963606
>> after
>> =>0.022019046306438913
>> francisco
>> =>0.020771004252363168
>> california =>
>> 0.01847124801606253
>> day
>> =>0.015514125170527842
>> wednesday
>> =>0.014587851421509652
>> citi
>> =>0.012973538756014369
>> :C-129371: [0:0.002, 00:0.000, 000:0.001, 01:0.000, 010:0.000, 0134:0.000,
>> 016:0.000, 02:0.000, 03:0.
>> Top Terms:
>> game =>
>> 0.04311022785679375
>> has =>
>> 0.03059922226267673
>> all
>> =>0.027605073346921877
>> leagu =>
>> 0.0267627245855276
>> star =>
>> 0.02206632764439995
>> final
>> =>0.020017765794918686
>> season =>
>> 0.01534931562714024
>> start =>
>> 0.01450896856938099
>> week
>> =>0.014407234069110549
>> nation =>
>> 0.01429746391305699
>> :C-129373: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 01:0.000, 012:0.000,
>> 016:0.000, 03:0.000, 034:0.
>> Top Terms:
>> coach =>
>> 0.05209277512761816
>> team
>> =>0.031773971685165554
>> charg
>> =>0.024246280249912454
>> from =>
>> 0.02093643936347752
>> has =>
>> 0.02057631329905952
>> week
>> =>0.016848920922797363
>> last =>
>> 0.01674320150844955
>> program
>> =>0.016023081209070564
>> former
>> =>0.015872337289314063
>> after =>
>> 0.01341825692502786
>> :C-129377: [0:0.002, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000,
>> 006:0.000, 0065:0.000, 007
>> Top Terms:
>> been =>
>> 0.03757994091979662
>> time =>
>> 0.03591307497544333
>> first =>
>> 0.03422461795380875
>> has
>> =>0.029800513863644906
>> feder
>> =>0.027382680342986195
>> monday
>> =>0.022174840523045594
>> sinc =>
>> 0.02185219249613946
>> year =>
>> 0.01933420097135394
>> from =>
>> 0.01162537888358458
>> state
>> =>0.009756869426688311
>> :C-129381: [0:0.004, 00:0.000, 000:0.002, 00000000235:0.000, 001:0.000,
>> 0011:0.000, 002:0.000, 0051:0
>> Top Terms:
>> win =>
>> 0.03267669747239372
>> one
>> =>0.031009191445456212
>> second
>> =>0.028066582472705007
>> three
>> =>0.026147346665631184
>> out =>
>> 0.0226123748207931
>> shot
>> =>0.020446190395276405
>> last
>> =>0.019624841184867056
>> night
>> =>0.019103407305052604
>> over
>> =>0.017376642133669604
>> year
>> =>0.016475201865715022
>> :C-129391: [0:0.003, 00:0.000, 000:0.001, 002:0.000, 01:0.000, 0112:0.000,
>> 0123:0.000, 02:0.000, 0213
>> Top Terms:
>> championship
>> =>0.035449579372280104
>> run
>> =>0.026446073370591447
>> art =>
>> 0.02489330236372834
>> open =>
>> 0.02282619503375418
>> place
>> =>0.022410914360311056
>> grand =>
>> 0.0169734705340118
>> reuter
>> =>0.015895311339829302
>> 6
>> =>0.015700075983436933
>> continu
>> =>0.015418929721703813
>> slam
>> =>0.012102435338420274
>>
>>
>> -Grant
>> On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote:
>>
>> > Can't say just off-hand.
>> >
>> > What is the data?
>> >
>> > On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <gs...@apache.org>
>> wrote:
>> >
>> >> I'm running ClusterLabels and it seems to be outputting the same values
>> for
>> >> every centroid [1]. When I run the cluster dumper, the top terms are
>> fairly
>> >> different for those same vectors.
>> >>
>> >> Have I hit a vagary of LLR or is this a bug?
>> >>
>> >>
>> >> Thanks,
>> >> Grant
>> >>
>> >>
>> >> [1]
>> >> <snip>
>> >> Top labels for Cluster 129062 containing 22710 vectors
>> >> Term LLR In-ClusterDF Out-ClusterDF
>> >> a 43269.00830466254 0 72060
>> >> his 7185.503760070074 0 17203
>> >> has 7028.243643655442 0 16855
>> >> from 6415.739411605988 0 15488
>> >> year 5930.141497239005 0 14391
>> >> state 5858.43069797568 0 14228
>> >> said 5616.422720833216 0 13676
>> >> it 5545.207108973991 0 13513
>> >> he 5239.340392438695 0 12810
>> >> new 4830.124521905556 0 11862
>> >>
>> >> Top labels for Cluster 129145 containing 11188 vectors
>> >> Term LLR In-ClusterDF Out-ClusterDF
>> >> a 19576.26998734614 0 72060
>> >> his 3352.5135342599824 0 17203
>> >> has 3279.466228939127 0 16855
>> >> from 2994.8128935270943 0 15488
>> >> year 2768.974903047085 0 14391
>> >> state 2735.612128134351 0 14228
>> >> said 2622.997358441353 0 13676
>> >> it 2589.8515553446487 0 13513
>> >> he 2447.4579147226177 0 12810
>> >> new 2256.8640938592143 0 11862
>> >>
>> >> Top labels for Cluster 129201 containing 13040 vectors
>> >> Term LLR In-ClusterDF Out-ClusterDF
>> >> a 23110.173012922285 0 72060
>> >> his 3940.4691014224663 0 17203
>> >> has 3854.554399965331 0 16855
>> >> from 3519.784154796507 0 15488
>> >> year 3254.2127395244315 0 14391
>> >> state 3214.9822960514575 0 14228
>> >> said 3082.565408431459 0 13676
>> >> it 3043.5924300444312 0 13513
>> >> he 2876.171367166564 0 12810
>> >> new 2652.0934832417406 0 11862
>> >>
>> >> Top labels for Cluster 129211 containing 14053 vectors
>> >> Term LLR In-ClusterDF Out-ClusterDF
>> >> a 25083.46391701023 0 72060
>> >> his 4266.378291217145 0 17203
>> >> has 4173.323467798065 0 16855
>> >> from 3810.7467373879626 0 15488
>> >> year 3523.1337431534193 0 14391
>> >> state 3480.648573280778 0 14228
>> >> said 3337.2482196930796 0 13676
>> >> it 3295.0432900944725 0 13513
>> >> he 3113.741967030335 0 12810
>> >> new 2871.0957860480994 0 11862
>> >>
>> >> Top labels for Cluster 129242 containing 12861 vectors
>> >> Term LLR In-ClusterDF Out-ClusterDF
>> >> a 22764.503256496973 0 72060
>> >> his 3883.2002838114277 0 17203
>> >> has 3798.5396822127514 0 16855
>> >> from 3468.6536546614952 0 15488
>> >> year 3206.954131908249 0 14391
>> >> state 3168.2954448102973 0 14228
>> >> said 3037.808057511691 0 13676
>> >> it 2999.402857856825 0 13513
>> >> he 2834.4202939094976 0 12810
>> >> new 2613.604658874683 0 11862
>> >>
>> >> Top labels for Cluster 129245 containing 6443 vectors
>> >> Term LLR In-ClusterDF Out-ClusterDF
>> >> a 10925.268199045677 0 72060
>> >> his 1890.511348863598 0 17203
>> >> has 1849.385320336558 0 16855
>> >> from 1689.0946326381527 0 15488
>> >> year 1561.8904545903206 0 14391
>> >> state 1543.096286157146 0 14228
>> >> said 1479.652662154287 0 13676
>> >> it 1460.9780013803393 0 13513
>> >> he 1380.745082413312 0 12810
>> >> new 1273.3357145632617 0 11862
>> >>
>> >> Top labels for Cluster 129255 containing 11390 vectors
>> >> Term LLR In-ClusterDF Out-ClusterDF
>> >> a 19957.211259535048 0 72060
>> >> his 3416.1555761522613 0 17203
>> >> has 3341.7163103362545 0 16855
>> >> from 3051.6410844950005 0 15488
>> >> year 2821.504116652999 0 14391
>> >> state 2787.5064550531097 0 14228
>> >> said 2672.7490201727487 0 13676
>> >> it 2638.972676954698 0 13513
>> >> he 2493.870809029322 0 12810
>> >> new 2299.653438703157 0 11862
>> >>
>> >> Top labels for Cluster 129265 containing 9461 vectors
>> >> Term LLR In-ClusterDF Out-ClusterDF
>> >> a 16362.85457371641 0 72060
>> >> his 2813.167819214519 0 17203
>> >> has 2751.908798408229 0 16855
>> >> from 2513.176188033074 0 15488
>> >> year 2323.752471229993 0 14391
>> >> state 2295.767774611246 0 14228
>> >> said 2201.3039346230216 0 13676
>> >> it 2173.4997256915085 0 13513
>> >> he 2054.0495802331716 0 12810
>> >> new 1894.1558320098557 0 11862
>> >>
>> >> Top labels for Cluster 129279 containing 14559 vectors
>> >> Term LLR In-ClusterDF Out-ClusterDF
>> >> a 26080.197364640888 0 72060
>> >> his 4430.338072712999 0 17203
>> >> has 4333.689091425855 0 16855
>> >> from 3957.116204748396 0 15488
>> >> year 3658.40981121175 0 14391
>> >> state 3614.286633652635 0 14228
>> >> said 3465.358771919273 0 13676
>> >> it 3421.527382406406 0 13513
>> >> he 3233.2411222746596 0 12810
>> >> new 2981.251407010015 0 11862
>> >>
>> >> Top labels for Cluster 129290 containing 13592 vectors
>> >> Term LLR In-ClusterDF Out-ClusterDF
>> >> a 24181.82589298836 0 72060
>> >> his 4117.6785482652485 0 17203
>> >> has 4027.8821644652635 0 16855
>> >> from 3677.9947950267233 0 15488
>> >> year 3400.440033295192 0 14391
>> >> state 3359.4400672735646 0 14228
>> >> said 3221.0516651300713 0 13676
>> >> it 3180.321518546436 0 13513
>> >> he 3005.353873868007 0 12810
>> >> new 2771.180380204227 0 11862
>> >> </snip>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>
Re: Cluster Labels
Posted by Grant Ingersoll <gs...@apache.org>.
OK, seems the problem is ClusterLabels was never updated when we switched
over to WeightedVectorWritable and it also seems like somewhere in the
equation of KMeans being run that we lost the NamedVector again, as the
clusteredPoints directory does not contain NamedVectors, even though that is
what I created the original points as when starting.
On Mon, Jul 5, 2010 at 1:55 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Hmmm, different field, more or less the same result, i.e. all labels are
> the same for each vector [1]. I also included the Cluster dump [2]. I'm
> suspecting a bug.
>
> [1]
> Top labels for Cluster 129022 containing 19186 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 5836.5347257247195 0 16867
> from 5328.54616727354 0 15499
> year 4925.276801970322 0 14400
> state 4866.91887763422 0 14240
> new 4011.6858639516868 0 11867
> after 3882.1740732807666 0 11503
> first 3002.5827110484242 0 8998
> two 2984.1892275922 0 8945
> unit 2930.794111499563 0 8791
> one 2686.95768492762 0 8085
>
> Top labels for Cluster 129119 containing 16043 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 4808.386086146813 0 16867
> from 4390.346637147013 0 15499
> year 4058.4180186586455 0 14400
> state 4010.379176544491 0 14240
> new 3306.234930681996 0 11867
> after 3199.5810555517673 0 11503
> first 2475.079962851014 0 8998
> two 2459.926843432244 0 8945
> unit 2415.9376569474116 0 8791
> one 2215.042654468678 0 8085
>
> Top labels for Cluster 129191 containing 7770 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 2243.2657141932286 0 16867
> from 2048.755412856117 0 15499
> year 1894.2384706358425 0 14400
> state 1871.8704557279125 0 14240
> new 1543.8513879175298 0 11867
> after 1494.1429192917421 0 11503
> first 1156.303048826754 0 8998
> two 1149.2339147529565 0 8945
> unit 1128.711646862328 0 8791
> one 1034.9745452422649 0 8085
>
> Top labels for Cluster 129302 containing 9426 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 2741.316972494591 0 16867
> from 2503.501101480797 0 15499
> year 2314.5996575923637 0 14400
> state 2287.255346294027 0 14240
> new 1886.2961270781234 0 11867
> after 1825.5399498036131 0 11503
> first 1412.654560342431 0 8998
> two 1404.0158626483753 0 8945
> unit 1378.9371921028942 0 8791
> one 1264.391515379306 0 8085
>
> Top labels for Cluster 129360 containing 13092 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 3870.8181769265793 0 16867
> from 3534.623348234687 0 15499
> year 3267.633215776179 0 14400
> state 3228.989259615075 0 14240
> new 2662.4551618834957 0 11867
> after 2576.628638952039 0 11503
> first 1993.499155438505 0 8998
> two 1981.3008509986103 0 8945
> unit 1945.8889682726003 0 8791
> one 1784.1570986662991 0 8085
>
> Top labels for Cluster 129371 containing 23944 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 7455.31941217836 0 16867
> from 6805.274207816925 0 15499
> year 6289.398677708115 0 14400
> state 6214.757351316046 0 14240
> new 5121.23683049297 0 11867
> after 4955.695805796888 0 11503
> first 3831.788851835765 0 8998
> two 3808.2933898111805 0 8945
> unit 3740.0891623105854 0 8791
> one 3428.6551325367764 0 8085
>
> Top labels for Cluster 129373 containing 9885 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 2880.6778563517146 0 16867
> from 2630.736483251676 0 15499
> year 2432.208566541318 0 14400
> state 2403.4711471684277 0 14240
> new 1982.0948037123308 0 11867
> after 1918.2465800205246 0 11503
> first 1484.359997350257 0 8998
> two 1475.282112147659 0 8945
> unit 1448.9285028181039 0 8791
> one 1328.560536378529 0 8085
>
> Top labels for Cluster 129377 containing 11303 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 3314.8890487886965 0 16867
> from 3027.14497121796 0 15499
> year 2798.608615776524 0 14400
> state 2765.528720188886 0 14240
> new 2280.5166378575377 0 11867
> after 2207.0322705539875 0 11503
> first 1707.7044410486706 0 8998
> two 1697.2581536169164 0 8945
> unit 1666.932174641639 0 8791
> one 1528.4241032432765 0 8085
>
> Top labels for Cluster 129381 containing 11411 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 3348.190782570746 0 16867
> from 3057.545994592365 0 15499
> year 2826.7072093421593 0 14400
> state 2793.2941474220715 0 14240
> new 2303.4001871203072 0 11867
> after 2229.176642407663 0 11503
> first 1724.8293614634313 0 8998
> two 1714.2781240069307 0 8945
> unit 1683.6474849330261 0 8791
> one 1543.7481994605623 0 8085
>
> Top labels for Cluster 129391 containing 7334 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> has 2113.35227333894 0 16867
> from 1930.1305988361128 0 15499
> year 1784.577833758667 0 14400
> state 1763.5072347805835 0 14240
> new 1454.5072316131555 0 11867
> after 1407.6797917694785 0 11503
> first 1089.4127462548204 0 8998
> two 1082.7530186888762 0 8945
> unit 1063.4192575318739 0 8791
> one 975.1101242941804 0 8085
>
> [2]
> :C-129022: [0:0.001, 000:0.003, 004:0.000, 0040:0.000, 0060:0.000,
> 01:0.000, 0100:0.000, 0110:0.000,
> Top Terms:
> from
> =>0.022236135215980328
> u =>
> 0.01589135359475966
> busi
> =>0.014789942880805335
> bank
> =>0.014395075820558541
> us =>
> 0.01402954110138604
> presid =>
> 0.01341952961319183
> month
> =>0.012118726267037198
> about
> =>0.011986047971260612
> compani
> =>0.011201454374207618
> obama =>
> 0.01105482429336391
> :C-129119: [0:0.001, 00:0.000, 000:0.003, 03:0.000, 04:0.000, 05:0.000,
> 0656:0.000, 07:0.000, 09:0.00
> Top Terms:
> citi =>
> 0.04119064757467011
> former
> =>0.030966538725529232
> home
> =>0.029642735534519644
> player =>
> 0.02879703136878369
> soccer =>
> 0.01847372541986708
> has
> =>0.015236681440174855
> mark
> =>0.015185164518720528
> new =>
> 0.01266468154720074
> polic =>
> 0.01253454821409647
> world
> =>0.011803315296178046
> :C-129191: [0:0.013, 00:0.002, 000:0.004, 000000:0.000, 001:0.000,
> 0011:0.000, 0022:0.000, 003:0.000,
> Top Terms:
> 4
> =>0.027636996760550075
> 3
> =>0.026093296145846434
> 1 =>
> 0.02570191540464146
> 5
> =>0.024807189589701305
> 2
> =>0.023669513631826157
> were
> =>0.021134415210709086
> sunday
> =>0.017928504766147838
> play
> =>0.017243683740808733
> through
> =>0.017133336974828554
> game
> =>0.017027790192043733
> :C-129302: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 002:0.000, 008:0.000,
> 01:0.000, 011:0.000, 0112:
> Top Terms:
> new
> =>0.039501149799390206
> peopl =>
> 0.01933397797740685
> world
> =>0.017478792605253438
> could
> =>0.013495142418778704
> has
> =>0.012987326502897916
> more
> =>0.012585724039194569
> from
> =>0.012242682917236177
> face =>
> 0.0117046220661272
> leader
> =>0.011579584625370691
> presid
> =>0.011192085113854965
> :C-129360: [0:0.000, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000,
> 005:0.000, 007:0.000, 008:
> Top Terms:
> state
> =>0.044732720259456946
> unit
> =>0.032493582810588666
> year
> =>0.025651340609304542
> san
> =>0.025617706557963606
> after
> =>0.022019046306438913
> francisco
> =>0.020771004252363168
> california =>
> 0.01847124801606253
> day
> =>0.015514125170527842
> wednesday
> =>0.014587851421509652
> citi
> =>0.012973538756014369
> :C-129371: [0:0.002, 00:0.000, 000:0.001, 01:0.000, 010:0.000, 0134:0.000,
> 016:0.000, 02:0.000, 03:0.
> Top Terms:
> game =>
> 0.04311022785679375
> has =>
> 0.03059922226267673
> all
> =>0.027605073346921877
> leagu =>
> 0.0267627245855276
> star =>
> 0.02206632764439995
> final
> =>0.020017765794918686
> season =>
> 0.01534931562714024
> start =>
> 0.01450896856938099
> week
> =>0.014407234069110549
> nation =>
> 0.01429746391305699
> :C-129373: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 01:0.000, 012:0.000,
> 016:0.000, 03:0.000, 034:0.
> Top Terms:
> coach =>
> 0.05209277512761816
> team
> =>0.031773971685165554
> charg
> =>0.024246280249912454
> from =>
> 0.02093643936347752
> has =>
> 0.02057631329905952
> week
> =>0.016848920922797363
> last =>
> 0.01674320150844955
> program
> =>0.016023081209070564
> former
> =>0.015872337289314063
> after =>
> 0.01341825692502786
> :C-129377: [0:0.002, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000,
> 006:0.000, 0065:0.000, 007
> Top Terms:
> been =>
> 0.03757994091979662
> time =>
> 0.03591307497544333
> first =>
> 0.03422461795380875
> has
> =>0.029800513863644906
> feder
> =>0.027382680342986195
> monday
> =>0.022174840523045594
> sinc =>
> 0.02185219249613946
> year =>
> 0.01933420097135394
> from =>
> 0.01162537888358458
> state
> =>0.009756869426688311
> :C-129381: [0:0.004, 00:0.000, 000:0.002, 00000000235:0.000, 001:0.000,
> 0011:0.000, 002:0.000, 0051:0
> Top Terms:
> win =>
> 0.03267669747239372
> one
> =>0.031009191445456212
> second
> =>0.028066582472705007
> three
> =>0.026147346665631184
> out =>
> 0.0226123748207931
> shot
> =>0.020446190395276405
> last
> =>0.019624841184867056
> night
> =>0.019103407305052604
> over
> =>0.017376642133669604
> year
> =>0.016475201865715022
> :C-129391: [0:0.003, 00:0.000, 000:0.001, 002:0.000, 01:0.000, 0112:0.000,
> 0123:0.000, 02:0.000, 0213
> Top Terms:
> championship
> =>0.035449579372280104
> run
> =>0.026446073370591447
> art =>
> 0.02489330236372834
> open =>
> 0.02282619503375418
> place
> =>0.022410914360311056
> grand =>
> 0.0169734705340118
> reuter
> =>0.015895311339829302
> 6
> =>0.015700075983436933
> continu
> =>0.015418929721703813
> slam
> =>0.012102435338420274
>
>
> -Grant
> On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote:
>
> > Can't say just off-hand.
> >
> > What is the data?
> >
> > On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >
> >> I'm running ClusterLabels and it seems to be outputting the same values
> for
> >> every centroid [1]. When I run the cluster dumper, the top terms are
> fairly
> >> different for those same vectors.
> >>
> >> Have I hit a vagary of LLR or is this a bug?
> >>
> >>
> >> Thanks,
> >> Grant
> >>
> >>
> >> [1]
> >> <snip>
> >> Top labels for Cluster 129062 containing 22710 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 43269.00830466254 0 72060
> >> his 7185.503760070074 0 17203
> >> has 7028.243643655442 0 16855
> >> from 6415.739411605988 0 15488
> >> year 5930.141497239005 0 14391
> >> state 5858.43069797568 0 14228
> >> said 5616.422720833216 0 13676
> >> it 5545.207108973991 0 13513
> >> he 5239.340392438695 0 12810
> >> new 4830.124521905556 0 11862
> >>
> >> Top labels for Cluster 129145 containing 11188 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 19576.26998734614 0 72060
> >> his 3352.5135342599824 0 17203
> >> has 3279.466228939127 0 16855
> >> from 2994.8128935270943 0 15488
> >> year 2768.974903047085 0 14391
> >> state 2735.612128134351 0 14228
> >> said 2622.997358441353 0 13676
> >> it 2589.8515553446487 0 13513
> >> he 2447.4579147226177 0 12810
> >> new 2256.8640938592143 0 11862
> >>
> >> Top labels for Cluster 129201 containing 13040 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 23110.173012922285 0 72060
> >> his 3940.4691014224663 0 17203
> >> has 3854.554399965331 0 16855
> >> from 3519.784154796507 0 15488
> >> year 3254.2127395244315 0 14391
> >> state 3214.9822960514575 0 14228
> >> said 3082.565408431459 0 13676
> >> it 3043.5924300444312 0 13513
> >> he 2876.171367166564 0 12810
> >> new 2652.0934832417406 0 11862
> >>
> >> Top labels for Cluster 129211 containing 14053 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 25083.46391701023 0 72060
> >> his 4266.378291217145 0 17203
> >> has 4173.323467798065 0 16855
> >> from 3810.7467373879626 0 15488
> >> year 3523.1337431534193 0 14391
> >> state 3480.648573280778 0 14228
> >> said 3337.2482196930796 0 13676
> >> it 3295.0432900944725 0 13513
> >> he 3113.741967030335 0 12810
> >> new 2871.0957860480994 0 11862
> >>
> >> Top labels for Cluster 129242 containing 12861 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 22764.503256496973 0 72060
> >> his 3883.2002838114277 0 17203
> >> has 3798.5396822127514 0 16855
> >> from 3468.6536546614952 0 15488
> >> year 3206.954131908249 0 14391
> >> state 3168.2954448102973 0 14228
> >> said 3037.808057511691 0 13676
> >> it 2999.402857856825 0 13513
> >> he 2834.4202939094976 0 12810
> >> new 2613.604658874683 0 11862
> >>
> >> Top labels for Cluster 129245 containing 6443 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 10925.268199045677 0 72060
> >> his 1890.511348863598 0 17203
> >> has 1849.385320336558 0 16855
> >> from 1689.0946326381527 0 15488
> >> year 1561.8904545903206 0 14391
> >> state 1543.096286157146 0 14228
> >> said 1479.652662154287 0 13676
> >> it 1460.9780013803393 0 13513
> >> he 1380.745082413312 0 12810
> >> new 1273.3357145632617 0 11862
> >>
> >> Top labels for Cluster 129255 containing 11390 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 19957.211259535048 0 72060
> >> his 3416.1555761522613 0 17203
> >> has 3341.7163103362545 0 16855
> >> from 3051.6410844950005 0 15488
> >> year 2821.504116652999 0 14391
> >> state 2787.5064550531097 0 14228
> >> said 2672.7490201727487 0 13676
> >> it 2638.972676954698 0 13513
> >> he 2493.870809029322 0 12810
> >> new 2299.653438703157 0 11862
> >>
> >> Top labels for Cluster 129265 containing 9461 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 16362.85457371641 0 72060
> >> his 2813.167819214519 0 17203
> >> has 2751.908798408229 0 16855
> >> from 2513.176188033074 0 15488
> >> year 2323.752471229993 0 14391
> >> state 2295.767774611246 0 14228
> >> said 2201.3039346230216 0 13676
> >> it 2173.4997256915085 0 13513
> >> he 2054.0495802331716 0 12810
> >> new 1894.1558320098557 0 11862
> >>
> >> Top labels for Cluster 129279 containing 14559 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 26080.197364640888 0 72060
> >> his 4430.338072712999 0 17203
> >> has 4333.689091425855 0 16855
> >> from 3957.116204748396 0 15488
> >> year 3658.40981121175 0 14391
> >> state 3614.286633652635 0 14228
> >> said 3465.358771919273 0 13676
> >> it 3421.527382406406 0 13513
> >> he 3233.2411222746596 0 12810
> >> new 2981.251407010015 0 11862
> >>
> >> Top labels for Cluster 129290 containing 13592 vectors
> >> Term LLR In-ClusterDF Out-ClusterDF
> >> a 24181.82589298836 0 72060
> >> his 4117.6785482652485 0 17203
> >> has 4027.8821644652635 0 16855
> >> from 3677.9947950267233 0 15488
> >> year 3400.440033295192 0 14391
> >> state 3359.4400672735646 0 14228
> >> said 3221.0516651300713 0 13676
> >> it 3180.321518546436 0 13513
> >> he 3005.353873868007 0 12810
> >> new 2771.180380204227 0 11862
> >> </snip>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
Re: Cluster Labels
Posted by Grant Ingersoll <gs...@apache.org>.
Hmmm, different field, more or less the same result, i.e. all labels are the same for each vector [1]. I also included the Cluster dump [2]. I'm suspecting a bug.
[1]
Top labels for Cluster 129022 containing 19186 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 5836.5347257247195 0 16867
from 5328.54616727354 0 15499
year 4925.276801970322 0 14400
state 4866.91887763422 0 14240
new 4011.6858639516868 0 11867
after 3882.1740732807666 0 11503
first 3002.5827110484242 0 8998
two 2984.1892275922 0 8945
unit 2930.794111499563 0 8791
one 2686.95768492762 0 8085
Top labels for Cluster 129119 containing 16043 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 4808.386086146813 0 16867
from 4390.346637147013 0 15499
year 4058.4180186586455 0 14400
state 4010.379176544491 0 14240
new 3306.234930681996 0 11867
after 3199.5810555517673 0 11503
first 2475.079962851014 0 8998
two 2459.926843432244 0 8945
unit 2415.9376569474116 0 8791
one 2215.042654468678 0 8085
Top labels for Cluster 129191 containing 7770 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 2243.2657141932286 0 16867
from 2048.755412856117 0 15499
year 1894.2384706358425 0 14400
state 1871.8704557279125 0 14240
new 1543.8513879175298 0 11867
after 1494.1429192917421 0 11503
first 1156.303048826754 0 8998
two 1149.2339147529565 0 8945
unit 1128.711646862328 0 8791
one 1034.9745452422649 0 8085
Top labels for Cluster 129302 containing 9426 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 2741.316972494591 0 16867
from 2503.501101480797 0 15499
year 2314.5996575923637 0 14400
state 2287.255346294027 0 14240
new 1886.2961270781234 0 11867
after 1825.5399498036131 0 11503
first 1412.654560342431 0 8998
two 1404.0158626483753 0 8945
unit 1378.9371921028942 0 8791
one 1264.391515379306 0 8085
Top labels for Cluster 129360 containing 13092 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 3870.8181769265793 0 16867
from 3534.623348234687 0 15499
year 3267.633215776179 0 14400
state 3228.989259615075 0 14240
new 2662.4551618834957 0 11867
after 2576.628638952039 0 11503
first 1993.499155438505 0 8998
two 1981.3008509986103 0 8945
unit 1945.8889682726003 0 8791
one 1784.1570986662991 0 8085
Top labels for Cluster 129371 containing 23944 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 7455.31941217836 0 16867
from 6805.274207816925 0 15499
year 6289.398677708115 0 14400
state 6214.757351316046 0 14240
new 5121.23683049297 0 11867
after 4955.695805796888 0 11503
first 3831.788851835765 0 8998
two 3808.2933898111805 0 8945
unit 3740.0891623105854 0 8791
one 3428.6551325367764 0 8085
Top labels for Cluster 129373 containing 9885 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 2880.6778563517146 0 16867
from 2630.736483251676 0 15499
year 2432.208566541318 0 14400
state 2403.4711471684277 0 14240
new 1982.0948037123308 0 11867
after 1918.2465800205246 0 11503
first 1484.359997350257 0 8998
two 1475.282112147659 0 8945
unit 1448.9285028181039 0 8791
one 1328.560536378529 0 8085
Top labels for Cluster 129377 containing 11303 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 3314.8890487886965 0 16867
from 3027.14497121796 0 15499
year 2798.608615776524 0 14400
state 2765.528720188886 0 14240
new 2280.5166378575377 0 11867
after 2207.0322705539875 0 11503
first 1707.7044410486706 0 8998
two 1697.2581536169164 0 8945
unit 1666.932174641639 0 8791
one 1528.4241032432765 0 8085
Top labels for Cluster 129381 containing 11411 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 3348.190782570746 0 16867
from 3057.545994592365 0 15499
year 2826.7072093421593 0 14400
state 2793.2941474220715 0 14240
new 2303.4001871203072 0 11867
after 2229.176642407663 0 11503
first 1724.8293614634313 0 8998
two 1714.2781240069307 0 8945
unit 1683.6474849330261 0 8791
one 1543.7481994605623 0 8085
Top labels for Cluster 129391 containing 7334 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 2113.35227333894 0 16867
from 1930.1305988361128 0 15499
year 1784.577833758667 0 14400
state 1763.5072347805835 0 14240
new 1454.5072316131555 0 11867
after 1407.6797917694785 0 11503
first 1089.4127462548204 0 8998
two 1082.7530186888762 0 8945
unit 1063.4192575318739 0 8791
one 975.1101242941804 0 8085
[2]
:C-129022: [0:0.001, 000:0.003, 004:0.000, 0040:0.000, 0060:0.000, 01:0.000, 0100:0.000, 0110:0.000,
Top Terms:
from =>0.022236135215980328
u => 0.01589135359475966
busi =>0.014789942880805335
bank =>0.014395075820558541
us => 0.01402954110138604
presid => 0.01341952961319183
month =>0.012118726267037198
about =>0.011986047971260612
compani =>0.011201454374207618
obama => 0.01105482429336391
:C-129119: [0:0.001, 00:0.000, 000:0.003, 03:0.000, 04:0.000, 05:0.000, 0656:0.000, 07:0.000, 09:0.00
Top Terms:
citi => 0.04119064757467011
former =>0.030966538725529232
home =>0.029642735534519644
player => 0.02879703136878369
soccer => 0.01847372541986708
has =>0.015236681440174855
mark =>0.015185164518720528
new => 0.01266468154720074
polic => 0.01253454821409647
world =>0.011803315296178046
:C-129191: [0:0.013, 00:0.002, 000:0.004, 000000:0.000, 001:0.000, 0011:0.000, 0022:0.000, 003:0.000,
Top Terms:
4 =>0.027636996760550075
3 =>0.026093296145846434
1 => 0.02570191540464146
5 =>0.024807189589701305
2 =>0.023669513631826157
were =>0.021134415210709086
sunday =>0.017928504766147838
play =>0.017243683740808733
through =>0.017133336974828554
game =>0.017027790192043733
:C-129302: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 002:0.000, 008:0.000, 01:0.000, 011:0.000, 0112:
Top Terms:
new =>0.039501149799390206
peopl => 0.01933397797740685
world =>0.017478792605253438
could =>0.013495142418778704
has =>0.012987326502897916
more =>0.012585724039194569
from =>0.012242682917236177
face => 0.0117046220661272
leader =>0.011579584625370691
presid =>0.011192085113854965
:C-129360: [0:0.000, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, 005:0.000, 007:0.000, 008:
Top Terms:
state =>0.044732720259456946
unit =>0.032493582810588666
year =>0.025651340609304542
san =>0.025617706557963606
after =>0.022019046306438913
francisco =>0.020771004252363168
california => 0.01847124801606253
day =>0.015514125170527842
wednesday =>0.014587851421509652
citi =>0.012973538756014369
:C-129371: [0:0.002, 00:0.000, 000:0.001, 01:0.000, 010:0.000, 0134:0.000, 016:0.000, 02:0.000, 03:0.
Top Terms:
game => 0.04311022785679375
has => 0.03059922226267673
all =>0.027605073346921877
leagu => 0.0267627245855276
star => 0.02206632764439995
final =>0.020017765794918686
season => 0.01534931562714024
start => 0.01450896856938099
week =>0.014407234069110549
nation => 0.01429746391305699
:C-129373: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 01:0.000, 012:0.000, 016:0.000, 03:0.000, 034:0.
Top Terms:
coach => 0.05209277512761816
team =>0.031773971685165554
charg =>0.024246280249912454
from => 0.02093643936347752
has => 0.02057631329905952
week =>0.016848920922797363
last => 0.01674320150844955
program =>0.016023081209070564
former =>0.015872337289314063
after => 0.01341825692502786
:C-129377: [0:0.002, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, 006:0.000, 0065:0.000, 007
Top Terms:
been => 0.03757994091979662
time => 0.03591307497544333
first => 0.03422461795380875
has =>0.029800513863644906
feder =>0.027382680342986195
monday =>0.022174840523045594
sinc => 0.02185219249613946
year => 0.01933420097135394
from => 0.01162537888358458
state =>0.009756869426688311
:C-129381: [0:0.004, 00:0.000, 000:0.002, 00000000235:0.000, 001:0.000, 0011:0.000, 002:0.000, 0051:0
Top Terms:
win => 0.03267669747239372
one =>0.031009191445456212
second =>0.028066582472705007
three =>0.026147346665631184
out => 0.0226123748207931
shot =>0.020446190395276405
last =>0.019624841184867056
night =>0.019103407305052604
over =>0.017376642133669604
year =>0.016475201865715022
:C-129391: [0:0.003, 00:0.000, 000:0.001, 002:0.000, 01:0.000, 0112:0.000, 0123:0.000, 02:0.000, 0213
Top Terms:
championship =>0.035449579372280104
run =>0.026446073370591447
art => 0.02489330236372834
open => 0.02282619503375418
place =>0.022410914360311056
grand => 0.0169734705340118
reuter =>0.015895311339829302
6 =>0.015700075983436933
continu =>0.015418929721703813
slam =>0.012102435338420274
-Grant
On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote:
> Can't say just off-hand.
>
> What is the data?
>
> On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
>> I'm running ClusterLabels and it seems to be outputting the same values for
>> every centroid [1]. When I run the cluster dumper, the top terms are fairly
>> different for those same vectors.
>>
>> Have I hit a vagary of LLR or is this a bug?
>>
>>
>> Thanks,
>> Grant
>>
>>
>> [1]
>> <snip>
>> Top labels for Cluster 129062 containing 22710 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 43269.00830466254 0 72060
>> his 7185.503760070074 0 17203
>> has 7028.243643655442 0 16855
>> from 6415.739411605988 0 15488
>> year 5930.141497239005 0 14391
>> state 5858.43069797568 0 14228
>> said 5616.422720833216 0 13676
>> it 5545.207108973991 0 13513
>> he 5239.340392438695 0 12810
>> new 4830.124521905556 0 11862
>>
>> Top labels for Cluster 129145 containing 11188 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 19576.26998734614 0 72060
>> his 3352.5135342599824 0 17203
>> has 3279.466228939127 0 16855
>> from 2994.8128935270943 0 15488
>> year 2768.974903047085 0 14391
>> state 2735.612128134351 0 14228
>> said 2622.997358441353 0 13676
>> it 2589.8515553446487 0 13513
>> he 2447.4579147226177 0 12810
>> new 2256.8640938592143 0 11862
>>
>> Top labels for Cluster 129201 containing 13040 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 23110.173012922285 0 72060
>> his 3940.4691014224663 0 17203
>> has 3854.554399965331 0 16855
>> from 3519.784154796507 0 15488
>> year 3254.2127395244315 0 14391
>> state 3214.9822960514575 0 14228
>> said 3082.565408431459 0 13676
>> it 3043.5924300444312 0 13513
>> he 2876.171367166564 0 12810
>> new 2652.0934832417406 0 11862
>>
>> Top labels for Cluster 129211 containing 14053 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 25083.46391701023 0 72060
>> his 4266.378291217145 0 17203
>> has 4173.323467798065 0 16855
>> from 3810.7467373879626 0 15488
>> year 3523.1337431534193 0 14391
>> state 3480.648573280778 0 14228
>> said 3337.2482196930796 0 13676
>> it 3295.0432900944725 0 13513
>> he 3113.741967030335 0 12810
>> new 2871.0957860480994 0 11862
>>
>> Top labels for Cluster 129242 containing 12861 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 22764.503256496973 0 72060
>> his 3883.2002838114277 0 17203
>> has 3798.5396822127514 0 16855
>> from 3468.6536546614952 0 15488
>> year 3206.954131908249 0 14391
>> state 3168.2954448102973 0 14228
>> said 3037.808057511691 0 13676
>> it 2999.402857856825 0 13513
>> he 2834.4202939094976 0 12810
>> new 2613.604658874683 0 11862
>>
>> Top labels for Cluster 129245 containing 6443 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 10925.268199045677 0 72060
>> his 1890.511348863598 0 17203
>> has 1849.385320336558 0 16855
>> from 1689.0946326381527 0 15488
>> year 1561.8904545903206 0 14391
>> state 1543.096286157146 0 14228
>> said 1479.652662154287 0 13676
>> it 1460.9780013803393 0 13513
>> he 1380.745082413312 0 12810
>> new 1273.3357145632617 0 11862
>>
>> Top labels for Cluster 129255 containing 11390 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 19957.211259535048 0 72060
>> his 3416.1555761522613 0 17203
>> has 3341.7163103362545 0 16855
>> from 3051.6410844950005 0 15488
>> year 2821.504116652999 0 14391
>> state 2787.5064550531097 0 14228
>> said 2672.7490201727487 0 13676
>> it 2638.972676954698 0 13513
>> he 2493.870809029322 0 12810
>> new 2299.653438703157 0 11862
>>
>> Top labels for Cluster 129265 containing 9461 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 16362.85457371641 0 72060
>> his 2813.167819214519 0 17203
>> has 2751.908798408229 0 16855
>> from 2513.176188033074 0 15488
>> year 2323.752471229993 0 14391
>> state 2295.767774611246 0 14228
>> said 2201.3039346230216 0 13676
>> it 2173.4997256915085 0 13513
>> he 2054.0495802331716 0 12810
>> new 1894.1558320098557 0 11862
>>
>> Top labels for Cluster 129279 containing 14559 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 26080.197364640888 0 72060
>> his 4430.338072712999 0 17203
>> has 4333.689091425855 0 16855
>> from 3957.116204748396 0 15488
>> year 3658.40981121175 0 14391
>> state 3614.286633652635 0 14228
>> said 3465.358771919273 0 13676
>> it 3421.527382406406 0 13513
>> he 3233.2411222746596 0 12810
>> new 2981.251407010015 0 11862
>>
>> Top labels for Cluster 129290 containing 13592 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 24181.82589298836 0 72060
>> his 4117.6785482652485 0 17203
>> has 4027.8821644652635 0 16855
>> from 3677.9947950267233 0 15488
>> year 3400.440033295192 0 14391
>> state 3359.4400672735646 0 14228
>> said 3221.0516651300713 0 13676
>> it 3180.321518546436 0 13513
>> he 3005.353873868007 0 12810
>> new 2771.180380204227 0 11862
>> </snip>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Cluster Labels
Posted by Ted Dunning <te...@gmail.com>.
Can't say just off-hand.
What is the data?
On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <gs...@apache.org> wrote:
> I'm running ClusterLabels and it seems to be outputting the same values for
> every centroid [1]. When I run the cluster dumper, the top terms are fairly
> different for those same vectors.
>
> Have I hit a vagary of LLR or is this a bug?
>
>
> Thanks,
> Grant
>
>
> [1]
> <snip>
> Top labels for Cluster 129062 containing 22710 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> a 43269.00830466254 0 72060
> his 7185.503760070074 0 17203
> has 7028.243643655442 0 16855
> from 6415.739411605988 0 15488
> year 5930.141497239005 0 14391
> state 5858.43069797568 0 14228
> said 5616.422720833216 0 13676
> it 5545.207108973991 0 13513
> he 5239.340392438695 0 12810
> new 4830.124521905556 0 11862
>
> Top labels for Cluster 129145 containing 11188 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> a 19576.26998734614 0 72060
> his 3352.5135342599824 0 17203
> has 3279.466228939127 0 16855
> from 2994.8128935270943 0 15488
> year 2768.974903047085 0 14391
> state 2735.612128134351 0 14228
> said 2622.997358441353 0 13676
> it 2589.8515553446487 0 13513
> he 2447.4579147226177 0 12810
> new 2256.8640938592143 0 11862
>
> Top labels for Cluster 129201 containing 13040 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> a 23110.173012922285 0 72060
> his 3940.4691014224663 0 17203
> has 3854.554399965331 0 16855
> from 3519.784154796507 0 15488
> year 3254.2127395244315 0 14391
> state 3214.9822960514575 0 14228
> said 3082.565408431459 0 13676
> it 3043.5924300444312 0 13513
> he 2876.171367166564 0 12810
> new 2652.0934832417406 0 11862
>
> Top labels for Cluster 129211 containing 14053 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> a 25083.46391701023 0 72060
> his 4266.378291217145 0 17203
> has 4173.323467798065 0 16855
> from 3810.7467373879626 0 15488
> year 3523.1337431534193 0 14391
> state 3480.648573280778 0 14228
> said 3337.2482196930796 0 13676
> it 3295.0432900944725 0 13513
> he 3113.741967030335 0 12810
> new 2871.0957860480994 0 11862
>
> Top labels for Cluster 129242 containing 12861 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> a 22764.503256496973 0 72060
> his 3883.2002838114277 0 17203
> has 3798.5396822127514 0 16855
> from 3468.6536546614952 0 15488
> year 3206.954131908249 0 14391
> state 3168.2954448102973 0 14228
> said 3037.808057511691 0 13676
> it 2999.402857856825 0 13513
> he 2834.4202939094976 0 12810
> new 2613.604658874683 0 11862
>
> Top labels for Cluster 129245 containing 6443 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> a 10925.268199045677 0 72060
> his 1890.511348863598 0 17203
> has 1849.385320336558 0 16855
> from 1689.0946326381527 0 15488
> year 1561.8904545903206 0 14391
> state 1543.096286157146 0 14228
> said 1479.652662154287 0 13676
> it 1460.9780013803393 0 13513
> he 1380.745082413312 0 12810
> new 1273.3357145632617 0 11862
>
> Top labels for Cluster 129255 containing 11390 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> a 19957.211259535048 0 72060
> his 3416.1555761522613 0 17203
> has 3341.7163103362545 0 16855
> from 3051.6410844950005 0 15488
> year 2821.504116652999 0 14391
> state 2787.5064550531097 0 14228
> said 2672.7490201727487 0 13676
> it 2638.972676954698 0 13513
> he 2493.870809029322 0 12810
> new 2299.653438703157 0 11862
>
> Top labels for Cluster 129265 containing 9461 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> a 16362.85457371641 0 72060
> his 2813.167819214519 0 17203
> has 2751.908798408229 0 16855
> from 2513.176188033074 0 15488
> year 2323.752471229993 0 14391
> state 2295.767774611246 0 14228
> said 2201.3039346230216 0 13676
> it 2173.4997256915085 0 13513
> he 2054.0495802331716 0 12810
> new 1894.1558320098557 0 11862
>
> Top labels for Cluster 129279 containing 14559 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> a 26080.197364640888 0 72060
> his 4430.338072712999 0 17203
> has 4333.689091425855 0 16855
> from 3957.116204748396 0 15488
> year 3658.40981121175 0 14391
> state 3614.286633652635 0 14228
> said 3465.358771919273 0 13676
> it 3421.527382406406 0 13513
> he 3233.2411222746596 0 12810
> new 2981.251407010015 0 11862
>
> Top labels for Cluster 129290 containing 13592 vectors
> Term LLR In-ClusterDF Out-ClusterDF
> a 24181.82589298836 0 72060
> his 4117.6785482652485 0 17203
> has 4027.8821644652635 0 16855
> from 3677.9947950267233 0 15488
> year 3400.440033295192 0 14391
> state 3359.4400672735646 0 14228
> said 3221.0516651300713 0 13676
> it 3180.321518546436 0 13513
> he 3005.353873868007 0 12810
> new 2771.180380204227 0 11862
> </snip>