You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/07/05 17:20:34 UTC

Cluster Labels

I'm running ClusterLabels and it seems to be outputting the same values for every centroid [1].  When I run the cluster dumper, the top terms are fairly different for those same vectors.

Have I hit a vagary of LLR or is this a bug?


Thanks,
Grant


[1] 
<snip>
Top labels for Cluster 129062 containing 22710 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
a               43269.00830466254               0               72060
his             7185.503760070074               0               17203
has             7028.243643655442               0               16855
from            6415.739411605988               0               15488
year            5930.141497239005               0               14391
state           5858.43069797568                0               14228
said            5616.422720833216               0               13676
it              5545.207108973991               0               13513
he              5239.340392438695               0               12810
new             4830.124521905556               0               11862

Top labels for Cluster 129145 containing 11188 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
a               19576.26998734614               0               72060
his             3352.5135342599824              0               17203
has             3279.466228939127               0               16855
from            2994.8128935270943              0               15488
year            2768.974903047085               0               14391
state           2735.612128134351               0               14228
said            2622.997358441353               0               13676
it              2589.8515553446487              0               13513
he              2447.4579147226177              0               12810
new             2256.8640938592143              0               11862

Top labels for Cluster 129201 containing 13040 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
a               23110.173012922285              0               72060
his             3940.4691014224663              0               17203
has             3854.554399965331               0               16855
from            3519.784154796507               0               15488
year            3254.2127395244315              0               14391
state           3214.9822960514575              0               14228
said            3082.565408431459               0               13676
it              3043.5924300444312              0               13513
he              2876.171367166564               0               12810
new             2652.0934832417406              0               11862

Top labels for Cluster 129211 containing 14053 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
a               25083.46391701023               0               72060
his             4266.378291217145               0               17203
has             4173.323467798065               0               16855
from            3810.7467373879626              0               15488
year            3523.1337431534193              0               14391
state           3480.648573280778               0               14228
said            3337.2482196930796              0               13676
it              3295.0432900944725              0               13513
he              3113.741967030335               0               12810
new             2871.0957860480994              0               11862

Top labels for Cluster 129242 containing 12861 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
a               22764.503256496973              0               72060
his             3883.2002838114277              0               17203
has             3798.5396822127514              0               16855
from            3468.6536546614952              0               15488
year            3206.954131908249               0               14391
state           3168.2954448102973              0               14228
said            3037.808057511691               0               13676
it              2999.402857856825               0               13513
he              2834.4202939094976              0               12810
new             2613.604658874683               0               11862

Top labels for Cluster 129245 containing 6443 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
a               10925.268199045677              0               72060
his             1890.511348863598               0               17203
has             1849.385320336558               0               16855
from            1689.0946326381527              0               15488
year            1561.8904545903206              0               14391
state           1543.096286157146               0               14228
said            1479.652662154287               0               13676
it              1460.9780013803393              0               13513
he              1380.745082413312               0               12810
new             1273.3357145632617              0               11862

Top labels for Cluster 129255 containing 11390 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
a               19957.211259535048              0               72060
his             3416.1555761522613              0               17203
has             3341.7163103362545              0               16855
from            3051.6410844950005              0               15488
year            2821.504116652999               0               14391
state           2787.5064550531097              0               14228
said            2672.7490201727487              0               13676
it              2638.972676954698               0               13513
he              2493.870809029322               0               12810
new             2299.653438703157               0               11862

Top labels for Cluster 129265 containing 9461 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
a               16362.85457371641               0               72060
his             2813.167819214519               0               17203
has             2751.908798408229               0               16855
from            2513.176188033074               0               15488
year            2323.752471229993               0               14391
state           2295.767774611246               0               14228
said            2201.3039346230216              0               13676
it              2173.4997256915085              0               13513
he              2054.0495802331716              0               12810
new             1894.1558320098557              0               11862

Top labels for Cluster 129279 containing 14559 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
a               26080.197364640888              0               72060
his             4430.338072712999               0               17203
has             4333.689091425855               0               16855
from            3957.116204748396               0               15488
year            3658.40981121175                0               14391
state           3614.286633652635               0               14228
said            3465.358771919273               0               13676
it              3421.527382406406               0               13513
he              3233.2411222746596              0               12810
new             2981.251407010015               0               11862

Top labels for Cluster 129290 containing 13592 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
a               24181.82589298836               0               72060
his             4117.6785482652485              0               17203
has             4027.8821644652635              0               16855
from            3677.9947950267233              0               15488
year            3400.440033295192               0               14391
state           3359.4400672735646              0               14228
said            3221.0516651300713              0               13676
it              3180.321518546436               0               13513
he              3005.353873868007               0               12810
new             2771.180380204227               0               11862
</snip>

Re: Cluster Labels

Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote:

> Can't say just off-hand.
> 
> What is the data?

Small docs, title and description, taken from RSS feeds from 20 or so news sites.  Hmm, looks like I created my docs from the wrong field (there shouldn't be stopwords like those below).  Let me re-run and I'll report back.


> 
> On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> I'm running ClusterLabels and it seems to be outputting the same values for
>> every centroid [1].  When I run the cluster dumper, the top terms are fairly
>> different for those same vectors.
>> 
>> Have I hit a vagary of LLR or is this a bug?
>> 
>> 
>> Thanks,
>> Grant
>> 
>> 
>> [1]
>> <snip>
>> Top labels for Cluster 129062 containing 22710 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               43269.00830466254               0               72060
>> his             7185.503760070074               0               17203
>> has             7028.243643655442               0               16855
>> from            6415.739411605988               0               15488
>> year            5930.141497239005               0               14391
>> state           5858.43069797568                0               14228
>> said            5616.422720833216               0               13676
>> it              5545.207108973991               0               13513
>> he              5239.340392438695               0               12810
>> new             4830.124521905556               0               11862
>> 
>> Top labels for Cluster 129145 containing 11188 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               19576.26998734614               0               72060
>> his             3352.5135342599824              0               17203
>> has             3279.466228939127               0               16855
>> from            2994.8128935270943              0               15488
>> year            2768.974903047085               0               14391
>> state           2735.612128134351               0               14228
>> said            2622.997358441353               0               13676
>> it              2589.8515553446487              0               13513
>> he              2447.4579147226177              0               12810
>> new             2256.8640938592143              0               11862
>> 
>> Top labels for Cluster 129201 containing 13040 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               23110.173012922285              0               72060
>> his             3940.4691014224663              0               17203
>> has             3854.554399965331               0               16855
>> from            3519.784154796507               0               15488
>> year            3254.2127395244315              0               14391
>> state           3214.9822960514575              0               14228
>> said            3082.565408431459               0               13676
>> it              3043.5924300444312              0               13513
>> he              2876.171367166564               0               12810
>> new             2652.0934832417406              0               11862
>> 
>> Top labels for Cluster 129211 containing 14053 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               25083.46391701023               0               72060
>> his             4266.378291217145               0               17203
>> has             4173.323467798065               0               16855
>> from            3810.7467373879626              0               15488
>> year            3523.1337431534193              0               14391
>> state           3480.648573280778               0               14228
>> said            3337.2482196930796              0               13676
>> it              3295.0432900944725              0               13513
>> he              3113.741967030335               0               12810
>> new             2871.0957860480994              0               11862
>> 
>> Top labels for Cluster 129242 containing 12861 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               22764.503256496973              0               72060
>> his             3883.2002838114277              0               17203
>> has             3798.5396822127514              0               16855
>> from            3468.6536546614952              0               15488
>> year            3206.954131908249               0               14391
>> state           3168.2954448102973              0               14228
>> said            3037.808057511691               0               13676
>> it              2999.402857856825               0               13513
>> he              2834.4202939094976              0               12810
>> new             2613.604658874683               0               11862
>> 
>> Top labels for Cluster 129245 containing 6443 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               10925.268199045677              0               72060
>> his             1890.511348863598               0               17203
>> has             1849.385320336558               0               16855
>> from            1689.0946326381527              0               15488
>> year            1561.8904545903206              0               14391
>> state           1543.096286157146               0               14228
>> said            1479.652662154287               0               13676
>> it              1460.9780013803393              0               13513
>> he              1380.745082413312               0               12810
>> new             1273.3357145632617              0               11862
>> 
>> Top labels for Cluster 129255 containing 11390 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               19957.211259535048              0               72060
>> his             3416.1555761522613              0               17203
>> has             3341.7163103362545              0               16855
>> from            3051.6410844950005              0               15488
>> year            2821.504116652999               0               14391
>> state           2787.5064550531097              0               14228
>> said            2672.7490201727487              0               13676
>> it              2638.972676954698               0               13513
>> he              2493.870809029322               0               12810
>> new             2299.653438703157               0               11862
>> 
>> Top labels for Cluster 129265 containing 9461 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               16362.85457371641               0               72060
>> his             2813.167819214519               0               17203
>> has             2751.908798408229               0               16855
>> from            2513.176188033074               0               15488
>> year            2323.752471229993               0               14391
>> state           2295.767774611246               0               14228
>> said            2201.3039346230216              0               13676
>> it              2173.4997256915085              0               13513
>> he              2054.0495802331716              0               12810
>> new             1894.1558320098557              0               11862
>> 
>> Top labels for Cluster 129279 containing 14559 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               26080.197364640888              0               72060
>> his             4430.338072712999               0               17203
>> has             4333.689091425855               0               16855
>> from            3957.116204748396               0               15488
>> year            3658.40981121175                0               14391
>> state           3614.286633652635               0               14228
>> said            3465.358771919273               0               13676
>> it              3421.527382406406               0               13513
>> he              3233.2411222746596              0               12810
>> new             2981.251407010015               0               11862
>> 
>> Top labels for Cluster 129290 containing 13592 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               24181.82589298836               0               72060
>> his             4117.6785482652485              0               17203
>> has             4027.8821644652635              0               16855
>> from            3677.9947950267233              0               15488
>> year            3400.440033295192               0               14391
>> state           3359.4400672735646              0               14228
>> said            3221.0516651300713              0               13676
>> it              3180.321518546436               0               13513
>> he              3005.353873868007               0               12810
>> new             2771.180380204227               0               11862
>> </snip>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


Re: Cluster Labels

Posted by Grant Ingersoll <gs...@apache.org>.
MAHOUT-434 solves the problem.


On Jul 5, 2010, at 2:34 PM, Grant Ingersoll wrote:

> https://issues.apache.org/jira/browse/MAHOUT-433
> 
> On Mon, Jul 5, 2010 at 2:28 PM, Grant Ingersoll <gs...@apache.org> wrote:
> OK, seems the problem is ClusterLabels was never updated when we switched over to WeightedVectorWritable and it also seems like somewhere in the equation of KMeans being run that we lost the NamedVector again, as the clusteredPoints directory does not contain NamedVectors, even though that is what I created the original points as when starting.
> 
> 
> On Mon, Jul 5, 2010 at 1:55 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Hmmm, different field, more or less the same result, i.e. all labels are the same for each vector [1].  I also included the Cluster dump [2].  I'm suspecting a bug.
> 
> [1]
> Top labels for Cluster 129022 containing 19186 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             5836.5347257247195              0               16867
> from            5328.54616727354                0               15499
> year            4925.276801970322               0               14400
> state           4866.91887763422                0               14240
> new             4011.6858639516868              0               11867
> after           3882.1740732807666              0               11503
> first           3002.5827110484242              0               8998
> two             2984.1892275922         0               8945
> unit            2930.794111499563               0               8791
> one             2686.95768492762                0               8085
> 
> Top labels for Cluster 129119 containing 16043 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             4808.386086146813               0               16867
> from            4390.346637147013               0               15499
> year            4058.4180186586455              0               14400
> state           4010.379176544491               0               14240
> new             3306.234930681996               0               11867
> after           3199.5810555517673              0               11503
> first           2475.079962851014               0               8998
> two             2459.926843432244               0               8945
> unit            2415.9376569474116              0               8791
> one             2215.042654468678               0               8085
> 
> Top labels for Cluster 129191 containing 7770 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             2243.2657141932286              0               16867
> from            2048.755412856117               0               15499
> year            1894.2384706358425              0               14400
> state           1871.8704557279125              0               14240
> new             1543.8513879175298              0               11867
> after           1494.1429192917421              0               11503
> first           1156.303048826754               0               8998
> two             1149.2339147529565              0               8945
> unit            1128.711646862328               0               8791
> one             1034.9745452422649              0               8085
> 
> Top labels for Cluster 129302 containing 9426 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             2741.316972494591               0               16867
> from            2503.501101480797               0               15499
> year            2314.5996575923637              0               14400
> state           2287.255346294027               0               14240
> new             1886.2961270781234              0               11867
> after           1825.5399498036131              0               11503
> first           1412.654560342431               0               8998
> two             1404.0158626483753              0               8945
> unit            1378.9371921028942              0               8791
> one             1264.391515379306               0               8085
> 
> Top labels for Cluster 129360 containing 13092 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             3870.8181769265793              0               16867
> from            3534.623348234687               0               15499
> year            3267.633215776179               0               14400
> state           3228.989259615075               0               14240
> new             2662.4551618834957              0               11867
> after           2576.628638952039               0               11503
> first           1993.499155438505               0               8998
> two             1981.3008509986103              0               8945
> unit            1945.8889682726003              0               8791
> one             1784.1570986662991              0               8085
> 
> Top labels for Cluster 129371 containing 23944 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             7455.31941217836                0               16867
> from            6805.274207816925               0               15499
> year            6289.398677708115               0               14400
> state           6214.757351316046               0               14240
> new             5121.23683049297                0               11867
> after           4955.695805796888               0               11503
> first           3831.788851835765               0               8998
> two             3808.2933898111805              0               8945
> unit            3740.0891623105854              0               8791
> one             3428.6551325367764              0               8085
> 
> Top labels for Cluster 129373 containing 9885 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             2880.6778563517146              0               16867
> from            2630.736483251676               0               15499
> year            2432.208566541318               0               14400
> state           2403.4711471684277              0               14240
> new             1982.0948037123308              0               11867
> after           1918.2465800205246              0               11503
> first           1484.359997350257               0               8998
> two             1475.282112147659               0               8945
> unit            1448.9285028181039              0               8791
> one             1328.560536378529               0               8085
> 
> Top labels for Cluster 129377 containing 11303 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             3314.8890487886965              0               16867
> from            3027.14497121796                0               15499
> year            2798.608615776524               0               14400
> state           2765.528720188886               0               14240
> new             2280.5166378575377              0               11867
> after           2207.0322705539875              0               11503
> first           1707.7044410486706              0               8998
> two             1697.2581536169164              0               8945
> unit            1666.932174641639               0               8791
> one             1528.4241032432765              0               8085
> 
> Top labels for Cluster 129381 containing 11411 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             3348.190782570746               0               16867
> from            3057.545994592365               0               15499
> year            2826.7072093421593              0               14400
> state           2793.2941474220715              0               14240
> new             2303.4001871203072              0               11867
> after           2229.176642407663               0               11503
> first           1724.8293614634313              0               8998
> two             1714.2781240069307              0               8945
> unit            1683.6474849330261              0               8791
> one             1543.7481994605623              0               8085
> 
> Top labels for Cluster 129391 containing 7334 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             2113.35227333894                0               16867
> from            1930.1305988361128              0               15499
> year            1784.577833758667               0               14400
> state           1763.5072347805835              0               14240
> new             1454.5072316131555              0               11867
> after           1407.6797917694785              0               11503
> first           1089.4127462548204              0               8998
> two             1082.7530186888762              0               8945
> unit            1063.4192575318739              0               8791
> one             975.1101242941804               0               8085
> 
> [2]
> :C-129022: [0:0.001, 000:0.003, 004:0.000, 0040:0.000, 0060:0.000, 01:0.000, 0100:0.000, 0110:0.000,
>        Top Terms:
>                from                                    =>0.022236135215980328
>                u                                       => 0.01589135359475966
>                busi                                    =>0.014789942880805335
>                bank                                    =>0.014395075820558541
>                us                                      => 0.01402954110138604
>                presid                                  => 0.01341952961319183
>                month                                   =>0.012118726267037198
>                about                                   =>0.011986047971260612
>                compani                                 =>0.011201454374207618
>                obama                                   => 0.01105482429336391
> :C-129119: [0:0.001, 00:0.000, 000:0.003, 03:0.000, 04:0.000, 05:0.000, 0656:0.000, 07:0.000, 09:0.00
>        Top Terms:
>                citi                                    => 0.04119064757467011
>                former                                  =>0.030966538725529232
>                home                                    =>0.029642735534519644
>                player                                  => 0.02879703136878369
>                soccer                                  => 0.01847372541986708
>                has                                     =>0.015236681440174855
>                mark                                    =>0.015185164518720528
>                new                                     => 0.01266468154720074
>                polic                                   => 0.01253454821409647
>                world                                   =>0.011803315296178046
> :C-129191: [0:0.013, 00:0.002, 000:0.004, 000000:0.000, 001:0.000, 0011:0.000, 0022:0.000, 003:0.000,
>        Top Terms:
>                4                                       =>0.027636996760550075
>                3                                       =>0.026093296145846434
>                1                                       => 0.02570191540464146
>                5                                       =>0.024807189589701305
>                2                                       =>0.023669513631826157
>                were                                    =>0.021134415210709086
>                sunday                                  =>0.017928504766147838
>                play                                    =>0.017243683740808733
>                through                                 =>0.017133336974828554
>                game                                    =>0.017027790192043733
> :C-129302: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 002:0.000, 008:0.000, 01:0.000, 011:0.000, 0112:
>        Top Terms:
>                new                                     =>0.039501149799390206
>                peopl                                   => 0.01933397797740685
>                world                                   =>0.017478792605253438
>                could                                   =>0.013495142418778704
>                has                                     =>0.012987326502897916
>                more                                    =>0.012585724039194569
>                from                                    =>0.012242682917236177
>                face                                    =>  0.0117046220661272
>                leader                                  =>0.011579584625370691
>                presid                                  =>0.011192085113854965
> :C-129360: [0:0.000, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, 005:0.000, 007:0.000, 008:
>        Top Terms:
>                state                                   =>0.044732720259456946
>                unit                                    =>0.032493582810588666
>                year                                    =>0.025651340609304542
>                san                                     =>0.025617706557963606
>                after                                   =>0.022019046306438913
>                francisco                               =>0.020771004252363168
>                california                              => 0.01847124801606253
>                day                                     =>0.015514125170527842
>                wednesday                               =>0.014587851421509652
>                citi                                    =>0.012973538756014369
> :C-129371: [0:0.002, 00:0.000, 000:0.001, 01:0.000, 010:0.000, 0134:0.000, 016:0.000, 02:0.000, 03:0.
>        Top Terms:
>                game                                    => 0.04311022785679375
>                has                                     => 0.03059922226267673
>                all                                     =>0.027605073346921877
>                leagu                                   =>  0.0267627245855276
>                star                                    => 0.02206632764439995
>                final                                   =>0.020017765794918686
>                season                                  => 0.01534931562714024
>                start                                   => 0.01450896856938099
>                week                                    =>0.014407234069110549
>                nation                                  => 0.01429746391305699
> :C-129373: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 01:0.000, 012:0.000, 016:0.000, 03:0.000, 034:0.
>        Top Terms:
>                coach                                   => 0.05209277512761816
>                team                                    =>0.031773971685165554
>                charg                                   =>0.024246280249912454
>                from                                    => 0.02093643936347752
>                has                                     => 0.02057631329905952
>                week                                    =>0.016848920922797363
>                last                                    => 0.01674320150844955
>                program                                 =>0.016023081209070564
>                former                                  =>0.015872337289314063
>                after                                   => 0.01341825692502786
> :C-129377: [0:0.002, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, 006:0.000, 0065:0.000, 007
>        Top Terms:
>                been                                    => 0.03757994091979662
>                time                                    => 0.03591307497544333
>                first                                   => 0.03422461795380875
>                has                                     =>0.029800513863644906
>                feder                                   =>0.027382680342986195
>                monday                                  =>0.022174840523045594
>                sinc                                    => 0.02185219249613946
>                year                                    => 0.01933420097135394
>                from                                    => 0.01162537888358458
>                state                                   =>0.009756869426688311
> :C-129381: [0:0.004, 00:0.000, 000:0.002, 00000000235:0.000, 001:0.000, 0011:0.000, 002:0.000, 0051:0
>        Top Terms:
>                win                                     => 0.03267669747239372
>                one                                     =>0.031009191445456212
>                second                                  =>0.028066582472705007
>                three                                   =>0.026147346665631184
>                out                                     =>  0.0226123748207931
>                shot                                    =>0.020446190395276405
>                last                                    =>0.019624841184867056
>                night                                   =>0.019103407305052604
>                over                                    =>0.017376642133669604
>                year                                    =>0.016475201865715022
> :C-129391: [0:0.003, 00:0.000, 000:0.001, 002:0.000, 01:0.000, 0112:0.000, 0123:0.000, 02:0.000, 0213
>        Top Terms:
>                championship                            =>0.035449579372280104
>                run                                     =>0.026446073370591447
>                art                                     => 0.02489330236372834
>                open                                    => 0.02282619503375418
>                place                                   =>0.022410914360311056
>                grand                                   =>  0.0169734705340118
>                reuter                                  =>0.015895311339829302
>                6                                       =>0.015700075983436933
>                continu                                 =>0.015418929721703813
>                slam                                    =>0.012102435338420274
> 
> 
> -Grant
> On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote:
> 
> > Can't say just off-hand.
> >
> > What is the data?
> >
> > On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <gs...@apache.org> wrote:
> >
> >> I'm running ClusterLabels and it seems to be outputting the same values for
> >> every centroid [1].  When I run the cluster dumper, the top terms are fairly
> >> different for those same vectors.
> >>
> >> Have I hit a vagary of LLR or is this a bug?
> >>
> >>
> >> Thanks,
> >> Grant
> >>
> >>
> >> [1]
> >> <snip>
> >> Top labels for Cluster 129062 containing 22710 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               43269.00830466254               0               72060
> >> his             7185.503760070074               0               17203
> >> has             7028.243643655442               0               16855
> >> from            6415.739411605988               0               15488
> >> year            5930.141497239005               0               14391
> >> state           5858.43069797568                0               14228
> >> said            5616.422720833216               0               13676
> >> it              5545.207108973991               0               13513
> >> he              5239.340392438695               0               12810
> >> new             4830.124521905556               0               11862
> >>
> >> Top labels for Cluster 129145 containing 11188 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               19576.26998734614               0               72060
> >> his             3352.5135342599824              0               17203
> >> has             3279.466228939127               0               16855
> >> from            2994.8128935270943              0               15488
> >> year            2768.974903047085               0               14391
> >> state           2735.612128134351               0               14228
> >> said            2622.997358441353               0               13676
> >> it              2589.8515553446487              0               13513
> >> he              2447.4579147226177              0               12810
> >> new             2256.8640938592143              0               11862
> >>
> >> Top labels for Cluster 129201 containing 13040 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               23110.173012922285              0               72060
> >> his             3940.4691014224663              0               17203
> >> has             3854.554399965331               0               16855
> >> from            3519.784154796507               0               15488
> >> year            3254.2127395244315              0               14391
> >> state           3214.9822960514575              0               14228
> >> said            3082.565408431459               0               13676
> >> it              3043.5924300444312              0               13513
> >> he              2876.171367166564               0               12810
> >> new             2652.0934832417406              0               11862
> >>
> >> Top labels for Cluster 129211 containing 14053 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               25083.46391701023               0               72060
> >> his             4266.378291217145               0               17203
> >> has             4173.323467798065               0               16855
> >> from            3810.7467373879626              0               15488
> >> year            3523.1337431534193              0               14391
> >> state           3480.648573280778               0               14228
> >> said            3337.2482196930796              0               13676
> >> it              3295.0432900944725              0               13513
> >> he              3113.741967030335               0               12810
> >> new             2871.0957860480994              0               11862
> >>
> >> Top labels for Cluster 129242 containing 12861 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               22764.503256496973              0               72060
> >> his             3883.2002838114277              0               17203
> >> has             3798.5396822127514              0               16855
> >> from            3468.6536546614952              0               15488
> >> year            3206.954131908249               0               14391
> >> state           3168.2954448102973              0               14228
> >> said            3037.808057511691               0               13676
> >> it              2999.402857856825               0               13513
> >> he              2834.4202939094976              0               12810
> >> new             2613.604658874683               0               11862
> >>
> >> Top labels for Cluster 129245 containing 6443 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               10925.268199045677              0               72060
> >> his             1890.511348863598               0               17203
> >> has             1849.385320336558               0               16855
> >> from            1689.0946326381527              0               15488
> >> year            1561.8904545903206              0               14391
> >> state           1543.096286157146               0               14228
> >> said            1479.652662154287               0               13676
> >> it              1460.9780013803393              0               13513
> >> he              1380.745082413312               0               12810
> >> new             1273.3357145632617              0               11862
> >>
> >> Top labels for Cluster 129255 containing 11390 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               19957.211259535048              0               72060
> >> his             3416.1555761522613              0               17203
> >> has             3341.7163103362545              0               16855
> >> from            3051.6410844950005              0               15488
> >> year            2821.504116652999               0               14391
> >> state           2787.5064550531097              0               14228
> >> said            2672.7490201727487              0               13676
> >> it              2638.972676954698               0               13513
> >> he              2493.870809029322               0               12810
> >> new             2299.653438703157               0               11862
> >>
> >> Top labels for Cluster 129265 containing 9461 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               16362.85457371641               0               72060
> >> his             2813.167819214519               0               17203
> >> has             2751.908798408229               0               16855
> >> from            2513.176188033074               0               15488
> >> year            2323.752471229993               0               14391
> >> state           2295.767774611246               0               14228
> >> said            2201.3039346230216              0               13676
> >> it              2173.4997256915085              0               13513
> >> he              2054.0495802331716              0               12810
> >> new             1894.1558320098557              0               11862
> >>
> >> Top labels for Cluster 129279 containing 14559 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               26080.197364640888              0               72060
> >> his             4430.338072712999               0               17203
> >> has             4333.689091425855               0               16855
> >> from            3957.116204748396               0               15488
> >> year            3658.40981121175                0               14391
> >> state           3614.286633652635               0               14228
> >> said            3465.358771919273               0               13676
> >> it              3421.527382406406               0               13513
> >> he              3233.2411222746596              0               12810
> >> new             2981.251407010015               0               11862
> >>
> >> Top labels for Cluster 129290 containing 13592 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               24181.82589298836               0               72060
> >> his             4117.6785482652485              0               17203
> >> has             4027.8821644652635              0               16855
> >> from            3677.9947950267233              0               15488
> >> year            3400.440033295192               0               14391
> >> state           3359.4400672735646              0               14228
> >> said            3221.0516651300713              0               13676
> >> it              3180.321518546436               0               13513
> >> he              3005.353873868007               0               12810
> >> new             2771.180380204227               0               11862
> >> </snip>
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
> 
> 
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


Re: Cluster Labels

Posted by Grant Ingersoll <gs...@apache.org>.
https://issues.apache.org/jira/browse/MAHOUT-433

On Mon, Jul 5, 2010 at 2:28 PM, Grant Ingersoll <gs...@apache.org> wrote:

> OK, seems the problem is ClusterLabels was never updated when we switched
> over to WeightedVectorWritable and it also seems like somewhere in the
> equation of KMeans being run that we lost the NamedVector again, as the
> clusteredPoints directory does not contain NamedVectors, even though that is
> what I created the original points as when starting.
>
>
> On Mon, Jul 5, 2010 at 1:55 PM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> Hmmm, different field, more or less the same result, i.e. all labels are
>> the same for each vector [1].  I also included the Cluster dump [2].  I'm
>> suspecting a bug.
>>
>> [1]
>> Top labels for Cluster 129022 containing 19186 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> has             5836.5347257247195              0               16867
>> from            5328.54616727354                0               15499
>> year            4925.276801970322               0               14400
>> state           4866.91887763422                0               14240
>> new             4011.6858639516868              0               11867
>> after           3882.1740732807666              0               11503
>> first           3002.5827110484242              0               8998
>> two             2984.1892275922         0               8945
>> unit            2930.794111499563               0               8791
>> one             2686.95768492762                0               8085
>>
>> Top labels for Cluster 129119 containing 16043 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> has             4808.386086146813               0               16867
>> from            4390.346637147013               0               15499
>> year            4058.4180186586455              0               14400
>> state           4010.379176544491               0               14240
>> new             3306.234930681996               0               11867
>> after           3199.5810555517673              0               11503
>> first           2475.079962851014               0               8998
>> two             2459.926843432244               0               8945
>> unit            2415.9376569474116              0               8791
>> one             2215.042654468678               0               8085
>>
>> Top labels for Cluster 129191 containing 7770 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> has             2243.2657141932286              0               16867
>> from            2048.755412856117               0               15499
>> year            1894.2384706358425              0               14400
>> state           1871.8704557279125              0               14240
>> new             1543.8513879175298              0               11867
>> after           1494.1429192917421              0               11503
>> first           1156.303048826754               0               8998
>> two             1149.2339147529565              0               8945
>> unit            1128.711646862328               0               8791
>> one             1034.9745452422649              0               8085
>>
>> Top labels for Cluster 129302 containing 9426 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> has             2741.316972494591               0               16867
>> from            2503.501101480797               0               15499
>> year            2314.5996575923637              0               14400
>> state           2287.255346294027               0               14240
>> new             1886.2961270781234              0               11867
>> after           1825.5399498036131              0               11503
>> first           1412.654560342431               0               8998
>> two             1404.0158626483753              0               8945
>> unit            1378.9371921028942              0               8791
>> one             1264.391515379306               0               8085
>>
>> Top labels for Cluster 129360 containing 13092 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> has             3870.8181769265793              0               16867
>> from            3534.623348234687               0               15499
>> year            3267.633215776179               0               14400
>> state           3228.989259615075               0               14240
>> new             2662.4551618834957              0               11867
>> after           2576.628638952039               0               11503
>> first           1993.499155438505               0               8998
>> two             1981.3008509986103              0               8945
>> unit            1945.8889682726003              0               8791
>> one             1784.1570986662991              0               8085
>>
>> Top labels for Cluster 129371 containing 23944 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> has             7455.31941217836                0               16867
>> from            6805.274207816925               0               15499
>> year            6289.398677708115               0               14400
>> state           6214.757351316046               0               14240
>> new             5121.23683049297                0               11867
>> after           4955.695805796888               0               11503
>> first           3831.788851835765               0               8998
>> two             3808.2933898111805              0               8945
>> unit            3740.0891623105854              0               8791
>> one             3428.6551325367764              0               8085
>>
>> Top labels for Cluster 129373 containing 9885 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> has             2880.6778563517146              0               16867
>> from            2630.736483251676               0               15499
>> year            2432.208566541318               0               14400
>> state           2403.4711471684277              0               14240
>> new             1982.0948037123308              0               11867
>> after           1918.2465800205246              0               11503
>> first           1484.359997350257               0               8998
>> two             1475.282112147659               0               8945
>> unit            1448.9285028181039              0               8791
>> one             1328.560536378529               0               8085
>>
>> Top labels for Cluster 129377 containing 11303 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> has             3314.8890487886965              0               16867
>> from            3027.14497121796                0               15499
>> year            2798.608615776524               0               14400
>> state           2765.528720188886               0               14240
>> new             2280.5166378575377              0               11867
>> after           2207.0322705539875              0               11503
>> first           1707.7044410486706              0               8998
>> two             1697.2581536169164              0               8945
>> unit            1666.932174641639               0               8791
>> one             1528.4241032432765              0               8085
>>
>> Top labels for Cluster 129381 containing 11411 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> has             3348.190782570746               0               16867
>> from            3057.545994592365               0               15499
>> year            2826.7072093421593              0               14400
>> state           2793.2941474220715              0               14240
>> new             2303.4001871203072              0               11867
>> after           2229.176642407663               0               11503
>> first           1724.8293614634313              0               8998
>> two             1714.2781240069307              0               8945
>> unit            1683.6474849330261              0               8791
>> one             1543.7481994605623              0               8085
>>
>> Top labels for Cluster 129391 containing 7334 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> has             2113.35227333894                0               16867
>> from            1930.1305988361128              0               15499
>> year            1784.577833758667               0               14400
>> state           1763.5072347805835              0               14240
>> new             1454.5072316131555              0               11867
>> after           1407.6797917694785              0               11503
>> first           1089.4127462548204              0               8998
>> two             1082.7530186888762              0               8945
>> unit            1063.4192575318739              0               8791
>> one             975.1101242941804               0               8085
>>
>> [2]
>> :C-129022: [0:0.001, 000:0.003, 004:0.000, 0040:0.000, 0060:0.000,
>> 01:0.000, 0100:0.000, 0110:0.000,
>>        Top Terms:
>>                from
>>  =>0.022236135215980328
>>                u                                       =>
>> 0.01589135359475966
>>                busi
>>  =>0.014789942880805335
>>                bank
>>  =>0.014395075820558541
>>                us                                      =>
>> 0.01402954110138604
>>                presid                                  =>
>> 0.01341952961319183
>>                month
>> =>0.012118726267037198
>>                about
>> =>0.011986047971260612
>>                compani
>> =>0.011201454374207618
>>                obama                                   =>
>> 0.01105482429336391
>> :C-129119: [0:0.001, 00:0.000, 000:0.003, 03:0.000, 04:0.000, 05:0.000,
>> 0656:0.000, 07:0.000, 09:0.00
>>        Top Terms:
>>                citi                                    =>
>> 0.04119064757467011
>>                former
>>  =>0.030966538725529232
>>                home
>>  =>0.029642735534519644
>>                player                                  =>
>> 0.02879703136878369
>>                soccer                                  =>
>> 0.01847372541986708
>>                has
>> =>0.015236681440174855
>>                mark
>>  =>0.015185164518720528
>>                new                                     =>
>> 0.01266468154720074
>>                polic                                   =>
>> 0.01253454821409647
>>                world
>> =>0.011803315296178046
>> :C-129191: [0:0.013, 00:0.002, 000:0.004, 000000:0.000, 001:0.000,
>> 0011:0.000, 0022:0.000, 003:0.000,
>>        Top Terms:
>>                4
>> =>0.027636996760550075
>>                3
>> =>0.026093296145846434
>>                1                                       =>
>> 0.02570191540464146
>>                5
>> =>0.024807189589701305
>>                2
>> =>0.023669513631826157
>>                were
>>  =>0.021134415210709086
>>                sunday
>>  =>0.017928504766147838
>>                play
>>  =>0.017243683740808733
>>                through
>> =>0.017133336974828554
>>                game
>>  =>0.017027790192043733
>> :C-129302: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 002:0.000, 008:0.000,
>> 01:0.000, 011:0.000, 0112:
>>        Top Terms:
>>                new
>> =>0.039501149799390206
>>                peopl                                   =>
>> 0.01933397797740685
>>                world
>> =>0.017478792605253438
>>                could
>> =>0.013495142418778704
>>                has
>> =>0.012987326502897916
>>                more
>>  =>0.012585724039194569
>>                from
>>  =>0.012242682917236177
>>                face                                    =>
>>  0.0117046220661272
>>                leader
>>  =>0.011579584625370691
>>                presid
>>  =>0.011192085113854965
>> :C-129360: [0:0.000, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000,
>> 005:0.000, 007:0.000, 008:
>>        Top Terms:
>>                state
>> =>0.044732720259456946
>>                unit
>>  =>0.032493582810588666
>>                year
>>  =>0.025651340609304542
>>                san
>> =>0.025617706557963606
>>                after
>> =>0.022019046306438913
>>                francisco
>> =>0.020771004252363168
>>                california                              =>
>> 0.01847124801606253
>>                day
>> =>0.015514125170527842
>>                wednesday
>> =>0.014587851421509652
>>                citi
>>  =>0.012973538756014369
>> :C-129371: [0:0.002, 00:0.000, 000:0.001, 01:0.000, 010:0.000, 0134:0.000,
>> 016:0.000, 02:0.000, 03:0.
>>        Top Terms:
>>                game                                    =>
>> 0.04311022785679375
>>                has                                     =>
>> 0.03059922226267673
>>                all
>> =>0.027605073346921877
>>                leagu                                   =>
>>  0.0267627245855276
>>                star                                    =>
>> 0.02206632764439995
>>                final
>> =>0.020017765794918686
>>                season                                  =>
>> 0.01534931562714024
>>                start                                   =>
>> 0.01450896856938099
>>                week
>>  =>0.014407234069110549
>>                nation                                  =>
>> 0.01429746391305699
>> :C-129373: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 01:0.000, 012:0.000,
>> 016:0.000, 03:0.000, 034:0.
>>        Top Terms:
>>                coach                                   =>
>> 0.05209277512761816
>>                team
>>  =>0.031773971685165554
>>                charg
>> =>0.024246280249912454
>>                from                                    =>
>> 0.02093643936347752
>>                has                                     =>
>> 0.02057631329905952
>>                week
>>  =>0.016848920922797363
>>                last                                    =>
>> 0.01674320150844955
>>                program
>> =>0.016023081209070564
>>                former
>>  =>0.015872337289314063
>>                after                                   =>
>> 0.01341825692502786
>> :C-129377: [0:0.002, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000,
>> 006:0.000, 0065:0.000, 007
>>        Top Terms:
>>                been                                    =>
>> 0.03757994091979662
>>                time                                    =>
>> 0.03591307497544333
>>                first                                   =>
>> 0.03422461795380875
>>                has
>> =>0.029800513863644906
>>                feder
>> =>0.027382680342986195
>>                monday
>>  =>0.022174840523045594
>>                sinc                                    =>
>> 0.02185219249613946
>>                year                                    =>
>> 0.01933420097135394
>>                from                                    =>
>> 0.01162537888358458
>>                state
>> =>0.009756869426688311
>> :C-129381: [0:0.004, 00:0.000, 000:0.002, 00000000235:0.000, 001:0.000,
>> 0011:0.000, 002:0.000, 0051:0
>>        Top Terms:
>>                win                                     =>
>> 0.03267669747239372
>>                one
>> =>0.031009191445456212
>>                second
>>  =>0.028066582472705007
>>                three
>> =>0.026147346665631184
>>                out                                     =>
>>  0.0226123748207931
>>                shot
>>  =>0.020446190395276405
>>                last
>>  =>0.019624841184867056
>>                night
>> =>0.019103407305052604
>>                over
>>  =>0.017376642133669604
>>                year
>>  =>0.016475201865715022
>> :C-129391: [0:0.003, 00:0.000, 000:0.001, 002:0.000, 01:0.000, 0112:0.000,
>> 0123:0.000, 02:0.000, 0213
>>        Top Terms:
>>                championship
>>  =>0.035449579372280104
>>                run
>> =>0.026446073370591447
>>                art                                     =>
>> 0.02489330236372834
>>                open                                    =>
>> 0.02282619503375418
>>                place
>> =>0.022410914360311056
>>                grand                                   =>
>>  0.0169734705340118
>>                reuter
>>  =>0.015895311339829302
>>                6
>> =>0.015700075983436933
>>                continu
>> =>0.015418929721703813
>>                slam
>>  =>0.012102435338420274
>>
>>
>> -Grant
>> On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote:
>>
>> > Can't say just off-hand.
>> >
>> > What is the data?
>> >
>> > On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <gs...@apache.org>
>> wrote:
>> >
>> >> I'm running ClusterLabels and it seems to be outputting the same values
>> for
>> >> every centroid [1].  When I run the cluster dumper, the top terms are
>> fairly
>> >> different for those same vectors.
>> >>
>> >> Have I hit a vagary of LLR or is this a bug?
>> >>
>> >>
>> >> Thanks,
>> >> Grant
>> >>
>> >>
>> >> [1]
>> >> <snip>
>> >> Top labels for Cluster 129062 containing 22710 vectors
>> >> Term             LLR             In-ClusterDF            Out-ClusterDF
>> >> a               43269.00830466254               0               72060
>> >> his             7185.503760070074               0               17203
>> >> has             7028.243643655442               0               16855
>> >> from            6415.739411605988               0               15488
>> >> year            5930.141497239005               0               14391
>> >> state           5858.43069797568                0               14228
>> >> said            5616.422720833216               0               13676
>> >> it              5545.207108973991               0               13513
>> >> he              5239.340392438695               0               12810
>> >> new             4830.124521905556               0               11862
>> >>
>> >> Top labels for Cluster 129145 containing 11188 vectors
>> >> Term             LLR             In-ClusterDF            Out-ClusterDF
>> >> a               19576.26998734614               0               72060
>> >> his             3352.5135342599824              0               17203
>> >> has             3279.466228939127               0               16855
>> >> from            2994.8128935270943              0               15488
>> >> year            2768.974903047085               0               14391
>> >> state           2735.612128134351               0               14228
>> >> said            2622.997358441353               0               13676
>> >> it              2589.8515553446487              0               13513
>> >> he              2447.4579147226177              0               12810
>> >> new             2256.8640938592143              0               11862
>> >>
>> >> Top labels for Cluster 129201 containing 13040 vectors
>> >> Term             LLR             In-ClusterDF            Out-ClusterDF
>> >> a               23110.173012922285              0               72060
>> >> his             3940.4691014224663              0               17203
>> >> has             3854.554399965331               0               16855
>> >> from            3519.784154796507               0               15488
>> >> year            3254.2127395244315              0               14391
>> >> state           3214.9822960514575              0               14228
>> >> said            3082.565408431459               0               13676
>> >> it              3043.5924300444312              0               13513
>> >> he              2876.171367166564               0               12810
>> >> new             2652.0934832417406              0               11862
>> >>
>> >> Top labels for Cluster 129211 containing 14053 vectors
>> >> Term             LLR             In-ClusterDF            Out-ClusterDF
>> >> a               25083.46391701023               0               72060
>> >> his             4266.378291217145               0               17203
>> >> has             4173.323467798065               0               16855
>> >> from            3810.7467373879626              0               15488
>> >> year            3523.1337431534193              0               14391
>> >> state           3480.648573280778               0               14228
>> >> said            3337.2482196930796              0               13676
>> >> it              3295.0432900944725              0               13513
>> >> he              3113.741967030335               0               12810
>> >> new             2871.0957860480994              0               11862
>> >>
>> >> Top labels for Cluster 129242 containing 12861 vectors
>> >> Term             LLR             In-ClusterDF            Out-ClusterDF
>> >> a               22764.503256496973              0               72060
>> >> his             3883.2002838114277              0               17203
>> >> has             3798.5396822127514              0               16855
>> >> from            3468.6536546614952              0               15488
>> >> year            3206.954131908249               0               14391
>> >> state           3168.2954448102973              0               14228
>> >> said            3037.808057511691               0               13676
>> >> it              2999.402857856825               0               13513
>> >> he              2834.4202939094976              0               12810
>> >> new             2613.604658874683               0               11862
>> >>
>> >> Top labels for Cluster 129245 containing 6443 vectors
>> >> Term             LLR             In-ClusterDF            Out-ClusterDF
>> >> a               10925.268199045677              0               72060
>> >> his             1890.511348863598               0               17203
>> >> has             1849.385320336558               0               16855
>> >> from            1689.0946326381527              0               15488
>> >> year            1561.8904545903206              0               14391
>> >> state           1543.096286157146               0               14228
>> >> said            1479.652662154287               0               13676
>> >> it              1460.9780013803393              0               13513
>> >> he              1380.745082413312               0               12810
>> >> new             1273.3357145632617              0               11862
>> >>
>> >> Top labels for Cluster 129255 containing 11390 vectors
>> >> Term             LLR             In-ClusterDF            Out-ClusterDF
>> >> a               19957.211259535048              0               72060
>> >> his             3416.1555761522613              0               17203
>> >> has             3341.7163103362545              0               16855
>> >> from            3051.6410844950005              0               15488
>> >> year            2821.504116652999               0               14391
>> >> state           2787.5064550531097              0               14228
>> >> said            2672.7490201727487              0               13676
>> >> it              2638.972676954698               0               13513
>> >> he              2493.870809029322               0               12810
>> >> new             2299.653438703157               0               11862
>> >>
>> >> Top labels for Cluster 129265 containing 9461 vectors
>> >> Term             LLR             In-ClusterDF            Out-ClusterDF
>> >> a               16362.85457371641               0               72060
>> >> his             2813.167819214519               0               17203
>> >> has             2751.908798408229               0               16855
>> >> from            2513.176188033074               0               15488
>> >> year            2323.752471229993               0               14391
>> >> state           2295.767774611246               0               14228
>> >> said            2201.3039346230216              0               13676
>> >> it              2173.4997256915085              0               13513
>> >> he              2054.0495802331716              0               12810
>> >> new             1894.1558320098557              0               11862
>> >>
>> >> Top labels for Cluster 129279 containing 14559 vectors
>> >> Term             LLR             In-ClusterDF            Out-ClusterDF
>> >> a               26080.197364640888              0               72060
>> >> his             4430.338072712999               0               17203
>> >> has             4333.689091425855               0               16855
>> >> from            3957.116204748396               0               15488
>> >> year            3658.40981121175                0               14391
>> >> state           3614.286633652635               0               14228
>> >> said            3465.358771919273               0               13676
>> >> it              3421.527382406406               0               13513
>> >> he              3233.2411222746596              0               12810
>> >> new             2981.251407010015               0               11862
>> >>
>> >> Top labels for Cluster 129290 containing 13592 vectors
>> >> Term             LLR             In-ClusterDF            Out-ClusterDF
>> >> a               24181.82589298836               0               72060
>> >> his             4117.6785482652485              0               17203
>> >> has             4027.8821644652635              0               16855
>> >> from            3677.9947950267233              0               15488
>> >> year            3400.440033295192               0               14391
>> >> state           3359.4400672735646              0               14228
>> >> said            3221.0516651300713              0               13676
>> >> it              3180.321518546436               0               13513
>> >> he              3005.353873868007               0               12810
>> >> new             2771.180380204227               0               11862
>> >> </snip>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>

Re: Cluster Labels

Posted by Grant Ingersoll <gs...@apache.org>.
OK, seems the problem is ClusterLabels was never updated when we switched
over to WeightedVectorWritable and it also seems like somewhere in the
equation of KMeans being run that we lost the NamedVector again, as the
clusteredPoints directory does not contain NamedVectors, even though that is
what I created the original points as when starting.

On Mon, Jul 5, 2010 at 1:55 PM, Grant Ingersoll <gs...@apache.org> wrote:

> Hmmm, different field, more or less the same result, i.e. all labels are
> the same for each vector [1].  I also included the Cluster dump [2].  I'm
> suspecting a bug.
>
> [1]
> Top labels for Cluster 129022 containing 19186 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             5836.5347257247195              0               16867
> from            5328.54616727354                0               15499
> year            4925.276801970322               0               14400
> state           4866.91887763422                0               14240
> new             4011.6858639516868              0               11867
> after           3882.1740732807666              0               11503
> first           3002.5827110484242              0               8998
> two             2984.1892275922         0               8945
> unit            2930.794111499563               0               8791
> one             2686.95768492762                0               8085
>
> Top labels for Cluster 129119 containing 16043 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             4808.386086146813               0               16867
> from            4390.346637147013               0               15499
> year            4058.4180186586455              0               14400
> state           4010.379176544491               0               14240
> new             3306.234930681996               0               11867
> after           3199.5810555517673              0               11503
> first           2475.079962851014               0               8998
> two             2459.926843432244               0               8945
> unit            2415.9376569474116              0               8791
> one             2215.042654468678               0               8085
>
> Top labels for Cluster 129191 containing 7770 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             2243.2657141932286              0               16867
> from            2048.755412856117               0               15499
> year            1894.2384706358425              0               14400
> state           1871.8704557279125              0               14240
> new             1543.8513879175298              0               11867
> after           1494.1429192917421              0               11503
> first           1156.303048826754               0               8998
> two             1149.2339147529565              0               8945
> unit            1128.711646862328               0               8791
> one             1034.9745452422649              0               8085
>
> Top labels for Cluster 129302 containing 9426 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             2741.316972494591               0               16867
> from            2503.501101480797               0               15499
> year            2314.5996575923637              0               14400
> state           2287.255346294027               0               14240
> new             1886.2961270781234              0               11867
> after           1825.5399498036131              0               11503
> first           1412.654560342431               0               8998
> two             1404.0158626483753              0               8945
> unit            1378.9371921028942              0               8791
> one             1264.391515379306               0               8085
>
> Top labels for Cluster 129360 containing 13092 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             3870.8181769265793              0               16867
> from            3534.623348234687               0               15499
> year            3267.633215776179               0               14400
> state           3228.989259615075               0               14240
> new             2662.4551618834957              0               11867
> after           2576.628638952039               0               11503
> first           1993.499155438505               0               8998
> two             1981.3008509986103              0               8945
> unit            1945.8889682726003              0               8791
> one             1784.1570986662991              0               8085
>
> Top labels for Cluster 129371 containing 23944 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             7455.31941217836                0               16867
> from            6805.274207816925               0               15499
> year            6289.398677708115               0               14400
> state           6214.757351316046               0               14240
> new             5121.23683049297                0               11867
> after           4955.695805796888               0               11503
> first           3831.788851835765               0               8998
> two             3808.2933898111805              0               8945
> unit            3740.0891623105854              0               8791
> one             3428.6551325367764              0               8085
>
> Top labels for Cluster 129373 containing 9885 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             2880.6778563517146              0               16867
> from            2630.736483251676               0               15499
> year            2432.208566541318               0               14400
> state           2403.4711471684277              0               14240
> new             1982.0948037123308              0               11867
> after           1918.2465800205246              0               11503
> first           1484.359997350257               0               8998
> two             1475.282112147659               0               8945
> unit            1448.9285028181039              0               8791
> one             1328.560536378529               0               8085
>
> Top labels for Cluster 129377 containing 11303 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             3314.8890487886965              0               16867
> from            3027.14497121796                0               15499
> year            2798.608615776524               0               14400
> state           2765.528720188886               0               14240
> new             2280.5166378575377              0               11867
> after           2207.0322705539875              0               11503
> first           1707.7044410486706              0               8998
> two             1697.2581536169164              0               8945
> unit            1666.932174641639               0               8791
> one             1528.4241032432765              0               8085
>
> Top labels for Cluster 129381 containing 11411 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             3348.190782570746               0               16867
> from            3057.545994592365               0               15499
> year            2826.7072093421593              0               14400
> state           2793.2941474220715              0               14240
> new             2303.4001871203072              0               11867
> after           2229.176642407663               0               11503
> first           1724.8293614634313              0               8998
> two             1714.2781240069307              0               8945
> unit            1683.6474849330261              0               8791
> one             1543.7481994605623              0               8085
>
> Top labels for Cluster 129391 containing 7334 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             2113.35227333894                0               16867
> from            1930.1305988361128              0               15499
> year            1784.577833758667               0               14400
> state           1763.5072347805835              0               14240
> new             1454.5072316131555              0               11867
> after           1407.6797917694785              0               11503
> first           1089.4127462548204              0               8998
> two             1082.7530186888762              0               8945
> unit            1063.4192575318739              0               8791
> one             975.1101242941804               0               8085
>
> [2]
> :C-129022: [0:0.001, 000:0.003, 004:0.000, 0040:0.000, 0060:0.000,
> 01:0.000, 0100:0.000, 0110:0.000,
>        Top Terms:
>                from
>  =>0.022236135215980328
>                u                                       =>
> 0.01589135359475966
>                busi
>  =>0.014789942880805335
>                bank
>  =>0.014395075820558541
>                us                                      =>
> 0.01402954110138604
>                presid                                  =>
> 0.01341952961319183
>                month
> =>0.012118726267037198
>                about
> =>0.011986047971260612
>                compani
> =>0.011201454374207618
>                obama                                   =>
> 0.01105482429336391
> :C-129119: [0:0.001, 00:0.000, 000:0.003, 03:0.000, 04:0.000, 05:0.000,
> 0656:0.000, 07:0.000, 09:0.00
>        Top Terms:
>                citi                                    =>
> 0.04119064757467011
>                former
>  =>0.030966538725529232
>                home
>  =>0.029642735534519644
>                player                                  =>
> 0.02879703136878369
>                soccer                                  =>
> 0.01847372541986708
>                has
> =>0.015236681440174855
>                mark
>  =>0.015185164518720528
>                new                                     =>
> 0.01266468154720074
>                polic                                   =>
> 0.01253454821409647
>                world
> =>0.011803315296178046
> :C-129191: [0:0.013, 00:0.002, 000:0.004, 000000:0.000, 001:0.000,
> 0011:0.000, 0022:0.000, 003:0.000,
>        Top Terms:
>                4
> =>0.027636996760550075
>                3
> =>0.026093296145846434
>                1                                       =>
> 0.02570191540464146
>                5
> =>0.024807189589701305
>                2
> =>0.023669513631826157
>                were
>  =>0.021134415210709086
>                sunday
>  =>0.017928504766147838
>                play
>  =>0.017243683740808733
>                through
> =>0.017133336974828554
>                game
>  =>0.017027790192043733
> :C-129302: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 002:0.000, 008:0.000,
> 01:0.000, 011:0.000, 0112:
>        Top Terms:
>                new
> =>0.039501149799390206
>                peopl                                   =>
> 0.01933397797740685
>                world
> =>0.017478792605253438
>                could
> =>0.013495142418778704
>                has
> =>0.012987326502897916
>                more
>  =>0.012585724039194569
>                from
>  =>0.012242682917236177
>                face                                    =>
>  0.0117046220661272
>                leader
>  =>0.011579584625370691
>                presid
>  =>0.011192085113854965
> :C-129360: [0:0.000, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000,
> 005:0.000, 007:0.000, 008:
>        Top Terms:
>                state
> =>0.044732720259456946
>                unit
>  =>0.032493582810588666
>                year
>  =>0.025651340609304542
>                san
> =>0.025617706557963606
>                after
> =>0.022019046306438913
>                francisco
> =>0.020771004252363168
>                california                              =>
> 0.01847124801606253
>                day
> =>0.015514125170527842
>                wednesday
> =>0.014587851421509652
>                citi
>  =>0.012973538756014369
> :C-129371: [0:0.002, 00:0.000, 000:0.001, 01:0.000, 010:0.000, 0134:0.000,
> 016:0.000, 02:0.000, 03:0.
>        Top Terms:
>                game                                    =>
> 0.04311022785679375
>                has                                     =>
> 0.03059922226267673
>                all
> =>0.027605073346921877
>                leagu                                   =>
>  0.0267627245855276
>                star                                    =>
> 0.02206632764439995
>                final
> =>0.020017765794918686
>                season                                  =>
> 0.01534931562714024
>                start                                   =>
> 0.01450896856938099
>                week
>  =>0.014407234069110549
>                nation                                  =>
> 0.01429746391305699
> :C-129373: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 01:0.000, 012:0.000,
> 016:0.000, 03:0.000, 034:0.
>        Top Terms:
>                coach                                   =>
> 0.05209277512761816
>                team
>  =>0.031773971685165554
>                charg
> =>0.024246280249912454
>                from                                    =>
> 0.02093643936347752
>                has                                     =>
> 0.02057631329905952
>                week
>  =>0.016848920922797363
>                last                                    =>
> 0.01674320150844955
>                program
> =>0.016023081209070564
>                former
>  =>0.015872337289314063
>                after                                   =>
> 0.01341825692502786
> :C-129377: [0:0.002, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000,
> 006:0.000, 0065:0.000, 007
>        Top Terms:
>                been                                    =>
> 0.03757994091979662
>                time                                    =>
> 0.03591307497544333
>                first                                   =>
> 0.03422461795380875
>                has
> =>0.029800513863644906
>                feder
> =>0.027382680342986195
>                monday
>  =>0.022174840523045594
>                sinc                                    =>
> 0.02185219249613946
>                year                                    =>
> 0.01933420097135394
>                from                                    =>
> 0.01162537888358458
>                state
> =>0.009756869426688311
> :C-129381: [0:0.004, 00:0.000, 000:0.002, 00000000235:0.000, 001:0.000,
> 0011:0.000, 002:0.000, 0051:0
>        Top Terms:
>                win                                     =>
> 0.03267669747239372
>                one
> =>0.031009191445456212
>                second
>  =>0.028066582472705007
>                three
> =>0.026147346665631184
>                out                                     =>
>  0.0226123748207931
>                shot
>  =>0.020446190395276405
>                last
>  =>0.019624841184867056
>                night
> =>0.019103407305052604
>                over
>  =>0.017376642133669604
>                year
>  =>0.016475201865715022
> :C-129391: [0:0.003, 00:0.000, 000:0.001, 002:0.000, 01:0.000, 0112:0.000,
> 0123:0.000, 02:0.000, 0213
>        Top Terms:
>                championship
>  =>0.035449579372280104
>                run
> =>0.026446073370591447
>                art                                     =>
> 0.02489330236372834
>                open                                    =>
> 0.02282619503375418
>                place
> =>0.022410914360311056
>                grand                                   =>
>  0.0169734705340118
>                reuter
>  =>0.015895311339829302
>                6
> =>0.015700075983436933
>                continu
> =>0.015418929721703813
>                slam
>  =>0.012102435338420274
>
>
> -Grant
> On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote:
>
> > Can't say just off-hand.
> >
> > What is the data?
> >
> > On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >
> >> I'm running ClusterLabels and it seems to be outputting the same values
> for
> >> every centroid [1].  When I run the cluster dumper, the top terms are
> fairly
> >> different for those same vectors.
> >>
> >> Have I hit a vagary of LLR or is this a bug?
> >>
> >>
> >> Thanks,
> >> Grant
> >>
> >>
> >> [1]
> >> <snip>
> >> Top labels for Cluster 129062 containing 22710 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               43269.00830466254               0               72060
> >> his             7185.503760070074               0               17203
> >> has             7028.243643655442               0               16855
> >> from            6415.739411605988               0               15488
> >> year            5930.141497239005               0               14391
> >> state           5858.43069797568                0               14228
> >> said            5616.422720833216               0               13676
> >> it              5545.207108973991               0               13513
> >> he              5239.340392438695               0               12810
> >> new             4830.124521905556               0               11862
> >>
> >> Top labels for Cluster 129145 containing 11188 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               19576.26998734614               0               72060
> >> his             3352.5135342599824              0               17203
> >> has             3279.466228939127               0               16855
> >> from            2994.8128935270943              0               15488
> >> year            2768.974903047085               0               14391
> >> state           2735.612128134351               0               14228
> >> said            2622.997358441353               0               13676
> >> it              2589.8515553446487              0               13513
> >> he              2447.4579147226177              0               12810
> >> new             2256.8640938592143              0               11862
> >>
> >> Top labels for Cluster 129201 containing 13040 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               23110.173012922285              0               72060
> >> his             3940.4691014224663              0               17203
> >> has             3854.554399965331               0               16855
> >> from            3519.784154796507               0               15488
> >> year            3254.2127395244315              0               14391
> >> state           3214.9822960514575              0               14228
> >> said            3082.565408431459               0               13676
> >> it              3043.5924300444312              0               13513
> >> he              2876.171367166564               0               12810
> >> new             2652.0934832417406              0               11862
> >>
> >> Top labels for Cluster 129211 containing 14053 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               25083.46391701023               0               72060
> >> his             4266.378291217145               0               17203
> >> has             4173.323467798065               0               16855
> >> from            3810.7467373879626              0               15488
> >> year            3523.1337431534193              0               14391
> >> state           3480.648573280778               0               14228
> >> said            3337.2482196930796              0               13676
> >> it              3295.0432900944725              0               13513
> >> he              3113.741967030335               0               12810
> >> new             2871.0957860480994              0               11862
> >>
> >> Top labels for Cluster 129242 containing 12861 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               22764.503256496973              0               72060
> >> his             3883.2002838114277              0               17203
> >> has             3798.5396822127514              0               16855
> >> from            3468.6536546614952              0               15488
> >> year            3206.954131908249               0               14391
> >> state           3168.2954448102973              0               14228
> >> said            3037.808057511691               0               13676
> >> it              2999.402857856825               0               13513
> >> he              2834.4202939094976              0               12810
> >> new             2613.604658874683               0               11862
> >>
> >> Top labels for Cluster 129245 containing 6443 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               10925.268199045677              0               72060
> >> his             1890.511348863598               0               17203
> >> has             1849.385320336558               0               16855
> >> from            1689.0946326381527              0               15488
> >> year            1561.8904545903206              0               14391
> >> state           1543.096286157146               0               14228
> >> said            1479.652662154287               0               13676
> >> it              1460.9780013803393              0               13513
> >> he              1380.745082413312               0               12810
> >> new             1273.3357145632617              0               11862
> >>
> >> Top labels for Cluster 129255 containing 11390 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               19957.211259535048              0               72060
> >> his             3416.1555761522613              0               17203
> >> has             3341.7163103362545              0               16855
> >> from            3051.6410844950005              0               15488
> >> year            2821.504116652999               0               14391
> >> state           2787.5064550531097              0               14228
> >> said            2672.7490201727487              0               13676
> >> it              2638.972676954698               0               13513
> >> he              2493.870809029322               0               12810
> >> new             2299.653438703157               0               11862
> >>
> >> Top labels for Cluster 129265 containing 9461 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               16362.85457371641               0               72060
> >> his             2813.167819214519               0               17203
> >> has             2751.908798408229               0               16855
> >> from            2513.176188033074               0               15488
> >> year            2323.752471229993               0               14391
> >> state           2295.767774611246               0               14228
> >> said            2201.3039346230216              0               13676
> >> it              2173.4997256915085              0               13513
> >> he              2054.0495802331716              0               12810
> >> new             1894.1558320098557              0               11862
> >>
> >> Top labels for Cluster 129279 containing 14559 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               26080.197364640888              0               72060
> >> his             4430.338072712999               0               17203
> >> has             4333.689091425855               0               16855
> >> from            3957.116204748396               0               15488
> >> year            3658.40981121175                0               14391
> >> state           3614.286633652635               0               14228
> >> said            3465.358771919273               0               13676
> >> it              3421.527382406406               0               13513
> >> he              3233.2411222746596              0               12810
> >> new             2981.251407010015               0               11862
> >>
> >> Top labels for Cluster 129290 containing 13592 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               24181.82589298836               0               72060
> >> his             4117.6785482652485              0               17203
> >> has             4027.8821644652635              0               16855
> >> from            3677.9947950267233              0               15488
> >> year            3400.440033295192               0               14391
> >> state           3359.4400672735646              0               14228
> >> said            3221.0516651300713              0               13676
> >> it              3180.321518546436               0               13513
> >> he              3005.353873868007               0               12810
> >> new             2771.180380204227               0               11862
> >> </snip>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Cluster Labels

Posted by Grant Ingersoll <gs...@apache.org>.
Hmmm, different field, more or less the same result, i.e. all labels are the same for each vector [1].  I also included the Cluster dump [2].  I'm suspecting a bug.

[1]
Top labels for Cluster 129022 containing 19186 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
has             5836.5347257247195              0               16867
from            5328.54616727354                0               15499
year            4925.276801970322               0               14400
state           4866.91887763422                0               14240
new             4011.6858639516868              0               11867
after           3882.1740732807666              0               11503
first           3002.5827110484242              0               8998
two             2984.1892275922         0               8945
unit            2930.794111499563               0               8791
one             2686.95768492762                0               8085

Top labels for Cluster 129119 containing 16043 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
has             4808.386086146813               0               16867
from            4390.346637147013               0               15499
year            4058.4180186586455              0               14400
state           4010.379176544491               0               14240
new             3306.234930681996               0               11867
after           3199.5810555517673              0               11503
first           2475.079962851014               0               8998
two             2459.926843432244               0               8945
unit            2415.9376569474116              0               8791
one             2215.042654468678               0               8085

Top labels for Cluster 129191 containing 7770 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
has             2243.2657141932286              0               16867
from            2048.755412856117               0               15499
year            1894.2384706358425              0               14400
state           1871.8704557279125              0               14240
new             1543.8513879175298              0               11867
after           1494.1429192917421              0               11503
first           1156.303048826754               0               8998
two             1149.2339147529565              0               8945
unit            1128.711646862328               0               8791
one             1034.9745452422649              0               8085

Top labels for Cluster 129302 containing 9426 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
has             2741.316972494591               0               16867
from            2503.501101480797               0               15499
year            2314.5996575923637              0               14400
state           2287.255346294027               0               14240
new             1886.2961270781234              0               11867
after           1825.5399498036131              0               11503
first           1412.654560342431               0               8998
two             1404.0158626483753              0               8945
unit            1378.9371921028942              0               8791
one             1264.391515379306               0               8085

Top labels for Cluster 129360 containing 13092 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
has             3870.8181769265793              0               16867
from            3534.623348234687               0               15499
year            3267.633215776179               0               14400
state           3228.989259615075               0               14240
new             2662.4551618834957              0               11867
after           2576.628638952039               0               11503
first           1993.499155438505               0               8998
two             1981.3008509986103              0               8945
unit            1945.8889682726003              0               8791
one             1784.1570986662991              0               8085

Top labels for Cluster 129371 containing 23944 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
has             7455.31941217836                0               16867
from            6805.274207816925               0               15499
year            6289.398677708115               0               14400
state           6214.757351316046               0               14240
new             5121.23683049297                0               11867
after           4955.695805796888               0               11503
first           3831.788851835765               0               8998
two             3808.2933898111805              0               8945
unit            3740.0891623105854              0               8791
one             3428.6551325367764              0               8085

Top labels for Cluster 129373 containing 9885 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
has             2880.6778563517146              0               16867
from            2630.736483251676               0               15499
year            2432.208566541318               0               14400
state           2403.4711471684277              0               14240
new             1982.0948037123308              0               11867
after           1918.2465800205246              0               11503
first           1484.359997350257               0               8998
two             1475.282112147659               0               8945
unit            1448.9285028181039              0               8791
one             1328.560536378529               0               8085

Top labels for Cluster 129377 containing 11303 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
has             3314.8890487886965              0               16867
from            3027.14497121796                0               15499
year            2798.608615776524               0               14400
state           2765.528720188886               0               14240
new             2280.5166378575377              0               11867
after           2207.0322705539875              0               11503
first           1707.7044410486706              0               8998
two             1697.2581536169164              0               8945
unit            1666.932174641639               0               8791
one             1528.4241032432765              0               8085

Top labels for Cluster 129381 containing 11411 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
has             3348.190782570746               0               16867
from            3057.545994592365               0               15499
year            2826.7072093421593              0               14400
state           2793.2941474220715              0               14240
new             2303.4001871203072              0               11867
after           2229.176642407663               0               11503
first           1724.8293614634313              0               8998
two             1714.2781240069307              0               8945
unit            1683.6474849330261              0               8791
one             1543.7481994605623              0               8085

Top labels for Cluster 129391 containing 7334 vectors
Term             LLR             In-ClusterDF            Out-ClusterDF 
has             2113.35227333894                0               16867
from            1930.1305988361128              0               15499
year            1784.577833758667               0               14400
state           1763.5072347805835              0               14240
new             1454.5072316131555              0               11867
after           1407.6797917694785              0               11503
first           1089.4127462548204              0               8998
two             1082.7530186888762              0               8945
unit            1063.4192575318739              0               8791
one             975.1101242941804               0               8085

[2]
:C-129022: [0:0.001, 000:0.003, 004:0.000, 0040:0.000, 0060:0.000, 01:0.000, 0100:0.000, 0110:0.000, 
        Top Terms: 
                from                                    =>0.022236135215980328
                u                                       => 0.01589135359475966
                busi                                    =>0.014789942880805335
                bank                                    =>0.014395075820558541
                us                                      => 0.01402954110138604
                presid                                  => 0.01341952961319183
                month                                   =>0.012118726267037198
                about                                   =>0.011986047971260612
                compani                                 =>0.011201454374207618
                obama                                   => 0.01105482429336391
:C-129119: [0:0.001, 00:0.000, 000:0.003, 03:0.000, 04:0.000, 05:0.000, 0656:0.000, 07:0.000, 09:0.00
        Top Terms: 
                citi                                    => 0.04119064757467011
                former                                  =>0.030966538725529232
                home                                    =>0.029642735534519644
                player                                  => 0.02879703136878369
                soccer                                  => 0.01847372541986708
                has                                     =>0.015236681440174855
                mark                                    =>0.015185164518720528
                new                                     => 0.01266468154720074
                polic                                   => 0.01253454821409647
                world                                   =>0.011803315296178046
:C-129191: [0:0.013, 00:0.002, 000:0.004, 000000:0.000, 001:0.000, 0011:0.000, 0022:0.000, 003:0.000,
        Top Terms: 
                4                                       =>0.027636996760550075
                3                                       =>0.026093296145846434
                1                                       => 0.02570191540464146
                5                                       =>0.024807189589701305
                2                                       =>0.023669513631826157
                were                                    =>0.021134415210709086
                sunday                                  =>0.017928504766147838
                play                                    =>0.017243683740808733
                through                                 =>0.017133336974828554
                game                                    =>0.017027790192043733
:C-129302: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 002:0.000, 008:0.000, 01:0.000, 011:0.000, 0112:
        Top Terms: 
                new                                     =>0.039501149799390206
                peopl                                   => 0.01933397797740685
                world                                   =>0.017478792605253438
                could                                   =>0.013495142418778704
                has                                     =>0.012987326502897916
                more                                    =>0.012585724039194569
                from                                    =>0.012242682917236177
                face                                    =>  0.0117046220661272
                leader                                  =>0.011579584625370691
                presid                                  =>0.011192085113854965
:C-129360: [0:0.000, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, 005:0.000, 007:0.000, 008:
        Top Terms: 
                state                                   =>0.044732720259456946
                unit                                    =>0.032493582810588666
                year                                    =>0.025651340609304542
                san                                     =>0.025617706557963606
                after                                   =>0.022019046306438913
                francisco                               =>0.020771004252363168
                california                              => 0.01847124801606253
                day                                     =>0.015514125170527842
                wednesday                               =>0.014587851421509652
                citi                                    =>0.012973538756014369
:C-129371: [0:0.002, 00:0.000, 000:0.001, 01:0.000, 010:0.000, 0134:0.000, 016:0.000, 02:0.000, 03:0.
        Top Terms: 
                game                                    => 0.04311022785679375
                has                                     => 0.03059922226267673
                all                                     =>0.027605073346921877
                leagu                                   =>  0.0267627245855276
                star                                    => 0.02206632764439995
                final                                   =>0.020017765794918686
                season                                  => 0.01534931562714024
                start                                   => 0.01450896856938099
                week                                    =>0.014407234069110549
                nation                                  => 0.01429746391305699
:C-129373: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 01:0.000, 012:0.000, 016:0.000, 03:0.000, 034:0.
        Top Terms: 
                coach                                   => 0.05209277512761816
                team                                    =>0.031773971685165554
                charg                                   =>0.024246280249912454
                from                                    => 0.02093643936347752
                has                                     => 0.02057631329905952
                week                                    =>0.016848920922797363
                last                                    => 0.01674320150844955
                program                                 =>0.016023081209070564
                former                                  =>0.015872337289314063
                after                                   => 0.01341825692502786
:C-129377: [0:0.002, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, 006:0.000, 0065:0.000, 007
        Top Terms: 
                been                                    => 0.03757994091979662
                time                                    => 0.03591307497544333
                first                                   => 0.03422461795380875
                has                                     =>0.029800513863644906
                feder                                   =>0.027382680342986195
                monday                                  =>0.022174840523045594
                sinc                                    => 0.02185219249613946
                year                                    => 0.01933420097135394
                from                                    => 0.01162537888358458
                state                                   =>0.009756869426688311
:C-129381: [0:0.004, 00:0.000, 000:0.002, 00000000235:0.000, 001:0.000, 0011:0.000, 002:0.000, 0051:0
        Top Terms: 
                win                                     => 0.03267669747239372
                one                                     =>0.031009191445456212
                second                                  =>0.028066582472705007
                three                                   =>0.026147346665631184
                out                                     =>  0.0226123748207931
                shot                                    =>0.020446190395276405
                last                                    =>0.019624841184867056
                night                                   =>0.019103407305052604
                over                                    =>0.017376642133669604
                year                                    =>0.016475201865715022
:C-129391: [0:0.003, 00:0.000, 000:0.001, 002:0.000, 01:0.000, 0112:0.000, 0123:0.000, 02:0.000, 0213
        Top Terms: 
                championship                            =>0.035449579372280104
                run                                     =>0.026446073370591447
                art                                     => 0.02489330236372834
                open                                    => 0.02282619503375418
                place                                   =>0.022410914360311056
                grand                                   =>  0.0169734705340118
                reuter                                  =>0.015895311339829302
                6                                       =>0.015700075983436933
                continu                                 =>0.015418929721703813
                slam                                    =>0.012102435338420274


-Grant
On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote:

> Can't say just off-hand.
> 
> What is the data?
> 
> On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> I'm running ClusterLabels and it seems to be outputting the same values for
>> every centroid [1].  When I run the cluster dumper, the top terms are fairly
>> different for those same vectors.
>> 
>> Have I hit a vagary of LLR or is this a bug?
>> 
>> 
>> Thanks,
>> Grant
>> 
>> 
>> [1]
>> <snip>
>> Top labels for Cluster 129062 containing 22710 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               43269.00830466254               0               72060
>> his             7185.503760070074               0               17203
>> has             7028.243643655442               0               16855
>> from            6415.739411605988               0               15488
>> year            5930.141497239005               0               14391
>> state           5858.43069797568                0               14228
>> said            5616.422720833216               0               13676
>> it              5545.207108973991               0               13513
>> he              5239.340392438695               0               12810
>> new             4830.124521905556               0               11862
>> 
>> Top labels for Cluster 129145 containing 11188 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               19576.26998734614               0               72060
>> his             3352.5135342599824              0               17203
>> has             3279.466228939127               0               16855
>> from            2994.8128935270943              0               15488
>> year            2768.974903047085               0               14391
>> state           2735.612128134351               0               14228
>> said            2622.997358441353               0               13676
>> it              2589.8515553446487              0               13513
>> he              2447.4579147226177              0               12810
>> new             2256.8640938592143              0               11862
>> 
>> Top labels for Cluster 129201 containing 13040 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               23110.173012922285              0               72060
>> his             3940.4691014224663              0               17203
>> has             3854.554399965331               0               16855
>> from            3519.784154796507               0               15488
>> year            3254.2127395244315              0               14391
>> state           3214.9822960514575              0               14228
>> said            3082.565408431459               0               13676
>> it              3043.5924300444312              0               13513
>> he              2876.171367166564               0               12810
>> new             2652.0934832417406              0               11862
>> 
>> Top labels for Cluster 129211 containing 14053 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               25083.46391701023               0               72060
>> his             4266.378291217145               0               17203
>> has             4173.323467798065               0               16855
>> from            3810.7467373879626              0               15488
>> year            3523.1337431534193              0               14391
>> state           3480.648573280778               0               14228
>> said            3337.2482196930796              0               13676
>> it              3295.0432900944725              0               13513
>> he              3113.741967030335               0               12810
>> new             2871.0957860480994              0               11862
>> 
>> Top labels for Cluster 129242 containing 12861 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               22764.503256496973              0               72060
>> his             3883.2002838114277              0               17203
>> has             3798.5396822127514              0               16855
>> from            3468.6536546614952              0               15488
>> year            3206.954131908249               0               14391
>> state           3168.2954448102973              0               14228
>> said            3037.808057511691               0               13676
>> it              2999.402857856825               0               13513
>> he              2834.4202939094976              0               12810
>> new             2613.604658874683               0               11862
>> 
>> Top labels for Cluster 129245 containing 6443 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               10925.268199045677              0               72060
>> his             1890.511348863598               0               17203
>> has             1849.385320336558               0               16855
>> from            1689.0946326381527              0               15488
>> year            1561.8904545903206              0               14391
>> state           1543.096286157146               0               14228
>> said            1479.652662154287               0               13676
>> it              1460.9780013803393              0               13513
>> he              1380.745082413312               0               12810
>> new             1273.3357145632617              0               11862
>> 
>> Top labels for Cluster 129255 containing 11390 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               19957.211259535048              0               72060
>> his             3416.1555761522613              0               17203
>> has             3341.7163103362545              0               16855
>> from            3051.6410844950005              0               15488
>> year            2821.504116652999               0               14391
>> state           2787.5064550531097              0               14228
>> said            2672.7490201727487              0               13676
>> it              2638.972676954698               0               13513
>> he              2493.870809029322               0               12810
>> new             2299.653438703157               0               11862
>> 
>> Top labels for Cluster 129265 containing 9461 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               16362.85457371641               0               72060
>> his             2813.167819214519               0               17203
>> has             2751.908798408229               0               16855
>> from            2513.176188033074               0               15488
>> year            2323.752471229993               0               14391
>> state           2295.767774611246               0               14228
>> said            2201.3039346230216              0               13676
>> it              2173.4997256915085              0               13513
>> he              2054.0495802331716              0               12810
>> new             1894.1558320098557              0               11862
>> 
>> Top labels for Cluster 129279 containing 14559 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               26080.197364640888              0               72060
>> his             4430.338072712999               0               17203
>> has             4333.689091425855               0               16855
>> from            3957.116204748396               0               15488
>> year            3658.40981121175                0               14391
>> state           3614.286633652635               0               14228
>> said            3465.358771919273               0               13676
>> it              3421.527382406406               0               13513
>> he              3233.2411222746596              0               12810
>> new             2981.251407010015               0               11862
>> 
>> Top labels for Cluster 129290 containing 13592 vectors
>> Term             LLR             In-ClusterDF            Out-ClusterDF
>> a               24181.82589298836               0               72060
>> his             4117.6785482652485              0               17203
>> has             4027.8821644652635              0               16855
>> from            3677.9947950267233              0               15488
>> year            3400.440033295192               0               14391
>> state           3359.4400672735646              0               14228
>> said            3221.0516651300713              0               13676
>> it              3180.321518546436               0               13513
>> he              3005.353873868007               0               12810
>> new             2771.180380204227               0               11862
>> </snip>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


Re: Cluster Labels

Posted by Ted Dunning <te...@gmail.com>.
Can't say just off-hand.

What is the data?

On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <gs...@apache.org> wrote:

> I'm running ClusterLabels and it seems to be outputting the same values for
> every centroid [1].  When I run the cluster dumper, the top terms are fairly
> different for those same vectors.
>
> Have I hit a vagary of LLR or is this a bug?
>
>
> Thanks,
> Grant
>
>
> [1]
> <snip>
> Top labels for Cluster 129062 containing 22710 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> a               43269.00830466254               0               72060
> his             7185.503760070074               0               17203
> has             7028.243643655442               0               16855
> from            6415.739411605988               0               15488
> year            5930.141497239005               0               14391
> state           5858.43069797568                0               14228
> said            5616.422720833216               0               13676
> it              5545.207108973991               0               13513
> he              5239.340392438695               0               12810
> new             4830.124521905556               0               11862
>
> Top labels for Cluster 129145 containing 11188 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> a               19576.26998734614               0               72060
> his             3352.5135342599824              0               17203
> has             3279.466228939127               0               16855
> from            2994.8128935270943              0               15488
> year            2768.974903047085               0               14391
> state           2735.612128134351               0               14228
> said            2622.997358441353               0               13676
> it              2589.8515553446487              0               13513
> he              2447.4579147226177              0               12810
> new             2256.8640938592143              0               11862
>
> Top labels for Cluster 129201 containing 13040 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> a               23110.173012922285              0               72060
> his             3940.4691014224663              0               17203
> has             3854.554399965331               0               16855
> from            3519.784154796507               0               15488
> year            3254.2127395244315              0               14391
> state           3214.9822960514575              0               14228
> said            3082.565408431459               0               13676
> it              3043.5924300444312              0               13513
> he              2876.171367166564               0               12810
> new             2652.0934832417406              0               11862
>
> Top labels for Cluster 129211 containing 14053 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> a               25083.46391701023               0               72060
> his             4266.378291217145               0               17203
> has             4173.323467798065               0               16855
> from            3810.7467373879626              0               15488
> year            3523.1337431534193              0               14391
> state           3480.648573280778               0               14228
> said            3337.2482196930796              0               13676
> it              3295.0432900944725              0               13513
> he              3113.741967030335               0               12810
> new             2871.0957860480994              0               11862
>
> Top labels for Cluster 129242 containing 12861 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> a               22764.503256496973              0               72060
> his             3883.2002838114277              0               17203
> has             3798.5396822127514              0               16855
> from            3468.6536546614952              0               15488
> year            3206.954131908249               0               14391
> state           3168.2954448102973              0               14228
> said            3037.808057511691               0               13676
> it              2999.402857856825               0               13513
> he              2834.4202939094976              0               12810
> new             2613.604658874683               0               11862
>
> Top labels for Cluster 129245 containing 6443 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> a               10925.268199045677              0               72060
> his             1890.511348863598               0               17203
> has             1849.385320336558               0               16855
> from            1689.0946326381527              0               15488
> year            1561.8904545903206              0               14391
> state           1543.096286157146               0               14228
> said            1479.652662154287               0               13676
> it              1460.9780013803393              0               13513
> he              1380.745082413312               0               12810
> new             1273.3357145632617              0               11862
>
> Top labels for Cluster 129255 containing 11390 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> a               19957.211259535048              0               72060
> his             3416.1555761522613              0               17203
> has             3341.7163103362545              0               16855
> from            3051.6410844950005              0               15488
> year            2821.504116652999               0               14391
> state           2787.5064550531097              0               14228
> said            2672.7490201727487              0               13676
> it              2638.972676954698               0               13513
> he              2493.870809029322               0               12810
> new             2299.653438703157               0               11862
>
> Top labels for Cluster 129265 containing 9461 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> a               16362.85457371641               0               72060
> his             2813.167819214519               0               17203
> has             2751.908798408229               0               16855
> from            2513.176188033074               0               15488
> year            2323.752471229993               0               14391
> state           2295.767774611246               0               14228
> said            2201.3039346230216              0               13676
> it              2173.4997256915085              0               13513
> he              2054.0495802331716              0               12810
> new             1894.1558320098557              0               11862
>
> Top labels for Cluster 129279 containing 14559 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> a               26080.197364640888              0               72060
> his             4430.338072712999               0               17203
> has             4333.689091425855               0               16855
> from            3957.116204748396               0               15488
> year            3658.40981121175                0               14391
> state           3614.286633652635               0               14228
> said            3465.358771919273               0               13676
> it              3421.527382406406               0               13513
> he              3233.2411222746596              0               12810
> new             2981.251407010015               0               11862
>
> Top labels for Cluster 129290 containing 13592 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> a               24181.82589298836               0               72060
> his             4117.6785482652485              0               17203
> has             4027.8821644652635              0               16855
> from            3677.9947950267233              0               15488
> year            3400.440033295192               0               14391
> state           3359.4400672735646              0               14228
> said            3221.0516651300713              0               13676
> it              3180.321518546436               0               13513
> he              3005.353873868007               0               12810
> new             2771.180380204227               0               11862
> </snip>