You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Matt Spitz <ms...@meebo-inc.com> on 2010/11/02 15:53:25 UTC
Re: clusterdump + document ids?

You're totally right. The short answer is that I'm an idiot, seq2sparse has
-nv to generate the NamedVectors, which I failed to realize.

Thanks, Jeff!
On Oct 29, 2010 10:07 PM, "Jeff Eastman" <jd...@windwardsolutions.com> wrote:
> Hi Matt,
>
> K-means passes NamedVectors transparently through its processing,
> including the clustering output. Looking over the relevant code:
>
> ClusterDumper.print() calls AbstractVector.asFormatString(v,bindings)
> AbstractVector.asFormatString(v,bindings) evaluates
> if (v instanceof NamedVector) {
> buf.append(((NamedVector) v).getName()).append(" = ");
> }
> ...before printing the formatted vector. Given that the sequence files
> generated by seq2sparse and presented to kmeans contain NamedVector
> wrappers (which they appear to do), the output should look like <name> =
> [<vector>]
>
> I don't know why you aren't seeing that. Can you please investigate?
> Jeff
>
> On 10/29/10 11:30 AM, Matt Spitz wrote:
>> Hey, folks.
>>
>> If I run kmeans-clustering with the -cl option, I get
>> <kmeans_output>/clusteredPoints.
>>
>> Running clusterdump with -p<kmeans_output>/clusteredPoints, I get output
>> that looks like this for a given cluster (running on the reuters corpus):
>>
>> ...
>> * Top Terms:*
>> * said =>
>> 1.421944826722092*
>> * 3 =>
>> 0.9007495669006188*
>> * reuter =>
>> 0.8924866335531932*
>> ...
>> * Weight: Point:*
>> * 1.0: [srd:7.671, 20.00:8.269, bp:13.510, co:3.989, financial:5.109,
>> under:3.297, activities:5.452, called:4.288, investment:3.831,
owns:5.457,
>> interest:3.438, market:2.977, comp\*
>> *anies:3.894, plan:4.041, 02:4.706, joint:4.573, both:4.004,
manage:6.697,
>> oversight:8.364, trading:3.842, also:2.737, venture:7.009, 15:2.866,
>> money:4.178, borrowing:5.718, inc:2.282, c\*
>> *ommittee:4.438, north:6.854, 26-feb-1987:5.496, 55:4.719, form:7.126,
>> management:4.015, standard:10.665, subsidiary:4.040, plc:4.462,
>> america:6.863, 3:1.124, which:2.445, petroleum:4.71\*
>> *6, oil:7.377, operated:6.224, british:4.646, reuter:1.133, pct:2.233,
>> said:1.330, unit:3.541]*
>> * 1.0: [16:3.076, 35:4.400, 26-feb-1987:5.496]*
>> * 1.0: [54.20:9.057, 16:3.076, 36:4.670, 26-feb-1987:5.496]*
>> ...
>>
>> I get the bags of words that end up in a given cluster, but I don't see
the
>> original document ID from which that bag of words was generated (e.g.
>> reut2-111.sgm-211.txt, etc)
>>
>> In the sequence file generated by 'seqdirectory', we get the following:
>>
>> *[mspitz@wowzers mahout-distribution-0.4]$ hadoop dfs -cat
>> examples/bin/work/reuters-out-seqdir/chunk-0 | head*
>> *SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text
>> �$�]Ƥ��S@7A/reut2-020.sgm-237.txt�'20-OCT-1987
>> 10:29:21.69*
>> *
>> *
>> *HUTTON<EFH> REITERATES STATEMENT OF SOLVENCY*
>> ...
>>
>> In the sparse vectors (which are passed into kmeans), we get:
>> *[mspitz@wowzers mahout-distribution-0.4]$ hadoop dfs -cat
>> examples/bin/work/reuters-out-seqdir-sparse/tf-vectors/part-r-00000 |
head*
>> *1�e�>�C/reut2-000.sgm-0.txt����h@
��@�V?��?�?��?��@�?��?��$?���?���?��@�1@
>> �6@
>>
�"@�c?��?��?��3?��"?���?��c?�@��?�ľ@�@?��?��?��~?��u?��U?�؈�@?��?��+@ݗ?��?���?�̜?���@�y?��?��?�?��l?��t@
>> �y@"��?���@��?��/?�ė?�ؾ@�����3@��?��}?��}@�p@�k@�D?���?��?��M?��?���?��k@
>> ��?�Ņ?��0?��U?�ֹ?��o?��F@�%?�֖?��|@�Y@��@���?��?��o?��?��?���?��%?��H@
��@�f@
>> ��?��?��T?��f?��@,�)?��?���?���?��?��f?��j?��n@
>> ��?���?���?��?��s?��?�؞?��]?��?��?��*
>> *?���?���� ?��X?���?���?��?���?��*
>> ...
>>
>> It looks like the document IDs are being passed on through the data
>> wrangling but then unused by kmeans and/or not reported in
clusteredPoints.
>> It seems to me like that'd be super useful to have them in the final
>> output. Are they easy to get at?
>>
>> Thanks,
>> Matt
>