You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Kidong Lee <my...@gmail.com> on 2011/02/15 01:32:18 UTC

How to get the Id List of items which belong to a cluster.

Hi,

My situation is almost like '12.1 Finding similar users on Twitter' in
Mahout in action book.

In my document, there are lists of item id and its contents seperated by
delimiter comma, for example like this CSV file(itemId, itemContents):
1223, sports
1344, football nike
...

First I did convert this csv file to sequence file, and vectorized the
sequence file with SparseVectorsFromSequenceFiles.
With kmeans clustering, I got the clusters. Until this, all the things fine.

I wanted to get the list of items which belong to a cluster, but I have no
idea how.
I have printed the entries using cluster-dumper, but there is no info about
the item id.

Any idea how to get the list of item id which belong to a cluster?

- Kidong.

Re: How to get the Id List of items which belong to a cluster.

Posted by Robin Anil <ro...@gmail.com>.

+user to get this indexed. Its a typical mistake when you start out with
mahout.

@Kidong You are welcome


Robin

On Mon, Feb 21, 2011 at 6:24 AM, Kidong Lee <my...@gmail.com> wrote:

> To vectorize, I have used SparseVectorsFromSequenceFiles *without using
> the parameter '-nv'(named vector flag)*.
> With '-nv' flag, I got the correct clustered points like this:
> ...
> Input Path:
> /user/root/item-contents-sample/cluster/out/clusteredPoints/part-m-00000
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> org.apache.mahout.clustering.WeightedVectorWritable
> Key: 45: Value: 1.0: 1102204 = [120:3.211]
> Key: 35: Value: 1.0: 1102939 = [93:5.394, 120:3.211]
> Key: 45: Value: 1.0: 1102945 = [120:3.211]
> Key: 35: Value: 1.0: 1102946 = [93:5.394, 120:3.211]
> ....
>
> Now item id is included in the value vector representation.
>
> Thank you Robin for correcting me!
>
> - Kidong.
>
>
> 2011/2/20 Robin Anil <ro...@gmail.com>
>
> Hi Kindong, here the key is the nearest cluster id and the value is vector.
>> I am guessing the identifier is getting dropped somehow. Looks like a bug,
>> can you confirm that you have created ids for the vectors you used and
>> wrapped them in a named vector?
>>
>>
>> Robin
>>
>>
>> On Wed, Feb 16, 2011 at 6:31 AM, Kidong Lee <my...@gmail.com> wrote:
>>
>>> Thank you for your reply, Robin.
>>>
>>> I actually got the sequence file in the clusteredPoints directory like
>>> this:
>>>
>>> Input Path:
>>> /user/root/item-contents-sample/cluster/out/clusteredPoints/part-m-00000
>>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
>>> org.apache.mahout.clustering.WeightedVectorWritable
>>> Key: 45: Value: 1.0: [120:3.211]
>>> Key: 35: Value: 1.0: [93:5.394, 120:3.211]
>>> Key: 45: Value: 1.0: [120:3.211]
>>> Key: 35: Value: 1.0: [93:5.394, 120:3.211]
>>> ...
>>>
>>> Key is the cluster id, and I think, Value is not the mapping of item id,
>>> but
>>> the mapping of the token value in the dictionary file and if-idf weight
>>> calculated in vectorization.
>>>
>>> Since I could not find a simple API in mahout to get the item ids in a
>>> cluster, I did some works for that as follows:
>>>
>>> First, I wrote a hadoop M/R job to parse the vector sequence file and
>>> produce the csv file(item-id, dic-token-value:tf-idf-weight).
>>> Second, I also wrote a hadoop M/R job to parse the clustered points
>>> sequence
>>> file and produce the csv file(cluster-id, dic-token-value:tf-idf-weight).
>>> And in the next step, using PIG, the vector csv file and cluster csv file
>>> could be joined by dic-token-value:tf-idf-weight and grouped by
>>> cluster-id
>>> and item-id, and finally I got the pairs of cluster-id and item-id in the
>>> output.
>>>
>>> - Kidong.
>>>
>>>
>>>
>>>
>>> 2011/2/16 Robin Anil <ro...@gmail.com>
>>>
>>> > clustering code has a paramater that enables or disables whether the
>>> > cluster-point assignments need to be generated. If set, it will create
>>> a
>>> > folder called clusteredPoints in the output directory having a sequence
>>> > file
>>> > with mappings
>>> >
>>> > Robin
>>> >
>>> > On Tue, Feb 15, 2011 at 6:02 AM, Kidong Lee <my...@gmail.com>
>>> wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > My situation is almost like '12.1 Finding similar users on Twitter'
>>> in
>>> > > Mahout in action book.
>>> > >
>>> > > In my document, there are lists of item id and its contents seperated
>>> by
>>> > > delimiter comma, for example like this CSV file(itemId,
>>> itemContents):
>>> > > 1223, sports
>>> > > 1344, football nike
>>> > > ...
>>> > >
>>> > > First I did convert this csv file to sequence file, and vectorized
>>> the
>>> > > sequence file with SparseVectorsFromSequenceFiles.
>>> > > With kmeans clustering, I got the clusters. Until this, all the
>>> things
>>> > > fine.
>>> > >
>>> > > I wanted to get the list of items which belong to a cluster, but I
>>> have
>>> > no
>>> > > idea how.
>>> > > I have printed the entries using cluster-dumper, but there is no info
>>> > about
>>> > > the item id.
>>> > >
>>> > > Any idea how to get the list of item id which belong to a cluster?
>>> > >
>>> > > - Kidong.
>>> > >
>>> >
>>>
>>
>>
>

Re: How to get the Id List of items which belong to a cluster.

Posted by Robin Anil <ro...@gmail.com>.

Hi Kindong, here the key is the nearest cluster id and the value is vector.
I am guessing the identifier is getting dropped somehow. Looks like a bug,
can you confirm that you have created ids for the vectors you used and
wrapped them in a named vector?


Robin

On Wed, Feb 16, 2011 at 6:31 AM, Kidong Lee <my...@gmail.com> wrote:

> Thank you for your reply, Robin.
>
> I actually got the sequence file in the clusteredPoints directory like
> this:
>
> Input Path:
> /user/root/item-contents-sample/cluster/out/clusteredPoints/part-m-00000
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> org.apache.mahout.clustering.WeightedVectorWritable
> Key: 45: Value: 1.0: [120:3.211]
> Key: 35: Value: 1.0: [93:5.394, 120:3.211]
> Key: 45: Value: 1.0: [120:3.211]
> Key: 35: Value: 1.0: [93:5.394, 120:3.211]
> ...
>
> Key is the cluster id, and I think, Value is not the mapping of item id,
> but
> the mapping of the token value in the dictionary file and if-idf weight
> calculated in vectorization.
>
> Since I could not find a simple API in mahout to get the item ids in a
> cluster, I did some works for that as follows:
>
> First, I wrote a hadoop M/R job to parse the vector sequence file and
> produce the csv file(item-id, dic-token-value:tf-idf-weight).
> Second, I also wrote a hadoop M/R job to parse the clustered points
> sequence
> file and produce the csv file(cluster-id, dic-token-value:tf-idf-weight).
> And in the next step, using PIG, the vector csv file and cluster csv file
> could be joined by dic-token-value:tf-idf-weight and grouped by cluster-id
> and item-id, and finally I got the pairs of cluster-id and item-id in the
> output.
>
> - Kidong.
>
>
>
>
> 2011/2/16 Robin Anil <ro...@gmail.com>
>
> > clustering code has a paramater that enables or disables whether the
> > cluster-point assignments need to be generated. If set, it will create a
> > folder called clusteredPoints in the output directory having a sequence
> > file
> > with mappings
> >
> > Robin
> >
> > On Tue, Feb 15, 2011 at 6:02 AM, Kidong Lee <my...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > My situation is almost like '12.1 Finding similar users on Twitter' in
> > > Mahout in action book.
> > >
> > > In my document, there are lists of item id and its contents seperated
> by
> > > delimiter comma, for example like this CSV file(itemId, itemContents):
> > > 1223, sports
> > > 1344, football nike
> > > ...
> > >
> > > First I did convert this csv file to sequence file, and vectorized the
> > > sequence file with SparseVectorsFromSequenceFiles.
> > > With kmeans clustering, I got the clusters. Until this, all the things
> > > fine.
> > >
> > > I wanted to get the list of items which belong to a cluster, but I have
> > no
> > > idea how.
> > > I have printed the entries using cluster-dumper, but there is no info
> > about
> > > the item id.
> > >
> > > Any idea how to get the list of item id which belong to a cluster?
> > >
> > > - Kidong.
> > >
> >
>

Re: How to get the Id List of items which belong to a cluster.

Posted by Kidong Lee <my...@gmail.com>.

Thank you for your reply, Robin.

I actually got the sequence file in the clusteredPoints directory like this:

Input Path:
/user/root/item-contents-sample/cluster/out/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.WeightedVectorWritable
Key: 45: Value: 1.0: [120:3.211]
Key: 35: Value: 1.0: [93:5.394, 120:3.211]
Key: 45: Value: 1.0: [120:3.211]
Key: 35: Value: 1.0: [93:5.394, 120:3.211]
...

Key is the cluster id, and I think, Value is not the mapping of item id, but
the mapping of the token value in the dictionary file and if-idf weight
calculated in vectorization.

Since I could not find a simple API in mahout to get the item ids in a
cluster, I did some works for that as follows:

First, I wrote a hadoop M/R job to parse the vector sequence file and
produce the csv file(item-id, dic-token-value:tf-idf-weight).
Second, I also wrote a hadoop M/R job to parse the clustered points sequence
file and produce the csv file(cluster-id, dic-token-value:tf-idf-weight).
And in the next step, using PIG, the vector csv file and cluster csv file
could be joined by dic-token-value:tf-idf-weight and grouped by cluster-id
and item-id, and finally I got the pairs of cluster-id and item-id in the
output.

- Kidong.

2011/2/16 Robin Anil <ro...@gmail.com>

> clustering code has a paramater that enables or disables whether the
> cluster-point assignments need to be generated. If set, it will create a
> folder called clusteredPoints in the output directory having a sequence
> file
> with mappings
>
> Robin
>
> On Tue, Feb 15, 2011 at 6:02 AM, Kidong Lee <my...@gmail.com> wrote:
>
> > Hi,
> >
> > My situation is almost like '12.1 Finding similar users on Twitter' in
> > Mahout in action book.
> >
> > In my document, there are lists of item id and its contents seperated by
> > delimiter comma, for example like this CSV file(itemId, itemContents):
> > 1223, sports
> > 1344, football nike
> > ...
> >
> > First I did convert this csv file to sequence file, and vectorized the
> > sequence file with SparseVectorsFromSequenceFiles.
> > With kmeans clustering, I got the clusters. Until this, all the things
> > fine.
> >
> > I wanted to get the list of items which belong to a cluster, but I have
> no
> > idea how.
> > I have printed the entries using cluster-dumper, but there is no info
> about
> > the item id.
> >
> > Any idea how to get the list of item id which belong to a cluster?
> >
> > - Kidong.
> >
>

Re: How to get the Id List of items which belong to a cluster.

Posted by Robin Anil <ro...@gmail.com>.

clustering code has a paramater that enables or disables whether the
cluster-point assignments need to be generated. If set, it will create a
folder called clusteredPoints in the output directory having a sequence file
with mappings

Robin

On Tue, Feb 15, 2011 at 6:02 AM, Kidong Lee <my...@gmail.com> wrote:

> Hi,
>
> My situation is almost like '12.1 Finding similar users on Twitter' in
> Mahout in action book.
>
> In my document, there are lists of item id and its contents seperated by
> delimiter comma, for example like this CSV file(itemId, itemContents):
> 1223, sports
> 1344, football nike
> ...
>
> First I did convert this csv file to sequence file, and vectorized the
> sequence file with SparseVectorsFromSequenceFiles.
> With kmeans clustering, I got the clusters. Until this, all the things
> fine.
>
> I wanted to get the list of items which belong to a cluster, but I have no
> idea how.
> I have printed the entries using cluster-dumper, but there is no info about
> the item id.
>
> Any idea how to get the list of item id which belong to a cluster?
>
> - Kidong.
>