You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sameer Tilak <ss...@live.com> on 2013/12/03 01:22:03 UTC

Mahout for clustering



Hi All,We are using Apache Pig for building our data pipeline. We have data in the following fashion:
userid, age, items {code 1, code 2, ….}, few other features...
Each item has a unique alphanumeric code.  I would like to use mahout for clustering it.  Based on my current  reading I see following few options
1. Map each alphanumeric item code to a numeric code -- AAAAA1 -> 0, AAAAA2 -> 1, AAAAA2 ->2 etc. Then run the clustering algorithm on the reformatted data and then map the results back onto the real item codes.2. Represent info on item codes  as 1 X M matrix where a column represents an items (1 if a given user has viewed a particular item 0 otherwise) and will have millions of columns. So each user will have id, age, and this matrix. Not sure if this will work…..
We also want to do frequency pattern mining etc. on the same data. Any thoughts on data representation and clustering will be great.

 		 	   		  

Re: Mahout for clustering

Posted by Ted Dunning <te...@gmail.com>.
Do you want to cluster users or items?

For items, the vectorization that you suggest will work reasonably well,
especially if you use TF.IDF weighting and normalize the resulting vectors.

You can also use one of the matrix decomposition techniques and cluster the
resulting vectors.  The spectral clustering system that is part of Mahout
will do all of this in one step.  SVD + streaming k-means + ball k-means
should also work well.





On Mon, Dec 2, 2013 at 4:22 PM, Sameer Tilak <ss...@live.com> wrote:

>
>
>
> Hi All,We are using Apache Pig for building our data pipeline. We have
> data in the following fashion:
> userid, age, items {code 1, code 2, ….}, few other features...
> Each item has a unique alphanumeric code.  I would like to use mahout for
> clustering it.  Based on my current  reading I see following few options
> 1. Map each alphanumeric item code to a numeric code -- AAAAA1 -> 0,
> AAAAA2 -> 1, AAAAA2 ->2 etc. Then run the clustering algorithm on the
> reformatted data and then map the results back onto the real item codes.2.
> Represent info on item codes  as 1 X M matrix where a column represents an
> items (1 if a given user has viewed a particular item 0 otherwise) and will
> have millions of columns. So each user will have id, age, and this matrix.
> Not sure if this will work…..
> We also want to do frequency pattern mining etc. on the same data. Any
> thoughts on data representation and clustering will be great.
>
>

Re: Mahout for clustering

Posted by Andrew Musselman <an...@gmail.com>.
I would probably write a script to parse that out and stream to it from Pig.

http://pig.apache.org/docs/r0.11.0/basic.html#stream


On Mon, Dec 2, 2013 at 4:30 PM, Sameer Tilak <ss...@live.com> wrote:

> I am looking for some input on how to vectorize my data.
>
> > From: sstilak@live.com
> > To: user@mahout.apache.org
> > Subject: Mahout for clustering
> > Date: Mon, 2 Dec 2013 16:22:03 -0800
> >
> >
> >
> >
> > Hi All,We are using Apache Pig for building our data pipeline. We have
> data in the following fashion:
> > userid, age, items {code 1, code 2, ….}, few other features...
> > Each item has a unique alphanumeric code.  I would like to use mahout
> for clustering it.  Based on my current  reading I see following few options
> > 1. Map each alphanumeric item code to a numeric code -- AAAAA1 -> 0,
> AAAAA2 -> 1, AAAAA2 ->2 etc. Then run the clustering algorithm on the
> reformatted data and then map the results back onto the real item codes.2.
> Represent info on item codes  as 1 X M matrix where a column represents an
> items (1 if a given user has viewed a particular item 0 otherwise) and will
> have millions of columns. So each user will have id, age, and this matrix.
> Not sure if this will work…..
> > We also want to do frequency pattern mining etc. on the same data. Any
> thoughts on data representation and clustering will be great.
> >
> >
>
>

RE: Mahout for clustering

Posted by Sameer Tilak <ss...@live.com>.
I am looking for some input on how to vectorize my data. 

> From: sstilak@live.com
> To: user@mahout.apache.org
> Subject: Mahout for clustering
> Date: Mon, 2 Dec 2013 16:22:03 -0800
> 
> 
> 
> 
> Hi All,We are using Apache Pig for building our data pipeline. We have data in the following fashion:
> userid, age, items {code 1, code 2, ….}, few other features...
> Each item has a unique alphanumeric code.  I would like to use mahout for clustering it.  Based on my current  reading I see following few options
> 1. Map each alphanumeric item code to a numeric code -- AAAAA1 -> 0, AAAAA2 -> 1, AAAAA2 ->2 etc. Then run the clustering algorithm on the reformatted data and then map the results back onto the real item codes.2. Represent info on item codes  as 1 X M matrix where a column represents an items (1 if a given user has viewed a particular item 0 otherwise) and will have millions of columns. So each user will have id, age, and this matrix. Not sure if this will work…..
> We also want to do frequency pattern mining etc. on the same data. Any thoughts on data representation and clustering will be great.
> 
>