You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Sameer Tilak <ss...@live.com> on 2013/12/10 19:36:27 UTC

Data vectorization and pig bag datatype

 Hi All,

We are using Apache Pig for building our data pipeline. We have data in the following fashion:

userid, items {code 1, code 2, ….}, few other features...

Each item has a unique alphanumeric code. I would like to use mahout for clustering it. To vectorize the data, we are represent info on item codes as 1 X M matrix where a column represents an items (1 if a given user has viewed a particular item 0 otherwise) and will have millions of columns. So each user will have id, and this matrix. I am generating the matrix in a Pig UDF. 

AU = FOREACH A GENERATE FLATTEN(myparser.myUDF(key, values)); 

/*Data I get back from my UDF should have the following format: {(userid,1,0,0,1,0,.........)} */ 

STORE AU into 'vector.out' using $SEQFILE_STORAGE ('-c $INT_CONVERTER', '-c $VECTOR_CONVERTER');

/* Use mahout for analyzing the data */

I am returning a bag from my UDF because the data potentially can have hundreds of millions of items and from my understanding for a tuple everything needs to fit into memory. Is there a better way of doing this? I want to make sure that I am on right track.