You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Sameer Tilak <ss...@live.com> on 2013/12/10 19:36:27 UTC
Data vectorization and pig bag datatype
Hi All,
We are using Apache Pig for building our data pipeline. We have data in the following fashion:
userid, items {code 1, code 2, ….}, few other features...
Each item has a unique alphanumeric code. I would like to use mahout for clustering it. To vectorize the data, we are represent info on item codes as 1 X M matrix where a column represents an items (1 if a given user has viewed a particular item 0 otherwise) and will have millions of columns. So each user will have id, and this matrix. I am generating the matrix in a Pig UDF.
AU = FOREACH A GENERATE FLATTEN(myparser.myUDF(key, values));
/*Data I get back from my UDF should have the following format: {(userid,1,0,0,1,0,.........)} */
STORE AU into 'vector.out' using $SEQFILE_STORAGE ('-c $INT_CONVERTER', '-c $VECTOR_CONVERTER');
/* Use mahout for analyzing the data */
I am returning a bag from my UDF because the data potentially can have hundreds of millions of items and from my understanding for a tuple everything needs to fit into memory. Is there a better way of doing this? I want to make sure that I am on right track.