You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sameer Tilak <ss...@live.com> on 2013/12/23 21:04:52 UTC

Vectorizing data in mapreduce mode

Hi everyone,

My Pig script generates the following -- results are stored in part-m-00000 to part-m-00004 files.

-bash-4.1$ hadoop dfs -ls /scratch/ItemIds

Found 7 items
-rw-r--r--   1 userid supergroup          0 2013-12-23 11:13 /scratch/ItemIds/_SUCCESS
drwxr-xr-x   - userid supergroup          0 2013-12-23 11:12 /scratch/ItemIds/_logs
-rw-r--r--   1 userid supergroup     276019 2013-12-23 11:12 /scratch/ItemIds/part-m-00000
-rw-r--r--   1 userid supergroup     272188 2013-12-23 11:12 /scratch/ItemIds/part-m-00001
-rw-r--r--   1 userid supergroup     252597 2013-12-23 11:12 /scratch/ItemIds/part-m-00002
-rw-r--r--   1 userid supergroup     236508 2013-12-23 11:12 /scratch/ItemIds/part-m-00003
-rw-r--r--   1 userid supergroup     270658 2013-12-23 11:12 /scratch/ItemIds/part-m-00004

 The output is stored as the Tab separated values:

userid1 itemid1 itemid2 itemid3 ......
userid2 itemid1 itemid2 itemid3 ......
......

I have following questions:

1. Is there a mahout utility that lets me point to the  /scratch/ItemIds and will generate one file out of these 5 part files?

2. What is the recommended way of parsing this tab separated file in a mapreduce mode? I want to vectorize this data and would like to do that in a parallel mode. I know how to vectorize the data correctly and how to run K-means on that. 

I have been using the following command to run my clustering algorithm on dummy data. Now, I want to ingest real data.

hadoop jar /apps/analytics/myanalytics.jar myanalytics.SimpleKMeansClustering -libjars /apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT.jar /:/apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT-job.jar:/apps/mahout/trunk/math/target/mahout-math-0.9-SNAPSHOT.jar

However, I am not sure if I write the code to vectorize data in my SimpleKMeansClustering class, will the above command run it in mapreduce mode?