You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Chris Fellows <ch...@yahoo.com> on 2007/12/14 19:41:33 UTC

advanced map/reduce tutorials?

Hello,

The map/reduce tutorials in the hadoop src are great for getting started. Are there any similar tutorials for more advanced use cases? Especially complicated ones that might involve subclassing RecordReader, InputFormat, and others. 

In particular I want to write a job that does a cartesian product of a file, i.e. it takes each row in the file and compares it against every other row in the file. My first pass involved writing a NonSplittableInputFormat and a RecordReader that composes 2 LineRecordReaders, one outerReader and one innerReader. This returns two rows merged into one to the Map task which does the comparison. 

Seems there must be a better way to do this. Additionally, no matter how many map tasks I assign, only one map task gets created and assigned by the job tracker. Any ideas on a better approach? Has anyone done anything similar?

Thanks!


RE: advanced map/reduce tutorials?

Posted by Joydeep Sen Sarma <js...@facebook.com>.
brute force: let the input be splittable. in each map job, open the original file and for each line in the split, iterate over all preceding lines in the input file. this will at least get u the parallelism.

but a better approach would be try and cast ur problem as a sorting/grouping problem. do all lines really have to be compared against each other? Or is it possible to bucketize lines that might match?  (and then map reduce to group lines and then do the matching within reduce).


-----Original Message-----
From: Chris Fellows [mailto:chrisc_fellows@yahoo.com]
Sent: Fri 12/14/2007 10:41 AM
To: hadoop-user@lucene.apache.org
Subject: advanced map/reduce tutorials?
 
Hello,

The map/reduce tutorials in the hadoop src are great for getting started. Are there any similar tutorials for more advanced use cases? Especially complicated ones that might involve subclassing RecordReader, InputFormat, and others. 

In particular I want to write a job that does a cartesian product of a file, i.e. it takes each row in the file and compares it against every other row in the file. My first pass involved writing a NonSplittableInputFormat and a RecordReader that composes 2 LineRecordReaders, one outerReader and one innerReader. This returns two rows merged into one to the Map task which does the comparison. 

Seems there must be a better way to do this. Additionally, no matter how many map tasks I assign, only one map task gets created and assigned by the job tracker. Any ideas on a better approach? Has anyone done anything similar?

Thanks!