You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Shane Butler <sh...@gmail.com> on 2009/02/22 23:52:44 UTC

Optimal job design?

Hi all,

I have a problem where I need to compare each input record to one or
more large files (the large files are loaded into memory).  Which file
the records will need to be compared against varies depending on the
input record.

Currently, I have it working, but I doubt it is the optimal way.  The
way I do it is I created a job to run before the main job, in which
each record is emitted one or more times with the file to be compared
against later on.  I then use the reduce phase to sort by file order,
so that in the main job it will not be constantly swapping these large
files in and out of memory.

This method works well however the first job writes a lot of data to
HDFS and takes a long time to run.  Given it's relatively simple task
I was wondering if there is a better way to do it?  Comments
appreciated!

Kind Regards,
Shane