You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Dan Benjamin <db...@amazon.com> on 2008/11/18 19:53:47 UTC

Performing a Lookup in Multiple MapFiles?

I've got a Hadoop process that creates as its output a MapFile.  Using one
reducer this is very slow (as the map is large), but with 150 (on a cluster
of 80 nodes) it runs quickly.  The problem is that it produces 150 output
files as well.  In a subsequent process I need to perform lookups on this
map - how is it recommended that I do this, given that I may not know the
number of existing MapFiles or their names?  Is there a cleaner solution
than listing the contents of the directory containing all of the MapFiles
and then just querying each in sequence?
-- 
View this message in context: http://www.nabble.com/Performing-a-Lookup-in-Multiple-MapFiles--tp20565940p20565940.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Performing a Lookup in Multiple MapFiles?

Posted by lohit <lo...@yahoo.com>.

Hi Dan,

You could do one few things to get around this.
1. In a subsequent step you could merge all your MapFile outputs into one file. This is if your MapFile output is small.
2. Else, you can use the same partition function which hadoop used to find the partition ID. Partition ID can tell you which output file (out of the 150 files) your key is present in. 
Eg. if the partition ID was 23, then the output file you would have to look for would be part-00023 in the generated output. 

You can use your own Partition class (make sure you use it for your first job as well as second) or reuse the one already used by Hadoop. http://hadoop.apache.org/core/docs/r0.18.2/api/org/apache/hadoop/mapred/Partitioner.html has details.

I think this http://hadoop.apache.org/core/docs/r0.18.2/api/org/apache/hadoop/examples/SleepJob.html has its usage example. (look for SleepJob.java)

-Lohit




----- Original Message ----
From: Dan Benjamin <db...@amazon.com>
To: core-user@hadoop.apache.org
Sent: Tuesday, November 18, 2008 10:53:47 AM
Subject: Performing a Lookup in Multiple MapFiles?


I've got a Hadoop process that creates as its output a MapFile.  Using one
reducer this is very slow (as the map is large), but with 150 (on a cluster
of 80 nodes) it runs quickly.  The problem is that it produces 150 output
files as well.  In a subsequent process I need to perform lookups on this
map - how is it recommended that I do this, given that I may not know the
number of existing MapFiles or their names?  Is there a cleaner solution
than listing the contents of the directory containing all of the MapFiles
and then just querying each in sequence?
-- 
View this message in context: http://www.nabble.com/Performing-a-Lookup-in-Multiple-MapFiles--tp20565940p20565940.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.