You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by DoomUs <dd...@nmt.edu> on 2011/04/17 00:06:28 UTC

Map Result Caching

I'd like to see if caching Map outputs is worth the time.  The idea is that
for certain jobs, many of the Map tasks will do the same thing they did last
time they were run, for instance a monthly report with vast data, but very
little changing data.  So every month the job is run, and some % of the Map
tasks are doing the same thing they did last month.

What if the Mapper first checked an HBase, or the HDFS for Map results from
the input it has, i.e. the Map input would be a "key" to the "value" that is
the output we got from the Map last month.

Would it be faster to search for these cached outputs, rather than re-run
the Map?  That's the question I'm looking to answer.

Here are my questions:
  Do you have any great HBase tutorials? I haven't used it.
  Should I use HBase?
  What would the code for the Mapper look like to first check the cache, and
if a result is found, don't process the input, just send what we got from
the cache straight to output?

Any input is greatly appreciated.
-- 
View this message in context: http://old.nabble.com/Map-Result-Caching-tp31415113p31415113.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: Map Result Caching

Posted by Robert Evans <ev...@yahoo-inc.com>.
DoomUs,

To me it seems like it should be something at the application level and less at the Hadoop level. I would think if there really is very little delta between the runs then the application would save the output of a map only job, and the next time would do a union of that and the output of processing a delta file, rather then trying to detect and cache results automatically behind the scenes.  Or perhaps the application is just doing aggregation, so they only have to save a little bit extra information with the aggregated data to be able to process the delta on its own and then combine it with the previous result.  I have seen delta processing work quite well in production.

--Bobby Evans

On 4/17/11 5:28 PM, "DoomUs" <dd...@nmt.edu> wrote:



I'd like to see if caching Map outputs is worth the time.  The idea is that
for certain jobs, many of the Map tasks will do the same thing they did last
time they were run, for instance a monthly report with vast data, but very
little changing data.  So every month the job is run, and some % of the Map
tasks are doing the same thing they did last month.

What if the Mapper first checked an HBase, or the HDFS for Map results from
the input it has, i.e. the Map input would be a "key" to the "value" that is
the output we got from the Map last month.

Would it be faster to search for these cached outputs, rather than re-run
the Map?  That's the question I'm looking to answer.

Here are my questions:
  Do you have any great HBase tutorials? I haven't used it.
  Should I use HBase?
  What would the code for the Mapper look like to first check the cache, and
if a result is found, don't process the input, just send what we got from
the cache straight to output?

Any input is greatly appreciated.
--
View this message in context: http://old.nabble.com/Map-Result-Caching-tp31415113p31415113.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.