You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Ram Manohar Bheemana (JIRA)" <ji...@apache.org> on 2015/07/01 16:09:04 UTC

[jira] [Created] (MAPREDUCE-6423) MapOutput Sampler

Ram Manohar Bheemana created MAPREDUCE-6423:
-----------------------------------------------

             Summary: MapOutput Sampler
                 Key: MAPREDUCE-6423
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6423
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
            Reporter: Ram Manohar Bheemana
            Priority: Minor


Need a sampler based on the MapOutput Keys. Current InputSampler implementation has a major drawback which is input and output of a mapper should be same, generally this isn't the case.

approach:
1. Create a Sampler which samples the data based on the input.
2. Run a small map reduce in uber task mode using the original job mapper and identity reducer to generate required MapOutputSample keys
3. Optionally, we can input the input file to be sample. For example inputs files A, B; we should be able to specify to use only file A for sampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)