You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Arun C Murthy (JIRA)" <ji...@apache.org> on 2009/11/05 03:42:32 UTC

[jira] Commented: (MAPREDUCE-1183) Serializable job components: Mapper, Reducer, InputFormat, OutputFormat et al

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773756#action_12773756 ] 

Arun C Murthy commented on MAPREDUCE-1183:
------------------------------------------

The current Configuration-based system has issues in a couple of use-cases:
# The primary drawback: Difficulty in implementing a Composite{Input|Output}Format
Pig is in the middle of a re-write of their Load/Store interfaces (http://wiki.apache.org/pig/LoadStoreRedesignProposal) where they want to be able to take an arbitrary InputFormat or OutputFormat and wrap it for use within Pig. Similarly a 'CompositeInputFormat' which can work with multiple InputFormats (say a map-side merge between data in multiple SequenceFiles and TFiles) leads to a situation where we push the {Input|Output}Format to deal with multiple copies of Configuration and manage them. This necessary because using a single Configuration results in same configuration key being over-written by multiple instances of {Input|Output}Format (say mapred.input.dir over-written by SequenceFileInputFormat and TFileInputFormat).
# Annoyance: An application which needs a very small amount of state in the Mapper/Reducer (say a small map of metadata) is forced to use DistributedCache, it's much more natural to have that state stored in the Mapper/Reducer and have it serialized from the client to the compute nodes.

Thus the proposal is to move to a model where an actual Mapper/Reducer/InputFormat/OutputFormat object is serialized by the framework, thus eliminating the need for using Configuration for storing the requisite information and using the object to keep the necessary state e.g. FileInputFormat will have a member to keep a list of input-paths to be processed.

The new api would look like:
{noformat}
Job job = new Job();
job.setMapper(new WordCountMapper());
job.setReducer(new WordCountReducer());
InputFormat in = new TextInputFormat("in");
in.addInputPath("in2");
OutputFormat out = new TextOutputFormat("out");
job.setInputFormat(in);
job.setOutputFormat(out);
job.waitForCompletion();
{noformat}

----

Thoughts?

> Serializable job components: Mapper, Reducer, InputFormat, OutputFormat et al
> -----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1183
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1183
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 0.21.0
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>
> Currently the Map-Reduce framework uses Configuration to pass information about the various aspects of a job such as Mapper, Reducer, InputFormat, OutputFormat, OutputCommitter etc. and application developers use org.apache.hadoop.mapreduce.Job.set*Class apis to set them at job-submission time:
> {noformat}
> Job.setMapperClass(IdentityMapper.class);
> Job.setReducerClass(IdentityReducer.class);
> Job.setInputFormatClass(TextInputFormat.class);
> Job.setOutputFormatClass(TextOutputFormat.class);
> ...
> {noformat}
> The proposal is that we move to a model where end-users interact with org.apache.hadoop.mapreduce.Job via actual objects which are then serialized by the framework:
> {noformat}
> Job.setMapper(new IdentityMapper());
> Job.setReducer(new IdentityReducer());
> Job.setInputFormat(new TextInputFormat("in"));
> Job.setOutputFormat(new TextOutputFormat("out"));
> ...
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.