You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Owen O'Malley <ow...@yahoo-inc.com> on 2006/02/10 18:06:15 UTC

Adding MapperBase and ReduceBase

Looking over the examples with Michel's addition of close, I'd like to 
suggest creating abstract classes MapperBase and ReducerBase that 
implement Mapper and Reducer interfaces respectively and have empty 
configure and close methods.

By providing the default methods, developers will only have to 
implement map or reduce unless they need the additional functionality.

Thoughts?

-- Owen


Re: Adding MapperBase and ReduceBase

Posted by Doug Cutting <cu...@apache.org>.
Owen O'Malley wrote:
> Is the Closable interface useful?

On further thought, I think a Closeable interface is very useful.  The 
need for a close() method comes up in many interfaces.  For example, 
this just came up in Nutch:

http://issues.apache.org/jira/browse/NUTCH-211

I also just noticed that a Closeable interface has been added to Java 
1.5.  So let's move Closeable to org.apache.hadoop.io, with a note that 
it should be replaced by java.io.Closeable when we move the Java 1.5, okay?

Doug

Re: Adding MapperBase and ReduceBase

Posted by Doug Cutting <cu...@apache.org>.
Owen O'Malley wrote:
> Is the Closable interface useful? How about a little renaming and 
> simplifying to do:
> 
> public interface UserTask  extends Configurable {
>    void close();
> }
> 
> public class UserTaskBase implements UserTask extends Configured {
>    ... default methods ...
> }
> 
> public interface Mapper extends UserTask {
>   void map(...);
> }
> 
> public interface Reducer extends UserTask {
>   void reduce(...);
> }
> 
> public class WordCount implements Mapper, Reducer extends UserTaskBase {
>   public void map(...)
>   public void reduce(...)
> 
>   public static void main(...)
> }

I like this.

> When looking through the code, the auto configuration in 
> JobConf.newInstance is pretty confusing. Reading through the code, it 
> looks like the Reducer objects are configured twice.

What's confusing?  The method in SVN looks simple to me.  HADOOP-29 
would make this more complicated, but I'm not convinced that is 
required.  Can you elaborate?

>> It would be nice to even remove the need for the calls to setMapper() 
>> and setReducer() above, i.e., to have JobConf default the mapper, 
>> reducer, etc. to things that are implemented by the class passed to 
>> its constructor.
> 
> Which constructor is doing this? The JobConfigured?

No, the JobConf.  One could construct a JobConf, as before, with 
JobConf(Configuration,UserTask), but, in addition to using the UserTask 
to determine the default jar, it could also use the UserTask to 
determine the default mapper, reducer, etc.

So a user application could be written as simply as:

public class MyApp implements Mapper, Reducer extends UserTaskBase {
   public map(...) { ... };
   public reduce(...) { ... };

   public static void main(String[] args) throws Exception {
     Configuration conf = new Configuration();
     JobConf job = new JobConf(conf, MyApp.class);
     job.setInputDir(args[0]);
     job.setOutputDir(args[1]);
     JobClient.run(job);
   }
}

Does that make sense?

Doug

Re: Adding MapperBase and ReduceBase

Posted by Owen O'Malley <ow...@yahoo-inc.com>.
On Feb 10, 2006, at 10:28 AM, Doug Cutting wrote:

> Owen O'Malley wrote:
>> Looking over the examples with Michel's addition of close, I'd like 
>> to suggest creating abstract classes MapperBase and ReducerBase that 
>> implement Mapper and Reducer interfaces respectively and have empty 
>> configure and close methods.
>> By providing the default methods, developers will only have to 
>> implement map or reduce unless they need the additional 
>> functionality.
>> Thoughts?
>
> There are cases where I've used a single class to implement both map() 
> and reduce().  For these a base class that implements Closeable and 
> JobConfigurable would better than a MapperBase and ReducerBase.  It 
> could also extend Configured, implementing Configurable.  We might 
> call it JobConfigured:

Is the Closable interface useful? How about a little renaming and 
simplifying to do:

public interface UserTask  extends Configurable {
    void close();
}

public class UserTaskBase implements UserTask extends Configured {
    ... default methods ...
}

public interface Mapper extends UserTask {
   void map(...);
}

public interface Reducer extends UserTask {
   void reduce(...);
}

public class WordCount implements Mapper, Reducer extends UserTaskBase {
   public void map(...)
   public void reduce(...)

   public static void main(...)
}

When looking through the code, the auto configuration in 
JobConf.newInstance is pretty confusing. Reading through the code, it 
looks like the Reducer objects are configured twice.

> It would be nice to even remove the need for the calls to setMapper() 
> and setReducer() above, i.e., to have JobConf default the mapper, 
> reducer, etc. to things that are implemented by the class passed to 
> its constructor.

Which constructor is doing this? The JobConfigured? I'd be worried 
about the different contexts that the JobConfs are created in. In 
particular, the only place they could meaningfully be set is the 
JobConf in the driver process, which doesn't have any Mapper or Reducer 
objects instantiated.

-- Owen


Re: Adding MapperBase and ReduceBase

Posted by Doug Cutting <cu...@apache.org>.
Owen O'Malley wrote:
> Looking over the examples with Michel's addition of close, I'd like to 
> suggest creating abstract classes MapperBase and ReducerBase that 
> implement Mapper and Reducer interfaces respectively and have empty 
> configure and close methods.
> 
> By providing the default methods, developers will only have to implement 
> map or reduce unless they need the additional functionality.
> 
> Thoughts?

There are cases where I've used a single class to implement both map() 
and reduce().  For these a base class that implements Closeable and 
JobConfigurable would better than a MapperBase and ReducerBase.  It 
could also extend Configured, implementing Configurable.  We might call 
it JobConfigured:

public abstract class JobConfigured
   implements Closeable, JobConfigurable
   extends Configured {

   public JobConfigured() { super(null); }

   public JobConfigured(Configuration conf) { super(conf); }

   public void configure(JobConf conf) { setConf(conf); }

   public void close() {}
}

Then one can define mapred applications as simply as:

public class MyMapredApplication extends JobConfigured {
   public map(...) { ... };
   public reduce(...) { ... };

   public static void main(String[] args) throws Exception {
     Configuration conf = new Configuration();
     JobConf job = new JobConf(conf, MyMapredApplication.class);
     job.setMapper(MyMapredApplication.class);
     job.setReducer(MyMapredApplication.class);
     job.setInputDir(args[0]);
     job.setOutputDir(args[1]);
     JobClient.run(job);
   }
}

It would be nice to even remove the need for the calls to setMapper() 
and setReducer() above, i.e., to have JobConf default the mapper, 
reducer, etc. to things that are implemented by the class passed to its 
constructor.

Doug