You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Owen O'Malley <ow...@yahoo-inc.com> on 2006/02/10 18:06:15 UTC
Adding MapperBase and ReduceBase
Looking over the examples with Michel's addition of close, I'd like to
suggest creating abstract classes MapperBase and ReducerBase that
implement Mapper and Reducer interfaces respectively and have empty
configure and close methods.
By providing the default methods, developers will only have to
implement map or reduce unless they need the additional functionality.
Thoughts?
-- Owen
Re: Adding MapperBase and ReduceBase
Posted by Doug Cutting <cu...@apache.org>.
Owen O'Malley wrote:
> Is the Closable interface useful?
On further thought, I think a Closeable interface is very useful. The
need for a close() method comes up in many interfaces. For example,
this just came up in Nutch:
http://issues.apache.org/jira/browse/NUTCH-211
I also just noticed that a Closeable interface has been added to Java
1.5. So let's move Closeable to org.apache.hadoop.io, with a note that
it should be replaced by java.io.Closeable when we move the Java 1.5, okay?
Doug
Re: Adding MapperBase and ReduceBase
Posted by Doug Cutting <cu...@apache.org>.
Owen O'Malley wrote:
> Is the Closable interface useful? How about a little renaming and
> simplifying to do:
>
> public interface UserTask extends Configurable {
> void close();
> }
>
> public class UserTaskBase implements UserTask extends Configured {
> ... default methods ...
> }
>
> public interface Mapper extends UserTask {
> void map(...);
> }
>
> public interface Reducer extends UserTask {
> void reduce(...);
> }
>
> public class WordCount implements Mapper, Reducer extends UserTaskBase {
> public void map(...)
> public void reduce(...)
>
> public static void main(...)
> }
I like this.
> When looking through the code, the auto configuration in
> JobConf.newInstance is pretty confusing. Reading through the code, it
> looks like the Reducer objects are configured twice.
What's confusing? The method in SVN looks simple to me. HADOOP-29
would make this more complicated, but I'm not convinced that is
required. Can you elaborate?
>> It would be nice to even remove the need for the calls to setMapper()
>> and setReducer() above, i.e., to have JobConf default the mapper,
>> reducer, etc. to things that are implemented by the class passed to
>> its constructor.
>
> Which constructor is doing this? The JobConfigured?
No, the JobConf. One could construct a JobConf, as before, with
JobConf(Configuration,UserTask), but, in addition to using the UserTask
to determine the default jar, it could also use the UserTask to
determine the default mapper, reducer, etc.
So a user application could be written as simply as:
public class MyApp implements Mapper, Reducer extends UserTaskBase {
public map(...) { ... };
public reduce(...) { ... };
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
JobConf job = new JobConf(conf, MyApp.class);
job.setInputDir(args[0]);
job.setOutputDir(args[1]);
JobClient.run(job);
}
}
Does that make sense?
Doug
Re: Adding MapperBase and ReduceBase
Posted by Owen O'Malley <ow...@yahoo-inc.com>.
On Feb 10, 2006, at 10:28 AM, Doug Cutting wrote:
> Owen O'Malley wrote:
>> Looking over the examples with Michel's addition of close, I'd like
>> to suggest creating abstract classes MapperBase and ReducerBase that
>> implement Mapper and Reducer interfaces respectively and have empty
>> configure and close methods.
>> By providing the default methods, developers will only have to
>> implement map or reduce unless they need the additional
>> functionality.
>> Thoughts?
>
> There are cases where I've used a single class to implement both map()
> and reduce(). For these a base class that implements Closeable and
> JobConfigurable would better than a MapperBase and ReducerBase. It
> could also extend Configured, implementing Configurable. We might
> call it JobConfigured:
Is the Closable interface useful? How about a little renaming and
simplifying to do:
public interface UserTask extends Configurable {
void close();
}
public class UserTaskBase implements UserTask extends Configured {
... default methods ...
}
public interface Mapper extends UserTask {
void map(...);
}
public interface Reducer extends UserTask {
void reduce(...);
}
public class WordCount implements Mapper, Reducer extends UserTaskBase {
public void map(...)
public void reduce(...)
public static void main(...)
}
When looking through the code, the auto configuration in
JobConf.newInstance is pretty confusing. Reading through the code, it
looks like the Reducer objects are configured twice.
> It would be nice to even remove the need for the calls to setMapper()
> and setReducer() above, i.e., to have JobConf default the mapper,
> reducer, etc. to things that are implemented by the class passed to
> its constructor.
Which constructor is doing this? The JobConfigured? I'd be worried
about the different contexts that the JobConfs are created in. In
particular, the only place they could meaningfully be set is the
JobConf in the driver process, which doesn't have any Mapper or Reducer
objects instantiated.
-- Owen
Re: Adding MapperBase and ReduceBase
Posted by Doug Cutting <cu...@apache.org>.
Owen O'Malley wrote:
> Looking over the examples with Michel's addition of close, I'd like to
> suggest creating abstract classes MapperBase and ReducerBase that
> implement Mapper and Reducer interfaces respectively and have empty
> configure and close methods.
>
> By providing the default methods, developers will only have to implement
> map or reduce unless they need the additional functionality.
>
> Thoughts?
There are cases where I've used a single class to implement both map()
and reduce(). For these a base class that implements Closeable and
JobConfigurable would better than a MapperBase and ReducerBase. It
could also extend Configured, implementing Configurable. We might call
it JobConfigured:
public abstract class JobConfigured
implements Closeable, JobConfigurable
extends Configured {
public JobConfigured() { super(null); }
public JobConfigured(Configuration conf) { super(conf); }
public void configure(JobConf conf) { setConf(conf); }
public void close() {}
}
Then one can define mapred applications as simply as:
public class MyMapredApplication extends JobConfigured {
public map(...) { ... };
public reduce(...) { ... };
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
JobConf job = new JobConf(conf, MyMapredApplication.class);
job.setMapper(MyMapredApplication.class);
job.setReducer(MyMapredApplication.class);
job.setInputDir(args[0]);
job.setOutputDir(args[1]);
JobClient.run(job);
}
}
It would be nice to even remove the need for the calls to setMapper()
and setReducer() above, i.e., to have JobConf default the mapper,
reducer, etc. to things that are implemented by the class passed to its
constructor.
Doug