You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by Nicolae Marasoiu <nm...@adobe.com> on 2013/06/19 09:53:37 UTC

InputFormat to regroup splits of underlying InputFormat to control number of map tasks

Hi,

When running map-reduce with many splits it would be nice from a performance perspective to have fewer splits while maintaining data locality, so that the overhead of running a map task (jvm creation, map executor ramp-up e.g. spring context, etc) be less impactful when frequently running map-reduces with low data & processing.

I created such an AggregatingInputFormat that simply groups input splits into composite ones with same location and creates a record reader that iterates over the record reader created by underlying inputFormat for the underlying raw splits.

Currently we intend to use it for hbase sharding but I would like to also implement an optimal algorithm to ensure both fair distribution and locality, which I can describe if you find it useful to apply in multi-locations such as replicated kafka or hdfs.

Thanks,
waiting for your feedback,
Nicu Marasoiu
Adobe

Re: InputFormat to regroup splits of underlying InputFormat to control number of map tasks

Posted by Nicolae Marasoiu <nm...@adobe.com>.

Hi,

Our intention is to solve this in a generic context, not just file input.
Thus the split class should be generic (very similar to CompositeInputSplit from mapred).

We also already implement getRecordReader by iterating over record readers created by the decorated input format (this method is not implemented in MultiFile).

Regarding allocation of such a composite split to a location, in a generic context (not just file input), a better job at choosing a sweet spot between data locality and workload distribution fairness can be done than the algorithm used in MultiFileIF does.

On the other hand, I will dive into FileInputFormat.getSplitHosts to evaluate the rack locality into a generic close-to-optimal allocation algorithm.

Please check  https://issues.apache.org/jira/browse/MAPREDUCE-5287 for java code already tested which can work in general context but is optimal only for single-location splits such as the case which hbase.

A configuration is given to the mapreduce job to specify the number of map tasks desired for the job.

Thanks,
Nicu

On 6/19/13 6:01 PM, "Robert Evans" <ev...@yahoo-inc.com>> wrote:

This sounds similar to MultiFileInputFormat

http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/h
adoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apach
e/hadoop/mapred/MultiFileInputFormat.java?revision=1239482&view=markup

It would be nice if you could take a look at it and see if there is
something we can do here to improve it/combine the two.

--Bobby

On 6/19/13 2:53 AM, "Nicolae Marasoiu" <nm...@adobe.com>> wrote:

Hi,

When running map-reduce with many splits it would be nice from a
performance perspective to have fewer splits while maintaining data
locality, so that the overhead of running a map task (jvm creation, map
executor ramp-up e.g. spring context, etc) be less impactful when
frequently running map-reduces with low data & processing.

I created such an AggregatingInputFormat that simply groups input splits
into composite ones with same location and creates a record reader that
iterates over the record reader created by underlying inputFormat for the
underlying raw splits.

Currently we intend to use it for hbase sharding but I would like to also
implement an optimal algorithm to ensure both fair distribution and
locality, which I can describe if you find it useful to apply in
multi-locations such as replicated kafka or hdfs.

Thanks,
waiting for your feedback,
Nicu Marasoiu
Adobe

Re: InputFormat to regroup splits of underlying InputFormat to control number of map tasks

Posted by Robert Evans <ev...@yahoo-inc.com>.

This sounds similar to MultiFileInputFormat

http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/h
adoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apach
e/hadoop/mapred/MultiFileInputFormat.java?revision=1239482&view=markup

It would be nice if you could take a look at it and see if there is
something we can do here to improve it/combine the two.

--Bobby

On 6/19/13 2:53 AM, "Nicolae Marasoiu" <nm...@adobe.com> wrote:

>Hi,
>
>When running map-reduce with many splits it would be nice from a
>performance perspective to have fewer splits while maintaining data
>locality, so that the overhead of running a map task (jvm creation, map
>executor ramp-up e.g. spring context, etc) be less impactful when
>frequently running map-reduces with low data & processing.
>
>I created such an AggregatingInputFormat that simply groups input splits
>into composite ones with same location and creates a record reader that
>iterates over the record reader created by underlying inputFormat for the
>underlying raw splits.
>
>Currently we intend to use it for hbase sharding but I would like to also
>implement an optimal algorithm to ensure both fair distribution and
>locality, which I can describe if you find it useful to apply in
>multi-locations such as replicated kafka or hdfs.
>
>Thanks,
>waiting for your feedback,
>Nicu Marasoiu
>Adobe