You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Yang <te...@gmail.com> on 2012/01/17 22:01:49 UTC

controlling where a map task runs?

I understand that normally map tasks are run close to the input files.

but in my application, the input file is a txt file with many lines of
query param, and the mapper reads out each line,
use the params in the line to query a local db file (for example sqlite3 ),
so the query itself takes a lot of time,
and the input query param is very small. so in this case the time to fetch
the input file is negligible . the db file is
always sitting on all the boxes in the cluster, so there is no time to copy
the db.


the problem is , when I have an empty cluster (100 nodes), and have a task
with only 4 mappers, hadoop schedules out the 4 mappers all on the same
node, likely close to where the data is. but since the run time here is
mostly determined by CPU and disk seeking,
I would like to spread them out as much as possible.

given  that the data is already present only on 1 node, how is it possible
to spread out my mappers?

Thanks
Yang

Re: controlling where a map task runs?

Posted by Rohit Kelkar <ro...@gmail.com>.
You could try NLineInputFormat. If N = 1 then the number of mappers
would be equal to the number of lines in your file. Now if the number
of mappers required is greater than max number of mappers that can be
run on a node then I think that the remaining mappers would get
scheduled on the other nodes in the clusters without obeying the data
locality. Of course N=1 is an extreme case, you could try changing the
value of N based on the number of lines in your input file.

- Rohit Kelkar

On Wed, Jan 18, 2012 at 2:31 AM, Yang <te...@gmail.com> wrote:
> I understand that normally map tasks are run close to the input files.
>
> but in my application, the input file is a txt file with many lines of query
> param, and the mapper reads out each line,
> use the params in the line to query a local db file (for example sqlite3 ),
> so the query itself takes a lot of time,
> and the input query param is very small. so in this case the time to fetch
> the input file is negligible . the db file is
> always sitting on all the boxes in the cluster, so there is no time to copy
> the db.
>
>
> the problem is , when I have an empty cluster (100 nodes), and have a task
> with only 4 mappers, hadoop schedules out the 4 mappers all on the same
> node, likely close to where the data is. but since the run time here is
> mostly determined by CPU and disk seeking,
> I would like to spread them out as much as possible.
>
> given  that the data is already present only on 1 node, how is it possible
> to spread out my mappers?
>
> Thanks
> Yang