You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Sergey Sova <se...@gmail.com> on 2018/08/10 09:03:24 UTC

Overcoming mapper heavy init problem

Hello.

There's a known problem when mapper #setup phase is heavy (e.g. loading
files from hdfs) and #map operations are fast, so it spends like 5 minutes
in #setup and 30 sec in #map. I have hbase MR job with 1000 regions => 1000
mappers and I see that it spends most of the time in setup phase.

The solution obviously is to init data once and reuse it till the end of
the job, but I'm not sure how to implement it with current framework
restrictions.

Is it theoretically possible to assign several local input splits to the
same mapper (e.g. return a bigger multi-region split from custom
TableInputFormat)? Or maybe there are other best practices for this
problem? I'm asking here, because I feel that there could be hidden
problems I'm not aware of, and it would be better locate or avoid it in the
beginning.

Thanks.
Sergey