You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Igor Tatarinov <ig...@decide.com> on 2011/03/09 20:50:38 UTC

optimizing Hive/Hadoop for latency

I understand that Hive and Hadoop are meant to run many jobs at once. As a
result, most tuning parameters are meant to increase the throughput of a
Hadoop cluster rather than latency. In our case, we use Elastic Map Reduce
to run a single Hive script on a daily basis. For that reason, our top
priority is to make the script run faster. So far, it's been a pretty
frustrating experience. I am curious if there are workarounds for the things
that are not easy to tune:

1) In particular, Hadoop lets you
configure mapred.tasktracker.map/reduce.tasks.maximum individually but there
is no way to limit the total of the two. Hive mappers seem to always finish
before the reducers and I wish I could run 1 more reducer when no mappers
are running at the same time. That doesn't seem to be possible.

2) Similarly, there is only one parameter to control memory
allocation: mapred.child.java.opts. So if my box is configured for 4 mappers
and 2 reducers, I have to set that parameter to less than 1/6 of total
memory available. The only problem is that once the mappers are done, 4/6th
or two thirds of all memory is essentially not being used. Is there
something I can do about that?

3) Another odd thing is not being able to run a single wave of reducers
easily. As I understand that's the optimal scenario in most cases. To make
this work, I have to know the total number of reducer slots in the cluster
and then define mapred.reduce.tasks accordingly. EMR seems to have a
solution for this problem (mapred.reduce.tasksperslot) but it doesn't seem
to work.

 Any suggestions would be greatly appreciated!

Thank you,
igor

Re: optimizing Hive/Hadoop for latency

Posted by Andrew Hitchcock <ad...@gmail.com>.

Hi,

Quick note on #3. In order to make mapred.reduce.tasksperslot work,
you need to completely remove all mentions of mapred.reduce.tasks from
your configuration (including removing it from the default config
file). Tasksperslot only takes effect as a last resort.

Andrew

On Wed, Mar 9, 2011 at 11:50 AM, Igor Tatarinov <ig...@decide.com> wrote:
> I understand that Hive and Hadoop are meant to run many jobs at once. As a
> result, most tuning parameters are meant to increase the throughput of a
> Hadoop cluster rather than latency. In our case, we use Elastic Map Reduce
> to run a single Hive script on a daily basis. For that reason, our top
> priority is to make the script run faster. So far, it's been a pretty
> frustrating experience. I am curious if there are workarounds for the things
> that are not easy to tune:
> 1) In particular, Hadoop lets you
> configure mapred.tasktracker.map/reduce.tasks.maximum individually but there
> is no way to limit the total of the two. Hive mappers seem to always finish
> before the reducers and I wish I could run 1 more reducer when no mappers
> are running at the same time. That doesn't seem to be possible.
> 2) Similarly, there is only one parameter to control memory
> allocation: mapred.child.java.opts. So if my box is configured for 4 mappers
> and 2 reducers, I have to set that parameter to less than 1/6 of total
> memory available. The only problem is that once the mappers are done, 4/6th
> or two thirds of all memory is essentially not being used. Is there
> something I can do about that?
> 3) Another odd thing is not being able to run a single wave of reducers
> easily. As I understand that's the optimal scenario in most cases. To make
> this work, I have to know the total number of reducer slots in the cluster
> and then define mapred.reduce.tasks accordingly. EMR seems to have a
> solution for this problem (mapred.reduce.tasksperslot) but it doesn't seem
> to work.
>  Any suggestions would be greatly appreciated!
> Thank you,
> igor