You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2014/07/10 17:46:13 UTC

Dealing with monstrous hive startup overhead

So Everyone is running around saying "hive is slow" "x is faster". I think
hive's biggest issue is that the mr2 entire process to acquire containers
and then launch a job in them is super overkill. I see it result in 40
seconds startup time for what amounts to a 2 second job. In the old hadoop
0.20.2 days these queries were much faster. Honestly I know everyones is in
the ball park that (tez/spark) is some magical answer....but how about we
make a yarn service that just keeps N / nodes open and ready for action.
Cut down the entire ask the manager for nodes each job part out.

Re: Dealing with monstrous hive startup overhead

Posted by Vikram Dixit <vi...@hortonworks.com>.
Hi Ed,

I agree with you that one of the big overheads in hive is the startup time
of a job while it acquires containers and launches AMs and tasks. I wanted
to just draw your attention to something that is there in hive right now
that addresses some of this. When using hive server 2 in tez mode, we can
configure a set of AMs to be pre-launched and hold onto a few
(configurable) containers. These AMs wait for a period of time for queries
to arrive and time out if idle for the entire period. The holding onto
containers is done even in case of the hive cli and the only difference is
with respect to AM. There is only one AM for the CLI session. Please let me
know if you would like more details on this feature.

Thanks
Vikram.

PS: I know this is available via tez but it does exactly what you outlined
(and some more) so just thought of adding this info.


On Thu, Jul 10, 2014 at 8:46 AM, Edward Capriolo <ed...@gmail.com>
wrote:

> So Everyone is running around saying "hive is slow" "x is faster". I think
> hive's biggest issue is that the mr2 entire process to acquire containers
> and then launch a job in them is super overkill. I see it result in 40
> seconds startup time for what amounts to a 2 second job. In the old hadoop
> 0.20.2 days these queries were much faster. Honestly I know everyones is in
> the ball park that (tez/spark) is some magical answer....but how about we
> make a yarn service that just keeps N / nodes open and ready for action.
> Cut down the entire ask the manager for nodes each job part out.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.