You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Lefty Leverenz <le...@gmail.com> on 2015/04/02 09:00:38 UTC

Re: Vectorized group-by on strings is super slow in hive 0.13

Thanks for the tip, Gopal.  I documented hive.limit.pushdown.memory.usage
<https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.limit.pushdown.memory.usage>
in
the Configuration Properties wiki but had a couple of questions about the
description (see the comment on HIVE-3562
<https://issues.apache.org/jira/browse/HIVE-3562?focusedCommentId=14392243&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14392243>
).

-- Lefty

On Mon, Mar 30, 2015 at 12:42 AM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

> Hi,
>
> >Been experimenting a little with vectorized execution in hive 0.13 and
> >found that group-by is super slow on string columns. This simple query is
> >13x slower when vectorization is enabled (c_customer_id is string). Don't
> >see this problem with int types.
>
> I think the performance issue is due to the row-count triggers for
> flushing the in-memory aggregations.
>
> This shouldn¹t happen to you in the hive-1.0 branch, but for 0.13 there is
> a fairly easy workaround to the performance issue.
>
> >select c_customer_id from customer group by c_customer_id limit 10;
>
> A very odd query that one, since it is one of the few patterns which
> speeds up with an extra ORDER BY.
>
> select c_customer_id from customer group by c_customer_id order by
> c_customer_id limit 10;
>
> tends to run faster than regular group-by + fetch limit as it shuffles
> less data (10 keys per map task).
>
> Try the same with
>
> set hive.vectorized.groupby.checkinterval=1024;
> set hive.vectorized.groupby.flush.percent=0.8;
> set hive.limit.pushdown.memory.usage=0.04;
>
> set hive.optimize.reducededuplication.min.reducer=1;
> # above only if you¹re on MRv2, in Tez the default (4) is the faster option
>
> That combination of operators should be triggering the fastest codepath.
>
> @lefty: the limit pushdown seems to be missing in docs as the Top-N memory
> size.
>
> Cheers,
> Gopal
>
>
>