You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Sivaramakrishnan Narayanan <ta...@gmail.com> on 2015/03/30 04:58:26 UTC

Vectorized group-by on strings is super slow in hive 0.13

Apologies if this has already been addressed / discussed - my searching of
jiras and mailing list did not find anything on this topic. Pointers
welcome.

Been experimenting a little with vectorized execution in hive 0.13 and
found that group-by is super slow on string columns. This simple query is
13x slower when vectorization is enabled (c_customer_id is string). Don't
see this problem with int types.

select c_customer_id from customer group by c_customer_id limit 10;


Hprof of mapper shows that hashing of keys seems to dominate execution
time.

Thanks
Siva

CPU SAMPLES BEGIN (total = 95560) Sun Mar 29 21:07:21 2015
rank   self  accum   count trace method
   1 32.09% 32.09%   30664 301072 java.util.HashMap.getEntry
   2 25.75% 57.84%   24604 301041 java.util.HashMap.put
   3 18.62% 76.45%   17791 301633 java.io.FileOutputStream.writeBytes
   4  5.67% 82.12%    5416 300917 java.net.SocketInputStream.socketRead0
   5  4.78% 86.90%    4568 300674 java.io.FileInputStream.available
   6  1.51% 88.42%    1447 301610
org.apache.hadoop.util.LexicographicalComparerHolder$UnsafeComparer.compareTo

TRACE 301072:
	java.util.HashMap.getEntry(HashMap.java:467)
	java.util.HashMap.get(HashMap.java:417)
	org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeHashAggregate.prepareBatchAggregationBufferSets(VectorGroupByOperator.java:353)
	org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeHashAggregate.processBatch(VectorGroupByOperator.java:292)

Re: Vectorized group-by on strings is super slow in hive 0.13

Posted by Lefty Leverenz <le...@gmail.com>.
Thanks for the tip, Gopal.  I documented hive.limit.pushdown.memory.usage
<https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.limit.pushdown.memory.usage>
in
the Configuration Properties wiki but had a couple of questions about the
description (see the comment on HIVE-3562
<https://issues.apache.org/jira/browse/HIVE-3562?focusedCommentId=14392243&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14392243>
).

-- Lefty

On Mon, Mar 30, 2015 at 12:42 AM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

> Hi,
>
> >Been experimenting a little with vectorized execution in hive 0.13 and
> >found that group-by is super slow on string columns. This simple query is
> >13x slower when vectorization is enabled (c_customer_id is string). Don't
> >see this problem with int types.
>
> I think the performance issue is due to the row-count triggers for
> flushing the in-memory aggregations.
>
> This shouldn¹t happen to you in the hive-1.0 branch, but for 0.13 there is
> a fairly easy workaround to the performance issue.
>
> >select c_customer_id from customer group by c_customer_id limit 10;
>
> A very odd query that one, since it is one of the few patterns which
> speeds up with an extra ORDER BY.
>
> select c_customer_id from customer group by c_customer_id order by
> c_customer_id limit 10;
>
> tends to run faster than regular group-by + fetch limit as it shuffles
> less data (10 keys per map task).
>
> Try the same with
>
> set hive.vectorized.groupby.checkinterval=1024;
> set hive.vectorized.groupby.flush.percent=0.8;
> set hive.limit.pushdown.memory.usage=0.04;
>
> set hive.optimize.reducededuplication.min.reducer=1;
> # above only if you¹re on MRv2, in Tez the default (4) is the faster option
>
> That combination of operators should be triggering the fastest codepath.
>
> @lefty: the limit pushdown seems to be missing in docs as the Top-N memory
> size.
>
> Cheers,
> Gopal
>
>
>

Re: Vectorized group-by on strings is super slow in hive 0.13

Posted by Gopal Vijayaraghavan <go...@apache.org>.
Hi,

>Been experimenting a little with vectorized execution in hive 0.13 and
>found that group-by is super slow on string columns. This simple query is
>13x slower when vectorization is enabled (c_customer_id is string). Don't
>see this problem with int types.

I think the performance issue is due to the row-count triggers for
flushing the in-memory aggregations.

This shouldn¹t happen to you in the hive-1.0 branch, but for 0.13 there is
a fairly easy workaround to the performance issue.

>select c_customer_id from customer group by c_customer_id limit 10;

A very odd query that one, since it is one of the few patterns which
speeds up with an extra ORDER BY.

select c_customer_id from customer group by c_customer_id order by
c_customer_id limit 10;

tends to run faster than regular group-by + fetch limit as it shuffles
less data (10 keys per map task).

Try the same with

set hive.vectorized.groupby.checkinterval=1024;
set hive.vectorized.groupby.flush.percent=0.8;
set hive.limit.pushdown.memory.usage=0.04;

set hive.optimize.reducededuplication.min.reducer=1;
# above only if you¹re on MRv2, in Tez the default (4) is the faster option

That combination of operators should be triggering the fastest codepath.

@lefty: the limit pushdown seems to be missing in docs as the Top-N memory
size.

Cheers,
Gopal