You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Sivaramakrishnan Narayanan <ta...@gmail.com> on 2015/03/30 04:58:26 UTC
Vectorized group-by on strings is super slow in hive 0.13
Apologies if this has already been addressed / discussed - my searching of
jiras and mailing list did not find anything on this topic. Pointers
welcome.
Been experimenting a little with vectorized execution in hive 0.13 and
found that group-by is super slow on string columns. This simple query is
13x slower when vectorization is enabled (c_customer_id is string). Don't
see this problem with int types.
select c_customer_id from customer group by c_customer_id limit 10;
Hprof of mapper shows that hashing of keys seems to dominate execution
time.
Thanks
Siva
CPU SAMPLES BEGIN (total = 95560) Sun Mar 29 21:07:21 2015
rank self accum count trace method
1 32.09% 32.09% 30664 301072 java.util.HashMap.getEntry
2 25.75% 57.84% 24604 301041 java.util.HashMap.put
3 18.62% 76.45% 17791 301633 java.io.FileOutputStream.writeBytes
4 5.67% 82.12% 5416 300917 java.net.SocketInputStream.socketRead0
5 4.78% 86.90% 4568 300674 java.io.FileInputStream.available
6 1.51% 88.42% 1447 301610
org.apache.hadoop.util.LexicographicalComparerHolder$UnsafeComparer.compareTo
TRACE 301072:
java.util.HashMap.getEntry(HashMap.java:467)
java.util.HashMap.get(HashMap.java:417)
org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeHashAggregate.prepareBatchAggregationBufferSets(VectorGroupByOperator.java:353)
org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeHashAggregate.processBatch(VectorGroupByOperator.java:292)
Re: Vectorized group-by on strings is super slow in hive 0.13
Posted by Lefty Leverenz <le...@gmail.com>.
Thanks for the tip, Gopal. I documented hive.limit.pushdown.memory.usage
<https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.limit.pushdown.memory.usage>
in
the Configuration Properties wiki but had a couple of questions about the
description (see the comment on HIVE-3562
<https://issues.apache.org/jira/browse/HIVE-3562?focusedCommentId=14392243&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14392243>
).
-- Lefty
On Mon, Mar 30, 2015 at 12:42 AM, Gopal Vijayaraghavan <go...@apache.org>
wrote:
> Hi,
>
> >Been experimenting a little with vectorized execution in hive 0.13 and
> >found that group-by is super slow on string columns. This simple query is
> >13x slower when vectorization is enabled (c_customer_id is string). Don't
> >see this problem with int types.
>
> I think the performance issue is due to the row-count triggers for
> flushing the in-memory aggregations.
>
> This shouldn¹t happen to you in the hive-1.0 branch, but for 0.13 there is
> a fairly easy workaround to the performance issue.
>
> >select c_customer_id from customer group by c_customer_id limit 10;
>
> A very odd query that one, since it is one of the few patterns which
> speeds up with an extra ORDER BY.
>
> select c_customer_id from customer group by c_customer_id order by
> c_customer_id limit 10;
>
> tends to run faster than regular group-by + fetch limit as it shuffles
> less data (10 keys per map task).
>
> Try the same with
>
> set hive.vectorized.groupby.checkinterval=1024;
> set hive.vectorized.groupby.flush.percent=0.8;
> set hive.limit.pushdown.memory.usage=0.04;
>
> set hive.optimize.reducededuplication.min.reducer=1;
> # above only if you¹re on MRv2, in Tez the default (4) is the faster option
>
> That combination of operators should be triggering the fastest codepath.
>
> @lefty: the limit pushdown seems to be missing in docs as the Top-N memory
> size.
>
> Cheers,
> Gopal
>
>
>
Re: Vectorized group-by on strings is super slow in hive 0.13
Posted by Gopal Vijayaraghavan <go...@apache.org>.
Hi,
>Been experimenting a little with vectorized execution in hive 0.13 and
>found that group-by is super slow on string columns. This simple query is
>13x slower when vectorization is enabled (c_customer_id is string). Don't
>see this problem with int types.
I think the performance issue is due to the row-count triggers for
flushing the in-memory aggregations.
This shouldn¹t happen to you in the hive-1.0 branch, but for 0.13 there is
a fairly easy workaround to the performance issue.
>select c_customer_id from customer group by c_customer_id limit 10;
A very odd query that one, since it is one of the few patterns which
speeds up with an extra ORDER BY.
select c_customer_id from customer group by c_customer_id order by
c_customer_id limit 10;
tends to run faster than regular group-by + fetch limit as it shuffles
less data (10 keys per map task).
Try the same with
set hive.vectorized.groupby.checkinterval=1024;
set hive.vectorized.groupby.flush.percent=0.8;
set hive.limit.pushdown.memory.usage=0.04;
set hive.optimize.reducededuplication.min.reducer=1;
# above only if you¹re on MRv2, in Tez the default (4) is the faster option
That combination of operators should be triggering the fastest codepath.
@lefty: the limit pushdown seems to be missing in docs as the Top-N memory
size.
Cheers,
Gopal