You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Navis류승우 <na...@nexr.com> on 2013/11/01 07:56:13 UTC

Re: Using Cluster by to improve Group by Performance

>From perspective of RS, two query is just different in hash code of RS key.
The cost of calculating hash of col3 and col4 might be negligible, I think.

2013/11/1 KayVajj <va...@gmail.com>

> Any response or pointers to understand how Cluster By in sub queries can
> affect the performance/speed of outer queries is helpful.
>
> Thanks
> Kay
>
>
> On Mon, Oct 28, 2013 at 1:17 PM, KayVajj <va...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question if I could use the cluster by clause in a sub query to
>> improve the performance of a group by query in hive
>>
>> Lets I have a Table A with columns (all strings) col1..col5 and the table
>> is not "Clustered"
>>
>> now I 'm trying to run the below query
>>
>> select
>>> col1,
>>> col2,
>>> col3,
>>> col4,
>>> concat_ws(',', collect_set(col5))
>>> from A
>>> group by
>>> col1,
>>> col2,
>>> col3,
>>> col4
>>
>>
>>
>> Would the below query optimize the above query and if not what is the
>> best practice to optimize this query. Assuming only col1 & col2 are the
>> uniquely identifying columns
>>
>>
>>
>>
>> select
>>> ct.col1,
>>> ct.col2,
>>> ct.col3,
>>> ct.col4,
>>> concat_ws(',', collect_set(ct.col5))
>>> from
>>> (
>>> select
>>> col1,
>>> col2,
>>> col3,
>>> col4,
>>> col5
>>> from A
>>> cluster by col1, col2
>>> ) ct
>>> group by
>>> ct.col1,
>>> ct.col2,
>>> ct.col3,
>>> ct.col4.
>>
>>
>> Thanks for your responses.
>>
>>
>