You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Rex X <dn...@gmail.com> on 2016/03/28 00:12:09 UTC
What's the advised way to do groupby 2 attributes from a table with
1000 columns?
Give a table with 1000 columns:
col1, col2, ..., col1000
The source table is about 1PB.
I only need to query 3 columns,
select col1, col2, sum(col3) as col3
from myTable
group by
col1, col2
Will it be advised to do a subquery first, and then send it to the
aggregation of group by, so that we have smaller files sending to groupby?
Not sure it Hive automatically takes care of this.
select col1, col2, sum(col3) as col3
from
(select col1, col2, col3
from myTable
) a
group by
col1, col2
Re: What's the advised way to do groupby 2 attributes from a table
with 1000 columns?
Posted by Gopal Vijayaraghavan <go...@apache.org>.
> I only need to query 3 columns,
...
> The source table is about 1PB.
Format of this table is extremely critical.
A columnar data format like ORC is recommended to avoid reading any other
columns when reading 3 out of 1000.
> Will it be advised to do a subquery first, and then send it to the
>aggregation of group by, so that we have smaller files sending to
>groupby? Not sure it Hive automatically takes care of this.
Hive does column projection after the first scan, so this should not be
necessary - if you do explain logical <query>, you will see
hive> explain logical select l_shipmode, l_shipdate, sum(l_quantity) from
lineitem group by l_shipmode, l_shipdate;
LOGICAL PLAN:
lineitem
TableScan (TS_0)
alias: lineitem
Select Operator (SEL_1)
expressions: l_shipdate (type: string), l_shipmode (type: string),
l_quantity (type: double)
outputColumnNames: l_shipdate, l_shipmode, l_quantity
Group By Operator (GBY_2)
aggregations: sum(l_quantity)
keys: l_shipdate (type: string), l_shipmode (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2
The SEL_1 showing the projection of the 3 columns out of all cols in
lineitem.
Cheers,
Gopal