You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Rex X <dn...@gmail.com> on 2016/03/28 00:12:09 UTC

What's the advised way to do groupby 2 attributes from a table with 1000 columns?

Give a table with 1000 columns:
    col1, col2, ..., col1000

The source table is about 1PB.

I only need to query 3 columns,

select col1, col2, sum(col3) as col3
from myTable
group by
col1, col2


Will it be advised to do a subquery first, and then send it to the
aggregation of group by, so that we have smaller files sending to groupby?
Not sure it Hive automatically takes care of this.

select col1, col2, sum(col3) as col3
from
    (select col1, col2, col3
     from myTable
    ) a
group by
col1, col2

Re: What's the advised way to do groupby 2 attributes from a table with 1000 columns?

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> I only need to query 3 columns,
...
> The source table is about 1PB.


Format of this table is extremely critical.

A columnar data format like ORC is recommended to avoid reading any other
columns when reading 3 out of 1000.

> Will it be advised to do a subquery first, and then send it to the
>aggregation of group by, so that we have smaller files sending to
>groupby? Not sure it Hive automatically takes care of this.

Hive does column projection after the first scan, so this should not be
necessary - if you do explain logical <query>, you will see

hive> explain logical select l_shipmode, l_shipdate, sum(l_quantity) from
lineitem group by l_shipmode, l_shipdate;



LOGICAL PLAN:
lineitem 
  TableScan (TS_0)
    alias: lineitem
    Select Operator (SEL_1)
      expressions: l_shipdate (type: string), l_shipmode (type: string),
l_quantity (type: double)
      outputColumnNames: l_shipdate, l_shipmode, l_quantity
      Group By Operator (GBY_2)
        aggregations: sum(l_quantity)
        keys: l_shipdate (type: string), l_shipmode (type: string)
        mode: hash
        outputColumnNames: _col0, _col1, _col2


The SEL_1 showing the projection of the 3 columns out of all cols in
lineitem.

Cheers,
Gopal