You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Ruslan Dautkhanov (JIRA)" <ji...@apache.org> on 2018/06/23 16:44:00 UTC

[jira] [Updated] (IMPALA-7204) Add support for GROUP BY ROLLUP

     [ https://issues.apache.org/jira/browse/IMPALA-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ruslan Dautkhanov updated IMPALA-7204:
--------------------------------------
    Description: 
Now suppose that we'd like to analyze our sales data, to study the amount of sales that is occurring for different products, in different states and regions. Using the ROLLUP feature of SQL 2003, we could issue the query:
{code:sql}
select region, state, product, sum(sales) total_sales
from sales_history 
group by rollup (region, state, product)
{code}
Semantically, the above query is equivalent to

 
{code:sql}
select region, state, product, sum(sales) total_sales
from sales_history 
group by region, state, product
union
select region, state, null, sum(sales) total_sales
from sales_history 
group by region, state
union
select region, null, null, sum(sales) total_sales
from sales_history 
group by region
union
select null, null, null, sum(sales) total_sales
from sales_history
 
{code}
The query might produce results that looked something like:
{noformat}
REGION STATE PRODUCT TOTAL_SALES
------ ----- ------- -----------
null null null 6200
EAST MA BOATS 100
EAST MA CARS 1500
EAST MA null 1600
EAST NY BOATS 150
EAST NY CARS 1000
EAST NY null 1150
EAST null null 2750
WEST CA BOATS 750
WEST CA CARS 500
WEST CA null 1250
WEST AZ BOATS 2000
WEST AZ CARS 200
WEST AZ null 2200
WEST null null 3450

{noformat}
We have a lot of production queries that work around this missing Impala functionality by having three UNION ALLs. Physical execution plan shows Impala actually reads full fact table three times. So it could be a three times improvement (or more, depending on number of columns that are being rolled up).

I can't find another SQL on Hadoop engine that doesn't support this feature. 
 *Checked Spark, Hive, PIG, Flink and some other engines - they all do support this basic SQL feature*.

Would be great to have a matching feature in Impala too.

  was:
Now suppose that we'd like to analyze our sales data, to study the amount of sales that is occurring for different products, in different states and regions. Using the ROLLUP feature of SQL 2003, we could issue the query:

{code:sql}
select region, state, product, sum(sales) total_sales
from sales_history 
group by rollup (region, state, product)
{code}
Semantically, the above query is equivalent to

 
{code:sql}
select region, state, product, sum(sales) total_sales
from sales_history 
group by region, state, product
union
select region, state, null, sum(sales) total_sales
from sales_history 
group by region, state
union
select region, null, null, sum(sales) total_sales
from sales_history 
group by region
union
select null, null, null, sum(sales) total_sales
from sales_history
 
{code}

The query might produce results that looked something like:

{noformat}

REGION STATE PRODUCT TOTAL_SALES
------ ----- ------- -----------
null null null 6200
EAST MA BOATS 100
EAST MA CARS 1500
EAST MA null 1600
EAST NY BOATS 150
EAST NY CARS 1000
EAST NY null 1150
EAST null null 2750
WEST CA BOATS 750
WEST CA CARS 500
WEST CA null 1250
WEST AZ BOATS 2000
WEST AZ CARS 200
WEST AZ null 2200
WEST null null 3450

{noformat}

We have a lot of production queries that work around this missing Impala functionality by having three UNION ALLs. Physical execution plan shows Impala actually reads full fact table three times. So it could be a three times improvement (or more, depending on number of columns that are being rolled up).

I can't find another SQL on Hadoop engine that doesn't support this feature. 
Checked Spark, Hive, PIG, Flink and some other engines - they all do support this basic SQL features. Would be great to have a matching feature in Impala too. 



> Add support for GROUP BY ROLLUP
> -------------------------------
>
>                 Key: IMPALA-7204
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7204
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 3.0, Impala 2.12.0
>            Reporter: Ruslan Dautkhanov
>            Priority: Critical
>              Labels: GROUP_BY, sql
>
> Now suppose that we'd like to analyze our sales data, to study the amount of sales that is occurring for different products, in different states and regions. Using the ROLLUP feature of SQL 2003, we could issue the query:
> {code:sql}
> select region, state, product, sum(sales) total_sales
> from sales_history 
> group by rollup (region, state, product)
> {code}
> Semantically, the above query is equivalent to
>  
> {code:sql}
> select region, state, product, sum(sales) total_sales
> from sales_history 
> group by region, state, product
> union
> select region, state, null, sum(sales) total_sales
> from sales_history 
> group by region, state
> union
> select region, null, null, sum(sales) total_sales
> from sales_history 
> group by region
> union
> select null, null, null, sum(sales) total_sales
> from sales_history
>  
> {code}
> The query might produce results that looked something like:
> {noformat}
> REGION STATE PRODUCT TOTAL_SALES
> ------ ----- ------- -----------
> null null null 6200
> EAST MA BOATS 100
> EAST MA CARS 1500
> EAST MA null 1600
> EAST NY BOATS 150
> EAST NY CARS 1000
> EAST NY null 1150
> EAST null null 2750
> WEST CA BOATS 750
> WEST CA CARS 500
> WEST CA null 1250
> WEST AZ BOATS 2000
> WEST AZ CARS 200
> WEST AZ null 2200
> WEST null null 3450
> {noformat}
> We have a lot of production queries that work around this missing Impala functionality by having three UNION ALLs. Physical execution plan shows Impala actually reads full fact table three times. So it could be a three times improvement (or more, depending on number of columns that are being rolled up).
> I can't find another SQL on Hadoop engine that doesn't support this feature. 
>  *Checked Spark, Hive, PIG, Flink and some other engines - they all do support this basic SQL feature*.
> Would be great to have a matching feature in Impala too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org