You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2020/10/26 14:07:00 UTC
[jira] [Created] (IMPALA-10284) Support executing union operands serially to reduce resource requirements

Quanlong Huang created IMPALA-10284:
---------------------------------------

             Summary: Support executing union operands serially to reduce resource requirements
                 Key: IMPALA-10284
                 URL: https://issues.apache.org/jira/browse/IMPALA-10284
             Project: IMPALA
          Issue Type: New Feature
            Reporter: Quanlong Huang


Large queries are slow if they trigger spill-to-disk. Sometimes we can split a large query into several smaller queries that each of them can fit into the memory to avoid spill-to-disk. For instance, consider the following query:
{code:sql}
select dt, os, city, count(distinct user_id)
from events
where dt >= 20200801 and dt <= 20200804
group by dt, os, city
order by dt, os, city
{code}
'dt' is the date in INT and it's the partition column. This query will require a large size of memory if the number of distinct user_id is large. If we can split the query based on 'dt' and execute them serially, we can reduce the required memory to avoid spill-to-disk:
{code:sql}
select dt, os, city, count(distinct user_id) from events where dt = 20200801 group by 1,2,3 order by 1,2,3;
select dt, os, city, count(distinct user_id) from events where dt = 20200802 group by 1,2,3 order by 1,2,3;
select dt, os, city, count(distinct user_id) from events where dt = 20200803 group by 1,2,3 order by 1,2,3;
select dt, os, city, count(distinct user_id) from events where dt = 20200804 group by 1,2,3 order by 1,2,3
{code}
The original query can be split into 4 queries since the results are first sorted by 'dt', and we sort the 4 queries based on their 'dt' values.
 If the original query is ordered by other columns, we need to union all results and sort them again. For instance:
{code:sql}
select dt, os, city, count(distinct user_id) uv
from events
where dt >= 20200801 and dt <= 20200804
group by dt, os, city
order by os, city, uv desc
{code}
It can't be split into 4 queries. If Impala can execute union operand one by one, we still have the hope to execute it using a smaller memory requirement:
{code:sql}
select dt, os, city, uv
from (
  select dt, os, city, count(distinct user_id) uv from events where dt = 20200801 group by 1,2,3
  union all
  select dt, os, city, count(distinct user_id) uv from events where dt = 20200802 group by 1,2,3
  union all
  select dt, os, city, count(distinct user_id) uv from events where dt = 20200803 group by 1,2,3
  union all
  select dt, os, city, count(distinct user_id) uv from events where dt = 20200804 group by 1,2,3
) t
order by os, city, uv desc
{code}
Currently, the 4 union operands start concurrently and occupy memory together. It'd be helpful if we can execute them one by one, and free the resources of an operand after its results are all fetched. We can control this by a query option or a query hint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)