You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Rahul Challapalli (JIRA)" <ji...@apache.org> on 2017/06/23 18:38:00 UTC

[jira] [Created] (DRILL-5604) Possible performance degradation with hash aggregate when number of distinct keys increase

Rahul Challapalli created DRILL-5604:
----------------------------------------

             Summary: Possible performance degradation with hash aggregate when number of distinct keys increase
                 Key: DRILL-5604
                 URL: https://issues.apache.org/jira/browse/DRILL-5604
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Relational Operators
    Affects Versions: 1.11.0
            Reporter: Rahul Challapalli


git.commit.id.abbrev=90f43bf

I tried to track the runtime as we gradually increase the no of distinct keys without increasing the total no of records. Below is one such test on top of tpcds sf1000 dataset

{code}
0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_list_price) from store_sales;
+---------+
| EXPR$0  |
+---------+
| 19736   |
+---------+
1 row selected (163.345 seconds)
0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_net_profit) from store_sales;
+----------+
|  EXPR$0  |
+----------+
| 1525675  |
+----------+
1 row selected (2094.962 seconds)
{code}

In both the above queries, the hash agg code processed 2879987999 records. So the time difference is due to overheads like hash table resizing etc. The second query took ~30 mins more than the first raising doubts whether there is an issue somewhere.

The dataset is too large to attach to a jira and so are the logs



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)