You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Rahul Challapalli (JIRA)" <ji...@apache.org> on 2017/06/23 18:38:00 UTC
[jira] [Created] (DRILL-5604) Possible performance degradation with
hash aggregate when number of distinct keys increase
Rahul Challapalli created DRILL-5604:
----------------------------------------
Summary: Possible performance degradation with hash aggregate when number of distinct keys increase
Key: DRILL-5604
URL: https://issues.apache.org/jira/browse/DRILL-5604
Project: Apache Drill
Issue Type: Bug
Components: Execution - Relational Operators
Affects Versions: 1.11.0
Reporter: Rahul Challapalli
git.commit.id.abbrev=90f43bf
I tried to track the runtime as we gradually increase the no of distinct keys without increasing the total no of records. Below is one such test on top of tpcds sf1000 dataset
{code}
0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_list_price) from store_sales;
+---------+
| EXPR$0 |
+---------+
| 19736 |
+---------+
1 row selected (163.345 seconds)
0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_net_profit) from store_sales;
+----------+
| EXPR$0 |
+----------+
| 1525675 |
+----------+
1 row selected (2094.962 seconds)
{code}
In both the above queries, the hash agg code processed 2879987999 records. So the time difference is due to overheads like hash table resizing etc. The second query took ~30 mins more than the first raising doubts whether there is an issue somewhere.
The dataset is too large to attach to a jira and so are the logs
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)