You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Khurram Faraaz (JIRA)" <ji...@apache.org> on 2015/04/21 21:49:59 UTC

[jira] [Created] (DRILL-2840) Duplicate HashAgg operator seen in physical plan for aggregate & grouping query

Khurram Faraaz created DRILL-2840:
-------------------------------------

             Summary: Duplicate HashAgg operator seen in physical plan for aggregate & grouping query
                 Key: DRILL-2840
                 URL: https://issues.apache.org/jira/browse/DRILL-2840
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
    Affects Versions: 0.9.0
            Reporter: Khurram Faraaz
            Assignee: Jinfeng Ni


Test was executed on 4 node cluster on CentOS.

{code}

Case 1) We need to know why there is an additional (HashAgg(group=[{0, 1}])) operator right after the Scan is done, in the physical plan.

0: jdbc:drill:> select max( distinct key1 ) maximum, count( distinct key1 ) count_key1, sum( distinct key1 ) sum_key1, min( distinct key1 ) minimum, avg( distinct key1 ) average, key2 from `twoKeyJsn.json` group by key2 order by key2;
+------------+------------+------------+------------+------------+------------+
|  maximum   | count_key1 |  sum_key1  |  minimum   |  average   |    key2    |
+------------+------------+------------+------------+------------+------------+
| 1.40095133379E9 | 156        | 1.1920469973999657E11 | 2.39793089027E7 | 7.641326906410036E8 | 0          |
| 1.42435032101E9 | 180        | 1.2884789516328592E11 | 8659240.29442 | 7.158216397960329E8 | 1          |
| 1.42931626355E9 | 1872281    | 1.3386731804571605E15 | 618.939623926 | 7.149958689198686E8 | a          |
| 1.42931347924E9 | 1870676    | 1.3371603128280032E15 | 108.851943741 | 7.14800592314224E8 | b          |
| 1.42931336919E9 | 1871847    | 1.337837189079748E15 | 3018.47312743 | 7.147150323075273E8 | c          |
| 1.42931380603E9 | 1870697    | 1.3362335178170852E15 | 3890.92180463 | 7.142971404867198E8 | d          |
| 1.42931281008E9 | 1871507    | 1.3367368067327902E15 | 1165.48741414 | 7.142569099302275E8 | e          |
| 1.42931480081E9 | 1870450    | 1.3358301916601862E15 | 354.577534881 | 7.14175835579773E8 | f          |
| 1.42931509068E9 | 1873604    | 1.3389171286500712E15 | 889.584888053 | 7.146211945801094E8 | g          |
| 1.42931553374E9 | 1872726    | 1.3393592500619982E15 | 2704.34813594 | 7.151923186104097E8 | h          |
| 1.42931450347E9 | 1872434    | 1.3381712881732795E15 | 122.281412463 | 7.146694025921766E8 | i          |
| 1.42931539751E9 | 1872250    | 1.3380216282921535E15 | 946.21365677 | 7.146597026530397E8 | j          |
| 1.42931334853E9 | 1873923    | 1.3390341356271005E15 | 1070.7862089 | 7.145619834043877E8 | k          |
| 1.42931539809E9 | 1870929    | 1.3371605654647945E15 | 55.1144569856 | 7.147040670516062E8 | l          |
| 1.42931543226E9 | 1874172    | 1.339322148620916E15 | 858.05505376 | 7.146207224421856E8 | m          |
| 1.42931595791E9 | 1874462    | 1.3391024723756562E15 | 237.230716926 | 7.143929684227561E8 | n          |
+------------+------------+------------+------------+------------+------------+
16 rows selected (103.566 seconds)
0: jdbc:drill:> explain plan for select max( distinct key1 ) maximum, count( distinct key1 ) count_key1, sum( distinct key1 ) sum_key1, min( distinct key1 ) minimum, avg( distinct key1 ) average, key2 from `twoKeyJsn.json` group by key2 order by key2;
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(maximum=[$0], count_key1=[$1], sum_key1=[$2], minimum=[$3], average=[$4], key2=[$5])
00-02        SelectionVectorRemover
00-03          Sort(sort0=[$5], dir0=[ASC])
00-04            Project(maximum=[$1], count_key1=[$2], sum_key1=[CASE(=($2, 0), null, $3)], minimum=[$4], average=[CAST(/(CastHigh(CASE(=($2, 0), null, $3)), $2)):ANY NOT NULL], key2=[$0])
00-05              HashAgg(group=[{0}], maximum=[MAX($1)], count_key1=[COUNT($1)], agg#2=[$SUM0($1)], minimum=[MIN($1)])
00-06                HashAgg(group=[{0, 1}])
00-07                  HashAgg(group=[{0, 1}])
00-08                    Scan(groupscan=[EasyGroupScan [selectionRoot=/tmp/twoKeyJsn.json, numFiles=1, columns=[`key2`, `key1`], files=[maprfs:/tmp/twoKeyJsn.json]]])

config options related to hashing and aggregation were set to,

0: jdbc:drill:> select * from sys.options where name like '%agg%';
+------------+------------+------------+------------+------------+------------+------------+
|    name    |    kind    |    type    |  num_val   | string_val |  bool_val  | float_val  |
+------------+------------+------------+------------+------------+------------+------------+
| planner.enable_multiphase_agg | BOOLEAN    | SYSTEM     | null       | null       | true       | null       |
| planner.enable_streamagg | BOOLEAN    | SYSTEM     | null       | null       | true       | null       |
| planner.enable_hashagg | BOOLEAN    | SYSTEM     | null       | null       | true       | null       |
| planner.memory.hash_agg_table_factor | DOUBLE     | SYSTEM     | null       | null       | null       | 1.1        |
+------------+------------+------------+------------+------------+------------+------------+
4 rows selected (0.203 seconds)
0: jdbc:drill:> select * from sys.options where name like '%hash%';
+------------+------------+------------+------------+------------+------------+------------+
|    name    |    kind    |    type    |  num_val   | string_val |  bool_val  | float_val  |
+------------+------------+------------+------------+------------+------------+------------+
| planner.enable_hash_single_key | BOOLEAN    | SYSTEM     | null       | null       | true       | null       |
| planner.join.hash_join_swap_margin_factor | DOUBLE     | SYSTEM     | null       | null       | null       | 10.0       |
| exec.max_hash_table_size | LONG       | SYSTEM     | 1073741824 | null       | null       | null       |
| planner.enable_hashjoin_swap | BOOLEAN    | SYSTEM     | null       | null       | true       | null       |
| exec.min_hash_table_size | LONG       | SYSTEM     | 65536      | null       | null       | null       |
| planner.enable_hashagg | BOOLEAN    | SYSTEM     | null       | null       | true       | null       |
| planner.memory.hash_agg_table_factor | DOUBLE     | SYSTEM     | null       | null       | null       | 1.1        |
| planner.memory.hash_join_table_factor | DOUBLE     | SYSTEM     | null       | null       | null       | 1.1        |
| planner.enable_hashjoin | BOOLEAN    | SYSTEM     | null       | null       | true       | null       |
+------------+------------+------------+------------+------------+------------+------------+
9 rows selected (0.144 seconds)


{code}

Here is another (similar) aggregate and grouping query that hung forever.

{code}

case 2) This aggregate and grouping query hangs indefinitely forever...
 input is from CSV file and it has two columns of data. It is running on 4 node cluster on CentOS. Data file has close to 26 million records in it.

0: jdbc:drill:> select count(*) from (select max( distinct cast(columns[0] as double) ) maximum, count( distinct cast(columns[0] as double) ) count_key1, sum( distinct cast(columns[0] as double)) sum_key1, min( distinct cast(columns[0] as double)) minimum, avg( distinct cast(columns[0] as double)) average, columns[1] from `tblfrmJsnToCSV/0_0_0.csv` where columns[0] <> 'key1' group by columns[1]);

Physical plan for the query is

00-00    Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 22.877799999999997, cumulative cost = {2176138.623779999 rows, 7886477.503779999 cpu, 0.0 io, 2.811224064E8 network, 6683978.048 memory}, id = 74019
00-01      StreamAgg(group=[{}], EXPR$0=[COUNT()]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 22.877799999999997, cumulative cost = {2176136.335999999 rows, 7886475.215999999 cpu, 0.0 io, 2.811224064E8 network, 6683978.048 memory}, id = 74018
00-02        Project($f0=[0]) : rowType = RecordType(INTEGER $f0): rowcount = 228.77799999999996, cumulative cost = {2175907.5579999993 rows, 7883729.879999999 cpu, 0.0 io, 2.811224064E8 network, 6683978.048 memory}, id = 74017
00-03          HashAgg(group=[{0}], maximum=[MAX($1)], count_key1=[COUNT($1)], agg#2=[$SUM0($1)], minimum=[MIN($1)]) : rowType = RecordType(ANY EXPR$5, DOUBLE maximum, BIGINT count_key1, DOUBLE $f3, DOUBLE minimum): rowcount = 228.77799999999996, cumulative cost = {2175678.7799999993 rows, 7883725.879999999 cpu, 0.0 io, 2.811224064E8 network, 6683978.048 memory}, id = 74016
00-04            HashAgg(group=[{0, 1}]) : rowType = RecordType(ANY EXPR$5, DOUBLE $f1): rowcount = 2287.7799999999997, cumulative cost = {2173390.9999999995 rows, 7755610.199999999 cpu, 0.0 io, 2.811224064E8 network, 6643713.12 memory}, id = 74015
00-05              Project(EXPR$5=[$0], $f1=[$1]) : rowType = RecordType(ANY EXPR$5, DOUBLE $f1): rowcount = 22877.8, cumulative cost = {2150513.1999999997 rows, 7389565.399999999 cpu, 0.0 io, 2.811224064E8 network, 6039739.2 memory}, id = 74014
00-06                HashToRandomExchange(dist0=[[$0]], dist1=[[$1]]) : rowType = RecordType(ANY EXPR$5, DOUBLE $f1, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 22877.8, cumulative cost = {2127635.4 rows, 7389557.399999999 cpu, 0.0 io, 2.811224064E8 network, 6039739.2 memory}, id = 74013
01-01                  UnorderedMuxExchange : rowType = RecordType(ANY EXPR$5, DOUBLE $f1, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 22877.8, cumulative cost = {2104757.6 rows, 7115023.8 cpu, 0.0 io, 0.0 network, 6039739.2 memory}, id = 74012
02-01                    Project(EXPR$5=[$0], $f1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[castINT(hash64($1, hash64($0)))]) : rowType = RecordType(ANY EXPR$5, DOUBLE $f1, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 22877.8, cumulative cost = {2081879.8 rows, 7092146.0 cpu, 0.0 io, 0.0 network, 6039739.2 memory}, id = 74011
02-02                      HashAgg(group=[{0, 1}]) : rowType = RecordType(ANY EXPR$5, DOUBLE $f1): rowcount = 22877.8, cumulative cost = {2059002.0 rows, 7092134.0 cpu, 0.0 io, 0.0 network, 6039739.2 memory}, id = 74010
02-03                        Project(EXPR$5=[$0], $f1=[CAST($1):DOUBLE]) : rowType = RecordType(ANY EXPR$5, DOUBLE $f1): rowcount = 228778.0, cumulative cost = {1830224.0 rows, 3431686.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 74009
02-04                          SelectionVectorRemover : rowType = RecordType(ANY ITEM, ANY ITEM1): rowcount = 228778.0, cumulative cost = {1601446.0 rows, 3431678.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 74008
02-05                            Filter(condition=[<>($1, 'key1')]) : rowType = RecordType(ANY ITEM, ANY ITEM1): rowcount = 228778.0, cumulative cost = {1372668.0 rows, 3202900.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 74007
02-06                              Project(ITEM=[ITEM($0, 1)], ITEM1=[ITEM($0, 0)]) : rowType = RecordType(ANY ITEM, ANY ITEM1): rowcount = 457556.0, cumulative cost = {915112.0 rows, 457564.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 74006
02-07                                Scan(groupscan=[EasyGroupScan [selectionRoot=/tmp/tblfrmJsnToCSV/0_0_0.csv, numFiles=1, columns=[`columns`[1], `columns`[0]], files=[maprfs:/tmp/tblfrmJsnToCSV/0_0_0.csv]]]) : rowType = RecordType(ANY columns): rowcount = 457556.0, cumulative cost = {457556.0 rows, 457556.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 74005

Sample data from the data file

[root@centos-01 logs]# hadoop fs -cat /tmp/tblfrmJsnToCSV/0_0_0.csv | more
key1,key2
1.2968152673E9,d
4.67365529012E7,c
9.39682065896E7,b
1.01580172933E9,d
4.98788888641E8,1
1.52391833107E8,1
7.31290386917E8,a
6.92726688161E8,d
1.12383522654E9,a
1.26807240856E8,1
9.54482542201E8,1
1.32100398388E9,0
1.17405537683E9,a
3.49879149097E7,0
6.50489380899E7,b
1.00841781109E9,a
1.19199684011E9,c
1.88765338328E8,b
8.24243579027E8,a
7.03797780195E8,b

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)