You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Khurram Faraaz (JIRA)" <ji...@apache.org> on 2015/06/10 08:17:00 UTC

[jira] [Updated] (DRILL-3269) Window function query takes too long to complete and return results

     [ https://issues.apache.org/jira/browse/DRILL-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Khurram Faraaz updated DRILL-3269:
----------------------------------
    Assignee: Deneche A. Hakim  (was: Chris Westin)

> Window function query takes too long to complete and return results
> -------------------------------------------------------------------
>
>                 Key: DRILL-3269
>                 URL: https://issues.apache.org/jira/browse/DRILL-3269
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.0.0
>         Environment: 1de6aed93efce8a524964371d96673b8ef192d89
>            Reporter: Khurram Faraaz
>            Assignee: Deneche A. Hakim
>            Priority: Minor
>
> Query that uses window functions takes too long to complete and return results. It returns close to a million records, for which it took 533.8 seconds ~8 minutes
> Input CSV file has two columns, one integer and another varchar type column. Please take a look.
> Size of the input CSV file
> root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
> -rwxr-xr-x   3 root root   27889455 2015-06-10 01:26 /tmp/manyDuplicates.csv
> {code}
> select count(*) over(partition by cast(columns[1] as varchar(25)) order by cast(columns[0] as bigint)) from `manyDuplicates.csv`;
> ...
> 1,000,007 rows selected (533.857 seconds)
> {code}
> There are five distinct values in columns[1] in the CSV file. = [FIVE PARTITIONS]
> {code}
> 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from `manyDuplicates.csv`;
> +-----------------------+
> |        EXPR$0         |
> +-----------------------+
> | FFFFGGGGHHHHIIIIJJJJ  |
> | PPPPQQQQRRRRSSSSTTTT  |
> | AAAABBBBCCCCDDDDEEEE  |
> | UUUUVVVVWWWWXXXXZZZZ  |
> | KKKKLLLLMMMMNNNNOOOO  |
> +-----------------------+
> 5 rows selected (1.906 seconds)
> {code}
> Here is the count for each of those values in columns[1]
> {code}
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';
> +---------+
> | EXPR$0  |
> +---------+
> | 200484  |
> +---------+
> 1 row selected (0.961 seconds)
> {code}
> {code}
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';
> +---------+
> | EXPR$0  |
> +---------+
> | 199353  |
> +---------+
> 1 row selected (0.86 seconds)
> {code}
> {code}
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';
> +---------+
> | EXPR$0  |
> +---------+
> | 200702  |
> +---------+
> 1 row selected (0.826 seconds)
> {code}
> {code}
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';
> +---------+
> | EXPR$0  |
> +---------+
> | 199916  |
> +---------+
> 1 row selected (0.851 seconds)
> {code}
> {code}
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';
> +---------+
> | EXPR$0  |
> +---------+
> | 199552  |
> +---------+
> 1 row selected (0.827 seconds)
> {code}
> Query plan for the long running query
> {code}
> | 00-00    Screen
> 00-01      UnionExchange
> 01-01        Project(EXPR$0=[$0])
> 01-02          Project($0=[$2])
> 01-03            Window(window#0=[window(partition {1} order by [0] range between UNBOUNDED PRECEDING and CURRENT ROW aggs [COUNT()])])
> 01-04              SelectionVectorRemover
> 01-05                Sort(sort0=[$1], sort1=[$0], dir0=[ASC], dir1=[ASC])
> 01-06                  Project($0=[$0], $1=[$1])
> 01-07                    HashToRandomExchange(dist0=[[$1]])
> 02-01                      UnorderedMuxExchange
> 03-01                        Project($0=[$0], $1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1))])
> 03-02                          Project($0=[CAST(ITEM($0, 0)):BIGINT], $1=[CAST(ITEM($0, 1)):VARCHAR(25) CHARACTER SET "ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary"])
> 03-03                            Scan(groupscan=[EasyGroupScan [selectionRoot=/tmp/manyDuplicates.csv, numFiles=1, columns=[`columns`[0], `columns`[1]], files=[maprfs:///tmp/manyDuplicates.csv]]])
> {code}
> python script to generate data in CSV format
> {code}
> import random
> f = open('/Users/kungfo/manyDuplicates.csv', 'a')
> for i in range(1,000000):
>     f.write(str(random.choice(xrange(1,1000000)))+','+str(random.choice(['AAAABBBBCCCCDDDDEEEE','FFFFGGGGHHHHIIIIJJJJ','KKKKLLLLMMMMNNNNOOOO','PPPPQQQQRRRRSSSSTTTT','UUUUVVVVWWWWXXXXZZZZ']))+'\n')
>     f.flush()
>     
>     
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)