You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Khurram Faraaz (JIRA)" <ji...@apache.org> on 2016/06/21 19:01:57 UTC
[jira] [Commented] (DRILL-4387) Improve execution side when it
handles skipAll query
[ https://issues.apache.org/jira/browse/DRILL-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342463#comment-15342463 ]
Khurram Faraaz commented on DRILL-4387:
---------------------------------------
The below queries return wrong results. (the problem seems to be there for quite some time)
{noformat}
Directory structure is
[root@centos-01 DRILL_4589]# ls
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
[root@centos-01 DRILL_4589]# cd 1990
[root@centos-01 1990]# ls
Q1 Q2 Q3 Q4
and so on...
Below two queries return 0, I don't think the results are correct, please review
0: jdbc:drill:schema=dfs.tmp> select count(dir0) from `DRILL_4589`;
+---------+
| EXPR$0 |
+---------+
| 0 |
+---------+
1 row selected (9.117 seconds)
0: jdbc:drill:schema=dfs.tmp> select count(dir1) from `DRILL_4589`;
+---------+
| EXPR$0 |
+---------+
| 0 |
+---------+
1 row selected (8.97 seconds)
0: jdbc:drill:schema=dfs.tmp> explain plan for select count(dir0) from `DRILL_4589`;
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 Project(EXPR$0=[$0])
00-02 Project(EXPR$0=[$0])
00-03 Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@5275c59a[columns = null, isStarQuery = false, isSkipQuery = false]])
0: jdbc:drill:schema=dfs.tmp> explain plan for select count(dir1) from `DRILL_4589`;
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 Project(EXPR$0=[$0])
00-02 Project(EXPR$0=[$0])
00-03 Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@337121ac[columns = null, isStarQuery = false, isSkipQuery = false]])
{noformat}
> Improve execution side when it handles skipAll query
> ----------------------------------------------------
>
> Key: DRILL-4387
> URL: https://issues.apache.org/jira/browse/DRILL-4387
> Project: Apache Drill
> Issue Type: Bug
> Reporter: Jinfeng Ni
> Assignee: Jinfeng Ni
> Fix For: 1.6.0
>
>
> DRILL-4279 changes the planner side and the RecordReader in the execution side when they handles skipAll query. However, it seems there are other places in the codebase that do not handle skipAll query efficiently. In particular, in GroupScan or ScanBatchCreator, we will replace a NULL or empty column list with star column. This essentially will force the execution side (RecordReader) to fetch all the columns for data source. Such behavior will lead to big performance overhead for the SCAN operator.
> To improve Drill's performance, we should change those places as well, as a follow-up work after DRILL-4279.
> One simple example of this problem is:
> {code}
> SELECT DISTINCT substring(dir1, 5) from dfs.`/Path/To/ParquetTable`;
> {code}
> The query does not require any regular column from the parquet file. However, ParquetRowGroupScan and ParquetScanBatchCreator will put star column as the column list. In case table has dozens or hundreds of columns, this will make SCAN operator much more expensive than necessary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)