You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Arina Ielchiieva (JIRA)" <ji...@apache.org> on 2017/05/26 08:48:04 UTC
[jira] [Comment Edited] (DRILL-5542) Scan unnecessary adds implicit
columns to ScanRecordBatch for select * query
[ https://issues.apache.org/jira/browse/DRILL-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026010#comment-16026010 ]
Arina Ielchiieva edited comment on DRILL-5542 at 5/26/17 8:47 AM:
------------------------------------------------------------------
Implicit columns were added to ScanBatch for star queries to handle the below queries:
{noformat}
select * from t where fqn = 'abc';
select *, fqn from t;
{noformat}
In such cases Drill simplifies list of columns to `*` thus we don't if implicit columns are needed. That's why they were added to ScanBatch and removed during ProjectBatch if they were not needed.
was (Author: arina):
Implicit columns were added to ScanBatch for star queries to handle the below queries:
{noformat}
select * from t where fqn = 'abc';
select *, fqn from t;
{noformat}
In such cases Drill simplifies list of columns to `*` thus we don't if implicit columns are needed. That's why there were added to ScanBatch and removed during ProjectBatch if they were not needed.
> Scan unnecessary adds implicit columns to ScanRecordBatch for select * query
> ----------------------------------------------------------------------------
>
> Key: DRILL-5542
> URL: https://issues.apache.org/jira/browse/DRILL-5542
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Relational Operators
> Reporter: Jinfeng Ni
>
> It seems that Drill would add several implicit columns (`fqn`, `filepath`, `filename`, `suffix`) to ScanBatch, where it's actually not required at downstream operator. Although those implicit columns would be dropped off later on, it increases both memory and CPU overhead.
> 1. JSON
> ```
> {a: 100}
> ```
> {code}
> select * from dfs.tmp.`1.json`;
> +------+
> | a |
> +------+
> | 100 |
> +------+
> {code}
> The schema from ScanRecordBatch is :
> {code}
> [ schema:
> BatchSchema [fields=[fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL), a(BIGINT:OPTIONAL)], selectionVector=NONE],
> {code}
> 2. Parquet
> {code}
> elect * from cp.`tpch/nation.parquet`;
> +--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
> | n_nationkey | n_name | n_regionkey | n_comment |
> +--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
> | 0 | ALGERIA | 0 | haggle. carefully final deposits detect slyly agai |
> ...
> {code}
> The schema of ScanRecordBatch:
> {code}
> schema:
> BatchSchema [fields=[n_nationkey(INT:REQUIRED), n_name(VARCHAR:REQUIRED), n_regionkey(INT:REQUIRED), n_comment(VARCHAR:REQUIRED), fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE],
> {code}
> 3. Text
> {code}
> cat 1.csv
> a, b, c
> select * from dfs.tmp.`1.csv`;
> +----------------+
> | columns |
> +----------------+
> | ["a","b","c"] |
> +----------------+
> {code}
> Schema of ScanRecordBatch
> {code}
> schema:
> BatchSchema [fields=[columns(VARCHAR:REPEATED)[$data$(VARCHAR:REQUIRED)], fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE],
> {code}
> If implicit columns are not part of query result of `select * query`, then Scan operator should not populate those implicit columns.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)