You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/05/26 03:25:04 UTC

[jira] [Commented] (DRILL-5542) Scan unnecessary adds implicit columns to ScanRecordBatch for select * query

    [ https://issues.apache.org/jira/browse/DRILL-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025745#comment-16025745 ] 

Paul Rogers commented on DRILL-5542:
------------------------------------

Thanks for tracking this down!

I wonder, how does the downstream operator know to remove the implicit columns? There is nothing in the column name or (it seems) physical plan to identify those columns as implicit. In the example for CSV, say, how would the downstream know that "columns" is OK, but "fqn" is not? Is this hard-coded somewhere?

If hardcoded, how does it know to pass along the "fqn" when it is requested?

In any event, for the readers that DRILL-5211 touches, I will address the issue in the revised scan batch code. Others will need attention by others.

> Scan unnecessary adds implicit columns to ScanRecordBatch for select * query
> ----------------------------------------------------------------------------
>
>                 Key: DRILL-5542
>                 URL: https://issues.apache.org/jira/browse/DRILL-5542
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>            Reporter: Jinfeng Ni
>
> It seems that Drill would add several implicit columns (`fqn`, `filepath`, `filename`, `suffix`) to ScanBatch, where it's actually not required at downstream operator. Although those implicit columns would be dropped off later on, it increases both memory and CPU overhead.    
> 1. JSON
> ```
> {a: 100}
> ```
> {code}
> select * from dfs.tmp.`1.json`;
> +------+
> |  a   |
> +------+
> | 100  |
> +------+
> {code}
> The schema from ScanRecordBatch is :
> {code}
> [ schema:
>     BatchSchema [fields=[fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL), a(BIGINT:OPTIONAL)], selectionVector=NONE], 
>  {code}
> 2. Parquet
> {code}
> elect * from cp.`tpch/nation.parquet`;
> +--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
> | n_nationkey  |     n_name      | n_regionkey  |                                                      n_comment                                                      |
> +--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
> | 0            | ALGERIA         | 0            |  haggle. carefully final deposits detect slyly agai                                                                 |
> ...
> {code}
> The schema of ScanRecordBatch:
> {code}
>   schema:
>     BatchSchema [fields=[n_nationkey(INT:REQUIRED), n_name(VARCHAR:REQUIRED), n_regionkey(INT:REQUIRED), n_comment(VARCHAR:REQUIRED), fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], 
> {code}
> 3. Text
> {code}
> cat 1.csv
> a, b, c
> select * from dfs.tmp.`1.csv`;
> +----------------+
> |    columns     |
> +----------------+
> | ["a","b","c"]  |
> +----------------+
> {code}
> Schema of ScanRecordBatch 
> {code}
>   schema:
>     BatchSchema [fields=[columns(VARCHAR:REPEATED)[$data$(VARCHAR:REQUIRED)], fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], 
> {code}
> If implicit columns are not part of query result of `select * query`, then Scan operator should not populate those implicit columns.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)