You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "Paul Rogers (Jira)" <ji...@apache.org> on 2021/12/19 23:05:00 UTC

[jira] [Created] (DRILL-8084) Scan LIMIT pushdown fails across files

Paul Rogers created DRILL-8084:
----------------------------------

             Summary: Scan LIMIT pushdown fails across files
                 Key: DRILL-8084
                 URL: https://issues.apache.org/jira/browse/DRILL-8084
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.19.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers


DRILL-7763 apparently added limit pushdowns to the file format plugins, which is a nice improvement. Unfortunately, the implementation only works for a scan with a single file: the limit is applied to each file independently. The correct implementation is to apply the limit to the {_}scan{_}, not the {_}file{_}.

Further, `LIMIT 0` has meaning: it asks to return a schema with no data. However, the implementation uses a {{maxRecords == 0}} to mean no limit, and a bit of code explicitly changes `LIMIT 0` to `LIMIT 1` so that "we read at least one file".

Consider and example. Two files, A and B, each of which have 10 records:
 * {{{}LIMIT 0{}}}: Obtain the schema from A, read no data from A. Do not open B. The current code changes {{LIMIT 0}} to {{{}LIMIT 1{}}}, thus returning data.
 * {{{}LIMIT 1{}}}: Read one record from A, none from B. (Don't even open B.) The current code will read 1 record from A and other from B.
 * {{{}LIMIT 15{}}}: Read all 10 records from A, and only 5 from B. The current code applies the limit of 15 to both files, thus reading 20 records.

The correct solution is to manage the {{LIMIT}} at the scan level. As each file completes, subtract the returned row count from the limit applied to the next file.

And, at the file level, there is no need to have each file count its records and check the limit on each row read. The "result set loader" already checks batch limits: it is the place to check the overall limit.

For this reason, the V2 EVF scan framework has been extended to manage the scan-level part, and the "result set loader" has been extended to enforce the per-file limit. The result is that readers need do...absolutely nothing; {{LIMIT}} pushdown is automatic.

EVF V1 has also been extended, but is less thoroughly tested since the desired path is to upgrade all readers to use EVF V2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)