You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/01/03 01:13:00 UTC
[jira] [Commented] (DRILL-8085) EVF V2 support in the "Easy" format plugin

    [ https://issues.apache.org/jira/browse/DRILL-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17467742#comment-17467742 ] 

ASF GitHub Bot commented on DRILL-8085:
---------------------------------------

paul-rogers opened a new pull request #2419:
URL: https://github.com/apache/drill/pull/2419


   # [DRILL-8085](https://issues.apache.org/jira/browse/DRILL-8085): EVF V2 support in the "Easy" format plugin
   
   ## Description
   
   Provides an easy integration between the "Easy" format plugin and the "EVF V2" structure, similar to what already exists for EVF V1. Revised the limit-push-down approach. Fixes the JSON writer to handle `TIMESTAMP` consistently with the reader (so certain tests pass on a non-UTC build machine.) Numerous other bits of cleanup.
   
   ## Design Overview
   
   Prior work added the "EVF" version 2 framework. Briefly, the original "extended vector framework" (EVF) started with the vector accessor layer to control vector sizes, then added the "result set loader" to provide an easy interface for readers to write to vectors using the accessors. This still left a large amount of plumbing to work out the "schema-on-read" set of columns to use. "EVF V1" fumbled its way to a solution, but the result was complex and limited. EVF V2 reworks the mechanism to cleanly support schema-on-read, a provided schema, an up-front schema defined by the data source, and the set of columns requested in the query. Figuring that out is amazingly complex (which should tell us something about the wisdom of Drill's current schema-on-read approach...)
   
   The EVF V1 work split the scan operation into layers, layers which are preserved in V2:
   
   * The Volcano operator (AKA the poorly-named "record batch"): `OperatorRecordBatch`
   * The scan operator (`ScanOperatorExec`), which iterates over the readers within a scan, and issues "events" to a listener: `ScanOperatorEvents`.
   * The scan and reader event listeners for EVF 2. (The old listeners, for EVF 1, also still exist.)
   * The format-specific plugins and readers.
   * The `ResultSetLoader`, `ColumnWriter` and `ValueVector` abstractions.
   
   The first two layers are meant to point the way to restructuring Drill's overly-complex record batches by dividing up responsibilities. So far, only the scan has been "modernized" this way.
   
   The gist of this change is to modify the `EasyFormatPlugin` to add a third way to build the readers. EVF V2 is simpler than V1, and far simpler than the "classic" do-it-all-yourself approach. A setting in the `EasyFormatConfig` class says which to use.
   
   ## Limit Pushdown
   
   [DRILL-7763](https://issues.apache.org/jira/browse/DRILL-7763) added `LIMIT` push-down to the file format readers. However, this work was limited in two ways:
   
   * The limit work was added to each reader by counting records in the read-each-record loop.
   * The limit was not applied across multiple readers in the same scan. If the limit is 50, and the scan has two readers: R1 with 40 row, and R2 with 100, rows, then we want to limit R2 to 50-40 = 10 rows.
   * The code used `maxRecords=0` to mean "no limit." However, `LIMIT 0` is commonly used by BI tools to obtain just a schema for a query. (Much work was done back in 2015 to enable this feature throughout Drill.)
   
   As it turns out, the `ResultSetLoader` already does most of this work to enforce its various batch size limits. The cross-reader work should be done in the scan framework. This PR adds this functionality so that the limit is passed from the plan into the EVF V2 mechanism, which automagically computes and enforces the per-reader limit. The `RowReader` will simply report "OK, stop reading, the batch is full" when the limit is reached. The scan framework detects that the batch was full because of the limit, and won't call your reader for another batch. `LIMIT 0` now works properly as well.
   
   As a result, to get full limit support, all a reader has to do is use EVF V2, and do nothing else. Readers that did roll-your-own limit support should rip out that code (that is, revert DRILL-7763). See the example  `HttpdLogFormatPlugin` (below) for what to do.
   
   As a side-effect, the JSON field for the limit changed names from `maxRecords` to `limit` so it is clear what it means.
   
   ## Examples
   
   This PR includes conversion of the "compliant text" (AKA "CSV") reader to use EVF V2. However, that reader is quite complex, and is not a good example for others to follow. So, this  PR also converted the `HttpdLogFormatPlugin` which is a much better example for others to follow.
   
   ## Next Steps
   
   Once this work is merged, I encourage the team to migrate the other plugins that use EVF V1 to use V2. The code will be simpler and you get `LIMIT` support for free. The readers will also automagically get support for a provided schema.
   
   Once all plugins are converted, the next step will be to rip out the EVF V1 code since it is so complex even I (as the author) have to take lots of time to understand it. Good riddance. "Build one to throw away" as Brooks said...
   
   ## Also includes
   
   * [DRILL-8100](https://issues.apache.org/jira/browse/DRILL-8100): JSON record writer does not convert Drill local timestamp to UTC.
   * [DRILL-8086](https://issues.apache.org/jira/browse/DRILL-8086): Convert the CSV (AKA "compliant text") reader to EVF V2.
   * [DRILL-8084](https://issues.apache.org/jira/browse/DRILL-8084): Scan LIMIT pushdown fails across files.
   * Clean-up of several flaky tests.
   
   `TestHttpPlugin` chose port 8091 which collides with other software (in this case, Druid.) Ad-hoc uses should use higher numbers, so the port was changed to 44332. (It should ideally be randomly selected on server start.)
   
   ## `TIMESTAMP`
   
   I struggled with [DRILL-8101](https://issues.apache.org/jira/browse/DRILL-8101): Resolve the TIMESTAMP madness. Drill treats the `TIMESTAMP` type as both a `DATETIME` (no time zone) and a Unix-like timestamp (UTC time zone). The tests are playing whack-a-mole as these conflicting interpretations collide: some tests only work in the UTC time zone (but my machine is in the Pacific zone.)
   
   This PR fixes the JSON writer so it converts from Drill's local-time `TIMESTAMP` to UTC, to "undo" the UTC-to-local conversion on read. However, this just caused other tests to fail since Parquet assumes that `TIMESTAMP` is either UTC, or an unspecified timezone (so Parquet can assume UTC if it wants.) As a result certain CTAS tests with both Parquet and JSON fail unless they are run in UTC. Since the problem is too big to fix in this PR, I disabled those tests. (The PR contains some clean-up from my attempt to find and understand the problem.)
   
   ## Documentation
   
   No user-visible changes. 
   
   ## Testing
   
   Added unit tests for new functionality. Revised existing tests where needed, or where they were flaky on my machine.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> EVF V2 support in the "Easy" format plugin
> ------------------------------------------
>
>                 Key: DRILL-8085
>                 URL: https://issues.apache.org/jira/browse/DRILL-8085
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.19.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>
> Add support for EVF V2 to the {{EasyFormatPlugin}} similar to how EVF V1 support already exists. Provide examples for others to follow.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)