You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/02/25 19:58:00 UTC
[jira] [Commented] (DRILL-7601) Shift column conversion to reader from scan framework

    [ https://issues.apache.org/jira/browse/DRILL-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044783#comment-17044783 ] 

ASF GitHub Bot commented on DRILL-7601:
---------------------------------------

paul-rogers commented on pull request #1993: DRILL-7601: Shift column conversion to reader from scan framework
URL: https://github.com/apache/drill/pull/1993
 
 
   ## Description
   
   Moves scan operator type conversion code into readers and out of the scan framework.
   
   At the time we implemented provided schemas with the text reader, the best path forward appeared to be to perform column type conversions within the scan framework including deep in the column writer structure.
   
   Experience with other readers has shown that the text reader is a special case: it always writes strings, which Drill-provided converters can parse into other types. Other readers, however are not so simple: they often have their own source structures which must be mated to a column reader, and so conversion is generally best done in the reader where it can be specific to the nuances of each reader.
   
   Since conversion is reader-specific, and uses types known only to that one reader, it cannot be generic in the scan framework.
   
   This PR is part of a [larger project](https://github.com/paul-rogers/drill/wiki/Toward-a-Workable-Dynamic-Schema-Model) to implement an overall design for combining projection, provided schemas and reader schemas into an overall Drill schema design.
   
   A side benefit is that the column writers become simpler without the scan-specific conversion code. This helps us to use the column writers in other operators in future work.
   
   ### Schema Handling in the Scan Operator
   
   The scan schema mechanism works as follows:
   
   * The execution plan provides the projection list. Prior PRs parsed that list into a set of column-like structures.
   * A scan consists of a set of readers. Each reader works with an *input source*.
   * Each input source defines an *input schema* in some source-specific format (such as a JDBC format, a list of CSV columns, etc.)
   * The reader converts the source schema to a Drill format *reader schema*, converting  from source to Drill types as needed.
   * EVF uses the projection list to decide which reader column to project (that is, to create a vector to store data for), and which to not project (which means to create a dummy column writer.)
   * The scan framework notices which projected columns are neither implicit columns  nor provided in the reader schema. The scan framework creates "null" columns as placeholders. (This is where things often go off the rails since the scan has no  idea what type to use for the null column.)
   * The scan combines the reader schema, implicit columns, and the null columns to  produce the scan's *output schema* which is then consumed by the downstream  operator.
   
   Drill also supports a provided schema:
   
   * The execution plan can optionally include a *provided schema*. When present, this schema gives the name (case) and type to be used for each column. That is, the provided schema states the type of vectors to be produced.
   * If the provided schema is *strict*, it acts as projection filter: any reader column  not in the provided schema is treated as unprojected.
   * Data produced by the reader is converted from the type the reader produces to the type of the provided schema column. (This is the gist of the change in this PR, see below.)
   * When computing null columns, the scan operator can avoid type conflicts by using the  column type from the provided schema.
   
   ### Scan Schema Column Adapters
   
   This PR shifts the conversion step. Before this PR, conversion happened deep inside EVF.
   With this PR, conversion happens in the reader as part of the source-schema to reader-schema
   conversion.
   
   The idea is that each reader will use some form of *column adapter* (what we've sometimes called a *column shim*) to convert from source-specific form to Drill form. The recently-revised Avro format plugin is a great example.
   
   A column adapter conceptually has three parts:
   
   * A "front end" that obtains (or accepts) data in some source-specific way.
   * A conversion step that converts data from the source format to a Drill-compatible type.
   * A "back end" that writes the data to a vector using a Drill column writer.
   
   As it turns out, the first two steps are unique to each reader, only the back end is common.
   
   In some cases, such as CSV, the "front end" can be in the form of the typical Java types (`int`, `String`, etc.) In this case, we want to write to each column via a single method call. To do this, this PR adds a `ValueWriter` interface which the `ScalarWriter` extends. The reader can create its own adapters by extending the `ValueWriter` interface so that the reader can freely mix "plain" writers
   and column converters.
   
   An obvious special case is when the reader, such as CSV, wants to use a standard set of conversions. Drill already provided such conversions. They now move from the accessor package into the scan operator package as they are specific just to some readers; no other operator needs such functionality. Standard type conversions extend from a new `DirectConversion` class which extends `ValueWriter`.
   
   ### Text Reader
   
   Modified the text reader to perform type conversions driven by the provided schema, using the standard conversions described above. Added tests to very operation in both the "with" and "without" headers case.
   
   Note that, with a provided schema, the "without headers" case will *not* produce the single `columns` column; it will instead produce the set of columns listed in the provided schema. The provided schema must list the first *n* columns with no holes. However, there can be fields at the end of the record left off of the provided schema, these fields are ignored.
   
   The `TextParsingSettings` and text parser has long supported an option to trim leading and trailing white space, but the option was hard-coded off. Added a provisioned-schema property to allow enabling this feature for all columns in a table, or for specific columns.
   
   Also exposed the "parse unescaped quotes" option the same way.
   
   ### Other Readers
   
   Updated the Log, Avro and HDF5 plugins to insert conversions where needed.
   
   Restructured the HDF5 column adapters a bit to simplify the code.
   
   ### Other Changes
   
   * Large amount of code cleanup including standardizing names.
   * Restructure the type conversion classes to work on top of, rather than inside, column converters.
   * Remove all the "plumbing" which passed the old converter factory down through the row set, result set loader and column writer classes.
   * Move conversion-related properties from the conversion class to the metadata classes.
   * Adjust the `StandardConversions` class to work with a provided column schema as the conversion target.
   * Added the provided schema to the schema negtiator given to each reader. Necessary because the reader now does conversion and so needs the provided schema, if available.
   * Removes the `ProjectionSet` class which attempted to combine the projection, provided schema and type conversion into a single concept.
   * Added a replacement `ProjectionFilter` which only handles projection: based on the projection list and optionally the provided schema.
   * Modified the Avro `ColumnConvertersUtil` to create a new `ColumnConverterFactory` which integrates standard type conversions based on the provided schema into the Avro-to-Drill conversions.
   
   ### Future Steps
   
   This is a first step. A future PR will improve schema handling in the scan framework to more clearly implement the schema "pipeline" outlined above and in the referenced design. As a result, some of the schema handling code in this PR is a bit ad-hoc to avoid this PR from growing even larger.
   
   ## Documentation
   
   No user visible changes except the addition to the text reader provided schema properties.
   The documentation should be updated to clearly explain the properties (along with the two new ones added here.)
   
   ## Testing
   
   Removed obsolete tests. Added several new tests. Reran the entire unit test suite. Fixed issues in several readers (see above) resulting from the changes in this PR.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Shift column conversion to reader from scan framework
> -----------------------------------------------------
>
>                 Key: DRILL-7601
>                 URL: https://issues.apache.org/jira/browse/DRILL-7601
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.17.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>             Fix For: 1.18.0
>
>
> At the time we implemented provided schemas with the text reader, the best path forward appeared to be to perform column type conversions within the scan framework including deep in the column writer structure.
> Experience with other readers has shown that the text reader is a special case: it always writes strings, which Drill-provided converters can parse into other types. Other readers, however are not so simple: they often have their own source structures which must be mated to a column reader, and so conversion is generally best done in the reader where it can be specific to the nuances of each reader.
> This ticket asks to restructure the conversion code to fit the reader-does-conversion pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)