You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/05/01 02:33:00 UTC
[jira] [Commented] (DRILL-6373) Refactor the Result Set Loader to prepare for Union, List support

    [ https://issues.apache.org/jira/browse/DRILL-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459402#comment-16459402 ] 

ASF GitHub Bot commented on DRILL-6373:
---------------------------------------

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/1244

    DRILL-6373: Refactor Result Set Loader for Union, List support

    This PR builds on the previous refactoring of the column accessors to prepare for Union, (non-repeated) List and Repeated List support. The PR includes four closely related changes divided across four commits:
    
    ### Correct the Type of the Data Vector in a Nullable Vector
    
    The nullable vectors contain a "bits" vector and a "data" vector. The data vector has historically been created using the same `MaterializedField` as the nullable vector, meaning that the data vector is labeled as "nullable" even though it has no bits vector.
    
    This PR creates a clone `MaterializedField` with the same name as the outer nullable vector, but with a Required type.
    
    This change ensures that the overflow logic works correctly as it uses the vector metadata (in the `MaterializedField`) to know what kind of vector to create for the "lookahead" vector.
    
    ### Result Set Loader Refactor
    
    The second commit pretty much just rearranges the deck chairs in a way that we an slot in the new types in the next PR. The need for the changes can be seen in the full code set (the union and list support was pulled out for this PR.)
    
    A union is a container, like a map, so the tuple state was refactored to create a common parent container state.
    
    List and unions are very complex to build, so the code to build the internal workings of each vector was pulled out into a separate builder class.
    
    ### Projection Handling and the Vector Cache
    
    Previous versions of the result set loader handled projection and a cache for vectors reused across readers in the same Scan operator. Once we introduce nested maps, projection within maps, unions and lists, projection gets much more complex, as does vector caching.
    
    This PR adds logic to support projection and vector caching to any arbitrary level of maps. It turns out that handling projection of an entire map, and projection of fields within maps, is far more complex than you'd think, requiring quite a bit of internal state to keep everything straight. The result is that we can now handle a map `m` with three fields `{a, b, c}` and project just one of them, `m.a`, say.
    
    Further, Drill allows projection of non-existent columns. So, we might ask for field `m.d` which does not exist in the above map. The projection mechanism handles this case as well, creating the right kind of null column.
    
    ### Unit Tests
    
    New tests are added to exercise the projection and cache mechanisms. Existing tests were updated for the changes made in the refactoring.
    
    ### Reference Design
    
    All of this work is done in support of the overall "batch sizing" project explained [here](https://github.com/paul-rogers/drill/wiki/Batch-Handling-Upgrades).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-6373

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1244.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1244
    
----
commit 7df4280f3011862b84b43240c8a07e0bf019745d
Author: Paul Rogers <pr...@...>
Date:   2018-05-01T02:10:53Z

    DRILL-6373: Fix nullable vector data vector type
    
    Fixes the type of the data vector within a nullable vector. The data vector is Required (has no bits vector.) Accurate metadata is required for proper overflow handling in the result set loader.

commit 9496ef681f19f03ccd735f2c9b18f6d914eae3e2
Author: Paul Rogers <pr...@...>
Date:   2018-05-01T02:15:33Z

    DRILL-6373: Refactor result set loader

commit 74675436ae1efdf66deafeaa27b281d169e274ad
Author: Paul Rogers <pr...@...>
Date:   2018-05-01T02:16:48Z

    DRILL-6373: Revised projection & vector cache

commit 04598c0dbdbbff5ecbc2f89d02b14f66982f86bd
Author: Paul Rogers <pr...@...>
Date:   2018-05-01T02:17:22Z

    DRILL-6373: Revised & added unit tests

----


> Refactor the Result Set Loader to prepare for Union, List support
> -----------------------------------------------------------------
>
>                 Key: DRILL-6373
>                 URL: https://issues.apache.org/jira/browse/DRILL-6373
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>             Fix For: 1.14.0
>
>
> As the next step in merging the "batch sizing" enhancements, refactor the {{ResultSetLoader}} and related classes to prepare for Union and List support. This fix follows the refactoring of the column accessors for the same purpose. Actual Union and List support is to follow in a separate PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)