You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/07/27 05:49:02 UTC
[jira] [Commented] (DRILL-5688) Add repeated map support to column accessors

    [ https://issues.apache.org/jira/browse/DRILL-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102746#comment-16102746 ] 

ASF GitHub Bot commented on DRILL-5688:
---------------------------------------

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/887

    DRILL-5688: Add repeated map support to column accessors

    Restructures the existing "column accessor" code to adopt a JSON-like structure that works for all of Drill's data types. This PR focused on the "repeated map" vector. (Still to come is support for repeated lists, but they fit into the revised JSON structure.)
    
    This PR has four commits that highlight different parts of the changes:
    
    * The core accessors themselves
    * Changes to vector classes along with a new "tuple metadata" class
    * Revisions to the "row set" test framework which uses, and tests, the accessors
    * "Collateral damage" changes that pick up changes to the row set classes and add a number of small test framework improvements.
    
    ### Accessors
    
    The accessor structure is explained in `package_info.java` files in the accessor packages. Basically, the structure is:
    
    * The accessor types are: tuple, array and scalar
    * A tuple is a set of (name, type) pairs
    * Maps and rows are both tuples
    * Arrays are a series of one of the three types
    
    The accessors add an "object" layer that represents any of the three types. So, a tuple is really a list of (name, object accessor) pairs, where the object accessor provide access to a scalar, an array or a tuple as appropriate for each column.
    
    The structure appears complex (since it must model JSON). But, an app using this code would use just the leaf scalar readers and writers. These classes currently access data via the value vector `Mutator` and `Accessor` classes. But, the goal is to eventually access the Netty `PlatformDependent` methods directly so that there is a single layer between the application and the call into direct memory. (Today there are multiple layers.)
    
    There is quite a bit of code change here to provide the new structure. But, the core functionality of reading and writing to vectors has not changed much. And, this code has extensive unit tests, which should avoid the need to "mentally execute" each line of code.
    
    ### Supporting Classes
    
    A new `TupleMetadata` class is a superset of the existing `BatchSchema`, but also provides "classic" tuple-like access by position or name. Eventually, this will also hold additional information such as actual size and so on (information now laboriously rediscovered by the "record batch sizer.") Since the accessors use a "tuple" abstraction to model both rows and maps, the tuple metadata provides this same view. The top-most tuple models the row. Columns within the row can be maps, which have their own tuple schema, and so on.
    
    `TupleNameSpace` moves locations (so it can be used in the vector package) but otherwise remains unchanged.
    
    `DrillBuf` provides an experimental `putInt()` method that does bounds checking and sets a value, to minimize calls. This will probably move into the writer in a later PR.
    
    This PR fixes DRILL-5690, a bug in repeated vectors that did not pass along Decimal scale and precision. See `RepeatedValueVectors.java`.
    
    `MaterializedField` changes to add an `isEquivalent()` method to compare two fields, ignoring internal (`$offset$`, `$bits$`, etc.) vectors.
    
    ### Row Set Classes and Tests
    
    The `RowSet` family of classes changed in response to the accessor changes.
    
    * The reader and writer are moved to separate files.
    * Row sets now use a "parsed" form of "storage" classes to hold vectors (more below).
    * Static factory methods were added to hide constructor complexity.
    * The `RowSetBuilder` and `RowSetComparison` test tools added support for repeated maps.
    * Code to handle generic object writing moved from the `RowSetBuilder` into the accessors.
    * The old `RowSetSchema` evolved to become the `TupleMetadata` mentioned above.
    * Tests were greatly enhanced to test all modes of all supported scalar types, as well as the new JSON-like structure.
    
    In the previous version, the row set classes had complex logic to figure out what kind of accessor to create for each vector. This became overly complex. In this version, the row set "parses" a vector container to create "storage" objects that represent tuples and columns. A column can, itself, be a tuple. (Note: there is no need to model lists since lists are just vectors at this level of abstraction, so need no special handling.)
    
    With this change, accessor creation is a simple matter of walking a tree to assemble the JSON-structure.
    
    This structure is also used to create a batch's vectors from a schema.
    
    ### Other Changes
    
    The last commit contains various other changes, mostly reflecting the changes above.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-5688

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/887.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #887
    
----
commit 022c5e9ed08c6393166e66ac5e862168bc6c5e77
Author: Paul Rogers <pr...@maprtech.com>
Date:   2017-07-27T05:03:50Z

    DRILL-5688: Add repeated map support to column accessors
    
    Includes the core JSON-like reader and writer interfaces and
    implementations.

commit 170101b177c113ebbdf1d0f890b1d80487c0ea2f
Author: Paul Rogers <pr...@maprtech.com>
Date:   2017-07-27T05:05:36Z

    Supporting vector and related classes
    
    Includes changes to value vectors, DrillBuf and other low-level classes.

commit f1ce8ffa6caa3120316ba538a5dc3e918c61da58
Author: Paul Rogers <pr...@maprtech.com>
Date:   2017-07-27T05:08:07Z

    Row set test classes
    
    Modifications to the row set abstraction (used for testing) for the
    changed accessors. Row sets also act as tests for the accessor classes,
    including a number of tests that test the classes used for testing.
    (Yes, somewhat recursive…)

commit 0310772c1920948c487c4789bb5d0f3fc5e3d012
Author: Paul Rogers <pr...@maprtech.com>
Date:   2017-07-27T05:09:07Z

    Test code affected by the row set changes
    
    Changes to unit tests, and the unit test framework, required by the
    changes to the accessor and row set classes.

----


> Add repeated map support to column accessors
> --------------------------------------------
>
>                 Key: DRILL-5688
>                 URL: https://issues.apache.org/jira/browse/DRILL-5688
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.12.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.12.0
>
>
> DRILL-5211 describes how Drill runs into OOM issues due to Drill's two allocators: Netty and Unsafe. That JIRA also describes the solution: limit vectors to 16 MB in length (with the eventual goal of limiting overall batch size.) DRILL-5517 added "size-aware" support to the column accessors created to parallel Drill's existing readers and writers. (The parallel implementation ensures that we don't break existing code that uses the existing mechanism; same as we did for the external sort.)
> This ticket describes work to extend the column accessors to handle repeated maps and lists. Key themes:
> * Define a common metadata schema for use in this layer and the "result set loader" of DRILL-5657. This schema layer builds on top of the existing schema to add the kind of metadata needed here and by the "sizer" created for the external sort.
> * Define a JSON-like reader and writer structure that supports the full Drill data model semantics. (The earlier version focused on the scalar types and arrays of scalars to prove the concept of limiting vector sizes.)
> * Revising test code to use the revised column writer structure.
> Implementation details will appear in the PR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)