You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by paul-rogers <gi...@git.apache.org> on 2017/07/31 15:59:15 UTC

[GitHub] drill issue #887: DRILL-5688: Add repeated map support to column accessors

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/887
  
    This set of commits provide performance tuning of the column writer implementation. Discussion during the code review identified a number of tuning opportunities that lead to this work. The commits divide the files by layer:
    
    * Value vectors
    * Column accessors
    * Row set classes used for testing
    * Secondary files affected by the above changes
    
    The result is a very nice performance improvement relative to the original vector mutator methods. The writers now provide:
    
    * 42% faster writes for required values
    * 73% faster writes for nullable values
    * 408% (4x) faster writes for repeated values
    
    Not a bad outcome from a casual code review conversation!
    
    ### Merge Vector Overflow into Column Writers
    
    DRILL-5688 and DRILL-5657 originally separated the tasks of detecting and handling vector overflow. (Vector overflow occurs when we are about to write past the maximum buffer size.) The result was that each “set” operation needed to traverse two layers: the “column loader” which handles overflow, and the “column writer” which writes the value to a vector.
    
    Discussion identified an opportunity to combine these into a single layer. The revised column accessors in this PR achieve that revision: each accessor detects overflow, call a handler to handle the overflow, then continues to write operation into the (presumably) new vector.
    
    The “result set loader” (DRILL-5657) abstraction contains the logic to handle overflows. The original “row set” abstractions for tests don’t handle overflow; instead they throw a index out-of-bounds exception. (Tests seldom write enough data to trigger an overflow unless deliberately testing the overflow feature.)
    
    ### Bypass the Vector Mutator
    
    The big performance improvement was to avoid the overhead of the original vector mutator. In the original code, each write travels through many layers of code:
    
    * Vector mutator
    * Drillbuf
    * UDLE
    * Netty platform dependent
    * Java Unsafe
    
    The revised column writers bypass the first three layers; and instead use the Netty platform dependent methods to write to vector memory. In this, they follow the text reader implementation which does something similar.
    
    ### Specialized Offset Vector Writer
    
    For arrays (repeated) and variable-width values, the original code is quite inefficient as it must do a direct memory read, and two writes, for each value. The revised writer code for variable-width fields does two writes. For arrays (repeated) values, the writer code does just one write per value and one write per row. (This is why the writer code is so much faster: the test writes five values per row: 15 direct memory I/Os in the original code, just 6 in the new writer code.)
    
    A new `OffsetVectorWriter` class manages the details of writing to the offset vector.
    
    ### Highly-Optimized Writer Code
    
    The unfortunate side-effect of the optimization is that the writers are no longer simple. The previous version (which should be reviewed first) delegate work to the vector mutator. The new version provides an integrated set of code that:
    
    * Does one single bounds check per value to
    * Check if the vector size must increase, and if so
    * Check if the growth would cause overflow
    * Also check if the “fill empties” logic is needed to fill in missing values
    
    On the other hand, it became clear that the writers could be simplified. The previous version, because it worked with the mutators in each vector class, needed a separate writer for required, optional and repeated vectors. The new code, because it works directly with Netty’s platform dependent layer, can use a single writer for all three data modes.
    
    Now, one might note that the single implementation has some very slight overhead:
    
    * The nullable vector need not fill empty “data” values (null values can be garbage, empty filling is really only needed for the “bits” vector)
    * Repeated writers need not fill empty values (there may be empties in the offset vector, but not the data vector.)
    
    However, the cost is a single comparison, so it did not seem worth the effort to generate three complex classes to avoid a single comparison. (The performance test results seem to back up this assumption.)
    
    Because of the above, the `ColumnAccessors.java` template has become a bit more complex (as if it was not already plenty complex.) Best is to look at the generated `ColumnAccessors` class, and its nested classes, to see the result of the template.
    
    ### Non-zeroing Vector Reallocation
    
    The existing `reAlloc` methods double vector size and zero-fill the upper half of the new vector. This causes the `setSafe()` methods to use loops to call `reAlloc()` multiple times in case the vector may need to quadruple or more.
    
    The revised column writer code avoids these two performance penalties by
    
    * Computing the required new vector size so it can be allocated in one go and
    * Skipping the zeroing step.
    
    Existing code continues to use the original implementation.
    
    ### Collateral Code Cleanup
    
    Because vector overflow is now integrated into the writers, there is no longer a need to throw the prior `VectorOverflowException`, which is now removed. The code changes in prior commits to handle this exception are removed in this commit.
    
    DRILL-5517 added “size aware” methods to value vectors to support the earlier version of the writers which used the vector mutators. Since that logic is now integrated into the writers themselves, this commit backs out those changes to tidy up the value vector code.
    
    Previous versions of the row set writers flattened maps for easier creation of test data in the `RowSetBuilder`. Earlier commits changed this behavior to require an array of values to populate a map. A number of tests needed updating as a result.
    
    Retrofitted a couple of tests to use the `SubOperatorTest` base class.
    
    Split the `MapWriter` into two distinct base classes so that the code can set the value count which is implemented in two different classes without a common parent class method definition.
    
    Earlier versions had a way to set every column type from an int value for testing. Moved this code around to make use of the generic object-based set methods now available.
    
    Added a `PerformanceTool` used to compare writer performance to the original vector & mutator implementations.
    
    Added a `TestFillEmpties` test which writes every fifth value to every data type and every mode to ensure that the “fill empties” logic works for all. This also has the handy side-effect of verifying null handling and basic scalar writer behavior.
    
    ### Readers
    
    A logical question is, “if the column writers benefited from optimization, what about the column readers?” Writers are the focus of the “memory fragmentation” project which will replace the low-level vector writers in various operators.
    
    For now, readers are only used in test code - which is not performance critical. We can apply the same technique to readers when they find themselves on the critical performance path.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---