You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by paul-rogers <gi...@git.apache.org> on 2017/12/24 01:44:12 UTC

[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/1060
  
    This PR is a tough one. We generally like to avoid design discussions in PRs; but I'm going to violate that rule.
    
    This PR is based on the premise that each vector has all the information needed to pull values from a `VLBulkEntry`. The bulk entry reads a given number of values. Presumably this is faster than having the source of the values set values one-by-one.
    
    A separate project is confronting a limitation of Drill: that readers can create batches of very large sizes. Why? Readers choose to read x records. Records can be as small as a few bytes or as large as many MB. (Large records are common in document data sources.) Suppose we decide to read 4K records (a common size) and the records are 1 MB each. The batch needs 4 GB of data. But, the "managed" sort and hash agg operators often get only about 40 MB. Since 4 GB batch > 40 MB working memory, queries die.
    
    If we knew that each record was y bytes, we could compute the record count, x = target batch size / record size. But, for most readers (including Parquet), Drill cannot know the record size until after the record is read. Hence, the entire premise of setting a fixed read size is unworkable. In short, unlike a classic relational DB, Drill cannot predict record sizes, Drill can only react to the size of records as they are read.
    
    Suppose the design were modified so that vector V read either up to x values or until it reaches some maximum vector size. The problem then becomes that, if the record has two columns, V and W, they cannot coordinate. (Specifically, the inner loop in vector V should not have visibility into the state of vector W.) V may read, say, 16K values of 1K each and become full. (For reasons discussed elsewhere, 16 MB is a convenient maximum vector size.) But if column W is 4X wider, at 4K bytes, it must read the same 16K records already read by X, but will require 64 MB, exceeding Y's own size target by 4X.
    
    Suppose the row has 1000 columns. (Unusual, but not unheard of.) Then, we have 1000 bulk readers all needing to coordinate. Maybe each one stays within its own size limit, growing to only 8 MB. But, the aggregate total of 1000 columns at 8 MB each is 8 GB, which exceeds the overall batch size target. (The actual target will be set by the planner, but low 10s of MB is a likely limit.)
    
    To work around this, the bulk load mechanism must know the sizes of all columns it will read in a batch and compute a row count that causes the aggregate sum of values to be below both the per-vector and overall batch size targets. As discussed, since Drill cannot predict the future, and Parquet provides no per-row size metadata, this approach cannot work in the general case (though, with good guesses, it can work in, say, some percentage of "typical" queries -- the approach the Drill currently uses.)
    
    Maybe the solution can read in small bursts, maybe 100 rows, to sense if any vector gets too large before doing another 100 values into each vector. Still with any value larger than 1 for the "burst count", it is easy to get back into the situation of an oversize batch if row sizes are not uniform. 
    
    Since Parquet makes no guarantees that row sizes are uniform, this is a stochastic game. What are the chances that the next 100 (say) rows behave like the last 100? Any deviation leads to OOM errors and query failures. What percentage of the time will we allow queries to fail with OOM errors to go fast the rest of the time? 10% failure rate for a 100% speed improvement? Is there any correct failure rate other than 0%? This is a hard question. In fact, this is the very problem that the "batch size" project was asked to solve.
    
    We are left with two choices. Either 1) exempt Parquet from row size limits, or 2) devise a new solution that allows size limits to be enforced, if only stochastically. Since we are entertaining this PR, then clearly we've decided to *not* address size issues in Parquet. But, since Parquet is our primary input format, we've decided not to address size issues for ~90% of queries. But, since we do, in fact, have a parallel project to limit batch sizes, perhaps we have not, in fact, exempted Parquet. This is quite the dilemma! 
    
    For background, the "batch size" project, to be used for other readers, solves the problem by monitoring total batch size row by row. When a row causes the batch (or any individual vector) size to be exceeded, "rollover" occurs to create a new batch, and the full batch is sent downstream. Lots of complexity comes from the fact that overflow can occur in the middle of a rows, perhaps inside a deeply nested structure. The mechanism makes overflow transparent to the reader; the reader simply asks if it can write a row, writes it, then loops to check again.
    
    It is understood that Parquet is columnar, and the row-by-row approach is perhaps inefficient. So, we need a columnar solution to the batch size problem. Perhaps this project can propose one.
    
    The above description points out another concern. The PR says that this mechanism is for flat rows. Yet, we have yet another project to support nested tables (the so-called "complex types".) Is this PR consistent with that project, or will we need a new, complex type reader that uses, say, the "batch size" mechanism, while the "flat" reader uses this mechanism? Perhaps this project can propose a solution to that issue also.
    
    Would appreciate direction from the PMC and commercial management about their intent and how these three now-conflicting projects are to be reconciled. Once that info is available, we can continue to determine if the code offered here realizes the goals set by project leadership.


---