You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2022/01/04 00:17:25 UTC

[GitHub] [drill] paul-rogers edited a comment on pull request #2412: DRILL-8088: Improve expression evaluation performance

paul-rogers edited a comment on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004438817

All: so I've kicked the hornet's nest with the mention of value vectors and Arrow. I'm going to put on my flame-proof suit and debunk some myths.

The columnar format is great for storage, for all the usual reasons. This is why Parquet uses it, Druid uses it for segment files, and various DBs use it for storage. The question we want to ask is, do those benefits apply to the format within the Drill execution engine? I'm here to suggest that columnar has no advantage, and many disadvantages, when used as the *internal* format of an execution engine. "Thems is fighting words", so let's bring it on.

I've had the pleasure of working with several query engines: Drill (columnar) and Impala (row-based) are two well-known examples. This has given me a unique opportunity to see if all the marketing claims for columnar (which still appear in the videos on Drill's website) actually hold up in practice. Spoiler: they don't.

This is a PR about optimization. A good rule in optimization is to start with the biggest issues, then work toward the details. So, rather than tinker with the details of vector execution, let's look at the fundamental issues.

**Myth: Vectorized execution**: The biggest myth is around vectorized execution. Of course, a minor myth is that Drill uses such execution (it doesn't.) The bigger myth is that, if we invested enough, it could.

Vectorized execution is great when we have a simple operation we apply to a large amount of data. Think the dot-product operation for neural networks, or data compression, or image transforms, or graphics. In all cases, we apply a simple operation (rescale, say) to a large amount of homogeneous data (the pixels in an image.)

So, the question is, does typical, real-world SQL fit this pattern? I've now seen enough crufty, complex, messy real-world queries to suggest that, no, SQL is not a good candidate for vectorization. `SELECT` and `WHERE` clauses embed business logic, and that logic is based on messy human rules, not crisp, clean mathematics. The resulting SQL tends to have conditionals (`WHEN` or `IF()`, etc.), lots of function calls (all those cool UDFs which @cgivre has written), and so on. Plus, as noted above, SQL deals with NULL values, which must short-circuit entire execution paths.

Hence, even if we could vectorize simple operations, we'd find that, in most queries, we could not actually use that code.

**Myth: Vectors are CPU Cache Friendly**: The second big myth is that vectors are somehow more "friendly" to the CPU L1 cache than a row format. The idea is that one can load a vector into the L1 cache, then zip through many values in one go. This myth is related to the above one.

First, SQL expressions are not based on columns, they are based on rows. Each calculation tends to involve multiple columns: `net_receipts = sales + taxes - returns`. Here each calculation touches four vectors, so we need all four to be in the CPU cache to benefit.

Second, SQL is row based: that above calculation is just one of perhaps many that occur on each row. In the ideal case, the calculations for independent groups: `SELECT a + b AS x, c - d + e AS y, f / g AS z, ...`. In this case, we could load vectors ``a, `b`, `x` into the L1 cache, do the calcs, then load `c`, `d`, `e` and y in the cache and so on. Of course, Drill doesn't work this way (it does all the calculations for a single row before moving to the next), but it could, and it would have to to benefit from vectorization.

A more typical case is that the same column is used in multiple expressions: `SELECT a + b AS x, a / c AS y, (a - d) * e AS z, ...` In this case, we must load the `a` vector into the L1 cache multiple times. (Or, more properly, its values would continually be bumped out of the cache, then reloaded.)

**Myth: Bigger Vectors are Better**: Drill went though a phase when everyone bought into the "L1 cache" myth. To get better performance everyone wanted ever larger vectors. In the code, you'll see that we started with 1K-row batches, then it grew to 4K, then other code would create 64K row batches. It got so bad we'd allocate vectors larger than 16MB, which caused memory fragmentation and OOM errors. (This is the original reason for what evolved to be "EVF": to control vector sizes to prevent memory fragmentation - very basic DB stuff.)

Remember, the CPU L1 cache is only about 256K in size. A 4MB vector is already 16x the L1 cache size. Combine that with real-world expressions and we end up with a "working set" of 10s of MB in size: 20x or more the L1 cache size. The result is lots of cache misses. (This stuff is really hard to measure, would be great for someone to do the experiments to show this happening in practice.)

**Myth: Vectors are Efficient**: A related, minor myth is that writing to vectors is more efficient than writing to rows. This is not true in general. In Drill, it is especially false. Originally, vectors were allocated at some small initial size (256K? Something like that.) Vectors grow as we add data. (That's one of the things that make working with them difficult.) Fill the 256K vector? We double it to 512K and copy across the data. We do again at 1MB, 2MB, 4MB, ... In the end, to grow to 4MB, copy about 4MB of data. That is 4MB of reads, 4MB of writes, in addition to the 4MB of writes needed to create the data.

Later, a bunch of ad-hoc "batch sizing" code was added to try to guess a good initial size for vectors. Not too hard or fixed-width vectors (`INT`, say), but rather tricky for variable-width vectors (`VARCHAR`).

Remember that each vector is sized and grows independently. So, to create a batch, we have to track the size of every vector, grow those that need growing, but not over-allocate all the vectors because the space is not fungible: vector `a` can't "borrow" unused space from vector `b`.

The point is, Drill needs a large amount of complex, add-on code just to work around the fact that every vector will grow, and copy data, if not sized correctly, and, in general, we don't know ahead of time what the size should be. The result is inefficiency.

**Single Data Format for In-Memory and Over-the-Wire**: One of Drills' claims to fame is that value vectors (and, later Arrow vectors) use the same layout in memory as over the wire, leading to efficient exchanges via RPC. This myth is true, as far as it goes. But, if you look at the code, the truth is much more complex. On the sender side, vectors are independent buffers (as explained above.) On the receiver side, the whole message comes in as one big buffer. Special code slices up that buffer to recover vectors. A vast amount of complex code is needed to handle the accounting. (Thanks to Chris, a one-time Drill contributor who wrote all that code well enough that it "just works". It would be hell to fix otherwise.)

**Complex Client Code**: All DB clients work a row at a time. Whether it is JDBC, Sql Alchemy, your home grown code, or whatever, a client consumes data a row at a time. All DB APIs (except Drill) are row-based. The client asks for the next row (or, more typically, the next *n* rows), then reads them one-by-one. Super simple and has worked for decades.

With Drill, a client has to include all of Drill's complex vector logic so it can read a batch of a 1K (or 4K or more) rows split across vectors. The client then has to walk the vectors to assemble the row that the client really wants. The result is that Java clients (such as JDBC) pull in a vast amount of Drill code. The C++ ODBC client was a nightmare: pretty much only one person ever could make it work (thanks, Parth!).

The complexity of this simple client operation has led clients to use the REST API instead. But, Drill is session-based (needed to set Drill's zillions of session options), but REST is not. Drill works for "big data", but REST delivers the results in one big blob. The result is that that the vector-based client is so hard to use that folks want to use the REST client, which doesn't work in the general case either. We've shot ourselves in the foot. Yet, every other DB on the planet has a simple row-based client API.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org