You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/12/21 07:17:17 UTC

[GitHub] [drill] Leon-WTF opened a new pull request #2412: DRILL-8088: Improve expression evaluation performance

Leon-WTF opened a new pull request #2412:
URL: https://github.com/apache/drill/pull/2412


   # [DRILL-8088](https://issues.apache.org/jira/browse/DRILL-8088): Improve expression evaluation performance
   
   ## Description
   Remove unnessesery map copy each time when enter new scope.
   
   ## Documentation
   No need
   
   ## Testing
   Will be tested in many existed UT which query with "case when", e.g. ifExpression in TestLimit0VsRegularQueriesMetadata


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton edited a comment on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
jnturton edited a comment on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1003928988






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
jnturton commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1003948739


   > i'm sure you have already discussed this, but i would like to know why we are not migrating to arrow and cannot find any information about this decision. As far as i know, arrow was inspired by drill and on the arrow homepage they have still the picture with drill on it but drill does not use arrow. https://arrow.apache.org/overview/
   > Is there any official statement from the project for the arrow support/migration?
   
   @Z0ltrix there isn't an official statement that I know of.  It's too big a question for a comment thread answer and a good topic for a community meetup with some senior devs present.  I believe that to some extent Drill's vector engine has developed in its own direction since Arrow arrived and the best route forward for Drill is now not entirely obvious, and needs some thought.  Significant pros for Arrow are that it is maintained externally and I believe its performance is very good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004406533


   @luocooong, here are answers to your questions:
   
   **Code gen**: Drill already supports "plain Java" code gen and use of the standard compiler without byte code fixup. It is what is used when you set the magic flag in each operator, then ask to save code for debugging. In the tests I did way back when, he "plain Java" path performed at least as well as the Janino/byte-code-fixup path.
   
   If you are not familiar with the "save code for debugging" mechanism, you should be if you want to look at optimization. I'd by happy to describe it (or hunt down to see if it is already described in the Wiki.)
   
   **Provided schema**: There are three cases to consider.
   
   1. Explicit SELECT: `SELECT a, b, c FROM ...`. In this case, if we have a schema, then all operators will use exactly the same code and we can generate once.
   2. "Lenient" wildcard: `SELECT * FROM ...`, where the file (such as JSON or CSV) may have more columns than described by the "provided schema". In this case, each reader is free to add the extra columns. Since each file may be different, each reader will produce a different schema, and downstream operators must deal with schema-on-read; the code cannot be shared.
   3. "Strict" wildcard: readers include only those columns defined in the schema. For this option, we can also generate code once. 
   
   **Refactors**: there are probably some random assortment of tickets filed as various people looked into this area. However, this is more than a "change this, improve that" kind of thing, it probably needs someone to spend time to fully understand what we have today and to do some research to see if there are ways to improve the execution model. Hence, this discussion.
   
   **Vectorization**: that is a complex discussion. I'll tackle that in another note. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
jnturton commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004186672


   @luocooong @paul-rogers I've always thought of Drill's code gen as being an effort to present a good target for the JVM's auto-vectorisation.  Not that this is likely to get the same results as the new SIMD intrinsics in the JVM, or a nice way to code.  Is this on the mark?  A reference:
   
   https://cr.openjdk.java.net/~vlivanov/talks/2019_CodeOne_MTE_Vectors.pdf


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
jnturton commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1003928988


   > Drill was designed to allow vector operations (hence Value Vectors), but the code was never written. In part because there are no CPU vector instructions that work with SQL nullable data. Arrow is supposed to have figured out solutions (Gandiva, is it?) which, perhaps we could consider (but probably only for non-nullable data.)
   
   Hi @paul-rogers, I think that what Arrow does for computations over nullable data is store an external null mask and compute results for every record, including the null ones where the value vector contains either rubbish or some sentinel value.  In a second pass, a null mask is computed for the result.  This results in wasted arithmetic operations for null values, but in practice that's better than a branch for every value.  Quite possibly even for pretty sparse vectors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1003831613


   Hi @luocooong, looks like you're looking at the expression and operator code. I wonder, is there anything you're trying to improve? Execution performance, maybe?
   
   As you know, Drill is very complicated. Drill uses code generation for expression evaluation. The code generation goes though a path that made sense for Java 5 (when Drill was written), but is now a bit awkward. We do have a way to use the native Java tools, which worked faster several years ago; that path is probably even faster now.
   
   Operator setup (another of your PRs) is impacted by code gen cost. Drill generates code for each fragment. If your query has 20 fragments, we generate code 20 times. The reason we must do that is that, in theory, every fragment can see a different schema, so the generated code could differ. By comparison, Spark generates code once, then pushes that code to all its executors.
   
   The generated code itself can be rather awkward for large queries: the code tries to inline everything which is great for small functions, but causes optimization problems as code blocks get larger.
   
   The mechanism to generate code, especially in the PROJECT operator, is vastly overly complex and could use a good re-think. It is so complex that it is hard to optimize because of the many assumptions and other issues embedded in the code.
   
   The generated code is meant to be small. But, over time, some operators added lots of "standard" code to the code generation path. The work is more work for the compiler and "byte code optimizer" that adds no per-query value. We've taken several passes at refactoring to pull that code of the code gen path, but there is more to do.
   
   Drill was designed to allow vector operations (hence Value Vectors), but the code was never written. In part because there are no CPU vector instructions that work with SQL nullable data. Arrow is supposed to have figured out solutions (Gandiva, is it?) which, perhaps we could consider (but probably only for non-nullable data.)
   
   Anyway, there are many areas we can improve. I can give you more details if I know what you're trying to accomplish.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre merged pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
cgivre merged pull request #2412:
URL: https://github.com/apache/drill/pull/2412


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
luocooong commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004154855


   @paul-rogers Hello, thanks for the information. I just have a few questions I'd like to know..
   - **Code gen path** :  Are you talking about the `Code Generation Workflow`? If we are going to use the native java tools, is there anything we can do for this?
   - **Use provided-schema** : If we provide the schema at query time, does it mean that we can also generate code once like Spark?
   - **Refactors and changes** : Is there any reference old tickets were used to rewrite and improve the CG path?
   - **Vectorization** : As I understand, Drill only implements vector storage, but had not implemented the vectorization base on these vector value, is that correct?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004444610


   One last note. Let's assume we wanted to adopt the row-based format (or, the myths being strong, we want to adopt Arrow.) How would we go about it?
   
   The "brute force" approach is to rewrite all the operators. Must deal with low-level vector code, so we'd rewrite that with low-level row (or Arrow) code. Since we can't really test until all operators are converted, we'd have to do the entire conversion in one huge effort. Then, we get to debug. I hope this approach is setting off alarm bells: it is high cost and high risk. This is why Drill never seriously entertained the change.
   
   But, there is another solution. The scan readers all used to work directly with vectors. (Parquet still does.) Because of the memory reasons explained above, we converted most of them to use EVF. As a result, we could swap vectors for row pages (or Arrow) by changing the low-level code. Readers would be blissfully ignorant of such changes because the higher-level abstractions would be unchanged.
   
   So, a more sane way to approach a change of in-memory representations is to first convert the other operators to use an EVF-like approach. (EVF for writing new batches, a "Result Set Loader" for reading exiting batches.) Such a change can be done gradually, operator-by-operator, and is fully compatible with other, non-converted operators. No big bang.
   
   Once everything is upgraded to EVF, then we can swap out the in-memory format. Maybe try Arrow. Try a row-based format. Run tests. Pick the winner.
   
   This is *not* a trivial exercise, but it is doable over time, if we see value and can muster the resources.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] Z0ltrix commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
Z0ltrix commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1003945110


   > > Drill was designed to allow vector operations (hence Value Vectors), but the code was never written. In part because there are no CPU vector instructions that work with SQL nullable data. Arrow is supposed to have figured out solutions (Gandiva, is it?) which, perhaps we could consider (but probably only for non-nullable data.)
   > 
   > Hi @paul-rogers, I think that what Arrow does for computations over nullable data is store an external null mask and compute results for every record, including the null ones where the value vector contains either rubbish or some sentinel value. In a second pass, a null mask is computed for the result. This results in wasted arithmetic operations for null values, but in practice that's better than a branch for every value. Quite possibly even for pretty sparse vectors.
   
   i'm sure you have already discussed this, but i would like to know why we are not migrating to arrow and cannot find any information about this decision. As far as i know, arrow was inspired by drill and on the arrow homepage they have still the picture with drill on it but drill does not use arrow. https://arrow.apache.org/overview/ 
   Is there any official statement from the project for the arrow support/migration?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers edited a comment on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
paul-rogers edited a comment on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004438817


   All: so I've kicked the hornet's nest with the mention of value vectors and Arrow. I'm going to put on my flame-proof suit and debunk some myths.
   
   The columnar format is great for storage, for all the usual reasons. This is why Parquet uses it, Druid uses it for segment files, and various DBs use it for storage. The question we want to ask is, do those benefits apply to the format within the Drill execution engine? I'm here to suggest that columnar has no advantage, and many disadvantages, when used as the *internal* format of an execution engine. "Thems is fighting words", so let's bring it on.
   
   I've had the pleasure of working with several query engines: Drill (columnar) and Impala (row-based) are two well-known examples. This has given me a unique opportunity to see if all the marketing claims for columnar (which still appear in the videos on Drill's website) actually hold up in practice. Spoiler: they don't.
   
   This is a PR about optimization. A good rule in optimization is to start with the biggest issues, then work toward the details. So, rather than tinker with the details of vector execution, let's look at the fundamental issues.
   
   **Myth: Vectorized execution**: The biggest myth is around vectorized execution. Of course, a minor myth is that Drill uses such execution (it doesn't.) The bigger myth is that, if we invested enough, it could.
   
   Vectorized execution is great when we have a simple operation we apply to a large amount of data. Think the dot-product operation for neural networks, or data compression, or image transforms, or graphics. In all cases, we apply a simple operation (rescale, say) to a large amount of homogeneous data (the pixels in an image.)
   
   So, the question is, does typical, real-world SQL fit this pattern? I've now seen enough crufty, complex, messy real-world queries to suggest that, no, SQL is not a good candidate for vectorization. `SELECT` and `WHERE` clauses embed business logic, and that logic is based on messy human rules, not crisp, clean mathematics. The resulting SQL tends to have conditionals (`WHEN` or `IF()`, etc.), lots of function calls (all those cool UDFs which @cgivre has written), and so on. Plus, as noted above, SQL deals with NULL values, which must short-circuit entire execution paths.
   
   Hence, even if we could vectorize simple operations, we'd find that, in most queries, we could not actually use that code.
   
   **Myth: Vectors are CPU Cache Friendly**: The second big myth is that vectors are somehow more "friendly" to the CPU L1 cache than a row format. The idea is that one can load a vector into the L1 cache, then zip through many values in one go. This myth is related to the above one.
   
   First, SQL expressions are not based on columns, they are based on rows. Each calculation tends to involve multiple columns: `net_receipts = sales + taxes - returns`. Here each calculation touches four vectors, so we need all four to be in the CPU cache to benefit.
   
   Second, SQL is row based: that above calculation is just one of perhaps many that occur on each row. In the ideal case, the calculations for independent groups: `SELECT a + b AS x, c - d + e AS y, f / g AS z, ...`. In this case, we could load vectors ``a, `b`, `x` into the L1 cache, do the calcs, then load `c`, `d`, `e` and y in the cache and so on. Of course, Drill doesn't work this way (it does all the calculations for a single row before moving to the next), but it could, and it would have to to benefit from vectorization.
   
   A more typical case is that the same column is used in multiple expressions: `SELECT a + b AS x, a / c AS y, (a - d) * e AS z, ...` In this case, we must load the `a` vector into the L1 cache multiple times. (Or, more properly, its values would continually be bumped out of the cache, then reloaded.)
   
   **Myth: Bigger Vectors are Better**: Drill went though a phase when everyone bought into the "L1 cache" myth. To get better performance everyone wanted ever larger vectors. In the code, you'll see that we started with 1K-row batches, then it grew to 4K, then other code would create 64K row batches. It got so bad we'd allocate vectors larger than 16MB, which caused memory fragmentation and OOM errors. (This is the original reason for what evolved to be "EVF": to control vector sizes to prevent memory fragmentation - very basic DB stuff.)
   
   Remember, the CPU L1 cache is only about 256K in size. A 4MB vector is already 16x the L1 cache size. Combine that with real-world expressions and we end up with a "working set" of 10s of MB in size: 20x or more the L1 cache size. The result is lots of cache misses. (This stuff is really hard to measure, would be great for someone to do the experiments to show this happening in practice.)
   
   **Myth: Vectors are Efficient**: A related, minor myth is that writing to vectors is more efficient than writing to rows. This is not true in general. In Drill, it is especially false. Originally, vectors were allocated at some small initial size (256K? Something like that.) Vectors grow as we add data. (That's one of the things that make working with them difficult.) Fill the 256K vector? We double it to 512K and copy across the data. We do again at 1MB, 2MB, 4MB, ... In the end, to grow to 4MB, copy about 4MB of data. That is 4MB of reads, 4MB of writes, in addition to the 4MB of writes needed to create the data.
   
   Later, a bunch of ad-hoc "batch sizing" code was added to try to guess a good initial size for vectors. Not too hard or fixed-width vectors (`INT`, say), but rather tricky for variable-width vectors (`VARCHAR`).
   
   Remember that each vector is sized and grows independently. So, to create a batch, we have to track the size of every vector, grow those that need growing, but not over-allocate all the vectors because the space is not fungible: vector `a` can't "borrow" unused space from vector `b`.
   
   The point is, Drill needs a large amount of complex, add-on code just to work around the fact that every vector will grow, and copy data, if not sized correctly, and, in general, we don't know ahead of time what the size should be. The result is inefficiency.
   
   **Single Data Format for In-Memory and Over-the-Wire**: One of Drills' claims to fame is that value vectors (and, later Arrow vectors) use the same layout in memory as over the wire, leading to efficient exchanges via RPC. This myth is true, as far as it goes. But, if you look at the code, the truth is much more complex. On the sender side, vectors are independent buffers (as explained above.) On the receiver side, the whole message comes in as one big buffer. Special code slices up that buffer to recover vectors. A vast amount of complex code is needed to handle the accounting. (Thanks to Chris, a one-time Drill contributor who wrote all that code well enough that it "just works". It would be hell to fix otherwise.)
   
   **Complex Client Code**: All DB clients work a row at a time. Whether it is JDBC, Sql Alchemy, your home grown code, or whatever, a client consumes data a row at a time. All DB APIs (except Drill) are row-based. The client asks for the next row (or, more typically, the next *n* rows), then reads them one-by-one. Super simple and has worked for decades.
   
   With Drill, a client has to include all of Drill's complex vector logic so it can read a batch of a 1K (or 4K or more) rows split across vectors. The client then has to walk the vectors to assemble the row that the client really wants. The result is that Java clients (such as JDBC) pull in a vast amount of Drill code. The C++ ODBC client was a nightmare: pretty much only one person ever could make it work (thanks, Parth!).
   
   The complexity of this simple client operation has led clients to use the REST API instead. But, Drill is session-based (needed to set Drill's zillions of session options), but REST is not. Drill works for "big data", but REST delivers the results in one big blob. The result is that that the vector-based client is so hard to use that folks want to use the REST client, which doesn't work in the general case either. We've shot ourselves in the foot. Yet, every other DB on the planet has a simple row-based client API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004442320


   The last topic is so complex that no myth has grown up around it, and the issue is not at all well understood. Vectors (and batches) are hell for distributed system performance. This gets pretty techie, so hang on.
   
   **Vectors are Hell for Exchanges**:  This comes from a real-world case in which a large cluster worked no faster than a single thread of execution. We've discussed how Drill wants to create large batches (per the myth) to benefit from vectorization (which we don't have) and to optimize L1 cache usage (which, as we've seen, we don't actually do.) Let's assume "small" batches of 1K rows.
   
   Drill also wants the single format for in-memory and over-the-wire usage. This means we want to transfer 1K record batches so that the receiver gets batches of the optimal in-memory size.
   
   Now, what happens in a distributed system? Assume you have 100 fragments running. (Maybe 10 machines with 10 cores each.) Let's think about one fragment, call it "f0.0". Let's assume f.0.0 runs a scan and a network sender. The scan builds up its 1K batches, because those are "efficient" (according to the myths we've discussed.)
   
   What does f0.0's network sender do? Let's assume the target is a hash join. So, the sender hashes the keys into 100 buckets. Now, the sender follows Drill's rule: send 1K record batches. Since there are 100 targets, the sender has to create 100 buffered batches, fill them each to 1K records, then send them. To visualize:
   
   `f0.0 (reader --> sender) - - > f1.x (receiver --> hash-join --> ...) ...`
   
   There are 100 f0 fragments: f0.0, ... f0.99, we're looking just at one of them: f0.0. The f0 "slice" sends to the "f1" slice that consists of 100 additional fragments: f1.0, ... f1.99.
   
   So, what happens in our sender? Assuming even hash distribution, we have to fill all our 100 outgoing batches before we can send them. This means we have to read 100 * 1K = 100K input records before we send the first outgoing batch. The result is a huge memory usage (those 100 batches), plus all the vector resizes and copies we discussed (as we grow those batches.)
   
   If that we not bad enough, this occurs in all our other 99 f0 fragments: we've got 100 * 100 = 10K buffered batches waiting to send. Yikes!
   
   Now, what happens in f1? It is sitting around waiting for data. No f0 will send until if fills its first outgoing batch for that receiver. If we assume an even distribution of data, then the outgoing batches fill at about the same rate. None can be sent until one of them reaches the target, at which point most of them are near-full. Once the first hits the 1K mark, off it goes to f1 who can filly start processing. This is bad because Drill claims to be highly distributed, but we just described is a serial way of working.
   
   But, it gets worse! Now, assume we're deeper in the DAG, at a sort:
   
   `f4: (receiver --> sort --> sender) - - > f4: (receiver --> merge --> ...)`
   
   The sort sorts its slice of records, and sends it to the merge fragment which merges all the partial sorts. Classic distributed systems stuff. Again, the f4 (sort) sender waits to fill its outgoing batches, then it sends. The merge can't start until it sees batches from all 100 inputs. So, it proceeds at the rate of the slowest sort.
   
   Now what happens? The merge uses up one of the 100 input batches, and needs another before it can proceed. But, here things get really nasty.
   
   On the f4 side, f4.0, say, sent the first batch to get full. It then sent the others as they filled. Meanwhile, the first batch started refilling and eventually will need to be sent again. Since the merges can't read a new batch until its used up the previous one, it blocks the f4 sender. As a result, f4 can't send to *any* other merge.
   
   The downstream fragment throttles the upstream, and visa versa. Not quite deadlock, but the entire system becomes serialized: the sort can't ship batches until the slowest merge can receive them. The merge can't make progress until the slowest sort provides the next batch. Every fragment depends on every other. Disaster!
   
   Again, we spent hours trying to figure this out on a customer cluster. We could see the effect, but we could not get in to work out the details. Would be great for someone to do the experiments.
   
   **Summary**: The above has debunked the major myths around columnar storage within a query engine. Note that **none** of the above changes if we use Arrow. We'd do a huge amount of work to switch, and be stuck with the same fundamental problems.
   
   Hence, we have to think deeply about this issue, not just by the snake oil that "vectors are good for an execution engine." Good old solid engineering and experimentation will tell us what's what.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004438817


   All: so I've kicked the hornet's nest with the mention of value vectors and Arrow. I'm going to put on my flame-proof suit and debunk some myths.
   
   The columnar format is great for storage, for all the usual reasons. This is why Parquet uses it, Druid uses it for segment files, and various DBs use it for storage. The question we want to ask is, do those benefits apply to the format within the Drill execution engine? I'm here to suggest that columnar has no advantage, and many disadvantages, when used as the *internal* format of an execution engine. "Thems is fighting words", so let's bring it on.
   
   I've had the pleasure of working with several query engines: Drill (columnar) and Impala (row-based) are two well-known examples. This has given me a unique opportunity to see if all the marketing claims for columnar (which still appear in the videos on Drill's website) actually hold up in practice. Spoiler: they don't.
   
   This is a PR about optimization. A good rule in optimization is to start with the biggest issues, then work toward the details. So, rather than tinker with the details of vector execution, let's look at the fundamental issues.
   
   **Myth: Vectorized execution**: The biggest myth is around vectorized execution. Of course, a minor myth is that Drill uses such execution (it doesn't.) The bigger myth is that, if we invested enough, it could.
   
   Vectorized execution is great when we have a simple operation we apply to a large amount of data. Think the dot-product operation for neural networks, or data compression, or image transforms, or graphics. In all cases, we apply a simple operation (rescale, say) to a large amount of homogeneous data (the pixels in an image.)
   
   So, the question is, does typical, real-world SQL fit this pattern? I've now seen enough crufty, complex, messy real-world queries to suggest that, no, SQL is not a good candidate for vectorization. `SELECT` and `WHERE` clauses embed business logic, and that logic is based on messy human rules, not crisp, clean mathematics. The resulting SQL tends to have conditionals (`WHEN` or `IF()`, etc.), lots of function calls (all those cool UDFs which @cgivre has written), and so on. Plus, as noted above, SQL deals with NULL values, which must short-circuit entire execution paths.
   
   Hence, even if we could vectorize simple operations, we'd find that, in most queries, we could not actually use that code.
   
   **Myth: Vectors are CPU Cache Friendly**: The second big myth is that vectors are somehow more "friendly" to the CPU L1 cache than a row format. The idea is that one can load a vector into the L1 cache, then zip through many values in one go. This myth is related to the above one.
   
   First, SQL expressions are not based on columns, they are based on rows. Each calculation tends to involve multiple columns: `net_receipts = sales + taxes - returns`. Here each calculation touches four vectors, so we need all four to be in the CPU cache to benefit.
   
   Second, SQL is row based: that above calculation is just one of perhaps many that occur on each row. In the ideal case, the calculations for independent groups: `SELECT a + b AS x, c - d + e AS y, f / g AS z, ...`. In this case, we could load vectors ``a, `b`, `x` into the L1 cache, do the calcs, then load `c`, `d`, `e` and y in the cache and so on. Of course, Drill doesn't work this way (it does all the calculations for a single row before moving to the next), but it could, and it would have to to benefit from vectorization.
   
   A more typical case is that the same column is used in multiple expressions: `SELECT a + b AS x, a / c AS y, (a - d) * e AS z, ...` In this case, we must load the `a` vector into the L1 cache multiple times. (Or, more properly, its values would continually be bumped out of the cache, then reloaded.)
   
   **Myth: Bigger Vectors are Better**: Drill went though a phase when everyone bought into the "L1 cache" myth. To get better performance everyone wanted ever larger vectors. In the code, you'll see that we started with 1K-row batches, then it grew to 4K, then other code would create 64K row batches. It got so bad we'd allocate vectors larger than 16MB, which caused memory fragmentation and OOM errors. (This is the original reason for what evolved to be "EVF": to control vector sizes to prevent memory fragmentation - very basic DB stuff.)
   
   Remember, the CPU L1 cache is only about 256K in size. A 4MB vector is already 16x the L1 cache size. Combine that with real-world expressions and we end up with a "working set" of 10s of MB in size: 20x or more the L1 cache size. The result is lots of cache misses. (This stuff is really hard to measure, would be great for someone to do the experiments to show this happening in practice.)
   
   **Myth: Vectors are Efficient**: A related, minor myth is that writing to vectors is more efficient than writing to rows. This is not true in general. In Drill, it is especially false. Originally, vectors were allocated at some small initial size (256K? Something like that.) Vectors grow as we add data. (That's one of the things that make working with them difficult.) Fill the 256K vector? We double it to 512K and copy across the data. We do again at 1MB, 2MB, 4MB, ... In the end, to grow to 4MB, copy about 4MB of data. That is 4MB of reads, 4MB of writes, in addition to the 4MB of writes needed to create the data.
   
   Later, a bunch of ad-hoc "batch sizing" code was added to try to guess a good initial size for vectors. Not too hard or fixed-width vectors (`INT`, say), but rather tricky for variable-width vectors (`VARCHAR`).
   
   Remember that each vector is sized and grows independently. So, to create a batch, we have to track the size of every vector, grow those that need growing, but not over-allocate all the vectors because the space is not fungible: vector `a` can't "borrow" unused space from vector `b`.
   
   The point is, Drill needs a large amount of complex, add-on code just to work around the fact that every vector will grow, and copy data, if not sized correctly, and, in general, we don't know ahead of time what the size should be. The result is inefficiency.
   
   **Single Data Format for In-Memory and Over-the-Wire**: One of Drills' claims to fame is that value vectors (and, later Arrow vectors) use the same layout in memory as over the wire, leading to efficient exchanges via RPC. This myth is true, as far as it goes. But, if you look at the code, the truth is much more complex. On the sender side, vectors are independent buffers (as explained above.) On the receiver side, the whole message comes in as one big buffer. Special code slices up that buffer to recover vectors. A vast amount of complex code is needed to handle the accounting. (Thanks to Chris, a one-time Drill contributor who wrote all that code well enough that it "just works". It would be hell to fix otherwise.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
luocooong commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004701505


   @paul-rogers Thanks for your knowledge base. For the vectorization and Arrow, I may need to keep pouring cold water for everyone. I think we need to figure out what we need here is, not just see it as powerful, then say it's really powerful.
   
   **SQL, NoSQL and NewSQL**
   In my work, I went through these three databases architecture. Initially, we had performance bottlenecks, so choose the NoSQL database, eg : mongo, es and redis. After entering the production environment, most developers were not good at using the API(clients) designed for these databases. The key is that developers do not have the patience to learn the advanced syntax of each database(NoSQL). So we come back to use the SQL-style database : NewSQL. But, we found that NewSQL also has many bottlenecks(once you don't think need any optimization), especially once you comes to deploy a private cluster, you'll get complex operation and maintenance conditions and user complaints.
   
   Here, the lesson is that choosing the right one is better than choosing it blindly.
   
   **OLTP, OLAP and HTAP**
   The row format is good for OLTP, because need to insert / update / delete and batch fetch entire rows as soon as possible. The columnar format is good for OLAP, because need to analyze large amounts of data. To keep low latency, need to filter data size from the disk and reduce network IO bottlenecks. What does HTAP look like? let's see that Internal implementation of TiDB :
   
   <img src="https://download.pingcap.com/images/docs/tiflash/tiflash-architecture.png" width="50%" height="50%">
   
   ```
   TiFlash provides the columnar storage, it conducts real-time replication of data in the TiKV nodes 
   at a low cost that does not block writes in TiKV. Meanwhile, it provides the same read consistency 
   as in TiKV and ensures that the latest data is read.
   ```
   
   As is well known, database vendors prefer to use row format to apply in OLTP and columnar format to support OLAP. However, they are well aware that there is no perfect data format. So that, I agree with Paul that the row format is better for Drill.
   
   **CPU Cache miss**
   - N-ary Storage Model
   <img src="https://user-images.githubusercontent.com/50079619/148023833-dcd92d2c-38ae-41fa-a80f-97496f50e647.png" width="50%" height="50%">
   
   - Cache miss
   <img src="https://user-images.githubusercontent.com/50079619/148022521-f623b105-9057-4286-8ee4-3a2231b8bc68.png" width="50%" height="50%">
   
   The figure shows the processing of CPU caches, and we can see that a lot of invalid data is being filled into the cache, crowding out data that would otherwise have been reusable.
   
   - Decomposition Storage Mode
   <img src="https://user-images.githubusercontent.com/50079619/148023968-6f0db55b-63d1-448a-9165-2001a59aea01.png" width="50%" height="50%">
   
   Reduce cache-miss is one of the advantages of the columnar format, but note that Paul has explained that there will also be cache-miss(calculation based in SQL syntax) in the columnar format. So that, we cannot expect the maximum performance with Arrow.
   
   **Write amplification**
   Avoid becoming a database system, but Drill has several key points on the read-write path.
   - Drill support split vector value to disk.
   - Drill is an engine, isn't a database, unable to unify the initial source format(made in the database vendor).
   - Complex read-write path :
     - (1) Read from a variety of data sources.
     - (2) Write to vector value or split to disk.
     - (3) Combined to row sets.
     - (4) Client side.
   
   As noted above, Impossible to avoid read and write multiple times in a query, and the write amplification will increase latency. So that, there's a lot to optimize, not only the columnar format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers edited a comment on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
paul-rogers edited a comment on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004438817


   All: so I've kicked the hornet's nest with the mention of value vectors and Arrow. I'm going to put on my flame-proof suit and debunk some myths.
   
   The columnar format is great for storage, for all the usual reasons. This is why Parquet uses it, Druid uses it for segment files, and various DBs use it for storage. The question we want to ask is, do those benefits apply to the format within the Drill execution engine? I'm here to suggest that columnar has no advantage, and many disadvantages, when used as the *internal* format of an execution engine. "Thems is fighting words", so let's bring it on.
   
   I've had the pleasure of working with several query engines: Drill (columnar) and Impala (row-based) are two well-known examples. This has given me a unique opportunity to see if all the marketing claims for columnar (which still appear in the videos on Drill's website) actually hold up in practice. Spoiler: they don't.
   
   This is a PR about optimization. A good rule in optimization is to start with the biggest issues, then work toward the details. So, rather than tinker with the details of vector execution, let's look at the fundamental issues. I hope this will help us avoid confusing Drill's (and Arrow's) marketing with reality.
   
   **Myth: Vectorized execution**: The biggest myth is around vectorized execution. Of course, a minor myth is that Drill uses such execution (it doesn't.) The bigger myth is that, if we invested enough, it could.
   
   Vectorized execution is great when we have a simple operation we apply to a large amount of data. Think the dot-product operation for neural networks, or data compression, or image transforms, or graphics. In all cases, we apply a simple operation (rescale, say) to a large amount of homogeneous data (the pixels in an image.)
   
   So, the question is, does typical, real-world SQL fit this pattern? I've now seen enough crufty, complex, messy real-world queries to suggest that, no, SQL is not a good candidate for vectorization. `SELECT` and `WHERE` clauses embed business logic, and that logic is based on messy human rules, not crisp, clean mathematics. The resulting SQL tends to have conditionals (`WHEN` or `IF()`, etc.), lots of function calls (all those cool UDFs which @cgivre has written), and so on. Plus, as noted above, SQL deals with NULL values, which must short-circuit entire execution paths.
   
   Hence, even if we could vectorize simple operations, we'd find that, in most queries, we could not actually use that code.
   
   **Myth: Vectors are CPU Cache Friendly**: The second big myth is that vectors are somehow more "friendly" to the CPU L1 cache than a row format. The idea is that one can load a vector into the L1 cache, then zip through many values in one go. This myth is related to the above one.
   
   First, SQL expressions are not based on columns, they are based on rows. Each calculation tends to involve multiple columns: `net_receipts = sales + taxes - returns`. Here each calculation touches four vectors, so we need all four to be in the CPU cache to benefit.
   
   Second, SQL is row based: that above calculation is just one of perhaps many that occur on each row. In the ideal case, the calculations for independent groups: `SELECT a + b AS x, c - d + e AS y, f / g AS z, ...`. In this case, we could load vectors ``a, `b`, `x` into the L1 cache, do the calcs, then load `c`, `d`, `e` and y in the cache and so on. Of course, Drill doesn't work this way (it does all the calculations for a single row before moving to the next), but it could, and it would have to to benefit from vectorization.
   
   A more typical case is that the same column is used in multiple expressions: `SELECT a + b AS x, a / c AS y, (a - d) * e AS z, ...` In this case, we must load the `a` vector into the L1 cache multiple times. (Or, more properly, its values would continually be bumped out of the cache, then reloaded.)
   
   **Myth: Bigger Vectors are Better**: Drill went though a phase when everyone bought into the "L1 cache" myth. To get better performance everyone wanted ever larger vectors. In the code, you'll see that we started with 1K-row batches, then it grew to 4K, then other code would create 64K row batches. It got so bad we'd allocate vectors larger than 16MB, which caused memory fragmentation and OOM errors. (This is the original reason for what evolved to be "EVF": to control vector sizes to prevent memory fragmentation - very basic DB stuff.)
   
   Remember, the CPU L1 cache is only about 256K in size. A 4MB vector is already 16x the L1 cache size. Combine that with real-world expressions and we end up with a "working set" of 10s of MB in size: 20x or more the L1 cache size. The result is lots of cache misses. (This stuff is really hard to measure, would be great for someone to do the experiments to show this happening in practice.)
   
   **Myth: Vectors are Efficient**: A related, minor myth is that writing to vectors is more efficient than writing to rows. This is not true in general. In Drill, it is especially false. Originally, vectors were allocated at some small initial size (256K? Something like that.) Vectors grow as we add data. (That's one of the things that make working with them difficult.) Fill the 256K vector? We double it to 512K and copy across the data. We do again at 1MB, 2MB, 4MB, ... In the end, to grow to 4MB, copy about 4MB of data. That is 4MB of reads, 4MB of writes, in addition to the 4MB of writes needed to create the data.
   
   Later, a bunch of ad-hoc "batch sizing" code was added to try to guess a good initial size for vectors. Not too hard or fixed-width vectors (`INT`, say), but rather tricky for variable-width vectors (`VARCHAR`).
   
   Remember that each vector is sized and grows independently. So, to create a batch, we have to track the size of every vector, grow those that need growing, but not over-allocate all the vectors because the space is not fungible: vector `a` can't "borrow" unused space from vector `b`.
   
   The point is, Drill needs a large amount of complex, add-on code just to work around the fact that every vector will grow, and copy data, if not sized correctly, and, in general, we don't know ahead of time what the size should be. The result is inefficiency.
   
   **Myth: Vectors Allow Efficient Parquet Integration**: The idea here is that Parquet is columnar, vectors are columnar, so we just read Parquet directly into vectors. The reality is that Parquet is a highly-encoded format that requires a large amount of complex code to decode. Drill does the decoding value-by-value, there is no "direct copy", nor could there be. Parquet works with Drill's value vectors, Arrow vectors, or Impala rows equally well: in each case, the Parquet data has to be decoded value-by-value, then written to the target format.
   
   **Single Data Format for In-Memory and Over-the-Wire**: One of Drills' claims to fame is that value vectors (and, later Arrow vectors) use the same layout in memory as over the wire, leading to efficient exchanges via RPC. This myth is true, as far as it goes. But, if you look at the code, the truth is much more complex. On the sender side, vectors are independent buffers (as explained above.) On the receiver side, the whole message comes in as one big buffer. Special code slices up that buffer to recover vectors. A vast amount of complex code is needed to handle the accounting. (Thanks to Chris, a one-time Drill contributor who wrote all that code well enough that it "just works". It would be hell to fix otherwise.)
   
   **Complex Client Code**: All DB clients work a row at a time. Whether it is JDBC, Sql Alchemy, your home grown code, or whatever, a client consumes data a row at a time. All DB APIs (except Drill) are row-based. The client asks for the next row (or, more typically, the next *n* rows), then reads them one-by-one. Super simple and has worked for decades.
   
   With Drill, a client has to include all of Drill's complex vector logic so it can read a batch of a 1K (or 4K or more) rows split across vectors. The client then has to walk the vectors to assemble the row that the client really wants. The result is that Java clients (such as JDBC) pull in a vast amount of Drill code. The C++ ODBC client was a nightmare: pretty much only one person ever could make it work (thanks, Parth!).
   
   The complexity of this simple client operation has led clients to use the REST API instead. But, Drill is session-based (needed to set Drill's zillions of session options), but REST is not. Drill works for "big data", but REST delivers the results in one big blob. The result is that that the vector-based client is so hard to use that folks want to use the REST client, which doesn't work in the general case either. We've shot ourselves in the foot. Yet, every other DB on the planet has a simple row-based client API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004444084


   OK, so the above raise the issues we have to consider when thinking about vectors (Drill's or Arrow's.) What is the alternative?
   
   Here, I think Impala got it right. Impala uses Parquet (columnar) for *storage*, but rows for *internal* processing. Impala is like an Italian sports car of old: spends lots of time in the shop, but when it works, it is very fast.
   
   One of the reasons that Impala is fast is because of the row format.  First, let's describe what "row-based" means. It means that columns appear together, as in a C `struct`, with rows packed one after another as in an array of `structs`. This means that the data for a given row is contiguous. There is only one buffer to size. Classic DB stuff that seems unusual only because we're all used to Drill's vector format.
   
   Let's look at the same issues above, but from a row-based perspective.
   
   **Expression Execution**: With a row-based model, the CPU can easily load a single row into the L1 cache. All our crufty-real-world expression logic works on that single row. So, no matter how messy the expressions, from the CPU's perspective, all the data is in that single row, which fits nicely into the cache.
   
   Rows can be small (a few dozen bytes) or large (maybe a 10s of K for long VARCHARs). In either case, they are far smaller than the L1 cache. The row is loaded. Once we move onto the next row, we'll never visit the previous one, so we don't care if the CPU flushes it from the cache.
   
   **Memory Allocation**: Rows reside in buffers (AKA "pages"), typically of a fixed size. A reader "pours" data into a row. When the page is full, that last record is copied to the next page. Only that one row is copied, not all the existing data. So, we eliminate the 1X copy + 1X load problem in Drill. Since there is only one page to allocate, memory is simpler. Since pages are of fixed size, memory management is simpler as well.
   
   **Exchanges**: Network exchanges are row-based. Rows are self-contained. A network sender can send single rows, if that is efficient, or batches of rows. In our 100-senders-100-receiver example, we could send rows as soon as they are available. The receiver starts working as soon as the first row is available. There is no near-deadlock from excessive buffering.
   
   Yes, we would want to buffer rows (into pages) for efficiency. But, in extreme cases, we can send small numbers of rows to keep the DAG flowing.
   
   **Clients**: As noted above, row-based clients are the classic solution and are simple to write. We could easily support proper clients in Python, Go, Rust and anything else if we used a row-based format.
   
   **Conclusion**: We tend to focus on the "value vector vs. Arrow" discussion. I'm here to say that that is the wrong question: it buys into myths which have hurt Drill for years. The *correct* question is: what is the most efficient format for the use cases where Drill wants to excel? The above suggests that, rather than Arrow, a better solution is to adopt a row-based internal format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org