You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2022/01/04 11:47:48 UTC

[GitHub] [drill] jnturton opened a new issue #2421: ValueVectors replacement

jnturton opened a new issue #2421:
URL: https://github.com/apache/drill/issues/2421


   This feature request has been transcribed from messages posted in #2412 and to the mailing list in the first week of January 2021.  The topic is what might replace the current memory structures used for data, ValueVectors.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004748894


   Charles Givre wrote:
   
   Thanks Ted for the perspective!  I had always wished to be a "fly on the wall" in those conversations.  
   -- C
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton edited a comment on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton edited a comment on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1008235930


   @paul-rogers I can imagine vectorisation helping hash joins because of all the hashing to be crunched. I haven't got a good paper to cite but a search from my phone did turn up at least one story on these lines. 
   
   https://www.cockroachlabs.com/blog/vectorized-hash-joiner/
   
   Regarding pivoting between row and column orientation depending on the operator, I would expect that all the ensuing memory copying would outweigh any gain but again I bring no references or data.
   
   EDIT: There are quite a few open access papers returned by "sql join SIMD". I've got some bedtime reading to do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] Leon-WTF commented on issue #2421: [DISCUSSION] ValueVectors Replacement

Posted by GitBox <gi...@apache.org>.
Leon-WTF commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1013839564


   @paul-rogers  Thanks paul for your knowledge, but for the example of the serialized sort and merge, I don't quite understand. Why the sort and merge are in the same major fragment(f4)? And they are also on the same machine? It exists  more than one merge fragment? not just one? Could you please share more details about it? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004750707


   Paul Rogers wrote:
   
   All: so I've kicked the hornet's nest with the mention of value vectors and Arrow. I'm going to put on my flame-proof suit and debunk some myths.
   
   The columnar format is great for storage, for all the usual reasons. This is why Parquet uses it, Druid uses it for segment files, and various DBs use it for storage. The question we want to ask is, do those benefits apply to the format within the Drill execution engine? I'm here to suggest that columnar has no advantage, and many disadvantages, when used as the *internal* format of an execution engine. "Thems is fighting words", so let's bring it on.
   
   I've had the pleasure of working with several query engines: Drill (columnar) and Impala (row-based) are two well-known examples. This has given me a unique opportunity to see if all the marketing claims for columnar (which still appear in the videos on Drill's website) actually hold up in practice. Spoiler: they don't.
   
   This is a PR about optimization. A good rule in optimization is to start with the biggest issues, then work toward the details. So, rather than tinker with the details of vector execution, let's look at the fundamental issues. I hope this will help us avoid confusing Drill's (and Arrow's) marketing with reality.
   
   **Myth: Vectorized execution**: The biggest myth is around vectorized execution. Of course, a minor myth is that Drill uses such execution (it doesn't.) The bigger myth is that, if we invested enough, it could.
   
   Vectorized execution is great when we have a simple operation we apply to a large amount of data. Think the dot-product operation for neural networks, or data compression, or image transforms, or graphics. In all cases, we apply a simple operation (rescale, say) to a large amount of homogeneous data (the pixels in an image.)
   
   So, the question is, does typical, real-world SQL fit this pattern? I've now seen enough crufty, complex, messy real-world queries to suggest that, no, SQL is not a good candidate for vectorization. `SELECT` and `WHERE` clauses embed business logic, and that logic is based on messy human rules, not crisp, clean mathematics. The resulting SQL tends to have conditionals (`WHEN` or `IF()`, etc.), lots of function calls (all those cool UDFs which @cgivre has written), and so on. Plus, as noted above, SQL deals with NULL values, which must short-circuit entire execution paths.
   
   Hence, even if we could vectorize simple operations, we'd find that, in most queries, we could not actually use that code.
   
   **Myth: Vectors are CPU Cache Friendly**: The second big myth is that vectors are somehow more "friendly" to the CPU L1 cache than a row format. The idea is that one can load a vector into the L1 cache, then zip through many values in one go. This myth is related to the above one.
   
   First, SQL expressions are not based on columns, they are based on rows. Each calculation tends to involve multiple columns: `net_receipts = sales + taxes - returns`. Here each calculation touches four vectors, so we need all four to be in the CPU cache to benefit.
   
   Second, SQL is row based: that above calculation is just one of perhaps many that occur on each row. In the ideal case, the calculations for independent groups: `SELECT a + b AS x, c - d + e AS y, f / g AS z, ...`. In this case, we could load vectors ``a, `b`, `x` into the L1 cache, do the calcs, then load `c`, `d`, `e` and y in the cache and so on. Of course, Drill doesn't work this way (it does all the calculations for a single row before moving to the next), but it could, and it would have to to benefit from vectorization.
   
   A more typical case is that the same column is used in multiple expressions: `SELECT a + b AS x, a / c AS y, (a - d) * e AS z, ...` In this case, we must load the `a` vector into the L1 cache multiple times. (Or, more properly, its values would continually be bumped out of the cache, then reloaded.)
   
   **Myth: Bigger Vectors are Better**: Drill went though a phase when everyone bought into the "L1 cache" myth. To get better performance everyone wanted ever larger vectors. In the code, you'll see that we started with 1K-row batches, then it grew to 4K, then other code would create 64K row batches. It got so bad we'd allocate vectors larger than 16MB, which caused memory fragmentation and OOM errors. (This is the original reason for what evolved to be "EVF": to control vector sizes to prevent memory fragmentation - very basic DB stuff.)
   
   Remember, the CPU L1 cache is only about 256K in size. A 4MB vector is already 16x the L1 cache size. Combine that with real-world expressions and we end up with a "working set" of 10s of MB in size: 20x or more the L1 cache size. The result is lots of cache misses. (This stuff is really hard to measure, would be great for someone to do the experiments to show this happening in practice.)
   
   **Myth: Vectors are Efficient**: A related, minor myth is that writing to vectors is more efficient than writing to rows. This is not true in general. In Drill, it is especially false. Originally, vectors were allocated at some small initial size (256K? Something like that.) Vectors grow as we add data. (That's one of the things that make working with them difficult.) Fill the 256K vector? We double it to 512K and copy across the data. We do again at 1MB, 2MB, 4MB, ... In the end, to grow to 4MB, copy about 4MB of data. That is 4MB of reads, 4MB of writes, in addition to the 4MB of writes needed to create the data.
   
   Later, a bunch of ad-hoc "batch sizing" code was added to try to guess a good initial size for vectors. Not too hard or fixed-width vectors (`INT`, say), but rather tricky for variable-width vectors (`VARCHAR`).
   
   Remember that each vector is sized and grows independently. So, to create a batch, we have to track the size of every vector, grow those that need growing, but not over-allocate all the vectors because the space is not fungible: vector `a` can't "borrow" unused space from vector `b`.
   
   The point is, Drill needs a large amount of complex, add-on code just to work around the fact that every vector will grow, and copy data, if not sized correctly, and, in general, we don't know ahead of time what the size should be. The result is inefficiency.
   
   **Myth: Vectors Allow Efficient Parquet Integration**: The idea here is that Parquet is columnar, vectors are columnar, so we just read Parquet directly into vectors. The reality is that Parquet is a highly-encoded format that requires a large amount of complex code to decode. Drill does the decoding value-by-value, there is no "direct copy", nor could there be. Parquet works with Drill's value vectors, Arrow vectors, or Impala rows equally well: in each case, the Parquet data has to be decoded value-by-value, then written to the target format.
   
   **Single Data Format for In-Memory and Over-the-Wire**: One of Drills' claims to fame is that value vectors (and, later Arrow vectors) use the same layout in memory as over the wire, leading to efficient exchanges via RPC. This myth is true, as far as it goes. But, if you look at the code, the truth is much more complex. On the sender side, vectors are independent buffers (as explained above.) On the receiver side, the whole message comes in as one big buffer. Special code slices up that buffer to recover vectors. A vast amount of complex code is needed to handle the accounting. (Thanks to Chris, a one-time Drill contributor who wrote all that code well enough that it "just works". It would be hell to fix otherwise.)
   
   **Complex Client Code**: All DB clients work a row at a time. Whether it is JDBC, Sql Alchemy, your home grown code, or whatever, a client consumes data a row at a time. All DB APIs (except Drill) are row-based. The client asks for the next row (or, more typically, the next *n* rows), then reads them one-by-one. Super simple and has worked for decades.
   
   With Drill, a client has to include all of Drill's complex vector logic so it can read a batch of a 1K (or 4K or more) rows split across vectors. The client then has to walk the vectors to assemble the row that the client really wants. The result is that Java clients (such as JDBC) pull in a vast amount of Drill code. The C++ ODBC client was a nightmare: pretty much only one person ever could make it work (thanks, Parth!).
   
   The complexity of this simple client operation has led clients to use the REST API instead. But, Drill is session-based (needed to set Drill's zillions of session options), but REST is not. Drill works for "big data", but REST delivers the results in one big blob. The result is that that the vector-based client is so hard to use that folks want to use the REST client, which doesn't work in the general case either. We've shot ourselves in the foot. Yet, every other DB on the planet has a simple row-based client API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers edited a comment on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
paul-rogers edited a comment on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1007611673


   @jnturton,  one could do something like what you described. However, to have all of Drill work with Arrow would be a huge amount of work. Optimizations made for one format would be sub-optimal for the other. (Example: exchanges.) Furthermore, your use case would benefit from vectors only in the project and grouping operators.
   
   So, I wonder if we might think about the problem operator-by-operator. If you have a compute-heavy phase, might that first transform data to vectors, apply the compute, then send data along in row format? Every fragment does a network exchange: data is read/written anyway. So, perhaps there is something that can be done to transform formats at fragment boundaries (he says, waving hands wildly...)
   
   You'll also get speed only for queries without joins. If you have joins, then the joins are likely to take the vast amount of the runtime, leaving your projection and grouping in the noise. I'm not sure how vectorization can help joins; certainly in Drill today, vectors make the join code atrociously complex.
   
   This is why DBs (and compiler optimizers) are hard: the answers change based on use case...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on issue #2421: [DISCUSSION] ValueVectors Replacement

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1014927069


   @Leon-WTF, you ask a good question. I'm afraid the answer is rather complex: hard to describe in a few words. Read if you're interested.
   
   Let's take a simple query. We have 10 scan fragments feeding 10 sort fragments that feed a single merge/"screen" fragment. Drill uses a random exchange to shuffle data from the 10 scan fragments to the 10 sorts, so that every sort gets data from every reader, balancing the load. There is an ordered exchange from the sorts to the merge. In "Impala notation":
   
   ```text
   Screen
   |
   Merge
   |
   Ordered Receiver
   |   -   -   -   -   -   -   -   -
   Ordered Sender
   |
   Sort
   |
   Unordered Receiver
   |  -   -   -   -   -   -   -   -
   Random Sender
   |
   Scan 
   ```
   
   The dotted lines are an ad-hoc addition to represent fragment boundaries. Remember there are 10 each of the lower two fragments, 1 of the top.
   
   First,lets consider the in-memory case. The scans read data and forward batches to random sorts. The sorts sort each incoming batches, sorts each one, and buffer them. When the sort sees EOF from all its inputs, it merges the buffered batches and sends the data downstream to the merge, one batch at a time.
   
   The merge needs a batch from each sort to proceed, so the merge can't start until the last sort finishes. (This is why a Sort is called a "blocking operator" or "buffering operator": it won't produce its first result until it has consumed all its input batches.)
   
   Everything is in memory, so each Sort can consume batches about as fast as the scans can produce them. The sorts all work in parallel, as we'd hope. The merge kicks in last, but that's to be expected: one can't merge without having first sorted all the data.
   
   OK, now for the spilling case, where the problems seem to crop up. (Full disclosure: I rewrote the current version of the Sort spilling code.) Now, data is too large to fit into memory for the sort. Let's focus on one sort, call it Sort 1. Sort 1 reads input batches until it fills its buffer. Then, Sort 1 pauses to write its sorted batches to disk. Simple enough. But, here is where things get tricky.
   
   All the scans must fill their output batches. Because we're doing random exchanges to the sorts, all outgoing batches are about the same size. Let's now pick one scan to focus on, Scan 2. Scan 2 fill up the outgoing batch for Sort 1, but Sort 1 is busy spilling, so Sort 1 can't read that batch. As a result, Scan 2 is blocked: it needs to add one more row for Sort 1, but it can't because the outgoing batch is full, and the downstream Sort won't accept it. So, Scan 2 grinds to a halt.
   
   The same is true for all other scans: as soon as they want to send to Sort 1 (which is spilling), they block. Soon, all scans are blocked. This means that all the other Sorts stop working: they have no incoming batches, so they are starved.
   
   Eventually, Sort 1 completes its spill and starts reading again. This means Scan 2 can send and start working again. The same is true of the other Scans. Now, the next Sort, Sort 2, needs to spill. (Remember, the data is randomly distributed, so all Sorts see about the same number of rows.) So, the whole show occurs again. The scans can't send to Sort 2, so they stall. The other scans become starved for inputs, and so they stop.
   
   Basically, the entire system becomes serialized on the sort spills: effectively, across the cluster only one spill will be active at any time, and that spill blocks senders which blocks the other sorts. We have a 10-way distributed, single-threaded query.
   
   Now, I've omitted some details. One is that, in Drill, every receiver is obligated to buffer three incoming batches before it blocks. So, the scenario above is not exactly right: there is buffering in the Unordered Receiver below each Sort. But, the net effect is the same once that three-batch buffer is filled.
   
   Even here there is an issue: remember: each scan hast to buffer 1024 for every sort. We have 10 scans and 10 sorts, so we're buffering 100 batches or 100K rows. And, the sorts have buffers of 3 batches each, so that's another 30 batches total, or 30K rows. All that consumes memory which is not available for the sort, hence the sort has to spill. That is, the memory design for Drill takes a narrow, per-operator view, and does not optimize memory use across the whole query.
   
   All of the above is conjecture based on watching large queries grind to a crawl. Drill provides overall metrics, but not metrics broken down by time slices, so we have no good way to visualize behavior. Using ASCII graphics, here's what we'd expect to see:
   
   ```text
   Sort 1: rrrsss____________r_______r_______rsss_________
   Sort 2: rr___rsss__________r_______r_______rsss________
   ....
   Scan 1: www__w____________www_____www_____www___
   Scan 2: www__w____________www_____www_____www___
   ```
   
   Where "r" means "read and sort a batch", "s" means "spill", "w" means "write a batch downstream, and "_" means "blocked or starved." Everything is parallel until the first spill, then everything serializes. There are 10 scans, so the extra wait time is for the other 8 scans to do their thing. All very complex!
   
   And, this problem grows with the square of the number of nodes. With 100 fragments, we buffer 10K batches of 1K rows each, or 10M rows sitting in send buffers. This should be screaming at us, "you're doing it wrong!"
   
   Another detail I omitted is that Drill has something called a "Mux Exchange". 
   ```text
   ...
   |
   Unordered Receiver
   |    -   -   -   -   -   -   -   -
   | Mux Sender
   |
   | Mux Receiver
   |    -   -   -   -   -   -   -   -
   Unordered Sender
   ...
   ```
   
   All the scans on a single node share a set of outgoing buffers. Now, nobody understands this beast, and it is implemented as a separate layer of exchanges, which seems to make things worse. I remember back in the day, someone tried to characterize its behavior. But, since we didn't understand it, we just kept trying "Mux On/Mux Off" with out really knowing what was happening. Having a part of the system that no one understands is never a great way to proceed.
   
   Still, the Mux thing is a good idea even if the implementation is flawed. Ideally, it would reduce the number of outgoing batches from n^2 to just n. But, it adds so much complexity, with vectors, that the benefit could never be clearly seen.
   
   How would rows work better than vectors? With vectors, we build up, say, 1024 rows in each batch before sending. We can't send until the batch is full, but if any batch is full, we can't add another row until we send the batch we've got. This is too crude and leads to the above scenario. The Mux exchange can consume rows so that the scans don't have to build up full batches to forward to the Mux.
   
   Better would be to have a buffer of up to 1024 rows, but we send whatever we have after some time. Now, each scan can continue to send to other sorts even while holding rows for Scan 1 while it spills, In this case, some of the other sorts will also fill their buffers and start to spill concurrently with Scan 1.
   
   Another idea is to do what Apex and other systems do: spill to disk if the receiver is blocked. (Or, equivalently, spill to disk on the receiver side if the consuming operator does not consume rows.)
   
   Yet another solution is to run the Sort in multiple threads so that it spills and consumes concurrently. (The current design is single-threaded and is pretty aggressive about how much is spilled: we inherited that logic from the "first-generation" sort spill code.)
   
   Now, I've focused on the Sort because I know that code best. But, the same is true for other blocking operators: joins, aggregations, etc.
   
   When Drill is used embedded, or at small scale, you will never see this issue. When Drill runs at large scale, so that spilling kicks in, then this issue becomes the dominant problem that prevents high performance.
   
   Sorry that this is all rather complex. Since it is complex, it goes unseen and unfixed. This is probably even a well-known problem in distributed computing circles: if any one knows of a good reference, please post it.
   
   For Drill 2.0, we have to decide if we want to stay in the large-scale distributed system business, or just diddle around as an embedded desktop app to read PDF files and Excel spreadsheets. At scale, we have to fix the above issues. As a desktop app, we can rip out all the complex distributed cruft that no one understands and just be a decent several node solution.
   
   But, there is no path that says we compete with Presto/Trino/Impala, etc. at scale while focusing on desktop concerns. Distributed system are *hard*, and we have to know what we're doing to offer an effective solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1008235930


   @paul-rogers I can imagine vectorisation helping hash joins because of all the hashing to be crunched. I haven't got a good paper to cite but a search from my phone did turn up at least one story on these lines. 
   
   https://www.cockroachlabs.com/blog/vectorized-hash-joiner/
   
   Regarding pivoting between row and column orientation depending on the operator, I would expect that all the ensuing memory copying would outweigh any gain but again I bring no references or data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton edited a comment on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton edited a comment on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1006287333


   Okay, @paul-rogers I've had a few swigs of the kool aid by now and I think I'm ready to forget about in-memory column orientation and SIMD in return for the benefits of row orientation.  For workflows that do involve bulk arithmetic I can imagine good interop taking care of that stage:
   
   1. Do some efficient parsing, filtering, sorting, aggregating in Drill
   2. Smoothly switch over to Pandas/Numpy (perhaps an Arrow exporter?) or Julia or ...
   3. Do bulk arithmetic using SIMD or even a GPU
   4. Store results or smoothly switch back to Drill
   
   I've used this workflow myself where the data interchange format was Parquet and the transport medium was the DFS (so perhaps a bit more "clunky" than "smooth", with lots of serialisation and IO incurred).
   
   Going further, if the decoupling of Drill from its in-memory format mentioned higher up is a real possibility then can we even imagine something like this, entirely in Drill?
   
   ```
   alter session set exec.memory_format = 'drill'; -- the default, row-oriented format
   
   create table as select ... -- do some efficient parsing, filtering, sorting, aggregating in Drill
   create table as select ... -- do some efficient parsing, filtering, sorting, aggregating in Drill
   
   alter session set exec.memory_format = 'arrow'; -- switch to Arrow format
   
   create table as select ... do some bulk arithmetic using SIMD
   create table as select ... do some bulk arithmetic using SIMD
   ```
   
   To my mind Drill 2.0 would not try to ship support for the latter, Arrow format, merely make design decisions which leave that door open for a motivated developer...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton closed issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton closed issue #2421:
URL: https://github.com/apache/drill/issues/2421


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] Leon-WTF edited a comment on issue #2421: [DISCUSSION] ValueVectors Replacement

Posted by GitBox <gi...@apache.org>.
Leon-WTF edited a comment on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1017659713


   @paul-rogers Thanks paul for your detailed explanation. One more question, why scan2 send to other sorts also blocked  when scan2 send to sort1 is blocked? Do you think it's better to make each communication pair execute separately? I see that the PartitionSender can do the partion and flush task in a thread pool.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1006287333


   Okay, @paul-rogers I've had a few swigs of the kool aid by now and I think I'm ready to forget about in-memory column orientation and SIMD in return for the benefits of row orientation.  For workflows that do involve bulk arithmetic I can imagine good interop taking care of that stage:
   
   1. Do some efficient parsing, filtering, sorting, aggregating in Drill
   2. Smoothly switch over to Pandas/Numpy (perhaps an Arrow exporter?) or Julia or ...
   3. Do bulk arithmetic using SIMD
   4. Store results or smoothly switch back to Drill
   
   I've used this workflow myself where the data interchange format was Parquet and the transport medium was the DFS (so perhaps a bit more "clunky" than "smooth", with lots of serialisation and IO incurred).
   
   Going further, if the decoupling of Drill from its in-memory format mentioned above is a real possibility then can we even imagine something like this, entirely in Drill?
   
   ```
   alter session set exec.memory_format = 'drill'; -- the default, row-oriented format
   
   create table as select ... -- do some efficient parsing, filtering, sorting, aggregating in Drill
   create table as select ... -- do some efficient parsing, filtering, sorting, aggregating in Drill
   
   alter session set exec.memory_format = 'arrow'; -- switch to Arrow format
   
   create table as select ... do some bulk arithmetic using SIMD
   create table as select ... do some bulk arithmetic using SIMD
   ```
   
   To my mind Drill 2.0 would not try to ship support for the latter, Arrow format, merely make design decisions which leave that door open for a motivated developer...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1005326255


   @luocooong, your picture about page layout and cache performance is helpful: it shows how some DBs lay out pages. (I did something similar way back when I wrote a DB.) However, for an in-memory format, we'd do the layout differently. There would be no header (that's a separate object). We'd ensure that all data within each row is self-contained: no pointers from the footer back into rows.
   
   Impala uses a format which, I believe, it borrowed from earlier DBs. Here is a variation, listing the various fields, in order:
   
   * Row size.
   * Fixed-width columns (INT, LONG, etc.) These have a fixed offset from the row start.
   * Offset and length of each variable-width field. These also have a fixed offset.
   * Variable portion with the variable-width data. (VARCHAR, etc.)
   
   Note that it doesn't matter the order in which the variable fields are written: the offset/length pairs do the right thing. Just as Drill has a `BatchSchema` to say what vectors make up a batch, each block of rows would have a "row schema" to map from the logical schema (i.e. names and types) to offsets. Each fixed-width fetch is thus an addition (base + offset), plus a read/write. This is about the same as for vectors. (Vector address + row * size, then read.) The row data can, of course, reside in direct memory as for vectors.
   
   The result is that the entire row fits into the cache as a unit with no overhead cruft. In practice, each operator has an incoming and outgoing row (filter, project, probe phase of a join, hash sender, ...). Still, the rows should be small enough that both incoming and outgoing rows fit in the cache.
   
   Of course, the JVM will pull in byte code blocks, local variables, etc. So, the code has to also be designed carefully to avoid cache thrashing. That's what Drill's aggressive inlining and byte code fixups are supposed to do. Then, the number of threads has to be managed so we don't get swapped out just after we get our cache nicely set up.
   
   This stuff is HARD. Testing is essential. A somewhat-dated, but still helpful, source is [Martin Thompson's Mechanical Sympathy Blog](https://mechanical-sympathy.blogspot.com/).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004749499


   Paul Rogers wrote:
   
   Hi All,
   
   Thanks Charles for dredging up that old discussion, your memory is better
   than mine! And, thanks Ted for that summary of MapR history. As one of the
   "replacement crew" brought in after the original folks left, your
   description is consistent with my memory of events. Moreover, as we looked
   at what was needed to run Drill in production, an Arrow port was far down
   on the list: it would not have solved actual customer problems.
   
   Before we get too excited about Arrow, I think we should have a discussion
   about what we want in an internal storage format. I added a long (sorry)
   set of comments in that PR that Charles mentioned that tries to debunk the
   myths that have grown up around using a columnar format as the internal
   representation for a query engine. (Columnar is great for storage.) The
   note presents the many issues we've encountered over the years that have
   caused us to layer ever more code on top of vectors to solve various
   problems. It also highlights a distributed-systems problem which vectors
   make far worse.
   
   Arrow is meant to be portable, as Ted discussed, but it is still columnar,
   and this is the source of endless problems in an execution engine. So, we
   want to ask, what is the optimal format for what Drill actually does? I'm
   now of the opinion that Drill might actually better benefit  from a
   row-based format, similar to what Impala uses. The notes even paint a path
   forward.
   
   Ted's description of the goal for Demio suggests that Arrow might be the
   right answer for that market. Drill, however, tends to be used to query
   myriad data sources at scale and as a "query integrator" across systems.
   This use case has different needs, which may be better served with a
   row-based format.
   
   The upshot is that "value vectors vs. Arrow" is the wrong place to start
   the discussion. The right place is "what does our many years of experience
   with Drill suggest is the most efficient format for how Drill is actually
   used?"
   
   Note that Drill could have an Arrow-based API independent of the internal
   format. The quote from Charles explains how we could do that.
   
   Thanks,
   
   - Paul
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004751568


   Paul Rogers wrote:
   
   One last note. Let's assume we wanted to adopt the row-based format (or, the myths being strong, we want to adopt Arrow.) How would we go about it?
   
   The "brute force" approach is to rewrite all the operators. Must deal with low-level vector code, so we'd rewrite that with low-level row (or Arrow) code. Since we can't really test until all operators are converted, we'd have to do the entire conversion in one huge effort. Then, we get to debug. I hope this approach is setting off alarm bells: it is high cost and high risk. This is why Drill never seriously entertained the change.
   
   But, there is another solution. The scan readers all used to work directly with vectors. (Parquet still does.) Because of the memory reasons explained above, we converted most of them to use EVF. As a result, we could swap vectors for row pages (or Arrow) by changing the low-level code. Readers would be blissfully ignorant of such changes because the higher-level abstractions would be unchanged.
   
   So, a more sane way to approach a change of in-memory representations is to first convert the other operators to use an EVF-like approach. (EVF for writing new batches, a "Result Set Loader" for reading exiting batches.) Such a change can be done gradually, operator-by-operator, and is fully compatible with other, non-converted operators. No big bang.
   
   Once everything is upgraded to EVF, then we can swap out the in-memory format. Maybe try Arrow. Try a row-based format. Run tests. Pick the winner.
   
   This is *not* a trivial exercise, but it is doable over time, if we see value and can muster the resources.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on issue #2421: [DISCUSSION] ValueVectors Replacement

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1019575466


   One more factor to consider: memory management. Row-based solutions are far, far easier to memory manage than vectors. Row solutions reduce internal fragmentation from 1/4 of batch size to 1/2 of the (much smaller) row size.
   
   We talked early on about the difficulty of sizing vectors and batches. If a record has 20 columns, then each is a vector. If the columns are fixed-width (`INT`, `BIGINT`, etc.) then we can predict the size of a batch with, say, 1K rows. What we can't predict is the size of variable-width columns (`VARCHAR`, say) so sometimes we have a vector of size 1K (because all the `VARCHAR` values are one byte) and sometimes we have 1GB (because all the values are 1MB.) Drill now has nots of ad-hoc code to try to guess the size, which is much better than the early days, but still not great.
   
   We often want to go the other way: we want to control memory use, so we allocate, say, 1MB per batch. When we do this, if we know the sort is given 1GB of buffer space, then we know we can store 1024 batches. (Picking simple numbers here just for discussion: actual numbers will vary.) So, what do we do to limit overall batch size to 1MB?
   
   Before EVF, we couldn't and batch sizes could become large and cause intermittent OOM errors. With EVF, there is complex and elaborate code to detect when the next doubling of any vector could push us past the target size. EVF reports "your batch is full", and the batch is sent downstream. Today, this works only in the scan; all other operators retain their ad-hoc "batch sizer" based solutions. (Full disclosure: I originally wrote the batch sizer as a quick & dirty solution for the sort, hence its silly name. Little did I realize it would spread to other operators and still be with us years later: I thought EVF would have won the battle by now. Sigh...)
   
   When sizing vectors this way, we end up with a large amount of *internal fragmentation* unused space within vectors. On average, the last half of a vector will be half used, meaning that the vector is, on average 3/4 full. (The reason is subtle: we doubled the vector when it was full. After the doubling, the vector is half full. Variable-width vector sizes are uncorrelated, so, on average we'll have half-filled the second half when the batch as a whole fills.) So, with our 1MB batch, about 1/4 or 256K goes unused. That's pretty inefficient!
   
   Now, consider a row-based solution. Again we allocate 1MB for our buffer. We fill it with rows, end-to-end. When the buffer fills, we will, on average, have written half a row. We shift that row to a new buffer and send the current one on its way. Now, our fragmentation is 1/2 the row width. If rows width is << the buffer size, then we're making much better use of our scarce memory resource (especially in buffering operators such as sort, join, and aggregation.)
   
   Lest you think this is somehow a Drill bug, recall how Parquet is built. The entire data set is buffered in memory before it is written to the file. Why? Parquet has no idea how large a column or row group will be until the columnar data blocks are actually created. To pack data optimally in the file, it is sub-optimally all buffered in memory prior to writes. In a way, this Parquet buffering (and the resulting memory pressure) is very similar to the exchange buffering we discussed in prior notes.
   
   There is another huge win for row-based buffers: they are all of the same size (or, if we get very fancy, of a few sizes.) Rather than needing a malloc-style variable-block size allocator, we can use a much simpler DB-style buffer pool. New buffers come out of the pool, retired buffers go back in. We know exactly how many buffers we have, and can easily allocate quantities to queries and fragments. All the guesswork evaporates from sort, join and aggregate planning: they instead know exactly how many buffers they have to work with, and can plan spilling accordingly. None of this is new; you'll find it written up in any book on DB internals.
   
   Thus, in just about every dimension, columnar storage is suboptimal as an internal query engine data format. The Impala guys knew this and benefited from their row-based structure. Drill is still suffering from its early misunderstanding that the benefits of columnar storage somehow translate into benefits as an internal data format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] Leon-WTF commented on issue #2421: [DISCUSSION] ValueVectors Replacement

Posted by GitBox <gi...@apache.org>.
Leon-WTF commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1017659713


   @paul-rogers Thanks paul for your detailed explanation. One more question, why scan2 send to other sorts also blocked  when scan2 send to sort1 is blocked? Do you think it's better to make each communication pair execute separately? I see that the PartitionSender can do the partion task in a thread pool.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004751317


   Paul Rogers wrote:
   
   OK, so the above raise the issues we have to consider when thinking about vectors (Drill's or Arrow's.) What is the alternative?
   
   Here, I think Impala got it right. Impala uses Parquet (columnar) for *storage*, but rows for *internal* processing. Impala is like an Italian sports car of old: spends lots of time in the shop, but when it works, it is very fast.
   
   One of the reasons that Impala is fast is because of the row format.  First, let's describe what "row-based" means. It means that columns appear together, as in a C `struct`, with rows packed one after another as in an array of `structs`. This means that the data for a given row is contiguous. There is only one buffer to size. Classic DB stuff that seems unusual only because we're all used to Drill's vector format.
   
   Let's look at the same issues above, but from a row-based perspective.
   
   **Expression Execution**: With a row-based model, the CPU can easily load a single row into the L1 cache. All our crufty-real-world expression logic works on that single row. So, no matter how messy the expressions, from the CPU's perspective, all the data is in that single row, which fits nicely into the cache.
   
   Rows can be small (a few dozen bytes) or large (maybe a 10s of K for long VARCHARs). In either case, they are far smaller than the L1 cache. The row is loaded. Once we move onto the next row, we'll never visit the previous one, so we don't care if the CPU flushes it from the cache.
   
   **Memory Allocation**: Rows reside in buffers (AKA "pages"), typically of a fixed size. A reader "pours" data into a row. When the page is full, that last record is copied to the next page. Only that one row is copied, not all the existing data. So, we eliminate the 1X copy + 1X load problem in Drill. Since there is only one page to allocate, memory is simpler. Since pages are of fixed size, memory management is simpler as well.
   
   **Exchanges**: Network exchanges are row-based. Rows are self-contained. A network sender can send single rows, if that is efficient, or batches of rows. In our 100-senders-100-receiver example, we could send rows as soon as they are available. The receiver starts working as soon as the first row is available. There is no near-deadlock from excessive buffering.
   
   Yes, we would want to buffer rows (into pages) for efficiency. But, in extreme cases, we can send small numbers of rows to keep the DAG flowing.
   
   **Clients**: As noted above, row-based clients are the classic solution and are simple to write. We could easily support proper clients in Python, Go, Rust and anything else if we used a row-based format.
   
   **Conclusion**: We tend to focus on the "value vector vs. Arrow" discussion. I'm here to say that that is the wrong question: it buys into myths which have hurt Drill for years. The *correct* question is: what is the most efficient format for the use cases where Drill wants to excel? The above suggests that, rather than Arrow, a better solution is to adopt a row-based internal format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004748679


   Ted Dunning wrote:
   
   As a little bit of perspective from somebody who *was* at MapR at the time,
   here are my recollections.
   
   Arrow is pretty much the value vectors from Drill with some lessons learned
   and all dependencies removed so that Arrow can be consumed separately from
   Drill.
   
   The spinout of the Dremio team didn't happen because of the lack of
   integration with Arrow ... it was more the other way around ... because a
   significant chunk of the Drill team left to form Dremio, the driving force
   that could have pushed for integration just wasn't around any more because
   they were off doing Dremio and weren't working on Drill any more very much.
   The motive for the spinout had mostly to do with the fact that Tomer and
   Jacques recognized the opportunity to build a largely in-memory analytical
   engine based on zero serialization techniques and also recognized that this
   could never be a priority for MapR because it was outside the center of
   mass there. Once the Dremio team was out, though, they had a huge need for
   interoperability with systems like Spark and Cassandra, and they needed to
   not impose all of Drill as a dependency if they wanted these other systems
   to take on Arrow.
   
   This history doesn't really impact the merits or methods of integrating
   present-day Drill with Arrow, but it is nice to get the story the right way
   around.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004749229


   Christian Pfarr wrote:
   
   Hi Charles, Ted, and the others here,
   
   it is very interesting to hear the evolution of Drill, Dremio and Arrow in that context and thank you Charles for restarting that discussion.
   
   I think, and James mentioned this in the PR as well, that Drill could benefit from the continues progress, the Arrow project has made since its separation from Drill. And the arrow Community seems to be large, so i assume this goes on and on with improvements, new features, etc. but i have not enough experience in Drill internals to have an Idea in which mass of refactoring this would lead.
   
   In addition to that, im not aware of the current roadmap of Arrow and if these would fit into Drills roadmap. Maybe Arrow would go into a different direction than Drill and what should we do, if Drill is bound to Arrow then?
   
   On the other hand, Arrow could help Drill to a wider adoption with clients like pyarrow, arrow-flight, various other programming languages etc. and (im not sure about that) maybe its a performance benefit if Drill use Arrow to read Data from HDFS(example), useses Arrow to work with it during execution and gives the vectors directly to my Python(example) programm via arrow-flight so that i can Play around with Pandas, etc.
   
   Just some thoughts i have since i have used Dremio with pyarrow and Drill with odbc connections.
   
   Regards
   Christian


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004749364


   Ted Dunning wrote:
   
   Christian,
   
   Your thoughts are very helpful. I find Arrow very nice (I use it in Agstack
   with Julia and Python).
   
   I don't think anybody is saying that Drill wouldn't be well set with a
   switch to Arrow or even just interfaces to Arrow. But it is a lot of work
   to make it all happen.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on issue #2421: [DISCUSSION] ValueVectors Replacement

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1019033520


   @Leon-WTF, good question, as always. We don't actually *want* any scan to be blocked: this is the problem we need to fix. Every scan fragment has a set of outgoing buffers. The problem is, in normal Drill operation, we don't send them until they are full. But, to absorb delays, we want to keep them empty most of the time, and allow them to absorb rows when the receiver is blocked.
   
   Our example had 10 sort fragments, so there are 10 outgoing buffers per scan fragment (ignoring the Mux exchange.) Let's look at this from the scan's perspective. It can fill each outgoing buffer to, say 1024 rows.
   
   The scan chugs along, producing rows. The random exchange assigns each to one of the outgoing buffers and copies it across. (Note the copy: we get no benefit from vectors during this operation.)
   
   Our data is evenly distributed, so we start with 0 rows per outgoing buffer. After a while, most buffers have, say, 3 rows. Some have more, some less, since things are random. We chug along some more. Now we're up to an average of 1000 rows. Pretty soon, one of the buffers fills to 1024.
   
   At this point, the scan fragment can't add another row to that buffer, so off the buffer goes across the wire to its target sort fragment. All good. This continues, the next buffer gets to 1024 and is sent off. Again, because of even distribution, pretty soon all are full and sent. Now the scan can chug along and refill them.
   
   Now, let's say that Sort 1 can only buffer 9 incoming batches, so it now wants to spill when, say, Scan 8 sends it the 10th batch.
   
   Now, back to scan 1, it has filled its buffers again, and it happens that the one for Sort 1 (the one which is spilling) is filled first. So, it has to be sent off before the Scan can produce another row. Why? There is no other place to put that row, and we can't get to the row behind that until we dispose of the current wow. But, the destination is blocked, doing a spill (ignoring the three-buffer rule), so the full buffer can't be sent, and so we can't add to it, since it is full.
   
   So, here's the heart of the problem. Scan 1 has just produced a row (or a batch with a row) that must go to Scan 1. The outgoing buffer is full, but it can't be sent. Scan 1 can't go any further until it can get rid of that current row. So...it...waits...
   
   In effect, our buffer to absorb delays can sometimes be as small as 0 or 1 rows (though it will average to 512 rows.)
   
   Even if we have separate channels (which we do), we can't use them. We could send off all the other partly-full buffers and it still wouldn't help us with that one we most care about: the one that has to take the current row but can't (because it is full and we're blocked sending that buffer). Sending the other buffers _might_ get the other sorts to spill sooner, so we get concurrent spilling. (Something for us to consider as a partial fix.)
   
   This is what I meant by a 10-way distributed query that begins to act as a single thread. Each scan can only send as fast as the slowest soft. Any spilling sort blocks all scans, which blocks all sorts. Definitely A Bad Thing.
   
   Now, if we have a row-based exchange, things might work better. Each scan might now have a rule:
   
   * Keep a buffer of up to 1024 rows (just picking a number).
   * Send the buffer if it is full, or if non-empty and has not been sent in the last 100 ms. (pick any time interval).
   
   With that rule (properly tuned), the scans will typically have mostly empty send buffers. When Sort 1 stars spilling, Scan 1 can buffer up to 1024 rows for that sort, while also sending partial buffers to other scans, to get them closer to spilling. Once Sort 1 is blocked, its rows are sent and maybe some other Sort's outgoing buffer starts to fill. We get closer to doing spills in parallel rather than sequentially.
   
   In this way, we normally have empty buffers, and have logs of margin to absorb blocking. Contrast this to the current scheme in which we have no leeway when a buffer fills and the receiver blocks.
   
   I don't have hard numbers, but I suspect (based on my experience with Impala) that the row-based schema would not fall prey to the single-threading that the vector-based approach does. (And, of course, we could implement the above idea with vectors, but it would normally lead to small batches downstream which might cause unexpected other problems in the current design.)
   
   To see this, we need more detailed instrumentation (a time series of events, not just a summary), and a beefy cluster, with a large data set, to run tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1007611673


   Hi James,
   
   One could do something like what you described. However, to have all of
   Drill work with Arrow would be a huge amount of work. Optimizations made
   for one format would be sub-optimal for the other. (Example: exchanges.)
   Furthermore, your use case would benefit from vectors only in the project
   and grouping operators.
   
   So, I wonder if we might think about the problem operator-by-operator. If
   you have a compute-heavy phase, might that first transform data to vectors,
   apply the compute, then send data along in row format? Every fragment does
   a network exchange: data is read/written anyway. So, perhaps there is
   something that can be done to transform formats at fragment boundaries (he
   says, waving hands wildly...)
   
   You'll also get speed only for queries without joins. If you have joins,
   then the joins are likely to take the vast amount of the runtime, leaving
   your projection and grouping in the noise. I'm not sure how vectorization
   can help joins; certainly in Drill today, vectors make the join code
   atrociously complex.
   
   This is why DBs (and compiler optimizers) are hard: the answers change
   based on use case...
   
   Thanks,
   
   - Paul
   
   On Wed, Jan 5, 2022 at 9:03 PM James Turton ***@***.***>
   wrote:
   
   > Okay, @paul-rogers <https://github.com/paul-rogers> I've had a few swigs
   > of the kool aid by now and I think I'm ready to forget about in-memory
   > column orientation and SIMD in return for the benefits of row orientation.
   > For workflows that do involve bulk arithmetic I can imagine good interop
   > taking care of that stage:
   >
   >    1. Do some efficient parsing, filtering, sorting, aggregating in Drill
   >    2. Smoothly switch over to Pandas/Numpy (perhaps an Arrow exporter?)
   >    or Julia or ...
   >    3. Do bulk arithmetic using SIMD
   >    4. Store results or smoothly switch back to Drill
   >
   > I've used this workflow myself where the data interchange format was
   > Parquet and the transport medium was the DFS (so perhaps a bit more
   > "clunky" than "smooth", with lots of serialisation and IO incurred).
   >
   > Going further, if the decoupling of Drill from its in-memory format
   > mentioned above is a real possibility then can we even imagine something
   > like this, entirely in Drill?
   >
   > alter session set exec.memory_format = 'drill'; -- the default, row-oriented format
   >
   > create table as select ... -- do some efficient parsing, filtering, sorting, aggregating in Drill
   > create table as select ... -- do some efficient parsing, filtering, sorting, aggregating in Drill
   >
   > alter session set exec.memory_format = 'arrow'; -- switch to Arrow format
   >
   > create table as select ... do some bulk arithmetic using SIMD
   > create table as select ... do some bulk arithmetic using SIMD
   >
   > To my mind Drill 2.0 would not try to ship support for the latter, Arrow
   > format, merely make design decisions which leave that door open for a
   > motivated developer...
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/drill/issues/2421#issuecomment-1006287333>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAYZF4OL3CNE5WIQCZG4SBDUUUPD3ANCNFSM5LHIIU5Q>
   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
   > or Android
   > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   >
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton edited a comment on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton edited a comment on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1006287333


   Okay, @paul-rogers I've had a few swigs of the kool aid by now and I think I'm ready to forget about in-memory column orientation and SIMD in return for the benefits of row orientation.  For workflows that do involve bulk arithmetic I can imagine good interop taking care of that stage:
   
   1. Do some efficient parsing, filtering, sorting, aggregating in Drill
   2. Smoothly switch over to Pandas/Numpy (perhaps an Arrow exporter?) or Julia or ...
   3. Do bulk arithmetic using SIMD or even a GPU
   4. Store results or smoothly switch back to Drill
   
   I've used this workflow myself where the data interchange format was Parquet and the transport medium was the DFS (so perhaps a bit more "clunky" than "smooth", with lots of serialisation and IO incurred).
   
   Going further, if the decoupling of Drill from its in-memory format mentioned above is a real possibility then can we even imagine something like this, entirely in Drill?
   
   ```
   alter session set exec.memory_format = 'drill'; -- the default, row-oriented format
   
   create table as select ... -- do some efficient parsing, filtering, sorting, aggregating in Drill
   create table as select ... -- do some efficient parsing, filtering, sorting, aggregating in Drill
   
   alter session set exec.memory_format = 'arrow'; -- switch to Arrow format
   
   create table as select ... do some bulk arithmetic using SIMD
   create table as select ... do some bulk arithmetic using SIMD
   ```
   
   To my mind Drill 2.0 would not try to ship support for the latter, Arrow format, merely make design decisions which leave that door open for a motivated developer...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004751035


   Paul Rogers wrote:
   
   The last topic is so complex that no myth has grown up around it, and the issue is not at all well understood. Vectors (and batches) are hell for distributed system performance. This gets pretty techie, so hang on.
   
   **Vectors are Hell for Exchanges**:  This comes from a real-world case in which a large cluster worked no faster than a single thread of execution. We've discussed how Drill wants to create large batches (per the myth) to benefit from vectorization (which we don't have) and to optimize L1 cache usage (which, as we've seen, we don't actually do.) Let's assume "small" batches of 1K rows.
   
   Drill also wants the single format for in-memory and over-the-wire usage. This means we want to transfer 1K record batches so that the receiver gets batches of the optimal in-memory size.
   
   Now, what happens in a distributed system? Assume you have 100 fragments running. (Maybe 10 machines with 10 cores each.) Let's think about one fragment, call it "f0.0". Let's assume f.0.0 runs a scan and a network sender. The scan builds up its 1K batches, because those are "efficient" (according to the myths we've discussed.)
   
   What does f0.0's network sender do? Let's assume the target is a hash join. So, the sender hashes the keys into 100 buckets. Now, the sender follows Drill's rule: send 1K record batches. Since there are 100 targets, the sender has to create 100 buffered batches, fill them each to 1K records, then send them. To visualize:
   
   `f0.0 (reader --> sender) - - > f1.x (receiver --> hash-join --> ...) ...`
   
   There are 100 f0 fragments: f0.0, ... f0.99, we're looking just at one of them: f0.0. The f0 "slice" sends to the "f1" slice that consists of 100 additional fragments: f1.0, ... f1.99.
   
   So, what happens in our sender? Assuming even hash distribution, we have to fill all our 100 outgoing batches before we can send them. This means we have to read 100 * 1K = 100K input records before we send the first outgoing batch. The result is a huge memory usage (those 100 batches), plus all the vector resizes and copies we discussed (as we grow those batches.)
   
   If that we not bad enough, this occurs in all our other 99 f0 fragments: we've got 100 * 100 = 10K buffered batches waiting to send. Yikes!
   
   Now, what happens in f1? It is sitting around waiting for data. No f0 will send until if fills its first outgoing batch for that receiver. If we assume an even distribution of data, then the outgoing batches fill at about the same rate. None can be sent until one of them reaches the target, at which point most of them are near-full. Once the first hits the 1K mark, off it goes to f1 who can filly start processing. This is bad because Drill claims to be highly distributed, but we just described is a serial way of working.
   
   But, it gets worse! Now, assume we're deeper in the DAG, at a sort:
   
   `f4: (receiver --> sort --> sender) - - > f4: (receiver --> merge --> ...)`
   
   The sort sorts its slice of records, and sends it to the merge fragment which merges all the partial sorts. Classic distributed systems stuff. Again, the f4 (sort) sender waits to fill its outgoing batches, then it sends. The merge can't start until it sees batches from all 100 inputs. So, it proceeds at the rate of the slowest sort.
   
   Now what happens? The merge uses up one of the 100 input batches, and needs another before it can proceed. But, here things get really nasty.
   
   On the f4 side, f4.0, say, sent the first batch to get full. It then sent the others as they filled. Meanwhile, the first batch started refilling and eventually will need to be sent again. Since the merges can't read a new batch until its used up the previous one, it blocks the f4 sender. As a result, f4 can't send to *any* other merge.
   
   The downstream fragment throttles the upstream, and visa versa. Not quite deadlock, but the entire system becomes serialized: the sort can't ship batches until the slowest merge can receive them. The merge can't make progress until the slowest sort provides the next batch. Every fragment depends on every other. Disaster!
   
   Again, we spent hours trying to figure this out on a customer cluster. We could see the effect, but we could not get in to work out the details. Would be great for someone to do the experiments.
   
   **Summary**: The above has debunked the major myths around columnar storage within a query engine. Note that **none** of the above changes if we use Arrow. We'd do a huge amount of work to switch, and be stuck with the same fundamental problems.
   
   Hence, we have to think deeply about this issue, not just by the snake oil that "vectors are good for an execution engine." Good old solid engineering and experimentation will tell us what's what.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1005676173


   A few, informational notes from me.  First, I wondered how it could be that benchmark-busting Impala manages to load SIMD registers from a row oriented memory layout then found the following [in here](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf).
   
   > We are also considering switching to a columnar canonical in-memory format for data that needs to be materialized during query processing, in order to take advantage of SIMD instructions 
   
   I wonder whether they switched?  I suppose I need only go over to apache/impala to find out.  I can't speak for anyone else but some of my important Drill workloads actually have been bulk floating point arithmetic (statistics, probabilities) where you do get a major boost from SIMD.  That said, I probably did even more work doing filtering, sorting and aggregating the same data in Drill before it was ready for arithmetic.
   
   Second, if you did want to do SIMD properly in Java (i.e. not praying for auto-vectorisation) then are you looking at JEP 338 and do [you have some vector types from JDK dictated to you](https://openjdk.java.net/jeps/338)?  Would these vectors work with our direct memory and Netty-based code?
   
   Third, would our direct memory code remain unchanged if we replace ValueVectors or would we want to start accessing direct memory through [JEP 393](https://openjdk.java.net/jeps/393).  Or... are [the high tech garbage collectors in recent JVMs](https://blogs.oracle.com/javamagazine/post/understanding-the-jdks-new-superfast-garbage-collectors) so good that we would laugh off the burden direct memory and its management entirely?
   
   Re. SIMD, I know there are positions like "Forget SIMD" that have been stated, but I wanted to note this stuff down for us anyway.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1005316421


   @jnturton, thanks for converting the discussion to a ticket. Thanks especially for dealing with the formatting translation!
    
   @luocooong, thanks for the very impressive analysis, and the wonderful images!
   
   As noted earlier, columnar is huge win for storage. It is less so for tools that work a row-at-a-time such as Drill or most data pipelines. It is easy to see why. A Parquet file may have 100 or more columns. Your query uses 10 of them, so that's all we read. A huge win. But, inside Drill, we'll mess with all 10 of your columns. You are probably doing filtering, calculations, aggregation, grouping, etc. There is a _reason_ you selected those columns: you want to work with them. And, Drill will do so a row at a time.
   
   Hence, the use pattern _within Drill_ seems to favor a row-based layout as we get few of the advantages of a columnar format, but we get all of the cost and complexity.
   
   Most SQL operators work a row at a time. So, even if the SQL is messy, and we have touch the same column multiple times, the row itself can fit into the CPU cache and those redundant accesses are trivially fast. It would be very hard to do a similar optimization for columns.
   
   Your OLAP analysis is correct, but mostly applies to old-school Mondrian-style cubes on top of relational tables where the aggregation is clearly defined. Most of the queries I've seen for Drill (or Impala or other tools) are pretty ugly, with lots of business logic. If the user writes the SQL, they won't limit themselves to a small set of standard aggregations, they'll do whatever they need to do to get the job done. That, in fact, is why SQL endures decade after decade: as awful as it is, you can get your task done if you try hard enough.
   
   As a result, although we could vectorize simple cube-rollup calculations, in practice, we don't get many of those. So, we need to optimize for what people actually do, not the ideal cases.
   
   Let's also remember this: if all you need are simple roll-up calculations, and you run those queries again and again (so they are worth optimizing), you have a better choice: have your data pipeline write pre-aggregated "cubes". Want to see sales-by-product-by-day over the last year on billions of transactions? Just build the roll-up cube. Its what Druid does and it works well. With a Spark pipeline, you can create a detail table and a roll-up table, both in Parquet, and queried by Drill. So, if we look at the broader context, we see that, the one place we'd benefit from vectorized operations, might not even happen within Drill; it would be done in Spark in the data pipeline.
   
   That said, we should also remember operators that might benefit from columnar: computing a hash key or sort. In this case, we could have a hybrid approach: a "row group" with just the required fields (the keys), and a separate "row group" with the "payload" of other fields. Some DBs use a similar approach (for storage.) With a bit of pipelining, the preceding operator could produce its output rows in the format optimal for the next operator.
   
   It would be great if we could mock up some simple test cases and try out the ideas. Doing so is not simple at present, but would be possible once we did the intermediate raw-vector-to-EVF conversion. Our experience tells us what we might expect to find, but we need to do the experiments to tell us what's actually true.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1008510147


   @jnturton, I think you're starting down a slippery slope, one that will end up with you convinced to simply move to Arrow (or enhance Drill's value vectors.) You are assuming that you can get a big win from vectorization in selected compute or hash operations -- at no cost elsewhere. The gist of my argument where, if that were true, Drill should already be in good shape: we'd only need to add some SIMD code and we'd rock. You're also assuming that there are SIMD hash functions: I'm not sure those exist: a search came up with somewhat random results. (SHA exists, however.)
   
   The hard truth is that vectors are good in very limited cases: those you outlined. As said above, Drill does far more than that. For all those other things, vectors are hugely complex and expensive.
   
   So, if we ignore the parts where vectors are inefficient, and ignore the operations that don't benefit from vectors, and ignore complexity of vectors in client code and their horrible effect on exchanges... and instead focus on some ideal cases where vectors might be faster... then, yes, we end up back where we are today. This is, the classic marketing position: vectors are great, if we ignore the bad stuff. The problem is, in making all those assumptions, we ignore the actual reality of what we've learned over many years.
   
   Until I see some numbers, I'm not convinced that there is ever a case, other than in ideal lab conditions, where the gain from vectors in some operations makes up for the cost everywhere else. Still, if we wanted to find out, we could tinker a bit:
   
   * Limit vector batches to a decent number of records, say 1K. Avoid the 4K, 8K or 64K behemoths. That will be friendlier on memory usage, deliver results to downstream fragments faster. And, since we don't actually have SIMD support, our SIMD features remain unaffected.
   * In exchanges, limit outgoing batches to a small row count. Add a timeout: if the batch is not filled in, say, x seconds, ship what we have to avoid near-deadlock at scale.
   * Since the Gandiva code from Arrow has to, at core, work on a block of direct memory, use it for a few select operations. That is, if we see that a projection is only one of your ideal cases, generate Gandiva code instead of Java code. If the operations are real-world messy, then stick with the Java code we have.
   * Even if we moved toward a row-based approach, keep the ability for a row to be comprised of multiple "column groups", each as individual columns or as groups of columns. For example, when reading from a file, one "column group" contains the file data, another has the "implicit" fields. If we do this, we can split out single columns if we want to run the hash (or other) function efficiently.
   
   There is no free lunch. If we focus only on the idea cases, then any gain we get in those cases is more than given back by the overall complexity and slowness of other cases.
   
   At some point, someone's got to actually do some benchmarks to gather actual facts...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004748290


   Charles Givre wrote:
   
   Hello all, 
   There was a discussion in a recently closed PR [1] with a discussion between z0ltrix, James Turton and a few others about integrating Drill with Apache Arrow and wondering why it was never done.  I'd like to share my perspective as someone who has been around Drill for some time but also as someone who never worked for MapR or Dremio.  This just represents my understanding of events as an outsider, and I could be wrong about some or all of this.   Please forgive (or correct) any inaccuracies. 
   
   When I first learned of Arrow and the idea of integrating Arrow with Drill, the thing that interested me the most was the ability to move data between platforms without having to serialize/deserialize the data.  From my understanding, MapR did some research and didn't find a significant performance advantage and hence didn't really pursue the integration.  The other side of it was that it would require a significant amount of work to refactor major parts of Drill. 
   
   I don't know the internal politics, but this was one of the major points of diversion between Dremio and Drill.
   
   With that said, there was a renewed discussion on the list [2] where Paul Rogers proposed what he described as a "Crude but Effective" approach to an Arrow integration.  
   
   This is in the email link but here was a part of Paul's email:
   
   > Charles, just brainstorming a bit, I think the easiest way to start is to create a simple, stand-alone server that speaks Arrow to the client, and uses the native Drill client to speak to Drill. The native Drill client exposes Drill value vectors. One trick would be to convert Drill vectors to the Arrow format. I think that data vectors are the same format. Possibly offset vectors. I think Arrow went its own way with null-value (Drill's is-set) vectors. So, some conversion might be a no-op, others might need to rewrite a vector. Good thing, this is purely at the vector level, so would be easy to write. The next issue is the one that Parth has long pointed out: Drill and Arrow each have their own memory allocators. How could we share a data vector between the two? The simplest initial solution is just to copy the data from Drill to Arrow. Slow, but transparent to the client. A crude first-approximation of the development steps:
   >
   > A crude first-approximation of the development steps: 
   > 1. Create the client shell server. 
   > 2. Implement the Arrow client protocol. Need some way to accept a query and return batches of results. 
   > 3. Forward the query to Drill using the native Drill client. 
   > 4. As a first pass, copy vectors from Drill to Arrow and return them to the client. 
   > 5. Then, solve that memory allocator problem to pass data without copying.
   
   One point that Paul made was that these pieces are fairly discrete and could be implemented without refactoring major components of Drill.  Of course, this could be something for Drill 2.0.  At a minimum, could we take the conversation off of the PR and put it in the email list? 
   
   Let's discuss... All ideas are welcome!
   
   Best,
   -- C
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton edited a comment on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton edited a comment on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1005676173


   A few, informational notes from me.  First, I wondered how it could be that benchmark-busting Impala manages to load SIMD registers from a row oriented memory layout then found the following [in here](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf).
   
   > We are also considering switching to a columnar canonical in-memory format for data that needs to be materialized during query processing, in order to take advantage of SIMD instructions 
   
   I wonder whether they switched?  I suppose I need only go over to apache/impala to find out.  EDIT: [I don't think they did](https://git-wip-us.apache.org/repos/asf?p=impala.git;a=blob;f=be/src/runtime/row-batch.cc;h=aed2e116781718e715e5ce01db996412eef0aa9f;hb=HEAD).  I can't speak for anyone else but some of my important Drill workloads actually have been bulk floating point arithmetic (statistics, probabilities) where you do get a major boost from SIMD.  That said, I probably did even more work doing filtering, sorting and aggregating the same data in Drill before it was ready for arithmetic.
   
   Second, if you did want to do SIMD properly in Java (i.e. not praying for auto-vectorisation) then are you looking at JEP 338 and do [you have some vector types from JDK dictated to you](https://openjdk.java.net/jeps/338)?  Would these vectors work with our direct memory and Netty-based code?
   
   Third, would our direct memory code remain unchanged if we replace ValueVectors or would we want to start accessing direct memory through [JEP 393](https://openjdk.java.net/jeps/393).  Or... are [the high tech garbage collectors in recent JVMs](https://blogs.oracle.com/javamagazine/post/understanding-the-jdks-new-superfast-garbage-collectors) so good that we would laugh off the burden direct memory and its management entirely?
   
   Re. SIMD, I know there are positions like "Forget SIMD" that have been stated, but I wanted to note this stuff down for us anyway.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1007611673


   Hi James,
   
   One could do something like what you described. However, to have all of
   Drill work with Arrow would be a huge amount of work. Optimizations made
   for one format would be sub-optimal for the other. (Example: exchanges.)
   Furthermore, your use case would benefit from vectors only in the project
   and grouping operators.
   
   So, I wonder if we might think about the problem operator-by-operator. If
   you have a compute-heavy phase, might that first transform data to vectors,
   apply the compute, then send data along in row format? Every fragment does
   a network exchange: data is read/written anyway. So, perhaps there is
   something that can be done to transform formats at fragment boundaries (he
   says, waving hands wildly...)
   
   You'll also get speed only for queries without joins. If you have joins,
   then the joins are likely to take the vast amount of the runtime, leaving
   your projection and grouping in the noise. I'm not sure how vectorization
   can help joins; certainly in Drill today, vectors make the join code
   atrociously complex.
   
   This is why DBs (and compiler optimizers) are hard: the answers change
   based on use case...
   
   Thanks,
   
   - Paul
   
   On Wed, Jan 5, 2022 at 9:03 PM James Turton ***@***.***>
   wrote:
   
   > Okay, @paul-rogers <https://github.com/paul-rogers> I've had a few swigs
   > of the kool aid by now and I think I'm ready to forget about in-memory
   > column orientation and SIMD in return for the benefits of row orientation.
   > For workflows that do involve bulk arithmetic I can imagine good interop
   > taking care of that stage:
   >
   >    1. Do some efficient parsing, filtering, sorting, aggregating in Drill
   >    2. Smoothly switch over to Pandas/Numpy (perhaps an Arrow exporter?)
   >    or Julia or ...
   >    3. Do bulk arithmetic using SIMD
   >    4. Store results or smoothly switch back to Drill
   >
   > I've used this workflow myself where the data interchange format was
   > Parquet and the transport medium was the DFS (so perhaps a bit more
   > "clunky" than "smooth", with lots of serialisation and IO incurred).
   >
   > Going further, if the decoupling of Drill from its in-memory format
   > mentioned above is a real possibility then can we even imagine something
   > like this, entirely in Drill?
   >
   > alter session set exec.memory_format = 'drill'; -- the default, row-oriented format
   >
   > create table as select ... -- do some efficient parsing, filtering, sorting, aggregating in Drill
   > create table as select ... -- do some efficient parsing, filtering, sorting, aggregating in Drill
   >
   > alter session set exec.memory_format = 'arrow'; -- switch to Arrow format
   >
   > create table as select ... do some bulk arithmetic using SIMD
   > create table as select ... do some bulk arithmetic using SIMD
   >
   > To my mind Drill 2.0 would not try to ship support for the latter, Arrow
   > format, merely make design decisions which leave that door open for a
   > motivated developer...
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/drill/issues/2421#issuecomment-1006287333>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAYZF4OL3CNE5WIQCZG4SBDUUUPD3ANCNFSM5LHIIU5Q>
   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
   > or Android
   > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   >
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton edited a comment on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton edited a comment on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1008596721


   @paul-rogers my personal bias, FWIW, is that it would take a major overall speedup on _real world_ Drill workloads to budge me from the cleanest and simplest memory format (which sounds like rows).  Something like a genuine 5x before it started getting hard for me to wave it away, and again I don't mean in a synthetic benchmark.  I'm just circulating material for the sake of enhancing the discussion.
   
   > You're also assuming that there are SIMD hash functions: I'm not sure those exist: a search came up with somewhat random results. (SHA exists, however.)
   
   Venturing a little off topic now but as a point of general interest, there is reason to believe at least a few such algorithms must exist.  The "moon boys" that kill the planet mining crypto ponzi tokens tend to use GPUs for that work.  Security researchers made news a few years back by producing collisions in MD5 using a Playstation 3 cluster.  From the hardware employed I conclude that the hash functions being brute forced in these stories must admit data-parallel algorithms and would also get some speed up from SIMD.
   
   EDIT: after a moment's thought, data parallelism in brute forcing must arise naturally from the need to compute a single hash function over many independent inputs.  I think it would arise in the same way in a hash join.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1008596721


   @paul-rogers my personal bias, FWIW, is that it would take a major overall speedup on _real world_ Drill workloads to budge me from the cleanest and simplest memory format (which sounds like rows).  Something like a genuine 5x before it started getting hard for me to wave it away, and again I don't mean in a synthetic benchmark.  I'm just circulating material for the sake of enhancing the discussion.
   
   > You're also assuming that there are SIMD hash functions: I'm not sure those exist: a search came up with somewhat random results. (SHA exists, however.)
   
   Venturing a little off topic now but as a point of general interest, there is reason to believe at least a few such algorithms must exist.  The "moon boys" that kill the planet mining crypto ponzi tokens tend to use GPUs for that work.  Security researchers made news a few years back by producing collisions in MD5 using a Playstation 3 cluster.  From the hardware employed I conclude that the hash functions being brute forced in these stories must admit data-parallel algorithms and would also get some speed up from SIMD.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Posted by GitBox <gi...@apache.org>.
jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004751937


   Cong Luo wrote:
   
   @paul-rogers Thanks for your knowledge base. For the vectorization and Arrow, I may need to keep pouring cold water for everyone. I think we need to figure out what we need here is, not just see it as powerful, then say it's really powerful.
   
   **SQL, NoSQL and NewSQL**
   In my work, I went through these three databases architecture. Initially, we had performance bottlenecks, so choose the NoSQL database, eg : mongo, es and redis. After entering the production environment, most developers were not good at using the API(clients) designed for these databases. The key is that developers do not have the patience to learn the advanced syntax of each database(NoSQL). So we come back to use the SQL-style database : NewSQL. But, we found that NewSQL also has many bottlenecks(once you don't think need any optimization), especially once you comes to deploy a private cluster, you'll get complex operation and maintenance conditions and user complaints.
   
   Here, the lesson is that choosing the right one is better than choosing it blindly.
   
   **OLTP, OLAP and HTAP**
   The row format is good for OLTP, because need to insert / update / delete and batch fetch entire rows as soon as possible. The columnar format is good for OLAP, because need to analyze large amounts of data. To keep low latency, need to filter data size from the disk and reduce network IO bottlenecks. What does HTAP look like? let's see that Internal implementation of TiDB :
   
   <img src="https://download.pingcap.com/images/docs/tiflash/tiflash-architecture.png" width="50%" height="50%">
   
   ```
   TiFlash provides the columnar storage, it conducts real-time replication of data in the TiKV nodes 
   at a low cost that does not block writes in TiKV. Meanwhile, it provides the same read consistency 
   as in TiKV and ensures that the latest data is read.
   ```
   
   As is well known, database vendors prefer to use row format to apply in OLTP and columnar format to support OLAP. However, they are well aware that there is no perfect data format. So that, I agree with Paul that the row format is better for Drill.
   
   **CPU Cache miss**
   - N-ary Storage Model
   <img src="https://user-images.githubusercontent.com/50079619/148023833-dcd92d2c-38ae-41fa-a80f-97496f50e647.png" width="50%" height="50%">
   
   - Cache miss
   <img src="https://user-images.githubusercontent.com/50079619/148022521-f623b105-9057-4286-8ee4-3a2231b8bc68.png" width="50%" height="50%">
   
   The figure shows the processing of CPU caches, and we can see that a lot of invalid data is being filled into the cache, crowding out data that would otherwise have been reusable.
   
   - Decomposition Storage Mode
   <img src="https://user-images.githubusercontent.com/50079619/148023968-6f0db55b-63d1-448a-9165-2001a59aea01.png" width="50%" height="50%">
   
   Reduce cache-miss is one of the advantages of the columnar format, but note that Paul has explained that there will also be cache-miss(calculation based in SQL syntax) in the columnar format. So that, we cannot expect the maximum performance with Arrow.
   
   **Write amplification**
   Avoid becoming a database system, but Drill has several key points on the read-write path.
   - Drill support split vector value to disk.
   - Drill is an engine, isn't a database, unable to unify the initial source format(made in the database vendor).
   - Complex read-write path :
     - (1) Read from a variety of data sources.
     - (2) Write to vector value or split to disk.
     - (3) Combined to row sets.
     - (4) Client side.
   
   As noted above, Impossible to avoid read and write multiple times in a query, and the write amplification will increase latency. So that, there's a lot to optimize, not only the columnar format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org