You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2022/01/05 01:59:23 UTC

[GitHub] [drill] paul-rogers commented on issue #2421: ValueVectors replacement

paul-rogers commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1005316421

@jnturton, thanks for converting the discussion to a ticket. Thanks especially for dealing with the formatting translation!

@luocooong, thanks for the very impressive analysis, and the wonderful images!

As noted earlier, columnar is huge win for storage. It is less so for tools that work a row-at-a-time such as Drill or most data pipelines. It is easy to see why. A Parquet file may have 100 or more columns. Your query uses 10 of them, so that's all we read. A huge win. But, inside Drill, we'll mess with all 10 of your columns. You are probably doing filtering, calculations, aggregation, grouping, etc. There is a _reason_ you selected those columns: you want to work with them. And, Drill will do so a row at a time.

Hence, the use pattern _within Drill_ seems to favor a row-based layout as we get few of the advantages of a columnar format, but we get all of the cost and complexity.

Most SQL operators work a row at a time. So, even if the SQL is messy, and we have touch the same column multiple times, the row itself can fit into the CPU cache and those redundant accesses are trivially fast. It would be very hard to do a similar optimization for columns.

Your OLAP analysis is correct, but mostly applies to old-school Mondrian-style cubes on top of relational tables where the aggregation is clearly defined. Most of the queries I've seen for Drill (or Impala or other tools) are pretty ugly, with lots of business logic. If the user writes the SQL, they won't limit themselves to a small set of standard aggregations, they'll do whatever they need to do to get the job done. That, in fact, is why SQL endures decade after decade: as awful as it is, you can get your task done if you try hard enough.

As a result, although we could vectorize simple cube-rollup calculations, in practice, we don't get many of those. So, we need to optimize for what people actually do, not the ideal cases.

Let's also remember this: if all you need are simple roll-up calculations, and you run those queries again and again (so they are worth optimizing), you have a better choice: have your data pipeline write pre-aggregated "cubes". Want to see sales-by-product-by-day over the last year on billions of transactions? Just build the roll-up cube. Its what Druid does and it works well. With a Spark pipeline, you can create a detail table and a roll-up table, both in Parquet, and queried by Drill. So, if we look at the broader context, we see that, the one place we'd benefit from vectorized operations, might not even happen within Drill; it would be done in Spark in the data pipeline.

That said, we should also remember operators that might benefit from columnar: computing a hash key or sort. In this case, we could have a hybrid approach: a "row group" with just the required fields (the keys), and a separate "row group" with the "payload" of other fields. Some DBs use a similar approach (for storage.) With a bit of pipelining, the preceding operator could produce its output rows in the format optimal for the next operator.

It would be great if we could mock up some simple test cases and try out the ideas. Doing so is not simple at present, but would be possible once we did the intermediate raw-vector-to-EVF conversion. Our experience tells us what we might expect to find, but we need to do the experiments to tell us what's actually true.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org