You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2022/01/03 01:59:26 UTC

[GitHub] [drill] paul-rogers commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1003831613

Hi @luocooong, looks like you're looking at the expression and operator code. I wonder, is there anything you're trying to improve? Execution performance, maybe?

As you know, Drill is very complicated. Drill uses code generation for expression evaluation. The code generation goes though a path that made sense for Java 5 (when Drill was written), but is now a bit awkward. We do have a way to use the native Java tools, which worked faster several years ago; that path is probably even faster now.

Operator setup (another of your PRs) is impacted by code gen cost. Drill generates code for each fragment. If your query has 20 fragments, we generate code 20 times. The reason we must do that is that, in theory, every fragment can see a different schema, so the generated code could differ. By comparison, Spark generates code once, then pushes that code to all its executors.

The generated code itself can be rather awkward for large queries: the code tries to inline everything which is great for small functions, but causes optimization problems as code blocks get larger.

The mechanism to generate code, especially in the PROJECT operator, is vastly overly complex and could use a good re-think. It is so complex that it is hard to optimize because of the many assumptions and other issues embedded in the code.

The generated code is meant to be small. But, over time, some operators added lots of "standard" code to the code generation path. The work is more work for the compiler and "byte code optimizer" that adds no per-query value. We've taken several passes at refactoring to pull that code of the code gen path, but there is more to do.

Drill was designed to allow vector operations (hence Value Vectors), but the code was never written. In part because there are no CPU vector instructions that work with SQL nullable data. Arrow is supposed to have figured out solutions (Gandiva, is it?) which, perhaps we could consider (but probably only for non-nullable data.)

Anyway, there are many areas we can improve. I can give you more details if I know what you're trying to accomplish.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org