You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Paul Rogers <pr...@mapr.com> on 2017/11/13 23:24:41 UTC

"Batch size control" and unit testing

Hi All,

Here is the next installment in the “batch size control” project update.

Drill has a great many operators. As we move forward, we must update them to use the new batch size control framework. Unit testing becomes a major concern. This note explains how we address that issue in this project.

The “classic” way to test Drill is to build the product, fire up the Drill server, and use Sqlline to fire off queries. The problem of course, is that the edit-compile-debug is glacially slow (five minutes). Testing is manual (copy/paste the query into Sqlline, visually inspect the results.)

Another alternative is to run the very same query, but as a JUnit test. Drill has many such tests. The “BaseTestQuery” framework and “TestBuilder” help. The newish “Cluster Framework” makes it very easy to start an embedded Drillbit with the desired options and settings, run a query, and examine the results. The edit-compile-debug cycle is much faster, on the order of 10-20 seconds.

This is good, but we still run the entire Drill operator stack and throw queries at it. We use use a file for input and capture query results as output.  But, we want much finer grain testing. That is, we want true unit testing: isolate a component, feed it some input, and verify its output.

A fact of Drill is that operators are tightly coupled with the fragment context which is coupled with the Drillbit context which needs the entire server. What to do? One solution is to use mocks, and, indeed, Drill has three solutions based on JMockit, Mockito, and Jinfeng’s handy new “Mini-Plan” framework.

Mocks are handy, but it is cleaner and simpler to have code that can be tested in isolation without mocks. The next step is the “sub-operator” test framework, the “RowSet” utilities and the “context” refactoring that break the tight coupling with the rest of Drill, allowing us to separate out an operator (after some simple changes to the code) to test in isolation. We can now easily pump in a very large variety of inputs (such as Drill’s 30+ data types in the 3 cardinalities) without having to set up a lot of overhead for each.

Still, however, many operators are internally complex and poking at them from the outside is limiting. We want to test, say, not just the sort operator, as a whole, but we want to exercise the bit of code that does the in-memory sort, or the one that writes batches to disk. To do this, we must “disaggregate” each operator into a series of separately-testable components, each with a clear API.

Refactoring operators can only be done for new operators, or when we need to make major changes to an existing operator. As part of the “batch size control” project, we have created a new version of the scan operator using this model.

Refactoring scan pointed out an opportunity to refactor the core operator code itself. Each operator has three responsibilities:

* Implement the Drill iterator protocol.
* Hold a record batch.
* Details of the operator algorithm.

The next “batch size” PR will provide a new version of the base operator class that splits responsibility into classes for the first two items, and an interface for the third. This allows us to unit test the two classes once and for all. Per-operator, the focus is just the operator implementation.

The core operator algorithm implementation is designed to be loosely coupled to the rest of Drill, allowing complete unit testing without mocks. The scan operator revision, which we’ll describe in the next note, makes use of this structure.

Thanks,

- Paul