You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Paul Rogers <pr...@maprtech.com> on 2017/01/02 01:11:56 UTC
Sort performance improvement

Hi All,

Happy New Year!

Recent experiments show that we can more than double in-memory sort performance by generating tighter code in the “inner loops” of the sort and merge implementations. The improved sort performance causes a 50% reduction in overall runtime (doubling of performance) for the sample query: 10 million rows of randomly-generated integers. YMMV.

The details are in [1]. In short, generated code makes expensive use of layer upon layer of function calls to fetch each value. The speed-up comes from eliminating 13 function calls per compare operation. In the sort we do something like O(n log n) compares: so savings add up quickly. For this test, the reduction is something like ~1.5 billion function calls.

Existing:

compare()
. SelectionVector2.getIndex( ) // x 2
. . DrillBuf.getChar( )
. . . DrillBuf.getShort()
. . . . PlatformDependent.getShort()
. doEval( ) // Generated, this is for required int
. . IntVector.getAccessor() // x 2
. . . Accessor.get()
. . . . DrillBuf.getInt()
. . . . . PlatformDependent.getInt()

Revised:

    public int compare(int leftIndex, int rightIndex) {
      int sv1 = PlatformDependent.getShort(addrSv + (leftIndex << 1)) & 0xFFFF;
      int sv2 = PlatformDependent.getShort(addrSv + (rightIndex << 1)) & 0xFFFF;
      int left = PlatformDependent.getInt(addr0 + (sv1<<2));
      int right = PlatformDependent.getInt(addr4 + (sv2<<2));

A similar change was made in the “MSorter”: the code that merges sorted batches in memory.

Interestingly, no additional performance gain was seen in replacing the PlatformDependent calls with direct calls to Unsafe. Presumably that path is simple enough to be optimized away by the JVM. 

This experiment hand-coded the replacement “SingleBatchSorter” and “MSorter". A next step would be to revise the code generator to generate the required code. And, of course, nullable, variable-length and array vectors will be more complex than the simple required int vector used in the experiment.

This technique can be used anywhere we have compute-intensive inner loops in generated code.

- Paul

[1] https://github.com/paul-rogers/drill/wiki/Optimization-of-External-Sort-Performance