You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/03/18 22:00:43 UTC

[jira] [Updated] (DRILL-5366) Use generic copier for wide rows in external sort

     [ https://issues.apache.org/jira/browse/DRILL-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Rogers updated DRILL-5366:
-------------------------------
    Component/s:     (was: Tools, Build & Test)

> Use generic copier for wide rows in external sort
> -------------------------------------------------
>
>                 Key: DRILL-5366
>                 URL: https://issues.apache.org/jira/browse/DRILL-5366
>             Project: Apache Drill
>          Issue Type: Sub-task
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.11.0
>
>
> The external sort makes use of a "priority copier" to copy rows at two times:
> * When merging data during spilling
> * When merging data for an in-memory sort
> As with all such Drill operators, the code works by generating two local variables per column, then generating two blocks of code per column (one for setup, one for the actual copy.)
> This works fine for rows with few columns. But, in rows with many columns (such as for queries against JSON documents), the amount of code produced becomes very large. This introduces extra overhead to generate, compile and store the extra code.
> DRILL-5125 found the same issue in the Selection Vector remover. By applying the fix from that ticket to the external sort, we reap 24% savings.
> Consider a unit test that runs only the copier. Create a single record batch with 1000 columns and 64K rows. Use the copier to produce a set of smaller output batches. Such a test factors out all the overhead of running a query.
> * Run time for a generated copier: 17 secs.
> * Run time with the generic copier: 13 secs
> * Savings: 4 seconds or 24%.
> (13 seconds is still a very long time to process 64K rows. There may be optimizations to be had in the priority queue implementation as well, but that is a separate issue.)
> To be conservative, provide a config option to enable the feature, perhaps by setting a threshold of the number of columns that must be present to use the generic version. That way, if folks feel that the generated version is faster for narrow rows, the generated version can be used. And each user can decide the point at which the costs of bulky code outweighs the performance costs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)