You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Pritesh Maker (JIRA)" <ji...@apache.org> on 2018/05/22 18:23:00 UTC

[jira] [Updated] (DRILL-6211) Optimizations for SelectionVectorRemover

     [ https://issues.apache.org/jira/browse/DRILL-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pritesh Maker updated DRILL-6211:
---------------------------------
    Fix Version/s:     (was: 1.14.0)

> Optimizations for SelectionVectorRemover 
> -----------------------------------------
>
>                 Key: DRILL-6211
>                 URL: https://issues.apache.org/jira/browse/DRILL-6211
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Codegen
>            Reporter: Kunal Khatua
>            Assignee: Karthikeyan Manivannan
>            Priority: Major
>         Attachments: 255d264c-f55e-b343-0bef-49d3e672d93f.sys.drill, 255d2664-2418-19e0-00ea-2076a06572a2.sys.drill, 255d2682-8481-bed0-fc22-197a75371c04.sys.drill, 255d26ae-2c0b-6cd6-ae71-4ad04c992daf.sys.drill, 255d2880-48a2-d86b-5410-29ce0cd249ed.sys.drill
>
>
> Currently, when a SelectionVectorRemover receives a record batch from an upstream operator (like a Filter), it immediately starts copying over records into a new outgoing batch.
>  It can be worthwhile if the RecordBatch can be enriched with some additional summary statistics about the attached SelectionVector, such as
>  # number of records that need to be removed/copied
>  # total number of records in the record-batch
> The benefit of this would be that in extreme cases, if *all* the records in a batch need to be either truncated or copies, the SelectionVectorRemover can simply drop the record-batch or simply forward it to the next downstream operator.
> While the extreme cases of simply dropping the batch kind of works (because there is no overhead in copying), for cases where the record batch should pass through, the overhead remains (and is actually more than 35% of the time, if you discount for the streaming agg cost within the tests).
> Here are the statistics of having such an optimization
> ||Selectivity||Query Time||%Time used by SVR||Time||Profile||
> |0%|6.996|0.13%|0.0090948|[^255d264c-f55e-b343-0bef-49d3e672d93f.sys.drill]|
> |10%|7.836|7.97%|0.6245292|[^255d2682-8481-bed0-fc22-197a75371c04.sys.drill]|
> |50%|11.225|25.59%|2.8724775|[^255d2664-2418-19e0-00ea-2076a06572a2.sys.drill]|
> |90%|14.966|33.91%|5.0749706|[^255d26ae-2c0b-6cd6-ae71-4ad04c992daf.sys.drill]|
> |100%|19.003|35.73%|6.7897719|[^255d2880-48a2-d86b-5410-29ce0cd249ed.sys.drill]|
> To summarize, the SVR should avoid creating new batches as much as possible.
> A more generic (non-trivial) optimization should take into account the fact that multiple batches emitted can be coalesced, but we don't currently have test metrics for that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)