You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Arina Ielchiieva (JIRA)" <ji...@apache.org> on 2018/05/24 10:10:00 UTC

[jira] [Updated] (DRILL-6442) Adjust Hbase disk cost & row count estimation when filter push down is applied

     [ https://issues.apache.org/jira/browse/DRILL-6442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arina Ielchiieva updated DRILL-6442:
------------------------------------
    Description: 
Disk cost for Hbase scan is calculated based on scan size in bytes.

{noformat}
float diskCost = scanSizeInBytes * ((columns == null || columns.isEmpty()) ? 1 : columns.size() / statsCalculator.getColsPerRow());
{noformat}

Scan size is bytes is estimated using {{TableStatsCalculator}} with the help of sampling.
When we estimate size for the first time (before applying filter push down), for sampling we use random rows. When estimating rows after filter push down, for sampling we use rows that qualify filter condition. It can happen that average row size can be higher after filter push down 
than before. Unfortunately since disk cost depends on these calculations, plan with filter push down can give higher cost then without it. 

Possible enhancements:
1. Currently default row count is 1 million but if during sampling we return less rows then expected, it means that our query will return not more rows then this number. We can use this number instead of default row count to achieve better cost estimations.
2. When filter push down was applied, row number was reduced by half in order to ensure plan with filter push down will have less cost. Then same should be done foe disk cost as well.


  was:
Disk cost for Hbase scan is calculated based on scan size in bytes.

{noformat}
float diskCost = scanSizeInBytes * ((columns == null || columns.isEmpty()) ? 1 : columns.size() / statsCalculator.getColsPerRow());
{noformat}

Scan size is bytes is estimated using {{TableStatsCalculator}} with the help of sampling.
When we estimate size for the first time (before applying filter push down), for sampling we use random rows. When estimating rows after filter push down, for sampling we use rows that qualify filter condition. It can happen that average row size can be higher after filter push down 
than before. Unfortunately since disk cost depends on these calculations, plan with filter push down can give higher cost then without it. 




> Adjust Hbase disk cost & row count estimation when filter push down is applied
> ------------------------------------------------------------------------------
>
>                 Key: DRILL-6442
>                 URL: https://issues.apache.org/jira/browse/DRILL-6442
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.13.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>            Priority: Major
>             Fix For: 1.14.0
>
>
> Disk cost for Hbase scan is calculated based on scan size in bytes.
> {noformat}
> float diskCost = scanSizeInBytes * ((columns == null || columns.isEmpty()) ? 1 : columns.size() / statsCalculator.getColsPerRow());
> {noformat}
> Scan size is bytes is estimated using {{TableStatsCalculator}} with the help of sampling.
> When we estimate size for the first time (before applying filter push down), for sampling we use random rows. When estimating rows after filter push down, for sampling we use rows that qualify filter condition. It can happen that average row size can be higher after filter push down 
> than before. Unfortunately since disk cost depends on these calculations, plan with filter push down can give higher cost then without it. 
> Possible enhancements:
> 1. Currently default row count is 1 million but if during sampling we return less rows then expected, it means that our query will return not more rows then this number. We can use this number instead of default row count to achieve better cost estimations.
> 2. When filter push down was applied, row number was reduced by half in order to ensure plan with filter push down will have less cost. Then same should be done foe disk cost as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)