You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Csaba Ringhofer (Jira)" <ji...@apache.org> on 2022/05/02 13:04:00 UTC

[jira] [Updated] (IMPALA-11278) Cardinality of small HBase regions is overestimated since HBASE-26340

     [ https://issues.apache.org/jira/browse/IMPALA-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Csaba Ringhofer updated IMPALA-11278:
-------------------------------------
    Description: 
Impala uses the size of an HBase region to estimate the number of rows, and the API we use (https://hbase.apache.org/2.4/apidocs/org/apache/hadoop/hbase/RegionLoad.html#getStorefileSizeMB() ) returns a size at MB precision. Since HBASE-26340 it returns 1 instead of 0 for very small but not empty tables, which leads to massively overestimating its size (we handle 0 in a special way. so we didn't estimate  row count as 0: https://github.com/apache/impala/blob/78609dca32d8ce996247c9552ba676a853c74686/fe/src/main/java/org/apache/impala/catalog/FeHBaseTable.java#L585 )

In newer versions of HBase getStorefileSizeMB() is deprecated and there are functions to get the size at byte granulity. Using it could solve the massive overestimation, but it may make our planner tests more sensitive to small size changes in HBase regions.

HBASE-26340 was backported with https://github.com/apache/impala/commit/ca48b940ec6281d492ad525418f234308a82eedf

  was:
Impala uses the size of an HBase region to estimate the number of rows, and the API we use (https://hbase.apache.org/2.4/apidocs/org/apache/hadoop/hbase/RegionLoad.html#getStorefileSizeMB() ) returns a size at MB precision. Since HBASE-26340 it returns 1 instead of 0 for very small but not empty tables, which leads to massively overestimating its size (we handle 0 in a special way. so we didn't estimate  row count as 0: https://github.com/apache/impala/blob/78609dca32d8ce996247c9552ba676a853c74686/fe/src/main/java/org/apache/impala/catalog/FeHBaseTable.java#L585 )

In newer versions of HBase getStorefileSizeMB() is deprecated and there are functions to get the size at byte granulity. Using it could solve the massive overestimation, but it may make our planner tests more sensitive to small size changes in HBase regions.


> Cardinality of small HBase regions is overestimated since HBASE-26340
> ---------------------------------------------------------------------
>
>                 Key: IMPALA-11278
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11278
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog, Frontend
>    Affects Versions: Impala 4.1.0
>            Reporter: Csaba Ringhofer
>            Priority: Major
>
> Impala uses the size of an HBase region to estimate the number of rows, and the API we use (https://hbase.apache.org/2.4/apidocs/org/apache/hadoop/hbase/RegionLoad.html#getStorefileSizeMB() ) returns a size at MB precision. Since HBASE-26340 it returns 1 instead of 0 for very small but not empty tables, which leads to massively overestimating its size (we handle 0 in a special way. so we didn't estimate  row count as 0: https://github.com/apache/impala/blob/78609dca32d8ce996247c9552ba676a853c74686/fe/src/main/java/org/apache/impala/catalog/FeHBaseTable.java#L585 )
> In newer versions of HBase getStorefileSizeMB() is deprecated and there are functions to get the size at byte granulity. Using it could solve the massive overestimation, but it may make our planner tests more sensitive to small size changes in HBase regions.
> HBASE-26340 was backported with https://github.com/apache/impala/commit/ca48b940ec6281d492ad525418f234308a82eedf



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org