You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2019/09/07 18:57:00 UTC

[jira] [Commented] (IMPALA-8923) Don't need synchronized in HBaseTable.getEstimatedRowStats

    [ https://issues.apache.org/jira/browse/IMPALA-8923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16924963#comment-16924963 ] 

ASF subversion and git services commented on IMPALA-8923:
---------------------------------------------------------

Commit ece7c4d77ce387fba26e841c8bddd5dc08bcacc0 in impala's branch refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=ece7c4d ]

IMPALA-8923: remove synchronized in HBaseTable.getEstimatedRowStats

HBaseTable.getEstimatedRowStats() estimates #rows and row size by
sampling on hbase table in target key range. It requires HBase RPCs
so could be slow.

Currently, HBaseTable.getEstimatedRowStats() is marked as synchronized.
The purpose is to protect the HTable (old HBase API) object in legacy
codes (before commit cf9d2485dd4e6544f6f1f407e2ad0b43eba31874). However,
after commit cf9d248, we create org.apache.hadoop.hbase.client.Table
object for each task (See comments and usages of getHBaseTable() in
FeHBaseTable.Util). So we don't need the "synchronized" keyword anymore
in HBaseTable.getEstimatedRowStats().

Keeping this method "synchronized" is further harmful. In high qps
workload, queries on the same table will wait for entering this method
and cost a lot of time in waiting (if this method is comparably slow).

Added some useful tracing logs to detect slow HBase RPCs.

Tests:
 - Manually adding a 100ms latency (e.g. 100ms) in
 FeHBaseTable.Util.getEstimatedRowStats() and run concurrent queries on
 the same hbase table. In my experiment, removing "synchronized" gains
 40% boost in 95% percentile query time.

Change-Id: Ifa23c16ee662c4f22851c700aea2ea5be847b64d
Reviewed-on: http://gerrit.cloudera.org:8080/14188
Reviewed-by: Quanlong Huang <hu...@gmail.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Don't need synchronized in HBaseTable.getEstimatedRowStats
> ----------------------------------------------------------
>
>                 Key: IMPALA-8923
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8923
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.7.1, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0, Impala 3.0, Impala 2.12.0, Impala 3.1.0, Impala 3.2.0, Impala 3.3.0
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>
> HBaseTable.getEstimatedRowStats() estimates #rows and row size by sampling on hbase table in target key range. It requires HBase RPCs so could be slow.
> Currently, HBaseTable.getEstimatedRowStats() is marked as synchronized. The purpose is to protect the HTable (old HBase API) object in legacy codes (before commit [cf9d248|https://github.com/apache/impala/commit/cf9d2485dd4e6544f6f1f407e2ad0b43eba31874]). However, after commit [cf9d248|https://github.com/apache/impala/commit/cf9d2485dd4e6544f6f1f407e2ad0b43eba31874], we create org.apache.hadoop.hbase.client.Table object for each task (See comments and usages of FeHBaseTable.Util.getHBaseTable()). So we don't need the "synchronized" marker anymore in HBaseTable.getEstimatedRowStats().
> Keeping the "synchronized" marker is further harmful. In high qps workload, queries on the same table will wait for entering this method and cost a lot of time in waiting (if this method is comparable slow).
> This can be revealed by manually adding a latency (e.g. 100ms) in FeHBaseTable.Util.getEstimatedRowStats() and run concurrent queries on the same hbase table. In my experiment, removing "synchronized" gains 40% boost in 95% percentil query time. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org