You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "HBase QA (JIRA)" <ji...@apache.org> on 2019/05/01 02:38:00 UTC
[jira] [Commented] (HBASE-22301) Consider rolling the WAL if the HDFS write pipeline is slow

    [ https://issues.apache.org/jira/browse/HBASE-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830834#comment-16830834 ] 

HBase QA commented on HBASE-22301:
----------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 26s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m  0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 26s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m  2s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 18s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 34s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 23s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 46s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 58s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 16s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 19s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m  8s{color} | {color:red} hbase-server: The patch generated 1 new + 24 unchanged - 0 fixed = 25 total (was 24) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m  0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 22s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  8m 12s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 56s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 28s{color} | {color:green} hbase-hadoop-compat in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 34s{color} | {color:green} hbase-hadoop2-compat in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}140m 31s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  1m 15s{color} | {color:green} The patch does not generate ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}186m 10s{color} | {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/PreCommit-HBASE-Build/223/artifact/patchprocess/Dockerfile |
| JIRA Issue | HBASE-22301 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12967533/HBASE-22301.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 50369a6b5d7f 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | master / 2c7fdb39ce |
| maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.11 |
| checkstyle | https://builds.apache.org/job/PreCommit-HBASE-Build/223/artifact/patchprocess/diff-checkstyle-hbase-server.txt |
|  Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/223/testReport/ |
| Max. process+thread count | 5380 (vs. ulimit of 10000) |
| modules | C: hbase-hadoop-compat hbase-hadoop2-compat hbase-server U: . |
| Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/223/console |
| Powered by | Apache Yetus 0.9.0 http://yetus.apache.org |


This message was automatically generated.



> Consider rolling the WAL if the HDFS write pipeline is slow
> -----------------------------------------------------------
>
>                 Key: HBASE-22301
>                 URL: https://issues.apache.org/jira/browse/HBASE-22301
>             Project: HBase
>          Issue Type: Improvement
>          Components: wal
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Minor
>             Fix For: 3.0.0, 1.5.0, 2.3.0
>
>         Attachments: HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch, HBASE-22301-branch-2.patch, HBASE-22301.patch
>
>
> Consider the case when a subset of the HDFS fleet is unhealthy but suffering a gray failure not an outright outage. HDFS operations, notably syncs, are abnormally slow on pipelines which include this subset of hosts. If the regionserver's WAL is backed by an impacted pipeline, all WAL handlers can be consumed waiting for acks from the datanodes in the pipeline (recall that some of them are sick). Imagine a write heavy application distributing load uniformly over the cluster at a fairly high rate. With the WAL subsystem slowed by HDFS level issues, all handlers can be blocked waiting to append to the WAL. Once all handlers are blocked, the application will experience backpressure. All (HBase) clients eventually have too many outstanding writes and block.
> Because the application is distributing writes near uniformly in the keyspace, the probability any given service endpoint will dispatch a request to an impacted regionserver, even a single regionserver, approaches 1.0. So the probability that all service endpoints will be affected approaches 1.0.
> In order to break the logjam, we need to remove the slow datanodes. Although there is HDFS level monitoring, mechanisms, and procedures for this, we should also attempt to take mitigating action at the HBase layer as soon as we find ourselves in trouble. It would be enough to remove the affected datanodes from the writer pipelines. A super simple strategy that can be effective is described below:
> This is with branch-1 code. I think branch-2's async WAL can mitigate but still can be susceptible. branch-2 sync WAL is susceptible. 
> We already roll the WAL writer if the pipeline suffers the failure of a datanode and the replication factor on the pipeline is too low. We should also consider how much time it took for the write pipeline to complete a sync the last time we measured it, or the max over the interval from now to the last time we checked. If the sync time exceeds a configured threshold, roll the log writer then too. Fortunately we don't need to know which datanode is making the WAL write pipeline slow, only that syncs on the pipeline are too slow and exceeding a threshold. This is enough information to know when to roll it. Once we roll it, we will get three new randomly selected datanodes. On most clusters the probability the new pipeline includes the slow datanode will be low. (And if for some reason it does end up with a problematic datanode again, we roll again.)
> This is not a silver bullet but this can be a reasonably effective mitigation.
> Provide a metric for tracking when log roll is requested (and for what reason).
> Emit a log line at log roll time that includes datanode pipeline details for further debugging and analysis, similar to the existing slow FSHLog sync log line.
> If we roll too many times within a short interval of time this probably means there is a widespread problem with the fleet and so our mitigation is not helping and may be exacerbating those problems or operator difficulties. Ensure log roll requests triggered by this new feature happen infrequently enough to not cause difficulties under either normal or abnormal conditions. A very simple strategy that could work well under both normal and abnormal conditions is to define a fairly lengthy interval, default 5 minutes, and then insure we do not roll more than once during this interval for this reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)