You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Duo Zhang (Jira)" <ji...@apache.org> on 2019/10/26 15:03:00 UTC
[jira] [Resolved] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"

     [ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Duo Zhang resolved HBASE-23181.
-------------------------------
    Hadoop Flags: Reviewed
      Resolution: Fixed

Pushed to branch-2.1+.

Thanks [~stack] and [~binlijin] for reviewing.

Will open follow on issues to address the remaining problems.

> Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-23181
>                 URL: https://issues.apache.org/jira/browse/HBASE-23181
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, wal
>    Affects Versions: 2.2.1
>            Reporter: Michael Stack
>            Assignee: Duo Zhang
>            Priority: Major
>             Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3
>
>
> On a heavily loaded cluster, WAL count keeps rising and we can get into a state where we are not rolling the logs off fast enough. In particular, there is this interesting state at the extreme where we pick a region to flush because 'Too many WALs' but the region is actually not online. As the WAL count rises, we keep picking a region-to-flush that is no longer on the server. This condition blocks our being able to clear WALs; eventually WALs climb into the hundreds and the RS goes zombie with a full Call queue that starts throwing CallQueueTooLargeExceptions (bad if this servers is the one carrying hbase:meta): i.e. clients fail to access the RegionServer.
> One symptom is a fast spike in WAL count for the RS. A restart of the RS will break the bind.
> Here is how it looks in the log:
> {code}
> # Here is region closing....
> 2019-10-16 23:10:55,897 INFO org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed 8ee433ad59526778c53cc85ed3762d0b
> ....
> # Then soon after ...
> 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is not online on us
> 2019-10-16 23:11:45,006 INFO org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; count=45, max=32; forcing flush of 1 regions(s): 8ee433ad59526778c53cc85ed3762d0b
> ...
> # Later...
> 2019-10-16 23:20:25,427 INFO org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; count=542, max=32; forcing flush of 1 regions(s): 8ee433ad59526778c53cc85ed3762d0b
> 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is not online on us
> {code}
> I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version regularly that had HBASE-16721 fix in it, but can't say yet if it was for same reason as above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)