You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Harry Waye <ha...@arachnys.com> on 2016/05/27 16:37:14 UTC

HBase consistency issues (holes) and long startup

We had a regionserver fall out of our cluster, I assume due to the process
hitting a limit as the region servers .out log file just contained "Killed"
which I've experienced when hitting open file descriptors limits.  After
this, hbck then reported inconsistencies in tables:

ERROR: There is a hole in the region chain between
dce998f6f8c63d3515a3207330697ce4-ravi teja and e4.  You need to create a
new .regioninfo and region dir in hdfs to plug the hole.

`hdfs fsck` reports a healthy dfs.

I attempted to run `hbase hbck -repairHoles` which didn't resolve the
inconsistencies.

I then restarted the HBase cluster and it now appears from looking at the
master log files that there are many tasks waiting to complete, and the web
interface results in a timeout:

master.SplitLogManager: total tasks = 299 unassigned = 285 tasks={ ... }

From looking at the logs on the regionservers I see messages such as:
"regionserver.SplitLogWorker: Current region server ... has 2 tasks in
progress and can't take more".

How can I speed up working through these tasks?  I suspect our nodes can
handle many more that 2 tasks at a time. I'll likely have followup
questions ones these have been worked through but I think that's it for not.

Any other information you need?

Re: HBase consistency issues (holes) and long startup

Posted by Stack <st...@duboce.net>.
On Fri, May 27, 2016 at 9:37 AM, Harry Waye <ha...@arachnys.com> wrote:

> We had a regionserver fall out of our cluster, I assume due to the process
> hitting a limit as the region servers .out log file just contained "Killed"
> which I've experienced when hitting open file descriptors limits.  After
> this, hbck then reported inconsistencies in tables:
>
>
Or kernel is killing the process because it is out of memory (no swapping
but all memory occupied by running processes)


> ERROR: There is a hole in the region chain between
> dce998f6f8c63d3515a3207330697ce4-ravi teja and e4.  You need to create a
> new .regioninfo and region dir in hdfs to plug the hole.
>
> `hdfs fsck` reports a healthy dfs.
>
> I attempted to run `hbase hbck -repairHoles` which didn't resolve the
> inconsistencies.
>
> I then restarted the HBase cluster and it now appears from looking at the
> master log files that there are many tasks waiting to complete, and the web
> interface results in a timeout:
>
> master.SplitLogManager: total tasks = 299 unassigned = 285 tasks={ ... }
>
>
We are trying to split WAL files before cluster comes back on line it
seems. Are we stuck on one WAL?



> From looking at the logs on the regionservers I see messages such as:
> "regionserver.SplitLogWorker: Current region server ... has 2 tasks in
> progress and can't take more".
>
>
There is a configuration which says how many tasks per regionserver:
"hbase.regionserver.wal.max.splitters"




> How can I speed up working through these tasks?  I suspect our nodes can
> handle many more that 2 tasks at a time. I'll likely have followup
> questions ones these have been worked through but I think that's it for
> not.
>
>
Did your cluster recover? Or is there a bad WAL in the way? One damaged
somehow by the kill (perhaps other than RSs are getting killed on your
possibly oversubscribed cluster)?

Yours,
St.


> Any other information you need?
>