You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Murthy boddu <sn...@gmail.com> on 2017/12/08 21:50:21 UTC

HBase Outage - Drop table operation stuck in "DELETE_TABLE_PRE_OPERATION"

Hi,

We recently ran into a production issue, here is the summary of events that
we went through, in timeline order:

1. One of the region servers went down (it became inaccessible)
2. Region transition initiated, some regions of multiple tables were
stuck in transition status. Most of them are in status “OPEN_FAIILED” or
“OPENING” or “PENDING”, “CLOSE_FAILED”
3. Client requests to those tables are still being diverted to lost
server causing failures/time outs. (Which can we do about it ?)
4. After waiting for many hours, we ran hbck –repair per table which
resolved issues with some of them.
5. One table, whose data can get stale in hours, we planned to recreate
it to avoid any corruption. Disabling of table went through fine but
dropping the table stuck at state “DELETE_TABLE_PRE_OPERATION”, it is
waiting for regions in transition to finish. The regions it is complaining
is in “OPENING” status.

Here is the exception:

2017-12-08 18:59:17,975 WARN [ProcedureExecutor-10]
procedure.DeleteTableProcedure: Retriable error trying to delete
table=Queue-SCKAD state=DELETE_TABLE_PRE_OPERATION

org.apache.hadoop.hbase.exceptions.TimeoutIOException: Timed out while
waiting on regions
Queue-SCKAD,B19,1502479054304.15a44cf47634d7d2264eaf00d61f6036. in
transition

at
org.apache.hadoop.hbase.master.procedure.ProcedureSyncWait.waitFor(ProcedureSyncWait.java:123)

at
org.apache.hadoop.hbase.master.procedure.ProcedureSyncWait.waitFor(ProcedureSyncWait.java:103)

1. This operation has been running for more than 24 hours and doesn’t
time out (isn't there a 2 hour timeout for client operations at HBase level
? ). Enabling the table back also queues up with no progress.
2. Because the table is in disable status, running hbck isn’t helping as
it says regions = 0.
3. We added new node to the cluster to replace the old one, we see that
HBase balancer doesn’t kick in at all. So, basically, region movement is
totally stuck.
4. No missing data on HDFS, 100% consistent. Hbck detail report on whole
cluster also returns OK.

I can provide additional logs if you request, but can you suggest how we
can resolve this problem with the cluster? Does restarting hbase master
process would help? We can’t afford another outage on the cluster making
the situation tricky.

My questions:

1. Why drop operation need to wait for regions in transition to finish?
Is there a way we can abort the on-going region movement or even the drop
operation?
2. Why rebalancing or other rest of operations are stuck?
3. Can you please suggest what action can be taken to resolve this?

Thank you for your time and help.

Regards

Re: HBase Outage - Drop table operation stuck in "DELETE_TABLE_PRE_OPERATION"

Posted by Ted Yu <yu...@gmail.com>.

From the line number of ProcedureSyncWait.java, it seems you are using
1.2.x release.

Can you check master log prior to 2017-12-08 18:59 ?
Pastebin relevant master log snippet (after necessary redaction).

Once we see the master log, we can see what might cause the
DeleteTableProcedure
to be stuck.

bq. Why rebalancing or other rest of operations are stuck?

If there is region in transition, the balancer wouldn't run.

"hbck –repair" combines many fixes. Normally admin is supposed to analyze
the particular inconsistencies before issuing proper fix.

Cheers

On Fri, Dec 8, 2017 at 1:50 PM, Murthy boddu <sn...@gmail.com> wrote:

> Hi,
>
>
>
> We recently ran into a production issue, here is the summary of events that
> we went through, in timeline order:
>
>
>
>    1. One of the region servers went down (it became inaccessible)
>    2. Region transition initiated, some regions of multiple tables were
>    stuck in transition status. Most of them are in status “OPEN_FAIILED” or
>    “OPENING” or “PENDING”, “CLOSE_FAILED”
>    3. Client requests to those tables are still being diverted to lost
>    server causing failures/time outs. (Which can we do about it ?)
>    4. After waiting for many hours, we ran hbck –repair per table which
>    resolved issues with some of them.
>    5. One table, whose data can get stale in hours, we planned to recreate
>    it to avoid any corruption. Disabling of table went through fine but
>    dropping the table stuck at state “DELETE_TABLE_PRE_OPERATION”, it is
>    waiting for regions in transition to finish. The regions it is
> complaining
>    is in “OPENING” status.
>
> Here is the exception:
>
>
>
> 2017-12-08 18:59:17,975 WARN  [ProcedureExecutor-10]
> procedure.DeleteTableProcedure: Retriable error trying to delete
> table=Queue-SCKAD state=DELETE_TABLE_PRE_OPERATION
>
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Timed out while
> waiting on regions
> Queue-SCKAD,B19,1502479054304.15a44cf47634d7d2264eaf00d61f6036. in
> transition
>
>                 at
> org.apache.hadoop.hbase.master.procedure.ProcedureSyncWait.waitFor(
> ProcedureSyncWait.java:123)
>
>                 at
> org.apache.hadoop.hbase.master.procedure.ProcedureSyncWait.waitFor(
> ProcedureSyncWait.java:103)
>
>
>
>    1. This operation has been running for more than 24 hours and doesn’t
>    time out (isn't there a 2 hour timeout for client operations at HBase
> level
>    ? ). Enabling the table back also queues up with no progress.
>    2. Because the table is in disable status, running hbck isn’t helping as
>    it says regions = 0.
>    3. We added new node to the cluster to replace the old one, we see that
>    HBase balancer doesn’t kick in at all. So, basically, region movement is
>    totally stuck.
>    4. No missing data on HDFS, 100% consistent. Hbck detail report on whole
>    cluster also returns OK.
>
>
>
> I can provide additional logs if you request, but can you suggest how we
> can resolve this problem with the cluster? Does restarting hbase master
> process would help? We can’t afford another outage on the cluster making
> the situation tricky.
>
>
>
> My questions:
>
>
>
>    1. Why drop operation need to wait for regions in transition to finish?
>    Is there a way we can abort the on-going region movement or even the
> drop
>    operation?
>    2. Why rebalancing or other rest of operations are stuck?
>    3.  Can you please suggest what action can be taken to resolve this?
>
>
>
> Thank you for your time and help.
>
>
>
> Regards
>