You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Marc Hoppins <ma...@eset.sk> on 2021/03/01 10:16:46 UTC

RE: HBASE WALs

If you know of anything that will help I would appreciate it.

If you need any log output let me know.

Thanks


-----Original Message-----
From: Wellington Chevreuil <we...@gmail.com> 
Sent: Thursday, February 25, 2021 4:08 PM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

>
> Do WAL files contain information for multiple regions per WAL or is 
> one WAL associated with one region?
>
Multiple regions edits would be present in a single wal file. That's why upon a RS crash and wal processing, there's a wal split phase.

I am trying to find a way to clear a RIT for a disabled table. A similar
> problem (but on a test cluster) involved me clearing znode info, 
> deleting HDFS data for the table and deleting WALs/MasterProcWAL 
> files, finally restarting HBASE service.
>
Which hbase version are you on?

Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins <ma...@eset.sk>
escreveu:

> Hi all,
>
> Do WAL files contain information for multiple regions per WAL or is 
> one WAL associated with one region?
>
> I am trying to find a way to clear a RIT for a disabled table. A 
> similar problem (but on a test cluster) involved me clearing znode 
> info, deleting HDFS data for the table and deleting WALs/MasterProcWAL 
> files, finally restarting HBASE service.
>
> Table cannot be enabled.
>
> Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system seems 
> mostly unhappy with one region in particular, and is reporting on that.
>
> There are many tables that are very active so I don't think it is 
> possible to stop the entire service without a lot of forewarning to users.
>
> Thanks in advance.
>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
By the way, I did have the DB guys shut down their operations, and I waited more than an hour for compactions to finish. The compactions were at 7 when I disabled all tables.

If HBASE is quiescent I would have expected no (or very few) masterProcWALs.

This is all a bit of a b*gger, to be frank. Any meaningful tools will not work on this version of HBASE 2 and anything for HBASE 1 is worthless also.  We are stymied by being a bit too small to justify the extortionate prices cloudera charge and so everything is in a state being stuck at particular versions.

-----Original Message-----
From: Marc Hoppins <ma...@eset.sk> 
Sent: Tuesday, March 23, 2021 12:13 PM
To: user@hbase.apache.org
Subject: RE: HBASE WALs

EXTERNAL

I am still not certain what will happen.  masterProcWALs contain info for all (running) tables, yes?

If all tables are disabled and I remove the master wals, how will that affect the other tables? When I disabled all tables, hundreds of master WALs are now created. This means there is a bunch of pending operations, yes?  Is it going to make some other things inconsistent?

I did try to set the table state manually to see if the faulty table would fire up and I restarted hbase...state was the same a locked table state due to pending disable and stuck region.

We may have the go-ahead to remove this table - I assume we cannot clone it while it is in a state of (DISABLED) flux but, once again, messing with master WALs has me on edge.


-----Original Message-----
From: Wellington Chevreuil <we...@gmail.com>
Sent: Tuesday, March 16, 2021 4:50 PM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

>
> To be clear, if the other tables are stopped, I assume all pending and 
> current operations will finish. How long will it take to write all 
> data - if indeed the data does get permanently written - so that we 
> can safely remove WALs?
>
If by "tables stopped" you mean your tables are disabled, then yeah, all related data would already have been flushed into hfiles and wouldn't be on your wals. But please be aware that what you really need here to get rid of the rogue proc is to remove master proc wals, not normal wals.

Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins <ma...@eset.sk>
escreveu:

> Overall, I am mystified as to how this could happen.  If Hadoop has a 
> replication factor (I believe we use the default) of 3 and we have two 
> datacenters with masters and workers in both, how can a network outage 
> affect Hadoop operation? Surely it should have used available 
> resources to continue operations...or have I misinterpreted entirely?
>
> -----Original Message-----
> From: Stack <st...@duboce.net>
> Sent: Tuesday, March 16, 2021 7:16 AM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <ma...@eset.sk> wrote:
>
> > Hi, all,
> >
> > For our stuck region, this exists in meta.  Could we alter the state 
> > to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
> >
> > You could but IIRC, in that version of HBase, you may need to 
> > restart the
> Master after the change (changing hbase:meta does not update the 
> Master's in-memory state). On restart, Master will read hbase:meta to 
> discover Region state.
>
> S
>
>
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:regioninfo, timestamp=1613580024017, value={ENCODED => 
> > f25fe93e24b34cb2f7fffddee1d89eec, NAME => 
> > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.',
> > STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'} 
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:seqnumDuringOpen, timestamp=1611787189839, 
> > value=\x00\x00\x00\x00\x00\x00\x04\x8F
> >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:server, timestamp=1611787189839, value=
> > dr1-hbase18.jumbo.hq.eset.com:16020
> >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:serverstartcode, timestamp=1611787189839,
> > value=1611785264032
> hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:sn, timestamp=1613580024017, value=
> > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
> >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:state, timestamp=1613580024017, value=OPENING
> >
> > -----Original Message-----
> > From: Wellington Chevreuil <we...@gmail.com>
> > Sent: Wednesday, March 10, 2021 10:56 AM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > >
> > > Sorry if I seem stupid but this is still all new to me.
> > >
> > Forgot to mention, there's no stupid questions here. Don't be shy 
> > and keep'em coming.
> >
> > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < 
> > wellington.chevreuil@gmail.com> escreveu:
> >
> > > However, how would that help anyway?  If we cannot fix this at 
> > > this time
> > >> then any upgrade would have inconsistencies also, yes?
> > >>
> > > The upgrade on it's own wouldn't fix existing inconsistencies, but 
> > > you would now have support for additional tooling
> > > (hbase-operators-tool) to help you with this.
> > >
> > > As all the 'SUCCESS' procedures have a parent ID 73587, does this 
> > > mean
> > >> that they were successfully and fully moved from hbase25 to each 
> > >> server mentioned in that procedure?  Or does it just mean that 
> > >> the region was successfully unassigned from hbase25 but the data 
> > >> still resides on hbase25?  I see locality 0.
> > >>
> > > IIRC, those were all UnassignProcedures, so it means the 
> > > unassignment of the related region has completed and the region 
> > > for that particular procedure went offline.
> > >
> > > If we change the table state in meta to 'ENABLED', could this 
> > > kickstart
> > >> all these things or will it just lead to further problems?
> > >
> > > Masters work with its own memory cache of meta, so manually 
> > > updating it will just make masters cache inconsistent with meta.
> > > You would need to restart masters to get its cache reloaded from 
> > > master. The main problem is that you still have the rogue 
> > > procedures, which you can't get rid of without stopping the 
> > > cluster. One alternative to a full cluster outage would be to 
> > > identify all RSes running the rogue procs (you can find that from 
> > > active master logs), then stop only those and master, clean masterprocwals, then start it again.
> > >
> > >
> > >> I suppose it means I am asking, the 73587 DisableTableProcedure, 
> > >> does it mean that the table is waiting to be disabled?  HBASE 
> > >> master declares that table is NOT enabled.
> > >>
> > > The table state may have been already updated to disabled, most of 
> > > its regions may already be offline, but the 73587 
> > > DisableTableProcedure cannot be considered "done" until all its 
> > > sub procedures are indeed
> > completed.
> > >
> > >
> > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> > > <ma...@eset.sk>
> > > escreveu:
> > >
> > >> Thanks for that.
> > >>
> > >> Alas, we are (currently) constrained by using Cloudera (CDH)
> > >> 6.3.1 and do not have a viable business use to pay the 
> > >> extortionate amount of money required to upgrade.  Which would 
> > >> give these cluster access to newer versions.
> > >>
> > >> However, how would that help anyway?  If we cannot fix this at 
> > >> this time then any upgrade would have inconsistencies also, yes?
> > >>
> > >> As all the 'SUCCESS' procedures have a parent ID 73587, does this 
> > >> mean that they were successfully and fully moved from hbase25 to 
> > >> each server mentioned in that procedure?  Or does it just mean 
> > >> that the region was successfully unassigned from hbase25 but the 
> > >> data still resides on hbase25?  I see locality 0.
> > >>
> > >> If we change the table state in meta to 'ENABLED', could this 
> > >> kickstart all these things or will it just lead to further problems?
> > >> I suppose it means I am asking, the 73587 DisableTableProcedure, 
> > >> does it mean that the table is waiting to be disabled?  HBASE 
> > >> master declares that table is NOT enabled.
> > >>
> > >> Sorry if I seem stupid but this is still all new to me.
> > >>
> > >> I appreciate the help.
> > >>
> > >> -----Original Message-----
> > >> From: Wellington Chevreuil <we...@gmail.com>
> > >> Sent: Tuesday, March 9, 2021 1:20 PM
> > >> To: Hbase-User <us...@hbase.apache.org>
> > >> Subject: Re: HBASE WALs
> > >>
> > >> EXTERNAL
> > >>
> > >> >
> > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> > >> procedure.
> > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems 
> > >> > to be the problem.
> > >> >
> > >> Per your list procedures output attached, it seems the procs 
> > >> states are all inconsistent. There's a WAIT_TIMEOUT subproc of
> > >> 73587 with PID 73827, which is the UnassignProcedure for this 
> > >> region. Problem is that there are already 5 APs for the same 
> > >> region, which may be causing some deadlocks. If this cluster was 
> > >> on a hbck2 supported version, you could get rid of this state 
> > >> using bypass command on all these proc ids, then manually get the 
> > >> table/regions states consistent again using 
> > >> setRegionState/setTableState/assigns/unassigns
> methods.
> > >>
> > >> Without tooling, the only option I can think of is to stop 
> > >> cluster, clean out masterprocwals, restart cluster, then use 
> > >> hbase shell to enable/disable/assign regions. You may also need 
> > >> to manually update table/region states in meta table. Of course, 
> > >> you can automate these manual steps into your own tooling, but 
> > >> may be a better strategy in the long term to upgrade to a more 
> > >> stable version that also benefits from more tooling supported by the community.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
> > >> <ma...@eset.sk>
> > >> escreveu:
> > >>
> > >> > Hi, Wellington,
> > >> >
> > >> > I was on 'vacation' (no road trip or overseas anything) for a week.
> > >> >
> > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> > >> procedure.
> > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems 
> > >> > to be the problem.
> > >> >
> > >> > I am still mystified about the HBCK2-tools. I have attached a 
> > >> > previous thread that you commented on at the time.
> > >> >
> > >> > I did build a tools for our HBASE 2.1.0...or rather, I built it 
> > >> > on Ubuntu
> > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on 
> > >> > Ubuntu
> > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
> > >> > I used it to help fix a similar problem with an offline table 
> > >> > and
> RITs.
> > >> > Both HBASE versions are the same.
> > >> >
> > >> > I attach a 'sheet' with the current procs/locks.
> > >> >
> > >> > -----Original Message-----
> > >> > From: Marc Hoppins <ma...@eset.sk>
> > >> > Sent: Wednesday, March 3, 2021 9:51 AM
> > >> > To: user@hbase.apache.org
> > >> > Cc: Martin Oravec <ma...@eset.sk>
> > >> > Subject: RE: HBASE WALs
> > >> >
> > >> > EXTERNAL
> > >> >
> > >> > Thanks, Wellington,
> > >> >
> > >> > I have already build a hbck1-tools for 2.1.0 using method 
> > >> > described in other topics. All the HBASE and JDK here is the 
> > >> > same version so if it worked fixing one cluster HBASE then it 
> > >> > should work for other
> > installs.
> > >> >
> > >> > Fiddling with masterprocWALs will require complete shutdown of 
> > >> > hbase operations to prevent incoming reds/writes on other 
> > >> > tables and I am not sure how disruptive that will be other than 
> > >> > "probably a
> > lot".
> > >> >
> > >> > -----Original Message-----
> > >> > From: Wellington Chevreuil <we...@gmail.com>
> > >> > Sent: Tuesday, March 2, 2021 10:57 AM
> > >> > To: Hbase-User <us...@hbase.apache.org>
> > >> > Subject: Re: HBASE WALs
> > >> >
> > >> > EXTERNAL
> > >> >
> > >> > Sorry, missed your previous email. I was hoping you were not on 
> > >> > a non-stable version, so that you would benefit from hbck2 tool
> support.
> > >> > Unfortunately, 2.1.0 is among the early releases that don't 
> > >> > work with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> > >> >
> > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
> > >> > seems
> > >> > > mostly unhappy with one region in particular, and is 
> > >> > > reporting on
> > >> that.
> > >> > >
> > >> > Are the other regions for the table properly closed, and this 
> > >> > is the only one stuck? If you do a list_procedures, are you 
> > >> > able to identify an 'unassign' procedure still running for this 
> > >> > table? Or if you grep master logs for this region, do you see 
> > >> > any messages suggesting there's still ongoing attempts to bring 
> > >> > the region offline? If there's apparently no procedure/no 
> > >> > ongoing attempts to offline the region, you might try to 
> > >> > manually update its state in meta table, then flip masters 
> > >> > (assuming you have master HA), so that the new active loads an up to date state from meta table.
> > >> >
> > >> > Otherwise, if there's still a rogue procedure trying to offline 
> > >> > the region, unfortunately, due to the lack of hbck support, you 
> > >> > would most likely need a more disruptive intervention similar 
> > >> > to what you had described in your first email, but instead of 
> > >> > normal wal folder, master proc wals is what you really would 
> > >> > need to clean out here, as that is where procedures state is 
> > >> > persisted, and you wouldn't want the rogue procedure to be resumed.
> > >> >
> > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
> > >> > <ma...@eset.sk>
> > >> > escreveu:
> > >> >
> > >> > > If you know of anything that will help I would appreciate it.
> > >> > >
> > >> > > If you need any log output let me know.
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > >
> > >> > > -----Original Message-----
> > >> > > From: Wellington Chevreuil <we...@gmail.com>
> > >> > > Sent: Thursday, February 25, 2021 4:08 PM
> > >> > > To: Hbase-User <us...@hbase.apache.org>
> > >> > > Subject: Re: HBASE WALs
> > >> > >
> > >> > > EXTERNAL
> > >> > >
> > >> > > >
> > >> > > > Do WAL files contain information for multiple regions per 
> > >> > > > WAL or is one WAL associated with one region?
> > >> > > >
> > >> > > Multiple regions edits would be present in a single wal file.
> > >> > > That's why upon a RS crash and wal processing, there's a wal 
> > >> > > split
> > phase.
> > >> > >
> > >> > > I am trying to find a way to clear a RIT for a disabled table.
> > >> > > A similar
> > >> > > > problem (but on a test cluster) involved me clearing znode 
> > >> > > > info, deleting HDFS data for the table and deleting 
> > >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > >> > > >
> > >> > > Which hbase version are you on?
> > >> > >
> > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> > >> > > <ma...@eset.sk>
> > >> > > escreveu:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > > Do WAL files contain information for multiple regions per 
> > >> > > > WAL or is one WAL associated with one region?
> > >> > > >
> > >> > > > I am trying to find a way to clear a RIT for a disabled table.
> > >> > > > A similar problem (but on a test cluster) involved me 
> > >> > > > clearing znode info, deleting HDFS data for the table and 
> > >> > > > deleting WALs/MasterProcWAL files, finally restarting HBASE
> service.
> > >> > > >
> > >> > > > Table cannot be enabled.
> > >> > > >
> > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> > >> > > > system seems mostly unhappy with one region in particular, 
> > >> > > > and is reporting
> > >> > on that.
> > >> > > >
> > >> > > > There are many tables that are very active so I don't think 
> > >> > > > it is possible to stop the entire service without a lot of 
> > >> > > > forewarning to
> > >> > > users.
> > >> > > >
> > >> > > > Thanks in advance.
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: HBASE WALs

Posted by Josh Elser <el...@apache.org>.
Would recommend you reach out to Cloudera Support if you're already 
using CDH. They will be able to help you a more hands-on with steps to 
find the busted procWAL(s) and recover.

On 4/7/21 2:11 AM, Marc Hoppins wrote:
> Unfortunately, we are currently stuck using CDH 6.3.2 with Hbase 2.1.0.  The company cannot really justify the cost of upgrading this particular offering at the incredibly expensive price per node, as we do not have any money-making on the data being stored to justify such spending for the size of the cluster.
> 
> -----Original Message-----
> From: Stack <st...@duboce.net>
> Sent: Wednesday, April 7, 2021 12:55 AM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
> 
> EXTERNAL
> 
> On Tue, Mar 30, 2021 at 2:52 AM Marc Hoppins <ma...@eset.sk> wrote:
> 
>> Dear HBASE gang,
>>
>> ...and, as I previously mentioned, we now have a grand bunch of OLD
>> WALs milling about.
>>
>>
> WALs in the masterProcWALs dir?
> 
> MY thinking is that if nothing is going on with writing, then anything in
>> any masterProcWALs must be related to the bad table and we can just
>> wipe them and restart HBASE.
>>
>> Questions I have:
>>
>> Am I correct in my theory? (I am far from being a Java guy so am not
>> sure how to follow the process there)
>>
>>
> If the old masterProcWALs are not clearing out, must be corruption in the older WALs that is preventing them 'completing' so they can be released (meantime new procs are added ahead of the old ones...so more WALs show up).
> 
> 
>> If another (quicker) choice was made and we stop DB operations,
>> disable all tables then delete masterProcWALs, WITHOUT waiting for
>> compactions to finish, would we have a real problem with where HBASE
>> thinks data is or where it should be going due to anything that was
>> pending in masterWALs for
>> (possibly) all tables?
>>
>>
> Compactions are interruptible. Compactions have nothing to do w/ the masterProcStore (or with where data is located).
> 
> 
> 
>> Is there any sane way to deal with the information in masterWALs?  Or
>> is that only a Java API thing?
>>
>>
> Old WALs are corrupt. Could try and get hbase to quiescent state, stop it, and try removing an old WAL... restart, see if it all ok. Hard part is that procedures sometimes span WALs so removal may just move forward the corruption.
> 
> Upgrade is your best course.... to 2.3. The procedure store will be migrated. There'll likely be some mess to be cleaned up but at least there is tooling to do so in later hbases.
> 
> S
> 
> 
> 
>> Thanks for all the help/info thus far.
>>
>> -----Original Message-----
>> From: Marc Hoppins <ma...@eset.sk>
>> Sent: Friday, March 26, 2021 10:49 AM
>> To: user@hbase.apache.org
>> Subject: RE: HBASE WALs
>>
>> EXTERNAL
>>
>> I wonder if anyone can explain the following:
>>
>> Before I tried my attempt to fix, HBASE master was retrying to deal
>> with that stuck region. The attempt counter was increasing - I think
>> at last count we were up to 3000 or something.  After my attempt, and
>> I restarted HBASE, it has not tried to fix the stuck region and
>> attempts are currently at zero.  All procs and locks still exist.
>>
>> -----Original Message-----
>> From: Wellington Chevreuil <we...@gmail.com>
>> Sent: Tuesday, March 23, 2021 6:16 PM
>> To: Hbase-User <us...@hbase.apache.org>
>> Subject: Re: HBASE WALs
>>
>> EXTERNAL
>>
>>>
>>> I am still not certain what will happen.  masterProcWALs contain
>>> info for all (running) tables, yes?
>>>
>> masterProcWALs only contain info for running procedures, not user
>> table data. User table data go on "normal" WALs, not "masterProcWALs".
>>
>>   If all tables are disabled and I remove the master wals, how will
>> that
>>> affect the other tables? When I disabled all tables, hundreds of
>>> master WALs are now created. This means there is a bunch of pending
>>> operations, yes?  Is it going to make some other things inconsistent?
>>
>> Table disabling involves the unassignment of all these tables regions.
>> Each of these "unassign" operations comprise a set of sequential phases.
>> These internal operations are called "procedures". Information about
>> the progress of these operations as it progresses through its
>> different phases are stored in these masterProcWALs files. That's why
>> triggering the "disable"
>> command will create some data under masterProcWALs. If all the disable
>> commands finished successfully, and all your procedures are finished
>> (apart from that rogue one existing for while already), you would be
>> good to clean out masterProcWALs.
>>
>> I did try to set the table state manually to see if the faulty table
>> would
>>> fire up and I restarted hbase...state was the same a locked table
>>> state due to pending disable and stuck region.
>>>
>> That's because of the rogue procedure. When you restarted master, it
>> went through masterProcWals and resumed the rogue procedure from the
>> unfinished state it was when you restarted hbase. If you had removed
>> masterProcWALs prior to restart, the rogue procedure would now be gone.
>>
>> We may have the go-ahead to remove this table - I assume we cannot
>> clone it
>>> while it is in a state of (DISABLED) flux but, once again, messing
>>> with master WALs has me on edge.
>>
>>  From what I understand, you already have the tables disabled, and no
>> unfinished procs apart from the rogue one, so just clean out
>> masterProcWALs and restart master.
>>
>> Em ter., 23 de mar. de 2021 às 11:13, Marc Hoppins
>> <ma...@eset.sk>
>> escreveu:
>>
>>> I am still not certain what will happen.  masterProcWALs contain
>>> info for all (running) tables, yes?
>>>
>>> If all tables are disabled and I remove the master wals, how will
>>> that affect the other tables? When I disabled all tables, hundreds
>>> of master WALs are now created. This means there is a bunch of
>>> pending operations, yes?  Is it going to make some other things inconsistent?
>>>
>>> I did try to set the table state manually to see if the faulty table
>>> would fire up and I restarted hbase...state was the same a locked
>>> table state due to pending disable and stuck region.
>>>
>>> We may have the go-ahead to remove this table - I assume we cannot
>>> clone it while it is in a state of (DISABLED) flux but, once again,
>>> messing with master WALs has me on edge.
>>>
>>>
>>> -----Original Message-----
>>> From: Wellington Chevreuil <we...@gmail.com>
>>> Sent: Tuesday, March 16, 2021 4:50 PM
>>> To: Hbase-User <us...@hbase.apache.org>
>>> Subject: Re: HBASE WALs
>>>
>>> EXTERNAL
>>>
>>>>
>>>> To be clear, if the other tables are stopped, I assume all pending
>>>> and current operations will finish. How long will it take to write
>>>> all data - if indeed the data does get permanently written - so
>>>> that we can safely remove WALs?
>>>>
>>> If by "tables stopped" you mean your tables are disabled, then yeah,
>>> all related data would already have been flushed into hfiles and
>>> wouldn't be on your wals. But please be aware that what you really
>>> need here to get rid of the rogue proc is to remove master proc
>>> wals,
>> not normal wals.
>>>
>>> Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins
>>> <ma...@eset.sk>
>>> escreveu:
>>>
>>>> Overall, I am mystified as to how this could happen.  If Hadoop
>>>> has a replication factor (I believe we use the default) of 3 and
>>>> we have two datacenters with masters and workers in both, how can
>>>> a network outage affect Hadoop operation? Surely it should have
>>>> used available resources to continue operations...or have I misinterpreted entirely?
>>>>
>>>> -----Original Message-----
>>>> From: Stack <st...@duboce.net>
>>>> Sent: Tuesday, March 16, 2021 7:16 AM
>>>> To: Hbase-User <us...@hbase.apache.org>
>>>> Subject: Re: HBASE WALs
>>>>
>>>> EXTERNAL
>>>>
>>>> On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins
>>>> <ma...@eset.sk>
>>> wrote:
>>>>
>>>>> Hi, all,
>>>>>
>>>>> For our stuck region, this exists in meta.  Could we alter the
>>>>> state to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
>>>>>
>>>>> You could but IIRC, in that version of HBase, you may need to
>>>>> restart the
>>>> Master after the change (changing hbase:meta does not update the
>>>> Master's in-memory state). On restart, Master will read hbase:meta
>>>> to discover Region state.
>>>>
>>>> S
>>>>
>>>>
>>>>> hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
>>>>> column=info:regioninfo, timestamp=1613580024017, value={ENCODED
>>>>> => f25fe93e24b34cb2f7fffddee1d89eec, NAME =>
>>>>> 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.'
>>>>> , STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'}
>>>>> hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
>>>>> column=info:seqnumDuringOpen, timestamp=1611787189839,
>>>>> value=\x00\x00\x00\x00\x00\x00\x04\x8F
>>>>>   hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
>>>>> column=info:server, timestamp=1611787189839, value=
>>>>> dr1-hbase18.jumbo.hq.eset.com:16020
>>>>>   hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
>>>>> column=info:serverstartcode, timestamp=1611787189839,
>>>>> value=1611785264032
>>>> hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
>>>>> column=info:sn, timestamp=1613580024017, value=
>>>>> ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
>>>>>   hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
>>>>> column=info:state, timestamp=1613580024017, value=OPENING
>>>>>
>>>>> -----Original Message-----
>>>>> From: Wellington Chevreuil <we...@gmail.com>
>>>>> Sent: Wednesday, March 10, 2021 10:56 AM
>>>>> To: Hbase-User <us...@hbase.apache.org>
>>>>> Subject: Re: HBASE WALs
>>>>>
>>>>> EXTERNAL
>>>>>
>>>>>>
>>>>>> Sorry if I seem stupid but this is still all new to me.
>>>>>>
>>>>> Forgot to mention, there's no stupid questions here. Don't be
>>>>> shy and keep'em coming.
>>>>>
>>>>> Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil <
>>>>> wellington.chevreuil@gmail.com> escreveu:
>>>>>
>>>>>> However, how would that help anyway?  If we cannot fix this at
>>>>>> this time
>>>>>>> then any upgrade would have inconsistencies also, yes?
>>>>>>>
>>>>>> The upgrade on it's own wouldn't fix existing inconsistencies,
>>>>>> but you would now have support for additional tooling
>>>>>> (hbase-operators-tool) to help you with this.
>>>>>>
>>>>>> As all the 'SUCCESS' procedures have a parent ID 73587, does
>>>>>> this mean
>>>>>>> that they were successfully and fully moved from hbase25 to
>>>>>>> each server mentioned in that procedure?  Or does it just
>>>>>>> mean that the region was successfully unassigned from hbase25
>>>>>>> but the data still resides on hbase25?  I see locality 0.
>>>>>>>
>>>>>> IIRC, those were all UnassignProcedures, so it means the
>>>>>> unassignment of the related region has completed and the
>>>>>> region for that particular procedure went offline.
>>>>>>
>>>>>> If we change the table state in meta to 'ENABLED', could this
>>>>>> kickstart
>>>>>>> all these things or will it just lead to further problems?
>>>>>>
>>>>>> Masters work with its own memory cache of meta, so manually
>>>>>> updating it will just make masters cache inconsistent with meta.
>>>>>> You would need to restart masters to get its cache reloaded
>>>>>> from master. The main problem is that you still have the rogue
>>>>>> procedures, which you can't get rid of without stopping the
>>>>>> cluster. One alternative to a full cluster outage would be to
>>>>>> identify all RSes running the rogue procs (you can find that
>>>>>> from active master logs), then stop only those and master,
>>>>>> clean
>>> masterprocwals, then start it again.
>>>>>>
>>>>>>
>>>>>>> I suppose it means I am asking, the 73587
>>>>>>> DisableTableProcedure, does it mean that the table is waiting
>>>>>>> to be disabled?  HBASE master declares that table is NOT enabled.
>>>>>>>
>>>>>> The table state may have been already updated to disabled,
>>>>>> most of its regions may already be offline, but the 73587
>>>>>> DisableTableProcedure cannot be considered "done" until all
>>>>>> its sub procedures are indeed
>>>>> completed.
>>>>>>
>>>>>>
>>>>>> Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins
>>>>>> <ma...@eset.sk>
>>>>>> escreveu:
>>>>>>
>>>>>>> Thanks for that.
>>>>>>>
>>>>>>> Alas, we are (currently) constrained by using Cloudera (CDH)
>>>>>>> 6.3.1 and do not have a viable business use to pay the
>>>>>>> extortionate amount of money required to upgrade.  Which
>>>>>>> would give these cluster access to newer versions.
>>>>>>>
>>>>>>> However, how would that help anyway?  If we cannot fix this
>>>>>>> at this time then any upgrade would have inconsistencies also, yes?
>>>>>>>
>>>>>>> As all the 'SUCCESS' procedures have a parent ID 73587, does
>>>>>>> this mean that they were successfully and fully moved from
>>>>>>> hbase25 to each server mentioned in that procedure?  Or does
>>>>>>> it just mean that the region was successfully unassigned from
>>>>>>> hbase25 but the data still resides on hbase25?  I see locality 0.
>>>>>>>
>>>>>>> If we change the table state in meta to 'ENABLED', could this
>>>>>>> kickstart all these things or will it just lead to further
>> problems?
>>>>>>> I suppose it means I am asking, the 73587
>>>>>>> DisableTableProcedure, does it mean that the table is waiting
>>>>>>> to be disabled?  HBASE master declares that table is NOT enabled.
>>>>>>>
>>>>>>> Sorry if I seem stupid but this is still all new to me.
>>>>>>>
>>>>>>> I appreciate the help.
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Wellington Chevreuil <we...@gmail.com>
>>>>>>> Sent: Tuesday, March 9, 2021 1:20 PM
>>>>>>> To: Hbase-User <us...@hbase.apache.org>
>>>>>>> Subject: Re: HBASE WALs
>>>>>>>
>>>>>>> EXTERNAL
>>>>>>>
>>>>>>>>
>>>>>>>> All fails are waiting on the same PID (73587), a DISABLE
>>>>>>>> TABLE
>>>>>>> procedure.
>>>>>>>> The offending region (f25fe93e24b34cb2f7fffddee1d89eec)
>>>>>>>> seems to be the problem.
>>>>>>>>
>>>>>>> Per your list procedures output attached, it seems the procs
>>>>>>> states are all inconsistent. There's a WAIT_TIMEOUT subproc
>>>>>>> of
>>>>>>> 73587 with PID 73827, which is the UnassignProcedure for this
>>>>>>> region. Problem is that there are already 5 APs for the same
>>>>>>> region, which may be causing some deadlocks. If this cluster
>>>>>>> was on a hbck2 supported version, you could get rid of this
>>>>>>> state using bypass command on all these proc ids, then
>>>>>>> manually get the table/regions states consistent again using
>>>>>>> setRegionState/setTableState/assigns/unassigns
>>>> methods.
>>>>>>>
>>>>>>> Without tooling, the only option I can think of is to stop
>>>>>>> cluster, clean out masterprocwals, restart cluster, then use
>>>>>>> hbase shell to enable/disable/assign regions. You may also
>>>>>>> need to manually update table/region states in meta table. Of
>>>>>>> course, you can automate these manual steps into your own
>>>>>>> tooling, but may be a better strategy in the long term to
>>>>>>> upgrade to a more stable version that also benefits from more
>>>>>>> tooling supported by
>>> the community.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins
>>>>>>> <ma...@eset.sk>
>>>>>>> escreveu:
>>>>>>>
>>>>>>>> Hi, Wellington,
>>>>>>>>
>>>>>>>> I was on 'vacation' (no road trip or overseas anything) for
>>>>>>>> a
>>> week.
>>>>>>>>
>>>>>>>> All fails are waiting on the same PID (73587), a DISABLE
>>>>>>>> TABLE
>>>>>>> procedure.
>>>>>>>> The offending region (f25fe93e24b34cb2f7fffddee1d89eec)
>>>>>>>> seems to be the problem.
>>>>>>>>
>>>>>>>> I am still mystified about the HBCK2-tools. I have attached
>>>>>>>> a previous thread that you commented on at the time.
>>>>>>>>
>>>>>>>> I did build a tools for our HBASE 2.1.0...or rather, I
>>>>>>>> built it on Ubuntu
>>>>>>>> 20.04 with openJDK8 (1.8.0_212), then successfully ran it
>>>>>>>> on Ubuntu
>>>>>>>> 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
>>>>>>>> I used it to help fix a similar problem with an offline
>>>>>>>> table and
>>>> RITs.
>>>>>>>> Both HBASE versions are the same.
>>>>>>>>
>>>>>>>> I attach a 'sheet' with the current procs/locks.
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Marc Hoppins <ma...@eset.sk>
>>>>>>>> Sent: Wednesday, March 3, 2021 9:51 AM
>>>>>>>> To: user@hbase.apache.org
>>>>>>>> Cc: Martin Oravec <ma...@eset.sk>
>>>>>>>> Subject: RE: HBASE WALs
>>>>>>>>
>>>>>>>> EXTERNAL
>>>>>>>>
>>>>>>>> Thanks, Wellington,
>>>>>>>>
>>>>>>>> I have already build a hbck1-tools for 2.1.0 using method
>>>>>>>> described in other topics. All the HBASE and JDK here is
>>>>>>>> the same version so if it worked fixing one cluster HBASE
>>>>>>>> then it should work for other
>>>>> installs.
>>>>>>>>
>>>>>>>> Fiddling with masterprocWALs will require complete shutdown
>>>>>>>> of hbase operations to prevent incoming reds/writes on
>>>>>>>> other tables and I am not sure how disruptive that will be
>>>>>>>> other than "probably a
>>>>> lot".
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Wellington Chevreuil <we...@gmail.com>
>>>>>>>> Sent: Tuesday, March 2, 2021 10:57 AM
>>>>>>>> To: Hbase-User <us...@hbase.apache.org>
>>>>>>>> Subject: Re: HBASE WALs
>>>>>>>>
>>>>>>>> EXTERNAL
>>>>>>>>
>>>>>>>> Sorry, missed your previous email. I was hoping you were
>>>>>>>> not on a non-stable version, so that you would benefit from
>>>>>>>> hbck2 tool
>>>> support.
>>>>>>>> Unfortunately, 2.1.0 is among the early releases that don't
>>>>>>>> work with this tool (it requires at least 2.0.3, 2.1.1 or
>> 2.2.0).
>>>>>>>>
>>>>>>>> Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the
>>>>>>>> system seems
>>>>>>>>> mostly unhappy with one region in particular, and is
>>>>>>>>> reporting on
>>>>>>> that.
>>>>>>>>>
>>>>>>>> Are the other regions for the table properly closed, and
>>>>>>>> this is the only one stuck? If you do a list_procedures,
>>>>>>>> are you able to identify an 'unassign' procedure still
>>>>>>>> running for this table? Or if you grep master logs for this
>>>>>>>> region, do you see any messages suggesting there's still
>>>>>>>> ongoing attempts to bring the region offline? If there's
>>>>>>>> apparently no procedure/no ongoing attempts to offline the
>>>>>>>> region, you might try to manually update its state in meta
>>>>>>>> table, then flip masters (assuming you have master HA), so
>>>>>>>> that the new active loads an up
>>> to date state from meta table.
>>>>>>>>
>>>>>>>> Otherwise, if there's still a rogue procedure trying to
>>>>>>>> offline the region, unfortunately, due to the lack of hbck
>>>>>>>> support, you would most likely need a more disruptive
>>>>>>>> intervention similar to what you had described in your
>>>>>>>> first email, but instead of normal wal folder, master proc
>>>>>>>> wals is what you really would need to clean out here, as
>>>>>>>> that is where procedures state is persisted, and you
>>>>>>>> wouldn't want the rogue procedure to be
>>> resumed.
>>>>>>>>
>>>>>>>> Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins
>>>>>>>> <ma...@eset.sk>
>>>>>>>> escreveu:
>>>>>>>>
>>>>>>>>> If you know of anything that will help I would appreciate it.
>>>>>>>>>
>>>>>>>>> If you need any log output let me know.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Wellington Chevreuil
>>>>>>>>> <we...@gmail.com>
>>>>>>>>> Sent: Thursday, February 25, 2021 4:08 PM
>>>>>>>>> To: Hbase-User <us...@hbase.apache.org>
>>>>>>>>> Subject: Re: HBASE WALs
>>>>>>>>>
>>>>>>>>> EXTERNAL
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Do WAL files contain information for multiple regions
>>>>>>>>>> per WAL or is one WAL associated with one region?
>>>>>>>>>>
>>>>>>>>> Multiple regions edits would be present in a single wal file.
>>>>>>>>> That's why upon a RS crash and wal processing, there's a
>>>>>>>>> wal split
>>>>> phase.
>>>>>>>>>
>>>>>>>>> I am trying to find a way to clear a RIT for a disabled table.
>>>>>>>>> A similar
>>>>>>>>>> problem (but on a test cluster) involved me clearing
>>>>>>>>>> znode info, deleting HDFS data for the table and
>>>>>>>>>> deleting WALs/MasterProcWAL files, finally restarting HBASE service.
>>>>>>>>>>
>>>>>>>>> Which hbase version are you on?
>>>>>>>>>
>>>>>>>>> Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins
>>>>>>>>> <ma...@eset.sk>
>>>>>>>>> escreveu:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> Do WAL files contain information for multiple regions
>>>>>>>>>> per WAL or is one WAL associated with one region?
>>>>>>>>>>
>>>>>>>>>> I am trying to find a way to clear a RIT for a disabled
>> table.
>>>>>>>>>> A similar problem (but on a test cluster) involved me
>>>>>>>>>> clearing znode info, deleting HDFS data for the table
>>>>>>>>>> and deleting WALs/MasterProcWAL files, finally
>>>>>>>>>> restarting HBASE
>>>> service.
>>>>>>>>>>
>>>>>>>>>> Table cannot be enabled.
>>>>>>>>>>
>>>>>>>>>> Multiple locks exist for DISABLE/ENABLE/UNASSIGN but
>>>>>>>>>> the system seems mostly unhappy with one region in
>>>>>>>>>> particular, and is reporting
>>>>>>>> on that.
>>>>>>>>>>
>>>>>>>>>> There are many tables that are very active so I don't
>>>>>>>>>> think it is possible to stop the entire service without
>>>>>>>>>> a lot of forewarning to
>>>>>>>>> users.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
Unfortunately, we are currently stuck using CDH 6.3.2 with Hbase 2.1.0.  The company cannot really justify the cost of upgrading this particular offering at the incredibly expensive price per node, as we do not have any money-making on the data being stored to justify such spending for the size of the cluster.

-----Original Message-----
From: Stack <st...@duboce.net> 
Sent: Wednesday, April 7, 2021 12:55 AM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

On Tue, Mar 30, 2021 at 2:52 AM Marc Hoppins <ma...@eset.sk> wrote:

> Dear HBASE gang,
>
> ...and, as I previously mentioned, we now have a grand bunch of OLD 
> WALs milling about.
>
>
WALs in the masterProcWALs dir?

MY thinking is that if nothing is going on with writing, then anything in
> any masterProcWALs must be related to the bad table and we can just 
> wipe them and restart HBASE.
>
> Questions I have:
>
> Am I correct in my theory? (I am far from being a Java guy so am not 
> sure how to follow the process there)
>
>
If the old masterProcWALs are not clearing out, must be corruption in the older WALs that is preventing them 'completing' so they can be released (meantime new procs are added ahead of the old ones...so more WALs show up).


> If another (quicker) choice was made and we stop DB operations, 
> disable all tables then delete masterProcWALs, WITHOUT waiting for 
> compactions to finish, would we have a real problem with where HBASE 
> thinks data is or where it should be going due to anything that was 
> pending in masterWALs for
> (possibly) all tables?
>
>
Compactions are interruptible. Compactions have nothing to do w/ the masterProcStore (or with where data is located).



> Is there any sane way to deal with the information in masterWALs?  Or 
> is that only a Java API thing?
>
>
Old WALs are corrupt. Could try and get hbase to quiescent state, stop it, and try removing an old WAL... restart, see if it all ok. Hard part is that procedures sometimes span WALs so removal may just move forward the corruption.

Upgrade is your best course.... to 2.3. The procedure store will be migrated. There'll likely be some mess to be cleaned up but at least there is tooling to do so in later hbases.

S



> Thanks for all the help/info thus far.
>
> -----Original Message-----
> From: Marc Hoppins <ma...@eset.sk>
> Sent: Friday, March 26, 2021 10:49 AM
> To: user@hbase.apache.org
> Subject: RE: HBASE WALs
>
> EXTERNAL
>
> I wonder if anyone can explain the following:
>
> Before I tried my attempt to fix, HBASE master was retrying to deal 
> with that stuck region. The attempt counter was increasing - I think 
> at last count we were up to 3000 or something.  After my attempt, and 
> I restarted HBASE, it has not tried to fix the stuck region and 
> attempts are currently at zero.  All procs and locks still exist.
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Tuesday, March 23, 2021 6:16 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > I am still not certain what will happen.  masterProcWALs contain 
> > info for all (running) tables, yes?
> >
> masterProcWALs only contain info for running procedures, not user 
> table data. User table data go on "normal" WALs, not "masterProcWALs".
>
>  If all tables are disabled and I remove the master wals, how will 
> that
> > affect the other tables? When I disabled all tables, hundreds of 
> > master WALs are now created. This means there is a bunch of pending 
> > operations, yes?  Is it going to make some other things inconsistent?
>
> Table disabling involves the unassignment of all these tables regions.
> Each of these "unassign" operations comprise a set of sequential phases.
> These internal operations are called "procedures". Information about 
> the progress of these operations as it progresses through its 
> different phases are stored in these masterProcWALs files. That's why 
> triggering the "disable"
> command will create some data under masterProcWALs. If all the disable 
> commands finished successfully, and all your procedures are finished 
> (apart from that rogue one existing for while already), you would be 
> good to clean out masterProcWALs.
>
> I did try to set the table state manually to see if the faulty table 
> would
> > fire up and I restarted hbase...state was the same a locked table 
> > state due to pending disable and stuck region.
> >
> That's because of the rogue procedure. When you restarted master, it 
> went through masterProcWals and resumed the rogue procedure from the 
> unfinished state it was when you restarted hbase. If you had removed 
> masterProcWALs prior to restart, the rogue procedure would now be gone.
>
> We may have the go-ahead to remove this table - I assume we cannot 
> clone it
> > while it is in a state of (DISABLED) flux but, once again, messing 
> > with master WALs has me on edge.
>
> From what I understand, you already have the tables disabled, and no 
> unfinished procs apart from the rogue one, so just clean out 
> masterProcWALs and restart master.
>
> Em ter., 23 de mar. de 2021 às 11:13, Marc Hoppins 
> <ma...@eset.sk>
> escreveu:
>
> > I am still not certain what will happen.  masterProcWALs contain 
> > info for all (running) tables, yes?
> >
> > If all tables are disabled and I remove the master wals, how will 
> > that affect the other tables? When I disabled all tables, hundreds 
> > of master WALs are now created. This means there is a bunch of 
> > pending operations, yes?  Is it going to make some other things inconsistent?
> >
> > I did try to set the table state manually to see if the faulty table 
> > would fire up and I restarted hbase...state was the same a locked 
> > table state due to pending disable and stuck region.
> >
> > We may have the go-ahead to remove this table - I assume we cannot 
> > clone it while it is in a state of (DISABLED) flux but, once again, 
> > messing with master WALs has me on edge.
> >
> >
> > -----Original Message-----
> > From: Wellington Chevreuil <we...@gmail.com>
> > Sent: Tuesday, March 16, 2021 4:50 PM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > >
> > > To be clear, if the other tables are stopped, I assume all pending 
> > > and current operations will finish. How long will it take to write 
> > > all data - if indeed the data does get permanently written - so 
> > > that we can safely remove WALs?
> > >
> > If by "tables stopped" you mean your tables are disabled, then yeah, 
> > all related data would already have been flushed into hfiles and 
> > wouldn't be on your wals. But please be aware that what you really 
> > need here to get rid of the rogue proc is to remove master proc 
> > wals,
> not normal wals.
> >
> > Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins 
> > <ma...@eset.sk>
> > escreveu:
> >
> > > Overall, I am mystified as to how this could happen.  If Hadoop 
> > > has a replication factor (I believe we use the default) of 3 and 
> > > we have two datacenters with masters and workers in both, how can 
> > > a network outage affect Hadoop operation? Surely it should have 
> > > used available resources to continue operations...or have I misinterpreted entirely?
> > >
> > > -----Original Message-----
> > > From: Stack <st...@duboce.net>
> > > Sent: Tuesday, March 16, 2021 7:16 AM
> > > To: Hbase-User <us...@hbase.apache.org>
> > > Subject: Re: HBASE WALs
> > >
> > > EXTERNAL
> > >
> > > On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins 
> > > <ma...@eset.sk>
> > wrote:
> > >
> > > > Hi, all,
> > > >
> > > > For our stuck region, this exists in meta.  Could we alter the 
> > > > state to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
> > > >
> > > > You could but IIRC, in that version of HBase, you may need to 
> > > > restart the
> > > Master after the change (changing hbase:meta does not update the 
> > > Master's in-memory state). On restart, Master will read hbase:meta 
> > > to discover Region state.
> > >
> > > S
> > >
> > >
> > > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:regioninfo, timestamp=1613580024017, value={ENCODED 
> > > > => f25fe93e24b34cb2f7fffddee1d89eec, NAME => 
> > > > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.'
> > > > , STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'} 
> > > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:seqnumDuringOpen, timestamp=1611787189839, 
> > > > value=\x00\x00\x00\x00\x00\x00\x04\x8F
> > > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:server, timestamp=1611787189839, value=
> > > > dr1-hbase18.jumbo.hq.eset.com:16020
> > > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:serverstartcode, timestamp=1611787189839,
> > > > value=1611785264032
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:sn, timestamp=1613580024017, value=
> > > > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
> > > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:state, timestamp=1613580024017, value=OPENING
> > > >
> > > > -----Original Message-----
> > > > From: Wellington Chevreuil <we...@gmail.com>
> > > > Sent: Wednesday, March 10, 2021 10:56 AM
> > > > To: Hbase-User <us...@hbase.apache.org>
> > > > Subject: Re: HBASE WALs
> > > >
> > > > EXTERNAL
> > > >
> > > > >
> > > > > Sorry if I seem stupid but this is still all new to me.
> > > > >
> > > > Forgot to mention, there's no stupid questions here. Don't be 
> > > > shy and keep'em coming.
> > > >
> > > > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < 
> > > > wellington.chevreuil@gmail.com> escreveu:
> > > >
> > > > > However, how would that help anyway?  If we cannot fix this at 
> > > > > this time
> > > > >> then any upgrade would have inconsistencies also, yes?
> > > > >>
> > > > > The upgrade on it's own wouldn't fix existing inconsistencies, 
> > > > > but you would now have support for additional tooling
> > > > > (hbase-operators-tool) to help you with this.
> > > > >
> > > > > As all the 'SUCCESS' procedures have a parent ID 73587, does 
> > > > > this mean
> > > > >> that they were successfully and fully moved from hbase25 to 
> > > > >> each server mentioned in that procedure?  Or does it just 
> > > > >> mean that the region was successfully unassigned from hbase25 
> > > > >> but the data still resides on hbase25?  I see locality 0.
> > > > >>
> > > > > IIRC, those were all UnassignProcedures, so it means the 
> > > > > unassignment of the related region has completed and the 
> > > > > region for that particular procedure went offline.
> > > > >
> > > > > If we change the table state in meta to 'ENABLED', could this 
> > > > > kickstart
> > > > >> all these things or will it just lead to further problems?
> > > > >
> > > > > Masters work with its own memory cache of meta, so manually 
> > > > > updating it will just make masters cache inconsistent with meta.
> > > > > You would need to restart masters to get its cache reloaded 
> > > > > from master. The main problem is that you still have the rogue 
> > > > > procedures, which you can't get rid of without stopping the 
> > > > > cluster. One alternative to a full cluster outage would be to 
> > > > > identify all RSes running the rogue procs (you can find that 
> > > > > from active master logs), then stop only those and master, 
> > > > > clean
> > masterprocwals, then start it again.
> > > > >
> > > > >
> > > > >> I suppose it means I am asking, the 73587 
> > > > >> DisableTableProcedure, does it mean that the table is waiting 
> > > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > > >>
> > > > > The table state may have been already updated to disabled, 
> > > > > most of its regions may already be offline, but the 73587 
> > > > > DisableTableProcedure cannot be considered "done" until all 
> > > > > its sub procedures are indeed
> > > > completed.
> > > > >
> > > > >
> > > > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> > > > > <ma...@eset.sk>
> > > > > escreveu:
> > > > >
> > > > >> Thanks for that.
> > > > >>
> > > > >> Alas, we are (currently) constrained by using Cloudera (CDH)
> > > > >> 6.3.1 and do not have a viable business use to pay the 
> > > > >> extortionate amount of money required to upgrade.  Which 
> > > > >> would give these cluster access to newer versions.
> > > > >>
> > > > >> However, how would that help anyway?  If we cannot fix this 
> > > > >> at this time then any upgrade would have inconsistencies also, yes?
> > > > >>
> > > > >> As all the 'SUCCESS' procedures have a parent ID 73587, does 
> > > > >> this mean that they were successfully and fully moved from
> > > > >> hbase25 to each server mentioned in that procedure?  Or does 
> > > > >> it just mean that the region was successfully unassigned from
> > > > >> hbase25 but the data still resides on hbase25?  I see locality 0.
> > > > >>
> > > > >> If we change the table state in meta to 'ENABLED', could this 
> > > > >> kickstart all these things or will it just lead to further
> problems?
> > > > >> I suppose it means I am asking, the 73587 
> > > > >> DisableTableProcedure, does it mean that the table is waiting 
> > > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > > >>
> > > > >> Sorry if I seem stupid but this is still all new to me.
> > > > >>
> > > > >> I appreciate the help.
> > > > >>
> > > > >> -----Original Message-----
> > > > >> From: Wellington Chevreuil <we...@gmail.com>
> > > > >> Sent: Tuesday, March 9, 2021 1:20 PM
> > > > >> To: Hbase-User <us...@hbase.apache.org>
> > > > >> Subject: Re: HBASE WALs
> > > > >>
> > > > >> EXTERNAL
> > > > >>
> > > > >> >
> > > > >> > All fails are waiting on the same PID (73587), a DISABLE 
> > > > >> > TABLE
> > > > >> procedure.
> > > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) 
> > > > >> > seems to be the problem.
> > > > >> >
> > > > >> Per your list procedures output attached, it seems the procs 
> > > > >> states are all inconsistent. There's a WAIT_TIMEOUT subproc 
> > > > >> of
> > > > >> 73587 with PID 73827, which is the UnassignProcedure for this 
> > > > >> region. Problem is that there are already 5 APs for the same 
> > > > >> region, which may be causing some deadlocks. If this cluster 
> > > > >> was on a hbck2 supported version, you could get rid of this 
> > > > >> state using bypass command on all these proc ids, then 
> > > > >> manually get the table/regions states consistent again using 
> > > > >> setRegionState/setTableState/assigns/unassigns
> > > methods.
> > > > >>
> > > > >> Without tooling, the only option I can think of is to stop 
> > > > >> cluster, clean out masterprocwals, restart cluster, then use 
> > > > >> hbase shell to enable/disable/assign regions. You may also 
> > > > >> need to manually update table/region states in meta table. Of 
> > > > >> course, you can automate these manual steps into your own 
> > > > >> tooling, but may be a better strategy in the long term to 
> > > > >> upgrade to a more stable version that also benefits from more 
> > > > >> tooling supported by
> > the community.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
> > > > >> <ma...@eset.sk>
> > > > >> escreveu:
> > > > >>
> > > > >> > Hi, Wellington,
> > > > >> >
> > > > >> > I was on 'vacation' (no road trip or overseas anything) for 
> > > > >> > a
> > week.
> > > > >> >
> > > > >> > All fails are waiting on the same PID (73587), a DISABLE 
> > > > >> > TABLE
> > > > >> procedure.
> > > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) 
> > > > >> > seems to be the problem.
> > > > >> >
> > > > >> > I am still mystified about the HBCK2-tools. I have attached 
> > > > >> > a previous thread that you commented on at the time.
> > > > >> >
> > > > >> > I did build a tools for our HBASE 2.1.0...or rather, I 
> > > > >> > built it on Ubuntu
> > > > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it 
> > > > >> > on Ubuntu
> > > > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
> > > > >> > I used it to help fix a similar problem with an offline 
> > > > >> > table and
> > > RITs.
> > > > >> > Both HBASE versions are the same.
> > > > >> >
> > > > >> > I attach a 'sheet' with the current procs/locks.
> > > > >> >
> > > > >> > -----Original Message-----
> > > > >> > From: Marc Hoppins <ma...@eset.sk>
> > > > >> > Sent: Wednesday, March 3, 2021 9:51 AM
> > > > >> > To: user@hbase.apache.org
> > > > >> > Cc: Martin Oravec <ma...@eset.sk>
> > > > >> > Subject: RE: HBASE WALs
> > > > >> >
> > > > >> > EXTERNAL
> > > > >> >
> > > > >> > Thanks, Wellington,
> > > > >> >
> > > > >> > I have already build a hbck1-tools for 2.1.0 using method 
> > > > >> > described in other topics. All the HBASE and JDK here is 
> > > > >> > the same version so if it worked fixing one cluster HBASE 
> > > > >> > then it should work for other
> > > > installs.
> > > > >> >
> > > > >> > Fiddling with masterprocWALs will require complete shutdown 
> > > > >> > of hbase operations to prevent incoming reds/writes on 
> > > > >> > other tables and I am not sure how disruptive that will be 
> > > > >> > other than "probably a
> > > > lot".
> > > > >> >
> > > > >> > -----Original Message-----
> > > > >> > From: Wellington Chevreuil <we...@gmail.com>
> > > > >> > Sent: Tuesday, March 2, 2021 10:57 AM
> > > > >> > To: Hbase-User <us...@hbase.apache.org>
> > > > >> > Subject: Re: HBASE WALs
> > > > >> >
> > > > >> > EXTERNAL
> > > > >> >
> > > > >> > Sorry, missed your previous email. I was hoping you were 
> > > > >> > not on a non-stable version, so that you would benefit from 
> > > > >> > hbck2 tool
> > > support.
> > > > >> > Unfortunately, 2.1.0 is among the early releases that don't 
> > > > >> > work with this tool (it requires at least 2.0.3, 2.1.1 or
> 2.2.0).
> > > > >> >
> > > > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> > > > >> > system seems
> > > > >> > > mostly unhappy with one region in particular, and is 
> > > > >> > > reporting on
> > > > >> that.
> > > > >> > >
> > > > >> > Are the other regions for the table properly closed, and 
> > > > >> > this is the only one stuck? If you do a list_procedures, 
> > > > >> > are you able to identify an 'unassign' procedure still 
> > > > >> > running for this table? Or if you grep master logs for this 
> > > > >> > region, do you see any messages suggesting there's still 
> > > > >> > ongoing attempts to bring the region offline? If there's 
> > > > >> > apparently no procedure/no ongoing attempts to offline the 
> > > > >> > region, you might try to manually update its state in meta 
> > > > >> > table, then flip masters (assuming you have master HA), so 
> > > > >> > that the new active loads an up
> > to date state from meta table.
> > > > >> >
> > > > >> > Otherwise, if there's still a rogue procedure trying to 
> > > > >> > offline the region, unfortunately, due to the lack of hbck 
> > > > >> > support, you would most likely need a more disruptive 
> > > > >> > intervention similar to what you had described in your 
> > > > >> > first email, but instead of normal wal folder, master proc 
> > > > >> > wals is what you really would need to clean out here, as 
> > > > >> > that is where procedures state is persisted, and you 
> > > > >> > wouldn't want the rogue procedure to be
> > resumed.
> > > > >> >
> > > > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
> > > > >> > <ma...@eset.sk>
> > > > >> > escreveu:
> > > > >> >
> > > > >> > > If you know of anything that will help I would appreciate it.
> > > > >> > >
> > > > >> > > If you need any log output let me know.
> > > > >> > >
> > > > >> > > Thanks
> > > > >> > >
> > > > >> > >
> > > > >> > > -----Original Message-----
> > > > >> > > From: Wellington Chevreuil 
> > > > >> > > <we...@gmail.com>
> > > > >> > > Sent: Thursday, February 25, 2021 4:08 PM
> > > > >> > > To: Hbase-User <us...@hbase.apache.org>
> > > > >> > > Subject: Re: HBASE WALs
> > > > >> > >
> > > > >> > > EXTERNAL
> > > > >> > >
> > > > >> > > >
> > > > >> > > > Do WAL files contain information for multiple regions 
> > > > >> > > > per WAL or is one WAL associated with one region?
> > > > >> > > >
> > > > >> > > Multiple regions edits would be present in a single wal file.
> > > > >> > > That's why upon a RS crash and wal processing, there's a 
> > > > >> > > wal split
> > > > phase.
> > > > >> > >
> > > > >> > > I am trying to find a way to clear a RIT for a disabled table.
> > > > >> > > A similar
> > > > >> > > > problem (but on a test cluster) involved me clearing 
> > > > >> > > > znode info, deleting HDFS data for the table and 
> > > > >> > > > deleting WALs/MasterProcWAL files, finally restarting HBASE service.
> > > > >> > > >
> > > > >> > > Which hbase version are you on?
> > > > >> > >
> > > > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> > > > >> > > <ma...@eset.sk>
> > > > >> > > escreveu:
> > > > >> > >
> > > > >> > > > Hi all,
> > > > >> > > >
> > > > >> > > > Do WAL files contain information for multiple regions 
> > > > >> > > > per WAL or is one WAL associated with one region?
> > > > >> > > >
> > > > >> > > > I am trying to find a way to clear a RIT for a disabled
> table.
> > > > >> > > > A similar problem (but on a test cluster) involved me 
> > > > >> > > > clearing znode info, deleting HDFS data for the table 
> > > > >> > > > and deleting WALs/MasterProcWAL files, finally 
> > > > >> > > > restarting HBASE
> > > service.
> > > > >> > > >
> > > > >> > > > Table cannot be enabled.
> > > > >> > > >
> > > > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but 
> > > > >> > > > the system seems mostly unhappy with one region in 
> > > > >> > > > particular, and is reporting
> > > > >> > on that.
> > > > >> > > >
> > > > >> > > > There are many tables that are very active so I don't 
> > > > >> > > > think it is possible to stop the entire service without 
> > > > >> > > > a lot of forewarning to
> > > > >> > > users.
> > > > >> > > >
> > > > >> > > > Thanks in advance.
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
Master WALs

-----Original Message-----
From: Stack <st...@duboce.net> 
Sent: Wednesday, April 7, 2021 12:55 AM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

On Tue, Mar 30, 2021 at 2:52 AM Marc Hoppins <ma...@eset.sk> wrote:

> Dear HBASE gang,
>
> ...and, as I previously mentioned, we now have a grand bunch of OLD 
> WALs milling about.
>
>
WALs in the masterProcWALs dir?

MY thinking is that if nothing is going on with writing, then anything in
> any masterProcWALs must be related to the bad table and we can just 
> wipe them and restart HBASE.
>
> Questions I have:
>
> Am I correct in my theory? (I am far from being a Java guy so am not 
> sure how to follow the process there)
>
>
If the old masterProcWALs are not clearing out, must be corruption in the older WALs that is preventing them 'completing' so they can be released (meantime new procs are added ahead of the old ones...so more WALs show up).


> If another (quicker) choice was made and we stop DB operations, 
> disable all tables then delete masterProcWALs, WITHOUT waiting for 
> compactions to finish, would we have a real problem with where HBASE 
> thinks data is or where it should be going due to anything that was 
> pending in masterWALs for
> (possibly) all tables?
>
>
Compactions are interruptible. Compactions have nothing to do w/ the masterProcStore (or with where data is located).



> Is there any sane way to deal with the information in masterWALs?  Or 
> is that only a Java API thing?
>
>
Old WALs are corrupt. Could try and get hbase to quiescent state, stop it, and try removing an old WAL... restart, see if it all ok. Hard part is that procedures sometimes span WALs so removal may just move forward the corruption.

Upgrade is your best course.... to 2.3. The procedure store will be migrated. There'll likely be some mess to be cleaned up but at least there is tooling to do so in later hbases.

S



> Thanks for all the help/info thus far.
>
> -----Original Message-----
> From: Marc Hoppins <ma...@eset.sk>
> Sent: Friday, March 26, 2021 10:49 AM
> To: user@hbase.apache.org
> Subject: RE: HBASE WALs
>
> EXTERNAL
>
> I wonder if anyone can explain the following:
>
> Before I tried my attempt to fix, HBASE master was retrying to deal 
> with that stuck region. The attempt counter was increasing - I think 
> at last count we were up to 3000 or something.  After my attempt, and 
> I restarted HBASE, it has not tried to fix the stuck region and 
> attempts are currently at zero.  All procs and locks still exist.
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Tuesday, March 23, 2021 6:16 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > I am still not certain what will happen.  masterProcWALs contain 
> > info for all (running) tables, yes?
> >
> masterProcWALs only contain info for running procedures, not user 
> table data. User table data go on "normal" WALs, not "masterProcWALs".
>
>  If all tables are disabled and I remove the master wals, how will 
> that
> > affect the other tables? When I disabled all tables, hundreds of 
> > master WALs are now created. This means there is a bunch of pending 
> > operations, yes?  Is it going to make some other things inconsistent?
>
> Table disabling involves the unassignment of all these tables regions.
> Each of these "unassign" operations comprise a set of sequential phases.
> These internal operations are called "procedures". Information about 
> the progress of these operations as it progresses through its 
> different phases are stored in these masterProcWALs files. That's why 
> triggering the "disable"
> command will create some data under masterProcWALs. If all the disable 
> commands finished successfully, and all your procedures are finished 
> (apart from that rogue one existing for while already), you would be 
> good to clean out masterProcWALs.
>
> I did try to set the table state manually to see if the faulty table 
> would
> > fire up and I restarted hbase...state was the same a locked table 
> > state due to pending disable and stuck region.
> >
> That's because of the rogue procedure. When you restarted master, it 
> went through masterProcWals and resumed the rogue procedure from the 
> unfinished state it was when you restarted hbase. If you had removed 
> masterProcWALs prior to restart, the rogue procedure would now be gone.
>
> We may have the go-ahead to remove this table - I assume we cannot 
> clone it
> > while it is in a state of (DISABLED) flux but, once again, messing 
> > with master WALs has me on edge.
>
> From what I understand, you already have the tables disabled, and no 
> unfinished procs apart from the rogue one, so just clean out 
> masterProcWALs and restart master.
>
> Em ter., 23 de mar. de 2021 às 11:13, Marc Hoppins 
> <ma...@eset.sk>
> escreveu:
>
> > I am still not certain what will happen.  masterProcWALs contain 
> > info for all (running) tables, yes?
> >
> > If all tables are disabled and I remove the master wals, how will 
> > that affect the other tables? When I disabled all tables, hundreds 
> > of master WALs are now created. This means there is a bunch of 
> > pending operations, yes?  Is it going to make some other things inconsistent?
> >
> > I did try to set the table state manually to see if the faulty table 
> > would fire up and I restarted hbase...state was the same a locked 
> > table state due to pending disable and stuck region.
> >
> > We may have the go-ahead to remove this table - I assume we cannot 
> > clone it while it is in a state of (DISABLED) flux but, once again, 
> > messing with master WALs has me on edge.
> >
> >
> > -----Original Message-----
> > From: Wellington Chevreuil <we...@gmail.com>
> > Sent: Tuesday, March 16, 2021 4:50 PM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > >
> > > To be clear, if the other tables are stopped, I assume all pending 
> > > and current operations will finish. How long will it take to write 
> > > all data - if indeed the data does get permanently written - so 
> > > that we can safely remove WALs?
> > >
> > If by "tables stopped" you mean your tables are disabled, then yeah, 
> > all related data would already have been flushed into hfiles and 
> > wouldn't be on your wals. But please be aware that what you really 
> > need here to get rid of the rogue proc is to remove master proc 
> > wals,
> not normal wals.
> >
> > Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins 
> > <ma...@eset.sk>
> > escreveu:
> >
> > > Overall, I am mystified as to how this could happen.  If Hadoop 
> > > has a replication factor (I believe we use the default) of 3 and 
> > > we have two datacenters with masters and workers in both, how can 
> > > a network outage affect Hadoop operation? Surely it should have 
> > > used available resources to continue operations...or have I misinterpreted entirely?
> > >
> > > -----Original Message-----
> > > From: Stack <st...@duboce.net>
> > > Sent: Tuesday, March 16, 2021 7:16 AM
> > > To: Hbase-User <us...@hbase.apache.org>
> > > Subject: Re: HBASE WALs
> > >
> > > EXTERNAL
> > >
> > > On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins 
> > > <ma...@eset.sk>
> > wrote:
> > >
> > > > Hi, all,
> > > >
> > > > For our stuck region, this exists in meta.  Could we alter the 
> > > > state to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
> > > >
> > > > You could but IIRC, in that version of HBase, you may need to 
> > > > restart the
> > > Master after the change (changing hbase:meta does not update the 
> > > Master's in-memory state). On restart, Master will read hbase:meta 
> > > to discover Region state.
> > >
> > > S
> > >
> > >
> > > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:regioninfo, timestamp=1613580024017, value={ENCODED 
> > > > => f25fe93e24b34cb2f7fffddee1d89eec, NAME => 
> > > > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.'
> > > > , STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'} 
> > > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:seqnumDuringOpen, timestamp=1611787189839, 
> > > > value=\x00\x00\x00\x00\x00\x00\x04\x8F
> > > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:server, timestamp=1611787189839, value=
> > > > dr1-hbase18.jumbo.hq.eset.com:16020
> > > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:serverstartcode, timestamp=1611787189839,
> > > > value=1611785264032
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:sn, timestamp=1613580024017, value=
> > > > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
> > > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:state, timestamp=1613580024017, value=OPENING
> > > >
> > > > -----Original Message-----
> > > > From: Wellington Chevreuil <we...@gmail.com>
> > > > Sent: Wednesday, March 10, 2021 10:56 AM
> > > > To: Hbase-User <us...@hbase.apache.org>
> > > > Subject: Re: HBASE WALs
> > > >
> > > > EXTERNAL
> > > >
> > > > >
> > > > > Sorry if I seem stupid but this is still all new to me.
> > > > >
> > > > Forgot to mention, there's no stupid questions here. Don't be 
> > > > shy and keep'em coming.
> > > >
> > > > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < 
> > > > wellington.chevreuil@gmail.com> escreveu:
> > > >
> > > > > However, how would that help anyway?  If we cannot fix this at 
> > > > > this time
> > > > >> then any upgrade would have inconsistencies also, yes?
> > > > >>
> > > > > The upgrade on it's own wouldn't fix existing inconsistencies, 
> > > > > but you would now have support for additional tooling
> > > > > (hbase-operators-tool) to help you with this.
> > > > >
> > > > > As all the 'SUCCESS' procedures have a parent ID 73587, does 
> > > > > this mean
> > > > >> that they were successfully and fully moved from hbase25 to 
> > > > >> each server mentioned in that procedure?  Or does it just 
> > > > >> mean that the region was successfully unassigned from hbase25 
> > > > >> but the data still resides on hbase25?  I see locality 0.
> > > > >>
> > > > > IIRC, those were all UnassignProcedures, so it means the 
> > > > > unassignment of the related region has completed and the 
> > > > > region for that particular procedure went offline.
> > > > >
> > > > > If we change the table state in meta to 'ENABLED', could this 
> > > > > kickstart
> > > > >> all these things or will it just lead to further problems?
> > > > >
> > > > > Masters work with its own memory cache of meta, so manually 
> > > > > updating it will just make masters cache inconsistent with meta.
> > > > > You would need to restart masters to get its cache reloaded 
> > > > > from master. The main problem is that you still have the rogue 
> > > > > procedures, which you can't get rid of without stopping the 
> > > > > cluster. One alternative to a full cluster outage would be to 
> > > > > identify all RSes running the rogue procs (you can find that 
> > > > > from active master logs), then stop only those and master, 
> > > > > clean
> > masterprocwals, then start it again.
> > > > >
> > > > >
> > > > >> I suppose it means I am asking, the 73587 
> > > > >> DisableTableProcedure, does it mean that the table is waiting 
> > > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > > >>
> > > > > The table state may have been already updated to disabled, 
> > > > > most of its regions may already be offline, but the 73587 
> > > > > DisableTableProcedure cannot be considered "done" until all 
> > > > > its sub procedures are indeed
> > > > completed.
> > > > >
> > > > >
> > > > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> > > > > <ma...@eset.sk>
> > > > > escreveu:
> > > > >
> > > > >> Thanks for that.
> > > > >>
> > > > >> Alas, we are (currently) constrained by using Cloudera (CDH)
> > > > >> 6.3.1 and do not have a viable business use to pay the 
> > > > >> extortionate amount of money required to upgrade.  Which 
> > > > >> would give these cluster access to newer versions.
> > > > >>
> > > > >> However, how would that help anyway?  If we cannot fix this 
> > > > >> at this time then any upgrade would have inconsistencies also, yes?
> > > > >>
> > > > >> As all the 'SUCCESS' procedures have a parent ID 73587, does 
> > > > >> this mean that they were successfully and fully moved from
> > > > >> hbase25 to each server mentioned in that procedure?  Or does 
> > > > >> it just mean that the region was successfully unassigned from
> > > > >> hbase25 but the data still resides on hbase25?  I see locality 0.
> > > > >>
> > > > >> If we change the table state in meta to 'ENABLED', could this 
> > > > >> kickstart all these things or will it just lead to further
> problems?
> > > > >> I suppose it means I am asking, the 73587 
> > > > >> DisableTableProcedure, does it mean that the table is waiting 
> > > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > > >>
> > > > >> Sorry if I seem stupid but this is still all new to me.
> > > > >>
> > > > >> I appreciate the help.
> > > > >>
> > > > >> -----Original Message-----
> > > > >> From: Wellington Chevreuil <we...@gmail.com>
> > > > >> Sent: Tuesday, March 9, 2021 1:20 PM
> > > > >> To: Hbase-User <us...@hbase.apache.org>
> > > > >> Subject: Re: HBASE WALs
> > > > >>
> > > > >> EXTERNAL
> > > > >>
> > > > >> >
> > > > >> > All fails are waiting on the same PID (73587), a DISABLE 
> > > > >> > TABLE
> > > > >> procedure.
> > > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) 
> > > > >> > seems to be the problem.
> > > > >> >
> > > > >> Per your list procedures output attached, it seems the procs 
> > > > >> states are all inconsistent. There's a WAIT_TIMEOUT subproc 
> > > > >> of
> > > > >> 73587 with PID 73827, which is the UnassignProcedure for this 
> > > > >> region. Problem is that there are already 5 APs for the same 
> > > > >> region, which may be causing some deadlocks. If this cluster 
> > > > >> was on a hbck2 supported version, you could get rid of this 
> > > > >> state using bypass command on all these proc ids, then 
> > > > >> manually get the table/regions states consistent again using 
> > > > >> setRegionState/setTableState/assigns/unassigns
> > > methods.
> > > > >>
> > > > >> Without tooling, the only option I can think of is to stop 
> > > > >> cluster, clean out masterprocwals, restart cluster, then use 
> > > > >> hbase shell to enable/disable/assign regions. You may also 
> > > > >> need to manually update table/region states in meta table. Of 
> > > > >> course, you can automate these manual steps into your own 
> > > > >> tooling, but may be a better strategy in the long term to 
> > > > >> upgrade to a more stable version that also benefits from more 
> > > > >> tooling supported by
> > the community.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
> > > > >> <ma...@eset.sk>
> > > > >> escreveu:
> > > > >>
> > > > >> > Hi, Wellington,
> > > > >> >
> > > > >> > I was on 'vacation' (no road trip or overseas anything) for 
> > > > >> > a
> > week.
> > > > >> >
> > > > >> > All fails are waiting on the same PID (73587), a DISABLE 
> > > > >> > TABLE
> > > > >> procedure.
> > > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) 
> > > > >> > seems to be the problem.
> > > > >> >
> > > > >> > I am still mystified about the HBCK2-tools. I have attached 
> > > > >> > a previous thread that you commented on at the time.
> > > > >> >
> > > > >> > I did build a tools for our HBASE 2.1.0...or rather, I 
> > > > >> > built it on Ubuntu
> > > > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it 
> > > > >> > on Ubuntu
> > > > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
> > > > >> > I used it to help fix a similar problem with an offline 
> > > > >> > table and
> > > RITs.
> > > > >> > Both HBASE versions are the same.
> > > > >> >
> > > > >> > I attach a 'sheet' with the current procs/locks.
> > > > >> >
> > > > >> > -----Original Message-----
> > > > >> > From: Marc Hoppins <ma...@eset.sk>
> > > > >> > Sent: Wednesday, March 3, 2021 9:51 AM
> > > > >> > To: user@hbase.apache.org
> > > > >> > Cc: Martin Oravec <ma...@eset.sk>
> > > > >> > Subject: RE: HBASE WALs
> > > > >> >
> > > > >> > EXTERNAL
> > > > >> >
> > > > >> > Thanks, Wellington,
> > > > >> >
> > > > >> > I have already build a hbck1-tools for 2.1.0 using method 
> > > > >> > described in other topics. All the HBASE and JDK here is 
> > > > >> > the same version so if it worked fixing one cluster HBASE 
> > > > >> > then it should work for other
> > > > installs.
> > > > >> >
> > > > >> > Fiddling with masterprocWALs will require complete shutdown 
> > > > >> > of hbase operations to prevent incoming reds/writes on 
> > > > >> > other tables and I am not sure how disruptive that will be 
> > > > >> > other than "probably a
> > > > lot".
> > > > >> >
> > > > >> > -----Original Message-----
> > > > >> > From: Wellington Chevreuil <we...@gmail.com>
> > > > >> > Sent: Tuesday, March 2, 2021 10:57 AM
> > > > >> > To: Hbase-User <us...@hbase.apache.org>
> > > > >> > Subject: Re: HBASE WALs
> > > > >> >
> > > > >> > EXTERNAL
> > > > >> >
> > > > >> > Sorry, missed your previous email. I was hoping you were 
> > > > >> > not on a non-stable version, so that you would benefit from 
> > > > >> > hbck2 tool
> > > support.
> > > > >> > Unfortunately, 2.1.0 is among the early releases that don't 
> > > > >> > work with this tool (it requires at least 2.0.3, 2.1.1 or
> 2.2.0).
> > > > >> >
> > > > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> > > > >> > system seems
> > > > >> > > mostly unhappy with one region in particular, and is 
> > > > >> > > reporting on
> > > > >> that.
> > > > >> > >
> > > > >> > Are the other regions for the table properly closed, and 
> > > > >> > this is the only one stuck? If you do a list_procedures, 
> > > > >> > are you able to identify an 'unassign' procedure still 
> > > > >> > running for this table? Or if you grep master logs for this 
> > > > >> > region, do you see any messages suggesting there's still 
> > > > >> > ongoing attempts to bring the region offline? If there's 
> > > > >> > apparently no procedure/no ongoing attempts to offline the 
> > > > >> > region, you might try to manually update its state in meta 
> > > > >> > table, then flip masters (assuming you have master HA), so 
> > > > >> > that the new active loads an up
> > to date state from meta table.
> > > > >> >
> > > > >> > Otherwise, if there's still a rogue procedure trying to 
> > > > >> > offline the region, unfortunately, due to the lack of hbck 
> > > > >> > support, you would most likely need a more disruptive 
> > > > >> > intervention similar to what you had described in your 
> > > > >> > first email, but instead of normal wal folder, master proc 
> > > > >> > wals is what you really would need to clean out here, as 
> > > > >> > that is where procedures state is persisted, and you 
> > > > >> > wouldn't want the rogue procedure to be
> > resumed.
> > > > >> >
> > > > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
> > > > >> > <ma...@eset.sk>
> > > > >> > escreveu:
> > > > >> >
> > > > >> > > If you know of anything that will help I would appreciate it.
> > > > >> > >
> > > > >> > > If you need any log output let me know.
> > > > >> > >
> > > > >> > > Thanks
> > > > >> > >
> > > > >> > >
> > > > >> > > -----Original Message-----
> > > > >> > > From: Wellington Chevreuil 
> > > > >> > > <we...@gmail.com>
> > > > >> > > Sent: Thursday, February 25, 2021 4:08 PM
> > > > >> > > To: Hbase-User <us...@hbase.apache.org>
> > > > >> > > Subject: Re: HBASE WALs
> > > > >> > >
> > > > >> > > EXTERNAL
> > > > >> > >
> > > > >> > > >
> > > > >> > > > Do WAL files contain information for multiple regions 
> > > > >> > > > per WAL or is one WAL associated with one region?
> > > > >> > > >
> > > > >> > > Multiple regions edits would be present in a single wal file.
> > > > >> > > That's why upon a RS crash and wal processing, there's a 
> > > > >> > > wal split
> > > > phase.
> > > > >> > >
> > > > >> > > I am trying to find a way to clear a RIT for a disabled table.
> > > > >> > > A similar
> > > > >> > > > problem (but on a test cluster) involved me clearing 
> > > > >> > > > znode info, deleting HDFS data for the table and 
> > > > >> > > > deleting WALs/MasterProcWAL files, finally restarting HBASE service.
> > > > >> > > >
> > > > >> > > Which hbase version are you on?
> > > > >> > >
> > > > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> > > > >> > > <ma...@eset.sk>
> > > > >> > > escreveu:
> > > > >> > >
> > > > >> > > > Hi all,
> > > > >> > > >
> > > > >> > > > Do WAL files contain information for multiple regions 
> > > > >> > > > per WAL or is one WAL associated with one region?
> > > > >> > > >
> > > > >> > > > I am trying to find a way to clear a RIT for a disabled
> table.
> > > > >> > > > A similar problem (but on a test cluster) involved me 
> > > > >> > > > clearing znode info, deleting HDFS data for the table 
> > > > >> > > > and deleting WALs/MasterProcWAL files, finally 
> > > > >> > > > restarting HBASE
> > > service.
> > > > >> > > >
> > > > >> > > > Table cannot be enabled.
> > > > >> > > >
> > > > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but 
> > > > >> > > > the system seems mostly unhappy with one region in 
> > > > >> > > > particular, and is reporting
> > > > >> > on that.
> > > > >> > > >
> > > > >> > > > There are many tables that are very active so I don't 
> > > > >> > > > think it is possible to stop the entire service without 
> > > > >> > > > a lot of forewarning to
> > > > >> > > users.
> > > > >> > > >
> > > > >> > > > Thanks in advance.
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: HBASE WALs

Posted by Stack <st...@duboce.net>.
On Tue, Mar 30, 2021 at 2:52 AM Marc Hoppins <ma...@eset.sk> wrote:

> Dear HBASE gang,
>
> ...and, as I previously mentioned, we now have a grand bunch of OLD WALs
> milling about.
>
>
WALs in the masterProcWALs dir?

MY thinking is that if nothing is going on with writing, then anything in
> any masterProcWALs must be related to the bad table and we can just wipe
> them and restart HBASE.
>
> Questions I have:
>
> Am I correct in my theory? (I am far from being a Java guy so am not sure
> how to follow the process there)
>
>
If the old masterProcWALs are not clearing out, must be corruption in the
older WALs that is preventing them 'completing' so they can be released
(meantime new procs are added ahead of the old ones...so more WALs show up).


> If another (quicker) choice was made and we stop DB operations, disable
> all tables then delete masterProcWALs, WITHOUT waiting for compactions to
> finish, would we have a real problem with where HBASE thinks data is or
> where it should be going due to anything that was pending in masterWALs for
> (possibly) all tables?
>
>
Compactions are interruptible. Compactions have nothing to do w/ the
masterProcStore (or with where data is located).



> Is there any sane way to deal with the information in masterWALs?  Or is
> that only a Java API thing?
>
>
Old WALs are corrupt. Could try and get hbase to quiescent state, stop it,
and try removing an old WAL... restart, see if it all ok. Hard part is that
procedures sometimes span WALs so removal may just move forward the
corruption.

Upgrade is your best course.... to 2.3. The procedure store will be
migrated. There'll likely be some mess to be cleaned up but at least there
is tooling to do so in later hbases.

S



> Thanks for all the help/info thus far.
>
> -----Original Message-----
> From: Marc Hoppins <ma...@eset.sk>
> Sent: Friday, March 26, 2021 10:49 AM
> To: user@hbase.apache.org
> Subject: RE: HBASE WALs
>
> EXTERNAL
>
> I wonder if anyone can explain the following:
>
> Before I tried my attempt to fix, HBASE master was retrying to deal with
> that stuck region. The attempt counter was increasing - I think at last
> count we were up to 3000 or something.  After my attempt, and I restarted
> HBASE, it has not tried to fix the stuck region and attempts are currently
> at zero.  All procs and locks still exist.
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Tuesday, March 23, 2021 6:16 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > I am still not certain what will happen.  masterProcWALs contain info
> > for all (running) tables, yes?
> >
> masterProcWALs only contain info for running procedures, not user table
> data. User table data go on "normal" WALs, not "masterProcWALs".
>
>  If all tables are disabled and I remove the master wals, how will that
> > affect the other tables? When I disabled all tables, hundreds of
> > master WALs are now created. This means there is a bunch of pending
> > operations, yes?  Is it going to make some other things inconsistent?
>
> Table disabling involves the unassignment of all these tables regions.
> Each of these "unassign" operations comprise a set of sequential phases.
> These internal operations are called "procedures". Information about the
> progress of these operations as it progresses through its different phases
> are stored in these masterProcWALs files. That's why triggering the
> "disable"
> command will create some data under masterProcWALs. If all the disable
> commands finished successfully, and all your procedures are finished (apart
> from that rogue one existing for while already), you would be good to clean
> out masterProcWALs.
>
> I did try to set the table state manually to see if the faulty table would
> > fire up and I restarted hbase...state was the same a locked table
> > state due to pending disable and stuck region.
> >
> That's because of the rogue procedure. When you restarted master, it went
> through masterProcWals and resumed the rogue procedure from the unfinished
> state it was when you restarted hbase. If you had removed masterProcWALs
> prior to restart, the rogue procedure would now be gone.
>
> We may have the go-ahead to remove this table - I assume we cannot clone it
> > while it is in a state of (DISABLED) flux but, once again, messing
> > with master WALs has me on edge.
>
> From what I understand, you already have the tables disabled, and no
> unfinished procs apart from the rogue one, so just clean out masterProcWALs
> and restart master.
>
> Em ter., 23 de mar. de 2021 às 11:13, Marc Hoppins <ma...@eset.sk>
> escreveu:
>
> > I am still not certain what will happen.  masterProcWALs contain info
> > for all (running) tables, yes?
> >
> > If all tables are disabled and I remove the master wals, how will that
> > affect the other tables? When I disabled all tables, hundreds of
> > master WALs are now created. This means there is a bunch of pending
> > operations, yes?  Is it going to make some other things inconsistent?
> >
> > I did try to set the table state manually to see if the faulty table
> > would fire up and I restarted hbase...state was the same a locked
> > table state due to pending disable and stuck region.
> >
> > We may have the go-ahead to remove this table - I assume we cannot
> > clone it while it is in a state of (DISABLED) flux but, once again,
> > messing with master WALs has me on edge.
> >
> >
> > -----Original Message-----
> > From: Wellington Chevreuil <we...@gmail.com>
> > Sent: Tuesday, March 16, 2021 4:50 PM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > >
> > > To be clear, if the other tables are stopped, I assume all pending
> > > and current operations will finish. How long will it take to write
> > > all data - if indeed the data does get permanently written - so that
> > > we can safely remove WALs?
> > >
> > If by "tables stopped" you mean your tables are disabled, then yeah,
> > all related data would already have been flushed into hfiles and
> > wouldn't be on your wals. But please be aware that what you really
> > need here to get rid of the rogue proc is to remove master proc wals,
> not normal wals.
> >
> > Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins
> > <ma...@eset.sk>
> > escreveu:
> >
> > > Overall, I am mystified as to how this could happen.  If Hadoop has
> > > a replication factor (I believe we use the default) of 3 and we have
> > > two datacenters with masters and workers in both, how can a network
> > > outage affect Hadoop operation? Surely it should have used available
> > > resources to continue operations...or have I misinterpreted entirely?
> > >
> > > -----Original Message-----
> > > From: Stack <st...@duboce.net>
> > > Sent: Tuesday, March 16, 2021 7:16 AM
> > > To: Hbase-User <us...@hbase.apache.org>
> > > Subject: Re: HBASE WALs
> > >
> > > EXTERNAL
> > >
> > > On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <ma...@eset.sk>
> > wrote:
> > >
> > > > Hi, all,
> > > >
> > > > For our stuck region, this exists in meta.  Could we alter the
> > > > state to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
> > > >
> > > > You could but IIRC, in that version of HBase, you may need to
> > > > restart the
> > > Master after the change (changing hbase:meta does not update the
> > > Master's in-memory state). On restart, Master will read hbase:meta
> > > to discover Region state.
> > >
> > > S
> > >
> > >
> > > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:regioninfo, timestamp=1613580024017, value={ENCODED =>
> > > > f25fe93e24b34cb2f7fffddee1d89eec, NAME =>
> > > > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.'
> > > > , STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'}
> > > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:seqnumDuringOpen, timestamp=1611787189839,
> > > > value=\x00\x00\x00\x00\x00\x00\x04\x8F
> > > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:server, timestamp=1611787189839, value=
> > > > dr1-hbase18.jumbo.hq.eset.com:16020
> > > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:serverstartcode, timestamp=1611787189839,
> > > > value=1611785264032
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:sn, timestamp=1613580024017, value=
> > > > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
> > > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > > column=info:state, timestamp=1613580024017, value=OPENING
> > > >
> > > > -----Original Message-----
> > > > From: Wellington Chevreuil <we...@gmail.com>
> > > > Sent: Wednesday, March 10, 2021 10:56 AM
> > > > To: Hbase-User <us...@hbase.apache.org>
> > > > Subject: Re: HBASE WALs
> > > >
> > > > EXTERNAL
> > > >
> > > > >
> > > > > Sorry if I seem stupid but this is still all new to me.
> > > > >
> > > > Forgot to mention, there's no stupid questions here. Don't be shy
> > > > and keep'em coming.
> > > >
> > > > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil <
> > > > wellington.chevreuil@gmail.com> escreveu:
> > > >
> > > > > However, how would that help anyway?  If we cannot fix this at
> > > > > this time
> > > > >> then any upgrade would have inconsistencies also, yes?
> > > > >>
> > > > > The upgrade on it's own wouldn't fix existing inconsistencies,
> > > > > but you would now have support for additional tooling
> > > > > (hbase-operators-tool) to help you with this.
> > > > >
> > > > > As all the 'SUCCESS' procedures have a parent ID 73587, does
> > > > > this mean
> > > > >> that they were successfully and fully moved from hbase25 to
> > > > >> each server mentioned in that procedure?  Or does it just mean
> > > > >> that the region was successfully unassigned from hbase25 but
> > > > >> the data still resides on hbase25?  I see locality 0.
> > > > >>
> > > > > IIRC, those were all UnassignProcedures, so it means the
> > > > > unassignment of the related region has completed and the region
> > > > > for that particular procedure went offline.
> > > > >
> > > > > If we change the table state in meta to 'ENABLED', could this
> > > > > kickstart
> > > > >> all these things or will it just lead to further problems?
> > > > >
> > > > > Masters work with its own memory cache of meta, so manually
> > > > > updating it will just make masters cache inconsistent with meta.
> > > > > You would need to restart masters to get its cache reloaded from
> > > > > master. The main problem is that you still have the rogue
> > > > > procedures, which you can't get rid of without stopping the
> > > > > cluster. One alternative to a full cluster outage would be to
> > > > > identify all RSes running the rogue procs (you can find that
> > > > > from active master logs), then stop only those and master, clean
> > masterprocwals, then start it again.
> > > > >
> > > > >
> > > > >> I suppose it means I am asking, the 73587
> > > > >> DisableTableProcedure, does it mean that the table is waiting
> > > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > > >>
> > > > > The table state may have been already updated to disabled, most
> > > > > of its regions may already be offline, but the 73587
> > > > > DisableTableProcedure cannot be considered "done" until all its
> > > > > sub procedures are indeed
> > > > completed.
> > > > >
> > > > >
> > > > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins
> > > > > <ma...@eset.sk>
> > > > > escreveu:
> > > > >
> > > > >> Thanks for that.
> > > > >>
> > > > >> Alas, we are (currently) constrained by using Cloudera (CDH)
> > > > >> 6.3.1 and do not have a viable business use to pay the
> > > > >> extortionate amount of money required to upgrade.  Which would
> > > > >> give these cluster access to newer versions.
> > > > >>
> > > > >> However, how would that help anyway?  If we cannot fix this at
> > > > >> this time then any upgrade would have inconsistencies also, yes?
> > > > >>
> > > > >> As all the 'SUCCESS' procedures have a parent ID 73587, does
> > > > >> this mean that they were successfully and fully moved from
> > > > >> hbase25 to each server mentioned in that procedure?  Or does it
> > > > >> just mean that the region was successfully unassigned from
> > > > >> hbase25 but the data still resides on hbase25?  I see locality 0.
> > > > >>
> > > > >> If we change the table state in meta to 'ENABLED', could this
> > > > >> kickstart all these things or will it just lead to further
> problems?
> > > > >> I suppose it means I am asking, the 73587
> > > > >> DisableTableProcedure, does it mean that the table is waiting
> > > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > > >>
> > > > >> Sorry if I seem stupid but this is still all new to me.
> > > > >>
> > > > >> I appreciate the help.
> > > > >>
> > > > >> -----Original Message-----
> > > > >> From: Wellington Chevreuil <we...@gmail.com>
> > > > >> Sent: Tuesday, March 9, 2021 1:20 PM
> > > > >> To: Hbase-User <us...@hbase.apache.org>
> > > > >> Subject: Re: HBASE WALs
> > > > >>
> > > > >> EXTERNAL
> > > > >>
> > > > >> >
> > > > >> > All fails are waiting on the same PID (73587), a DISABLE
> > > > >> > TABLE
> > > > >> procedure.
> > > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems
> > > > >> > to be the problem.
> > > > >> >
> > > > >> Per your list procedures output attached, it seems the procs
> > > > >> states are all inconsistent. There's a WAIT_TIMEOUT subproc of
> > > > >> 73587 with PID 73827, which is the UnassignProcedure for this
> > > > >> region. Problem is that there are already 5 APs for the same
> > > > >> region, which may be causing some deadlocks. If this cluster
> > > > >> was on a hbck2 supported version, you could get rid of this
> > > > >> state using bypass command on all these proc ids, then manually
> > > > >> get the table/regions states consistent again using
> > > > >> setRegionState/setTableState/assigns/unassigns
> > > methods.
> > > > >>
> > > > >> Without tooling, the only option I can think of is to stop
> > > > >> cluster, clean out masterprocwals, restart cluster, then use
> > > > >> hbase shell to enable/disable/assign regions. You may also need
> > > > >> to manually update table/region states in meta table. Of
> > > > >> course, you can automate these manual steps into your own
> > > > >> tooling, but may be a better strategy in the long term to
> > > > >> upgrade to a more stable version that also benefits from more
> > > > >> tooling supported by
> > the community.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins
> > > > >> <ma...@eset.sk>
> > > > >> escreveu:
> > > > >>
> > > > >> > Hi, Wellington,
> > > > >> >
> > > > >> > I was on 'vacation' (no road trip or overseas anything) for a
> > week.
> > > > >> >
> > > > >> > All fails are waiting on the same PID (73587), a DISABLE
> > > > >> > TABLE
> > > > >> procedure.
> > > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems
> > > > >> > to be the problem.
> > > > >> >
> > > > >> > I am still mystified about the HBCK2-tools. I have attached a
> > > > >> > previous thread that you commented on at the time.
> > > > >> >
> > > > >> > I did build a tools for our HBASE 2.1.0...or rather, I built
> > > > >> > it on Ubuntu
> > > > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on
> > > > >> > Ubuntu
> > > > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
> > > > >> > I used it to help fix a similar problem with an offline table
> > > > >> > and
> > > RITs.
> > > > >> > Both HBASE versions are the same.
> > > > >> >
> > > > >> > I attach a 'sheet' with the current procs/locks.
> > > > >> >
> > > > >> > -----Original Message-----
> > > > >> > From: Marc Hoppins <ma...@eset.sk>
> > > > >> > Sent: Wednesday, March 3, 2021 9:51 AM
> > > > >> > To: user@hbase.apache.org
> > > > >> > Cc: Martin Oravec <ma...@eset.sk>
> > > > >> > Subject: RE: HBASE WALs
> > > > >> >
> > > > >> > EXTERNAL
> > > > >> >
> > > > >> > Thanks, Wellington,
> > > > >> >
> > > > >> > I have already build a hbck1-tools for 2.1.0 using method
> > > > >> > described in other topics. All the HBASE and JDK here is the
> > > > >> > same version so if it worked fixing one cluster HBASE then it
> > > > >> > should work for other
> > > > installs.
> > > > >> >
> > > > >> > Fiddling with masterprocWALs will require complete shutdown
> > > > >> > of hbase operations to prevent incoming reds/writes on other
> > > > >> > tables and I am not sure how disruptive that will be other
> > > > >> > than "probably a
> > > > lot".
> > > > >> >
> > > > >> > -----Original Message-----
> > > > >> > From: Wellington Chevreuil <we...@gmail.com>
> > > > >> > Sent: Tuesday, March 2, 2021 10:57 AM
> > > > >> > To: Hbase-User <us...@hbase.apache.org>
> > > > >> > Subject: Re: HBASE WALs
> > > > >> >
> > > > >> > EXTERNAL
> > > > >> >
> > > > >> > Sorry, missed your previous email. I was hoping you were not
> > > > >> > on a non-stable version, so that you would benefit from hbck2
> > > > >> > tool
> > > support.
> > > > >> > Unfortunately, 2.1.0 is among the early releases that don't
> > > > >> > work with this tool (it requires at least 2.0.3, 2.1.1 or
> 2.2.0).
> > > > >> >
> > > > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the
> > > > >> > system seems
> > > > >> > > mostly unhappy with one region in particular, and is
> > > > >> > > reporting on
> > > > >> that.
> > > > >> > >
> > > > >> > Are the other regions for the table properly closed, and this
> > > > >> > is the only one stuck? If you do a list_procedures, are you
> > > > >> > able to identify an 'unassign' procedure still running for
> > > > >> > this table? Or if you grep master logs for this region, do
> > > > >> > you see any messages suggesting there's still ongoing
> > > > >> > attempts to bring the region offline? If there's apparently
> > > > >> > no procedure/no ongoing attempts to offline the region, you
> > > > >> > might try to manually update its state in meta table, then
> > > > >> > flip masters (assuming you have master HA), so that the new
> > > > >> > active loads an up
> > to date state from meta table.
> > > > >> >
> > > > >> > Otherwise, if there's still a rogue procedure trying to
> > > > >> > offline the region, unfortunately, due to the lack of hbck
> > > > >> > support, you would most likely need a more disruptive
> > > > >> > intervention similar to what you had described in your first
> > > > >> > email, but instead of normal wal folder, master proc wals is
> > > > >> > what you really would need to clean out here, as that is
> > > > >> > where procedures state is persisted, and you wouldn't want
> > > > >> > the rogue procedure to be
> > resumed.
> > > > >> >
> > > > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins
> > > > >> > <ma...@eset.sk>
> > > > >> > escreveu:
> > > > >> >
> > > > >> > > If you know of anything that will help I would appreciate it.
> > > > >> > >
> > > > >> > > If you need any log output let me know.
> > > > >> > >
> > > > >> > > Thanks
> > > > >> > >
> > > > >> > >
> > > > >> > > -----Original Message-----
> > > > >> > > From: Wellington Chevreuil <we...@gmail.com>
> > > > >> > > Sent: Thursday, February 25, 2021 4:08 PM
> > > > >> > > To: Hbase-User <us...@hbase.apache.org>
> > > > >> > > Subject: Re: HBASE WALs
> > > > >> > >
> > > > >> > > EXTERNAL
> > > > >> > >
> > > > >> > > >
> > > > >> > > > Do WAL files contain information for multiple regions per
> > > > >> > > > WAL or is one WAL associated with one region?
> > > > >> > > >
> > > > >> > > Multiple regions edits would be present in a single wal file.
> > > > >> > > That's why upon a RS crash and wal processing, there's a
> > > > >> > > wal split
> > > > phase.
> > > > >> > >
> > > > >> > > I am trying to find a way to clear a RIT for a disabled table.
> > > > >> > > A similar
> > > > >> > > > problem (but on a test cluster) involved me clearing
> > > > >> > > > znode info, deleting HDFS data for the table and deleting
> > > > >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > > > >> > > >
> > > > >> > > Which hbase version are you on?
> > > > >> > >
> > > > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins
> > > > >> > > <ma...@eset.sk>
> > > > >> > > escreveu:
> > > > >> > >
> > > > >> > > > Hi all,
> > > > >> > > >
> > > > >> > > > Do WAL files contain information for multiple regions per
> > > > >> > > > WAL or is one WAL associated with one region?
> > > > >> > > >
> > > > >> > > > I am trying to find a way to clear a RIT for a disabled
> table.
> > > > >> > > > A similar problem (but on a test cluster) involved me
> > > > >> > > > clearing znode info, deleting HDFS data for the table and
> > > > >> > > > deleting WALs/MasterProcWAL files, finally restarting
> > > > >> > > > HBASE
> > > service.
> > > > >> > > >
> > > > >> > > > Table cannot be enabled.
> > > > >> > > >
> > > > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the
> > > > >> > > > system seems mostly unhappy with one region in
> > > > >> > > > particular, and is reporting
> > > > >> > on that.
> > > > >> > > >
> > > > >> > > > There are many tables that are very active so I don't
> > > > >> > > > think it is possible to stop the entire service without a
> > > > >> > > > lot of forewarning to
> > > > >> > > users.
> > > > >> > > >
> > > > >> > > > Thanks in advance.
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
Dear HBASE gang,

Here is the current situation:

A previous attempt to fix the stuck region by trying to manually adjust the stuck table from it's indeterminate state to OPEN failed to work.  We got the DB guys to shut down their operations so we could let the tables stabilise.  There were about 7 compactions remaining at the time but we disabled all the tables, fiddled with out stuck table and restarted HBASE.  I did not remove the masterProcWALs as I was hoping that this 

https://community.cloudera.com/t5/Support-Questions/Hbase-table-is-stuck-in-quot-Disabling-quot-state-Neither/m-p/235112

would have resolved it but it didn't.

...and, as I previously mentioned, we now have a grand bunch of OLD WALs milling about.

We tried again today but this time around I asked the guys to shut down their doings earlier to allow for more time for things to settle in HBASE.  6 hours later and the tables are still compacting.   The situation had got to compactions on 4 region servers remaining then another phase of compactions started and all boxes were busy again.

From my observations I deduced that HBASE must be trying to achieve 100% (or close to) locality before it will be happy and quieten down.

MY thinking is that if nothing is going on with writing, then anything in any masterProcWALs must be related to the bad table and we can just wipe them and restart HBASE.

Questions I have:

Am I correct in my theory? (I am far from being a Java guy so am not sure how to follow the process there)

If another (quicker) choice was made and we stop DB operations, disable all tables then delete masterProcWALs, WITHOUT waiting for compactions to finish, would we have a real problem with where HBASE thinks data is or where it should be going due to anything that was pending in masterWALs for (possibly) all tables?

Is there any sane way to deal with the information in masterWALs?  Or is that only a Java API thing?

Thanks for all the help/info thus far.

-----Original Message-----
From: Marc Hoppins <ma...@eset.sk> 
Sent: Friday, March 26, 2021 10:49 AM
To: user@hbase.apache.org
Subject: RE: HBASE WALs

EXTERNAL

I wonder if anyone can explain the following:

Before I tried my attempt to fix, HBASE master was retrying to deal with that stuck region. The attempt counter was increasing - I think at last count we were up to 3000 or something.  After my attempt, and I restarted HBASE, it has not tried to fix the stuck region and attempts are currently at zero.  All procs and locks still exist.

-----Original Message-----
From: Wellington Chevreuil <we...@gmail.com>
Sent: Tuesday, March 23, 2021 6:16 PM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

>
> I am still not certain what will happen.  masterProcWALs contain info 
> for all (running) tables, yes?
>
masterProcWALs only contain info for running procedures, not user table data. User table data go on "normal" WALs, not "masterProcWALs".

 If all tables are disabled and I remove the master wals, how will that
> affect the other tables? When I disabled all tables, hundreds of 
> master WALs are now created. This means there is a bunch of pending 
> operations, yes?  Is it going to make some other things inconsistent?

Table disabling involves the unassignment of all these tables regions. Each of these "unassign" operations comprise a set of sequential phases. These internal operations are called "procedures". Information about the progress of these operations as it progresses through its different phases are stored in these masterProcWALs files. That's why triggering the  "disable"
command will create some data under masterProcWALs. If all the disable commands finished successfully, and all your procedures are finished (apart from that rogue one existing for while already), you would be good to clean out masterProcWALs.

I did try to set the table state manually to see if the faulty table would
> fire up and I restarted hbase...state was the same a locked table 
> state due to pending disable and stuck region.
>
That's because of the rogue procedure. When you restarted master, it went through masterProcWals and resumed the rogue procedure from the unfinished state it was when you restarted hbase. If you had removed masterProcWALs prior to restart, the rogue procedure would now be gone.

We may have the go-ahead to remove this table - I assume we cannot clone it
> while it is in a state of (DISABLED) flux but, once again, messing 
> with master WALs has me on edge.

From what I understand, you already have the tables disabled, and no unfinished procs apart from the rogue one, so just clean out masterProcWALs and restart master.

Em ter., 23 de mar. de 2021 às 11:13, Marc Hoppins <ma...@eset.sk>
escreveu:

> I am still not certain what will happen.  masterProcWALs contain info 
> for all (running) tables, yes?
>
> If all tables are disabled and I remove the master wals, how will that 
> affect the other tables? When I disabled all tables, hundreds of 
> master WALs are now created. This means there is a bunch of pending 
> operations, yes?  Is it going to make some other things inconsistent?
>
> I did try to set the table state manually to see if the faulty table 
> would fire up and I restarted hbase...state was the same a locked 
> table state due to pending disable and stuck region.
>
> We may have the go-ahead to remove this table - I assume we cannot 
> clone it while it is in a state of (DISABLED) flux but, once again, 
> messing with master WALs has me on edge.
>
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Tuesday, March 16, 2021 4:50 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > To be clear, if the other tables are stopped, I assume all pending 
> > and current operations will finish. How long will it take to write 
> > all data - if indeed the data does get permanently written - so that 
> > we can safely remove WALs?
> >
> If by "tables stopped" you mean your tables are disabled, then yeah, 
> all related data would already have been flushed into hfiles and 
> wouldn't be on your wals. But please be aware that what you really 
> need here to get rid of the rogue proc is to remove master proc wals, not normal wals.
>
> Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins 
> <ma...@eset.sk>
> escreveu:
>
> > Overall, I am mystified as to how this could happen.  If Hadoop has 
> > a replication factor (I believe we use the default) of 3 and we have 
> > two datacenters with masters and workers in both, how can a network 
> > outage affect Hadoop operation? Surely it should have used available 
> > resources to continue operations...or have I misinterpreted entirely?
> >
> > -----Original Message-----
> > From: Stack <st...@duboce.net>
> > Sent: Tuesday, March 16, 2021 7:16 AM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <ma...@eset.sk>
> wrote:
> >
> > > Hi, all,
> > >
> > > For our stuck region, this exists in meta.  Could we alter the 
> > > state to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
> > >
> > > You could but IIRC, in that version of HBase, you may need to 
> > > restart the
> > Master after the change (changing hbase:meta does not update the 
> > Master's in-memory state). On restart, Master will read hbase:meta 
> > to discover Region state.
> >
> > S
> >
> >
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:regioninfo, timestamp=1613580024017, value={ENCODED => 
> > > f25fe93e24b34cb2f7fffddee1d89eec, NAME => 
> > > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.'
> > > , STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'} 
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:seqnumDuringOpen, timestamp=1611787189839, 
> > > value=\x00\x00\x00\x00\x00\x00\x04\x8F
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:server, timestamp=1611787189839, value=
> > > dr1-hbase18.jumbo.hq.eset.com:16020
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:serverstartcode, timestamp=1611787189839,
> > > value=1611785264032
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:sn, timestamp=1613580024017, value=
> > > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:state, timestamp=1613580024017, value=OPENING
> > >
> > > -----Original Message-----
> > > From: Wellington Chevreuil <we...@gmail.com>
> > > Sent: Wednesday, March 10, 2021 10:56 AM
> > > To: Hbase-User <us...@hbase.apache.org>
> > > Subject: Re: HBASE WALs
> > >
> > > EXTERNAL
> > >
> > > >
> > > > Sorry if I seem stupid but this is still all new to me.
> > > >
> > > Forgot to mention, there's no stupid questions here. Don't be shy 
> > > and keep'em coming.
> > >
> > > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < 
> > > wellington.chevreuil@gmail.com> escreveu:
> > >
> > > > However, how would that help anyway?  If we cannot fix this at 
> > > > this time
> > > >> then any upgrade would have inconsistencies also, yes?
> > > >>
> > > > The upgrade on it's own wouldn't fix existing inconsistencies, 
> > > > but you would now have support for additional tooling
> > > > (hbase-operators-tool) to help you with this.
> > > >
> > > > As all the 'SUCCESS' procedures have a parent ID 73587, does 
> > > > this mean
> > > >> that they were successfully and fully moved from hbase25 to 
> > > >> each server mentioned in that procedure?  Or does it just mean 
> > > >> that the region was successfully unassigned from hbase25 but 
> > > >> the data still resides on hbase25?  I see locality 0.
> > > >>
> > > > IIRC, those were all UnassignProcedures, so it means the 
> > > > unassignment of the related region has completed and the region 
> > > > for that particular procedure went offline.
> > > >
> > > > If we change the table state in meta to 'ENABLED', could this 
> > > > kickstart
> > > >> all these things or will it just lead to further problems?
> > > >
> > > > Masters work with its own memory cache of meta, so manually 
> > > > updating it will just make masters cache inconsistent with meta.
> > > > You would need to restart masters to get its cache reloaded from 
> > > > master. The main problem is that you still have the rogue 
> > > > procedures, which you can't get rid of without stopping the 
> > > > cluster. One alternative to a full cluster outage would be to 
> > > > identify all RSes running the rogue procs (you can find that 
> > > > from active master logs), then stop only those and master, clean
> masterprocwals, then start it again.
> > > >
> > > >
> > > >> I suppose it means I am asking, the 73587 
> > > >> DisableTableProcedure, does it mean that the table is waiting 
> > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > >>
> > > > The table state may have been already updated to disabled, most 
> > > > of its regions may already be offline, but the 73587 
> > > > DisableTableProcedure cannot be considered "done" until all its 
> > > > sub procedures are indeed
> > > completed.
> > > >
> > > >
> > > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> > > > <ma...@eset.sk>
> > > > escreveu:
> > > >
> > > >> Thanks for that.
> > > >>
> > > >> Alas, we are (currently) constrained by using Cloudera (CDH)
> > > >> 6.3.1 and do not have a viable business use to pay the 
> > > >> extortionate amount of money required to upgrade.  Which would 
> > > >> give these cluster access to newer versions.
> > > >>
> > > >> However, how would that help anyway?  If we cannot fix this at 
> > > >> this time then any upgrade would have inconsistencies also, yes?
> > > >>
> > > >> As all the 'SUCCESS' procedures have a parent ID 73587, does 
> > > >> this mean that they were successfully and fully moved from
> > > >> hbase25 to each server mentioned in that procedure?  Or does it 
> > > >> just mean that the region was successfully unassigned from
> > > >> hbase25 but the data still resides on hbase25?  I see locality 0.
> > > >>
> > > >> If we change the table state in meta to 'ENABLED', could this 
> > > >> kickstart all these things or will it just lead to further problems?
> > > >> I suppose it means I am asking, the 73587 
> > > >> DisableTableProcedure, does it mean that the table is waiting 
> > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > >>
> > > >> Sorry if I seem stupid but this is still all new to me.
> > > >>
> > > >> I appreciate the help.
> > > >>
> > > >> -----Original Message-----
> > > >> From: Wellington Chevreuil <we...@gmail.com>
> > > >> Sent: Tuesday, March 9, 2021 1:20 PM
> > > >> To: Hbase-User <us...@hbase.apache.org>
> > > >> Subject: Re: HBASE WALs
> > > >>
> > > >> EXTERNAL
> > > >>
> > > >> >
> > > >> > All fails are waiting on the same PID (73587), a DISABLE 
> > > >> > TABLE
> > > >> procedure.
> > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems 
> > > >> > to be the problem.
> > > >> >
> > > >> Per your list procedures output attached, it seems the procs 
> > > >> states are all inconsistent. There's a WAIT_TIMEOUT subproc of
> > > >> 73587 with PID 73827, which is the UnassignProcedure for this 
> > > >> region. Problem is that there are already 5 APs for the same 
> > > >> region, which may be causing some deadlocks. If this cluster 
> > > >> was on a hbck2 supported version, you could get rid of this 
> > > >> state using bypass command on all these proc ids, then manually 
> > > >> get the table/regions states consistent again using 
> > > >> setRegionState/setTableState/assigns/unassigns
> > methods.
> > > >>
> > > >> Without tooling, the only option I can think of is to stop 
> > > >> cluster, clean out masterprocwals, restart cluster, then use 
> > > >> hbase shell to enable/disable/assign regions. You may also need 
> > > >> to manually update table/region states in meta table. Of 
> > > >> course, you can automate these manual steps into your own 
> > > >> tooling, but may be a better strategy in the long term to 
> > > >> upgrade to a more stable version that also benefits from more 
> > > >> tooling supported by
> the community.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
> > > >> <ma...@eset.sk>
> > > >> escreveu:
> > > >>
> > > >> > Hi, Wellington,
> > > >> >
> > > >> > I was on 'vacation' (no road trip or overseas anything) for a
> week.
> > > >> >
> > > >> > All fails are waiting on the same PID (73587), a DISABLE 
> > > >> > TABLE
> > > >> procedure.
> > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems 
> > > >> > to be the problem.
> > > >> >
> > > >> > I am still mystified about the HBCK2-tools. I have attached a 
> > > >> > previous thread that you commented on at the time.
> > > >> >
> > > >> > I did build a tools for our HBASE 2.1.0...or rather, I built 
> > > >> > it on Ubuntu
> > > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on 
> > > >> > Ubuntu
> > > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
> > > >> > I used it to help fix a similar problem with an offline table 
> > > >> > and
> > RITs.
> > > >> > Both HBASE versions are the same.
> > > >> >
> > > >> > I attach a 'sheet' with the current procs/locks.
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Marc Hoppins <ma...@eset.sk>
> > > >> > Sent: Wednesday, March 3, 2021 9:51 AM
> > > >> > To: user@hbase.apache.org
> > > >> > Cc: Martin Oravec <ma...@eset.sk>
> > > >> > Subject: RE: HBASE WALs
> > > >> >
> > > >> > EXTERNAL
> > > >> >
> > > >> > Thanks, Wellington,
> > > >> >
> > > >> > I have already build a hbck1-tools for 2.1.0 using method 
> > > >> > described in other topics. All the HBASE and JDK here is the 
> > > >> > same version so if it worked fixing one cluster HBASE then it 
> > > >> > should work for other
> > > installs.
> > > >> >
> > > >> > Fiddling with masterprocWALs will require complete shutdown 
> > > >> > of hbase operations to prevent incoming reds/writes on other 
> > > >> > tables and I am not sure how disruptive that will be other 
> > > >> > than "probably a
> > > lot".
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Wellington Chevreuil <we...@gmail.com>
> > > >> > Sent: Tuesday, March 2, 2021 10:57 AM
> > > >> > To: Hbase-User <us...@hbase.apache.org>
> > > >> > Subject: Re: HBASE WALs
> > > >> >
> > > >> > EXTERNAL
> > > >> >
> > > >> > Sorry, missed your previous email. I was hoping you were not 
> > > >> > on a non-stable version, so that you would benefit from hbck2 
> > > >> > tool
> > support.
> > > >> > Unfortunately, 2.1.0 is among the early releases that don't 
> > > >> > work with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> > > >> >
> > > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> > > >> > system seems
> > > >> > > mostly unhappy with one region in particular, and is 
> > > >> > > reporting on
> > > >> that.
> > > >> > >
> > > >> > Are the other regions for the table properly closed, and this 
> > > >> > is the only one stuck? If you do a list_procedures, are you 
> > > >> > able to identify an 'unassign' procedure still running for 
> > > >> > this table? Or if you grep master logs for this region, do 
> > > >> > you see any messages suggesting there's still ongoing 
> > > >> > attempts to bring the region offline? If there's apparently 
> > > >> > no procedure/no ongoing attempts to offline the region, you 
> > > >> > might try to manually update its state in meta table, then 
> > > >> > flip masters (assuming you have master HA), so that the new 
> > > >> > active loads an up
> to date state from meta table.
> > > >> >
> > > >> > Otherwise, if there's still a rogue procedure trying to 
> > > >> > offline the region, unfortunately, due to the lack of hbck 
> > > >> > support, you would most likely need a more disruptive 
> > > >> > intervention similar to what you had described in your first 
> > > >> > email, but instead of normal wal folder, master proc wals is 
> > > >> > what you really would need to clean out here, as that is 
> > > >> > where procedures state is persisted, and you wouldn't want 
> > > >> > the rogue procedure to be
> resumed.
> > > >> >
> > > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
> > > >> > <ma...@eset.sk>
> > > >> > escreveu:
> > > >> >
> > > >> > > If you know of anything that will help I would appreciate it.
> > > >> > >
> > > >> > > If you need any log output let me know.
> > > >> > >
> > > >> > > Thanks
> > > >> > >
> > > >> > >
> > > >> > > -----Original Message-----
> > > >> > > From: Wellington Chevreuil <we...@gmail.com>
> > > >> > > Sent: Thursday, February 25, 2021 4:08 PM
> > > >> > > To: Hbase-User <us...@hbase.apache.org>
> > > >> > > Subject: Re: HBASE WALs
> > > >> > >
> > > >> > > EXTERNAL
> > > >> > >
> > > >> > > >
> > > >> > > > Do WAL files contain information for multiple regions per 
> > > >> > > > WAL or is one WAL associated with one region?
> > > >> > > >
> > > >> > > Multiple regions edits would be present in a single wal file.
> > > >> > > That's why upon a RS crash and wal processing, there's a 
> > > >> > > wal split
> > > phase.
> > > >> > >
> > > >> > > I am trying to find a way to clear a RIT for a disabled table.
> > > >> > > A similar
> > > >> > > > problem (but on a test cluster) involved me clearing 
> > > >> > > > znode info, deleting HDFS data for the table and deleting 
> > > >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > > >> > > >
> > > >> > > Which hbase version are you on?
> > > >> > >
> > > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> > > >> > > <ma...@eset.sk>
> > > >> > > escreveu:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > > Do WAL files contain information for multiple regions per 
> > > >> > > > WAL or is one WAL associated with one region?
> > > >> > > >
> > > >> > > > I am trying to find a way to clear a RIT for a disabled table.
> > > >> > > > A similar problem (but on a test cluster) involved me 
> > > >> > > > clearing znode info, deleting HDFS data for the table and 
> > > >> > > > deleting WALs/MasterProcWAL files, finally restarting 
> > > >> > > > HBASE
> > service.
> > > >> > > >
> > > >> > > > Table cannot be enabled.
> > > >> > > >
> > > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> > > >> > > > system seems mostly unhappy with one region in 
> > > >> > > > particular, and is reporting
> > > >> > on that.
> > > >> > > >
> > > >> > > > There are many tables that are very active so I don't 
> > > >> > > > think it is possible to stop the entire service without a 
> > > >> > > > lot of forewarning to
> > > >> > > users.
> > > >> > > >
> > > >> > > > Thanks in advance.
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
I wonder if anyone can explain the following:

Before I tried my attempt to fix, HBASE master was retrying to deal with that stuck region. The attempt counter was increasing - I think at last count we were up to 3000 or something.  After my attempt, and I restarted HBASE, it has not tried to fix the stuck region and attempts are currently at zero.  All procs and locks still exist.

-----Original Message-----
From: Wellington Chevreuil <we...@gmail.com> 
Sent: Tuesday, March 23, 2021 6:16 PM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

>
> I am still not certain what will happen.  masterProcWALs contain info 
> for all (running) tables, yes?
>
masterProcWALs only contain info for running procedures, not user table data. User table data go on "normal" WALs, not "masterProcWALs".

 If all tables are disabled and I remove the master wals, how will that
> affect the other tables? When I disabled all tables, hundreds of 
> master WALs are now created. This means there is a bunch of pending 
> operations, yes?  Is it going to make some other things inconsistent?

Table disabling involves the unassignment of all these tables regions. Each of these "unassign" operations comprise a set of sequential phases. These internal operations are called "procedures". Information about the progress of these operations as it progresses through its different phases are stored in these masterProcWALs files. That's why triggering the  "disable"
command will create some data under masterProcWALs. If all the disable commands finished successfully, and all your procedures are finished (apart from that rogue one existing for while already), you would be good to clean out masterProcWALs.

I did try to set the table state manually to see if the faulty table would
> fire up and I restarted hbase...state was the same a locked table 
> state due to pending disable and stuck region.
>
That's because of the rogue procedure. When you restarted master, it went through masterProcWals and resumed the rogue procedure from the unfinished state it was when you restarted hbase. If you had removed masterProcWALs prior to restart, the rogue procedure would now be gone.

We may have the go-ahead to remove this table - I assume we cannot clone it
> while it is in a state of (DISABLED) flux but, once again, messing 
> with master WALs has me on edge.

From what I understand, you already have the tables disabled, and no unfinished procs apart from the rogue one, so just clean out masterProcWALs and restart master.

Em ter., 23 de mar. de 2021 às 11:13, Marc Hoppins <ma...@eset.sk>
escreveu:

> I am still not certain what will happen.  masterProcWALs contain info 
> for all (running) tables, yes?
>
> If all tables are disabled and I remove the master wals, how will that 
> affect the other tables? When I disabled all tables, hundreds of 
> master WALs are now created. This means there is a bunch of pending 
> operations, yes?  Is it going to make some other things inconsistent?
>
> I did try to set the table state manually to see if the faulty table 
> would fire up and I restarted hbase...state was the same a locked 
> table state due to pending disable and stuck region.
>
> We may have the go-ahead to remove this table - I assume we cannot 
> clone it while it is in a state of (DISABLED) flux but, once again, 
> messing with master WALs has me on edge.
>
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Tuesday, March 16, 2021 4:50 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > To be clear, if the other tables are stopped, I assume all pending 
> > and current operations will finish. How long will it take to write 
> > all data - if indeed the data does get permanently written - so that 
> > we can safely remove WALs?
> >
> If by "tables stopped" you mean your tables are disabled, then yeah, 
> all related data would already have been flushed into hfiles and 
> wouldn't be on your wals. But please be aware that what you really 
> need here to get rid of the rogue proc is to remove master proc wals, not normal wals.
>
> Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins 
> <ma...@eset.sk>
> escreveu:
>
> > Overall, I am mystified as to how this could happen.  If Hadoop has 
> > a replication factor (I believe we use the default) of 3 and we have 
> > two datacenters with masters and workers in both, how can a network 
> > outage affect Hadoop operation? Surely it should have used available 
> > resources to continue operations...or have I misinterpreted entirely?
> >
> > -----Original Message-----
> > From: Stack <st...@duboce.net>
> > Sent: Tuesday, March 16, 2021 7:16 AM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <ma...@eset.sk>
> wrote:
> >
> > > Hi, all,
> > >
> > > For our stuck region, this exists in meta.  Could we alter the 
> > > state to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
> > >
> > > You could but IIRC, in that version of HBase, you may need to 
> > > restart the
> > Master after the change (changing hbase:meta does not update the 
> > Master's in-memory state). On restart, Master will read hbase:meta 
> > to discover Region state.
> >
> > S
> >
> >
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:regioninfo, timestamp=1613580024017, value={ENCODED => 
> > > f25fe93e24b34cb2f7fffddee1d89eec, NAME => 
> > > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.'
> > > , STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'} 
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:seqnumDuringOpen, timestamp=1611787189839, 
> > > value=\x00\x00\x00\x00\x00\x00\x04\x8F
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:server, timestamp=1611787189839, value=
> > > dr1-hbase18.jumbo.hq.eset.com:16020
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:serverstartcode, timestamp=1611787189839,
> > > value=1611785264032
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:sn, timestamp=1613580024017, value=
> > > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:state, timestamp=1613580024017, value=OPENING
> > >
> > > -----Original Message-----
> > > From: Wellington Chevreuil <we...@gmail.com>
> > > Sent: Wednesday, March 10, 2021 10:56 AM
> > > To: Hbase-User <us...@hbase.apache.org>
> > > Subject: Re: HBASE WALs
> > >
> > > EXTERNAL
> > >
> > > >
> > > > Sorry if I seem stupid but this is still all new to me.
> > > >
> > > Forgot to mention, there's no stupid questions here. Don't be shy 
> > > and keep'em coming.
> > >
> > > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < 
> > > wellington.chevreuil@gmail.com> escreveu:
> > >
> > > > However, how would that help anyway?  If we cannot fix this at 
> > > > this time
> > > >> then any upgrade would have inconsistencies also, yes?
> > > >>
> > > > The upgrade on it's own wouldn't fix existing inconsistencies, 
> > > > but you would now have support for additional tooling
> > > > (hbase-operators-tool) to help you with this.
> > > >
> > > > As all the 'SUCCESS' procedures have a parent ID 73587, does 
> > > > this mean
> > > >> that they were successfully and fully moved from hbase25 to 
> > > >> each server mentioned in that procedure?  Or does it just mean 
> > > >> that the region was successfully unassigned from hbase25 but 
> > > >> the data still resides on hbase25?  I see locality 0.
> > > >>
> > > > IIRC, those were all UnassignProcedures, so it means the 
> > > > unassignment of the related region has completed and the region 
> > > > for that particular procedure went offline.
> > > >
> > > > If we change the table state in meta to 'ENABLED', could this 
> > > > kickstart
> > > >> all these things or will it just lead to further problems?
> > > >
> > > > Masters work with its own memory cache of meta, so manually 
> > > > updating it will just make masters cache inconsistent with meta.
> > > > You would need to restart masters to get its cache reloaded from 
> > > > master. The main problem is that you still have the rogue 
> > > > procedures, which you can't get rid of without stopping the 
> > > > cluster. One alternative to a full cluster outage would be to 
> > > > identify all RSes running the rogue procs (you can find that 
> > > > from active master logs), then stop only those and master, clean
> masterprocwals, then start it again.
> > > >
> > > >
> > > >> I suppose it means I am asking, the 73587 
> > > >> DisableTableProcedure, does it mean that the table is waiting 
> > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > >>
> > > > The table state may have been already updated to disabled, most 
> > > > of its regions may already be offline, but the 73587 
> > > > DisableTableProcedure cannot be considered "done" until all its 
> > > > sub procedures are indeed
> > > completed.
> > > >
> > > >
> > > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> > > > <ma...@eset.sk>
> > > > escreveu:
> > > >
> > > >> Thanks for that.
> > > >>
> > > >> Alas, we are (currently) constrained by using Cloudera (CDH)
> > > >> 6.3.1 and do not have a viable business use to pay the 
> > > >> extortionate amount of money required to upgrade.  Which would 
> > > >> give these cluster access to newer versions.
> > > >>
> > > >> However, how would that help anyway?  If we cannot fix this at 
> > > >> this time then any upgrade would have inconsistencies also, yes?
> > > >>
> > > >> As all the 'SUCCESS' procedures have a parent ID 73587, does 
> > > >> this mean that they were successfully and fully moved from 
> > > >> hbase25 to each server mentioned in that procedure?  Or does it 
> > > >> just mean that the region was successfully unassigned from 
> > > >> hbase25 but the data still resides on hbase25?  I see locality 0.
> > > >>
> > > >> If we change the table state in meta to 'ENABLED', could this 
> > > >> kickstart all these things or will it just lead to further problems?
> > > >> I suppose it means I am asking, the 73587 
> > > >> DisableTableProcedure, does it mean that the table is waiting 
> > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > >>
> > > >> Sorry if I seem stupid but this is still all new to me.
> > > >>
> > > >> I appreciate the help.
> > > >>
> > > >> -----Original Message-----
> > > >> From: Wellington Chevreuil <we...@gmail.com>
> > > >> Sent: Tuesday, March 9, 2021 1:20 PM
> > > >> To: Hbase-User <us...@hbase.apache.org>
> > > >> Subject: Re: HBASE WALs
> > > >>
> > > >> EXTERNAL
> > > >>
> > > >> >
> > > >> > All fails are waiting on the same PID (73587), a DISABLE 
> > > >> > TABLE
> > > >> procedure.
> > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems 
> > > >> > to be the problem.
> > > >> >
> > > >> Per your list procedures output attached, it seems the procs 
> > > >> states are all inconsistent. There's a WAIT_TIMEOUT subproc of
> > > >> 73587 with PID 73827, which is the UnassignProcedure for this 
> > > >> region. Problem is that there are already 5 APs for the same 
> > > >> region, which may be causing some deadlocks. If this cluster 
> > > >> was on a hbck2 supported version, you could get rid of this 
> > > >> state using bypass command on all these proc ids, then manually 
> > > >> get the table/regions states consistent again using 
> > > >> setRegionState/setTableState/assigns/unassigns
> > methods.
> > > >>
> > > >> Without tooling, the only option I can think of is to stop 
> > > >> cluster, clean out masterprocwals, restart cluster, then use 
> > > >> hbase shell to enable/disable/assign regions. You may also need 
> > > >> to manually update table/region states in meta table. Of 
> > > >> course, you can automate these manual steps into your own 
> > > >> tooling, but may be a better strategy in the long term to 
> > > >> upgrade to a more stable version that also benefits from more 
> > > >> tooling supported by
> the community.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
> > > >> <ma...@eset.sk>
> > > >> escreveu:
> > > >>
> > > >> > Hi, Wellington,
> > > >> >
> > > >> > I was on 'vacation' (no road trip or overseas anything) for a
> week.
> > > >> >
> > > >> > All fails are waiting on the same PID (73587), a DISABLE 
> > > >> > TABLE
> > > >> procedure.
> > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems 
> > > >> > to be the problem.
> > > >> >
> > > >> > I am still mystified about the HBCK2-tools. I have attached a 
> > > >> > previous thread that you commented on at the time.
> > > >> >
> > > >> > I did build a tools for our HBASE 2.1.0...or rather, I built 
> > > >> > it on Ubuntu
> > > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on 
> > > >> > Ubuntu
> > > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
> > > >> > I used it to help fix a similar problem with an offline table 
> > > >> > and
> > RITs.
> > > >> > Both HBASE versions are the same.
> > > >> >
> > > >> > I attach a 'sheet' with the current procs/locks.
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Marc Hoppins <ma...@eset.sk>
> > > >> > Sent: Wednesday, March 3, 2021 9:51 AM
> > > >> > To: user@hbase.apache.org
> > > >> > Cc: Martin Oravec <ma...@eset.sk>
> > > >> > Subject: RE: HBASE WALs
> > > >> >
> > > >> > EXTERNAL
> > > >> >
> > > >> > Thanks, Wellington,
> > > >> >
> > > >> > I have already build a hbck1-tools for 2.1.0 using method 
> > > >> > described in other topics. All the HBASE and JDK here is the 
> > > >> > same version so if it worked fixing one cluster HBASE then it 
> > > >> > should work for other
> > > installs.
> > > >> >
> > > >> > Fiddling with masterprocWALs will require complete shutdown 
> > > >> > of hbase operations to prevent incoming reds/writes on other 
> > > >> > tables and I am not sure how disruptive that will be other 
> > > >> > than "probably a
> > > lot".
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Wellington Chevreuil <we...@gmail.com>
> > > >> > Sent: Tuesday, March 2, 2021 10:57 AM
> > > >> > To: Hbase-User <us...@hbase.apache.org>
> > > >> > Subject: Re: HBASE WALs
> > > >> >
> > > >> > EXTERNAL
> > > >> >
> > > >> > Sorry, missed your previous email. I was hoping you were not 
> > > >> > on a non-stable version, so that you would benefit from hbck2 
> > > >> > tool
> > support.
> > > >> > Unfortunately, 2.1.0 is among the early releases that don't 
> > > >> > work with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> > > >> >
> > > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> > > >> > system seems
> > > >> > > mostly unhappy with one region in particular, and is 
> > > >> > > reporting on
> > > >> that.
> > > >> > >
> > > >> > Are the other regions for the table properly closed, and this 
> > > >> > is the only one stuck? If you do a list_procedures, are you 
> > > >> > able to identify an 'unassign' procedure still running for 
> > > >> > this table? Or if you grep master logs for this region, do 
> > > >> > you see any messages suggesting there's still ongoing 
> > > >> > attempts to bring the region offline? If there's apparently 
> > > >> > no procedure/no ongoing attempts to offline the region, you 
> > > >> > might try to manually update its state in meta table, then 
> > > >> > flip masters (assuming you have master HA), so that the new 
> > > >> > active loads an up
> to date state from meta table.
> > > >> >
> > > >> > Otherwise, if there's still a rogue procedure trying to 
> > > >> > offline the region, unfortunately, due to the lack of hbck 
> > > >> > support, you would most likely need a more disruptive 
> > > >> > intervention similar to what you had described in your first 
> > > >> > email, but instead of normal wal folder, master proc wals is 
> > > >> > what you really would need to clean out here, as that is 
> > > >> > where procedures state is persisted, and you wouldn't want 
> > > >> > the rogue procedure to be
> resumed.
> > > >> >
> > > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
> > > >> > <ma...@eset.sk>
> > > >> > escreveu:
> > > >> >
> > > >> > > If you know of anything that will help I would appreciate it.
> > > >> > >
> > > >> > > If you need any log output let me know.
> > > >> > >
> > > >> > > Thanks
> > > >> > >
> > > >> > >
> > > >> > > -----Original Message-----
> > > >> > > From: Wellington Chevreuil <we...@gmail.com>
> > > >> > > Sent: Thursday, February 25, 2021 4:08 PM
> > > >> > > To: Hbase-User <us...@hbase.apache.org>
> > > >> > > Subject: Re: HBASE WALs
> > > >> > >
> > > >> > > EXTERNAL
> > > >> > >
> > > >> > > >
> > > >> > > > Do WAL files contain information for multiple regions per 
> > > >> > > > WAL or is one WAL associated with one region?
> > > >> > > >
> > > >> > > Multiple regions edits would be present in a single wal file.
> > > >> > > That's why upon a RS crash and wal processing, there's a 
> > > >> > > wal split
> > > phase.
> > > >> > >
> > > >> > > I am trying to find a way to clear a RIT for a disabled table.
> > > >> > > A similar
> > > >> > > > problem (but on a test cluster) involved me clearing 
> > > >> > > > znode info, deleting HDFS data for the table and deleting 
> > > >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > > >> > > >
> > > >> > > Which hbase version are you on?
> > > >> > >
> > > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> > > >> > > <ma...@eset.sk>
> > > >> > > escreveu:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > > Do WAL files contain information for multiple regions per 
> > > >> > > > WAL or is one WAL associated with one region?
> > > >> > > >
> > > >> > > > I am trying to find a way to clear a RIT for a disabled table.
> > > >> > > > A similar problem (but on a test cluster) involved me 
> > > >> > > > clearing znode info, deleting HDFS data for the table and 
> > > >> > > > deleting WALs/MasterProcWAL files, finally restarting 
> > > >> > > > HBASE
> > service.
> > > >> > > >
> > > >> > > > Table cannot be enabled.
> > > >> > > >
> > > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> > > >> > > > system seems mostly unhappy with one region in 
> > > >> > > > particular, and is reporting
> > > >> > on that.
> > > >> > > >
> > > >> > > > There are many tables that are very active so I don't 
> > > >> > > > think it is possible to stop the entire service without a 
> > > >> > > > lot of forewarning to
> > > >> > > users.
> > > >> > > >
> > > >> > > > Thanks in advance.
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
😊 You are a helpful elf.  It takes a while for all  things to slow/stop when waiting to shut down hbase after read/write operations had been stopped. The DB folk were chomping at the bit to get started importing again.

So, to be clear (I must sound like an idiot)...

Disabling a table: does that perform any region operations before shutting down (compacting/merging) or do these get written to master WALs to be continued when the table is enabled?

If no operations are being carried out and all tables are disabled, all the remaining masterProcWALs will be for these procedures which lurk in the proc/lock list for this table (hds2_md5), including:

In the sheet I sent

ENABLE table
DISABLE table
RUNNABLE assigns
SUCCESS unassigns
WAITING_TIMEOUT (our stuck region)


Since restarting hbase (without actually clearing anything) there now exists (a long list with dates going back to Feb17 when the error first occurred).  I guess I am going to have to fix this once and for all before we end up with a system full of old WALs.

Procedure WAL state

LogID 	Size 	Timestamp 	Path
11288 	29.4 KB 	Wed Mar 24 09:23:28 CET 2021 	hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011288.log
11287 	166.2 KB 	Wed Mar 24 08:23:28 CET 2021 	hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011287.log
11286 	183.6 KB 	Wed Mar 24 07:23:28 CET 2021 	hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011286.log
11285 	101.9 KB 	Wed Mar 24 06:23:28 CET 2021 	hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011285.log
11284 	88.1 KB 	Wed Mar 24 05:23:28 CET 2021 	hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011284.log
11283 	101.7 KB 	Wed Mar 24 04:23:28 CET 2021 	hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011283.log
11282 	87.9 KB 	Wed Mar 24 03:23:28 CET 2021 	hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011282.log

It is a sticky situation that we are not in a position to upgrade Cloudera (and thus haddop services/software) to a newer version.

-----Original Message-----
From: Wellington Chevreuil <we...@gmail.com> 
Sent: Tuesday, March 23, 2021 6:16 PM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

>
> I am still not certain what will happen.  masterProcWALs contain info 
> for all (running) tables, yes?
>
masterProcWALs only contain info for running procedures, not user table data. User table data go on "normal" WALs, not "masterProcWALs".

 If all tables are disabled and I remove the master wals, how will that
> affect the other tables? When I disabled all tables, hundreds of 
> master WALs are now created. This means there is a bunch of pending 
> operations, yes?  Is it going to make some other things inconsistent?

Table disabling involves the unassignment of all these tables regions. Each of these "unassign" operations comprise a set of sequential phases. These internal operations are called "procedures". Information about the progress of these operations as it progresses through its different phases are stored in these masterProcWALs files. That's why triggering the  "disable"
command will create some data under masterProcWALs. If all the disable commands finished successfully, and all your procedures are finished (apart from that rogue one existing for while already), you would be good to clean out masterProcWALs.

I did try to set the table state manually to see if the faulty table would
> fire up and I restarted hbase...state was the same a locked table 
> state due to pending disable and stuck region.
>
That's because of the rogue procedure. When you restarted master, it went through masterProcWals and resumed the rogue procedure from the unfinished state it was when you restarted hbase. If you had removed masterProcWALs prior to restart, the rogue procedure would now be gone.

We may have the go-ahead to remove this table - I assume we cannot clone it
> while it is in a state of (DISABLED) flux but, once again, messing 
> with master WALs has me on edge.

From what I understand, you already have the tables disabled, and no unfinished procs apart from the rogue one, so just clean out masterProcWALs and restart master.

Em ter., 23 de mar. de 2021 às 11:13, Marc Hoppins <ma...@eset.sk>
escreveu:

> I am still not certain what will happen.  masterProcWALs contain info 
> for all (running) tables, yes?
>
> If all tables are disabled and I remove the master wals, how will that 
> affect the other tables? When I disabled all tables, hundreds of 
> master WALs are now created. This means there is a bunch of pending 
> operations, yes?  Is it going to make some other things inconsistent?
>
> I did try to set the table state manually to see if the faulty table 
> would fire up and I restarted hbase...state was the same a locked 
> table state due to pending disable and stuck region.
>
> We may have the go-ahead to remove this table - I assume we cannot 
> clone it while it is in a state of (DISABLED) flux but, once again, 
> messing with master WALs has me on edge.
>
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Tuesday, March 16, 2021 4:50 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > To be clear, if the other tables are stopped, I assume all pending 
> > and current operations will finish. How long will it take to write 
> > all data - if indeed the data does get permanently written - so that 
> > we can safely remove WALs?
> >
> If by "tables stopped" you mean your tables are disabled, then yeah, 
> all related data would already have been flushed into hfiles and 
> wouldn't be on your wals. But please be aware that what you really 
> need here to get rid of the rogue proc is to remove master proc wals, not normal wals.
>
> Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins 
> <ma...@eset.sk>
> escreveu:
>
> > Overall, I am mystified as to how this could happen.  If Hadoop has 
> > a replication factor (I believe we use the default) of 3 and we have 
> > two datacenters with masters and workers in both, how can a network 
> > outage affect Hadoop operation? Surely it should have used available 
> > resources to continue operations...or have I misinterpreted entirely?
> >
> > -----Original Message-----
> > From: Stack <st...@duboce.net>
> > Sent: Tuesday, March 16, 2021 7:16 AM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <ma...@eset.sk>
> wrote:
> >
> > > Hi, all,
> > >
> > > For our stuck region, this exists in meta.  Could we alter the 
> > > state to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
> > >
> > > You could but IIRC, in that version of HBase, you may need to 
> > > restart the
> > Master after the change (changing hbase:meta does not update the 
> > Master's in-memory state). On restart, Master will read hbase:meta 
> > to discover Region state.
> >
> > S
> >
> >
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:regioninfo, timestamp=1613580024017, value={ENCODED => 
> > > f25fe93e24b34cb2f7fffddee1d89eec, NAME => 
> > > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.'
> > > , STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'} 
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:seqnumDuringOpen, timestamp=1611787189839, 
> > > value=\x00\x00\x00\x00\x00\x00\x04\x8F
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:server, timestamp=1611787189839, value=
> > > dr1-hbase18.jumbo.hq.eset.com:16020
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:serverstartcode, timestamp=1611787189839,
> > > value=1611785264032
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:sn, timestamp=1613580024017, value=
> > > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:state, timestamp=1613580024017, value=OPENING
> > >
> > > -----Original Message-----
> > > From: Wellington Chevreuil <we...@gmail.com>
> > > Sent: Wednesday, March 10, 2021 10:56 AM
> > > To: Hbase-User <us...@hbase.apache.org>
> > > Subject: Re: HBASE WALs
> > >
> > > EXTERNAL
> > >
> > > >
> > > > Sorry if I seem stupid but this is still all new to me.
> > > >
> > > Forgot to mention, there's no stupid questions here. Don't be shy 
> > > and keep'em coming.
> > >
> > > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < 
> > > wellington.chevreuil@gmail.com> escreveu:
> > >
> > > > However, how would that help anyway?  If we cannot fix this at 
> > > > this time
> > > >> then any upgrade would have inconsistencies also, yes?
> > > >>
> > > > The upgrade on it's own wouldn't fix existing inconsistencies, 
> > > > but you would now have support for additional tooling
> > > > (hbase-operators-tool) to help you with this.
> > > >
> > > > As all the 'SUCCESS' procedures have a parent ID 73587, does 
> > > > this mean
> > > >> that they were successfully and fully moved from hbase25 to 
> > > >> each server mentioned in that procedure?  Or does it just mean 
> > > >> that the region was successfully unassigned from hbase25 but 
> > > >> the data still resides on hbase25?  I see locality 0.
> > > >>
> > > > IIRC, those were all UnassignProcedures, so it means the 
> > > > unassignment of the related region has completed and the region 
> > > > for that particular procedure went offline.
> > > >
> > > > If we change the table state in meta to 'ENABLED', could this 
> > > > kickstart
> > > >> all these things or will it just lead to further problems?
> > > >
> > > > Masters work with its own memory cache of meta, so manually 
> > > > updating it will just make masters cache inconsistent with meta.
> > > > You would need to restart masters to get its cache reloaded from 
> > > > master. The main problem is that you still have the rogue 
> > > > procedures, which you can't get rid of without stopping the 
> > > > cluster. One alternative to a full cluster outage would be to 
> > > > identify all RSes running the rogue procs (you can find that 
> > > > from active master logs), then stop only those and master, clean
> masterprocwals, then start it again.
> > > >
> > > >
> > > >> I suppose it means I am asking, the 73587 
> > > >> DisableTableProcedure, does it mean that the table is waiting 
> > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > >>
> > > > The table state may have been already updated to disabled, most 
> > > > of its regions may already be offline, but the 73587 
> > > > DisableTableProcedure cannot be considered "done" until all its 
> > > > sub procedures are indeed
> > > completed.
> > > >
> > > >
> > > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> > > > <ma...@eset.sk>
> > > > escreveu:
> > > >
> > > >> Thanks for that.
> > > >>
> > > >> Alas, we are (currently) constrained by using Cloudera (CDH)
> > > >> 6.3.1 and do not have a viable business use to pay the 
> > > >> extortionate amount of money required to upgrade.  Which would 
> > > >> give these cluster access to newer versions.
> > > >>
> > > >> However, how would that help anyway?  If we cannot fix this at 
> > > >> this time then any upgrade would have inconsistencies also, yes?
> > > >>
> > > >> As all the 'SUCCESS' procedures have a parent ID 73587, does 
> > > >> this mean that they were successfully and fully moved from 
> > > >> hbase25 to each server mentioned in that procedure?  Or does it 
> > > >> just mean that the region was successfully unassigned from 
> > > >> hbase25 but the data still resides on hbase25?  I see locality 0.
> > > >>
> > > >> If we change the table state in meta to 'ENABLED', could this 
> > > >> kickstart all these things or will it just lead to further problems?
> > > >> I suppose it means I am asking, the 73587 
> > > >> DisableTableProcedure, does it mean that the table is waiting 
> > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > >>
> > > >> Sorry if I seem stupid but this is still all new to me.
> > > >>
> > > >> I appreciate the help.
> > > >>
> > > >> -----Original Message-----
> > > >> From: Wellington Chevreuil <we...@gmail.com>
> > > >> Sent: Tuesday, March 9, 2021 1:20 PM
> > > >> To: Hbase-User <us...@hbase.apache.org>
> > > >> Subject: Re: HBASE WALs
> > > >>
> > > >> EXTERNAL
> > > >>
> > > >> >
> > > >> > All fails are waiting on the same PID (73587), a DISABLE 
> > > >> > TABLE
> > > >> procedure.
> > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems 
> > > >> > to be the problem.
> > > >> >
> > > >> Per your list procedures output attached, it seems the procs 
> > > >> states are all inconsistent. There's a WAIT_TIMEOUT subproc of
> > > >> 73587 with PID 73827, which is the UnassignProcedure for this 
> > > >> region. Problem is that there are already 5 APs for the same 
> > > >> region, which may be causing some deadlocks. If this cluster 
> > > >> was on a hbck2 supported version, you could get rid of this 
> > > >> state using bypass command on all these proc ids, then manually 
> > > >> get the table/regions states consistent again using 
> > > >> setRegionState/setTableState/assigns/unassigns
> > methods.
> > > >>
> > > >> Without tooling, the only option I can think of is to stop 
> > > >> cluster, clean out masterprocwals, restart cluster, then use 
> > > >> hbase shell to enable/disable/assign regions. You may also need 
> > > >> to manually update table/region states in meta table. Of 
> > > >> course, you can automate these manual steps into your own 
> > > >> tooling, but may be a better strategy in the long term to 
> > > >> upgrade to a more stable version that also benefits from more 
> > > >> tooling supported by
> the community.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
> > > >> <ma...@eset.sk>
> > > >> escreveu:
> > > >>
> > > >> > Hi, Wellington,
> > > >> >
> > > >> > I was on 'vacation' (no road trip or overseas anything) for a
> week.
> > > >> >
> > > >> > All fails are waiting on the same PID (73587), a DISABLE 
> > > >> > TABLE
> > > >> procedure.
> > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems 
> > > >> > to be the problem.
> > > >> >
> > > >> > I am still mystified about the HBCK2-tools. I have attached a 
> > > >> > previous thread that you commented on at the time.
> > > >> >
> > > >> > I did build a tools for our HBASE 2.1.0...or rather, I built 
> > > >> > it on Ubuntu
> > > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on 
> > > >> > Ubuntu
> > > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
> > > >> > I used it to help fix a similar problem with an offline table 
> > > >> > and
> > RITs.
> > > >> > Both HBASE versions are the same.
> > > >> >
> > > >> > I attach a 'sheet' with the current procs/locks.
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Marc Hoppins <ma...@eset.sk>
> > > >> > Sent: Wednesday, March 3, 2021 9:51 AM
> > > >> > To: user@hbase.apache.org
> > > >> > Cc: Martin Oravec <ma...@eset.sk>
> > > >> > Subject: RE: HBASE WALs
> > > >> >
> > > >> > EXTERNAL
> > > >> >
> > > >> > Thanks, Wellington,
> > > >> >
> > > >> > I have already build a hbck1-tools for 2.1.0 using method 
> > > >> > described in other topics. All the HBASE and JDK here is the 
> > > >> > same version so if it worked fixing one cluster HBASE then it 
> > > >> > should work for other
> > > installs.
> > > >> >
> > > >> > Fiddling with masterprocWALs will require complete shutdown 
> > > >> > of hbase operations to prevent incoming reds/writes on other 
> > > >> > tables and I am not sure how disruptive that will be other 
> > > >> > than "probably a
> > > lot".
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Wellington Chevreuil <we...@gmail.com>
> > > >> > Sent: Tuesday, March 2, 2021 10:57 AM
> > > >> > To: Hbase-User <us...@hbase.apache.org>
> > > >> > Subject: Re: HBASE WALs
> > > >> >
> > > >> > EXTERNAL
> > > >> >
> > > >> > Sorry, missed your previous email. I was hoping you were not 
> > > >> > on a non-stable version, so that you would benefit from hbck2 
> > > >> > tool
> > support.
> > > >> > Unfortunately, 2.1.0 is among the early releases that don't 
> > > >> > work with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> > > >> >
> > > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> > > >> > system seems
> > > >> > > mostly unhappy with one region in particular, and is 
> > > >> > > reporting on
> > > >> that.
> > > >> > >
> > > >> > Are the other regions for the table properly closed, and this 
> > > >> > is the only one stuck? If you do a list_procedures, are you 
> > > >> > able to identify an 'unassign' procedure still running for 
> > > >> > this table? Or if you grep master logs for this region, do 
> > > >> > you see any messages suggesting there's still ongoing 
> > > >> > attempts to bring the region offline? If there's apparently 
> > > >> > no procedure/no ongoing attempts to offline the region, you 
> > > >> > might try to manually update its state in meta table, then 
> > > >> > flip masters (assuming you have master HA), so that the new 
> > > >> > active loads an up
> to date state from meta table.
> > > >> >
> > > >> > Otherwise, if there's still a rogue procedure trying to 
> > > >> > offline the region, unfortunately, due to the lack of hbck 
> > > >> > support, you would most likely need a more disruptive 
> > > >> > intervention similar to what you had described in your first 
> > > >> > email, but instead of normal wal folder, master proc wals is 
> > > >> > what you really would need to clean out here, as that is 
> > > >> > where procedures state is persisted, and you wouldn't want 
> > > >> > the rogue procedure to be
> resumed.
> > > >> >
> > > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
> > > >> > <ma...@eset.sk>
> > > >> > escreveu:
> > > >> >
> > > >> > > If you know of anything that will help I would appreciate it.
> > > >> > >
> > > >> > > If you need any log output let me know.
> > > >> > >
> > > >> > > Thanks
> > > >> > >
> > > >> > >
> > > >> > > -----Original Message-----
> > > >> > > From: Wellington Chevreuil <we...@gmail.com>
> > > >> > > Sent: Thursday, February 25, 2021 4:08 PM
> > > >> > > To: Hbase-User <us...@hbase.apache.org>
> > > >> > > Subject: Re: HBASE WALs
> > > >> > >
> > > >> > > EXTERNAL
> > > >> > >
> > > >> > > >
> > > >> > > > Do WAL files contain information for multiple regions per 
> > > >> > > > WAL or is one WAL associated with one region?
> > > >> > > >
> > > >> > > Multiple regions edits would be present in a single wal file.
> > > >> > > That's why upon a RS crash and wal processing, there's a 
> > > >> > > wal split
> > > phase.
> > > >> > >
> > > >> > > I am trying to find a way to clear a RIT for a disabled table.
> > > >> > > A similar
> > > >> > > > problem (but on a test cluster) involved me clearing 
> > > >> > > > znode info, deleting HDFS data for the table and deleting 
> > > >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > > >> > > >
> > > >> > > Which hbase version are you on?
> > > >> > >
> > > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> > > >> > > <ma...@eset.sk>
> > > >> > > escreveu:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > > Do WAL files contain information for multiple regions per 
> > > >> > > > WAL or is one WAL associated with one region?
> > > >> > > >
> > > >> > > > I am trying to find a way to clear a RIT for a disabled table.
> > > >> > > > A similar problem (but on a test cluster) involved me 
> > > >> > > > clearing znode info, deleting HDFS data for the table and 
> > > >> > > > deleting WALs/MasterProcWAL files, finally restarting 
> > > >> > > > HBASE
> > service.
> > > >> > > >
> > > >> > > > Table cannot be enabled.
> > > >> > > >
> > > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> > > >> > > > system seems mostly unhappy with one region in 
> > > >> > > > particular, and is reporting
> > > >> > on that.
> > > >> > > >
> > > >> > > > There are many tables that are very active so I don't 
> > > >> > > > think it is possible to stop the entire service without a 
> > > >> > > > lot of forewarning to
> > > >> > > users.
> > > >> > > >
> > > >> > > > Thanks in advance.
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: HBASE WALs

Posted by Wellington Chevreuil <we...@gmail.com>.
>
> I am still not certain what will happen.  masterProcWALs contain info for
> all (running) tables, yes?
>
masterProcWALs only contain info for running procedures, not user table
data. User table data go on "normal" WALs, not "masterProcWALs".

 If all tables are disabled and I remove the master wals, how will that
> affect the other tables? When I disabled all tables, hundreds of master
> WALs are now created. This means there is a bunch of pending operations,
> yes?  Is it going to make some other things inconsistent?

Table disabling involves the unassignment of all these tables regions. Each
of these "unassign" operations comprise a set of sequential phases. These
internal operations are called "procedures". Information about the progress
of these operations as it progresses through its different phases are
stored in these masterProcWALs files. That's why triggering the  "disable"
command will create some data under masterProcWALs. If all the disable
commands finished successfully, and all your procedures are finished (apart
from that rogue one existing for while already), you would be good to clean
out masterProcWALs.

I did try to set the table state manually to see if the faulty table would
> fire up and I restarted hbase...state was the same a locked table state due
> to pending disable and stuck region.
>
That's because of the rogue procedure. When you restarted master, it went
through masterProcWals and resumed the rogue procedure from the unfinished
state it was when you restarted hbase. If you had removed masterProcWALs
prior to restart, the rogue procedure would now be gone.

We may have the go-ahead to remove this table - I assume we cannot clone it
> while it is in a state of (DISABLED) flux but, once again, messing with
> master WALs has me on edge.

From what I understand, you already have the tables disabled, and no
unfinished procs apart from the rogue one, so just clean out masterProcWALs
and restart master.

Em ter., 23 de mar. de 2021 às 11:13, Marc Hoppins <ma...@eset.sk>
escreveu:

> I am still not certain what will happen.  masterProcWALs contain info for
> all (running) tables, yes?
>
> If all tables are disabled and I remove the master wals, how will that
> affect the other tables? When I disabled all tables, hundreds of master
> WALs are now created. This means there is a bunch of pending operations,
> yes?  Is it going to make some other things inconsistent?
>
> I did try to set the table state manually to see if the faulty table would
> fire up and I restarted hbase...state was the same a locked table state due
> to pending disable and stuck region.
>
> We may have the go-ahead to remove this table - I assume we cannot clone
> it while it is in a state of (DISABLED) flux but, once again, messing with
> master WALs has me on edge.
>
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Tuesday, March 16, 2021 4:50 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > To be clear, if the other tables are stopped, I assume all pending and
> > current operations will finish. How long will it take to write all
> > data - if indeed the data does get permanently written - so that we
> > can safely remove WALs?
> >
> If by "tables stopped" you mean your tables are disabled, then yeah, all
> related data would already have been flushed into hfiles and wouldn't be on
> your wals. But please be aware that what you really need here to get rid of
> the rogue proc is to remove master proc wals, not normal wals.
>
> Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins <ma...@eset.sk>
> escreveu:
>
> > Overall, I am mystified as to how this could happen.  If Hadoop has a
> > replication factor (I believe we use the default) of 3 and we have two
> > datacenters with masters and workers in both, how can a network outage
> > affect Hadoop operation? Surely it should have used available
> > resources to continue operations...or have I misinterpreted entirely?
> >
> > -----Original Message-----
> > From: Stack <st...@duboce.net>
> > Sent: Tuesday, March 16, 2021 7:16 AM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <ma...@eset.sk>
> wrote:
> >
> > > Hi, all,
> > >
> > > For our stuck region, this exists in meta.  Could we alter the state
> > > to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
> > >
> > > You could but IIRC, in that version of HBase, you may need to
> > > restart the
> > Master after the change (changing hbase:meta does not update the
> > Master's in-memory state). On restart, Master will read hbase:meta to
> > discover Region state.
> >
> > S
> >
> >
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:regioninfo, timestamp=1613580024017, value={ENCODED =>
> > > f25fe93e24b34cb2f7fffddee1d89eec, NAME =>
> > > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.',
> > > STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'}
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:seqnumDuringOpen, timestamp=1611787189839,
> > > value=\x00\x00\x00\x00\x00\x00\x04\x8F
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:server, timestamp=1611787189839, value=
> > > dr1-hbase18.jumbo.hq.eset.com:16020
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:serverstartcode, timestamp=1611787189839,
> > > value=1611785264032
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:sn, timestamp=1613580024017, value=
> > > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:state, timestamp=1613580024017, value=OPENING
> > >
> > > -----Original Message-----
> > > From: Wellington Chevreuil <we...@gmail.com>
> > > Sent: Wednesday, March 10, 2021 10:56 AM
> > > To: Hbase-User <us...@hbase.apache.org>
> > > Subject: Re: HBASE WALs
> > >
> > > EXTERNAL
> > >
> > > >
> > > > Sorry if I seem stupid but this is still all new to me.
> > > >
> > > Forgot to mention, there's no stupid questions here. Don't be shy
> > > and keep'em coming.
> > >
> > > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil <
> > > wellington.chevreuil@gmail.com> escreveu:
> > >
> > > > However, how would that help anyway?  If we cannot fix this at
> > > > this time
> > > >> then any upgrade would have inconsistencies also, yes?
> > > >>
> > > > The upgrade on it's own wouldn't fix existing inconsistencies, but
> > > > you would now have support for additional tooling
> > > > (hbase-operators-tool) to help you with this.
> > > >
> > > > As all the 'SUCCESS' procedures have a parent ID 73587, does this
> > > > mean
> > > >> that they were successfully and fully moved from hbase25 to each
> > > >> server mentioned in that procedure?  Or does it just mean that
> > > >> the region was successfully unassigned from hbase25 but the data
> > > >> still resides on hbase25?  I see locality 0.
> > > >>
> > > > IIRC, those were all UnassignProcedures, so it means the
> > > > unassignment of the related region has completed and the region
> > > > for that particular procedure went offline.
> > > >
> > > > If we change the table state in meta to 'ENABLED', could this
> > > > kickstart
> > > >> all these things or will it just lead to further problems?
> > > >
> > > > Masters work with its own memory cache of meta, so manually
> > > > updating it will just make masters cache inconsistent with meta.
> > > > You would need to restart masters to get its cache reloaded from
> > > > master. The main problem is that you still have the rogue
> > > > procedures, which you can't get rid of without stopping the
> > > > cluster. One alternative to a full cluster outage would be to
> > > > identify all RSes running the rogue procs (you can find that from
> > > > active master logs), then stop only those and master, clean
> masterprocwals, then start it again.
> > > >
> > > >
> > > >> I suppose it means I am asking, the 73587 DisableTableProcedure,
> > > >> does it mean that the table is waiting to be disabled?  HBASE
> > > >> master declares that table is NOT enabled.
> > > >>
> > > > The table state may have been already updated to disabled, most of
> > > > its regions may already be offline, but the 73587
> > > > DisableTableProcedure cannot be considered "done" until all its
> > > > sub procedures are indeed
> > > completed.
> > > >
> > > >
> > > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins
> > > > <ma...@eset.sk>
> > > > escreveu:
> > > >
> > > >> Thanks for that.
> > > >>
> > > >> Alas, we are (currently) constrained by using Cloudera (CDH)
> > > >> 6.3.1 and do not have a viable business use to pay the
> > > >> extortionate amount of money required to upgrade.  Which would
> > > >> give these cluster access to newer versions.
> > > >>
> > > >> However, how would that help anyway?  If we cannot fix this at
> > > >> this time then any upgrade would have inconsistencies also, yes?
> > > >>
> > > >> As all the 'SUCCESS' procedures have a parent ID 73587, does this
> > > >> mean that they were successfully and fully moved from hbase25 to
> > > >> each server mentioned in that procedure?  Or does it just mean
> > > >> that the region was successfully unassigned from hbase25 but the
> > > >> data still resides on hbase25?  I see locality 0.
> > > >>
> > > >> If we change the table state in meta to 'ENABLED', could this
> > > >> kickstart all these things or will it just lead to further problems?
> > > >> I suppose it means I am asking, the 73587 DisableTableProcedure,
> > > >> does it mean that the table is waiting to be disabled?  HBASE
> > > >> master declares that table is NOT enabled.
> > > >>
> > > >> Sorry if I seem stupid but this is still all new to me.
> > > >>
> > > >> I appreciate the help.
> > > >>
> > > >> -----Original Message-----
> > > >> From: Wellington Chevreuil <we...@gmail.com>
> > > >> Sent: Tuesday, March 9, 2021 1:20 PM
> > > >> To: Hbase-User <us...@hbase.apache.org>
> > > >> Subject: Re: HBASE WALs
> > > >>
> > > >> EXTERNAL
> > > >>
> > > >> >
> > > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> > > >> procedure.
> > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems
> > > >> > to be the problem.
> > > >> >
> > > >> Per your list procedures output attached, it seems the procs
> > > >> states are all inconsistent. There's a WAIT_TIMEOUT subproc of
> > > >> 73587 with PID 73827, which is the UnassignProcedure for this
> > > >> region. Problem is that there are already 5 APs for the same
> > > >> region, which may be causing some deadlocks. If this cluster was
> > > >> on a hbck2 supported version, you could get rid of this state
> > > >> using bypass command on all these proc ids, then manually get the
> > > >> table/regions states consistent again using
> > > >> setRegionState/setTableState/assigns/unassigns
> > methods.
> > > >>
> > > >> Without tooling, the only option I can think of is to stop
> > > >> cluster, clean out masterprocwals, restart cluster, then use
> > > >> hbase shell to enable/disable/assign regions. You may also need
> > > >> to manually update table/region states in meta table. Of course,
> > > >> you can automate these manual steps into your own tooling, but
> > > >> may be a better strategy in the long term to upgrade to a more
> > > >> stable version that also benefits from more tooling supported by
> the community.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins
> > > >> <ma...@eset.sk>
> > > >> escreveu:
> > > >>
> > > >> > Hi, Wellington,
> > > >> >
> > > >> > I was on 'vacation' (no road trip or overseas anything) for a
> week.
> > > >> >
> > > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> > > >> procedure.
> > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems
> > > >> > to be the problem.
> > > >> >
> > > >> > I am still mystified about the HBCK2-tools. I have attached a
> > > >> > previous thread that you commented on at the time.
> > > >> >
> > > >> > I did build a tools for our HBASE 2.1.0...or rather, I built it
> > > >> > on Ubuntu
> > > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on
> > > >> > Ubuntu
> > > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
> > > >> > I used it to help fix a similar problem with an offline table
> > > >> > and
> > RITs.
> > > >> > Both HBASE versions are the same.
> > > >> >
> > > >> > I attach a 'sheet' with the current procs/locks.
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Marc Hoppins <ma...@eset.sk>
> > > >> > Sent: Wednesday, March 3, 2021 9:51 AM
> > > >> > To: user@hbase.apache.org
> > > >> > Cc: Martin Oravec <ma...@eset.sk>
> > > >> > Subject: RE: HBASE WALs
> > > >> >
> > > >> > EXTERNAL
> > > >> >
> > > >> > Thanks, Wellington,
> > > >> >
> > > >> > I have already build a hbck1-tools for 2.1.0 using method
> > > >> > described in other topics. All the HBASE and JDK here is the
> > > >> > same version so if it worked fixing one cluster HBASE then it
> > > >> > should work for other
> > > installs.
> > > >> >
> > > >> > Fiddling with masterprocWALs will require complete shutdown of
> > > >> > hbase operations to prevent incoming reds/writes on other
> > > >> > tables and I am not sure how disruptive that will be other than
> > > >> > "probably a
> > > lot".
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Wellington Chevreuil <we...@gmail.com>
> > > >> > Sent: Tuesday, March 2, 2021 10:57 AM
> > > >> > To: Hbase-User <us...@hbase.apache.org>
> > > >> > Subject: Re: HBASE WALs
> > > >> >
> > > >> > EXTERNAL
> > > >> >
> > > >> > Sorry, missed your previous email. I was hoping you were not on
> > > >> > a non-stable version, so that you would benefit from hbck2 tool
> > support.
> > > >> > Unfortunately, 2.1.0 is among the early releases that don't
> > > >> > work with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> > > >> >
> > > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system
> > > >> > seems
> > > >> > > mostly unhappy with one region in particular, and is
> > > >> > > reporting on
> > > >> that.
> > > >> > >
> > > >> > Are the other regions for the table properly closed, and this
> > > >> > is the only one stuck? If you do a list_procedures, are you
> > > >> > able to identify an 'unassign' procedure still running for this
> > > >> > table? Or if you grep master logs for this region, do you see
> > > >> > any messages suggesting there's still ongoing attempts to bring
> > > >> > the region offline? If there's apparently no procedure/no
> > > >> > ongoing attempts to offline the region, you might try to
> > > >> > manually update its state in meta table, then flip masters
> > > >> > (assuming you have master HA), so that the new active loads an up
> to date state from meta table.
> > > >> >
> > > >> > Otherwise, if there's still a rogue procedure trying to offline
> > > >> > the region, unfortunately, due to the lack of hbck support, you
> > > >> > would most likely need a more disruptive intervention similar
> > > >> > to what you had described in your first email, but instead of
> > > >> > normal wal folder, master proc wals is what you really would
> > > >> > need to clean out here, as that is where procedures state is
> > > >> > persisted, and you wouldn't want the rogue procedure to be
> resumed.
> > > >> >
> > > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins
> > > >> > <ma...@eset.sk>
> > > >> > escreveu:
> > > >> >
> > > >> > > If you know of anything that will help I would appreciate it.
> > > >> > >
> > > >> > > If you need any log output let me know.
> > > >> > >
> > > >> > > Thanks
> > > >> > >
> > > >> > >
> > > >> > > -----Original Message-----
> > > >> > > From: Wellington Chevreuil <we...@gmail.com>
> > > >> > > Sent: Thursday, February 25, 2021 4:08 PM
> > > >> > > To: Hbase-User <us...@hbase.apache.org>
> > > >> > > Subject: Re: HBASE WALs
> > > >> > >
> > > >> > > EXTERNAL
> > > >> > >
> > > >> > > >
> > > >> > > > Do WAL files contain information for multiple regions per
> > > >> > > > WAL or is one WAL associated with one region?
> > > >> > > >
> > > >> > > Multiple regions edits would be present in a single wal file.
> > > >> > > That's why upon a RS crash and wal processing, there's a wal
> > > >> > > split
> > > phase.
> > > >> > >
> > > >> > > I am trying to find a way to clear a RIT for a disabled table.
> > > >> > > A similar
> > > >> > > > problem (but on a test cluster) involved me clearing znode
> > > >> > > > info, deleting HDFS data for the table and deleting
> > > >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > > >> > > >
> > > >> > > Which hbase version are you on?
> > > >> > >
> > > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins
> > > >> > > <ma...@eset.sk>
> > > >> > > escreveu:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > > Do WAL files contain information for multiple regions per
> > > >> > > > WAL or is one WAL associated with one region?
> > > >> > > >
> > > >> > > > I am trying to find a way to clear a RIT for a disabled table.
> > > >> > > > A similar problem (but on a test cluster) involved me
> > > >> > > > clearing znode info, deleting HDFS data for the table and
> > > >> > > > deleting WALs/MasterProcWAL files, finally restarting HBASE
> > service.
> > > >> > > >
> > > >> > > > Table cannot be enabled.
> > > >> > > >
> > > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the
> > > >> > > > system seems mostly unhappy with one region in particular,
> > > >> > > > and is reporting
> > > >> > on that.
> > > >> > > >
> > > >> > > > There are many tables that are very active so I don't think
> > > >> > > > it is possible to stop the entire service without a lot of
> > > >> > > > forewarning to
> > > >> > > users.
> > > >> > > >
> > > >> > > > Thanks in advance.
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
I am still not certain what will happen.  masterProcWALs contain info for all (running) tables, yes?

If all tables are disabled and I remove the master wals, how will that affect the other tables? When I disabled all tables, hundreds of master WALs are now created. This means there is a bunch of pending operations, yes?  Is it going to make some other things inconsistent?

I did try to set the table state manually to see if the faulty table would fire up and I restarted hbase...state was the same a locked table state due to pending disable and stuck region.

We may have the go-ahead to remove this table - I assume we cannot clone it while it is in a state of (DISABLED) flux but, once again, messing with master WALs has me on edge.


-----Original Message-----
From: Wellington Chevreuil <we...@gmail.com> 
Sent: Tuesday, March 16, 2021 4:50 PM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

>
> To be clear, if the other tables are stopped, I assume all pending and 
> current operations will finish. How long will it take to write all 
> data - if indeed the data does get permanently written - so that we 
> can safely remove WALs?
>
If by "tables stopped" you mean your tables are disabled, then yeah, all related data would already have been flushed into hfiles and wouldn't be on your wals. But please be aware that what you really need here to get rid of the rogue proc is to remove master proc wals, not normal wals.

Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins <ma...@eset.sk>
escreveu:

> Overall, I am mystified as to how this could happen.  If Hadoop has a 
> replication factor (I believe we use the default) of 3 and we have two 
> datacenters with masters and workers in both, how can a network outage 
> affect Hadoop operation? Surely it should have used available 
> resources to continue operations...or have I misinterpreted entirely?
>
> -----Original Message-----
> From: Stack <st...@duboce.net>
> Sent: Tuesday, March 16, 2021 7:16 AM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <ma...@eset.sk> wrote:
>
> > Hi, all,
> >
> > For our stuck region, this exists in meta.  Could we alter the state 
> > to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
> >
> > You could but IIRC, in that version of HBase, you may need to 
> > restart the
> Master after the change (changing hbase:meta does not update the 
> Master's in-memory state). On restart, Master will read hbase:meta to 
> discover Region state.
>
> S
>
>
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:regioninfo, timestamp=1613580024017, value={ENCODED => 
> > f25fe93e24b34cb2f7fffddee1d89eec, NAME => 
> > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.',
> > STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'} 
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:seqnumDuringOpen, timestamp=1611787189839, 
> > value=\x00\x00\x00\x00\x00\x00\x04\x8F
> >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:server, timestamp=1611787189839, value=
> > dr1-hbase18.jumbo.hq.eset.com:16020
> >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:serverstartcode, timestamp=1611787189839,
> > value=1611785264032
> hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:sn, timestamp=1613580024017, value=
> > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
> >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:state, timestamp=1613580024017, value=OPENING
> >
> > -----Original Message-----
> > From: Wellington Chevreuil <we...@gmail.com>
> > Sent: Wednesday, March 10, 2021 10:56 AM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > >
> > > Sorry if I seem stupid but this is still all new to me.
> > >
> > Forgot to mention, there's no stupid questions here. Don't be shy 
> > and keep'em coming.
> >
> > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < 
> > wellington.chevreuil@gmail.com> escreveu:
> >
> > > However, how would that help anyway?  If we cannot fix this at 
> > > this time
> > >> then any upgrade would have inconsistencies also, yes?
> > >>
> > > The upgrade on it's own wouldn't fix existing inconsistencies, but 
> > > you would now have support for additional tooling
> > > (hbase-operators-tool) to help you with this.
> > >
> > > As all the 'SUCCESS' procedures have a parent ID 73587, does this 
> > > mean
> > >> that they were successfully and fully moved from hbase25 to each 
> > >> server mentioned in that procedure?  Or does it just mean that 
> > >> the region was successfully unassigned from hbase25 but the data 
> > >> still resides on hbase25?  I see locality 0.
> > >>
> > > IIRC, those were all UnassignProcedures, so it means the 
> > > unassignment of the related region has completed and the region 
> > > for that particular procedure went offline.
> > >
> > > If we change the table state in meta to 'ENABLED', could this 
> > > kickstart
> > >> all these things or will it just lead to further problems?
> > >
> > > Masters work with its own memory cache of meta, so manually 
> > > updating it will just make masters cache inconsistent with meta. 
> > > You would need to restart masters to get its cache reloaded from 
> > > master. The main problem is that you still have the rogue 
> > > procedures, which you can't get rid of without stopping the 
> > > cluster. One alternative to a full cluster outage would be to 
> > > identify all RSes running the rogue procs (you can find that from 
> > > active master logs), then stop only those and master, clean masterprocwals, then start it again.
> > >
> > >
> > >> I suppose it means I am asking, the 73587 DisableTableProcedure, 
> > >> does it mean that the table is waiting to be disabled?  HBASE 
> > >> master declares that table is NOT enabled.
> > >>
> > > The table state may have been already updated to disabled, most of 
> > > its regions may already be offline, but the 73587 
> > > DisableTableProcedure cannot be considered "done" until all its 
> > > sub procedures are indeed
> > completed.
> > >
> > >
> > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> > > <ma...@eset.sk>
> > > escreveu:
> > >
> > >> Thanks for that.
> > >>
> > >> Alas, we are (currently) constrained by using Cloudera (CDH) 
> > >> 6.3.1 and do not have a viable business use to pay the 
> > >> extortionate amount of money required to upgrade.  Which would 
> > >> give these cluster access to newer versions.
> > >>
> > >> However, how would that help anyway?  If we cannot fix this at 
> > >> this time then any upgrade would have inconsistencies also, yes?
> > >>
> > >> As all the 'SUCCESS' procedures have a parent ID 73587, does this 
> > >> mean that they were successfully and fully moved from hbase25 to 
> > >> each server mentioned in that procedure?  Or does it just mean 
> > >> that the region was successfully unassigned from hbase25 but the 
> > >> data still resides on hbase25?  I see locality 0.
> > >>
> > >> If we change the table state in meta to 'ENABLED', could this 
> > >> kickstart all these things or will it just lead to further problems?
> > >> I suppose it means I am asking, the 73587 DisableTableProcedure, 
> > >> does it mean that the table is waiting to be disabled?  HBASE 
> > >> master declares that table is NOT enabled.
> > >>
> > >> Sorry if I seem stupid but this is still all new to me.
> > >>
> > >> I appreciate the help.
> > >>
> > >> -----Original Message-----
> > >> From: Wellington Chevreuil <we...@gmail.com>
> > >> Sent: Tuesday, March 9, 2021 1:20 PM
> > >> To: Hbase-User <us...@hbase.apache.org>
> > >> Subject: Re: HBASE WALs
> > >>
> > >> EXTERNAL
> > >>
> > >> >
> > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> > >> procedure.
> > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems 
> > >> > to be the problem.
> > >> >
> > >> Per your list procedures output attached, it seems the procs 
> > >> states are all inconsistent. There's a WAIT_TIMEOUT subproc of 
> > >> 73587 with PID 73827, which is the UnassignProcedure for this 
> > >> region. Problem is that there are already 5 APs for the same 
> > >> region, which may be causing some deadlocks. If this cluster was 
> > >> on a hbck2 supported version, you could get rid of this state 
> > >> using bypass command on all these proc ids, then manually get the 
> > >> table/regions states consistent again using 
> > >> setRegionState/setTableState/assigns/unassigns
> methods.
> > >>
> > >> Without tooling, the only option I can think of is to stop 
> > >> cluster, clean out masterprocwals, restart cluster, then use 
> > >> hbase shell to enable/disable/assign regions. You may also need 
> > >> to manually update table/region states in meta table. Of course, 
> > >> you can automate these manual steps into your own tooling, but 
> > >> may be a better strategy in the long term to upgrade to a more 
> > >> stable version that also benefits from more tooling supported by the community.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
> > >> <ma...@eset.sk>
> > >> escreveu:
> > >>
> > >> > Hi, Wellington,
> > >> >
> > >> > I was on 'vacation' (no road trip or overseas anything) for a week.
> > >> >
> > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> > >> procedure.
> > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems 
> > >> > to be the problem.
> > >> >
> > >> > I am still mystified about the HBCK2-tools. I have attached a 
> > >> > previous thread that you commented on at the time.
> > >> >
> > >> > I did build a tools for our HBASE 2.1.0...or rather, I built it 
> > >> > on Ubuntu
> > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on 
> > >> > Ubuntu
> > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
> > >> > I used it to help fix a similar problem with an offline table 
> > >> > and
> RITs.
> > >> > Both HBASE versions are the same.
> > >> >
> > >> > I attach a 'sheet' with the current procs/locks.
> > >> >
> > >> > -----Original Message-----
> > >> > From: Marc Hoppins <ma...@eset.sk>
> > >> > Sent: Wednesday, March 3, 2021 9:51 AM
> > >> > To: user@hbase.apache.org
> > >> > Cc: Martin Oravec <ma...@eset.sk>
> > >> > Subject: RE: HBASE WALs
> > >> >
> > >> > EXTERNAL
> > >> >
> > >> > Thanks, Wellington,
> > >> >
> > >> > I have already build a hbck1-tools for 2.1.0 using method 
> > >> > described in other topics. All the HBASE and JDK here is the 
> > >> > same version so if it worked fixing one cluster HBASE then it 
> > >> > should work for other
> > installs.
> > >> >
> > >> > Fiddling with masterprocWALs will require complete shutdown of 
> > >> > hbase operations to prevent incoming reds/writes on other 
> > >> > tables and I am not sure how disruptive that will be other than 
> > >> > "probably a
> > lot".
> > >> >
> > >> > -----Original Message-----
> > >> > From: Wellington Chevreuil <we...@gmail.com>
> > >> > Sent: Tuesday, March 2, 2021 10:57 AM
> > >> > To: Hbase-User <us...@hbase.apache.org>
> > >> > Subject: Re: HBASE WALs
> > >> >
> > >> > EXTERNAL
> > >> >
> > >> > Sorry, missed your previous email. I was hoping you were not on 
> > >> > a non-stable version, so that you would benefit from hbck2 tool
> support.
> > >> > Unfortunately, 2.1.0 is among the early releases that don't 
> > >> > work with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> > >> >
> > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
> > >> > seems
> > >> > > mostly unhappy with one region in particular, and is 
> > >> > > reporting on
> > >> that.
> > >> > >
> > >> > Are the other regions for the table properly closed, and this 
> > >> > is the only one stuck? If you do a list_procedures, are you 
> > >> > able to identify an 'unassign' procedure still running for this 
> > >> > table? Or if you grep master logs for this region, do you see 
> > >> > any messages suggesting there's still ongoing attempts to bring 
> > >> > the region offline? If there's apparently no procedure/no 
> > >> > ongoing attempts to offline the region, you might try to 
> > >> > manually update its state in meta table, then flip masters 
> > >> > (assuming you have master HA), so that the new active loads an up to date state from meta table.
> > >> >
> > >> > Otherwise, if there's still a rogue procedure trying to offline 
> > >> > the region, unfortunately, due to the lack of hbck support, you 
> > >> > would most likely need a more disruptive intervention similar 
> > >> > to what you had described in your first email, but instead of 
> > >> > normal wal folder, master proc wals is what you really would 
> > >> > need to clean out here, as that is where procedures state is 
> > >> > persisted, and you wouldn't want the rogue procedure to be resumed.
> > >> >
> > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
> > >> > <ma...@eset.sk>
> > >> > escreveu:
> > >> >
> > >> > > If you know of anything that will help I would appreciate it.
> > >> > >
> > >> > > If you need any log output let me know.
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > >
> > >> > > -----Original Message-----
> > >> > > From: Wellington Chevreuil <we...@gmail.com>
> > >> > > Sent: Thursday, February 25, 2021 4:08 PM
> > >> > > To: Hbase-User <us...@hbase.apache.org>
> > >> > > Subject: Re: HBASE WALs
> > >> > >
> > >> > > EXTERNAL
> > >> > >
> > >> > > >
> > >> > > > Do WAL files contain information for multiple regions per 
> > >> > > > WAL or is one WAL associated with one region?
> > >> > > >
> > >> > > Multiple regions edits would be present in a single wal file.
> > >> > > That's why upon a RS crash and wal processing, there's a wal 
> > >> > > split
> > phase.
> > >> > >
> > >> > > I am trying to find a way to clear a RIT for a disabled table.
> > >> > > A similar
> > >> > > > problem (but on a test cluster) involved me clearing znode 
> > >> > > > info, deleting HDFS data for the table and deleting 
> > >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > >> > > >
> > >> > > Which hbase version are you on?
> > >> > >
> > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> > >> > > <ma...@eset.sk>
> > >> > > escreveu:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > > Do WAL files contain information for multiple regions per 
> > >> > > > WAL or is one WAL associated with one region?
> > >> > > >
> > >> > > > I am trying to find a way to clear a RIT for a disabled table.
> > >> > > > A similar problem (but on a test cluster) involved me 
> > >> > > > clearing znode info, deleting HDFS data for the table and 
> > >> > > > deleting WALs/MasterProcWAL files, finally restarting HBASE
> service.
> > >> > > >
> > >> > > > Table cannot be enabled.
> > >> > > >
> > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> > >> > > > system seems mostly unhappy with one region in particular, 
> > >> > > > and is reporting
> > >> > on that.
> > >> > > >
> > >> > > > There are many tables that are very active so I don't think 
> > >> > > > it is possible to stop the entire service without a lot of 
> > >> > > > forewarning to
> > >> > > users.
> > >> > > >
> > >> > > > Thanks in advance.
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: HBASE WALs

Posted by Wellington Chevreuil <we...@gmail.com>.
>
> To be clear, if the other tables are stopped, I assume all pending and
> current operations will finish. How long will it take to write all data -
> if indeed the data does get permanently written - so that we can safely
> remove WALs?
>
If by "tables stopped" you mean your tables are disabled, then yeah, all
related data would already have been flushed into hfiles and wouldn't be on
your wals. But please be aware that what you really need here to get rid of
the rogue proc is to remove master proc wals, not normal wals.

Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins <ma...@eset.sk>
escreveu:

> Overall, I am mystified as to how this could happen.  If Hadoop has a
> replication factor (I believe we use the default) of 3 and we have two
> datacenters with masters and workers in both, how can a network outage
> affect Hadoop operation? Surely it should have used available resources to
> continue operations...or have I misinterpreted entirely?
>
> -----Original Message-----
> From: Stack <st...@duboce.net>
> Sent: Tuesday, March 16, 2021 7:16 AM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <ma...@eset.sk> wrote:
>
> > Hi, all,
> >
> > For our stuck region, this exists in meta.  Could we alter the state
> > to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
> >
> > You could but IIRC, in that version of HBase, you may need to restart
> > the
> Master after the change (changing hbase:meta does not update the Master's
> in-memory state). On restart, Master will read hbase:meta to discover
> Region state.
>
> S
>
>
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:regioninfo, timestamp=1613580024017, value={ENCODED =>
> > f25fe93e24b34cb2f7fffddee1d89eec, NAME =>
> > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.',
> > STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'}
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:seqnumDuringOpen, timestamp=1611787189839,
> > value=\x00\x00\x00\x00\x00\x00\x04\x8F
> >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:server, timestamp=1611787189839, value=
> > dr1-hbase18.jumbo.hq.eset.com:16020
> >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:serverstartcode, timestamp=1611787189839,
> > value=1611785264032
> hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:sn, timestamp=1613580024017, value=
> > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
> >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:state, timestamp=1613580024017, value=OPENING
> >
> > -----Original Message-----
> > From: Wellington Chevreuil <we...@gmail.com>
> > Sent: Wednesday, March 10, 2021 10:56 AM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > >
> > > Sorry if I seem stupid but this is still all new to me.
> > >
> > Forgot to mention, there's no stupid questions here. Don't be shy and
> > keep'em coming.
> >
> > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil <
> > wellington.chevreuil@gmail.com> escreveu:
> >
> > > However, how would that help anyway?  If we cannot fix this at this
> > > time
> > >> then any upgrade would have inconsistencies also, yes?
> > >>
> > > The upgrade on it's own wouldn't fix existing inconsistencies, but
> > > you would now have support for additional tooling
> > > (hbase-operators-tool) to help you with this.
> > >
> > > As all the 'SUCCESS' procedures have a parent ID 73587, does this
> > > mean
> > >> that they were successfully and fully moved from hbase25 to each
> > >> server mentioned in that procedure?  Or does it just mean that the
> > >> region was successfully unassigned from hbase25 but the data still
> > >> resides on hbase25?  I see locality 0.
> > >>
> > > IIRC, those were all UnassignProcedures, so it means the
> > > unassignment of the related region has completed and the region for
> > > that particular procedure went offline.
> > >
> > > If we change the table state in meta to 'ENABLED', could this
> > > kickstart
> > >> all these things or will it just lead to further problems?
> > >
> > > Masters work with its own memory cache of meta, so manually updating
> > > it will just make masters cache inconsistent with meta. You would
> > > need to restart masters to get its cache reloaded from master. The
> > > main problem is that you still have the rogue procedures, which you
> > > can't get rid of without stopping the cluster. One alternative to a
> > > full cluster outage would be to identify all RSes running the rogue
> > > procs (you can find that from active master logs), then stop only
> > > those and master, clean masterprocwals, then start it again.
> > >
> > >
> > >> I suppose it means I am asking, the 73587 DisableTableProcedure,
> > >> does it mean that the table is waiting to be disabled?  HBASE
> > >> master declares that table is NOT enabled.
> > >>
> > > The table state may have been already updated to disabled, most of
> > > its regions may already be offline, but the 73587
> > > DisableTableProcedure cannot be considered "done" until all its sub
> > > procedures are indeed
> > completed.
> > >
> > >
> > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins
> > > <ma...@eset.sk>
> > > escreveu:
> > >
> > >> Thanks for that.
> > >>
> > >> Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1
> > >> and do not have a viable business use to pay the extortionate
> > >> amount of money required to upgrade.  Which would give these
> > >> cluster access to newer versions.
> > >>
> > >> However, how would that help anyway?  If we cannot fix this at this
> > >> time then any upgrade would have inconsistencies also, yes?
> > >>
> > >> As all the 'SUCCESS' procedures have a parent ID 73587, does this
> > >> mean that they were successfully and fully moved from hbase25 to
> > >> each server mentioned in that procedure?  Or does it just mean that
> > >> the region was successfully unassigned from hbase25 but the data
> > >> still resides on hbase25?  I see locality 0.
> > >>
> > >> If we change the table state in meta to 'ENABLED', could this
> > >> kickstart all these things or will it just lead to further problems?
> > >> I suppose it means I am asking, the 73587 DisableTableProcedure,
> > >> does it mean that the table is waiting to be disabled?  HBASE
> > >> master declares that table is NOT enabled.
> > >>
> > >> Sorry if I seem stupid but this is still all new to me.
> > >>
> > >> I appreciate the help.
> > >>
> > >> -----Original Message-----
> > >> From: Wellington Chevreuil <we...@gmail.com>
> > >> Sent: Tuesday, March 9, 2021 1:20 PM
> > >> To: Hbase-User <us...@hbase.apache.org>
> > >> Subject: Re: HBASE WALs
> > >>
> > >> EXTERNAL
> > >>
> > >> >
> > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> > >> procedure.
> > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to
> > >> > be the problem.
> > >> >
> > >> Per your list procedures output attached, it seems the procs states
> > >> are all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with
> > >> PID 73827, which is the UnassignProcedure for this region. Problem
> > >> is that there are already 5 APs for the same region, which may be
> > >> causing some deadlocks. If this cluster was on a hbck2 supported
> > >> version, you could get rid of this state using bypass command on
> > >> all these proc ids, then manually get the table/regions states
> > >> consistent again using setRegionState/setTableState/assigns/unassigns
> methods.
> > >>
> > >> Without tooling, the only option I can think of is to stop cluster,
> > >> clean out masterprocwals, restart cluster, then use hbase shell to
> > >> enable/disable/assign regions. You may also need to manually update
> > >> table/region states in meta table. Of course, you can automate
> > >> these manual steps into your own tooling, but may be a better
> > >> strategy in the long term to upgrade to a more stable version that
> > >> also benefits from more tooling supported by the community.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins
> > >> <ma...@eset.sk>
> > >> escreveu:
> > >>
> > >> > Hi, Wellington,
> > >> >
> > >> > I was on 'vacation' (no road trip or overseas anything) for a week.
> > >> >
> > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> > >> procedure.
> > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to
> > >> > be the problem.
> > >> >
> > >> > I am still mystified about the HBCK2-tools. I have attached a
> > >> > previous thread that you commented on at the time.
> > >> >
> > >> > I did build a tools for our HBASE 2.1.0...or rather, I built it
> > >> > on Ubuntu
> > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on
> > >> > Ubuntu
> > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
> > >> > I used it to help fix a similar problem with an offline table and
> RITs.
> > >> > Both HBASE versions are the same.
> > >> >
> > >> > I attach a 'sheet' with the current procs/locks.
> > >> >
> > >> > -----Original Message-----
> > >> > From: Marc Hoppins <ma...@eset.sk>
> > >> > Sent: Wednesday, March 3, 2021 9:51 AM
> > >> > To: user@hbase.apache.org
> > >> > Cc: Martin Oravec <ma...@eset.sk>
> > >> > Subject: RE: HBASE WALs
> > >> >
> > >> > EXTERNAL
> > >> >
> > >> > Thanks, Wellington,
> > >> >
> > >> > I have already build a hbck1-tools for 2.1.0 using method
> > >> > described in other topics. All the HBASE and JDK here is the same
> > >> > version so if it worked fixing one cluster HBASE then it should
> > >> > work for other
> > installs.
> > >> >
> > >> > Fiddling with masterprocWALs will require complete shutdown of
> > >> > hbase operations to prevent incoming reds/writes on other tables
> > >> > and I am not sure how disruptive that will be other than
> > >> > "probably a
> > lot".
> > >> >
> > >> > -----Original Message-----
> > >> > From: Wellington Chevreuil <we...@gmail.com>
> > >> > Sent: Tuesday, March 2, 2021 10:57 AM
> > >> > To: Hbase-User <us...@hbase.apache.org>
> > >> > Subject: Re: HBASE WALs
> > >> >
> > >> > EXTERNAL
> > >> >
> > >> > Sorry, missed your previous email. I was hoping you were not on a
> > >> > non-stable version, so that you would benefit from hbck2 tool
> support.
> > >> > Unfortunately, 2.1.0 is among the early releases that don't work
> > >> > with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> > >> >
> > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system
> > >> > seems
> > >> > > mostly unhappy with one region in particular, and is reporting
> > >> > > on
> > >> that.
> > >> > >
> > >> > Are the other regions for the table properly closed, and this is
> > >> > the only one stuck? If you do a list_procedures, are you able to
> > >> > identify an 'unassign' procedure still running for this table? Or
> > >> > if you grep master logs for this region, do you see any messages
> > >> > suggesting there's still ongoing attempts to bring the region
> > >> > offline? If there's apparently no procedure/no ongoing attempts
> > >> > to offline the region, you might try to manually update its state
> > >> > in meta table, then flip masters (assuming you have master HA),
> > >> > so that the new active loads an up to date state from meta table.
> > >> >
> > >> > Otherwise, if there's still a rogue procedure trying to offline
> > >> > the region, unfortunately, due to the lack of hbck support, you
> > >> > would most likely need a more disruptive intervention similar to
> > >> > what you had described in your first email, but instead of normal
> > >> > wal folder, master proc wals is what you really would need to
> > >> > clean out here, as that is where procedures state is persisted,
> > >> > and you wouldn't want the rogue procedure to be resumed.
> > >> >
> > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins
> > >> > <ma...@eset.sk>
> > >> > escreveu:
> > >> >
> > >> > > If you know of anything that will help I would appreciate it.
> > >> > >
> > >> > > If you need any log output let me know.
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > >
> > >> > > -----Original Message-----
> > >> > > From: Wellington Chevreuil <we...@gmail.com>
> > >> > > Sent: Thursday, February 25, 2021 4:08 PM
> > >> > > To: Hbase-User <us...@hbase.apache.org>
> > >> > > Subject: Re: HBASE WALs
> > >> > >
> > >> > > EXTERNAL
> > >> > >
> > >> > > >
> > >> > > > Do WAL files contain information for multiple regions per WAL
> > >> > > > or is one WAL associated with one region?
> > >> > > >
> > >> > > Multiple regions edits would be present in a single wal file.
> > >> > > That's why upon a RS crash and wal processing, there's a wal
> > >> > > split
> > phase.
> > >> > >
> > >> > > I am trying to find a way to clear a RIT for a disabled table.
> > >> > > A similar
> > >> > > > problem (but on a test cluster) involved me clearing znode
> > >> > > > info, deleting HDFS data for the table and deleting
> > >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > >> > > >
> > >> > > Which hbase version are you on?
> > >> > >
> > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins
> > >> > > <ma...@eset.sk>
> > >> > > escreveu:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > > Do WAL files contain information for multiple regions per WAL
> > >> > > > or is one WAL associated with one region?
> > >> > > >
> > >> > > > I am trying to find a way to clear a RIT for a disabled table.
> > >> > > > A similar problem (but on a test cluster) involved me
> > >> > > > clearing znode info, deleting HDFS data for the table and
> > >> > > > deleting WALs/MasterProcWAL files, finally restarting HBASE
> service.
> > >> > > >
> > >> > > > Table cannot be enabled.
> > >> > > >
> > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the
> > >> > > > system seems mostly unhappy with one region in particular,
> > >> > > > and is reporting
> > >> > on that.
> > >> > > >
> > >> > > > There are many tables that are very active so I don't think
> > >> > > > it is possible to stop the entire service without a lot of
> > >> > > > forewarning to
> > >> > > users.
> > >> > > >
> > >> > > > Thanks in advance.
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
Overall, I am mystified as to how this could happen.  If Hadoop has a replication factor (I believe we use the default) of 3 and we have two datacenters with masters and workers in both, how can a network outage affect Hadoop operation? Surely it should have used available resources to continue operations...or have I misinterpreted entirely?

-----Original Message-----
From: Stack <st...@duboce.net> 
Sent: Tuesday, March 16, 2021 7:16 AM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <ma...@eset.sk> wrote:

> Hi, all,
>
> For our stuck region, this exists in meta.  Could we alter the state 
> to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
>
> You could but IIRC, in that version of HBase, you may need to restart 
> the
Master after the change (changing hbase:meta does not update the Master's in-memory state). On restart, Master will read hbase:meta to discover Region state.

S


> hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:regioninfo, timestamp=1613580024017, value={ENCODED => 
> f25fe93e24b34cb2f7fffddee1d89eec, NAME => 
> 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.',
> STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'}  
> hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:seqnumDuringOpen, timestamp=1611787189839, 
> value=\x00\x00\x00\x00\x00\x00\x04\x8F
>  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:server, timestamp=1611787189839, value=
> dr1-hbase18.jumbo.hq.eset.com:16020
>  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:serverstartcode, timestamp=1611787189839, 
> value=1611785264032  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:sn, timestamp=1613580024017, value=
> ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
>  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:state, timestamp=1613580024017, value=OPENING
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Wednesday, March 10, 2021 10:56 AM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > Sorry if I seem stupid but this is still all new to me.
> >
> Forgot to mention, there's no stupid questions here. Don't be shy and 
> keep'em coming.
>
> Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < 
> wellington.chevreuil@gmail.com> escreveu:
>
> > However, how would that help anyway?  If we cannot fix this at this 
> > time
> >> then any upgrade would have inconsistencies also, yes?
> >>
> > The upgrade on it's own wouldn't fix existing inconsistencies, but 
> > you would now have support for additional tooling 
> > (hbase-operators-tool) to help you with this.
> >
> > As all the 'SUCCESS' procedures have a parent ID 73587, does this 
> > mean
> >> that they were successfully and fully moved from hbase25 to each 
> >> server mentioned in that procedure?  Or does it just mean that the 
> >> region was successfully unassigned from hbase25 but the data still 
> >> resides on hbase25?  I see locality 0.
> >>
> > IIRC, those were all UnassignProcedures, so it means the 
> > unassignment of the related region has completed and the region for 
> > that particular procedure went offline.
> >
> > If we change the table state in meta to 'ENABLED', could this 
> > kickstart
> >> all these things or will it just lead to further problems?
> >
> > Masters work with its own memory cache of meta, so manually updating 
> > it will just make masters cache inconsistent with meta. You would 
> > need to restart masters to get its cache reloaded from master. The 
> > main problem is that you still have the rogue procedures, which you 
> > can't get rid of without stopping the cluster. One alternative to a 
> > full cluster outage would be to identify all RSes running the rogue 
> > procs (you can find that from active master logs), then stop only 
> > those and master, clean masterprocwals, then start it again.
> >
> >
> >> I suppose it means I am asking, the 73587 DisableTableProcedure, 
> >> does it mean that the table is waiting to be disabled?  HBASE 
> >> master declares that table is NOT enabled.
> >>
> > The table state may have been already updated to disabled, most of 
> > its regions may already be offline, but the 73587 
> > DisableTableProcedure cannot be considered "done" until all its sub 
> > procedures are indeed
> completed.
> >
> >
> > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> > <ma...@eset.sk>
> > escreveu:
> >
> >> Thanks for that.
> >>
> >> Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1 
> >> and do not have a viable business use to pay the extortionate 
> >> amount of money required to upgrade.  Which would give these 
> >> cluster access to newer versions.
> >>
> >> However, how would that help anyway?  If we cannot fix this at this 
> >> time then any upgrade would have inconsistencies also, yes?
> >>
> >> As all the 'SUCCESS' procedures have a parent ID 73587, does this 
> >> mean that they were successfully and fully moved from hbase25 to 
> >> each server mentioned in that procedure?  Or does it just mean that 
> >> the region was successfully unassigned from hbase25 but the data 
> >> still resides on hbase25?  I see locality 0.
> >>
> >> If we change the table state in meta to 'ENABLED', could this 
> >> kickstart all these things or will it just lead to further problems?
> >> I suppose it means I am asking, the 73587 DisableTableProcedure, 
> >> does it mean that the table is waiting to be disabled?  HBASE 
> >> master declares that table is NOT enabled.
> >>
> >> Sorry if I seem stupid but this is still all new to me.
> >>
> >> I appreciate the help.
> >>
> >> -----Original Message-----
> >> From: Wellington Chevreuil <we...@gmail.com>
> >> Sent: Tuesday, March 9, 2021 1:20 PM
> >> To: Hbase-User <us...@hbase.apache.org>
> >> Subject: Re: HBASE WALs
> >>
> >> EXTERNAL
> >>
> >> >
> >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> >> procedure.
> >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to 
> >> > be the problem.
> >> >
> >> Per your list procedures output attached, it seems the procs states 
> >> are all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with 
> >> PID 73827, which is the UnassignProcedure for this region. Problem 
> >> is that there are already 5 APs for the same region, which may be 
> >> causing some deadlocks. If this cluster was on a hbck2 supported 
> >> version, you could get rid of this state using bypass command on 
> >> all these proc ids, then manually get the table/regions states 
> >> consistent again using setRegionState/setTableState/assigns/unassigns methods.
> >>
> >> Without tooling, the only option I can think of is to stop cluster, 
> >> clean out masterprocwals, restart cluster, then use hbase shell to 
> >> enable/disable/assign regions. You may also need to manually update 
> >> table/region states in meta table. Of course, you can automate 
> >> these manual steps into your own tooling, but may be a better 
> >> strategy in the long term to upgrade to a more stable version that 
> >> also benefits from more tooling supported by the community.
> >>
> >>
> >>
> >>
> >>
> >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
> >> <ma...@eset.sk>
> >> escreveu:
> >>
> >> > Hi, Wellington,
> >> >
> >> > I was on 'vacation' (no road trip or overseas anything) for a week.
> >> >
> >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> >> procedure.
> >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to 
> >> > be the problem.
> >> >
> >> > I am still mystified about the HBCK2-tools. I have attached a 
> >> > previous thread that you commented on at the time.
> >> >
> >> > I did build a tools for our HBASE 2.1.0...or rather, I built it 
> >> > on Ubuntu
> >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on 
> >> > Ubuntu
> >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).  
> >> > I used it to help fix a similar problem with an offline table and RITs.
> >> > Both HBASE versions are the same.
> >> >
> >> > I attach a 'sheet' with the current procs/locks.
> >> >
> >> > -----Original Message-----
> >> > From: Marc Hoppins <ma...@eset.sk>
> >> > Sent: Wednesday, March 3, 2021 9:51 AM
> >> > To: user@hbase.apache.org
> >> > Cc: Martin Oravec <ma...@eset.sk>
> >> > Subject: RE: HBASE WALs
> >> >
> >> > EXTERNAL
> >> >
> >> > Thanks, Wellington,
> >> >
> >> > I have already build a hbck1-tools for 2.1.0 using method 
> >> > described in other topics. All the HBASE and JDK here is the same 
> >> > version so if it worked fixing one cluster HBASE then it should 
> >> > work for other
> installs.
> >> >
> >> > Fiddling with masterprocWALs will require complete shutdown of 
> >> > hbase operations to prevent incoming reds/writes on other tables 
> >> > and I am not sure how disruptive that will be other than 
> >> > "probably a
> lot".
> >> >
> >> > -----Original Message-----
> >> > From: Wellington Chevreuil <we...@gmail.com>
> >> > Sent: Tuesday, March 2, 2021 10:57 AM
> >> > To: Hbase-User <us...@hbase.apache.org>
> >> > Subject: Re: HBASE WALs
> >> >
> >> > EXTERNAL
> >> >
> >> > Sorry, missed your previous email. I was hoping you were not on a 
> >> > non-stable version, so that you would benefit from hbck2 tool support.
> >> > Unfortunately, 2.1.0 is among the early releases that don't work 
> >> > with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> >> >
> >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
> >> > seems
> >> > > mostly unhappy with one region in particular, and is reporting 
> >> > > on
> >> that.
> >> > >
> >> > Are the other regions for the table properly closed, and this is 
> >> > the only one stuck? If you do a list_procedures, are you able to 
> >> > identify an 'unassign' procedure still running for this table? Or 
> >> > if you grep master logs for this region, do you see any messages 
> >> > suggesting there's still ongoing attempts to bring the region 
> >> > offline? If there's apparently no procedure/no ongoing attempts 
> >> > to offline the region, you might try to manually update its state 
> >> > in meta table, then flip masters (assuming you have master HA), 
> >> > so that the new active loads an up to date state from meta table.
> >> >
> >> > Otherwise, if there's still a rogue procedure trying to offline 
> >> > the region, unfortunately, due to the lack of hbck support, you 
> >> > would most likely need a more disruptive intervention similar to 
> >> > what you had described in your first email, but instead of normal 
> >> > wal folder, master proc wals is what you really would need to 
> >> > clean out here, as that is where procedures state is persisted, 
> >> > and you wouldn't want the rogue procedure to be resumed.
> >> >
> >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
> >> > <ma...@eset.sk>
> >> > escreveu:
> >> >
> >> > > If you know of anything that will help I would appreciate it.
> >> > >
> >> > > If you need any log output let me know.
> >> > >
> >> > > Thanks
> >> > >
> >> > >
> >> > > -----Original Message-----
> >> > > From: Wellington Chevreuil <we...@gmail.com>
> >> > > Sent: Thursday, February 25, 2021 4:08 PM
> >> > > To: Hbase-User <us...@hbase.apache.org>
> >> > > Subject: Re: HBASE WALs
> >> > >
> >> > > EXTERNAL
> >> > >
> >> > > >
> >> > > > Do WAL files contain information for multiple regions per WAL 
> >> > > > or is one WAL associated with one region?
> >> > > >
> >> > > Multiple regions edits would be present in a single wal file.
> >> > > That's why upon a RS crash and wal processing, there's a wal 
> >> > > split
> phase.
> >> > >
> >> > > I am trying to find a way to clear a RIT for a disabled table. 
> >> > > A similar
> >> > > > problem (but on a test cluster) involved me clearing znode 
> >> > > > info, deleting HDFS data for the table and deleting 
> >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> >> > > >
> >> > > Which hbase version are you on?
> >> > >
> >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> >> > > <ma...@eset.sk>
> >> > > escreveu:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > Do WAL files contain information for multiple regions per WAL 
> >> > > > or is one WAL associated with one region?
> >> > > >
> >> > > > I am trying to find a way to clear a RIT for a disabled table.
> >> > > > A similar problem (but on a test cluster) involved me 
> >> > > > clearing znode info, deleting HDFS data for the table and 
> >> > > > deleting WALs/MasterProcWAL files, finally restarting HBASE service.
> >> > > >
> >> > > > Table cannot be enabled.
> >> > > >
> >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> >> > > > system seems mostly unhappy with one region in particular, 
> >> > > > and is reporting
> >> > on that.
> >> > > >
> >> > > > There are many tables that are very active so I don't think 
> >> > > > it is possible to stop the entire service without a lot of 
> >> > > > forewarning to
> >> > > users.
> >> > > >
> >> > > > Thanks in advance.
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: HBASE WALs

Posted by Stack <st...@duboce.net>.
On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <ma...@eset.sk> wrote:

> Hi, all,
>
> For our stuck region, this exists in meta.  Could we alter the state to
> CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
>
> You could but IIRC, in that version of HBase, you may need to restart the
Master after the change (changing hbase:meta does not update the Master's
in-memory state). On restart, Master will read hbase:meta to discover
Region state.

S


> hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:regioninfo, timestamp=1613580024017, value={ENCODED =>
> f25fe93e24b34cb2f7fffddee1d89eec, NAME =>
> 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.',
> STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'}
>  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:seqnumDuringOpen, timestamp=1611787189839,
> value=\x00\x00\x00\x00\x00\x00\x04\x8F
>  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:server, timestamp=1611787189839, value=
> dr1-hbase18.jumbo.hq.eset.com:16020
>  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:serverstartcode, timestamp=1611787189839, value=1611785264032
>  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:sn, timestamp=1613580024017, value=
> ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
>  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:state, timestamp=1613580024017, value=OPENING
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Wednesday, March 10, 2021 10:56 AM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > Sorry if I seem stupid but this is still all new to me.
> >
> Forgot to mention, there's no stupid questions here. Don't be shy and
> keep'em coming.
>
> Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil <
> wellington.chevreuil@gmail.com> escreveu:
>
> > However, how would that help anyway?  If we cannot fix this at this
> > time
> >> then any upgrade would have inconsistencies also, yes?
> >>
> > The upgrade on it's own wouldn't fix existing inconsistencies, but you
> > would now have support for additional tooling (hbase-operators-tool)
> > to help you with this.
> >
> > As all the 'SUCCESS' procedures have a parent ID 73587, does this mean
> >> that they were successfully and fully moved from hbase25 to each
> >> server mentioned in that procedure?  Or does it just mean that the
> >> region was successfully unassigned from hbase25 but the data still
> >> resides on hbase25?  I see locality 0.
> >>
> > IIRC, those were all UnassignProcedures, so it means the unassignment
> > of the related region has completed and the region for that particular
> > procedure went offline.
> >
> > If we change the table state in meta to 'ENABLED', could this
> > kickstart
> >> all these things or will it just lead to further problems?
> >
> > Masters work with its own memory cache of meta, so manually updating
> > it will just make masters cache inconsistent with meta. You would need
> > to restart masters to get its cache reloaded from master. The main
> > problem is that you still have the rogue procedures, which you can't
> > get rid of without stopping the cluster. One alternative to a full
> > cluster outage would be to identify all RSes running the rogue procs
> > (you can find that from active master logs), then stop only those and
> > master, clean masterprocwals, then start it again.
> >
> >
> >> I suppose it means I am asking, the 73587 DisableTableProcedure, does
> >> it mean that the table is waiting to be disabled?  HBASE master
> >> declares that table is NOT enabled.
> >>
> > The table state may have been already updated to disabled, most of its
> > regions may already be offline, but the 73587 DisableTableProcedure
> > cannot be considered "done" until all its sub procedures are indeed
> completed.
> >
> >
> > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins
> > <ma...@eset.sk>
> > escreveu:
> >
> >> Thanks for that.
> >>
> >> Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1
> >> and do not have a viable business use to pay the extortionate amount
> >> of money required to upgrade.  Which would give these cluster access
> >> to newer versions.
> >>
> >> However, how would that help anyway?  If we cannot fix this at this
> >> time then any upgrade would have inconsistencies also, yes?
> >>
> >> As all the 'SUCCESS' procedures have a parent ID 73587, does this
> >> mean that they were successfully and fully moved from hbase25 to each
> >> server mentioned in that procedure?  Or does it just mean that the
> >> region was successfully unassigned from hbase25 but the data still
> >> resides on hbase25?  I see locality 0.
> >>
> >> If we change the table state in meta to 'ENABLED', could this
> >> kickstart all these things or will it just lead to further problems?
> >> I suppose it means I am asking, the 73587 DisableTableProcedure, does
> >> it mean that the table is waiting to be disabled?  HBASE master
> >> declares that table is NOT enabled.
> >>
> >> Sorry if I seem stupid but this is still all new to me.
> >>
> >> I appreciate the help.
> >>
> >> -----Original Message-----
> >> From: Wellington Chevreuil <we...@gmail.com>
> >> Sent: Tuesday, March 9, 2021 1:20 PM
> >> To: Hbase-User <us...@hbase.apache.org>
> >> Subject: Re: HBASE WALs
> >>
> >> EXTERNAL
> >>
> >> >
> >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> >> procedure.
> >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be
> >> > the problem.
> >> >
> >> Per your list procedures output attached, it seems the procs states
> >> are all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with
> >> PID 73827, which is the UnassignProcedure for this region. Problem is
> >> that there are already 5 APs for the same region, which may be
> >> causing some deadlocks. If this cluster was on a hbck2 supported
> >> version, you could get rid of this state using bypass command on all
> >> these proc ids, then manually get the table/regions states consistent
> >> again using setRegionState/setTableState/assigns/unassigns methods.
> >>
> >> Without tooling, the only option I can think of is to stop cluster,
> >> clean out masterprocwals, restart cluster, then use hbase shell to
> >> enable/disable/assign regions. You may also need to manually update
> >> table/region states in meta table. Of course, you can automate these
> >> manual steps into your own tooling, but may be a better strategy in
> >> the long term to upgrade to a more stable version that also benefits
> >> from more tooling supported by the community.
> >>
> >>
> >>
> >>
> >>
> >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins
> >> <ma...@eset.sk>
> >> escreveu:
> >>
> >> > Hi, Wellington,
> >> >
> >> > I was on 'vacation' (no road trip or overseas anything) for a week.
> >> >
> >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> >> procedure.
> >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be
> >> > the problem.
> >> >
> >> > I am still mystified about the HBCK2-tools. I have attached a
> >> > previous thread that you commented on at the time.
> >> >
> >> > I did build a tools for our HBASE 2.1.0...or rather, I built it on
> >> > Ubuntu
> >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu
> >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).  I
> >> > used it to help fix a similar problem with an offline table and RITs.
> >> > Both HBASE versions are the same.
> >> >
> >> > I attach a 'sheet' with the current procs/locks.
> >> >
> >> > -----Original Message-----
> >> > From: Marc Hoppins <ma...@eset.sk>
> >> > Sent: Wednesday, March 3, 2021 9:51 AM
> >> > To: user@hbase.apache.org
> >> > Cc: Martin Oravec <ma...@eset.sk>
> >> > Subject: RE: HBASE WALs
> >> >
> >> > EXTERNAL
> >> >
> >> > Thanks, Wellington,
> >> >
> >> > I have already build a hbck1-tools for 2.1.0 using method described
> >> > in other topics. All the HBASE and JDK here is the same version so
> >> > if it worked fixing one cluster HBASE then it should work for other
> installs.
> >> >
> >> > Fiddling with masterprocWALs will require complete shutdown of
> >> > hbase operations to prevent incoming reds/writes on other tables
> >> > and I am not sure how disruptive that will be other than "probably a
> lot".
> >> >
> >> > -----Original Message-----
> >> > From: Wellington Chevreuil <we...@gmail.com>
> >> > Sent: Tuesday, March 2, 2021 10:57 AM
> >> > To: Hbase-User <us...@hbase.apache.org>
> >> > Subject: Re: HBASE WALs
> >> >
> >> > EXTERNAL
> >> >
> >> > Sorry, missed your previous email. I was hoping you were not on a
> >> > non-stable version, so that you would benefit from hbck2 tool support.
> >> > Unfortunately, 2.1.0 is among the early releases that don't work
> >> > with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> >> >
> >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system
> >> > seems
> >> > > mostly unhappy with one region in particular, and is reporting on
> >> that.
> >> > >
> >> > Are the other regions for the table properly closed, and this is
> >> > the only one stuck? If you do a list_procedures, are you able to
> >> > identify an 'unassign' procedure still running for this table? Or
> >> > if you grep master logs for this region, do you see any messages
> >> > suggesting there's still ongoing attempts to bring the region
> >> > offline? If there's apparently no procedure/no ongoing attempts to
> >> > offline the region, you might try to manually update its state in
> >> > meta table, then flip masters (assuming you have master HA), so
> >> > that the new active loads an up to date state from meta table.
> >> >
> >> > Otherwise, if there's still a rogue procedure trying to offline the
> >> > region, unfortunately, due to the lack of hbck support, you would
> >> > most likely need a more disruptive intervention similar to what you
> >> > had described in your first email, but instead of normal wal
> >> > folder, master proc wals is what you really would need to clean out
> >> > here, as that is where procedures state is persisted, and you
> >> > wouldn't want the rogue procedure to be resumed.
> >> >
> >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins
> >> > <ma...@eset.sk>
> >> > escreveu:
> >> >
> >> > > If you know of anything that will help I would appreciate it.
> >> > >
> >> > > If you need any log output let me know.
> >> > >
> >> > > Thanks
> >> > >
> >> > >
> >> > > -----Original Message-----
> >> > > From: Wellington Chevreuil <we...@gmail.com>
> >> > > Sent: Thursday, February 25, 2021 4:08 PM
> >> > > To: Hbase-User <us...@hbase.apache.org>
> >> > > Subject: Re: HBASE WALs
> >> > >
> >> > > EXTERNAL
> >> > >
> >> > > >
> >> > > > Do WAL files contain information for multiple regions per WAL
> >> > > > or is one WAL associated with one region?
> >> > > >
> >> > > Multiple regions edits would be present in a single wal file.
> >> > > That's why upon a RS crash and wal processing, there's a wal split
> phase.
> >> > >
> >> > > I am trying to find a way to clear a RIT for a disabled table. A
> >> > > similar
> >> > > > problem (but on a test cluster) involved me clearing znode
> >> > > > info, deleting HDFS data for the table and deleting
> >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> >> > > >
> >> > > Which hbase version are you on?
> >> > >
> >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins
> >> > > <ma...@eset.sk>
> >> > > escreveu:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > Do WAL files contain information for multiple regions per WAL
> >> > > > or is one WAL associated with one region?
> >> > > >
> >> > > > I am trying to find a way to clear a RIT for a disabled table.
> >> > > > A similar problem (but on a test cluster) involved me clearing
> >> > > > znode info, deleting HDFS data for the table and deleting
> >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> >> > > >
> >> > > > Table cannot be enabled.
> >> > > >
> >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system
> >> > > > seems mostly unhappy with one region in particular, and is
> >> > > > reporting
> >> > on that.
> >> > > >
> >> > > > There are many tables that are very active so I don't think it
> >> > > > is possible to stop the entire service without a lot of
> >> > > > forewarning to
> >> > > users.
> >> > > >
> >> > > > Thanks in advance.
> >> > > >
> >> > >
> >> >
> >>
> >
>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
Hi, all,

For our stuck region, this exists in meta.  Could we alter the state to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?

hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec. column=info:regioninfo, timestamp=1613580024017, value={ENCODED => f25fe93e24b34cb2f7fffddee1d89eec, NAME => 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.', STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'}
 hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec. column=info:seqnumDuringOpen, timestamp=1611787189839, value=\x00\x00\x00\x00\x00\x00\x04\x8F
 hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec. column=info:server, timestamp=1611787189839, value=dr1-hbase18.jumbo.hq.eset.com:16020
 hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec. column=info:serverstartcode, timestamp=1611787189839, value=1611785264032
 hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec. column=info:sn, timestamp=1613580024017, value=ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
 hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec. column=info:state, timestamp=1613580024017, value=OPENING

-----Original Message-----
From: Wellington Chevreuil <we...@gmail.com> 
Sent: Wednesday, March 10, 2021 10:56 AM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

>
> Sorry if I seem stupid but this is still all new to me.
>
Forgot to mention, there's no stupid questions here. Don't be shy and keep'em coming.

Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < wellington.chevreuil@gmail.com> escreveu:

> However, how would that help anyway?  If we cannot fix this at this 
> time
>> then any upgrade would have inconsistencies also, yes?
>>
> The upgrade on it's own wouldn't fix existing inconsistencies, but you 
> would now have support for additional tooling (hbase-operators-tool)  
> to help you with this.
>
> As all the 'SUCCESS' procedures have a parent ID 73587, does this mean
>> that they were successfully and fully moved from hbase25 to each 
>> server mentioned in that procedure?  Or does it just mean that the 
>> region was successfully unassigned from hbase25 but the data still 
>> resides on hbase25?  I see locality 0.
>>
> IIRC, those were all UnassignProcedures, so it means the unassignment 
> of the related region has completed and the region for that particular 
> procedure went offline.
>
> If we change the table state in meta to 'ENABLED', could this 
> kickstart
>> all these things or will it just lead to further problems?
>
> Masters work with its own memory cache of meta, so manually updating 
> it will just make masters cache inconsistent with meta. You would need 
> to restart masters to get its cache reloaded from master. The main 
> problem is that you still have the rogue procedures, which you can't 
> get rid of without stopping the cluster. One alternative to a full 
> cluster outage would be to identify all RSes running the rogue procs 
> (you can find that from active master logs), then stop only those and 
> master, clean masterprocwals, then start it again.
>
>
>> I suppose it means I am asking, the 73587 DisableTableProcedure, does 
>> it mean that the table is waiting to be disabled?  HBASE master 
>> declares that table is NOT enabled.
>>
> The table state may have been already updated to disabled, most of its 
> regions may already be offline, but the 73587 DisableTableProcedure 
> cannot be considered "done" until all its sub procedures are indeed completed.
>
>
> Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> <ma...@eset.sk>
> escreveu:
>
>> Thanks for that.
>>
>> Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1 
>> and do not have a viable business use to pay the extortionate amount 
>> of money required to upgrade.  Which would give these cluster access 
>> to newer versions.
>>
>> However, how would that help anyway?  If we cannot fix this at this 
>> time then any upgrade would have inconsistencies also, yes?
>>
>> As all the 'SUCCESS' procedures have a parent ID 73587, does this 
>> mean that they were successfully and fully moved from hbase25 to each 
>> server mentioned in that procedure?  Or does it just mean that the 
>> region was successfully unassigned from hbase25 but the data still 
>> resides on hbase25?  I see locality 0.
>>
>> If we change the table state in meta to 'ENABLED', could this 
>> kickstart all these things or will it just lead to further problems?  
>> I suppose it means I am asking, the 73587 DisableTableProcedure, does 
>> it mean that the table is waiting to be disabled?  HBASE master 
>> declares that table is NOT enabled.
>>
>> Sorry if I seem stupid but this is still all new to me.
>>
>> I appreciate the help.
>>
>> -----Original Message-----
>> From: Wellington Chevreuil <we...@gmail.com>
>> Sent: Tuesday, March 9, 2021 1:20 PM
>> To: Hbase-User <us...@hbase.apache.org>
>> Subject: Re: HBASE WALs
>>
>> EXTERNAL
>>
>> >
>> > All fails are waiting on the same PID (73587), a DISABLE TABLE
>> procedure.
>> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be 
>> > the problem.
>> >
>> Per your list procedures output attached, it seems the procs states 
>> are all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with 
>> PID 73827, which is the UnassignProcedure for this region. Problem is 
>> that there are already 5 APs for the same region, which may be 
>> causing some deadlocks. If this cluster was on a hbck2 supported 
>> version, you could get rid of this state using bypass command on all 
>> these proc ids, then manually get the table/regions states consistent 
>> again using setRegionState/setTableState/assigns/unassigns methods.
>>
>> Without tooling, the only option I can think of is to stop cluster, 
>> clean out masterprocwals, restart cluster, then use hbase shell to 
>> enable/disable/assign regions. You may also need to manually update 
>> table/region states in meta table. Of course, you can automate these 
>> manual steps into your own tooling, but may be a better strategy in 
>> the long term to upgrade to a more stable version that also benefits 
>> from more tooling supported by the community.
>>
>>
>>
>>
>>
>> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
>> <ma...@eset.sk>
>> escreveu:
>>
>> > Hi, Wellington,
>> >
>> > I was on 'vacation' (no road trip or overseas anything) for a week.
>> >
>> > All fails are waiting on the same PID (73587), a DISABLE TABLE
>> procedure.
>> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be 
>> > the problem.
>> >
>> > I am still mystified about the HBCK2-tools. I have attached a 
>> > previous thread that you commented on at the time.
>> >
>> > I did build a tools for our HBASE 2.1.0...or rather, I built it on 
>> > Ubuntu
>> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu
>> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).  I 
>> > used it to help fix a similar problem with an offline table and RITs.
>> > Both HBASE versions are the same.
>> >
>> > I attach a 'sheet' with the current procs/locks.
>> >
>> > -----Original Message-----
>> > From: Marc Hoppins <ma...@eset.sk>
>> > Sent: Wednesday, March 3, 2021 9:51 AM
>> > To: user@hbase.apache.org
>> > Cc: Martin Oravec <ma...@eset.sk>
>> > Subject: RE: HBASE WALs
>> >
>> > EXTERNAL
>> >
>> > Thanks, Wellington,
>> >
>> > I have already build a hbck1-tools for 2.1.0 using method described 
>> > in other topics. All the HBASE and JDK here is the same version so 
>> > if it worked fixing one cluster HBASE then it should work for other installs.
>> >
>> > Fiddling with masterprocWALs will require complete shutdown of 
>> > hbase operations to prevent incoming reds/writes on other tables 
>> > and I am not sure how disruptive that will be other than "probably a lot".
>> >
>> > -----Original Message-----
>> > From: Wellington Chevreuil <we...@gmail.com>
>> > Sent: Tuesday, March 2, 2021 10:57 AM
>> > To: Hbase-User <us...@hbase.apache.org>
>> > Subject: Re: HBASE WALs
>> >
>> > EXTERNAL
>> >
>> > Sorry, missed your previous email. I was hoping you were not on a 
>> > non-stable version, so that you would benefit from hbck2 tool support.
>> > Unfortunately, 2.1.0 is among the early releases that don't work 
>> > with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
>> >
>> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
>> > seems
>> > > mostly unhappy with one region in particular, and is reporting on
>> that.
>> > >
>> > Are the other regions for the table properly closed, and this is 
>> > the only one stuck? If you do a list_procedures, are you able to 
>> > identify an 'unassign' procedure still running for this table? Or 
>> > if you grep master logs for this region, do you see any messages 
>> > suggesting there's still ongoing attempts to bring the region 
>> > offline? If there's apparently no procedure/no ongoing attempts to 
>> > offline the region, you might try to manually update its state in 
>> > meta table, then flip masters (assuming you have master HA), so 
>> > that the new active loads an up to date state from meta table.
>> >
>> > Otherwise, if there's still a rogue procedure trying to offline the 
>> > region, unfortunately, due to the lack of hbck support, you would 
>> > most likely need a more disruptive intervention similar to what you 
>> > had described in your first email, but instead of normal wal 
>> > folder, master proc wals is what you really would need to clean out 
>> > here, as that is where procedures state is persisted, and you 
>> > wouldn't want the rogue procedure to be resumed.
>> >
>> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
>> > <ma...@eset.sk>
>> > escreveu:
>> >
>> > > If you know of anything that will help I would appreciate it.
>> > >
>> > > If you need any log output let me know.
>> > >
>> > > Thanks
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Wellington Chevreuil <we...@gmail.com>
>> > > Sent: Thursday, February 25, 2021 4:08 PM
>> > > To: Hbase-User <us...@hbase.apache.org>
>> > > Subject: Re: HBASE WALs
>> > >
>> > > EXTERNAL
>> > >
>> > > >
>> > > > Do WAL files contain information for multiple regions per WAL 
>> > > > or is one WAL associated with one region?
>> > > >
>> > > Multiple regions edits would be present in a single wal file. 
>> > > That's why upon a RS crash and wal processing, there's a wal split phase.
>> > >
>> > > I am trying to find a way to clear a RIT for a disabled table. A 
>> > > similar
>> > > > problem (but on a test cluster) involved me clearing znode 
>> > > > info, deleting HDFS data for the table and deleting 
>> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
>> > > >
>> > > Which hbase version are you on?
>> > >
>> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
>> > > <ma...@eset.sk>
>> > > escreveu:
>> > >
>> > > > Hi all,
>> > > >
>> > > > Do WAL files contain information for multiple regions per WAL 
>> > > > or is one WAL associated with one region?
>> > > >
>> > > > I am trying to find a way to clear a RIT for a disabled table. 
>> > > > A similar problem (but on a test cluster) involved me clearing 
>> > > > znode info, deleting HDFS data for the table and deleting 
>> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
>> > > >
>> > > > Table cannot be enabled.
>> > > >
>> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
>> > > > seems mostly unhappy with one region in particular, and is 
>> > > > reporting
>> > on that.
>> > > >
>> > > > There are many tables that are very active so I don't think it 
>> > > > is possible to stop the entire service without a lot of 
>> > > > forewarning to
>> > > users.
>> > > >
>> > > > Thanks in advance.
>> > > >
>> > >
>> >
>>
>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
To be clear, if the other tables are stopped, I assume all pending and current operations will finish. How long will it take to write all data - if indeed the data does get permanently written - so that we can safely remove WALs?

-----Original Message-----
From: Wellington Chevreuil <we...@gmail.com> 
Sent: Wednesday, March 10, 2021 10:56 AM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

>
> Sorry if I seem stupid but this is still all new to me.
>
Forgot to mention, there's no stupid questions here. Don't be shy and keep'em coming.

Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < wellington.chevreuil@gmail.com> escreveu:

> However, how would that help anyway?  If we cannot fix this at this 
> time
>> then any upgrade would have inconsistencies also, yes?
>>
> The upgrade on it's own wouldn't fix existing inconsistencies, but you 
> would now have support for additional tooling (hbase-operators-tool)  
> to help you with this.
>
> As all the 'SUCCESS' procedures have a parent ID 73587, does this mean
>> that they were successfully and fully moved from hbase25 to each 
>> server mentioned in that procedure?  Or does it just mean that the 
>> region was successfully unassigned from hbase25 but the data still 
>> resides on hbase25?  I see locality 0.
>>
> IIRC, those were all UnassignProcedures, so it means the unassignment 
> of the related region has completed and the region for that particular 
> procedure went offline.
>
> If we change the table state in meta to 'ENABLED', could this 
> kickstart
>> all these things or will it just lead to further problems?
>
> Masters work with its own memory cache of meta, so manually updating 
> it will just make masters cache inconsistent with meta. You would need 
> to restart masters to get its cache reloaded from master. The main 
> problem is that you still have the rogue procedures, which you can't 
> get rid of without stopping the cluster. One alternative to a full 
> cluster outage would be to identify all RSes running the rogue procs 
> (you can find that from active master logs), then stop only those and 
> master, clean masterprocwals, then start it again.
>
>
>> I suppose it means I am asking, the 73587 DisableTableProcedure, does 
>> it mean that the table is waiting to be disabled?  HBASE master 
>> declares that table is NOT enabled.
>>
> The table state may have been already updated to disabled, most of its 
> regions may already be offline, but the 73587 DisableTableProcedure 
> cannot be considered "done" until all its sub procedures are indeed completed.
>
>
> Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> <ma...@eset.sk>
> escreveu:
>
>> Thanks for that.
>>
>> Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1 
>> and do not have a viable business use to pay the extortionate amount 
>> of money required to upgrade.  Which would give these cluster access 
>> to newer versions.
>>
>> However, how would that help anyway?  If we cannot fix this at this 
>> time then any upgrade would have inconsistencies also, yes?
>>
>> As all the 'SUCCESS' procedures have a parent ID 73587, does this 
>> mean that they were successfully and fully moved from hbase25 to each 
>> server mentioned in that procedure?  Or does it just mean that the 
>> region was successfully unassigned from hbase25 but the data still 
>> resides on hbase25?  I see locality 0.
>>
>> If we change the table state in meta to 'ENABLED', could this 
>> kickstart all these things or will it just lead to further problems?  
>> I suppose it means I am asking, the 73587 DisableTableProcedure, does 
>> it mean that the table is waiting to be disabled?  HBASE master 
>> declares that table is NOT enabled.
>>
>> Sorry if I seem stupid but this is still all new to me.
>>
>> I appreciate the help.
>>
>> -----Original Message-----
>> From: Wellington Chevreuil <we...@gmail.com>
>> Sent: Tuesday, March 9, 2021 1:20 PM
>> To: Hbase-User <us...@hbase.apache.org>
>> Subject: Re: HBASE WALs
>>
>> EXTERNAL
>>
>> >
>> > All fails are waiting on the same PID (73587), a DISABLE TABLE
>> procedure.
>> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be 
>> > the problem.
>> >
>> Per your list procedures output attached, it seems the procs states 
>> are all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with 
>> PID 73827, which is the UnassignProcedure for this region. Problem is 
>> that there are already 5 APs for the same region, which may be 
>> causing some deadlocks. If this cluster was on a hbck2 supported 
>> version, you could get rid of this state using bypass command on all 
>> these proc ids, then manually get the table/regions states consistent 
>> again using setRegionState/setTableState/assigns/unassigns methods.
>>
>> Without tooling, the only option I can think of is to stop cluster, 
>> clean out masterprocwals, restart cluster, then use hbase shell to 
>> enable/disable/assign regions. You may also need to manually update 
>> table/region states in meta table. Of course, you can automate these 
>> manual steps into your own tooling, but may be a better strategy in 
>> the long term to upgrade to a more stable version that also benefits 
>> from more tooling supported by the community.
>>
>>
>>
>>
>>
>> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
>> <ma...@eset.sk>
>> escreveu:
>>
>> > Hi, Wellington,
>> >
>> > I was on 'vacation' (no road trip or overseas anything) for a week.
>> >
>> > All fails are waiting on the same PID (73587), a DISABLE TABLE
>> procedure.
>> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be 
>> > the problem.
>> >
>> > I am still mystified about the HBCK2-tools. I have attached a 
>> > previous thread that you commented on at the time.
>> >
>> > I did build a tools for our HBASE 2.1.0...or rather, I built it on 
>> > Ubuntu
>> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu
>> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).  I 
>> > used it to help fix a similar problem with an offline table and RITs.
>> > Both HBASE versions are the same.
>> >
>> > I attach a 'sheet' with the current procs/locks.
>> >
>> > -----Original Message-----
>> > From: Marc Hoppins <ma...@eset.sk>
>> > Sent: Wednesday, March 3, 2021 9:51 AM
>> > To: user@hbase.apache.org
>> > Cc: Martin Oravec <ma...@eset.sk>
>> > Subject: RE: HBASE WALs
>> >
>> > EXTERNAL
>> >
>> > Thanks, Wellington,
>> >
>> > I have already build a hbck1-tools for 2.1.0 using method described 
>> > in other topics. All the HBASE and JDK here is the same version so 
>> > if it worked fixing one cluster HBASE then it should work for other installs.
>> >
>> > Fiddling with masterprocWALs will require complete shutdown of 
>> > hbase operations to prevent incoming reds/writes on other tables 
>> > and I am not sure how disruptive that will be other than "probably a lot".
>> >
>> > -----Original Message-----
>> > From: Wellington Chevreuil <we...@gmail.com>
>> > Sent: Tuesday, March 2, 2021 10:57 AM
>> > To: Hbase-User <us...@hbase.apache.org>
>> > Subject: Re: HBASE WALs
>> >
>> > EXTERNAL
>> >
>> > Sorry, missed your previous email. I was hoping you were not on a 
>> > non-stable version, so that you would benefit from hbck2 tool support.
>> > Unfortunately, 2.1.0 is among the early releases that don't work 
>> > with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
>> >
>> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
>> > seems
>> > > mostly unhappy with one region in particular, and is reporting on
>> that.
>> > >
>> > Are the other regions for the table properly closed, and this is 
>> > the only one stuck? If you do a list_procedures, are you able to 
>> > identify an 'unassign' procedure still running for this table? Or 
>> > if you grep master logs for this region, do you see any messages 
>> > suggesting there's still ongoing attempts to bring the region 
>> > offline? If there's apparently no procedure/no ongoing attempts to 
>> > offline the region, you might try to manually update its state in 
>> > meta table, then flip masters (assuming you have master HA), so 
>> > that the new active loads an up to date state from meta table.
>> >
>> > Otherwise, if there's still a rogue procedure trying to offline the 
>> > region, unfortunately, due to the lack of hbck support, you would 
>> > most likely need a more disruptive intervention similar to what you 
>> > had described in your first email, but instead of normal wal 
>> > folder, master proc wals is what you really would need to clean out 
>> > here, as that is where procedures state is persisted, and you 
>> > wouldn't want the rogue procedure to be resumed.
>> >
>> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
>> > <ma...@eset.sk>
>> > escreveu:
>> >
>> > > If you know of anything that will help I would appreciate it.
>> > >
>> > > If you need any log output let me know.
>> > >
>> > > Thanks
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Wellington Chevreuil <we...@gmail.com>
>> > > Sent: Thursday, February 25, 2021 4:08 PM
>> > > To: Hbase-User <us...@hbase.apache.org>
>> > > Subject: Re: HBASE WALs
>> > >
>> > > EXTERNAL
>> > >
>> > > >
>> > > > Do WAL files contain information for multiple regions per WAL 
>> > > > or is one WAL associated with one region?
>> > > >
>> > > Multiple regions edits would be present in a single wal file. 
>> > > That's why upon a RS crash and wal processing, there's a wal split phase.
>> > >
>> > > I am trying to find a way to clear a RIT for a disabled table. A 
>> > > similar
>> > > > problem (but on a test cluster) involved me clearing znode 
>> > > > info, deleting HDFS data for the table and deleting 
>> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
>> > > >
>> > > Which hbase version are you on?
>> > >
>> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
>> > > <ma...@eset.sk>
>> > > escreveu:
>> > >
>> > > > Hi all,
>> > > >
>> > > > Do WAL files contain information for multiple regions per WAL 
>> > > > or is one WAL associated with one region?
>> > > >
>> > > > I am trying to find a way to clear a RIT for a disabled table. 
>> > > > A similar problem (but on a test cluster) involved me clearing 
>> > > > znode info, deleting HDFS data for the table and deleting 
>> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
>> > > >
>> > > > Table cannot be enabled.
>> > > >
>> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
>> > > > seems mostly unhappy with one region in particular, and is 
>> > > > reporting
>> > on that.
>> > > >
>> > > > There are many tables that are very active so I don't think it 
>> > > > is possible to stop the entire service without a lot of 
>> > > > forewarning to
>> > > users.
>> > > >
>> > > > Thanks in advance.
>> > > >
>> > >
>> >
>>
>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
Currently, hbase UI reports that there is only ONE region on hbase25 - which is probably our stuck region.  Does this help in any way that we can more easily fix this?

-----Original Message-----
From: Wellington Chevreuil <we...@gmail.com> 
Sent: Wednesday, March 10, 2021 10:56 AM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

>
> Sorry if I seem stupid but this is still all new to me.
>
Forgot to mention, there's no stupid questions here. Don't be shy and keep'em coming.

Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < wellington.chevreuil@gmail.com> escreveu:

> However, how would that help anyway?  If we cannot fix this at this 
> time
>> then any upgrade would have inconsistencies also, yes?
>>
> The upgrade on it's own wouldn't fix existing inconsistencies, but you 
> would now have support for additional tooling (hbase-operators-tool)  
> to help you with this.
>
> As all the 'SUCCESS' procedures have a parent ID 73587, does this mean
>> that they were successfully and fully moved from hbase25 to each 
>> server mentioned in that procedure?  Or does it just mean that the 
>> region was successfully unassigned from hbase25 but the data still 
>> resides on hbase25?  I see locality 0.
>>
> IIRC, those were all UnassignProcedures, so it means the unassignment 
> of the related region has completed and the region for that particular 
> procedure went offline.
>
> If we change the table state in meta to 'ENABLED', could this 
> kickstart
>> all these things or will it just lead to further problems?
>
> Masters work with its own memory cache of meta, so manually updating 
> it will just make masters cache inconsistent with meta. You would need 
> to restart masters to get its cache reloaded from master. The main 
> problem is that you still have the rogue procedures, which you can't 
> get rid of without stopping the cluster. One alternative to a full 
> cluster outage would be to identify all RSes running the rogue procs 
> (you can find that from active master logs), then stop only those and 
> master, clean masterprocwals, then start it again.
>
>
>> I suppose it means I am asking, the 73587 DisableTableProcedure, does 
>> it mean that the table is waiting to be disabled?  HBASE master 
>> declares that table is NOT enabled.
>>
> The table state may have been already updated to disabled, most of its 
> regions may already be offline, but the 73587 DisableTableProcedure 
> cannot be considered "done" until all its sub procedures are indeed completed.
>
>
> Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> <ma...@eset.sk>
> escreveu:
>
>> Thanks for that.
>>
>> Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1 
>> and do not have a viable business use to pay the extortionate amount 
>> of money required to upgrade.  Which would give these cluster access 
>> to newer versions.
>>
>> However, how would that help anyway?  If we cannot fix this at this 
>> time then any upgrade would have inconsistencies also, yes?
>>
>> As all the 'SUCCESS' procedures have a parent ID 73587, does this 
>> mean that they were successfully and fully moved from hbase25 to each 
>> server mentioned in that procedure?  Or does it just mean that the 
>> region was successfully unassigned from hbase25 but the data still 
>> resides on hbase25?  I see locality 0.
>>
>> If we change the table state in meta to 'ENABLED', could this 
>> kickstart all these things or will it just lead to further problems?  
>> I suppose it means I am asking, the 73587 DisableTableProcedure, does 
>> it mean that the table is waiting to be disabled?  HBASE master 
>> declares that table is NOT enabled.
>>
>> Sorry if I seem stupid but this is still all new to me.
>>
>> I appreciate the help.
>>
>> -----Original Message-----
>> From: Wellington Chevreuil <we...@gmail.com>
>> Sent: Tuesday, March 9, 2021 1:20 PM
>> To: Hbase-User <us...@hbase.apache.org>
>> Subject: Re: HBASE WALs
>>
>> EXTERNAL
>>
>> >
>> > All fails are waiting on the same PID (73587), a DISABLE TABLE
>> procedure.
>> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be 
>> > the problem.
>> >
>> Per your list procedures output attached, it seems the procs states 
>> are all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with 
>> PID 73827, which is the UnassignProcedure for this region. Problem is 
>> that there are already 5 APs for the same region, which may be 
>> causing some deadlocks. If this cluster was on a hbck2 supported 
>> version, you could get rid of this state using bypass command on all 
>> these proc ids, then manually get the table/regions states consistent 
>> again using setRegionState/setTableState/assigns/unassigns methods.
>>
>> Without tooling, the only option I can think of is to stop cluster, 
>> clean out masterprocwals, restart cluster, then use hbase shell to 
>> enable/disable/assign regions. You may also need to manually update 
>> table/region states in meta table. Of course, you can automate these 
>> manual steps into your own tooling, but may be a better strategy in 
>> the long term to upgrade to a more stable version that also benefits 
>> from more tooling supported by the community.
>>
>>
>>
>>
>>
>> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
>> <ma...@eset.sk>
>> escreveu:
>>
>> > Hi, Wellington,
>> >
>> > I was on 'vacation' (no road trip or overseas anything) for a week.
>> >
>> > All fails are waiting on the same PID (73587), a DISABLE TABLE
>> procedure.
>> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be 
>> > the problem.
>> >
>> > I am still mystified about the HBCK2-tools. I have attached a 
>> > previous thread that you commented on at the time.
>> >
>> > I did build a tools for our HBASE 2.1.0...or rather, I built it on 
>> > Ubuntu
>> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu
>> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).  I 
>> > used it to help fix a similar problem with an offline table and RITs.
>> > Both HBASE versions are the same.
>> >
>> > I attach a 'sheet' with the current procs/locks.
>> >
>> > -----Original Message-----
>> > From: Marc Hoppins <ma...@eset.sk>
>> > Sent: Wednesday, March 3, 2021 9:51 AM
>> > To: user@hbase.apache.org
>> > Cc: Martin Oravec <ma...@eset.sk>
>> > Subject: RE: HBASE WALs
>> >
>> > EXTERNAL
>> >
>> > Thanks, Wellington,
>> >
>> > I have already build a hbck1-tools for 2.1.0 using method described 
>> > in other topics. All the HBASE and JDK here is the same version so 
>> > if it worked fixing one cluster HBASE then it should work for other installs.
>> >
>> > Fiddling with masterprocWALs will require complete shutdown of 
>> > hbase operations to prevent incoming reds/writes on other tables 
>> > and I am not sure how disruptive that will be other than "probably a lot".
>> >
>> > -----Original Message-----
>> > From: Wellington Chevreuil <we...@gmail.com>
>> > Sent: Tuesday, March 2, 2021 10:57 AM
>> > To: Hbase-User <us...@hbase.apache.org>
>> > Subject: Re: HBASE WALs
>> >
>> > EXTERNAL
>> >
>> > Sorry, missed your previous email. I was hoping you were not on a 
>> > non-stable version, so that you would benefit from hbck2 tool support.
>> > Unfortunately, 2.1.0 is among the early releases that don't work 
>> > with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
>> >
>> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
>> > seems
>> > > mostly unhappy with one region in particular, and is reporting on
>> that.
>> > >
>> > Are the other regions for the table properly closed, and this is 
>> > the only one stuck? If you do a list_procedures, are you able to 
>> > identify an 'unassign' procedure still running for this table? Or 
>> > if you grep master logs for this region, do you see any messages 
>> > suggesting there's still ongoing attempts to bring the region 
>> > offline? If there's apparently no procedure/no ongoing attempts to 
>> > offline the region, you might try to manually update its state in 
>> > meta table, then flip masters (assuming you have master HA), so 
>> > that the new active loads an up to date state from meta table.
>> >
>> > Otherwise, if there's still a rogue procedure trying to offline the 
>> > region, unfortunately, due to the lack of hbck support, you would 
>> > most likely need a more disruptive intervention similar to what you 
>> > had described in your first email, but instead of normal wal 
>> > folder, master proc wals is what you really would need to clean out 
>> > here, as that is where procedures state is persisted, and you 
>> > wouldn't want the rogue procedure to be resumed.
>> >
>> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
>> > <ma...@eset.sk>
>> > escreveu:
>> >
>> > > If you know of anything that will help I would appreciate it.
>> > >
>> > > If you need any log output let me know.
>> > >
>> > > Thanks
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Wellington Chevreuil <we...@gmail.com>
>> > > Sent: Thursday, February 25, 2021 4:08 PM
>> > > To: Hbase-User <us...@hbase.apache.org>
>> > > Subject: Re: HBASE WALs
>> > >
>> > > EXTERNAL
>> > >
>> > > >
>> > > > Do WAL files contain information for multiple regions per WAL 
>> > > > or is one WAL associated with one region?
>> > > >
>> > > Multiple regions edits would be present in a single wal file. 
>> > > That's why upon a RS crash and wal processing, there's a wal split phase.
>> > >
>> > > I am trying to find a way to clear a RIT for a disabled table. A 
>> > > similar
>> > > > problem (but on a test cluster) involved me clearing znode 
>> > > > info, deleting HDFS data for the table and deleting 
>> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
>> > > >
>> > > Which hbase version are you on?
>> > >
>> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
>> > > <ma...@eset.sk>
>> > > escreveu:
>> > >
>> > > > Hi all,
>> > > >
>> > > > Do WAL files contain information for multiple regions per WAL 
>> > > > or is one WAL associated with one region?
>> > > >
>> > > > I am trying to find a way to clear a RIT for a disabled table. 
>> > > > A similar problem (but on a test cluster) involved me clearing 
>> > > > znode info, deleting HDFS data for the table and deleting 
>> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
>> > > >
>> > > > Table cannot be enabled.
>> > > >
>> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
>> > > > seems mostly unhappy with one region in particular, and is 
>> > > > reporting
>> > on that.
>> > > >
>> > > > There are many tables that are very active so I don't think it 
>> > > > is possible to stop the entire service without a lot of 
>> > > > forewarning to
>> > > users.
>> > > >
>> > > > Thanks in advance.
>> > > >
>> > >
>> >
>>
>

Re: HBASE WALs

Posted by Wellington Chevreuil <we...@gmail.com>.
>
> Sorry if I seem stupid but this is still all new to me.
>
Forgot to mention, there's no stupid questions here. Don't be shy and
keep'em coming.

Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil <
wellington.chevreuil@gmail.com> escreveu:

> However, how would that help anyway?  If we cannot fix this at this time
>> then any upgrade would have inconsistencies also, yes?
>>
> The upgrade on it's own wouldn't fix existing inconsistencies, but you
> would now have support for additional tooling (hbase-operators-tool)  to
> help you with this.
>
> As all the 'SUCCESS' procedures have a parent ID 73587, does this mean
>> that they were successfully and fully moved from hbase25 to each server
>> mentioned in that procedure?  Or does it just mean that the region was
>> successfully unassigned from hbase25 but the data still resides on
>> hbase25?  I see locality 0.
>>
> IIRC, those were all UnassignProcedures, so it means the unassignment of
> the related region has completed and the region for that particular
> procedure went offline.
>
> If we change the table state in meta to 'ENABLED', could this kickstart
>> all these things or will it just lead to further problems?
>
> Masters work with its own memory cache of meta, so manually updating it
> will just make masters cache inconsistent with meta. You would need to
> restart masters to get its cache reloaded from master. The main problem is
> that you still have the rogue procedures, which you can't get rid of
> without stopping the cluster. One alternative to a full cluster outage
> would be to identify all RSes running the rogue procs (you can find that
> from active master logs), then stop only those and master, clean
> masterprocwals, then start it again.
>
>
>> I suppose it means I am asking, the 73587 DisableTableProcedure, does it
>> mean that the table is waiting to be disabled?  HBASE master declares that
>> table is NOT enabled.
>>
> The table state may have been already updated to disabled, most of its
> regions may already be offline, but the 73587 DisableTableProcedure cannot
> be considered "done" until all its sub procedures are indeed completed.
>
>
> Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins <ma...@eset.sk>
> escreveu:
>
>> Thanks for that.
>>
>> Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1 and do
>> not have a viable business use to pay the extortionate amount of money
>> required to upgrade.  Which would give these cluster access to newer
>> versions.
>>
>> However, how would that help anyway?  If we cannot fix this at this time
>> then any upgrade would have inconsistencies also, yes?
>>
>> As all the 'SUCCESS' procedures have a parent ID 73587, does this mean
>> that they were successfully and fully moved from hbase25 to each server
>> mentioned in that procedure?  Or does it just mean that the region was
>> successfully unassigned from hbase25 but the data still resides on
>> hbase25?  I see locality 0.
>>
>> If we change the table state in meta to 'ENABLED', could this kickstart
>> all these things or will it just lead to further problems?  I suppose it
>> means I am asking, the 73587 DisableTableProcedure, does it mean that the
>> table is waiting to be disabled?  HBASE master declares that table is NOT
>> enabled.
>>
>> Sorry if I seem stupid but this is still all new to me.
>>
>> I appreciate the help.
>>
>> -----Original Message-----
>> From: Wellington Chevreuil <we...@gmail.com>
>> Sent: Tuesday, March 9, 2021 1:20 PM
>> To: Hbase-User <us...@hbase.apache.org>
>> Subject: Re: HBASE WALs
>>
>> EXTERNAL
>>
>> >
>> > All fails are waiting on the same PID (73587), a DISABLE TABLE
>> procedure.
>> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be
>> > the problem.
>> >
>> Per your list procedures output attached, it seems the procs states are
>> all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with PID 73827,
>> which is the UnassignProcedure for this region. Problem is that there are
>> already 5 APs for the same region, which may be causing some deadlocks. If
>> this cluster was on a hbck2 supported version, you could get rid of this
>> state using bypass command on all these proc ids, then manually get the
>> table/regions states consistent again using
>> setRegionState/setTableState/assigns/unassigns methods.
>>
>> Without tooling, the only option I can think of is to stop cluster, clean
>> out masterprocwals, restart cluster, then use hbase shell to
>> enable/disable/assign regions. You may also need to manually update
>> table/region states in meta table. Of course, you can automate these manual
>> steps into your own tooling, but may be a better strategy in the long term
>> to upgrade to a more stable version that also benefits from more tooling
>> supported by the community.
>>
>>
>>
>>
>>
>> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins <ma...@eset.sk>
>> escreveu:
>>
>> > Hi, Wellington,
>> >
>> > I was on 'vacation' (no road trip or overseas anything) for a week.
>> >
>> > All fails are waiting on the same PID (73587), a DISABLE TABLE
>> procedure.
>> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be
>> > the problem.
>> >
>> > I am still mystified about the HBCK2-tools. I have attached a previous
>> > thread that you commented on at the time.
>> >
>> > I did build a tools for our HBASE 2.1.0...or rather, I built it on
>> > Ubuntu
>> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu
>> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).  I
>> > used it to help fix a similar problem with an offline table and RITs.
>> > Both HBASE versions are the same.
>> >
>> > I attach a 'sheet' with the current procs/locks.
>> >
>> > -----Original Message-----
>> > From: Marc Hoppins <ma...@eset.sk>
>> > Sent: Wednesday, March 3, 2021 9:51 AM
>> > To: user@hbase.apache.org
>> > Cc: Martin Oravec <ma...@eset.sk>
>> > Subject: RE: HBASE WALs
>> >
>> > EXTERNAL
>> >
>> > Thanks, Wellington,
>> >
>> > I have already build a hbck1-tools for 2.1.0 using method described in
>> > other topics. All the HBASE and JDK here is the same version so if it
>> > worked fixing one cluster HBASE then it should work for other installs.
>> >
>> > Fiddling with masterprocWALs will require complete shutdown of hbase
>> > operations to prevent incoming reds/writes on other tables and I am
>> > not sure how disruptive that will be other than "probably a lot".
>> >
>> > -----Original Message-----
>> > From: Wellington Chevreuil <we...@gmail.com>
>> > Sent: Tuesday, March 2, 2021 10:57 AM
>> > To: Hbase-User <us...@hbase.apache.org>
>> > Subject: Re: HBASE WALs
>> >
>> > EXTERNAL
>> >
>> > Sorry, missed your previous email. I was hoping you were not on a
>> > non-stable version, so that you would benefit from hbck2 tool support.
>> > Unfortunately, 2.1.0 is among the early releases that don't work with
>> > this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
>> >
>> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system seems
>> > > mostly unhappy with one region in particular, and is reporting on
>> that.
>> > >
>> > Are the other regions for the table properly closed, and this is the
>> > only one stuck? If you do a list_procedures, are you able to identify
>> > an 'unassign' procedure still running for this table? Or if you grep
>> > master logs for this region, do you see any messages suggesting
>> > there's still ongoing attempts to bring the region offline? If there's
>> > apparently no procedure/no ongoing attempts to offline the region, you
>> > might try to manually update its state in meta table, then flip
>> > masters (assuming you have master HA), so that the new active loads an
>> > up to date state from meta table.
>> >
>> > Otherwise, if there's still a rogue procedure trying to offline the
>> > region, unfortunately, due to the lack of hbck support, you would most
>> > likely need a more disruptive intervention similar to what you had
>> > described in your first email, but instead of normal wal folder,
>> > master proc wals is what you really would need to clean out here, as
>> > that is where procedures state is persisted, and you wouldn't want the
>> > rogue procedure to be resumed.
>> >
>> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins
>> > <ma...@eset.sk>
>> > escreveu:
>> >
>> > > If you know of anything that will help I would appreciate it.
>> > >
>> > > If you need any log output let me know.
>> > >
>> > > Thanks
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Wellington Chevreuil <we...@gmail.com>
>> > > Sent: Thursday, February 25, 2021 4:08 PM
>> > > To: Hbase-User <us...@hbase.apache.org>
>> > > Subject: Re: HBASE WALs
>> > >
>> > > EXTERNAL
>> > >
>> > > >
>> > > > Do WAL files contain information for multiple regions per WAL or
>> > > > is one WAL associated with one region?
>> > > >
>> > > Multiple regions edits would be present in a single wal file. That's
>> > > why upon a RS crash and wal processing, there's a wal split phase.
>> > >
>> > > I am trying to find a way to clear a RIT for a disabled table. A
>> > > similar
>> > > > problem (but on a test cluster) involved me clearing znode info,
>> > > > deleting HDFS data for the table and deleting WALs/MasterProcWAL
>> > > > files, finally restarting HBASE service.
>> > > >
>> > > Which hbase version are you on?
>> > >
>> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins
>> > > <ma...@eset.sk>
>> > > escreveu:
>> > >
>> > > > Hi all,
>> > > >
>> > > > Do WAL files contain information for multiple regions per WAL or
>> > > > is one WAL associated with one region?
>> > > >
>> > > > I am trying to find a way to clear a RIT for a disabled table. A
>> > > > similar problem (but on a test cluster) involved me clearing znode
>> > > > info, deleting HDFS data for the table and deleting
>> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
>> > > >
>> > > > Table cannot be enabled.
>> > > >
>> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system
>> > > > seems mostly unhappy with one region in particular, and is
>> > > > reporting
>> > on that.
>> > > >
>> > > > There are many tables that are very active so I don't think it is
>> > > > possible to stop the entire service without a lot of forewarning
>> > > > to
>> > > users.
>> > > >
>> > > > Thanks in advance.
>> > > >
>> > >
>> >
>>
>

Re: HBASE WALs

Posted by Wellington Chevreuil <we...@gmail.com>.
>
> However, how would that help anyway?  If we cannot fix this at this time
> then any upgrade would have inconsistencies also, yes?
>
The upgrade on it's own wouldn't fix existing inconsistencies, but you
would now have support for additional tooling (hbase-operators-tool)  to
help you with this.

As all the 'SUCCESS' procedures have a parent ID 73587, does this mean that
> they were successfully and fully moved from hbase25 to each server
> mentioned in that procedure?  Or does it just mean that the region was
> successfully unassigned from hbase25 but the data still resides on
> hbase25?  I see locality 0.
>
IIRC, those were all UnassignProcedures, so it means the unassignment of
the related region has completed and the region for that particular
procedure went offline.

If we change the table state in meta to 'ENABLED', could this kickstart all
> these things or will it just lead to further problems?

Masters work with its own memory cache of meta, so manually updating it
will just make masters cache inconsistent with meta. You would need to
restart masters to get its cache reloaded from master. The main problem is
that you still have the rogue procedures, which you can't get rid of
without stopping the cluster. One alternative to a full cluster outage
would be to identify all RSes running the rogue procs (you can find that
from active master logs), then stop only those and master, clean
masterprocwals, then start it again.


> I suppose it means I am asking, the 73587 DisableTableProcedure, does it
> mean that the table is waiting to be disabled?  HBASE master declares that
> table is NOT enabled.
>
The table state may have been already updated to disabled, most of its
regions may already be offline, but the 73587 DisableTableProcedure cannot
be considered "done" until all its sub procedures are indeed completed.


Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins <ma...@eset.sk>
escreveu:

> Thanks for that.
>
> Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1 and do
> not have a viable business use to pay the extortionate amount of money
> required to upgrade.  Which would give these cluster access to newer
> versions.
>
> However, how would that help anyway?  If we cannot fix this at this time
> then any upgrade would have inconsistencies also, yes?
>
> As all the 'SUCCESS' procedures have a parent ID 73587, does this mean
> that they were successfully and fully moved from hbase25 to each server
> mentioned in that procedure?  Or does it just mean that the region was
> successfully unassigned from hbase25 but the data still resides on
> hbase25?  I see locality 0.
>
> If we change the table state in meta to 'ENABLED', could this kickstart
> all these things or will it just lead to further problems?  I suppose it
> means I am asking, the 73587 DisableTableProcedure, does it mean that the
> table is waiting to be disabled?  HBASE master declares that table is NOT
> enabled.
>
> Sorry if I seem stupid but this is still all new to me.
>
> I appreciate the help.
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Tuesday, March 9, 2021 1:20 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > All fails are waiting on the same PID (73587), a DISABLE TABLE procedure.
> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be
> > the problem.
> >
> Per your list procedures output attached, it seems the procs states are
> all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with PID 73827,
> which is the UnassignProcedure for this region. Problem is that there are
> already 5 APs for the same region, which may be causing some deadlocks. If
> this cluster was on a hbck2 supported version, you could get rid of this
> state using bypass command on all these proc ids, then manually get the
> table/regions states consistent again using
> setRegionState/setTableState/assigns/unassigns methods.
>
> Without tooling, the only option I can think of is to stop cluster, clean
> out masterprocwals, restart cluster, then use hbase shell to
> enable/disable/assign regions. You may also need to manually update
> table/region states in meta table. Of course, you can automate these manual
> steps into your own tooling, but may be a better strategy in the long term
> to upgrade to a more stable version that also benefits from more tooling
> supported by the community.
>
>
>
>
>
> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins <ma...@eset.sk>
> escreveu:
>
> > Hi, Wellington,
> >
> > I was on 'vacation' (no road trip or overseas anything) for a week.
> >
> > All fails are waiting on the same PID (73587), a DISABLE TABLE procedure.
> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be
> > the problem.
> >
> > I am still mystified about the HBCK2-tools. I have attached a previous
> > thread that you commented on at the time.
> >
> > I did build a tools for our HBASE 2.1.0...or rather, I built it on
> > Ubuntu
> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu
> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).  I
> > used it to help fix a similar problem with an offline table and RITs.
> > Both HBASE versions are the same.
> >
> > I attach a 'sheet' with the current procs/locks.
> >
> > -----Original Message-----
> > From: Marc Hoppins <ma...@eset.sk>
> > Sent: Wednesday, March 3, 2021 9:51 AM
> > To: user@hbase.apache.org
> > Cc: Martin Oravec <ma...@eset.sk>
> > Subject: RE: HBASE WALs
> >
> > EXTERNAL
> >
> > Thanks, Wellington,
> >
> > I have already build a hbck1-tools for 2.1.0 using method described in
> > other topics. All the HBASE and JDK here is the same version so if it
> > worked fixing one cluster HBASE then it should work for other installs.
> >
> > Fiddling with masterprocWALs will require complete shutdown of hbase
> > operations to prevent incoming reds/writes on other tables and I am
> > not sure how disruptive that will be other than "probably a lot".
> >
> > -----Original Message-----
> > From: Wellington Chevreuil <we...@gmail.com>
> > Sent: Tuesday, March 2, 2021 10:57 AM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > Sorry, missed your previous email. I was hoping you were not on a
> > non-stable version, so that you would benefit from hbck2 tool support.
> > Unfortunately, 2.1.0 is among the early releases that don't work with
> > this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> >
> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system seems
> > > mostly unhappy with one region in particular, and is reporting on that.
> > >
> > Are the other regions for the table properly closed, and this is the
> > only one stuck? If you do a list_procedures, are you able to identify
> > an 'unassign' procedure still running for this table? Or if you grep
> > master logs for this region, do you see any messages suggesting
> > there's still ongoing attempts to bring the region offline? If there's
> > apparently no procedure/no ongoing attempts to offline the region, you
> > might try to manually update its state in meta table, then flip
> > masters (assuming you have master HA), so that the new active loads an
> > up to date state from meta table.
> >
> > Otherwise, if there's still a rogue procedure trying to offline the
> > region, unfortunately, due to the lack of hbck support, you would most
> > likely need a more disruptive intervention similar to what you had
> > described in your first email, but instead of normal wal folder,
> > master proc wals is what you really would need to clean out here, as
> > that is where procedures state is persisted, and you wouldn't want the
> > rogue procedure to be resumed.
> >
> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins
> > <ma...@eset.sk>
> > escreveu:
> >
> > > If you know of anything that will help I would appreciate it.
> > >
> > > If you need any log output let me know.
> > >
> > > Thanks
> > >
> > >
> > > -----Original Message-----
> > > From: Wellington Chevreuil <we...@gmail.com>
> > > Sent: Thursday, February 25, 2021 4:08 PM
> > > To: Hbase-User <us...@hbase.apache.org>
> > > Subject: Re: HBASE WALs
> > >
> > > EXTERNAL
> > >
> > > >
> > > > Do WAL files contain information for multiple regions per WAL or
> > > > is one WAL associated with one region?
> > > >
> > > Multiple regions edits would be present in a single wal file. That's
> > > why upon a RS crash and wal processing, there's a wal split phase.
> > >
> > > I am trying to find a way to clear a RIT for a disabled table. A
> > > similar
> > > > problem (but on a test cluster) involved me clearing znode info,
> > > > deleting HDFS data for the table and deleting WALs/MasterProcWAL
> > > > files, finally restarting HBASE service.
> > > >
> > > Which hbase version are you on?
> > >
> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins
> > > <ma...@eset.sk>
> > > escreveu:
> > >
> > > > Hi all,
> > > >
> > > > Do WAL files contain information for multiple regions per WAL or
> > > > is one WAL associated with one region?
> > > >
> > > > I am trying to find a way to clear a RIT for a disabled table. A
> > > > similar problem (but on a test cluster) involved me clearing znode
> > > > info, deleting HDFS data for the table and deleting
> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > > >
> > > > Table cannot be enabled.
> > > >
> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system
> > > > seems mostly unhappy with one region in particular, and is
> > > > reporting
> > on that.
> > > >
> > > > There are many tables that are very active so I don't think it is
> > > > possible to stop the entire service without a lot of forewarning
> > > > to
> > > users.
> > > >
> > > > Thanks in advance.
> > > >
> > >
> >
>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
Thanks for that.

Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1 and do not have a viable business use to pay the extortionate amount of money required to upgrade.  Which would give these cluster access to newer versions.

However, how would that help anyway?  If we cannot fix this at this time then any upgrade would have inconsistencies also, yes?

As all the 'SUCCESS' procedures have a parent ID 73587, does this mean that they were successfully and fully moved from hbase25 to each server mentioned in that procedure?  Or does it just mean that the region was successfully unassigned from hbase25 but the data still resides on hbase25?  I see locality 0.

If we change the table state in meta to 'ENABLED', could this kickstart all these things or will it just lead to further problems?  I suppose it means I am asking, the 73587 DisableTableProcedure, does it mean that the table is waiting to be disabled?  HBASE master declares that table is NOT enabled.

Sorry if I seem stupid but this is still all new to me.

I appreciate the help.

-----Original Message-----
From: Wellington Chevreuil <we...@gmail.com> 
Sent: Tuesday, March 9, 2021 1:20 PM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

>
> All fails are waiting on the same PID (73587), a DISABLE TABLE procedure.
> The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be 
> the problem.
>
Per your list procedures output attached, it seems the procs states are all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with PID 73827, which is the UnassignProcedure for this region. Problem is that there are already 5 APs for the same region, which may be causing some deadlocks. If this cluster was on a hbck2 supported version, you could get rid of this state using bypass command on all these proc ids, then manually get the table/regions states consistent again using setRegionState/setTableState/assigns/unassigns methods.

Without tooling, the only option I can think of is to stop cluster, clean out masterprocwals, restart cluster, then use hbase shell to enable/disable/assign regions. You may also need to manually update table/region states in meta table. Of course, you can automate these manual steps into your own tooling, but may be a better strategy in the long term to upgrade to a more stable version that also benefits from more tooling supported by the community.





Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins <ma...@eset.sk>
escreveu:

> Hi, Wellington,
>
> I was on 'vacation' (no road trip or overseas anything) for a week.
>
> All fails are waiting on the same PID (73587), a DISABLE TABLE procedure.
> The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be 
> the problem.
>
> I am still mystified about the HBCK2-tools. I have attached a previous 
> thread that you commented on at the time.
>
> I did build a tools for our HBASE 2.1.0...or rather, I built it on 
> Ubuntu
> 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu 
> 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).  I 
> used it to help fix a similar problem with an offline table and RITs.  
> Both HBASE versions are the same.
>
> I attach a 'sheet' with the current procs/locks.
>
> -----Original Message-----
> From: Marc Hoppins <ma...@eset.sk>
> Sent: Wednesday, March 3, 2021 9:51 AM
> To: user@hbase.apache.org
> Cc: Martin Oravec <ma...@eset.sk>
> Subject: RE: HBASE WALs
>
> EXTERNAL
>
> Thanks, Wellington,
>
> I have already build a hbck1-tools for 2.1.0 using method described in 
> other topics. All the HBASE and JDK here is the same version so if it 
> worked fixing one cluster HBASE then it should work for other installs.
>
> Fiddling with masterprocWALs will require complete shutdown of hbase 
> operations to prevent incoming reds/writes on other tables and I am 
> not sure how disruptive that will be other than "probably a lot".
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Tuesday, March 2, 2021 10:57 AM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> Sorry, missed your previous email. I was hoping you were not on a 
> non-stable version, so that you would benefit from hbck2 tool support.
> Unfortunately, 2.1.0 is among the early releases that don't work with 
> this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
>
> Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system seems
> > mostly unhappy with one region in particular, and is reporting on that.
> >
> Are the other regions for the table properly closed, and this is the 
> only one stuck? If you do a list_procedures, are you able to identify 
> an 'unassign' procedure still running for this table? Or if you grep 
> master logs for this region, do you see any messages suggesting 
> there's still ongoing attempts to bring the region offline? If there's 
> apparently no procedure/no ongoing attempts to offline the region, you 
> might try to manually update its state in meta table, then flip 
> masters (assuming you have master HA), so that the new active loads an 
> up to date state from meta table.
>
> Otherwise, if there's still a rogue procedure trying to offline the 
> region, unfortunately, due to the lack of hbck support, you would most 
> likely need a more disruptive intervention similar to what you had 
> described in your first email, but instead of normal wal folder, 
> master proc wals is what you really would need to clean out here, as 
> that is where procedures state is persisted, and you wouldn't want the 
> rogue procedure to be resumed.
>
> Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
> <ma...@eset.sk>
> escreveu:
>
> > If you know of anything that will help I would appreciate it.
> >
> > If you need any log output let me know.
> >
> > Thanks
> >
> >
> > -----Original Message-----
> > From: Wellington Chevreuil <we...@gmail.com>
> > Sent: Thursday, February 25, 2021 4:08 PM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > >
> > > Do WAL files contain information for multiple regions per WAL or 
> > > is one WAL associated with one region?
> > >
> > Multiple regions edits would be present in a single wal file. That's 
> > why upon a RS crash and wal processing, there's a wal split phase.
> >
> > I am trying to find a way to clear a RIT for a disabled table. A 
> > similar
> > > problem (but on a test cluster) involved me clearing znode info, 
> > > deleting HDFS data for the table and deleting WALs/MasterProcWAL 
> > > files, finally restarting HBASE service.
> > >
> > Which hbase version are you on?
> >
> > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> > <ma...@eset.sk>
> > escreveu:
> >
> > > Hi all,
> > >
> > > Do WAL files contain information for multiple regions per WAL or 
> > > is one WAL associated with one region?
> > >
> > > I am trying to find a way to clear a RIT for a disabled table. A 
> > > similar problem (but on a test cluster) involved me clearing znode 
> > > info, deleting HDFS data for the table and deleting 
> > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > >
> > > Table cannot be enabled.
> > >
> > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
> > > seems mostly unhappy with one region in particular, and is 
> > > reporting
> on that.
> > >
> > > There are many tables that are very active so I don't think it is 
> > > possible to stop the entire service without a lot of forewarning 
> > > to
> > users.
> > >
> > > Thanks in advance.
> > >
> >
>

Re: HBASE WALs

Posted by Wellington Chevreuil <we...@gmail.com>.
>
> All fails are waiting on the same PID (73587), a DISABLE TABLE procedure.
> The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be the
> problem.
>
Per your list procedures output attached, it seems the procs states are all
inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with PID 73827,
which is the UnassignProcedure for this region. Problem is that there are
already 5 APs for the same region, which may be causing some deadlocks. If
this cluster was on a hbck2 supported version, you could get rid of this
state using bypass command on all these proc ids, then manually get the
table/regions states consistent again using
setRegionState/setTableState/assigns/unassigns methods.

Without tooling, the only option I can think of is to stop cluster, clean
out masterprocwals, restart cluster, then use hbase shell to
enable/disable/assign regions. You may also need to manually update
table/region states in meta table. Of course, you can automate these manual
steps into your own tooling, but may be a better strategy in the long term
to upgrade to a more stable version that also benefits from more tooling
supported by the community.





Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins <ma...@eset.sk>
escreveu:

> Hi, Wellington,
>
> I was on 'vacation' (no road trip or overseas anything) for a week.
>
> All fails are waiting on the same PID (73587), a DISABLE TABLE procedure.
> The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be the
> problem.
>
> I am still mystified about the HBCK2-tools. I have attached a previous
> thread that you commented on at the time.
>
> I did build a tools for our HBASE 2.1.0...or rather, I built it on Ubuntu
> 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu 16.04
> with a slightly different java (Oracle Java 8, 1.8.0_181).  I used it to
> help fix a similar problem with an offline table and RITs.  Both HBASE
> versions are the same.
>
> I attach a 'sheet' with the current procs/locks.
>
> -----Original Message-----
> From: Marc Hoppins <ma...@eset.sk>
> Sent: Wednesday, March 3, 2021 9:51 AM
> To: user@hbase.apache.org
> Cc: Martin Oravec <ma...@eset.sk>
> Subject: RE: HBASE WALs
>
> EXTERNAL
>
> Thanks, Wellington,
>
> I have already build a hbck1-tools for 2.1.0 using method described in
> other topics. All the HBASE and JDK here is the same version so if it
> worked fixing one cluster HBASE then it should work for other installs.
>
> Fiddling with masterprocWALs will require complete shutdown of hbase
> operations to prevent incoming reds/writes on other tables and I am not
> sure how disruptive that will be other than "probably a lot".
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Tuesday, March 2, 2021 10:57 AM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> Sorry, missed your previous email. I was hoping you were not on a
> non-stable version, so that you would benefit from hbck2 tool support.
> Unfortunately, 2.1.0 is among the early releases that don't work with this
> tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
>
> Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system seems
> > mostly unhappy with one region in particular, and is reporting on that.
> >
> Are the other regions for the table properly closed, and this is the only
> one stuck? If you do a list_procedures, are you able to identify an
> 'unassign' procedure still running for this table? Or if you grep master
> logs for this region, do you see any messages suggesting there's still
> ongoing attempts to bring the region offline? If there's apparently no
> procedure/no ongoing attempts to offline the region, you might try to
> manually update its state in meta table, then flip masters (assuming you
> have master HA), so that the new active loads an up to date state from meta
> table.
>
> Otherwise, if there's still a rogue procedure trying to offline the
> region, unfortunately, due to the lack of hbck support, you would most
> likely need a more disruptive intervention similar to what you had
> described in your first email, but instead of normal wal folder, master
> proc wals is what you really would need to clean out here, as that is where
> procedures state is persisted, and you wouldn't want the rogue procedure to
> be resumed.
>
> Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins <ma...@eset.sk>
> escreveu:
>
> > If you know of anything that will help I would appreciate it.
> >
> > If you need any log output let me know.
> >
> > Thanks
> >
> >
> > -----Original Message-----
> > From: Wellington Chevreuil <we...@gmail.com>
> > Sent: Thursday, February 25, 2021 4:08 PM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > >
> > > Do WAL files contain information for multiple regions per WAL or is
> > > one WAL associated with one region?
> > >
> > Multiple regions edits would be present in a single wal file. That's
> > why upon a RS crash and wal processing, there's a wal split phase.
> >
> > I am trying to find a way to clear a RIT for a disabled table. A
> > similar
> > > problem (but on a test cluster) involved me clearing znode info,
> > > deleting HDFS data for the table and deleting WALs/MasterProcWAL
> > > files, finally restarting HBASE service.
> > >
> > Which hbase version are you on?
> >
> > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins
> > <ma...@eset.sk>
> > escreveu:
> >
> > > Hi all,
> > >
> > > Do WAL files contain information for multiple regions per WAL or is
> > > one WAL associated with one region?
> > >
> > > I am trying to find a way to clear a RIT for a disabled table. A
> > > similar problem (but on a test cluster) involved me clearing znode
> > > info, deleting HDFS data for the table and deleting
> > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > >
> > > Table cannot be enabled.
> > >
> > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system
> > > seems mostly unhappy with one region in particular, and is reporting
> on that.
> > >
> > > There are many tables that are very active so I don't think it is
> > > possible to stop the entire service without a lot of forewarning to
> > users.
> > >
> > > Thanks in advance.
> > >
> >
>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
Hi, Wellington,

I was on 'vacation' (no road trip or overseas anything) for a week.

All fails are waiting on the same PID (73587), a DISABLE TABLE procedure.  The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be the problem.

I am still mystified about the HBCK2-tools. I have attached a previous thread that you commented on at the time.

I did build a tools for our HBASE 2.1.0...or rather, I built it on Ubuntu 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).  I used it to help fix a similar problem with an offline table and RITs.  Both HBASE versions are the same.

I attach a 'sheet' with the current procs/locks.

-----Original Message-----
From: Marc Hoppins <ma...@eset.sk> 
Sent: Wednesday, March 3, 2021 9:51 AM
To: user@hbase.apache.org
Cc: Martin Oravec <ma...@eset.sk>
Subject: RE: HBASE WALs

EXTERNAL

Thanks, Wellington,

I have already build a hbck1-tools for 2.1.0 using method described in other topics. All the HBASE and JDK here is the same version so if it worked fixing one cluster HBASE then it should work for other installs.

Fiddling with masterprocWALs will require complete shutdown of hbase operations to prevent incoming reds/writes on other tables and I am not sure how disruptive that will be other than "probably a lot".

-----Original Message-----
From: Wellington Chevreuil <we...@gmail.com>
Sent: Tuesday, March 2, 2021 10:57 AM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

Sorry, missed your previous email. I was hoping you were not on a non-stable version, so that you would benefit from hbck2 tool support.
Unfortunately, 2.1.0 is among the early releases that don't work with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).

Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system seems
> mostly unhappy with one region in particular, and is reporting on that.
>
Are the other regions for the table properly closed, and this is the only one stuck? If you do a list_procedures, are you able to identify an 'unassign' procedure still running for this table? Or if you grep master logs for this region, do you see any messages suggesting there's still ongoing attempts to bring the region offline? If there's apparently no procedure/no ongoing attempts to offline the region, you might try to manually update its state in meta table, then flip masters (assuming you have master HA), so that the new active loads an up to date state from meta table.

Otherwise, if there's still a rogue procedure trying to offline the region, unfortunately, due to the lack of hbck support, you would most likely need a more disruptive intervention similar to what you had described in your first email, but instead of normal wal folder, master proc wals is what you really would need to clean out here, as that is where procedures state is persisted, and you wouldn't want the rogue procedure to be resumed.

Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins <ma...@eset.sk>
escreveu:

> If you know of anything that will help I would appreciate it.
>
> If you need any log output let me know.
>
> Thanks
>
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Thursday, February 25, 2021 4:08 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > Do WAL files contain information for multiple regions per WAL or is 
> > one WAL associated with one region?
> >
> Multiple regions edits would be present in a single wal file. That's 
> why upon a RS crash and wal processing, there's a wal split phase.
>
> I am trying to find a way to clear a RIT for a disabled table. A 
> similar
> > problem (but on a test cluster) involved me clearing znode info, 
> > deleting HDFS data for the table and deleting WALs/MasterProcWAL 
> > files, finally restarting HBASE service.
> >
> Which hbase version are you on?
>
> Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> <ma...@eset.sk>
> escreveu:
>
> > Hi all,
> >
> > Do WAL files contain information for multiple regions per WAL or is 
> > one WAL associated with one region?
> >
> > I am trying to find a way to clear a RIT for a disabled table. A 
> > similar problem (but on a test cluster) involved me clearing znode 
> > info, deleting HDFS data for the table and deleting 
> > WALs/MasterProcWAL files, finally restarting HBASE service.
> >
> > Table cannot be enabled.
> >
> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
> > seems mostly unhappy with one region in particular, and is reporting on that.
> >
> > There are many tables that are very active so I don't think it is 
> > possible to stop the entire service without a lot of forewarning to
> users.
> >
> > Thanks in advance.
> >
>

RE: HBASE WALs

Posted by Marc Hoppins <ma...@eset.sk>.
Thanks, Wellington,

I have already build a hbck1-tools for 2.1.0 using method described in other topics. All the HBASE and JDK here is the same version so if it worked fixing one cluster HBASE then it should work for other installs.

Fiddling with masterprocWALs will require complete shutdown of hbase operations to prevent incoming reds/writes on other tables and I am not sure how disruptive that will be other than "probably a lot".

-----Original Message-----
From: Wellington Chevreuil <we...@gmail.com> 
Sent: Tuesday, March 2, 2021 10:57 AM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

Sorry, missed your previous email. I was hoping you were not on a non-stable version, so that you would benefit from hbck2 tool support.
Unfortunately, 2.1.0 is among the early releases that don't work with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).

Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system seems
> mostly unhappy with one region in particular, and is reporting on that.
>
Are the other regions for the table properly closed, and this is the only one stuck? If you do a list_procedures, are you able to identify an 'unassign' procedure still running for this table? Or if you grep master logs for this region, do you see any messages suggesting there's still ongoing attempts to bring the region offline? If there's apparently no procedure/no ongoing attempts to offline the region, you might try to manually update its state in meta table, then flip masters (assuming you have master HA), so that the new active loads an up to date state from meta table.

Otherwise, if there's still a rogue procedure trying to offline the region, unfortunately, due to the lack of hbck support, you would most likely need a more disruptive intervention similar to what you had described in your first email, but instead of normal wal folder, master proc wals is what you really would need to clean out here, as that is where procedures state is persisted, and you wouldn't want the rogue procedure to be resumed.

Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins <ma...@eset.sk>
escreveu:

> If you know of anything that will help I would appreciate it.
>
> If you need any log output let me know.
>
> Thanks
>
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Thursday, February 25, 2021 4:08 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > Do WAL files contain information for multiple regions per WAL or is 
> > one WAL associated with one region?
> >
> Multiple regions edits would be present in a single wal file. That's 
> why upon a RS crash and wal processing, there's a wal split phase.
>
> I am trying to find a way to clear a RIT for a disabled table. A 
> similar
> > problem (but on a test cluster) involved me clearing znode info, 
> > deleting HDFS data for the table and deleting WALs/MasterProcWAL 
> > files, finally restarting HBASE service.
> >
> Which hbase version are you on?
>
> Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> <ma...@eset.sk>
> escreveu:
>
> > Hi all,
> >
> > Do WAL files contain information for multiple regions per WAL or is 
> > one WAL associated with one region?
> >
> > I am trying to find a way to clear a RIT for a disabled table. A 
> > similar problem (but on a test cluster) involved me clearing znode 
> > info, deleting HDFS data for the table and deleting 
> > WALs/MasterProcWAL files, finally restarting HBASE service.
> >
> > Table cannot be enabled.
> >
> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
> > seems mostly unhappy with one region in particular, and is reporting on that.
> >
> > There are many tables that are very active so I don't think it is 
> > possible to stop the entire service without a lot of forewarning to
> users.
> >
> > Thanks in advance.
> >
>

Re: HBASE WALs

Posted by Wellington Chevreuil <we...@gmail.com>.
Sorry, missed your previous email. I was hoping you were not on a
non-stable version, so that you would benefit from hbck2 tool support.
Unfortunately, 2.1.0 is among the early releases that don't work with this
tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).

Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system seems
> mostly unhappy with one region in particular, and is reporting on that.
>
Are the other regions for the table properly closed, and this is the only
one stuck? If you do a list_procedures, are you able to identify an
'unassign' procedure still running for this table? Or if you grep master
logs for this region, do you see any messages suggesting there's still
ongoing attempts to bring the region offline? If there's apparently no
procedure/no ongoing attempts to offline the region, you might try to
manually update its state in meta table, then flip masters (assuming you
have master HA), so that the new active loads an up to date state from meta
table.

Otherwise, if there's still a rogue procedure trying to offline the region,
unfortunately, due to the lack of hbck support, you would most likely need
a more disruptive intervention similar to what you had described in your
first email, but instead of normal wal folder, master proc wals is what you
really would need to clean out here, as that is where procedures state is
persisted, and you wouldn't want the rogue procedure to be resumed.

Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins <ma...@eset.sk>
escreveu:

> If you know of anything that will help I would appreciate it.
>
> If you need any log output let me know.
>
> Thanks
>
>
> -----Original Message-----
> From: Wellington Chevreuil <we...@gmail.com>
> Sent: Thursday, February 25, 2021 4:08 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > Do WAL files contain information for multiple regions per WAL or is
> > one WAL associated with one region?
> >
> Multiple regions edits would be present in a single wal file. That's why
> upon a RS crash and wal processing, there's a wal split phase.
>
> I am trying to find a way to clear a RIT for a disabled table. A similar
> > problem (but on a test cluster) involved me clearing znode info,
> > deleting HDFS data for the table and deleting WALs/MasterProcWAL
> > files, finally restarting HBASE service.
> >
> Which hbase version are you on?
>
> Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins <ma...@eset.sk>
> escreveu:
>
> > Hi all,
> >
> > Do WAL files contain information for multiple regions per WAL or is
> > one WAL associated with one region?
> >
> > I am trying to find a way to clear a RIT for a disabled table. A
> > similar problem (but on a test cluster) involved me clearing znode
> > info, deleting HDFS data for the table and deleting WALs/MasterProcWAL
> > files, finally restarting HBASE service.
> >
> > Table cannot be enabled.
> >
> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system seems
> > mostly unhappy with one region in particular, and is reporting on that.
> >
> > There are many tables that are very active so I don't think it is
> > possible to stop the entire service without a lot of forewarning to
> users.
> >
> > Thanks in advance.
> >
>