You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Claude M <cl...@gmail.com> on 2022/01/04 15:24:00 UTC

Re: HBase master unable to recover with error "Cannot seek after EOF"

I don't want to rebuild HBase.  According to the attached HBase/Hadoop
compatibility chart, the latest version of HBase that has been verified w/
Hadoop is 2.3.x.
The fix was put into branch 2.3 on 11/21 but there is not going to be a
2.3.8 release since it is mentioned that branch 2.3 is EOL.  Is there not
another way around this?

On Fri, Dec 24, 2021 at 12:53 AM 张铎(Duo Zhang) <pa...@gmail.com>
wrote:

> Ah, thanks Yulin Niu for the pointer. HBASE-26053 should be the problem.
>
> Yulin Niu <yu...@gmail.com> 于2021年12月19日周日 10:41写道:
> >
> > https://issues.apache.org/jira/browse/HBASE-25053
> > It seems the bug described in this issue, You can try cherry pick this
> > patch, Claude M
> >
> > Viraj Jasani <vj...@apache.org> 于2021年12月19日周日 02:17写道:
> >
> > > > Your fix is a bit dangerous since you may lose some ongoing
> procedures,
> > > but
> > > > if you did not experience any inconsistency on your cluster, for
> example,
> > > > some regions are not online, then it is OK.
> > >
> > > Duo, out of curiosity, even if some regions are offline and/or some
> servers
> > > go offline, wouldn't master failover re-trigger SCPs and TRSPs to
> bring all
> > > regions ONLINE?
> > > I have played around with removal of MasterProcWAL on hbase1 only (WAL
> proc
> > > store) and have seen new SCPs getting triggered i.e. AM doesn bring all
> > > regions ONLINE eventually.
> > >
> > >
> > > On Thu, Dec 16, 2021 at 9:57 PM 张铎(Duo Zhang) <pa...@gmail.com>
> > > wrote:
> > >
> > > > I guess this should be a bug. For the master local region we do not
> > > handle
> > > > broken WAL files which do not even have a valid header.
> > > >
> > > > Will take a look at the code tomorrow to confirm whether this is the
> > > case.
> > > >
> > > > Your fix is a bit dangerous since you may lose some ongoing
> procedures,
> > > but
> > > > if you did not experience any inconsistency on your cluster, for
> example,
> > > > some regions are not online, then it is OK.
> > > >
> > > > Thanks for reporting.
> > > >
> > > > Claude M <cl...@gmail.com> 于2021年12月16日周四 03:37写道:
> > > >
> > > > > Hello,
> > > > >
> > > > > I have the following installed:
> > > > >
> > > > >    - Hadoop 3.2.2
> > > > >    - HBase 2.3.5
> > > > >
> > > > >
> > > > > When all the datanodes in Hadoop are stopped but the HBase cluster
> is
> > > > > still running, the HBase master crashes w/ the attached exception
> and
> > > is
> > > > > not recoverable.
> > > > >
> > > > > If I delete the contents under the following directories in hdfs,
> the
> > > > > master will then recover:
> > > > >
> > > > >    - /hbase/MasterData/WALs/
> > > > >    - /hbase/MasterData/data/master/store/*/recovered.wals/
> > > > >
> > > > > Is this an appropriate way to resolve the issue?  If not, what
> should
> > > be
> > > > > done?
> > > > >
> > > > >
> > > > > Thanks
> > > > >
> > > >
> > >
>

Re: HBase master unable to recover with error "Cannot seek after EOF"

Posted by Andrew Purtell <ap...@apache.org>.
We run Hadoop (HDFS, YARN) 2.10 at my employer, with 2.4 and 1.6/1.7, so I
can give feedback about functionality and compatibility with respect to
this combination.

Unfortunately I do not have experience running Hadoop 3 in production and
am not directly familiar with anyone who does, although I do believe some
in the community are, and would welcome their experience (and versions) in
this thread if they would like to write in. We can change that statement if
someone can attest to 'sufficiently tested' with some 3.x version. This
change may be due (or not).

I do have some non-prod experience with Hadoop 3.1 and HBase 2.4 with root
filesystem on S3 (with HBOSS) and WAL on HDFS. I did not run into any WAL
related issues like the problems you mention on that thread but would need
more effort to qualify such a configuration for production so it might turn
up under more intensive scenarios. Have not tried Hadoop 3.2 or later.

I'm sorry I could not be more helpful.



On Fri, Jan 7, 2022 at 11:28 AM Claude M <cl...@gmail.com> wrote:

> Thanks for your reply.  What about the following statement that is in the
> documentation, is this still true?
>
> Hadoop 3.x is still in early access releases and has not yet been
> sufficiently tested by the HBase community for production use cases.
>
> When I tested HBase 2.3.5 w/ hadoop 3.2.2, I was encountering a problem w/
> Hadoop described here:
> https://www.mail-archive.com/user@hadoop.apache.org/msg24265.html.  When I
> changed it to use Hadoop 2.10.0, I did not have the problem.
>
>
> On Fri, Jan 7, 2022 at 1:13 PM Andrew Purtell <an...@gmail.com>
> wrote:
>
> > The functional compatibility is the same with 2.3 and 2.4 with respect to
> > Hadoop 2.10. The omission in the compatibility chart is a documentation
> > bug. There is an existing JIRA for that omission that will be
> > reprioritized.
> >
> > > On Jan 7, 2022, at 9:38 AM, Claude M <cl...@gmail.com> wrote:
> > >
> > > Has HBase 2.4 been tested to be fully functional w/ Hadoop 2.10.0?  I
> > don't
> > > see it in the compatibility chart.
> > >
> > >> On Fri, Jan 7, 2022 at 12:37 AM 张铎(Duo Zhang) <pa...@gmail.com>
> > wrote:
> > >>
> > >> You can try to upgrade to 2.4.x, it should be rolling upgradable.
> > >>
> > >> Claude M <cl...@gmail.com> 于2022年1月4日周二 23:24写道:
> > >>>
> > >>> I don't want to rebuild HBase.  According to the attached
> HBase/Hadoop
> > >> compatibility chart, the latest version of HBase that has been
> verified
> > w/
> > >> Hadoop is 2.3.x.
> > >>> The fix was put into branch 2.3 on 11/21 but there is not going to
> be a
> > >> 2.3.8 release since it is mentioned that branch 2.3 is EOL.  Is there
> > not
> > >> another way around this?
> > >>>
> > >>> On Fri, Dec 24, 2021 at 12:53 AM 张铎(Duo Zhang) <
> palomino219@gmail.com>
> > >> wrote:
> > >>>>
> > >>>> Ah, thanks Yulin Niu for the pointer. HBASE-26053 should be the
> > problem.
> > >>>>
> > >>>> Yulin Niu <yu...@gmail.com> 于2021年12月19日周日 10:41写道:
> > >>>>>
> > >>>>> https://issues.apache.org/jira/browse/HBASE-25053
> > >>>>> It seems the bug described in this issue, You can try cherry pick
> > this
> > >>>>> patch, Claude M
> > >>>>>
> > >>>>> Viraj Jasani <vj...@apache.org> 于2021年12月19日周日 02:17写道:
> > >>>>>
> > >>>>>>> Your fix is a bit dangerous since you may lose some ongoing
> > >> procedures,
> > >>>>>> but
> > >>>>>>> if you did not experience any inconsistency on your cluster, for
> > >> example,
> > >>>>>>> some regions are not online, then it is OK.
> > >>>>>>
> > >>>>>> Duo, out of curiosity, even if some regions are offline and/or
> some
> > >> servers
> > >>>>>> go offline, wouldn't master failover re-trigger SCPs and TRSPs to
> > >> bring all
> > >>>>>> regions ONLINE?
> > >>>>>> I have played around with removal of MasterProcWAL on hbase1 only
> > >> (WAL proc
> > >>>>>> store) and have seen new SCPs getting triggered i.e. AM doesn
> bring
> > >> all
> > >>>>>> regions ONLINE eventually.
> > >>>>>>
> > >>>>>>
> > >>>>>> On Thu, Dec 16, 2021 at 9:57 PM 张铎(Duo Zhang) <
> > >> palomino219@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> I guess this should be a bug. For the master local region we do
> > >> not
> > >>>>>> handle
> > >>>>>>> broken WAL files which do not even have a valid header.
> > >>>>>>>
> > >>>>>>> Will take a look at the code tomorrow to confirm whether this is
> > >> the
> > >>>>>> case.
> > >>>>>>>
> > >>>>>>> Your fix is a bit dangerous since you may lose some ongoing
> > >> procedures,
> > >>>>>> but
> > >>>>>>> if you did not experience any inconsistency on your cluster, for
> > >> example,
> > >>>>>>> some regions are not online, then it is OK.
> > >>>>>>>
> > >>>>>>> Thanks for reporting.
> > >>>>>>>
> > >>>>>>> Claude M <cl...@gmail.com> 于2021年12月16日周四 03:37写道:
> > >>>>>>>
> > >>>>>>>> Hello,
> > >>>>>>>>
> > >>>>>>>> I have the following installed:
> > >>>>>>>>
> > >>>>>>>>   - Hadoop 3.2.2
> > >>>>>>>>   - HBase 2.3.5
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> When all the datanodes in Hadoop are stopped but the HBase
> > >> cluster is
> > >>>>>>>> still running, the HBase master crashes w/ the attached
> > >> exception and
> > >>>>>> is
> > >>>>>>>> not recoverable.
> > >>>>>>>>
> > >>>>>>>> If I delete the contents under the following directories in
> > >> hdfs, the
> > >>>>>>>> master will then recover:
> > >>>>>>>>
> > >>>>>>>>   - /hbase/MasterData/WALs/
> > >>>>>>>>   - /hbase/MasterData/data/master/store/*/recovered.wals/
> > >>>>>>>>
> > >>>>>>>> Is this an appropriate way to resolve the issue?  If not, what
> > >> should
> > >>>>>> be
> > >>>>>>>> done?
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Thanks
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>
> >
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: HBase master unable to recover with error "Cannot seek after EOF"

Posted by Claude M <cl...@gmail.com>.
Thanks for your reply.  What about the following statement that is in the
documentation, is this still true?

Hadoop 3.x is still in early access releases and has not yet been
sufficiently tested by the HBase community for production use cases.

When I tested HBase 2.3.5 w/ hadoop 3.2.2, I was encountering a problem w/
Hadoop described here:
https://www.mail-archive.com/user@hadoop.apache.org/msg24265.html.  When I
changed it to use Hadoop 2.10.0, I did not have the problem.


On Fri, Jan 7, 2022 at 1:13 PM Andrew Purtell <an...@gmail.com>
wrote:

> The functional compatibility is the same with 2.3 and 2.4 with respect to
> Hadoop 2.10. The omission in the compatibility chart is a documentation
> bug. There is an existing JIRA for that omission that will be
> reprioritized.
>
> > On Jan 7, 2022, at 9:38 AM, Claude M <cl...@gmail.com> wrote:
> >
> > Has HBase 2.4 been tested to be fully functional w/ Hadoop 2.10.0?  I
> don't
> > see it in the compatibility chart.
> >
> >> On Fri, Jan 7, 2022 at 12:37 AM 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
> >>
> >> You can try to upgrade to 2.4.x, it should be rolling upgradable.
> >>
> >> Claude M <cl...@gmail.com> 于2022年1月4日周二 23:24写道:
> >>>
> >>> I don't want to rebuild HBase.  According to the attached HBase/Hadoop
> >> compatibility chart, the latest version of HBase that has been verified
> w/
> >> Hadoop is 2.3.x.
> >>> The fix was put into branch 2.3 on 11/21 but there is not going to be a
> >> 2.3.8 release since it is mentioned that branch 2.3 is EOL.  Is there
> not
> >> another way around this?
> >>>
> >>> On Fri, Dec 24, 2021 at 12:53 AM 张铎(Duo Zhang) <pa...@gmail.com>
> >> wrote:
> >>>>
> >>>> Ah, thanks Yulin Niu for the pointer. HBASE-26053 should be the
> problem.
> >>>>
> >>>> Yulin Niu <yu...@gmail.com> 于2021年12月19日周日 10:41写道:
> >>>>>
> >>>>> https://issues.apache.org/jira/browse/HBASE-25053
> >>>>> It seems the bug described in this issue, You can try cherry pick
> this
> >>>>> patch, Claude M
> >>>>>
> >>>>> Viraj Jasani <vj...@apache.org> 于2021年12月19日周日 02:17写道:
> >>>>>
> >>>>>>> Your fix is a bit dangerous since you may lose some ongoing
> >> procedures,
> >>>>>> but
> >>>>>>> if you did not experience any inconsistency on your cluster, for
> >> example,
> >>>>>>> some regions are not online, then it is OK.
> >>>>>>
> >>>>>> Duo, out of curiosity, even if some regions are offline and/or some
> >> servers
> >>>>>> go offline, wouldn't master failover re-trigger SCPs and TRSPs to
> >> bring all
> >>>>>> regions ONLINE?
> >>>>>> I have played around with removal of MasterProcWAL on hbase1 only
> >> (WAL proc
> >>>>>> store) and have seen new SCPs getting triggered i.e. AM doesn bring
> >> all
> >>>>>> regions ONLINE eventually.
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Dec 16, 2021 at 9:57 PM 张铎(Duo Zhang) <
> >> palomino219@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I guess this should be a bug. For the master local region we do
> >> not
> >>>>>> handle
> >>>>>>> broken WAL files which do not even have a valid header.
> >>>>>>>
> >>>>>>> Will take a look at the code tomorrow to confirm whether this is
> >> the
> >>>>>> case.
> >>>>>>>
> >>>>>>> Your fix is a bit dangerous since you may lose some ongoing
> >> procedures,
> >>>>>> but
> >>>>>>> if you did not experience any inconsistency on your cluster, for
> >> example,
> >>>>>>> some regions are not online, then it is OK.
> >>>>>>>
> >>>>>>> Thanks for reporting.
> >>>>>>>
> >>>>>>> Claude M <cl...@gmail.com> 于2021年12月16日周四 03:37写道:
> >>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> I have the following installed:
> >>>>>>>>
> >>>>>>>>   - Hadoop 3.2.2
> >>>>>>>>   - HBase 2.3.5
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> When all the datanodes in Hadoop are stopped but the HBase
> >> cluster is
> >>>>>>>> still running, the HBase master crashes w/ the attached
> >> exception and
> >>>>>> is
> >>>>>>>> not recoverable.
> >>>>>>>>
> >>>>>>>> If I delete the contents under the following directories in
> >> hdfs, the
> >>>>>>>> master will then recover:
> >>>>>>>>
> >>>>>>>>   - /hbase/MasterData/WALs/
> >>>>>>>>   - /hbase/MasterData/data/master/store/*/recovered.wals/
> >>>>>>>>
> >>>>>>>> Is this an appropriate way to resolve the issue?  If not, what
> >> should
> >>>>>> be
> >>>>>>>> done?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>
>

Re: HBase master unable to recover with error "Cannot seek after EOF"

Posted by Andrew Purtell <an...@gmail.com>.
The functional compatibility is the same with 2.3 and 2.4 with respect to Hadoop 2.10. The omission in the compatibility chart is a documentation bug. There is an existing JIRA for that omission that will be reprioritized. 

> On Jan 7, 2022, at 9:38 AM, Claude M <cl...@gmail.com> wrote:
> 
> Has HBase 2.4 been tested to be fully functional w/ Hadoop 2.10.0?  I don't
> see it in the compatibility chart.
> 
>> On Fri, Jan 7, 2022 at 12:37 AM 张铎(Duo Zhang) <pa...@gmail.com> wrote:
>> 
>> You can try to upgrade to 2.4.x, it should be rolling upgradable.
>> 
>> Claude M <cl...@gmail.com> 于2022年1月4日周二 23:24写道:
>>> 
>>> I don't want to rebuild HBase.  According to the attached HBase/Hadoop
>> compatibility chart, the latest version of HBase that has been verified w/
>> Hadoop is 2.3.x.
>>> The fix was put into branch 2.3 on 11/21 but there is not going to be a
>> 2.3.8 release since it is mentioned that branch 2.3 is EOL.  Is there not
>> another way around this?
>>> 
>>> On Fri, Dec 24, 2021 at 12:53 AM 张铎(Duo Zhang) <pa...@gmail.com>
>> wrote:
>>>> 
>>>> Ah, thanks Yulin Niu for the pointer. HBASE-26053 should be the problem.
>>>> 
>>>> Yulin Niu <yu...@gmail.com> 于2021年12月19日周日 10:41写道:
>>>>> 
>>>>> https://issues.apache.org/jira/browse/HBASE-25053
>>>>> It seems the bug described in this issue, You can try cherry pick this
>>>>> patch, Claude M
>>>>> 
>>>>> Viraj Jasani <vj...@apache.org> 于2021年12月19日周日 02:17写道:
>>>>> 
>>>>>>> Your fix is a bit dangerous since you may lose some ongoing
>> procedures,
>>>>>> but
>>>>>>> if you did not experience any inconsistency on your cluster, for
>> example,
>>>>>>> some regions are not online, then it is OK.
>>>>>> 
>>>>>> Duo, out of curiosity, even if some regions are offline and/or some
>> servers
>>>>>> go offline, wouldn't master failover re-trigger SCPs and TRSPs to
>> bring all
>>>>>> regions ONLINE?
>>>>>> I have played around with removal of MasterProcWAL on hbase1 only
>> (WAL proc
>>>>>> store) and have seen new SCPs getting triggered i.e. AM doesn bring
>> all
>>>>>> regions ONLINE eventually.
>>>>>> 
>>>>>> 
>>>>>> On Thu, Dec 16, 2021 at 9:57 PM 张铎(Duo Zhang) <
>> palomino219@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> I guess this should be a bug. For the master local region we do
>> not
>>>>>> handle
>>>>>>> broken WAL files which do not even have a valid header.
>>>>>>> 
>>>>>>> Will take a look at the code tomorrow to confirm whether this is
>> the
>>>>>> case.
>>>>>>> 
>>>>>>> Your fix is a bit dangerous since you may lose some ongoing
>> procedures,
>>>>>> but
>>>>>>> if you did not experience any inconsistency on your cluster, for
>> example,
>>>>>>> some regions are not online, then it is OK.
>>>>>>> 
>>>>>>> Thanks for reporting.
>>>>>>> 
>>>>>>> Claude M <cl...@gmail.com> 于2021年12月16日周四 03:37写道:
>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> I have the following installed:
>>>>>>>> 
>>>>>>>>   - Hadoop 3.2.2
>>>>>>>>   - HBase 2.3.5
>>>>>>>> 
>>>>>>>> 
>>>>>>>> When all the datanodes in Hadoop are stopped but the HBase
>> cluster is
>>>>>>>> still running, the HBase master crashes w/ the attached
>> exception and
>>>>>> is
>>>>>>>> not recoverable.
>>>>>>>> 
>>>>>>>> If I delete the contents under the following directories in
>> hdfs, the
>>>>>>>> master will then recover:
>>>>>>>> 
>>>>>>>>   - /hbase/MasterData/WALs/
>>>>>>>>   - /hbase/MasterData/data/master/store/*/recovered.wals/
>>>>>>>> 
>>>>>>>> Is this an appropriate way to resolve the issue?  If not, what
>> should
>>>>>> be
>>>>>>>> done?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>> 
>>>>>> 
>> 

Re: HBase master unable to recover with error "Cannot seek after EOF"

Posted by Claude M <cl...@gmail.com>.
Has HBase 2.4 been tested to be fully functional w/ Hadoop 2.10.0?  I don't
see it in the compatibility chart.

On Fri, Jan 7, 2022 at 12:37 AM 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> You can try to upgrade to 2.4.x, it should be rolling upgradable.
>
> Claude M <cl...@gmail.com> 于2022年1月4日周二 23:24写道:
> >
> > I don't want to rebuild HBase.  According to the attached HBase/Hadoop
> compatibility chart, the latest version of HBase that has been verified w/
> Hadoop is 2.3.x.
> > The fix was put into branch 2.3 on 11/21 but there is not going to be a
> 2.3.8 release since it is mentioned that branch 2.3 is EOL.  Is there not
> another way around this?
> >
> > On Fri, Dec 24, 2021 at 12:53 AM 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
> >>
> >> Ah, thanks Yulin Niu for the pointer. HBASE-26053 should be the problem.
> >>
> >> Yulin Niu <yu...@gmail.com> 于2021年12月19日周日 10:41写道:
> >> >
> >> > https://issues.apache.org/jira/browse/HBASE-25053
> >> > It seems the bug described in this issue, You can try cherry pick this
> >> > patch, Claude M
> >> >
> >> > Viraj Jasani <vj...@apache.org> 于2021年12月19日周日 02:17写道:
> >> >
> >> > > > Your fix is a bit dangerous since you may lose some ongoing
> procedures,
> >> > > but
> >> > > > if you did not experience any inconsistency on your cluster, for
> example,
> >> > > > some regions are not online, then it is OK.
> >> > >
> >> > > Duo, out of curiosity, even if some regions are offline and/or some
> servers
> >> > > go offline, wouldn't master failover re-trigger SCPs and TRSPs to
> bring all
> >> > > regions ONLINE?
> >> > > I have played around with removal of MasterProcWAL on hbase1 only
> (WAL proc
> >> > > store) and have seen new SCPs getting triggered i.e. AM doesn bring
> all
> >> > > regions ONLINE eventually.
> >> > >
> >> > >
> >> > > On Thu, Dec 16, 2021 at 9:57 PM 张铎(Duo Zhang) <
> palomino219@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > I guess this should be a bug. For the master local region we do
> not
> >> > > handle
> >> > > > broken WAL files which do not even have a valid header.
> >> > > >
> >> > > > Will take a look at the code tomorrow to confirm whether this is
> the
> >> > > case.
> >> > > >
> >> > > > Your fix is a bit dangerous since you may lose some ongoing
> procedures,
> >> > > but
> >> > > > if you did not experience any inconsistency on your cluster, for
> example,
> >> > > > some regions are not online, then it is OK.
> >> > > >
> >> > > > Thanks for reporting.
> >> > > >
> >> > > > Claude M <cl...@gmail.com> 于2021年12月16日周四 03:37写道:
> >> > > >
> >> > > > > Hello,
> >> > > > >
> >> > > > > I have the following installed:
> >> > > > >
> >> > > > >    - Hadoop 3.2.2
> >> > > > >    - HBase 2.3.5
> >> > > > >
> >> > > > >
> >> > > > > When all the datanodes in Hadoop are stopped but the HBase
> cluster is
> >> > > > > still running, the HBase master crashes w/ the attached
> exception and
> >> > > is
> >> > > > > not recoverable.
> >> > > > >
> >> > > > > If I delete the contents under the following directories in
> hdfs, the
> >> > > > > master will then recover:
> >> > > > >
> >> > > > >    - /hbase/MasterData/WALs/
> >> > > > >    - /hbase/MasterData/data/master/store/*/recovered.wals/
> >> > > > >
> >> > > > > Is this an appropriate way to resolve the issue?  If not, what
> should
> >> > > be
> >> > > > > done?
> >> > > > >
> >> > > > >
> >> > > > > Thanks
> >> > > > >
> >> > > >
> >> > >
>

Re: HBase master unable to recover with error "Cannot seek after EOF"

Posted by "张铎(Duo Zhang)" <pa...@gmail.com>.
You can try to upgrade to 2.4.x, it should be rolling upgradable.

Claude M <cl...@gmail.com> 于2022年1月4日周二 23:24写道:
>
> I don't want to rebuild HBase.  According to the attached HBase/Hadoop compatibility chart, the latest version of HBase that has been verified w/ Hadoop is 2.3.x.
> The fix was put into branch 2.3 on 11/21 but there is not going to be a 2.3.8 release since it is mentioned that branch 2.3 is EOL.  Is there not another way around this?
>
> On Fri, Dec 24, 2021 at 12:53 AM 张铎(Duo Zhang) <pa...@gmail.com> wrote:
>>
>> Ah, thanks Yulin Niu for the pointer. HBASE-26053 should be the problem.
>>
>> Yulin Niu <yu...@gmail.com> 于2021年12月19日周日 10:41写道:
>> >
>> > https://issues.apache.org/jira/browse/HBASE-25053
>> > It seems the bug described in this issue, You can try cherry pick this
>> > patch, Claude M
>> >
>> > Viraj Jasani <vj...@apache.org> 于2021年12月19日周日 02:17写道:
>> >
>> > > > Your fix is a bit dangerous since you may lose some ongoing procedures,
>> > > but
>> > > > if you did not experience any inconsistency on your cluster, for example,
>> > > > some regions are not online, then it is OK.
>> > >
>> > > Duo, out of curiosity, even if some regions are offline and/or some servers
>> > > go offline, wouldn't master failover re-trigger SCPs and TRSPs to bring all
>> > > regions ONLINE?
>> > > I have played around with removal of MasterProcWAL on hbase1 only (WAL proc
>> > > store) and have seen new SCPs getting triggered i.e. AM doesn bring all
>> > > regions ONLINE eventually.
>> > >
>> > >
>> > > On Thu, Dec 16, 2021 at 9:57 PM 张铎(Duo Zhang) <pa...@gmail.com>
>> > > wrote:
>> > >
>> > > > I guess this should be a bug. For the master local region we do not
>> > > handle
>> > > > broken WAL files which do not even have a valid header.
>> > > >
>> > > > Will take a look at the code tomorrow to confirm whether this is the
>> > > case.
>> > > >
>> > > > Your fix is a bit dangerous since you may lose some ongoing procedures,
>> > > but
>> > > > if you did not experience any inconsistency on your cluster, for example,
>> > > > some regions are not online, then it is OK.
>> > > >
>> > > > Thanks for reporting.
>> > > >
>> > > > Claude M <cl...@gmail.com> 于2021年12月16日周四 03:37写道:
>> > > >
>> > > > > Hello,
>> > > > >
>> > > > > I have the following installed:
>> > > > >
>> > > > >    - Hadoop 3.2.2
>> > > > >    - HBase 2.3.5
>> > > > >
>> > > > >
>> > > > > When all the datanodes in Hadoop are stopped but the HBase cluster is
>> > > > > still running, the HBase master crashes w/ the attached exception and
>> > > is
>> > > > > not recoverable.
>> > > > >
>> > > > > If I delete the contents under the following directories in hdfs, the
>> > > > > master will then recover:
>> > > > >
>> > > > >    - /hbase/MasterData/WALs/
>> > > > >    - /hbase/MasterData/data/master/store/*/recovered.wals/
>> > > > >
>> > > > > Is this an appropriate way to resolve the issue?  If not, what should
>> > > be
>> > > > > done?
>> > > > >
>> > > > >
>> > > > > Thanks
>> > > > >
>> > > >
>> > >