You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "Adam J. Shook" <ad...@gmail.com> on 2018/06/11 21:26:15 UTC

Corrupt WAL

Hey all,

The root tablet on one of our dev systems isn't loading due to an illegal
state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be
the best way to mitigate this issue?  This was likely caused due to both of
our NameNodes failing.

Thank you,
--Adam

Re: Corrupt WAL

Posted by "Adam J. Shook" <ad...@gmail.com>.
The code referenced in the PR works to detect and move a WAL, replacing it
with an empty one, but isn't fully wrapped up/merged.  Some priorities were
shifted and this got pushed back, though I do plan on addressing the
comments in the code review Soon™.

I'd suggest upgrading to 1.9.2 once you resolve the issue.  We've been
running it for a while and have not had any WAL-related errors.

--Adam

On Tue, Aug 21, 2018 at 6:58 PM Ed Coleman <de...@etcoleman.com> wrote:

> The has been work done in https://github.com/apache/accumulo/pull/574.
> I'm not certain of the state of the code, but the description may provide
> you with things that you could look at manually.
>
>
> -----Original Message-----
> From: tech.shan@gmail.com [mailto:tech.shan@gmail.com]
> Sent: Tuesday, August 21, 2018 5:45 PM
> To: user@accumulo.apache.org
> Subject: Re: Corrupt WAL
>
> Was there any success with this workaround strategy?  I am also
> experiencing this issue.
>
> On 2018/06/13 16:30:22, "Adam J. Shook" <ad...@gmail.com> wrote:
> > Sorry, I had the error backwards.  There is an OPEN for the WAL and
> > then immediately a COMPACTION_FINISH entry.  This would cause the error.
> >
> > On Wed, Jun 13, 2018 at 11:34 AM, Adam J. Shook <ad...@gmail.com>
> > wrote:
> >
> > > Looking at the log I see that the last two entries are
> > > COMPACTION_START of one RFile immediately followed by a
> > > COMPACTION_START of a separate RFile which (I believe) would lead to
> > > the error.  Would this necessarily be an issue if the compactions are
> for separate RFiles?
> > >
> > > This is a dev cluster and I don't necessarily care about it, but is
> > > there a (good) means to do WAL log surgery?  I imagine I can just
> > > chop off bytes until the log is parseable and missing the info about
> the compactions.
> > >
> > > On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner <ke...@deenlo.com>
> wrote:
> > >
> > >> On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook
> > >> <ad...@gmail.com>
> > >> wrote:
> > >> > Yes, that is the error.  I'll inspect the logs and report back.
> > >>
> > >> Ok.  The LogReader command has a mechanism to filter which tablet
> > >> is displayed.  If the walog has  alot of data in it, may need to
> > >> use this.
> > >>
> > >> Also, be aware that only 5 mutations are shown for a "many mutations"
> > >> objects in the walog.   The -m options changes this.  May want to see
> > >> more when deciding if the info in the log is important.
> > >>
> > >>
> > >> >
> > >> > On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <ke...@deenlo.com>
> > >> wrote:
> > >> >>
> > >> >> Is the message you are seeing "COMPACTION_FINISH (without
> > >> >> preceding COMPACTION_START)" ?  That messages indicates that the
> > >> >> WALs are incomplete, probably as a result of the NN problems.
> > >> >> Could do the following :
> > >> >>
> > >> >> 1) Run the following command to see whats in the log.  Need to
> > >> >> see what is there for the root tablet.
> > >> >>
> > >> >>    accumulo org.apache.accumulo.tserver.logger.LogReader
> > >> >>
> > >> >> 2) Replace the log file with an empty file after seeing if there
> > >> >> is anything important in it.
> > >> >>
> > >> >> I think the list of WALs for the root tablet is stored in ZK at
> > >> >> /accumulo/<id>/walogs
> > >> >>
> > >> >> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook
> > >> >> <ad...@gmail.com>
> > >> >> wrote:
> > >> >> > Hey all,
> > >> >> >
> > >> >> > The root tablet on one of our dev systems isn't loading due to
> > >> >> > an illegal state exception -- COMPACTION_FINISH preceding
> > >> >> > COMPACTION_START.
> > >> What'd
> > >> >> > be
> > >> >> > the best way to mitigate this issue?  This was likely caused
> > >> >> > due to
> > >> both
> > >> >> > of
> > >> >> > our NameNodes failing.
> > >> >> >
> > >> >> > Thank you,
> > >> >> > --Adam
> > >> >
> > >> >
> > >>
> > >
> > >
> >
>
>

RE: Corrupt WAL

Posted by Ed Coleman <de...@etcoleman.com>.
The has been work done in https://github.com/apache/accumulo/pull/574. I'm not certain of the state of the code, but the description may provide you with things that you could look at manually.


-----Original Message-----
From: tech.shan@gmail.com [mailto:tech.shan@gmail.com] 
Sent: Tuesday, August 21, 2018 5:45 PM
To: user@accumulo.apache.org
Subject: Re: Corrupt WAL

Was there any success with this workaround strategy?  I am also experiencing this issue.

On 2018/06/13 16:30:22, "Adam J. Shook" <ad...@gmail.com> wrote: 
> Sorry, I had the error backwards.  There is an OPEN for the WAL and 
> then immediately a COMPACTION_FINISH entry.  This would cause the error.
> 
> On Wed, Jun 13, 2018 at 11:34 AM, Adam J. Shook <ad...@gmail.com>
> wrote:
> 
> > Looking at the log I see that the last two entries are 
> > COMPACTION_START of one RFile immediately followed by a 
> > COMPACTION_START of a separate RFile which (I believe) would lead to 
> > the error.  Would this necessarily be an issue if the compactions are for separate RFiles?
> >
> > This is a dev cluster and I don't necessarily care about it, but is 
> > there a (good) means to do WAL log surgery?  I imagine I can just 
> > chop off bytes until the log is parseable and missing the info about the compactions.
> >
> > On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner <ke...@deenlo.com> wrote:
> >
> >> On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook 
> >> <ad...@gmail.com>
> >> wrote:
> >> > Yes, that is the error.  I'll inspect the logs and report back.
> >>
> >> Ok.  The LogReader command has a mechanism to filter which tablet 
> >> is displayed.  If the walog has  alot of data in it, may need to 
> >> use this.
> >>
> >> Also, be aware that only 5 mutations are shown for a "many mutations"
> >> objects in the walog.   The -m options changes this.  May want to see
> >> more when deciding if the info in the log is important.
> >>
> >>
> >> >
> >> > On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <ke...@deenlo.com>
> >> wrote:
> >> >>
> >> >> Is the message you are seeing "COMPACTION_FINISH (without 
> >> >> preceding COMPACTION_START)" ?  That messages indicates that the 
> >> >> WALs are incomplete, probably as a result of the NN problems.  
> >> >> Could do the following :
> >> >>
> >> >> 1) Run the following command to see whats in the log.  Need to 
> >> >> see what is there for the root tablet.
> >> >>
> >> >>    accumulo org.apache.accumulo.tserver.logger.LogReader
> >> >>
> >> >> 2) Replace the log file with an empty file after seeing if there 
> >> >> is anything important in it.
> >> >>
> >> >> I think the list of WALs for the root tablet is stored in ZK at 
> >> >> /accumulo/<id>/walogs
> >> >>
> >> >> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook 
> >> >> <ad...@gmail.com>
> >> >> wrote:
> >> >> > Hey all,
> >> >> >
> >> >> > The root tablet on one of our dev systems isn't loading due to 
> >> >> > an illegal state exception -- COMPACTION_FINISH preceding 
> >> >> > COMPACTION_START.
> >> What'd
> >> >> > be
> >> >> > the best way to mitigate this issue?  This was likely caused 
> >> >> > due to
> >> both
> >> >> > of
> >> >> > our NameNodes failing.
> >> >> >
> >> >> > Thank you,
> >> >> > --Adam
> >> >
> >> >
> >>
> >
> >
>


Re: Corrupt WAL

Posted by te...@gmail.com, te...@gmail.com.
Was there any success with this workaround strategy?  I am also experiencing this issue.

On 2018/06/13 16:30:22, "Adam J. Shook" <ad...@gmail.com> wrote: 
> Sorry, I had the error backwards.  There is an OPEN for the WAL and then
> immediately a COMPACTION_FINISH entry.  This would cause the error.
> 
> On Wed, Jun 13, 2018 at 11:34 AM, Adam J. Shook <ad...@gmail.com>
> wrote:
> 
> > Looking at the log I see that the last two entries are COMPACTION_START of
> > one RFile immediately followed by a COMPACTION_START of a separate RFile
> > which (I believe) would lead to the error.  Would this necessarily be an
> > issue if the compactions are for separate RFiles?
> >
> > This is a dev cluster and I don't necessarily care about it, but is there
> > a (good) means to do WAL log surgery?  I imagine I can just chop off bytes
> > until the log is parseable and missing the info about the compactions.
> >
> > On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner <ke...@deenlo.com> wrote:
> >
> >> On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook <ad...@gmail.com>
> >> wrote:
> >> > Yes, that is the error.  I'll inspect the logs and report back.
> >>
> >> Ok.  The LogReader command has a mechanism to filter which tablet is
> >> displayed.  If the walog has  alot of data in it, may need to use
> >> this.
> >>
> >> Also, be aware that only 5 mutations are shown for a "many mutations"
> >> objects in the walog.   The -m options changes this.  May want to see
> >> more when deciding if the info in the log is important.
> >>
> >>
> >> >
> >> > On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <ke...@deenlo.com>
> >> wrote:
> >> >>
> >> >> Is the message you are seeing "COMPACTION_FINISH (without preceding
> >> >> COMPACTION_START)" ?  That messages indicates that the WALs are
> >> >> incomplete, probably as a result of the NN problems.  Could do the
> >> >> following :
> >> >>
> >> >> 1) Run the following command to see whats in the log.  Need to see
> >> >> what is there for the root tablet.
> >> >>
> >> >>    accumulo org.apache.accumulo.tserver.logger.LogReader
> >> >>
> >> >> 2) Replace the log file with an empty file after seeing if there is
> >> >> anything important in it.
> >> >>
> >> >> I think the list of WALs for the root tablet is stored in ZK at
> >> >> /accumulo/<id>/walogs
> >> >>
> >> >> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook <ad...@gmail.com>
> >> >> wrote:
> >> >> > Hey all,
> >> >> >
> >> >> > The root tablet on one of our dev systems isn't loading due to an
> >> >> > illegal
> >> >> > state exception -- COMPACTION_FINISH preceding COMPACTION_START.
> >> What'd
> >> >> > be
> >> >> > the best way to mitigate this issue?  This was likely caused due to
> >> both
> >> >> > of
> >> >> > our NameNodes failing.
> >> >> >
> >> >> > Thank you,
> >> >> > --Adam
> >> >
> >> >
> >>
> >
> >
>

Re: Corrupt WAL

Posted by "Adam J. Shook" <ad...@gmail.com>.
Sorry, I had the error backwards.  There is an OPEN for the WAL and then
immediately a COMPACTION_FINISH entry.  This would cause the error.

On Wed, Jun 13, 2018 at 11:34 AM, Adam J. Shook <ad...@gmail.com>
wrote:

> Looking at the log I see that the last two entries are COMPACTION_START of
> one RFile immediately followed by a COMPACTION_START of a separate RFile
> which (I believe) would lead to the error.  Would this necessarily be an
> issue if the compactions are for separate RFiles?
>
> This is a dev cluster and I don't necessarily care about it, but is there
> a (good) means to do WAL log surgery?  I imagine I can just chop off bytes
> until the log is parseable and missing the info about the compactions.
>
> On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner <ke...@deenlo.com> wrote:
>
>> On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook <ad...@gmail.com>
>> wrote:
>> > Yes, that is the error.  I'll inspect the logs and report back.
>>
>> Ok.  The LogReader command has a mechanism to filter which tablet is
>> displayed.  If the walog has  alot of data in it, may need to use
>> this.
>>
>> Also, be aware that only 5 mutations are shown for a "many mutations"
>> objects in the walog.   The -m options changes this.  May want to see
>> more when deciding if the info in the log is important.
>>
>>
>> >
>> > On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <ke...@deenlo.com>
>> wrote:
>> >>
>> >> Is the message you are seeing "COMPACTION_FINISH (without preceding
>> >> COMPACTION_START)" ?  That messages indicates that the WALs are
>> >> incomplete, probably as a result of the NN problems.  Could do the
>> >> following :
>> >>
>> >> 1) Run the following command to see whats in the log.  Need to see
>> >> what is there for the root tablet.
>> >>
>> >>    accumulo org.apache.accumulo.tserver.logger.LogReader
>> >>
>> >> 2) Replace the log file with an empty file after seeing if there is
>> >> anything important in it.
>> >>
>> >> I think the list of WALs for the root tablet is stored in ZK at
>> >> /accumulo/<id>/walogs
>> >>
>> >> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook <ad...@gmail.com>
>> >> wrote:
>> >> > Hey all,
>> >> >
>> >> > The root tablet on one of our dev systems isn't loading due to an
>> >> > illegal
>> >> > state exception -- COMPACTION_FINISH preceding COMPACTION_START.
>> What'd
>> >> > be
>> >> > the best way to mitigate this issue?  This was likely caused due to
>> both
>> >> > of
>> >> > our NameNodes failing.
>> >> >
>> >> > Thank you,
>> >> > --Adam
>> >
>> >
>>
>
>

Re: Corrupt WAL

Posted by "Adam J. Shook" <ad...@gmail.com>.
Looking at the log I see that the last two entries are COMPACTION_START of
one RFile immediately followed by a COMPACTION_START of a separate RFile
which (I believe) would lead to the error.  Would this necessarily be an
issue if the compactions are for separate RFiles?

This is a dev cluster and I don't necessarily care about it, but is there a
(good) means to do WAL log surgery?  I imagine I can just chop off bytes
until the log is parseable and missing the info about the compactions.

On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner <ke...@deenlo.com> wrote:

> On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook <ad...@gmail.com>
> wrote:
> > Yes, that is the error.  I'll inspect the logs and report back.
>
> Ok.  The LogReader command has a mechanism to filter which tablet is
> displayed.  If the walog has  alot of data in it, may need to use
> this.
>
> Also, be aware that only 5 mutations are shown for a "many mutations"
> objects in the walog.   The -m options changes this.  May want to see
> more when deciding if the info in the log is important.
>
>
> >
> > On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <ke...@deenlo.com> wrote:
> >>
> >> Is the message you are seeing "COMPACTION_FINISH (without preceding
> >> COMPACTION_START)" ?  That messages indicates that the WALs are
> >> incomplete, probably as a result of the NN problems.  Could do the
> >> following :
> >>
> >> 1) Run the following command to see whats in the log.  Need to see
> >> what is there for the root tablet.
> >>
> >>    accumulo org.apache.accumulo.tserver.logger.LogReader
> >>
> >> 2) Replace the log file with an empty file after seeing if there is
> >> anything important in it.
> >>
> >> I think the list of WALs for the root tablet is stored in ZK at
> >> /accumulo/<id>/walogs
> >>
> >> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook <ad...@gmail.com>
> >> wrote:
> >> > Hey all,
> >> >
> >> > The root tablet on one of our dev systems isn't loading due to an
> >> > illegal
> >> > state exception -- COMPACTION_FINISH preceding COMPACTION_START.
> What'd
> >> > be
> >> > the best way to mitigate this issue?  This was likely caused due to
> both
> >> > of
> >> > our NameNodes failing.
> >> >
> >> > Thank you,
> >> > --Adam
> >
> >
>

Re: Corrupt WAL

Posted by Keith Turner <ke...@deenlo.com>.
On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook <ad...@gmail.com> wrote:
> Yes, that is the error.  I'll inspect the logs and report back.

Ok.  The LogReader command has a mechanism to filter which tablet is
displayed.  If the walog has  alot of data in it, may need to use
this.

Also, be aware that only 5 mutations are shown for a "many mutations"
objects in the walog.   The -m options changes this.  May want to see
more when deciding if the info in the log is important.


>
> On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <ke...@deenlo.com> wrote:
>>
>> Is the message you are seeing "COMPACTION_FINISH (without preceding
>> COMPACTION_START)" ?  That messages indicates that the WALs are
>> incomplete, probably as a result of the NN problems.  Could do the
>> following :
>>
>> 1) Run the following command to see whats in the log.  Need to see
>> what is there for the root tablet.
>>
>>    accumulo org.apache.accumulo.tserver.logger.LogReader
>>
>> 2) Replace the log file with an empty file after seeing if there is
>> anything important in it.
>>
>> I think the list of WALs for the root tablet is stored in ZK at
>> /accumulo/<id>/walogs
>>
>> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook <ad...@gmail.com>
>> wrote:
>> > Hey all,
>> >
>> > The root tablet on one of our dev systems isn't loading due to an
>> > illegal
>> > state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd
>> > be
>> > the best way to mitigate this issue?  This was likely caused due to both
>> > of
>> > our NameNodes failing.
>> >
>> > Thank you,
>> > --Adam
>
>

Re: Corrupt WAL

Posted by "Adam J. Shook" <ad...@gmail.com>.
Yes, that is the error.  I'll inspect the logs and report back.

On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <ke...@deenlo.com> wrote:

> Is the message you are seeing "COMPACTION_FINISH (without preceding
> COMPACTION_START)" ?  That messages indicates that the WALs are
> incomplete, probably as a result of the NN problems.  Could do the
> following :
>
> 1) Run the following command to see whats in the log.  Need to see
> what is there for the root tablet.
>
>    accumulo org.apache.accumulo.tserver.logger.LogReader
>
> 2) Replace the log file with an empty file after seeing if there is
> anything important in it.
>
> I think the list of WALs for the root tablet is stored in ZK at
> /accumulo/<id>/walogs
>
> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook <ad...@gmail.com>
> wrote:
> > Hey all,
> >
> > The root tablet on one of our dev systems isn't loading due to an illegal
> > state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd
> be
> > the best way to mitigate this issue?  This was likely caused due to both
> of
> > our NameNodes failing.
> >
> > Thank you,
> > --Adam
>

Re: Corrupt WAL

Posted by Keith Turner <ke...@deenlo.com>.
Is the message you are seeing "COMPACTION_FINISH (without preceding
COMPACTION_START)" ?  That messages indicates that the WALs are
incomplete, probably as a result of the NN problems.  Could do the
following :

1) Run the following command to see whats in the log.  Need to see
what is there for the root tablet.

   accumulo org.apache.accumulo.tserver.logger.LogReader

2) Replace the log file with an empty file after seeing if there is
anything important in it.

I think the list of WALs for the root tablet is stored in ZK at
/accumulo/<id>/walogs

On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook <ad...@gmail.com> wrote:
> Hey all,
>
> The root tablet on one of our dev systems isn't loading due to an illegal
> state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be
> the best way to mitigate this issue?  This was likely caused due to both of
> our NameNodes failing.
>
> Thank you,
> --Adam

Re: Corrupt WAL

Posted by "Adam J. Shook" <ad...@gmail.com>.
The WAL is from 1.9.1.

On Mon, Jun 11, 2018 at 6:33 PM, Christopher <ct...@apache.org> wrote:

> That's what I was thinking it was related to. Do you know if the
> particular WAL file was created from a previous version, from before you
> upgraded?
>
> On Mon, Jun 11, 2018 at 6:00 PM Adam J. Shook <ad...@gmail.com>
> wrote:
>
>> Sorry would have been good to include that :)  It's the newest 1.9.1.  I
>> think it relates to https://github.com/apache/accumulo/pull/458, just
>> not sure what the best thing to do here is.
>>
>> On Mon, Jun 11, 2018 at 5:46 PM, Christopher <ct...@apache.org> wrote:
>>
>>> What version are you using?
>>>
>>> On Mon, Jun 11, 2018 at 5:27 PM Adam J. Shook <ad...@gmail.com>
>>> wrote:
>>>
>>>> Hey all,
>>>>
>>>> The root tablet on one of our dev systems isn't loading due to an
>>>> illegal state exception -- COMPACTION_FINISH preceding COMPACTION_START.
>>>> What'd be the best way to mitigate this issue?  This was likely caused due
>>>> to both of our NameNodes failing.
>>>>
>>>> Thank you,
>>>> --Adam
>>>>
>>>
>>

Re: Corrupt WAL

Posted by Christopher <ct...@apache.org>.
That's what I was thinking it was related to. Do you know if the particular
WAL file was created from a previous version, from before you upgraded?

On Mon, Jun 11, 2018 at 6:00 PM Adam J. Shook <ad...@gmail.com> wrote:

> Sorry would have been good to include that :)  It's the newest 1.9.1.  I
> think it relates to https://github.com/apache/accumulo/pull/458, just not
> sure what the best thing to do here is.
>
> On Mon, Jun 11, 2018 at 5:46 PM, Christopher <ct...@apache.org> wrote:
>
>> What version are you using?
>>
>> On Mon, Jun 11, 2018 at 5:27 PM Adam J. Shook <ad...@gmail.com>
>> wrote:
>>
>>> Hey all,
>>>
>>> The root tablet on one of our dev systems isn't loading due to an
>>> illegal state exception -- COMPACTION_FINISH preceding COMPACTION_START.
>>> What'd be the best way to mitigate this issue?  This was likely caused due
>>> to both of our NameNodes failing.
>>>
>>> Thank you,
>>> --Adam
>>>
>>
>

Re: Corrupt WAL

Posted by "Adam J. Shook" <ad...@gmail.com>.
Sorry would have been good to include that :)  It's the newest 1.9.1.  I
think it relates to https://github.com/apache/accumulo/pull/458, just not
sure what the best thing to do here is.

On Mon, Jun 11, 2018 at 5:46 PM, Christopher <ct...@apache.org> wrote:

> What version are you using?
>
> On Mon, Jun 11, 2018 at 5:27 PM Adam J. Shook <ad...@gmail.com>
> wrote:
>
>> Hey all,
>>
>> The root tablet on one of our dev systems isn't loading due to an illegal
>> state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be
>> the best way to mitigate this issue?  This was likely caused due to both of
>> our NameNodes failing.
>>
>> Thank you,
>> --Adam
>>
>

Re: Corrupt WAL

Posted by Christopher <ct...@apache.org>.
What version are you using?

On Mon, Jun 11, 2018 at 5:27 PM Adam J. Shook <ad...@gmail.com> wrote:

> Hey all,
>
> The root tablet on one of our dev systems isn't loading due to an illegal
> state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be
> the best way to mitigate this issue?  This was likely caused due to both of
> our NameNodes failing.
>
> Thank you,
> --Adam
>