You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@activemq.apache.org by yw yw <wy...@gmail.com> on 2019/07/01 09:57:13 UTC

Invalid record type during startup after power outage

Hi, All


Yesterday our cluster experienced a sudden loss of power. When we started
broker after power brought back, exception occurred:


The exception showed the userRecordType loaded was illegal. The operation
team deleted data journals and broker started successfully.


It was a pity we didn't backup the problematic journal files. We checked
dmesg command output, no disk errors. SMART tests on disk also showed disk
not broken. Then we digged into code(JournalImpl::readJournalFile) and
tried to find something. We have two doubts with the code.


First doubt:

The comment says "I - We scan for any valid record on the file. If a hole
happened on the middle of the file we keep looking until all the
possibilities are gone".

Considering we're appending journal file and fileId is strictly increasing,
so we can just skip the whole file if the fileId of record is not equal to
file id. IMO the rest records in the file are the same, no need to read
them. Should we keep looking all the possibilities, is there a
possibility(very low one) that we just assemble a record of which fileId,
recordType, checkSize all qualifies but actually does not exist?

Our second one:

In the case of power outage where part of record is written into disk, e.g.
recordyType,fileId is successfully written, we may read the old record
though fileId is latest?

Can anyone shed some lights on this please? Thanks.

Re: Invalid record type during startup after power outage

Posted by yw yw <wy...@gmail.com>.

Thanks for the elaborate explanation:)  The stack showed readJournalFile
was finished(meaning verification on total-record-size succeeded) but
userRecordType was wrong. I looked at ARTEMIS-2053, seems fixing the wrong
order of journal files which would probably lose data, but not affecting
file integrity,  do i understand correctly?

I'm just curious why we continue scanning from the next byte(rather than
skipping the whole file) when fields you referred doesn't qualify, what was
your consideration writing it? And Do you think there is small chance a
wrong record is assembled with the recordsize and total-record-size
matching each other if we keep looking?

Clebert Suconic <cl...@gmail.com> 于2019年7月16日周二 下午10:45写道：

>  A little correction, that description was for simple AddData.
>
>
> Look at the extensions of JournalInternalRecord for a better
> description of the data format.
>
> On Tue, Jul 16, 2019 at 10:43 AM Clebert Suconic
> <cl...@gmail.com> wrote:
> >
> > it seems you are using 2.4.0. It does not seem related, but this fix
> > here would be important to have it on your system:
> >
> > commit 6b1abd1aadc2d097e3baefeb312c8e68092876ba
> > Author: Clebert Suconic <cl...@apache.org>
> > Date:   Sun Aug 26 15:55:56 2018 -0400
> >     ARTEMIS-2053 avoiding data loss after compacting
> >
> >
> >
> > However, let me explain you how the record scanning works:
> >
> > the format for the data is:
> >
> > Roughly
> > JOURNAL-RECORD-TYPE (byte)
> > FILE-ID (int)
> > compact-count(byte)
> > recordID(long)
> > recordSize, from persisters (int)
> > userRecordType
> > total-record-size
> >
> >
> >
> > When we recycle a file, we simply change the fileID on the header, and
> > when we load the file, the scan is done by matching the record-type
> > and at the end of the record the total-record-size has to match the
> > record-type.
> >
> > I did this to avoid filling up the file with zeros, which at the time
> > was a costly operation (I wrote this when disks were still mechanical
> > at the time), but that's still a costly operation.
> >
> > So, to wrongly trick the scan you will need a byte record, matching
> > the fileID at the next int, with the recordsize and total-record-size
> > matching each other.
> >
> >
> > Perhaps the loading is skipping the verification on total-record-size
> > and that snicked an invalid record?
> >
> >
> >
> > Or perhaps the fact that you missed the commit I mentioned caused you an
> issue?
> >
> >
> >
> > On Mon, Jul 1, 2019 at 5:57 AM yw yw <wy...@gmail.com> wrote:
> > >
> > > Hi, All
> > >
> > >
> > > Yesterday our cluster experienced a sudden loss of power. When we
> started
> > > broker after power brought back, exception occurred:
> > >
> > >
> > > The exception showed the userRecordType loaded was illegal. The
> operation
> > > team deleted data journals and broker started successfully.
> > >
> > >
> > > It was a pity we didn't backup the problematic journal files. We
> checked
> > > dmesg command output, no disk errors. SMART tests on disk also showed
> disk
> > > not broken. Then we digged into code(JournalImpl::readJournalFile) and
> > > tried to find something. We have two doubts with the code.
> > >
> > >
> > > First doubt:
> > >
> > > The comment says "I - We scan for any valid record on the file. If a
> hole
> > > happened on the middle of the file we keep looking until all the
> > > possibilities are gone".
> > >
> > > Considering we're appending journal file and fileId is strictly
> increasing,
> > > so we can just skip the whole file if the fileId of record is not
> equal to
> > > file id. IMO the rest records in the file are the same, no need to read
> > > them. Should we keep looking all the possibilities, is there a
> > > possibility(very low one) that we just assemble a record of which
> fileId,
> > > recordType, checkSize all qualifies but actually does not exist?
> > >
> > > Our second one:
> > >
> > > In the case of power outage where part of record is written into disk,
> e.g.
> > > recordyType,fileId is successfully written, we may read the old record
> > > though fileId is latest?
> > >
> > > Can anyone shed some lights on this please? Thanks.
> >
> >
> >
> > --
> > Clebert Suconic
>
>
>
> --
> Clebert Suconic
>

Re: Invalid record type during startup after power outage

Posted by Clebert Suconic <cl...@gmail.com>.

 A little correction, that description was for simple AddData.


Look at the extensions of JournalInternalRecord for a better
description of the data format.

On Tue, Jul 16, 2019 at 10:43 AM Clebert Suconic
<cl...@gmail.com> wrote:
>
> it seems you are using 2.4.0. It does not seem related, but this fix
> here would be important to have it on your system:
>
> commit 6b1abd1aadc2d097e3baefeb312c8e68092876ba
> Author: Clebert Suconic <cl...@apache.org>
> Date:   Sun Aug 26 15:55:56 2018 -0400
>     ARTEMIS-2053 avoiding data loss after compacting
>
>
>
> However, let me explain you how the record scanning works:
>
> the format for the data is:
>
> Roughly
> JOURNAL-RECORD-TYPE (byte)
> FILE-ID (int)
> compact-count(byte)
> recordID(long)
> recordSize, from persisters (int)
> userRecordType
> total-record-size
>
>
>
> When we recycle a file, we simply change the fileID on the header, and
> when we load the file, the scan is done by matching the record-type
> and at the end of the record the total-record-size has to match the
> record-type.
>
> I did this to avoid filling up the file with zeros, which at the time
> was a costly operation (I wrote this when disks were still mechanical
> at the time), but that's still a costly operation.
>
> So, to wrongly trick the scan you will need a byte record, matching
> the fileID at the next int, with the recordsize and total-record-size
> matching each other.
>
>
> Perhaps the loading is skipping the verification on total-record-size
> and that snicked an invalid record?
>
>
>
> Or perhaps the fact that you missed the commit I mentioned caused you an issue?
>
>
>
> On Mon, Jul 1, 2019 at 5:57 AM yw yw <wy...@gmail.com> wrote:
> >
> > Hi, All
> >
> >
> > Yesterday our cluster experienced a sudden loss of power. When we started
> > broker after power brought back, exception occurred:
> >
> >
> > The exception showed the userRecordType loaded was illegal. The operation
> > team deleted data journals and broker started successfully.
> >
> >
> > It was a pity we didn't backup the problematic journal files. We checked
> > dmesg command output, no disk errors. SMART tests on disk also showed disk
> > not broken. Then we digged into code(JournalImpl::readJournalFile) and
> > tried to find something. We have two doubts with the code.
> >
> >
> > First doubt:
> >
> > The comment says "I - We scan for any valid record on the file. If a hole
> > happened on the middle of the file we keep looking until all the
> > possibilities are gone".
> >
> > Considering we're appending journal file and fileId is strictly increasing,
> > so we can just skip the whole file if the fileId of record is not equal to
> > file id. IMO the rest records in the file are the same, no need to read
> > them. Should we keep looking all the possibilities, is there a
> > possibility(very low one) that we just assemble a record of which fileId,
> > recordType, checkSize all qualifies but actually does not exist?
> >
> > Our second one:
> >
> > In the case of power outage where part of record is written into disk, e.g.
> > recordyType,fileId is successfully written, we may read the old record
> > though fileId is latest?
> >
> > Can anyone shed some lights on this please? Thanks.
>
>
>
> --
> Clebert Suconic



-- 
Clebert Suconic

Re: Invalid record type during startup after power outage

Posted by Clebert Suconic <cl...@gmail.com>.

it seems you are using 2.4.0. It does not seem related, but this fix
here would be important to have it on your system:

commit 6b1abd1aadc2d097e3baefeb312c8e68092876ba
Author: Clebert Suconic <cl...@apache.org>
Date:   Sun Aug 26 15:55:56 2018 -0400
    ARTEMIS-2053 avoiding data loss after compacting

However, let me explain you how the record scanning works:

the format for the data is:

Roughly
JOURNAL-RECORD-TYPE (byte)
FILE-ID (int)
compact-count(byte)
recordID(long)
recordSize, from persisters (int)
userRecordType
total-record-size

When we recycle a file, we simply change the fileID on the header, and
when we load the file, the scan is done by matching the record-type
and at the end of the record the total-record-size has to match the
record-type.

I did this to avoid filling up the file with zeros, which at the time
was a costly operation (I wrote this when disks were still mechanical
at the time), but that's still a costly operation.

So, to wrongly trick the scan you will need a byte record, matching
the fileID at the next int, with the recordsize and total-record-size
matching each other.

Perhaps the loading is skipping the verification on total-record-size
and that snicked an invalid record?

Or perhaps the fact that you missed the commit I mentioned caused you an issue?

On Mon, Jul 1, 2019 at 5:57 AM yw yw <wy...@gmail.com> wrote:
>
> Hi, All
>
>
> Yesterday our cluster experienced a sudden loss of power. When we started
> broker after power brought back, exception occurred:
>
>
> The exception showed the userRecordType loaded was illegal. The operation
> team deleted data journals and broker started successfully.
>
>
> It was a pity we didn't backup the problematic journal files. We checked
> dmesg command output, no disk errors. SMART tests on disk also showed disk
> not broken. Then we digged into code(JournalImpl::readJournalFile) and
> tried to find something. We have two doubts with the code.
>
>
> First doubt:
>
> The comment says "I - We scan for any valid record on the file. If a hole
> happened on the middle of the file we keep looking until all the
> possibilities are gone".
>
> Considering we're appending journal file and fileId is strictly increasing,
> so we can just skip the whole file if the fileId of record is not equal to
> file id. IMO the rest records in the file are the same, no need to read
> them. Should we keep looking all the possibilities, is there a
> possibility(very low one) that we just assemble a record of which fileId,
> recordType, checkSize all qualifies but actually does not exist?
>
> Our second one:
>
> In the case of power outage where part of record is written into disk, e.g.
> recordyType,fileId is successfully written, we may read the old record
> though fileId is latest?
>
> Can anyone shed some lights on this please? Thanks.

-- 
Clebert Suconic

Re: Invalid record type during startup after power outage

Posted by yw yw <wy...@gmail.com>.

Sorry, I forgot the stack-trace:

2019-07-01 01:32:37,574 ERROR [org.apache.activemq.artemis.core.server]
AMQ224000: Failure in initialisation: java.lang.IllegalStateException:
Invalid record type -93

at
org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.loadMessageJournal(AbstractJournalStorageManager.java:1137)
[artemis-server-2.4.0.jar:2.4.0]

at
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:2575)
[artemis-server-2.4.0.jar:2.4.0]

at
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:2336)
[artemis-server-2.4.0.jar:2.4.0]

at
org.apache.activemq.artemis.core.server.impl.LiveOnlyActivation.run(LiveOnlyActivation.java:71)
[artemis-server-2.4.0.jar:2.4.0]

at
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.internalStart(ActiveMQServerImpl.java:543)
[artemis-server-2.4.0.jar:2.4.0]

at
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.start(ActiveMQServerImpl.java:482)
[artemis-server-2.4.0.jar:2.4.0]

at
org.apache.activemq.artemis.integration.FileBroker.start(FileBroker.java:111)
[artemis-cli-2.4.0.jar:2.4.0]

at org.apache.activemq.artemis.cli.commands.Run.execute(Run.java:82)
[artemis-cli-2.4.0.jar:2.4.0]

at
org.apache.activemq.artemis.cli.Artemis.internalExecute(Artemis.java:149)
[artemis-cli-2.4.0.jar:2.4.0]

at org.apache.activemq.artemis.cli.Artemis.execute(Artemis.java:96)
[artemis-cli-2.4.0.jar:2.4.0]

at org.apache.activemq.artemis.cli.Artemis.execute(Artemis.java:123)
[artemis-cli-2.4.0.jar:2.4.0]


Unluckily the operation team deleted data journals and restarted broker. So
we have nothing to do but to read code to find the root cause :(



Justin Bertram <jb...@apache.org> 于2019年7月1日周一 下午10:37写道：

> Do you have the full stack-trace of the exception? Also, do you still have
> the damaged journal? It would be good to have it to test with.
>
>
> Justin
>
> On Mon, Jul 1, 2019 at 4:57 AM yw yw <wy...@gmail.com> wrote:
>
> > Hi, All
> >
> >
> > Yesterday our cluster experienced a sudden loss of power. When we started
> > broker after power brought back, exception occurred:
> >
> >
> > The exception showed the userRecordType loaded was illegal. The operation
> > team deleted data journals and broker started successfully.
> >
> >
> > It was a pity we didn't backup the problematic journal files. We checked
> > dmesg command output, no disk errors. SMART tests on disk also showed
> disk
> > not broken. Then we digged into code(JournalImpl::readJournalFile) and
> > tried to find something. We have two doubts with the code.
> >
> >
> > First doubt:
> >
> > The comment says "I - We scan for any valid record on the file. If a hole
> > happened on the middle of the file we keep looking until all the
> > possibilities are gone".
> >
> > Considering we're appending journal file and fileId is strictly
> increasing,
> > so we can just skip the whole file if the fileId of record is not equal
> to
> > file id. IMO the rest records in the file are the same, no need to read
> > them. Should we keep looking all the possibilities, is there a
> > possibility(very low one) that we just assemble a record of which fileId,
> > recordType, checkSize all qualifies but actually does not exist?
> >
> > Our second one:
> >
> > In the case of power outage where part of record is written into disk,
> e.g.
> > recordyType,fileId is successfully written, we may read the old record
> > though fileId is latest?
> >
> > Can anyone shed some lights on this please? Thanks.
> >
>

Re: Invalid record type during startup after power outage

Posted by Justin Bertram <jb...@apache.org>.

Do you have the full stack-trace of the exception? Also, do you still have
the damaged journal? It would be good to have it to test with.


Justin

On Mon, Jul 1, 2019 at 4:57 AM yw yw <wy...@gmail.com> wrote:

> Hi, All
>
>
> Yesterday our cluster experienced a sudden loss of power. When we started
> broker after power brought back, exception occurred:
>
>
> The exception showed the userRecordType loaded was illegal. The operation
> team deleted data journals and broker started successfully.
>
>
> It was a pity we didn't backup the problematic journal files. We checked
> dmesg command output, no disk errors. SMART tests on disk also showed disk
> not broken. Then we digged into code(JournalImpl::readJournalFile) and
> tried to find something. We have two doubts with the code.
>
>
> First doubt:
>
> The comment says "I - We scan for any valid record on the file. If a hole
> happened on the middle of the file we keep looking until all the
> possibilities are gone".
>
> Considering we're appending journal file and fileId is strictly increasing,
> so we can just skip the whole file if the fileId of record is not equal to
> file id. IMO the rest records in the file are the same, no need to read
> them. Should we keep looking all the possibilities, is there a
> possibility(very low one) that we just assemble a record of which fileId,
> recordType, checkSize all qualifies but actually does not exist?
>
> Our second one:
>
> In the case of power outage where part of record is written into disk, e.g.
> recordyType,fileId is successfully written, we may read the old record
> though fileId is latest?
>
> Can anyone shed some lights on this please? Thanks.
>