You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@activemq.apache.org by "Nordstrom, Karl" <kx...@psu.edu> on 2022/06/07 16:10:14 UTC

Corrupt journal record

Hello,

I check activemq.log for messages like these to determine if activemq has corrupt data files:

[kxn2@amq-a02 scheduler]$ sudo grep "Failed to start job scheduler store" /opt/local/activemq/data/activemq.log | head -1
2022-06-03 16:00:46,670 | ERROR | Failed to start job scheduler store: JobSchedulerStore: /opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler | org.apache.activemq.broker.BrokerService | main

Then I move scheduleDB files after stopping activemq.service on both brokers.

cd /opt/local/activemq/data/kahadb/scheduler

sudo mv scheduleDB.data scheduleDB.data.`date +%Y%m%d`; sudo mv scheduleDB.redo scheduleDB.redo.`date +%Y%m%d`

After starting ActiveMQ, 7,500,000 entries were recovered. That takes 2.5 hours but it usually works but, there was a corrupt journal file.

[kxn2@amq-a02 data]$ grep Corrupt activemq.log*

2022-06-02 07:55:40,066 | WARN  | Corrupt journal records found in '/opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler/db-1179.log' between offsets: 11558626..11559784 | org.apache.activemq.store.kahadb.disk.journal.Journal | main

We eventually stopped activemq, moved schedule/ and started activemq.

Part of the problem is that it took 2.5 hours to recover 7.5M entried then fail.

After we restarted, we have this.
[kxn2@amq-a02 scheduler]$ ls -l db-*.log | wc -l
1
[kxn2@amq-a02 scheduler]$ wc -l db-1.log
50,067 db-1.log

Before we had lots of log files.
[kxn2@amq-a02 scheduler.bkup]$ ls -l db-*.log | wc -l
125
[kxn2@amq-a02 scheduler.bkup]$ wc -l db-*.log
...
8,697,209 total

How can I get activemq to clean up these logs?

P.S.

I did eventually test removing the journal entry at byte range 11558626..11559784.
I could get ActiveMQ to start after that.


---

Karl Nordström

Systems Administrator

Penn State IT | Application Platforms

Re: ActiveMQ 5.16.4 Data Corruption

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Karl,

IMHO, the problem is related to the storage. The only time I saw such
corruption was due to a script that zip all log files in kahabd (the
guys thought it was log files and not journal files).

Double check your storage, maybe add lease locker config, I think the
problem is more around that.

Regards
JB

On Thu, Jun 16, 2022 at 7:58 PM Nordstrom, Karl <kx...@psu.edu> wrote:
>
> Matt, et al.,
>
> Kahadb is on shared NFS scaled out NAS storage. Sometimes, ActiveMQ loses its NFS mounts when the Storage Team upgrades the OS on the storage nodes. They upgrade one node at a time. The NFS mount must migrate to another storage node. Supposedly, It can take up to 30 seconds to migrate. The IP address of the new storage node is different than the original storage node. We avoid data corruption by stopping activemq.service on the broker that is in slave mode during the storage upgrade.
>
> Unfortunatly, I did not check for I/O errors earlier. I don't have /var/log/messages before 2022/06/10. If this happens again I will certainly follow your advise.
>
>
> ---
>
> Karl Nordström
>
> Systems Administrator
>
> Penn State IT | Application Platforms
>
> ________________________________
> From: Matt Pavlovich <ma...@gmail.com>
> Sent: Wednesday, June 15, 2022 6:24 PM
> To: users@activemq.apache.org <us...@activemq.apache.org>
> Subject: Re: ActiveMQ 5.16.4 Data Corruption
>
> Karl-
>
> Is this on a local disk, RAID, SAN or NAS? First step is to confirm there was no disk corruption-- check your syslog and dmesg output for anyI/O error messages.
>
> Thanks,
> Matt
>
> > On Jun 14, 2022, at 2:48 PM, Nordstrom, Karl <kx...@psu.edu> wrote:
> >
> > Hello,
> >
> > We have activemq-5.16.4 and java-1.8.0-openjdk.x86_64 1:1.8.0.332.b09-1.el7_9 running on rhel7.
> >
> > The following was done on our acceptance cluster.
> >
> > I check activemq.log for messages to determine if activemq has corrupt data files:
> >
> > [kxn2@amq-a02 scheduler]$ sudo grep "Failed to start job scheduler store" /opt/local/activemq/data/activemq.log | head -1
> > 2022-06-03 16:00:46,670 | ERROR | Failed to start job scheduler store: JobSchedulerStore: /opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler | org.apache.activemq.broker.BrokerService | main
> >
> > Then I move scheduleDB files after stopping activemq.service on both brokers.
> >
> > cd /opt/local/activemq/data/kahadb/scheduler
> >
> > sudo mv scheduleDB.data scheduleDB.data.`date +%Y%m%d`; sudo mv scheduleDB.redo scheduleDB.redo.`date +%Y%m%d`
> >
> > After starting ActiveMQ, 7,500,000 entries were recovered, but it failed with ERROR | Failed to start job scheduler store.
> >
> > There was a corrupt journal file.
> >
> > [kxn2@amq-a02 data]$ grep Corrupt activemq.log*
> >
> > 2022-06-02 07:55:40,066 | WARN  | Corrupt journal records found in '/opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler/db-1179.log' between offsets: 11558626..11559784 | org.apache.activemq.store.kahadb.disk.journal.Journal | main
> >
> > We tried starting activemq without the db-1179.log file, with an empty db-1179.log file. ActiveMQ complained about both.
> >
> > We eventually stopped activemq, renamed the schedule/ directory and started activemq.
> >
> > After we restarted, we have one db-*.log file with 50K messages.
> >
> > [kxn2@amq-a02 scheduler]$ wc -l db-1.log
> > 50,067 db-1.log
> >
> > Before we had 125 log files and 8.697,209 messages!
> >
> > [kxn2@amq-a02 scheduler.bkup]$ wc -l db-*.log
> > ...
> > 8,697,209 total
> >
> > So, we have millions of messages that we probably do not need. It took 2.5 hours to recover 7.5M entries before it failed; likely due to the corrupt record.
> >
> > How can I get activemq to clean up these logs, so this recovery doesn't take so long?
> >
> > How can I correct the data corruption?
> >
> > For a test, I did remove the range of the file between offsets: 11558626..11559784. I used the "head -c" command, grep and vi to do that. ActiveMQ did start.
> >
> > I am hoping that this doesn't happen in production, because it won't be acceptable to lose messages to get activemq to start up.
> >
> > ---
> >
> > Karl Nordström
> >
> > Systems Administrator
> >
> > Penn State IT | Application Platforms
>

Re: ActiveMQ 5.16.4 Data Corruption

Posted by "Nordstrom, Karl" <kx...@psu.edu>.

Matt, et al.,

Kahadb is on shared NFS scaled out NAS storage. Sometimes, ActiveMQ loses its NFS mounts when the Storage Team upgrades the OS on the storage nodes. They upgrade one node at a time. The NFS mount must migrate to another storage node. Supposedly, It can take up to 30 seconds to migrate. The IP address of the new storage node is different than the original storage node. We avoid data corruption by stopping activemq.service on the broker that is in slave mode during the storage upgrade.

Unfortunatly, I did not check for I/O errors earlier. I don't have /var/log/messages before 2022/06/10. If this happens again I will certainly follow your advise.


---

Karl Nordström

Systems Administrator

Penn State IT | Application Platforms

________________________________
From: Matt Pavlovich <ma...@gmail.com>
Sent: Wednesday, June 15, 2022 6:24 PM
To: users@activemq.apache.org <us...@activemq.apache.org>
Subject: Re: ActiveMQ 5.16.4 Data Corruption

Karl-

Is this on a local disk, RAID, SAN or NAS? First step is to confirm there was no disk corruption-- check your syslog and dmesg output for anyI/O error messages.

Thanks,
Matt

> On Jun 14, 2022, at 2:48 PM, Nordstrom, Karl <kx...@psu.edu> wrote:
>
> Hello,
>
> We have activemq-5.16.4 and java-1.8.0-openjdk.x86_64 1:1.8.0.332.b09-1.el7_9 running on rhel7.
>
> The following was done on our acceptance cluster.
>
> I check activemq.log for messages to determine if activemq has corrupt data files:
>
> [kxn2@amq-a02 scheduler]$ sudo grep "Failed to start job scheduler store" /opt/local/activemq/data/activemq.log | head -1
> 2022-06-03 16:00:46,670 | ERROR | Failed to start job scheduler store: JobSchedulerStore: /opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler | org.apache.activemq.broker.BrokerService | main
>
> Then I move scheduleDB files after stopping activemq.service on both brokers.
>
> cd /opt/local/activemq/data/kahadb/scheduler
>
> sudo mv scheduleDB.data scheduleDB.data.`date +%Y%m%d`; sudo mv scheduleDB.redo scheduleDB.redo.`date +%Y%m%d`
>
> After starting ActiveMQ, 7,500,000 entries were recovered, but it failed with ERROR | Failed to start job scheduler store.
>
> There was a corrupt journal file.
>
> [kxn2@amq-a02 data]$ grep Corrupt activemq.log*
>
> 2022-06-02 07:55:40,066 | WARN  | Corrupt journal records found in '/opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler/db-1179.log' between offsets: 11558626..11559784 | org.apache.activemq.store.kahadb.disk.journal.Journal | main
>
> We tried starting activemq without the db-1179.log file, with an empty db-1179.log file. ActiveMQ complained about both.
>
> We eventually stopped activemq, renamed the schedule/ directory and started activemq.
>
> After we restarted, we have one db-*.log file with 50K messages.
>
> [kxn2@amq-a02 scheduler]$ wc -l db-1.log
> 50,067 db-1.log
>
> Before we had 125 log files and 8.697,209 messages!
>
> [kxn2@amq-a02 scheduler.bkup]$ wc -l db-*.log
> ...
> 8,697,209 total
>
> So, we have millions of messages that we probably do not need. It took 2.5 hours to recover 7.5M entries before it failed; likely due to the corrupt record.
>
> How can I get activemq to clean up these logs, so this recovery doesn't take so long?
>
> How can I correct the data corruption?
>
> For a test, I did remove the range of the file between offsets: 11558626..11559784. I used the "head -c" command, grep and vi to do that. ActiveMQ did start.
>
> I am hoping that this doesn't happen in production, because it won't be acceptable to lose messages to get activemq to start up.
>
> ---
>
> Karl Nordström
>
> Systems Administrator
>
> Penn State IT | Application Platforms

Re: ActiveMQ 5.16.4 Data Corruption

Posted by Matt Pavlovich <ma...@gmail.com>.

Karl-

Is this on a local disk, RAID, SAN or NAS? First step is to confirm there was no disk corruption-- check your syslog and dmesg output for anyI/O error messages.

Thanks,
Matt

> On Jun 14, 2022, at 2:48 PM, Nordstrom, Karl <kx...@psu.edu> wrote:
> 
> Hello,
> 
> We have activemq-5.16.4 and java-1.8.0-openjdk.x86_64 1:1.8.0.332.b09-1.el7_9 running on rhel7.
> 
> The following was done on our acceptance cluster.
> 
> I check activemq.log for messages to determine if activemq has corrupt data files:
> 
> [kxn2@amq-a02 scheduler]$ sudo grep "Failed to start job scheduler store" /opt/local/activemq/data/activemq.log | head -1
> 2022-06-03 16:00:46,670 | ERROR | Failed to start job scheduler store: JobSchedulerStore: /opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler | org.apache.activemq.broker.BrokerService | main
> 
> Then I move scheduleDB files after stopping activemq.service on both brokers.
> 
> cd /opt/local/activemq/data/kahadb/scheduler
> 
> sudo mv scheduleDB.data scheduleDB.data.`date +%Y%m%d`; sudo mv scheduleDB.redo scheduleDB.redo.`date +%Y%m%d`
> 
> After starting ActiveMQ, 7,500,000 entries were recovered, but it failed with ERROR | Failed to start job scheduler store.
> 
> There was a corrupt journal file.
> 
> [kxn2@amq-a02 data]$ grep Corrupt activemq.log*
> 
> 2022-06-02 07:55:40,066 | WARN  | Corrupt journal records found in '/opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler/db-1179.log' between offsets: 11558626..11559784 | org.apache.activemq.store.kahadb.disk.journal.Journal | main
> 
> We tried starting activemq without the db-1179.log file, with an empty db-1179.log file. ActiveMQ complained about both.
> 
> We eventually stopped activemq, renamed the schedule/ directory and started activemq.
> 
> After we restarted, we have one db-*.log file with 50K messages.
> 
> [kxn2@amq-a02 scheduler]$ wc -l db-1.log
> 50,067 db-1.log
> 
> Before we had 125 log files and 8.697,209 messages!
> 
> [kxn2@amq-a02 scheduler.bkup]$ wc -l db-*.log
> ...
> 8,697,209 total
> 
> So, we have millions of messages that we probably do not need. It took 2.5 hours to recover 7.5M entries before it failed; likely due to the corrupt record.
> 
> How can I get activemq to clean up these logs, so this recovery doesn't take so long?
> 
> How can I correct the data corruption?
> 
> For a test, I did remove the range of the file between offsets: 11558626..11559784. I used the "head -c" command, grep and vi to do that. ActiveMQ did start.
> 
> I am hoping that this doesn't happen in production, because it won't be acceptable to lose messages to get activemq to start up.
> 
> ---
> 
> Karl Nordström
> 
> Systems Administrator
> 
> Penn State IT | Application Platforms

ActiveMQ 5.16.4 Data Corruption

Posted by "Nordstrom, Karl" <kx...@psu.edu>.

Hello,

We have activemq-5.16.4 and java-1.8.0-openjdk.x86_64 1:1.8.0.332.b09-1.el7_9 running on rhel7.

The following was done on our acceptance cluster.

I check activemq.log for messages to determine if activemq has corrupt data files:

[kxn2@amq-a02 scheduler]$ sudo grep "Failed to start job scheduler store" /opt/local/activemq/data/activemq.log | head -1
2022-06-03 16:00:46,670 | ERROR | Failed to start job scheduler store: JobSchedulerStore: /opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler | org.apache.activemq.broker.BrokerService | main

Then I move scheduleDB files after stopping activemq.service on both brokers.

cd /opt/local/activemq/data/kahadb/scheduler

sudo mv scheduleDB.data scheduleDB.data.`date +%Y%m%d`; sudo mv scheduleDB.redo scheduleDB.redo.`date +%Y%m%d`

After starting ActiveMQ, 7,500,000 entries were recovered, but it failed with ERROR | Failed to start job scheduler store.

There was a corrupt journal file.

[kxn2@amq-a02 data]$ grep Corrupt activemq.log*

2022-06-02 07:55:40,066 | WARN  | Corrupt journal records found in '/opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler/db-1179.log' between offsets: 11558626..11559784 | org.apache.activemq.store.kahadb.disk.journal.Journal | main

We tried starting activemq without the db-1179.log file, with an empty db-1179.log file. ActiveMQ complained about both.

We eventually stopped activemq, renamed the schedule/ directory and started activemq.

After we restarted, we have one db-*.log file with 50K messages.

[kxn2@amq-a02 scheduler]$ wc -l db-1.log
50,067 db-1.log

Before we had 125 log files and 8.697,209 messages!

[kxn2@amq-a02 scheduler.bkup]$ wc -l db-*.log
...
8,697,209 total

So, we have millions of messages that we probably do not need. It took 2.5 hours to recover 7.5M entries before it failed; likely due to the corrupt record.

How can I get activemq to clean up these logs, so this recovery doesn't take so long?

How can I correct the data corruption?

For a test, I did remove the range of the file between offsets: 11558626..11559784. I used the "head -c" command, grep and vi to do that. ActiveMQ did start.

I am hoping that this doesn't happen in production, because it won't be acceptable to lose messages to get activemq to start up.

---

Karl Nordström

Systems Administrator

Penn State IT | Application Platforms