You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@activemq.apache.org by Christian Schneider <ch...@die-schneider.net> on 2016/02/15 13:32:26 UTC

Failover very slow with kahadb while restart of master is fast

At a customer we are experiencing a very strange behaviour of activemq.

We have two brokers on separate machines A and B. Storage is on a third 
machine and imported using nfsv4.
Kahadb contains about 18 GB of journal files.

Scenario 1:
- Stop B
- Stop A (master now)
- Start A

The start of A takes about 50 seconds.

Scnario 2:
- Stop A
- Stop B (master now)
- Start A

The start of A takes about 5 minutes.

In the slow case the log at debug level shows a big gap between adding 
the last queue (12:04) and running the ActiveMQ Journal Checkpoint 
Worker again after the queue creation (12:09).
I have no idea what activemq is doing in the mean time but it seems to 
take several minutes.

Can anyone explain this behaviour or explain what activemq does in the 
mean time? I would also be happy about pointer to the code.
What I do not understand is why it makes a difference if the master is 
started again or if the slave is becoming master. As they both share the 
same kahadb over nfs I would expect the same startup
behaviour.

Christian

-----

Slow start
2016-02-15 12:03:01,943 | INFO  | Refreshing 
org.apache.activemq.xbean.XBeanBrokerFactory$1@17b14695: startup date 
[Mon Feb 15 12:03:01 CET 2016]; root of context hierarchy | 
org.apache.activemq.xbean.XBeanBrokerFactory$1 | main
...
2016-02-15 12:03:03,018 | INFO  | 
PListStore:[/path/activemq/data/A/tmp_storage] started | 
org.apache.activemq.store.kahadb.plist.PListStoreImpl | main
2016-02-15 12:03:03,033 | INFO  | ignoring zero length, partially 
initialised journal data file: db-1.log number = 1 , length = 0 | 
org.apache.activemq.store.kahadb.disk.journal.Journal | main
2016-02-15 12:03:03,177 | INFO  | JobSchedulerStore:path/scheduler 
started | 
org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl | main
2016-02-15 12:03:03,180 | INFO  | JobScheduler using directory: 
path/scheduler | org.apache.activemq.broker.BrokerService | main
2016-02-15 12:03:03,335 | INFO  | Using Persistence Adapter: 
KahaDBPersistenceAdapter[/path] | 
org.apache.activemq.broker.BrokerService | main
2016-02-15 12:03:03,350 | INFO  | JMX consoles can connect to 
service:jmx:rmi:///jndi/rmi://localhost:1617/jmxrmi | 
org.apache.activemq.broker.jmx.ManagementContext | JMX connector
2016-02-15 12:03:05,198 | INFO  | Corrupt journal records found in 
'path/db-1122.log' between offsets: 17893102..17893732 | 
org.apache.activemq.store.kahadb.disk.journal.Journal | main
2016-02-15 12:03:47,925 | INFO  | KahaDB is version 5 | 
org.apache.activemq.store.kahadb.MessageDatabase | main
....
2016-02-15 12:03:49,233 | INFO  | Recovering from the journal ... | 
org.apache.activemq.store.kahadb.MessageDatabase | main
2016-02-15 12:03:49,237 | INFO  | Recovery replayed 20 operations from 
the journal in 0.094 seconds. | 
org.apache.activemq.store.kahadb.MessageDatabase | main
2016-02-15 12:04:26,863 | DEBUG | Checkpoint started. | 
org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal 
Checkpoint Worker
2016-02-15 12:04:26,932 | DEBUG | Checkpoint done. | 
org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal 
Checkpoint Worker
2016-02-15 12:04:26,940 | INFO  | Apache ActiveMQ 5.10.2 (A, 
ID:A-59167-1455534183070-1:1) is starting | 
org.apache.activemq.broker.BrokerService | main
...
2016-02-15 12:04:30,289 | DEBUG | A adding destination: queue://q1.DLQ | 
org.apache.activemq.broker.region.AbstractRegion | main
2016-02-15 12:04:57,026 | DEBUG | queue://q1.DLQ expiring messages .. | 
org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
2016-02-15 12:04:57,029 | DEBUG | q1.DLQ toPageIn: 5, Inflight: 0, 
pagedInMessages.size 0, pagedInPendingDispatch.size 0, enqueueCount: 0, 
dequeueCount: 0, memUsage:0 | org.apache.activemq.broker.region.Queue | 
ActiveMQ Broker[A] Scheduler
2016-02-15 12:09:20,440 | DEBUG | Checkpoint started. | 
org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal 
Checkpoint Worker
2016-02-15 12:09:20,476 | DEBUG | Checkpoint done. | 
org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal 
Checkpoint Worker
2016-02-15 12:09:20,488 | DEBUG | queue://q1.DLQ expiring messages done. 
| org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler


-- 
Christian Schneider
http://www.liquid-reality.de

Open Source Architect
http://www.talend.com

Re: Failover very slow with kahadb while restart of master is fast

Posted by Tim Bain <tb...@alumni.duke.edu>.

Sure, YOU know it was a clean shutdown, but how does your slave broker know
which type it was?

Methods to signal that fact could be added, but none currently exists in
the code.
On Apr 29, 2016 7:14 AM, "Christian Schneider" <ch...@die-schneider.net>
wrote:

> I do not assume that the restart will be clean when the master dies.
>
> In our case the brokers are shutdown orderly during a regular maintenance
> where some of the customer software is updated on the system.
> So in the case were we experienced the slow start it was a clean shutdown.
>
> Christian
>
> On 29.04.2016 15:03, Tim Bain wrote:
>
>> Also, your question makes the assumption that files your broker stumbles
>> upon when it becomes master were written by a previous master that
>> shutdown
>> cleanly, which is a rather optimistic assumption to make.
>> On Apr 29, 2016 7:01 AM, "Tim Bain" <tb...@alumni.duke.edu> wrote:
>>
>>
> --
> Christian Schneider
> http://www.liquid-reality.de
>
> Open Source Architect
> http://www.talend.com
>
>

Re: Failover very slow with kahadb while restart of master is fast

Posted by Christian Schneider <ch...@die-schneider.net>.

I do not assume that the restart will be clean when the master dies.

In our case the brokers are shutdown orderly during a regular 
maintenance where some of the customer software is updated on the system.
So in the case were we experienced the slow start it was a clean shutdown.

Christian

On 29.04.2016 15:03, Tim Bain wrote:
> Also, your question makes the assumption that files your broker stumbles
> upon when it becomes master were written by a previous master that shutdown
> cleanly, which is a rather optimistic assumption to make.
> On Apr 29, 2016 7:01 AM, "Tim Bain" <tb...@alumni.duke.edu> wrote:
>

-- 
Christian Schneider
http://www.liquid-reality.de

Open Source Architect
http://www.talend.com

Re: Failover very slow with kahadb while restart of master is fast

Posted by Tim Bain <tb...@alumni.duke.edu>.

Also, your question makes the assumption that files your broker stumbles
upon when it becomes master were written by a previous master that shutdown
cleanly, which is a rather optimistic assumption to make.
On Apr 29, 2016 7:01 AM, "Tim Bain" <tb...@alumni.duke.edu> wrote:

> By reading the code (
> http://www.grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-kahadb-store/5.11.1/org/apache/activemq/store/kahadb/disk/journal/Journal.java#Journal.recoveryCheck%28org.apache.activemq.store.kahadb.disk.journal.DataFile%29),
> I found that we're reading the data files to confirm that they're not
> corrupted.  That's the beauty of an open-source project: all of the source
> code is available, so you can read it yourself and see what's being done.
>
> You should confirm that your OS will properly invalidate its cache entry
> for a given file when that file changes; having KahaDB read stale cache
> entries would be a very bad thing.
>
> Tim
> We have found the main cause for this issue. Current Master and new Master
> seem to do the exact same things on startup.
> They read all the journals.
>
> It seems that the OS is caching the files in memory in some cases.
> The master seems to have all the jorunal files in the cache. So when it
> restarts it is fast.
> The new master seems to not have the journals in cache so they have to be
> transfered over the network (~10MB/s) in our case. So this
> causes the difference in startup time.
>
> One big question remains:
> Why are the whole journals read? Shouldn't a master that is shut down in a
> controlled way leave a clean state, so on startup only the meta data index
> is read?
>
> Christian
>
> On 15.02.2016 13:32, Christian Schneider wrote:
>
>> At a customer we are experiencing a very strange behaviour of activemq.
>>
>> We have two brokers on separate machines A and B. Storage is on a third
>> machine and imported using nfsv4.
>> Kahadb contains about 18 GB of journal files.
>>
>> Scenario 1:
>> - Stop B
>> - Stop A (master now)
>> - Start A
>>
>> The start of A takes about 50 seconds.
>>
>> Scnario 2:
>> - Stop A
>> - Stop B (master now)
>> - Start A
>>
>> The start of A takes about 5 minutes.
>>
>> In the slow case the log at debug level shows a big gap between adding
>> the last queue (12:04) and running the ActiveMQ Journal Checkpoint Worker
>> again after the queue creation (12:09).
>> I have no idea what activemq is doing in the mean time but it seems to
>> take several minutes.
>>
>> Can anyone explain this behaviour or explain what activemq does in the
>> mean time? I would also be happy about pointer to the code.
>> What I do not understand is why it makes a difference if the master is
>> started again or if the slave is becoming master. As they both share the
>> same kahadb over nfs I would expect the same startup
>> behaviour.
>>
>> Christian
>>
>> -----
>>
>> Slow start
>> 2016-02-15 12:03:01,943 | INFO  | Refreshing
>> org.apache.activemq.xbean.XBeanBrokerFactory$1@17b14695: startup date
>> [Mon Feb 15 12:03:01 CET 2016]; root of context hierarchy |
>> org.apache.activemq.xbean.XBeanBrokerFactory$1 | main
>> ...
>> 2016-02-15 12:03:03,018 | INFO  |
>> PListStore:[/path/activemq/data/A/tmp_storage] started |
>> org.apache.activemq.store.kahadb.plist.PListStoreImpl | main
>> 2016-02-15 12:03:03,033 | INFO  | ignoring zero length, partially
>> initialised journal data file: db-1.log number = 1 , length = 0 |
>> org.apache.activemq.store.kahadb.disk.journal.Journal | main
>> 2016-02-15 12:03:03,177 | INFO  | JobSchedulerStore:path/scheduler
>> started | org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl
>> | main
>> 2016-02-15 12:03:03,180 | INFO  | JobScheduler using directory:
>> path/scheduler | org.apache.activemq.broker.BrokerService | main
>> 2016-02-15 12:03:03,335 | INFO  | Using Persistence Adapter:
>> KahaDBPersistenceAdapter[/path] | org.apache.activemq.broker.BrokerService
>> | main
>> 2016-02-15 12:03:03,350 | INFO  | JMX consoles can connect to
>> service:jmx:rmi:///jndi/rmi://localhost:1617/jmxrmi |
>> org.apache.activemq.broker.jmx.ManagementContext | JMX connector
>> 2016-02-15 12:03:05,198 | INFO  | Corrupt journal records found in
>> 'path/db-1122.log' between offsets: 17893102..17893732 |
>> org.apache.activemq.store.kahadb.disk.journal.Journal | main
>> 2016-02-15 12:03:47,925 | INFO  | KahaDB is version 5 |
>> org.apache.activemq.store.kahadb.MessageDatabase | main
>> ....
>> 2016-02-15 12:03:49,233 | INFO  | Recovering from the journal ... |
>> org.apache.activemq.store.kahadb.MessageDatabase | main
>> 2016-02-15 12:03:49,237 | INFO  | Recovery replayed 20 operations from
>> the journal in 0.094 seconds. |
>> org.apache.activemq.store.kahadb.MessageDatabase | main
>> 2016-02-15 12:04:26,863 | DEBUG | Checkpoint started. |
>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>> Checkpoint Worker
>> 2016-02-15 12:04:26,932 | DEBUG | Checkpoint done. |
>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>> Checkpoint Worker
>> 2016-02-15 12:04:26,940 | INFO  | Apache ActiveMQ 5.10.2 (A,
>> ID:A-59167-1455534183070-1:1) is starting |
>> org.apache.activemq.broker.BrokerService | main
>> ...
>> 2016-02-15 12:04:30,289 | DEBUG | A adding destination: queue://q1.DLQ |
>> org.apache.activemq.broker.region.AbstractRegion | main
>> 2016-02-15 12:04:57,026 | DEBUG | queue://q1.DLQ expiring messages .. |
>> org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
>> 2016-02-15 12:04:57,029 | DEBUG | q1.DLQ toPageIn: 5, Inflight: 0,
>> pagedInMessages.size 0, pagedInPendingDispatch.size 0, enqueueCount: 0,
>> dequeueCount: 0, memUsage:0 | org.apache.activemq.broker.region.Queue |
>> ActiveMQ Broker[A] Scheduler
>> 2016-02-15 12:09:20,440 | DEBUG | Checkpoint started. |
>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>> Checkpoint Worker
>> 2016-02-15 12:09:20,476 | DEBUG | Checkpoint done. |
>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>> Checkpoint Worker
>> 2016-02-15 12:09:20,488 | DEBUG | queue://q1.DLQ expiring messages done.
>> | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
>>
>>
>>
>
> --
> Christian Schneider
> http://www.liquid-reality.de
>
> Open Source Architect
> http://www.talend.com
>
>

Re: Failover very slow with kahadb while restart of master is fast

Posted by Tim Bain <tb...@alumni.duke.edu>.

By reading the code (
http://www.grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-kahadb-store/5.11.1/org/apache/activemq/store/kahadb/disk/journal/Journal.java#Journal.recoveryCheck%28org.apache.activemq.store.kahadb.disk.journal.DataFile%29),
I found that we're reading the data files to confirm that they're not
corrupted.  That's the beauty of an open-source project: all of the source
code is available, so you can read it yourself and see what's being done.

You should confirm that your OS will properly invalidate its cache entry
for a given file when that file changes; having KahaDB read stale cache
entries would be a very bad thing.

Tim
We have found the main cause for this issue. Current Master and new Master
seem to do the exact same things on startup.
They read all the journals.

It seems that the OS is caching the files in memory in some cases.
The master seems to have all the jorunal files in the cache. So when it
restarts it is fast.
The new master seems to not have the journals in cache so they have to be
transfered over the network (~10MB/s) in our case. So this
causes the difference in startup time.

One big question remains:
Why are the whole journals read? Shouldn't a master that is shut down in a
controlled way leave a clean state, so on startup only the meta data index
is read?

Christian

On 15.02.2016 13:32, Christian Schneider wrote:

> At a customer we are experiencing a very strange behaviour of activemq.
>
> We have two brokers on separate machines A and B. Storage is on a third
> machine and imported using nfsv4.
> Kahadb contains about 18 GB of journal files.
>
> Scenario 1:
> - Stop B
> - Stop A (master now)
> - Start A
>
> The start of A takes about 50 seconds.
>
> Scnario 2:
> - Stop A
> - Stop B (master now)
> - Start A
>
> The start of A takes about 5 minutes.
>
> In the slow case the log at debug level shows a big gap between adding the
> last queue (12:04) and running the ActiveMQ Journal Checkpoint Worker again
> after the queue creation (12:09).
> I have no idea what activemq is doing in the mean time but it seems to
> take several minutes.
>
> Can anyone explain this behaviour or explain what activemq does in the
> mean time? I would also be happy about pointer to the code.
> What I do not understand is why it makes a difference if the master is
> started again or if the slave is becoming master. As they both share the
> same kahadb over nfs I would expect the same startup
> behaviour.
>
> Christian
>
> -----
>
> Slow start
> 2016-02-15 12:03:01,943 | INFO  | Refreshing
> org.apache.activemq.xbean.XBeanBrokerFactory$1@17b14695: startup date
> [Mon Feb 15 12:03:01 CET 2016]; root of context hierarchy |
> org.apache.activemq.xbean.XBeanBrokerFactory$1 | main
> ...
> 2016-02-15 12:03:03,018 | INFO  |
> PListStore:[/path/activemq/data/A/tmp_storage] started |
> org.apache.activemq.store.kahadb.plist.PListStoreImpl | main
> 2016-02-15 12:03:03,033 | INFO  | ignoring zero length, partially
> initialised journal data file: db-1.log number = 1 , length = 0 |
> org.apache.activemq.store.kahadb.disk.journal.Journal | main
> 2016-02-15 12:03:03,177 | INFO  | JobSchedulerStore:path/scheduler started
> | org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl | main
> 2016-02-15 12:03:03,180 | INFO  | JobScheduler using directory:
> path/scheduler | org.apache.activemq.broker.BrokerService | main
> 2016-02-15 12:03:03,335 | INFO  | Using Persistence Adapter:
> KahaDBPersistenceAdapter[/path] | org.apache.activemq.broker.BrokerService
> | main
> 2016-02-15 12:03:03,350 | INFO  | JMX consoles can connect to
> service:jmx:rmi:///jndi/rmi://localhost:1617/jmxrmi |
> org.apache.activemq.broker.jmx.ManagementContext | JMX connector
> 2016-02-15 12:03:05,198 | INFO  | Corrupt journal records found in
> 'path/db-1122.log' between offsets: 17893102..17893732 |
> org.apache.activemq.store.kahadb.disk.journal.Journal | main
> 2016-02-15 12:03:47,925 | INFO  | KahaDB is version 5 |
> org.apache.activemq.store.kahadb.MessageDatabase | main
> ....
> 2016-02-15 12:03:49,233 | INFO  | Recovering from the journal ... |
> org.apache.activemq.store.kahadb.MessageDatabase | main
> 2016-02-15 12:03:49,237 | INFO  | Recovery replayed 20 operations from the
> journal in 0.094 seconds. |
> org.apache.activemq.store.kahadb.MessageDatabase | main
> 2016-02-15 12:04:26,863 | DEBUG | Checkpoint started. |
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
> Checkpoint Worker
> 2016-02-15 12:04:26,932 | DEBUG | Checkpoint done. |
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
> Checkpoint Worker
> 2016-02-15 12:04:26,940 | INFO  | Apache ActiveMQ 5.10.2 (A,
> ID:A-59167-1455534183070-1:1) is starting |
> org.apache.activemq.broker.BrokerService | main
> ...
> 2016-02-15 12:04:30,289 | DEBUG | A adding destination: queue://q1.DLQ |
> org.apache.activemq.broker.region.AbstractRegion | main
> 2016-02-15 12:04:57,026 | DEBUG | queue://q1.DLQ expiring messages .. |
> org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
> 2016-02-15 12:04:57,029 | DEBUG | q1.DLQ toPageIn: 5, Inflight: 0,
> pagedInMessages.size 0, pagedInPendingDispatch.size 0, enqueueCount: 0,
> dequeueCount: 0, memUsage:0 | org.apache.activemq.broker.region.Queue |
> ActiveMQ Broker[A] Scheduler
> 2016-02-15 12:09:20,440 | DEBUG | Checkpoint started. |
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
> Checkpoint Worker
> 2016-02-15 12:09:20,476 | DEBUG | Checkpoint done. |
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
> Checkpoint Worker
> 2016-02-15 12:09:20,488 | DEBUG | queue://q1.DLQ expiring messages done. |
> org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
>
>
>

-- 
Christian Schneider
http://www.liquid-reality.de

Open Source Architect
http://www.talend.com

Re: Failover very slow with kahadb while restart of master is fast

Posted by akhil <ak...@gmail.com>.

Hi Chris , 

How did you solve this caching issue while your brokers standing on master
slave ..?

Did you put out any settings in activemq xml to startup the broker in quick
time or making it quick failover ?

Thanks,
Akhil.



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html

Re: Failover very slow with kahadb while restart of master is fast

Posted by Christian Schneider <ch...@die-schneider.net>.

We have found the main cause for this issue. Current Master and new 
Master seem to do the exact same things on startup.
They read all the journals.

It seems that the OS is caching the files in memory in some cases.
The master seems to have all the jorunal files in the cache. So when it 
restarts it is fast.
The new master seems to not have the journals in cache so they have to 
be transfered over the network (~10MB/s) in our case. So this
causes the difference in startup time.

One big question remains:
Why are the whole journals read? Shouldn't a master that is shut down in 
a controlled way leave a clean state, so on startup only the meta data 
index is read?

Christian

On 15.02.2016 13:32, Christian Schneider wrote:
> At a customer we are experiencing a very strange behaviour of activemq.
>
> We have two brokers on separate machines A and B. Storage is on a 
> third machine and imported using nfsv4.
> Kahadb contains about 18 GB of journal files.
>
> Scenario 1:
> - Stop B
> - Stop A (master now)
> - Start A
>
> The start of A takes about 50 seconds.
>
> Scnario 2:
> - Stop A
> - Stop B (master now)
> - Start A
>
> The start of A takes about 5 minutes.
>
> In the slow case the log at debug level shows a big gap between adding 
> the last queue (12:04) and running the ActiveMQ Journal Checkpoint 
> Worker again after the queue creation (12:09).
> I have no idea what activemq is doing in the mean time but it seems to 
> take several minutes.
>
> Can anyone explain this behaviour or explain what activemq does in the 
> mean time? I would also be happy about pointer to the code.
> What I do not understand is why it makes a difference if the master is 
> started again or if the slave is becoming master. As they both share 
> the same kahadb over nfs I would expect the same startup
> behaviour.
>
> Christian
>
> -----
>
> Slow start
> 2016-02-15 12:03:01,943 | INFO  | Refreshing 
> org.apache.activemq.xbean.XBeanBrokerFactory$1@17b14695: startup date 
> [Mon Feb 15 12:03:01 CET 2016]; root of context hierarchy | 
> org.apache.activemq.xbean.XBeanBrokerFactory$1 | main
> ...
> 2016-02-15 12:03:03,018 | INFO  | 
> PListStore:[/path/activemq/data/A/tmp_storage] started | 
> org.apache.activemq.store.kahadb.plist.PListStoreImpl | main
> 2016-02-15 12:03:03,033 | INFO  | ignoring zero length, partially 
> initialised journal data file: db-1.log number = 1 , length = 0 | 
> org.apache.activemq.store.kahadb.disk.journal.Journal | main
> 2016-02-15 12:03:03,177 | INFO  | JobSchedulerStore:path/scheduler 
> started | 
> org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl | main
> 2016-02-15 12:03:03,180 | INFO  | JobScheduler using directory: 
> path/scheduler | org.apache.activemq.broker.BrokerService | main
> 2016-02-15 12:03:03,335 | INFO  | Using Persistence Adapter: 
> KahaDBPersistenceAdapter[/path] | 
> org.apache.activemq.broker.BrokerService | main
> 2016-02-15 12:03:03,350 | INFO  | JMX consoles can connect to 
> service:jmx:rmi:///jndi/rmi://localhost:1617/jmxrmi | 
> org.apache.activemq.broker.jmx.ManagementContext | JMX connector
> 2016-02-15 12:03:05,198 | INFO  | Corrupt journal records found in 
> 'path/db-1122.log' between offsets: 17893102..17893732 | 
> org.apache.activemq.store.kahadb.disk.journal.Journal | main
> 2016-02-15 12:03:47,925 | INFO  | KahaDB is version 5 | 
> org.apache.activemq.store.kahadb.MessageDatabase | main
> ....
> 2016-02-15 12:03:49,233 | INFO  | Recovering from the journal ... | 
> org.apache.activemq.store.kahadb.MessageDatabase | main
> 2016-02-15 12:03:49,237 | INFO  | Recovery replayed 20 operations from 
> the journal in 0.094 seconds. | 
> org.apache.activemq.store.kahadb.MessageDatabase | main
> 2016-02-15 12:04:26,863 | DEBUG | Checkpoint started. | 
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal 
> Checkpoint Worker
> 2016-02-15 12:04:26,932 | DEBUG | Checkpoint done. | 
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal 
> Checkpoint Worker
> 2016-02-15 12:04:26,940 | INFO  | Apache ActiveMQ 5.10.2 (A, 
> ID:A-59167-1455534183070-1:1) is starting | 
> org.apache.activemq.broker.BrokerService | main
> ...
> 2016-02-15 12:04:30,289 | DEBUG | A adding destination: queue://q1.DLQ 
> | org.apache.activemq.broker.region.AbstractRegion | main
> 2016-02-15 12:04:57,026 | DEBUG | queue://q1.DLQ expiring messages .. 
> | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
> 2016-02-15 12:04:57,029 | DEBUG | q1.DLQ toPageIn: 5, Inflight: 0, 
> pagedInMessages.size 0, pagedInPendingDispatch.size 0, enqueueCount: 
> 0, dequeueCount: 0, memUsage:0 | 
> org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
> 2016-02-15 12:09:20,440 | DEBUG | Checkpoint started. | 
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal 
> Checkpoint Worker
> 2016-02-15 12:09:20,476 | DEBUG | Checkpoint done. | 
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal 
> Checkpoint Worker
> 2016-02-15 12:09:20,488 | DEBUG | queue://q1.DLQ expiring messages 
> done. | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] 
> Scheduler
>
>


-- 
Christian Schneider
http://www.liquid-reality.de

Open Source Architect
http://www.talend.com

Re: Failover very slow with kahadb while restart of master is fast

Posted by Christian Schneider <ch...@die-schneider.net>.

I now took thread dumps of activemq while it was taking so long to 
start  up.

This is where it seemed to spend most of the time:
http://apaste.info/H1c

So it looks like it is reading the kahadb journals. Does anyone have an 
idea why it seems to matter if the master restarts (fast) or the dlave 
takes over (slow).
After all the files are in the same remote location in both cases.

Christian

# at 
org.apache.activemq.util.RecoverableRandomAccessFile.readFully(RecoverableRandomAccessFile.java:75)
# at 
org.apache.activemq.store.kahadb.disk.journal.DataFileAccessor.readFully(DataFileAccessor.java:98)
# at 
org.apache.activemq.store.kahadb.disk.journal.Journal.checkBatchRecord(Journal.java:302)
# at 
org.apache.activemq.store.kahadb.disk.journal.Journal.recoveryCheck(Journal.java:210)
# at 
org.apache.activemq.store.kahadb.disk.journal.Journal.start(Journal.java:170)
# -locked <0x000000070086c700>(a 
org.apache.activemq.store.kahadb.disk.journal.Journal)
# at 
org.apache.activemq.store.kahadb.MessageDatabase.open(MessageDatabase.java:378)



On 15.02.2016 15:15, Tim Bain wrote:
> You could use a profiler such as JVisualVM to tell you what work is being
> done.  You could get some of that information from thread dumps via jstack,
> but I think a profiler would give much more useful and detailed information
> for less effort.
>
> Tim
> On Feb 15, 2016 5:37 AM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> wrote:
>
>> Hi Christian,
>>
>> what's the option on the NFS ?
>>
>> Most of the time, I use I use sync and I increase the rsize and wsize.
>>
>> Anyway, the difference is huge. The problem could be related to the
>> journal checksum check. Can you try checkForCorruptJournalFiles="false" on
>> the kahadb configuration ?
>>
>> Maybe increasing the kahadb jobs could help too:
>>
>>              <kahaDB directory="${karaf.data}/activemq/activemq/kahadb"
>>                  indexWriteBatchSize="1000"
>>                  indexCacheSize="2000"
>>                  journalMaxFileLength="32mb"
>>                  checkForCorruptJournalFiles="false"
>>                  maxAsyncJobs="5000"
>>                  concurrentStoreAndDispatchQueues="true"
>>                  concurrentStoreAndDispatchTopics="true"
>>                  enableJournalDiskSyncs="true"
>>                  enableIndexWriteAsync="true"/>
>>
>>
>> Regards
>> JB
>>
>> On 02/15/2016 01:32 PM, Christian Schneider wrote:
>>
>>> At a customer we are experiencing a very strange behaviour of activemq.
>>>
>>> We have two brokers on separate machines A and B. Storage is on a third
>>> machine and imported using nfsv4.
>>> Kahadb contains about 18 GB of journal files.
>>>
>>> Scenario 1:
>>> - Stop B
>>> - Stop A (master now)
>>> - Start A
>>>
>>> The start of A takes about 50 seconds.
>>>
>>> Scnario 2:
>>> - Stop A
>>> - Stop B (master now)
>>> - Start A
>>>
>>> The start of A takes about 5 minutes.
>>>
>>> In the slow case the log at debug level shows a big gap between adding
>>> the last queue (12:04) and running the ActiveMQ Journal Checkpoint
>>> Worker again after the queue creation (12:09).
>>> I have no idea what activemq is doing in the mean time but it seems to
>>> take several minutes.
>>>
>>> Can anyone explain this behaviour or explain what activemq does in the
>>> mean time? I would also be happy about pointer to the code.
>>> What I do not understand is why it makes a difference if the master is
>>> started again or if the slave is becoming master. As they both share the
>>> same kahadb over nfs I would expect the same startup
>>> behaviour.
>>>
>>> Christian
>>>
>>> -----
>>>
>>> Slow start
>>> 2016-02-15 12:03:01,943 | INFO  | Refreshing
>>> org.apache.activemq.xbean.XBeanBrokerFactory$1@17b14695: startup date
>>> [Mon Feb 15 12:03:01 CET 2016]; root of context hierarchy |
>>> org.apache.activemq.xbean.XBeanBrokerFactory$1 | main
>>> ...
>>> 2016-02-15 12:03:03,018 | INFO  |
>>> PListStore:[/path/activemq/data/A/tmp_storage] started |
>>> org.apache.activemq.store.kahadb.plist.PListStoreImpl | main
>>> 2016-02-15 12:03:03,033 | INFO  | ignoring zero length, partially
>>> initialised journal data file: db-1.log number = 1 , length = 0 |
>>> org.apache.activemq.store.kahadb.disk.journal.Journal | main
>>> 2016-02-15 12:03:03,177 | INFO  | JobSchedulerStore:path/scheduler
>>> started |
>>> org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl | main
>>> 2016-02-15 12:03:03,180 | INFO  | JobScheduler using directory:
>>> path/scheduler | org.apache.activemq.broker.BrokerService | main
>>> 2016-02-15 12:03:03,335 | INFO  | Using Persistence Adapter:
>>> KahaDBPersistenceAdapter[/path] |
>>> org.apache.activemq.broker.BrokerService | main
>>> 2016-02-15 12:03:03,350 | INFO  | JMX consoles can connect to
>>> service:jmx:rmi:///jndi/rmi://localhost:1617/jmxrmi |
>>> org.apache.activemq.broker.jmx.ManagementContext | JMX connector
>>> 2016-02-15 12:03:05,198 | INFO  | Corrupt journal records found in
>>> 'path/db-1122.log' between offsets: 17893102..17893732 |
>>> org.apache.activemq.store.kahadb.disk.journal.Journal | main
>>> 2016-02-15 12:03:47,925 | INFO  | KahaDB is version 5 |
>>> org.apache.activemq.store.kahadb.MessageDatabase | main
>>> ....
>>> 2016-02-15 12:03:49,233 | INFO  | Recovering from the journal ... |
>>> org.apache.activemq.store.kahadb.MessageDatabase | main
>>> 2016-02-15 12:03:49,237 | INFO  | Recovery replayed 20 operations from
>>> the journal in 0.094 seconds. |
>>> org.apache.activemq.store.kahadb.MessageDatabase | main
>>> 2016-02-15 12:04:26,863 | DEBUG | Checkpoint started. |
>>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>>> Checkpoint Worker
>>> 2016-02-15 12:04:26,932 | DEBUG | Checkpoint done. |
>>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>>> Checkpoint Worker
>>> 2016-02-15 12:04:26,940 | INFO  | Apache ActiveMQ 5.10.2 (A,
>>> ID:A-59167-1455534183070-1:1) is starting |
>>> org.apache.activemq.broker.BrokerService | main
>>> ...
>>> 2016-02-15 12:04:30,289 | DEBUG | A adding destination: queue://q1.DLQ |
>>> org.apache.activemq.broker.region.AbstractRegion | main
>>> 2016-02-15 12:04:57,026 | DEBUG | queue://q1.DLQ expiring messages .. |
>>> org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
>>> 2016-02-15 12:04:57,029 | DEBUG | q1.DLQ toPageIn: 5, Inflight: 0,
>>> pagedInMessages.size 0, pagedInPendingDispatch.size 0, enqueueCount: 0,
>>> dequeueCount: 0, memUsage:0 | org.apache.activemq.broker.region.Queue |
>>> ActiveMQ Broker[A] Scheduler
>>> 2016-02-15 12:09:20,440 | DEBUG | Checkpoint started. |
>>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>>> Checkpoint Worker
>>> 2016-02-15 12:09:20,476 | DEBUG | Checkpoint done. |
>>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>>> Checkpoint Worker
>>> 2016-02-15 12:09:20,488 | DEBUG | queue://q1.DLQ expiring messages done.
>>> | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
>>>
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>


-- 
Christian Schneider
http://www.liquid-reality.de

Open Source Architect
http://www.talend.com

Re: Failover very slow with kahadb while restart of master is fast

Posted by Tim Bain <tb...@alumni.duke.edu>.

You could use a profiler such as JVisualVM to tell you what work is being
done.  You could get some of that information from thread dumps via jstack,
but I think a profiler would give much more useful and detailed information
for less effort.

Tim
On Feb 15, 2016 5:37 AM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> wrote:

> Hi Christian,
>
> what's the option on the NFS ?
>
> Most of the time, I use I use sync and I increase the rsize and wsize.
>
> Anyway, the difference is huge. The problem could be related to the
> journal checksum check. Can you try checkForCorruptJournalFiles="false" on
> the kahadb configuration ?
>
> Maybe increasing the kahadb jobs could help too:
>
>             <kahaDB directory="${karaf.data}/activemq/activemq/kahadb"
>                 indexWriteBatchSize="1000"
>                 indexCacheSize="2000"
>                 journalMaxFileLength="32mb"
>                 checkForCorruptJournalFiles="false"
>                 maxAsyncJobs="5000"
>                 concurrentStoreAndDispatchQueues="true"
>                 concurrentStoreAndDispatchTopics="true"
>                 enableJournalDiskSyncs="true"
>                 enableIndexWriteAsync="true"/>
>
>
> Regards
> JB
>
> On 02/15/2016 01:32 PM, Christian Schneider wrote:
>
>> At a customer we are experiencing a very strange behaviour of activemq.
>>
>> We have two brokers on separate machines A and B. Storage is on a third
>> machine and imported using nfsv4.
>> Kahadb contains about 18 GB of journal files.
>>
>> Scenario 1:
>> - Stop B
>> - Stop A (master now)
>> - Start A
>>
>> The start of A takes about 50 seconds.
>>
>> Scnario 2:
>> - Stop A
>> - Stop B (master now)
>> - Start A
>>
>> The start of A takes about 5 minutes.
>>
>> In the slow case the log at debug level shows a big gap between adding
>> the last queue (12:04) and running the ActiveMQ Journal Checkpoint
>> Worker again after the queue creation (12:09).
>> I have no idea what activemq is doing in the mean time but it seems to
>> take several minutes.
>>
>> Can anyone explain this behaviour or explain what activemq does in the
>> mean time? I would also be happy about pointer to the code.
>> What I do not understand is why it makes a difference if the master is
>> started again or if the slave is becoming master. As they both share the
>> same kahadb over nfs I would expect the same startup
>> behaviour.
>>
>> Christian
>>
>> -----
>>
>> Slow start
>> 2016-02-15 12:03:01,943 | INFO  | Refreshing
>> org.apache.activemq.xbean.XBeanBrokerFactory$1@17b14695: startup date
>> [Mon Feb 15 12:03:01 CET 2016]; root of context hierarchy |
>> org.apache.activemq.xbean.XBeanBrokerFactory$1 | main
>> ...
>> 2016-02-15 12:03:03,018 | INFO  |
>> PListStore:[/path/activemq/data/A/tmp_storage] started |
>> org.apache.activemq.store.kahadb.plist.PListStoreImpl | main
>> 2016-02-15 12:03:03,033 | INFO  | ignoring zero length, partially
>> initialised journal data file: db-1.log number = 1 , length = 0 |
>> org.apache.activemq.store.kahadb.disk.journal.Journal | main
>> 2016-02-15 12:03:03,177 | INFO  | JobSchedulerStore:path/scheduler
>> started |
>> org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl | main
>> 2016-02-15 12:03:03,180 | INFO  | JobScheduler using directory:
>> path/scheduler | org.apache.activemq.broker.BrokerService | main
>> 2016-02-15 12:03:03,335 | INFO  | Using Persistence Adapter:
>> KahaDBPersistenceAdapter[/path] |
>> org.apache.activemq.broker.BrokerService | main
>> 2016-02-15 12:03:03,350 | INFO  | JMX consoles can connect to
>> service:jmx:rmi:///jndi/rmi://localhost:1617/jmxrmi |
>> org.apache.activemq.broker.jmx.ManagementContext | JMX connector
>> 2016-02-15 12:03:05,198 | INFO  | Corrupt journal records found in
>> 'path/db-1122.log' between offsets: 17893102..17893732 |
>> org.apache.activemq.store.kahadb.disk.journal.Journal | main
>> 2016-02-15 12:03:47,925 | INFO  | KahaDB is version 5 |
>> org.apache.activemq.store.kahadb.MessageDatabase | main
>> ....
>> 2016-02-15 12:03:49,233 | INFO  | Recovering from the journal ... |
>> org.apache.activemq.store.kahadb.MessageDatabase | main
>> 2016-02-15 12:03:49,237 | INFO  | Recovery replayed 20 operations from
>> the journal in 0.094 seconds. |
>> org.apache.activemq.store.kahadb.MessageDatabase | main
>> 2016-02-15 12:04:26,863 | DEBUG | Checkpoint started. |
>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>> Checkpoint Worker
>> 2016-02-15 12:04:26,932 | DEBUG | Checkpoint done. |
>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>> Checkpoint Worker
>> 2016-02-15 12:04:26,940 | INFO  | Apache ActiveMQ 5.10.2 (A,
>> ID:A-59167-1455534183070-1:1) is starting |
>> org.apache.activemq.broker.BrokerService | main
>> ...
>> 2016-02-15 12:04:30,289 | DEBUG | A adding destination: queue://q1.DLQ |
>> org.apache.activemq.broker.region.AbstractRegion | main
>> 2016-02-15 12:04:57,026 | DEBUG | queue://q1.DLQ expiring messages .. |
>> org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
>> 2016-02-15 12:04:57,029 | DEBUG | q1.DLQ toPageIn: 5, Inflight: 0,
>> pagedInMessages.size 0, pagedInPendingDispatch.size 0, enqueueCount: 0,
>> dequeueCount: 0, memUsage:0 | org.apache.activemq.broker.region.Queue |
>> ActiveMQ Broker[A] Scheduler
>> 2016-02-15 12:09:20,440 | DEBUG | Checkpoint started. |
>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>> Checkpoint Worker
>> 2016-02-15 12:09:20,476 | DEBUG | Checkpoint done. |
>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>> Checkpoint Worker
>> 2016-02-15 12:09:20,488 | DEBUG | queue://q1.DLQ expiring messages done.
>> | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Failover very slow with kahadb while restart of master is fast

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Christian,

what's the option on the NFS ?

Most of the time, I use I use sync and I increase the rsize and wsize.

Anyway, the difference is huge. The problem could be related to the 
journal checksum check. Can you try checkForCorruptJournalFiles="false" 
on the kahadb configuration ?

Maybe increasing the kahadb jobs could help too:

             <kahaDB directory="${karaf.data}/activemq/activemq/kahadb"
                 indexWriteBatchSize="1000"
                 indexCacheSize="2000"
                 journalMaxFileLength="32mb"
                 checkForCorruptJournalFiles="false"
                 maxAsyncJobs="5000"
                 concurrentStoreAndDispatchQueues="true"
                 concurrentStoreAndDispatchTopics="true"
                 enableJournalDiskSyncs="true"
                 enableIndexWriteAsync="true"/>


Regards
JB

On 02/15/2016 01:32 PM, Christian Schneider wrote:
> At a customer we are experiencing a very strange behaviour of activemq.
>
> We have two brokers on separate machines A and B. Storage is on a third
> machine and imported using nfsv4.
> Kahadb contains about 18 GB of journal files.
>
> Scenario 1:
> - Stop B
> - Stop A (master now)
> - Start A
>
> The start of A takes about 50 seconds.
>
> Scnario 2:
> - Stop A
> - Stop B (master now)
> - Start A
>
> The start of A takes about 5 minutes.
>
> In the slow case the log at debug level shows a big gap between adding
> the last queue (12:04) and running the ActiveMQ Journal Checkpoint
> Worker again after the queue creation (12:09).
> I have no idea what activemq is doing in the mean time but it seems to
> take several minutes.
>
> Can anyone explain this behaviour or explain what activemq does in the
> mean time? I would also be happy about pointer to the code.
> What I do not understand is why it makes a difference if the master is
> started again or if the slave is becoming master. As they both share the
> same kahadb over nfs I would expect the same startup
> behaviour.
>
> Christian
>
> -----
>
> Slow start
> 2016-02-15 12:03:01,943 | INFO  | Refreshing
> org.apache.activemq.xbean.XBeanBrokerFactory$1@17b14695: startup date
> [Mon Feb 15 12:03:01 CET 2016]; root of context hierarchy |
> org.apache.activemq.xbean.XBeanBrokerFactory$1 | main
> ...
> 2016-02-15 12:03:03,018 | INFO  |
> PListStore:[/path/activemq/data/A/tmp_storage] started |
> org.apache.activemq.store.kahadb.plist.PListStoreImpl | main
> 2016-02-15 12:03:03,033 | INFO  | ignoring zero length, partially
> initialised journal data file: db-1.log number = 1 , length = 0 |
> org.apache.activemq.store.kahadb.disk.journal.Journal | main
> 2016-02-15 12:03:03,177 | INFO  | JobSchedulerStore:path/scheduler
> started |
> org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl | main
> 2016-02-15 12:03:03,180 | INFO  | JobScheduler using directory:
> path/scheduler | org.apache.activemq.broker.BrokerService | main
> 2016-02-15 12:03:03,335 | INFO  | Using Persistence Adapter:
> KahaDBPersistenceAdapter[/path] |
> org.apache.activemq.broker.BrokerService | main
> 2016-02-15 12:03:03,350 | INFO  | JMX consoles can connect to
> service:jmx:rmi:///jndi/rmi://localhost:1617/jmxrmi |
> org.apache.activemq.broker.jmx.ManagementContext | JMX connector
> 2016-02-15 12:03:05,198 | INFO  | Corrupt journal records found in
> 'path/db-1122.log' between offsets: 17893102..17893732 |
> org.apache.activemq.store.kahadb.disk.journal.Journal | main
> 2016-02-15 12:03:47,925 | INFO  | KahaDB is version 5 |
> org.apache.activemq.store.kahadb.MessageDatabase | main
> ....
> 2016-02-15 12:03:49,233 | INFO  | Recovering from the journal ... |
> org.apache.activemq.store.kahadb.MessageDatabase | main
> 2016-02-15 12:03:49,237 | INFO  | Recovery replayed 20 operations from
> the journal in 0.094 seconds. |
> org.apache.activemq.store.kahadb.MessageDatabase | main
> 2016-02-15 12:04:26,863 | DEBUG | Checkpoint started. |
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
> Checkpoint Worker
> 2016-02-15 12:04:26,932 | DEBUG | Checkpoint done. |
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
> Checkpoint Worker
> 2016-02-15 12:04:26,940 | INFO  | Apache ActiveMQ 5.10.2 (A,
> ID:A-59167-1455534183070-1:1) is starting |
> org.apache.activemq.broker.BrokerService | main
> ...
> 2016-02-15 12:04:30,289 | DEBUG | A adding destination: queue://q1.DLQ |
> org.apache.activemq.broker.region.AbstractRegion | main
> 2016-02-15 12:04:57,026 | DEBUG | queue://q1.DLQ expiring messages .. |
> org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
> 2016-02-15 12:04:57,029 | DEBUG | q1.DLQ toPageIn: 5, Inflight: 0,
> pagedInMessages.size 0, pagedInPendingDispatch.size 0, enqueueCount: 0,
> dequeueCount: 0, memUsage:0 | org.apache.activemq.broker.region.Queue |
> ActiveMQ Broker[A] Scheduler
> 2016-02-15 12:09:20,440 | DEBUG | Checkpoint started. |
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
> Checkpoint Worker
> 2016-02-15 12:09:20,476 | DEBUG | Checkpoint done. |
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
> Checkpoint Worker
> 2016-02-15 12:09:20,488 | DEBUG | queue://q1.DLQ expiring messages done.
> | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Failover very slow with kahadb while restart of master is fast

Posted by artnaseef <ar...@artnaseef.com>.

Does the shutdown of the broker complete cleanly in both cases?

A slow shutdown will lead to the shutdown script using kill -9, which
eliminates any possibility of the process continuing to ensure a clean
shutdown.  In that case, the contents of the file may be left in a dirty
state, which may lead to recovery when the broker starts up.

Also, in the "Stop master then stop second master" scenario, is the second
master (B) completely stopped before restarting the first master (A)?

One way to help track this down - take periodic stack dumps (jstack is good
for this) during the startup (perhaps one every 15 seconds), and look at the
stack trace of the main broker startup thread.  That should help determine
the cause.



--
View this message in context: http://activemq.2283324.n4.nabble.com/Failover-very-slow-with-kahadb-while-restart-of-master-is-fast-tp4707500p4708777.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.