You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@activemq.apache.org by khandelwalanuj <kh...@gmail.com> on 2015/01/25 10:36:01 UTC

kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

Hi,

Broker verison : 5.10.0
using Master-slave topology with shared kahadb. 

Today we facing very critical production issue due to Kahadb. We got below
mentioned error in broker logs, and after that broker stopped it's transport
connectors and stopped it's services but still it didn't release the lock on
kahadb because of which even failover broker was not able to acquire the
lock and not able to serve the clients. 

Broker was in this state for long time unless we manually restarted the
broker. The major concern here is that master broker didn't release the lock
on kahadb because of which failover was not able to get the lock and become
master. 

Can you please let us know what was the reason caused this and why master
didn't release the lock ? 


/[20150124 10:36:58.665 EST (ActiveMQ Data File Writer)
org.apache.activemq.store.kahadb.disk.journal.DataFileAppender#processQueue
382 INFO] - Journal fai
led while writing at: 1677639 
[20150124 10:36:58.706 EST (ActiveMQ Journal Checkpoint Worker)
org.apache.activemq.store.kahadb.MessageDatabase$3#run 364 ERROR] -
Checkpoint failed 
java.io.IOException: Input/output error
        at java.io.RandomAccessFile.write0(Native Method)
        at java.io.RandomAccessFile.write(RandomAccessFile.java:472)
        at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1028)
        at
org.apache.activemq.util.RecoverableRandomAccessFile.writeLong(RecoverableRandomAccessFile.java:305)
        at
org.apache.activemq.store.kahadb.disk.page.PageFile.writeBatch(PageFile.java:1062)
        at
org.apache.activemq.store.kahadb.disk.page.PageFile.flush(PageFile.java:516)
        at
org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1512)
        at
org.apache.activemq.store.kahadb.MessageDatabase$17.execute(MessageDatabase.java:1484)
        at
org.apache.activemq.store.kahadb.disk.page.Transaction.execute(Transaction.java:779)
        at
org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1481)
        at
org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:929)
        at
org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:353)
/

Thanks,
Anuj



--
View this message in context: http://activemq.2283324.n4.nabble.com/kahadb-corruption-Checkpoint-failed-java-io-IOException-Input-output-error-tp4690378.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

Posted by khandelwalanuj <kh...@gmail.com>.
Hi,

Yes !! The broker process was still running. I verified it with "ps"
command. 
I have updated the JIRA with details as you mentioned in last update. 

Thanks,
Anuj



--
View this message in context: http://activemq.2283324.n4.nabble.com/kahadb-corruption-Checkpoint-failed-java-io-IOException-Input-output-error-tp4690378p4690588.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

Posted by Tim Bain <tb...@alumni.duke.edu>.
You said the broker still held the file lock, but I assumed that the broker
process exited without releasing the lock (since it couldn't write to
disk).  Can you confirm that the master broker process really was still
running (as seen by ps, not just the state of the file lock)?

If the broker process really was still running,  the problem is actually
that the broker tries to shutdown but fails to do that when the broker
can't write to the disk that hosts KahaDB.  If the process exited but the
disk still thinks the master broker holds the lock because it couldn't
write to disk to release the lock, then the problem is that the slave
broker isn't able to detect that the master exited without releasing the
lock.  Your JIRA should be updated in either case, but you need to know
whether the process exited to know which update to make.
On Jan 27, 2015 11:20 PM, "khandelwalanuj" <kh...@gmail.com>
wrote:

> Hi,
>
> Master was not completely killed. Master has stopped it's transport
> connectors and plugins but it didn't release it's lock from the kahadb.
>
> Thanks,
> Anuj
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/kahadb-corruption-Checkpoint-failed-java-io-IOException-Input-output-error-tp4690378p4690514.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

Posted by khandelwalanuj <kh...@gmail.com>.
Hi,

Master was not completely killed. Master has stopped it's transport
connectors and plugins but it didn't release it's lock from the kahadb. 

Thanks,
Anuj



--
View this message in context: http://activemq.2283324.n4.nabble.com/kahadb-corruption-Checkpoint-failed-java-io-IOException-Input-output-error-tp4690378p4690514.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

Posted by Tim Bain <tb...@alumni.duke.edu>.
I thought the master was killed completely, and the problem was solely that
the slave didn't take over.  Can you please describe how the master wasn't
killed completely, since you've never before mentioned that in either your
emails here or the JIRA you submitted?

On Tue, Jan 27, 2015 at 7:46 AM, khandelwalanuj <khandelwal.anuj90@gmail.com
> wrote:

> Hi,
>
> There was some failures on filer because of which applications (ActiveMQ)
> was not able to read/write on kahadb.
>
> As you mentioned that kahadb should handle this if master broker is not
> writing than failover should take over; I have logged a request
> https://issues.apache.org/jira/browse/AMQ-5540
>
> To handle this can ActiveMQ provide a configuration in
> http://activemq.apache.org/configurable-ioexception-handling.html which
> can
> kill the master completely and let the failover take over ?
>
> Thanks,
> Anuj
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/kahadb-corruption-Checkpoint-failed-java-io-IOException-Input-output-error-tp4690378p4690470.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

Posted by khandelwalanuj <kh...@gmail.com>.
Hi,

There was some failures on filer because of which applications (ActiveMQ)
was not able to read/write on kahadb.

As you mentioned that kahadb should handle this if master broker is not
writing than failover should take over; I have logged a request
https://issues.apache.org/jira/browse/AMQ-5540 

To handle this can ActiveMQ provide a configuration in
http://activemq.apache.org/configurable-ioexception-handling.html which can
kill the master completely and let the failover take over ?

Thanks,
Anuj



--
View this message in context: http://activemq.2283324.n4.nabble.com/kahadb-corruption-Checkpoint-failed-java-io-IOException-Input-output-error-tp4690378p4690470.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

Posted by Tim Bain <tb...@alumni.duke.edu>.
Submit this as a JIRA, stating that KahaDB can't fail over to the slave if
the master is unable to write to disk when it shuts down (because it
couldn't write to disk).  I'm not sure how feasible it'll be for the slave
to detect this (maybe there's a file modification timestamp that can be
used, or maybe something could be added to have the master write
periodically to a file so the slave can detect that the master is no longer
writing), but ideally KahaDB should handle this situation.

With that being said, have your sysadmins figured out why KahaDB was unable
to write to disk in a live production system and made sure it never happens
again?  Because KahaDB only had this problem because your infrastructure
had a terrible failure, and I really hope that the sysadmin's office, not
the ActiveMQ mailing list, was the first stop you made after you discovered
this problem.
On Jan 26, 2015 11:24 PM, "khandelwalanuj" <kh...@gmail.com>
wrote:

> Did anyone get a chance to look at this ?
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/kahadb-corruption-Checkpoint-failed-java-io-IOException-Input-output-error-tp4690378p4690442.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

Posted by khandelwalanuj <kh...@gmail.com>.
Did anyone get a chance to look at this ? 



--
View this message in context: http://activemq.2283324.n4.nabble.com/kahadb-corruption-Checkpoint-failed-java-io-IOException-Input-output-error-tp4690378p4690442.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

Posted by khandelwalanuj <kh...@gmail.com>.
Attaching the complete stack trace for more details. 


ActiveMQ_prod_25_Jan.txt
<http://activemq.2283324.n4.nabble.com/file/n4690379/ActiveMQ_prod_25_Jan.txt>  



--
View this message in context: http://activemq.2283324.n4.nabble.com/kahadb-corruption-Checkpoint-failed-java-io-IOException-Input-output-error-tp4690378p4690379.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.