You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@activemq.apache.org by "Gunawan, Rahman (GSFC-703.H)[Halvik Corp]" <ra...@nasa.gov.INVALID> on 2022/02/28 12:54:40 UTC

RE: [EXTERNAL] Re: Artemis file locking not released

The backup server knew that the primary server had problem.  Below is from the log from the backup server:
ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to create netty connection: java.net.UnknownHostException

Thus, I'm thinking if the Artemis primary server lost connection to NFS or network, the backup server can detect, unlock the file and take over.  Please let me know if you have suggestions.
Thanks

Regards,
Rahman

-----Original Message-----
From: Clebert Suconic <cl...@gmail.com> 
Sent: Saturday, February 26, 2022 9:27 AM
To: users@activemq.apache.org
Subject: [EXTERNAL] Re: Artemis file locking not released

Could be some configuration on the remote file system attributes ?

On Fri, Feb 25, 2022 at 12:03 PM Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.invalid> wrote:

> I'm using Artemis 2.19.1.  I'm using share file configuration and 
> testing a scenario where the primary Artemis server is isolated from 
> the network by disabling the network card.  Because the primary server 
> lost communication to NFS, the file is never unlock and the backup 
> server is always waiting for the lock.  When we enable the network 
> card in primary server, the primary server is completely down.  Below is the primary server log:
> "Reference Handler" Id=2 WAITING on java.lang.ref.Reference$Lock@64b6b3fc
>         at java.lang.Object.wait(Native Method)
>         -  waiting on java.lang.ref.Reference$Lock@64b6b3fc
>         at java.lang.Object.wait(Object.java:502)
>         at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
>         at 
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
>
>
>
> ======================================================================
> =========
> End Thread dump
>
> Is this bugs in Artemis share file configuration?
>
> Regards,
> Rahman
>
--
Clebert Suconic

RE: [EXTERNAL] Re: Artemis file locking not released

Posted by "Gunawan, Rahman (GSFC-703.H)[Halvik Corp]" <ra...@nasa.gov.INVALID>.

Attached is Artemis v2.19.1 log when it was terminated.  Should the server have gone to sleep when it lost connection to NFS/network and woke up when the server recovered the connection to NFS/network? 
In the replication mode, the server went to sleep when it lost access to network, then woke up from sleep when it recovered access to the network.

-----Original Message-----
From: Justin Bertram <jb...@apache.org>
Sent: Monday, February 28, 2022 1:49 PM
To: users@activemq.apache.org
Subject: Re: [EXTERNAL] Re: Artemis file locking not released

> Why was the primary server completely down when it was isolated from 
> the
network?

I can't really say since you've not really provided any details about this.
However, I would guess that since the journal is on NFS and since you killed the broker's network then it encountered a critical IO error and shut itself down. This is the expected behavior.

> I configured <network-check-list>, enabled , 
> <network-check-ping-command>
and <network-check-ping6-command> so the primary server knew that the network was unhealthy as shown in below log...

I've not seen the network pinger enabled for a shared-store configuration as it was explicitly designed for the replicated (i.e. shared nothing) configuration to avoid split-brain. In the shared-store configuration the shared-store itself mitigates against split-brain (e.g. via file locks). I don't believe you need to configure the network pinger given your use of shared-store.


Justin

On Mon, Feb 28, 2022 at 11:34 AM Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.invalid> wrote:

> We'll take a look at the NFS configuration.  Why was the primary 
> server completely down when it was isolated from the network?  I 
> configured <network-check-list>, enabled , 
> <network-check-ping-command> and <network-check-ping6-command> so the 
> primary server knew that the network was unhealthy as shown in below log:
> [org.apache.activemq.artemis.logs] AMQ201001: Network is unhealthy, 
> stopping service ActiveMQServerImpl
>
> However; when we enabled back the network card, the primary server was 
> completely down.  I had to start the primary server manually.
>
> Regards,
> Rahman
>
> -----Original Message-----
> From: Justin Bertram <jb...@apache.org>
> Sent: Monday, February 28, 2022 10:15 AM
> To: users@activemq.apache.org
> Subject: Re: [EXTERNAL] Re: Artemis file locking not released
>
> The backup and the live do have a direct connection. This allows the 
> backup to share its connection details with the live. The live then 
> takes those details and passes them on to clients so that the clients 
> will know where to connect in case the live fails.
>
> However, if this connection breaks it is *not* possible for the backup 
> to simply "unlock" the journal and take over. The only entities which 
> can unlock the journal is the live broker (who created the lock in the 
> first
> place) or NFS itself (e.g. in the case of some kind of connectivity 
> failure). If the lock is not being released when the live broker's NFS 
> connectivity fails then I would suggest you have a problem with your 
> NFS configuration.
>
>
> Justin
>
> On Mon, Feb 28, 2022 at 6:55 AM Gunawan, Rahman (GSFC-703.H)[Halvik 
> Corp] < rahman.gunawan@nasa.gov.invalid> wrote:
>
> > The backup server knew that the primary server had problem.  Below 
> > is from the log from the backup server:
> > ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to 
> > create netty connection: java.net.UnknownHostException
> >
> > Thus, I'm thinking if the Artemis primary server lost connection to 
> > NFS or network, the backup server can detect, unlock the file and 
> > take
> over.
> > Please let me know if you have suggestions.
> > Thanks
> >
> > Regards,
> > Rahman
> >
> > -----Original Message-----
> > From: Clebert Suconic <cl...@gmail.com>
> > Sent: Saturday, February 26, 2022 9:27 AM
> > To: users@activemq.apache.org
> > Subject: [EXTERNAL] Re: Artemis file locking not released
> >
> > Could be some configuration on the remote file system attributes ?
> >
> > On Fri, Feb 25, 2022 at 12:03 PM Gunawan, Rahman (GSFC-703.H)[Halvik 
> > Corp] <ra...@nasa.gov.invalid> wrote:
> >
> > > I'm using Artemis 2.19.1.  I'm using share file configuration and 
> > > testing a scenario where the primary Artemis server is isolated 
> > > from the network by disabling the network card.  Because the 
> > > primary server lost communication to NFS, the file is never unlock 
> > > and the backup server is always waiting for the lock.  When we 
> > > enable the network card in primary server, the primary server is 
> > > completely down.  Below is
> > the primary server log:
> > > "Reference Handler" Id=2 WAITING on
> java.lang.ref.Reference$Lock@64b6b3fc
> > >         at java.lang.Object.wait(Native Method)
> > >         -  waiting on java.lang.ref.Reference$Lock@64b6b3fc
> > >         at java.lang.Object.wait(Object.java:502)
> > >         at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> > >         at
> > > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
> > >
> > >
> > >
> > > ==================================================================
> > > ==
> > > ==
> > > =========
> > > End Thread dump
> > >
> > > Is this bugs in Artemis share file configuration?
> > >
> > > Regards,
> > > Rahman
> > >
> > --
> > Clebert Suconic
> >
>

Re: [EXTERNAL] Re: Artemis file locking not released

Posted by Justin Bertram <jb...@apache.org>.

> Should the server have gone to sleep when it lost connection to
NFS/network and woke up when the server recovered the connection to
NFS/network?

The broker is written to shut itself down when it encounters IO errors like
this as such errors are deemed critical to the fundamental function of the
broker. It is not designed to go "to sleep" in these specific situations
and keep trying the failed operation. Even if it did the broker would still
be unavailable to clients which would functionally equate to downtime. I
would expect such a feature would be challenging to implement and not
substantially more useful than simply restarting the broker either
automatically (e.g. by running the broker as an OS "service") or
administratively. Such a feature makes even less sense in managed
environments (e.g. in the cloud) where infrastructure monitors services and
restarts them automatically in case of a failure.

In any event, based on your description as well as the logs you provided I
would say everything is working as designed.


Justin

On Tue, Mar 1, 2022 at 11:04 AM Gunawan, Rahman (GSFC-703.H)[Halvik Corp]
<ra...@nasa.gov.invalid> wrote:

> Attached is Artemis v2.19.1 log when it was terminated.  Should the server
> have gone to sleep when it lost connection to NFS/network and woke up when
> the server recovered the connection to NFS/network?
> In the replication mode, the server went to sleep when it lost access to
> network, then woke up from sleep when it recovered access to the network.
>
> -----Original Message-----
> From: Justin Bertram <jb...@apache.org>
> Sent: Monday, February 28, 2022 1:49 PM
> To: users@activemq.apache.org
> Subject: Re: [EXTERNAL] Re: Artemis file locking not released
>
> > Why was the primary server completely down when it was isolated from
> > the
> network?
>
> I can't really say since you've not really provided any details about this.
> However, I would guess that since the journal is on NFS and since you
> killed the broker's network then it encountered a critical IO error and
> shut itself down. This is the expected behavior.
>
> > I configured <network-check-list>, enabled ,
> > <network-check-ping-command>
> and <network-check-ping6-command> so the primary server knew that the
> network was unhealthy as shown in below log...
>
> I've not seen the network pinger enabled for a shared-store configuration
> as it was explicitly designed for the replicated (i.e. shared nothing)
> configuration to avoid split-brain. In the shared-store configuration the
> shared-store itself mitigates against split-brain (e.g. via file locks). I
> don't believe you need to configure the network pinger given your use of
> shared-store.
>
>
> Justin
>
> On Mon, Feb 28, 2022 at 11:34 AM Gunawan, Rahman (GSFC-703.H)[Halvik Corp]
> <ra...@nasa.gov.invalid> wrote:
>
> > We'll take a look at the NFS configuration.  Why was the primary
> > server completely down when it was isolated from the network?  I
> > configured <network-check-list>, enabled ,
> > <network-check-ping-command> and <network-check-ping6-command> so the
> > primary server knew that the network was unhealthy as shown in below log:
> > [org.apache.activemq.artemis.logs] AMQ201001: Network is unhealthy,
> > stopping service ActiveMQServerImpl
> >
> > However; when we enabled back the network card, the primary server was
> > completely down.  I had to start the primary server manually.
> >
> > Regards,
> > Rahman
> >
> > -----Original Message-----
> > From: Justin Bertram <jb...@apache.org>
> > Sent: Monday, February 28, 2022 10:15 AM
> > To: users@activemq.apache.org
> > Subject: Re: [EXTERNAL] Re: Artemis file locking not released
> >
> > The backup and the live do have a direct connection. This allows the
> > backup to share its connection details with the live. The live then
> > takes those details and passes them on to clients so that the clients
> > will know where to connect in case the live fails.
> >
> > However, if this connection breaks it is *not* possible for the backup
> > to simply "unlock" the journal and take over. The only entities which
> > can unlock the journal is the live broker (who created the lock in the
> > first
> > place) or NFS itself (e.g. in the case of some kind of connectivity
> > failure). If the lock is not being released when the live broker's NFS
> > connectivity fails then I would suggest you have a problem with your
> > NFS configuration.
> >
> >
> > Justin
> >
> > On Mon, Feb 28, 2022 at 6:55 AM Gunawan, Rahman (GSFC-703.H)[Halvik
> > Corp] < rahman.gunawan@nasa.gov.invalid> wrote:
> >
> > > The backup server knew that the primary server had problem.  Below
> > > is from the log from the backup server:
> > > ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to
> > > create netty connection: java.net.UnknownHostException
> > >
> > > Thus, I'm thinking if the Artemis primary server lost connection to
> > > NFS or network, the backup server can detect, unlock the file and
> > > take
> > over.
> > > Please let me know if you have suggestions.
> > > Thanks
> > >
> > > Regards,
> > > Rahman
> > >
> > > -----Original Message-----
> > > From: Clebert Suconic <cl...@gmail.com>
> > > Sent: Saturday, February 26, 2022 9:27 AM
> > > To: users@activemq.apache.org
> > > Subject: [EXTERNAL] Re: Artemis file locking not released
> > >
> > > Could be some configuration on the remote file system attributes ?
> > >
> > > On Fri, Feb 25, 2022 at 12:03 PM Gunawan, Rahman (GSFC-703.H)[Halvik
> > > Corp] <ra...@nasa.gov.invalid> wrote:
> > >
> > > > I'm using Artemis 2.19.1.  I'm using share file configuration and
> > > > testing a scenario where the primary Artemis server is isolated
> > > > from the network by disabling the network card.  Because the
> > > > primary server lost communication to NFS, the file is never unlock
> > > > and the backup server is always waiting for the lock.  When we
> > > > enable the network card in primary server, the primary server is
> > > > completely down.  Below is
> > > the primary server log:
> > > > "Reference Handler" Id=2 WAITING on
> > java.lang.ref.Reference$Lock@64b6b3fc
> > > >         at java.lang.Object.wait(Native Method)
> > > >         -  waiting on java.lang.ref.Reference$Lock@64b6b3fc
> > > >         at java.lang.Object.wait(Object.java:502)
> > > >         at
> java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> > > >         at
> > > > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
> > > >
> > > >
> > > >
> > > > ==================================================================
> > > > ==
> > > > ==
> > > > =========
> > > > End Thread dump
> > > >
> > > > Is this bugs in Artemis share file configuration?
> > > >
> > > > Regards,
> > > > Rahman
> > > >
> > > --
> > > Clebert Suconic
> > >
> >
>

RE: [EXTERNAL] Re: Artemis file locking not released

Posted by "Gunawan, Rahman (GSFC-703.H)[Halvik Corp]" <ra...@nasa.gov.INVALID>.

Attached is Artemis v2.19.1 log when it was terminated.  Should the server have gone to sleep when it lost connection to NFS/network and woke up when the server recovered the connection to NFS/network? 
In the replication mode, the server went to sleep when it lost access to network, then woke up from sleep when it recovered access to the network.

-----Original Message-----
From: Justin Bertram <jb...@apache.org> 
Sent: Monday, February 28, 2022 1:49 PM
To: users@activemq.apache.org
Subject: Re: [EXTERNAL] Re: Artemis file locking not released

> Why was the primary server completely down when it was isolated from 
> the
network?

I can't really say since you've not really provided any details about this.
However, I would guess that since the journal is on NFS and since you killed the broker's network then it encountered a critical IO error and shut itself down. This is the expected behavior.

> I configured <network-check-list>, enabled , 
> <network-check-ping-command>
and <network-check-ping6-command> so the primary server knew that the network was unhealthy as shown in below log...

I've not seen the network pinger enabled for a shared-store configuration as it was explicitly designed for the replicated (i.e. shared nothing) configuration to avoid split-brain. In the shared-store configuration the shared-store itself mitigates against split-brain (e.g. via file locks). I don't believe you need to configure the network pinger given your use of shared-store.


Justin

On Mon, Feb 28, 2022 at 11:34 AM Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.invalid> wrote:

> We'll take a look at the NFS configuration.  Why was the primary 
> server completely down when it was isolated from the network?  I 
> configured <network-check-list>, enabled , 
> <network-check-ping-command> and <network-check-ping6-command> so the 
> primary server knew that the network was unhealthy as shown in below log:
> [org.apache.activemq.artemis.logs] AMQ201001: Network is unhealthy, 
> stopping service ActiveMQServerImpl
>
> However; when we enabled back the network card, the primary server was 
> completely down.  I had to start the primary server manually.
>
> Regards,
> Rahman
>
> -----Original Message-----
> From: Justin Bertram <jb...@apache.org>
> Sent: Monday, February 28, 2022 10:15 AM
> To: users@activemq.apache.org
> Subject: Re: [EXTERNAL] Re: Artemis file locking not released
>
> The backup and the live do have a direct connection. This allows the 
> backup to share its connection details with the live. The live then 
> takes those details and passes them on to clients so that the clients 
> will know where to connect in case the live fails.
>
> However, if this connection breaks it is *not* possible for the backup 
> to simply "unlock" the journal and take over. The only entities which 
> can unlock the journal is the live broker (who created the lock in the 
> first
> place) or NFS itself (e.g. in the case of some kind of connectivity 
> failure). If the lock is not being released when the live broker's NFS 
> connectivity fails then I would suggest you have a problem with your 
> NFS configuration.
>
>
> Justin
>
> On Mon, Feb 28, 2022 at 6:55 AM Gunawan, Rahman (GSFC-703.H)[Halvik 
> Corp] < rahman.gunawan@nasa.gov.invalid> wrote:
>
> > The backup server knew that the primary server had problem.  Below 
> > is from the log from the backup server:
> > ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to 
> > create netty connection: java.net.UnknownHostException
> >
> > Thus, I'm thinking if the Artemis primary server lost connection to 
> > NFS or network, the backup server can detect, unlock the file and 
> > take
> over.
> > Please let me know if you have suggestions.
> > Thanks
> >
> > Regards,
> > Rahman
> >
> > -----Original Message-----
> > From: Clebert Suconic <cl...@gmail.com>
> > Sent: Saturday, February 26, 2022 9:27 AM
> > To: users@activemq.apache.org
> > Subject: [EXTERNAL] Re: Artemis file locking not released
> >
> > Could be some configuration on the remote file system attributes ?
> >
> > On Fri, Feb 25, 2022 at 12:03 PM Gunawan, Rahman (GSFC-703.H)[Halvik 
> > Corp] <ra...@nasa.gov.invalid> wrote:
> >
> > > I'm using Artemis 2.19.1.  I'm using share file configuration and 
> > > testing a scenario where the primary Artemis server is isolated 
> > > from the network by disabling the network card.  Because the 
> > > primary server lost communication to NFS, the file is never unlock 
> > > and the backup server is always waiting for the lock.  When we 
> > > enable the network card in primary server, the primary server is 
> > > completely down.  Below is
> > the primary server log:
> > > "Reference Handler" Id=2 WAITING on
> java.lang.ref.Reference$Lock@64b6b3fc
> > >         at java.lang.Object.wait(Native Method)
> > >         -  waiting on java.lang.ref.Reference$Lock@64b6b3fc
> > >         at java.lang.Object.wait(Object.java:502)
> > >         at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> > >         at
> > > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
> > >
> > >
> > >
> > > ==================================================================
> > > ==
> > > ==
> > > =========
> > > End Thread dump
> > >
> > > Is this bugs in Artemis share file configuration?
> > >
> > > Regards,
> > > Rahman
> > >
> > --
> > Clebert Suconic
> >
>

Re: [EXTERNAL] Re: Artemis file locking not released

Posted by Justin Bertram <jb...@apache.org>.

> Why was the primary server completely down when it was isolated from the
network?

I can't really say since you've not really provided any details about this.
However, I would guess that since the journal is on NFS and since you
killed the broker's network then it encountered a critical IO error and
shut itself down. This is the expected behavior.

> I configured <network-check-list>, enabled , <network-check-ping-command>
and <network-check-ping6-command> so the primary server knew that the
network was unhealthy as shown in below log...

I've not seen the network pinger enabled for a shared-store configuration
as it was explicitly designed for the replicated (i.e. shared nothing)
configuration to avoid split-brain. In the shared-store configuration the
shared-store itself mitigates against split-brain (e.g. via file locks). I
don't believe you need to configure the network pinger given your use of
shared-store.


Justin

On Mon, Feb 28, 2022 at 11:34 AM Gunawan, Rahman (GSFC-703.H)[Halvik Corp]
<ra...@nasa.gov.invalid> wrote:

> We'll take a look at the NFS configuration.  Why was the primary server
> completely down when it was isolated from the network?  I configured
> <network-check-list>, enabled , <network-check-ping-command> and
> <network-check-ping6-command> so the primary server knew that the network
> was unhealthy as shown in below log:
> [org.apache.activemq.artemis.logs] AMQ201001: Network is unhealthy,
> stopping service ActiveMQServerImpl
>
> However; when we enabled back the network card, the primary server was
> completely down.  I had to start the primary server manually.
>
> Regards,
> Rahman
>
> -----Original Message-----
> From: Justin Bertram <jb...@apache.org>
> Sent: Monday, February 28, 2022 10:15 AM
> To: users@activemq.apache.org
> Subject: Re: [EXTERNAL] Re: Artemis file locking not released
>
> The backup and the live do have a direct connection. This allows the
> backup to share its connection details with the live. The live then takes
> those details and passes them on to clients so that the clients will know
> where to connect in case the live fails.
>
> However, if this connection breaks it is *not* possible for the backup to
> simply "unlock" the journal and take over. The only entities which can
> unlock the journal is the live broker (who created the lock in the first
> place) or NFS itself (e.g. in the case of some kind of connectivity
> failure). If the lock is not being released when the live broker's NFS
> connectivity fails then I would suggest you have a problem with your NFS
> configuration.
>
>
> Justin
>
> On Mon, Feb 28, 2022 at 6:55 AM Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <
> rahman.gunawan@nasa.gov.invalid> wrote:
>
> > The backup server knew that the primary server had problem.  Below is
> > from the log from the backup server:
> > ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to
> > create netty connection: java.net.UnknownHostException
> >
> > Thus, I'm thinking if the Artemis primary server lost connection to
> > NFS or network, the backup server can detect, unlock the file and take
> over.
> > Please let me know if you have suggestions.
> > Thanks
> >
> > Regards,
> > Rahman
> >
> > -----Original Message-----
> > From: Clebert Suconic <cl...@gmail.com>
> > Sent: Saturday, February 26, 2022 9:27 AM
> > To: users@activemq.apache.org
> > Subject: [EXTERNAL] Re: Artemis file locking not released
> >
> > Could be some configuration on the remote file system attributes ?
> >
> > On Fri, Feb 25, 2022 at 12:03 PM Gunawan, Rahman (GSFC-703.H)[Halvik
> > Corp] <ra...@nasa.gov.invalid> wrote:
> >
> > > I'm using Artemis 2.19.1.  I'm using share file configuration and
> > > testing a scenario where the primary Artemis server is isolated from
> > > the network by disabling the network card.  Because the primary
> > > server lost communication to NFS, the file is never unlock and the
> > > backup server is always waiting for the lock.  When we enable the
> > > network card in primary server, the primary server is completely
> > > down.  Below is
> > the primary server log:
> > > "Reference Handler" Id=2 WAITING on
> java.lang.ref.Reference$Lock@64b6b3fc
> > >         at java.lang.Object.wait(Native Method)
> > >         -  waiting on java.lang.ref.Reference$Lock@64b6b3fc
> > >         at java.lang.Object.wait(Object.java:502)
> > >         at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> > >         at
> > > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
> > >
> > >
> > >
> > > ====================================================================
> > > ==
> > > =========
> > > End Thread dump
> > >
> > > Is this bugs in Artemis share file configuration?
> > >
> > > Regards,
> > > Rahman
> > >
> > --
> > Clebert Suconic
> >
>

RE: [EXTERNAL] Re: Artemis file locking not released

Posted by "Gunawan, Rahman (GSFC-703.H)[Halvik Corp]" <ra...@nasa.gov.INVALID>.

We'll take a look at the NFS configuration.  Why was the primary server completely down when it was isolated from the network?  I configured <network-check-list>, enabled , <network-check-ping-command> and <network-check-ping6-command> so the primary server knew that the network was unhealthy as shown in below log:
[org.apache.activemq.artemis.logs] AMQ201001: Network is unhealthy, stopping service ActiveMQServerImpl

However; when we enabled back the network card, the primary server was completely down.  I had to start the primary server manually.

Regards,
Rahman

-----Original Message-----
From: Justin Bertram <jb...@apache.org> 
Sent: Monday, February 28, 2022 10:15 AM
To: users@activemq.apache.org
Subject: Re: [EXTERNAL] Re: Artemis file locking not released

The backup and the live do have a direct connection. This allows the backup to share its connection details with the live. The live then takes those details and passes them on to clients so that the clients will know where to connect in case the live fails.

However, if this connection breaks it is *not* possible for the backup to simply "unlock" the journal and take over. The only entities which can unlock the journal is the live broker (who created the lock in the first
place) or NFS itself (e.g. in the case of some kind of connectivity failure). If the lock is not being released when the live broker's NFS connectivity fails then I would suggest you have a problem with your NFS configuration.

Justin

On Mon, Feb 28, 2022 at 6:55 AM Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.invalid> wrote:

> The backup server knew that the primary server had problem.  Below is 
> from the log from the backup server:
> ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to 
> create netty connection: java.net.UnknownHostException
>
> Thus, I'm thinking if the Artemis primary server lost connection to 
> NFS or network, the backup server can detect, unlock the file and take over.
> Please let me know if you have suggestions.
> Thanks
>
> Regards,
> Rahman
>
> -----Original Message-----
> From: Clebert Suconic <cl...@gmail.com>
> Sent: Saturday, February 26, 2022 9:27 AM
> To: users@activemq.apache.org
> Subject: [EXTERNAL] Re: Artemis file locking not released
>
> Could be some configuration on the remote file system attributes ?
>
> On Fri, Feb 25, 2022 at 12:03 PM Gunawan, Rahman (GSFC-703.H)[Halvik 
> Corp] <ra...@nasa.gov.invalid> wrote:
>
> > I'm using Artemis 2.19.1.  I'm using share file configuration and 
> > testing a scenario where the primary Artemis server is isolated from 
> > the network by disabling the network card.  Because the primary 
> > server lost communication to NFS, the file is never unlock and the 
> > backup server is always waiting for the lock.  When we enable the 
> > network card in primary server, the primary server is completely 
> > down.  Below is
> the primary server log:
> > "Reference Handler" Id=2 WAITING on java.lang.ref.Reference$Lock@64b6b3fc
> >         at java.lang.Object.wait(Native Method)
> >         -  waiting on java.lang.ref.Reference$Lock@64b6b3fc
> >         at java.lang.Object.wait(Object.java:502)
> >         at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> >         at
> > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
> >
> >
> >
> > ====================================================================
> > ==
> > =========
> > End Thread dump
> >
> > Is this bugs in Artemis share file configuration?
> >
> > Regards,
> > Rahman
> >
> --
> Clebert Suconic
>

Re: [EXTERNAL] Re: Artemis file locking not released

Posted by Justin Bertram <jb...@apache.org>.

The backup and the live do have a direct connection. This allows the backup
to share its connection details with the live. The live then takes those
details and passes them on to clients so that the clients will know where
to connect in case the live fails.

However, if this connection breaks it is *not* possible for the backup to
simply "unlock" the journal and take over. The only entities which can
unlock the journal is the live broker (who created the lock in the first
place) or NFS itself (e.g. in the case of some kind of connectivity
failure). If the lock is not being released when the live broker's NFS
connectivity fails then I would suggest you have a problem with your NFS
configuration.


Justin

On Mon, Feb 28, 2022 at 6:55 AM Gunawan, Rahman (GSFC-703.H)[Halvik Corp]
<ra...@nasa.gov.invalid> wrote:

> The backup server knew that the primary server had problem.  Below is from
> the log from the backup server:
> ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to
> create netty connection: java.net.UnknownHostException
>
> Thus, I'm thinking if the Artemis primary server lost connection to NFS or
> network, the backup server can detect, unlock the file and take over.
> Please let me know if you have suggestions.
> Thanks
>
> Regards,
> Rahman
>
> -----Original Message-----
> From: Clebert Suconic <cl...@gmail.com>
> Sent: Saturday, February 26, 2022 9:27 AM
> To: users@activemq.apache.org
> Subject: [EXTERNAL] Re: Artemis file locking not released
>
> Could be some configuration on the remote file system attributes ?
>
> On Fri, Feb 25, 2022 at 12:03 PM Gunawan, Rahman (GSFC-703.H)[Halvik Corp]
> <ra...@nasa.gov.invalid> wrote:
>
> > I'm using Artemis 2.19.1.  I'm using share file configuration and
> > testing a scenario where the primary Artemis server is isolated from
> > the network by disabling the network card.  Because the primary server
> > lost communication to NFS, the file is never unlock and the backup
> > server is always waiting for the lock.  When we enable the network
> > card in primary server, the primary server is completely down.  Below is
> the primary server log:
> > "Reference Handler" Id=2 WAITING on java.lang.ref.Reference$Lock@64b6b3fc
> >         at java.lang.Object.wait(Native Method)
> >         -  waiting on java.lang.ref.Reference$Lock@64b6b3fc
> >         at java.lang.Object.wait(Object.java:502)
> >         at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> >         at
> > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
> >
> >
> >
> > ======================================================================
> > =========
> > End Thread dump
> >
> > Is this bugs in Artemis share file configuration?
> >
> > Regards,
> > Rahman
> >
> --
> Clebert Suconic
>

Re: [EXTERNAL] Artemis file locking not released

Posted by Matt Pavlovich <ma...@gmail.com>.

Hi Rahman-

High likelihood that the Artemis server's NFSv3 services did not release the lock when the connectivity to the NFS server went away.

This sounds like the exact use case as to why NFSv4 is generally recommended these days.

-Matt Pavlovich

> On Feb 28, 2022, at 11:24 AM, Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.INVALID> wrote:
> 
> I'm using NFS v3.  What is the recommended version?  I don't see the minimum NFS requirement in https://activemq.apache.org/components/artemis/documentation/2.19.0/ha.html.
> 
> Thanks
> 
> Rahman
> 
> -----Original Message-----
> From: Vilius Šumskas <vi...@rivile.lt> 
> Sent: Monday, February 28, 2022 8:57 AM
> To: users@activemq.apache.org
> Subject: RE: [EXTERNAL] Re: Artemis file locking not released
> 
> Are you using NFS version 4.1 and what's are your mount options?
> 
> -- 
>    Vilius
> 
> -----Original Message-----
> From: Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.INVALID>
> Sent: Monday, February 28, 2022 2:55 PM
> To: users@activemq.apache.org
> Subject: RE: [EXTERNAL] Re: Artemis file locking not released
> 
> The backup server knew that the primary server had problem.  Below is from the log from the backup server:
> ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to create netty connection: java.net.UnknownHostException
> 
> Thus, I'm thinking if the Artemis primary server lost connection to NFS or network, the backup server can detect, unlock the file and take over.  Please let me know if you have suggestions.
> Thanks
> 
> Regards,
> Rahman
> 
> -----Original Message-----
> From: Clebert Suconic <cl...@gmail.com>
> Sent: Saturday, February 26, 2022 9:27 AM
> To: users@activemq.apache.org
> Subject: [EXTERNAL] Re: Artemis file locking not released
> 
> Could be some configuration on the remote file system attributes ?
> 
> On Fri, Feb 25, 2022 at 12:03 PM Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.invalid> wrote:
> 
>> I'm using Artemis 2.19.1.  I'm using share file configuration and 
>> testing a scenario where the primary Artemis server is isolated from 
>> the network by disabling the network card.  Because the primary server 
>> lost communication to NFS, the file is never unlock and the backup 
>> server is always waiting for the lock.  When we enable the network 
>> card in primary server, the primary server is completely down.  Below is the primary server log:
>> "Reference Handler" Id=2 WAITING on java.lang.ref.Reference$Lock@64b6b3fc
>>        at java.lang.Object.wait(Native Method)
>>        -  waiting on java.lang.ref.Reference$Lock@64b6b3fc
>>        at java.lang.Object.wait(Object.java:502)
>>        at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
>>        at
>> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
>> 
>> 
>> 
>> ======================================================================
>> =========
>> End Thread dump
>> 
>> Is this bugs in Artemis share file configuration?
>> 
>> Regards,
>> Rahman
>> 
> --
> Clebert Suconic

RE: [EXTERNAL] Re: Artemis file locking not released

Posted by Vilius Šumskas <vi...@rivile.lt>.

NFSv3 doesn't support lock expiration. You need v4.

ActiveMQ classic actually has this in shared cluster requirements https://activemq.apache.org/shared-file-system-master-slave . For Artemis there is a small note regarding NFSv4 here https://activemq.apache.org/components/artemis/documentation/latest/persistence.html . Though I agree, that requirements could be clearer and all in one place.

-- 
    Vilius

-----Original Message-----
From: Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.INVALID> 
Sent: Monday, February 28, 2022 7:24 PM
To: users@activemq.apache.org
Subject: RE: [EXTERNAL] Re: Artemis file locking not released

I'm using NFS v3.  What is the recommended version?  I don't see the minimum NFS requirement in https://activemq.apache.org/components/artemis/documentation/2.19.0/ha.html.

Thanks

Rahman

-----Original Message-----
From: Vilius Šumskas <vi...@rivile.lt>
Sent: Monday, February 28, 2022 8:57 AM
To: users@activemq.apache.org
Subject: RE: [EXTERNAL] Re: Artemis file locking not released

Are you using NFS version 4.1 and what's are your mount options?

-- 
    Vilius

-----Original Message-----
From: Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.INVALID>
Sent: Monday, February 28, 2022 2:55 PM
To: users@activemq.apache.org
Subject: RE: [EXTERNAL] Re: Artemis file locking not released

The backup server knew that the primary server had problem.  Below is from the log from the backup server:
ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to create netty connection: java.net.UnknownHostException

Thus, I'm thinking if the Artemis primary server lost connection to NFS or network, the backup server can detect, unlock the file and take over.  Please let me know if you have suggestions.
Thanks

Regards,
Rahman

-----Original Message-----
From: Clebert Suconic <cl...@gmail.com>
Sent: Saturday, February 26, 2022 9:27 AM
To: users@activemq.apache.org
Subject: [EXTERNAL] Re: Artemis file locking not released

Could be some configuration on the remote file system attributes ?

On Fri, Feb 25, 2022 at 12:03 PM Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.invalid> wrote:

> I'm using Artemis 2.19.1.  I'm using share file configuration and 
> testing a scenario where the primary Artemis server is isolated from 
> the network by disabling the network card.  Because the primary server 
> lost communication to NFS, the file is never unlock and the backup 
> server is always waiting for the lock.  When we enable the network 
> card in primary server, the primary server is completely down.  Below is the primary server log:
> "Reference Handler" Id=2 WAITING on java.lang.ref.Reference$Lock@64b6b3fc
>         at java.lang.Object.wait(Native Method)
>         -  waiting on java.lang.ref.Reference$Lock@64b6b3fc
>         at java.lang.Object.wait(Object.java:502)
>         at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
>         at
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
>
>
>
> ======================================================================
> =========
> End Thread dump
>
> Is this bugs in Artemis share file configuration?
>
> Regards,
> Rahman
>
--
Clebert Suconic

Re: [EXTERNAL] Re: Artemis file locking not released

Posted by Justin Bertram <jb...@apache.org>.

> I'm using NFS v3.  What is the recommended version?

I'd recommend using NFS 4.1.


Justin

On Mon, Feb 28, 2022 at 11:24 AM Gunawan, Rahman (GSFC-703.H)[Halvik Corp]
<ra...@nasa.gov.invalid> wrote:

> I'm using NFS v3.  What is the recommended version?  I don't see the
> minimum NFS requirement in
> https://activemq.apache.org/components/artemis/documentation/2.19.0/ha.html
> .
>
> Thanks
>
> Rahman
>
> -----Original Message-----
> From: Vilius Šumskas <vi...@rivile.lt>
> Sent: Monday, February 28, 2022 8:57 AM
> To: users@activemq.apache.org
> Subject: RE: [EXTERNAL] Re: Artemis file locking not released
>
> Are you using NFS version 4.1 and what's are your mount options?
>
> --
>     Vilius
>
> -----Original Message-----
> From: Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <rahman.gunawan@nasa.gov
> .INVALID>
> Sent: Monday, February 28, 2022 2:55 PM
> To: users@activemq.apache.org
> Subject: RE: [EXTERNAL] Re: Artemis file locking not released
>
> The backup server knew that the primary server had problem.  Below is from
> the log from the backup server:
> ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to
> create netty connection: java.net.UnknownHostException
>
> Thus, I'm thinking if the Artemis primary server lost connection to NFS or
> network, the backup server can detect, unlock the file and take over.
> Please let me know if you have suggestions.
> Thanks
>
> Regards,
> Rahman
>
> -----Original Message-----
> From: Clebert Suconic <cl...@gmail.com>
> Sent: Saturday, February 26, 2022 9:27 AM
> To: users@activemq.apache.org
> Subject: [EXTERNAL] Re: Artemis file locking not released
>
> Could be some configuration on the remote file system attributes ?
>
> On Fri, Feb 25, 2022 at 12:03 PM Gunawan, Rahman (GSFC-703.H)[Halvik Corp]
> <ra...@nasa.gov.invalid> wrote:
>
> > I'm using Artemis 2.19.1.  I'm using share file configuration and
> > testing a scenario where the primary Artemis server is isolated from
> > the network by disabling the network card.  Because the primary server
> > lost communication to NFS, the file is never unlock and the backup
> > server is always waiting for the lock.  When we enable the network
> > card in primary server, the primary server is completely down.  Below is
> the primary server log:
> > "Reference Handler" Id=2 WAITING on java.lang.ref.Reference$Lock@64b6b3fc
> >         at java.lang.Object.wait(Native Method)
> >         -  waiting on java.lang.ref.Reference$Lock@64b6b3fc
> >         at java.lang.Object.wait(Object.java:502)
> >         at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> >         at
> > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
> >
> >
> >
> > ======================================================================
> > =========
> > End Thread dump
> >
> > Is this bugs in Artemis share file configuration?
> >
> > Regards,
> > Rahman
> >
> --
> Clebert Suconic
>

RE: [EXTERNAL] Re: Artemis file locking not released

Posted by "Gunawan, Rahman (GSFC-703.H)[Halvik Corp]" <ra...@nasa.gov.INVALID>.

I'm using NFS v3.  What is the recommended version?  I don't see the minimum NFS requirement in https://activemq.apache.org/components/artemis/documentation/2.19.0/ha.html.

Thanks

Rahman

-----Original Message-----
From: Vilius Šumskas <vi...@rivile.lt> 
Sent: Monday, February 28, 2022 8:57 AM
To: users@activemq.apache.org
Subject: RE: [EXTERNAL] Re: Artemis file locking not released

Are you using NFS version 4.1 and what's are your mount options?

-- 
    Vilius

-----Original Message-----
From: Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.INVALID>
Sent: Monday, February 28, 2022 2:55 PM
To: users@activemq.apache.org
Subject: RE: [EXTERNAL] Re: Artemis file locking not released

The backup server knew that the primary server had problem.  Below is from the log from the backup server:
ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to create netty connection: java.net.UnknownHostException

Thus, I'm thinking if the Artemis primary server lost connection to NFS or network, the backup server can detect, unlock the file and take over.  Please let me know if you have suggestions.
Thanks

Regards,
Rahman

-----Original Message-----
From: Clebert Suconic <cl...@gmail.com>
Sent: Saturday, February 26, 2022 9:27 AM
To: users@activemq.apache.org
Subject: [EXTERNAL] Re: Artemis file locking not released

Could be some configuration on the remote file system attributes ?

On Fri, Feb 25, 2022 at 12:03 PM Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.invalid> wrote:

> I'm using Artemis 2.19.1.  I'm using share file configuration and 
> testing a scenario where the primary Artemis server is isolated from 
> the network by disabling the network card.  Because the primary server 
> lost communication to NFS, the file is never unlock and the backup 
> server is always waiting for the lock.  When we enable the network 
> card in primary server, the primary server is completely down.  Below is the primary server log:
> "Reference Handler" Id=2 WAITING on java.lang.ref.Reference$Lock@64b6b3fc
>         at java.lang.Object.wait(Native Method)
>         -  waiting on java.lang.ref.Reference$Lock@64b6b3fc
>         at java.lang.Object.wait(Object.java:502)
>         at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
>         at
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
>
>
>
> ======================================================================
> =========
> End Thread dump
>
> Is this bugs in Artemis share file configuration?
>
> Regards,
> Rahman
>
--
Clebert Suconic

RE: [EXTERNAL] Re: Artemis file locking not released

Posted by Vilius Šumskas <vi...@rivile.lt>.

Are you using NFS version 4.1 and what's are your mount options?

-- 
    Vilius

-----Original Message-----
From: Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.INVALID> 
Sent: Monday, February 28, 2022 2:55 PM
To: users@activemq.apache.org
Subject: RE: [EXTERNAL] Re: Artemis file locking not released

The backup server knew that the primary server had problem.  Below is from the log from the backup server:
ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to create netty connection: java.net.UnknownHostException

Thus, I'm thinking if the Artemis primary server lost connection to NFS or network, the backup server can detect, unlock the file and take over.  Please let me know if you have suggestions.
Thanks

Regards,
Rahman

-----Original Message-----
From: Clebert Suconic <cl...@gmail.com>
Sent: Saturday, February 26, 2022 9:27 AM
To: users@activemq.apache.org
Subject: [EXTERNAL] Re: Artemis file locking not released

Could be some configuration on the remote file system attributes ?

On Fri, Feb 25, 2022 at 12:03 PM Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <ra...@nasa.gov.invalid> wrote:

> I'm using Artemis 2.19.1.  I'm using share file configuration and 
> testing a scenario where the primary Artemis server is isolated from 
> the network by disabling the network card.  Because the primary server 
> lost communication to NFS, the file is never unlock and the backup 
> server is always waiting for the lock.  When we enable the network 
> card in primary server, the primary server is completely down.  Below is the primary server log:
> "Reference Handler" Id=2 WAITING on java.lang.ref.Reference$Lock@64b6b3fc
>         at java.lang.Object.wait(Native Method)
>         -  waiting on java.lang.ref.Reference$Lock@64b6b3fc
>         at java.lang.Object.wait(Object.java:502)
>         at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
>         at
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
>
>
>
> ======================================================================
> =========
> End Thread dump
>
> Is this bugs in Artemis share file configuration?
>
> Regards,
> Rahman
>
--
Clebert Suconic