You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@ace.apache.org by "Robert M. Mather" <ro...@gmail.com> on 2015/06/29 11:49:52 UTC

ACE server was unavailable, application was impacted strangely

Our hosting service where the ACE server runs was down for maintenance, so
our clients couldn't contact the ACE server for an extended period of time.
I've logged in to a few client sites (we have around 100) and I'm seeing
that the agents eventually blacklisted the server IP and stopped checking
there for updates, even though that's the only one we have. Once the server
came back online, they still didn't resume sychronizing with it. Is this
the correct behavior? Shouldn't the agent detect when the server is back
online and connect to it again?

I now see the "agent.discovery.checking" option, which I guess we should
set to false in the future.

The troubling part is that we have a DS component running in our client
application that pings the server periodically, but the clients all stopped
pinging after the server outage. In every log I checked, the pings stopped
immediately after the ACE agent blacklisted the server IP. The ping is just
a task running under the standard Java ScheduledExecutorService that POSTs
to our server every few minutes using the Apache HttpClient. Is it possible
that the ACE agent could interfere with that somehow? The service running
the ping task doesn't log that it got stopped or failed in any way. Other
services on the client are working normally.

After restarting the a few client processes, they all reconnected to ACE
and started pinging normally again.

Re: ACE server was unavailable, application was impacted strangely

Posted by "Robert M. Mather" <ro...@gmail.com>.
On Mon, Jun 29, 2015 at 3:58 AM, Jan Willem Janssen <
janwillem.janssen@luminis.eu> wrote:

> Hi Robert,
>
> > On 29 Jun 2015, at 11:49, Robert M. Mather <ro...@gmail.com>
> wrote:
> >
> > Our hosting service where the ACE server runs was down for maintenance,
> so
> > our clients couldn't contact the ACE server for an extended period of
> time.
> > I've logged in to a few client sites (we have around 100) and I'm seeing
> > that the agents eventually blacklisted the server IP and stopped checking
> > there for updates, even though that's the only one we have. Once the
> server
> > came back online, they still didn't resume sychronizing with it. Is this
> > the correct behavior? Shouldn't the agent detect when the server is back
> > online and connect to it again?
> >
> > I now see the "agent.discovery.checking" option, which I guess we should
> > set to false in the future.
>
> IMO, this is a bug: it makes no sense to blacklist a server when there is
> only
> one the agent can talk to. Could you raise an issue for this on JIRA?
>

Sure, I'll file an issue. Until the bug is fixed, is there some way I can
prevent issues from occurring in the future if the ACE server becomes
unavailable again? Would setting "agent.discovery.checking=false" prevent
the blacklisting?

(The idea of blacklisting is to create a crude form of failover: suppose
> you’ve
> multiple ACE servers up and running, a client could try each one of them in
> case on of them is not accessible.)
>
> > The troubling part is that we have a DS component running in our client
> > application that pings the server periodically, but the clients all
> stopped
> > pinging after the server outage. In every log I checked, the pings
> stopped
> > immediately after the ACE agent blacklisted the server IP. The ping is
> just
> > a task running under the standard Java ScheduledExecutorService that
> POSTs
> > to our server every few minutes using the Apache HttpClient. Is it
> possible
> > that the ACE agent could interfere with that somehow? The service running
> > the ping task doesn't log that it got stopped or failed in any way. Other
> > services on the client are working normally.
>
> How does your job obtain the server IP? Through the DiscoveryHandler of the
> agent itself? If so, than this might be the culprit as it no longer returns
> the IP of the server since it is blacklisted, and there are no alternative
> server IPs to return...
>

It's completely independent of the agent service, and I can't think of any
reason why this would happen without knowing more about the internals of
the agent.

>
> HtH,
>
> --
> Met vriendelijke groeten | Kind regards
>
> Jan Willem Janssen | Software Architect
> +31 631 765 814
>
> My world is revolving around INAETICS and Amdatu
>
> Luminis Technologies B.V.
> Churchillplein 1
> 7314 BZ   Apeldoorn
> +31 88 586 46 00
>
> http://www.luminis-technologies.com
> http://www.luminis.eu
>
> KvK (CoC) 09 16 28 93
> BTW (VAT) NL8169.78.566.B.01
>
>

Re: ACE server was unavailable, application was impacted strangely

Posted by Jan Willem Janssen <ja...@luminis.eu>.
Hi Robert,

> On 29 Jun 2015, at 11:49, Robert M. Mather <ro...@gmail.com> wrote:
> 
> Our hosting service where the ACE server runs was down for maintenance, so
> our clients couldn't contact the ACE server for an extended period of time.
> I've logged in to a few client sites (we have around 100) and I'm seeing
> that the agents eventually blacklisted the server IP and stopped checking
> there for updates, even though that's the only one we have. Once the server
> came back online, they still didn't resume sychronizing with it. Is this
> the correct behavior? Shouldn't the agent detect when the server is back
> online and connect to it again?
> 
> I now see the "agent.discovery.checking" option, which I guess we should
> set to false in the future.

IMO, this is a bug: it makes no sense to blacklist a server when there is only
one the agent can talk to. Could you raise an issue for this on JIRA?

(The idea of blacklisting is to create a crude form of failover: suppose you’ve
multiple ACE servers up and running, a client could try each one of them in
case on of them is not accessible.)

> The troubling part is that we have a DS component running in our client
> application that pings the server periodically, but the clients all stopped
> pinging after the server outage. In every log I checked, the pings stopped
> immediately after the ACE agent blacklisted the server IP. The ping is just
> a task running under the standard Java ScheduledExecutorService that POSTs
> to our server every few minutes using the Apache HttpClient. Is it possible
> that the ACE agent could interfere with that somehow? The service running
> the ping task doesn't log that it got stopped or failed in any way. Other
> services on the client are working normally.

How does your job obtain the server IP? Through the DiscoveryHandler of the
agent itself? If so, than this might be the culprit as it no longer returns
the IP of the server since it is blacklisted, and there are no alternative
server IPs to return...

HtH,

--
Met vriendelijke groeten | Kind regards

Jan Willem Janssen | Software Architect
+31 631 765 814

My world is revolving around INAETICS and Amdatu

Luminis Technologies B.V.
Churchillplein 1
7314 BZ   Apeldoorn
+31 88 586 46 00

http://www.luminis-technologies.com
http://www.luminis.eu

KvK (CoC) 09 16 28 93
BTW (VAT) NL8169.78.566.B.01