You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@activemq.apache.org by clebertsuconic <gi...@git.apache.org> on 2017/08/05 05:02:09 UTC

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

GitHub user clebertsuconic opened a pull request:

    https://github.com/apache/activemq-artemis/pull/1443

    ARTEMIS-1324 Deadlock detection and health check of critical components

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/clebertsuconic/activemq-artemis critical-prototype

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/activemq-artemis/pull/1443.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1443
    
----
commit 58474c42b74d31e4727d4da58553c01533d4f4e6
Author: Clebert Suconic <cl...@apache.org>
Date:   2017-08-05T04:52:42Z

    ARTEMIS-1324 Deadlock detection and health check of critical components

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131771529
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/ActiveMQServerImpl.java ---
    @@ -479,12 +499,72 @@ public final synchronized void start() throws Exception {
           }
        }
     
    +   @Override
    +   public CriticalAnalyzer getCriticalAnalyzer() {
    +      return this.analyzer;
    +   }
    +
        private void internalStart() throws Exception {
           if (state != SERVER_STATE.STOPPED) {
              logger.debug("Server already started!");
              return;
           }
     
    +      /** Calling this for cases where the server was stopped and now is being restarted... failback, etc...*/
    +      this.analyzer.clear();
    +
    +      this.getCriticalAnalyzer().setCheckTime(configuration.getCriticalAnalyzerCheckPeriod()).setTimeout(configuration.getCriticalAnalyzerTimeout());
    +
    +      if (configuration.isCriticalAnalyzer()) {
    +         this.getCriticalAnalyzer().start();
    +      }
    +
    +      this.getCriticalAnalyzer().addAction((CriticalComponent c) -> {
    +         ActiveMQServerLogger.LOGGER.irresponsiveComponent(c);
    +         threadDump();
    +
    +         // on the case of a critical failure, -1 cannot simply means forever.
    +         // in case graceful is -1, we will set it to 30 seconds
    +         long timeout = configuration.getGracefulShutdownTimeout() < 0 ? 30000 : configuration.getGracefulShutdownTimeout();
    +
    +         Thread notificationSender = new Thread() {
    +            public void run () {
    --- End diff --
    
    should this need an Overrides?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131530870
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/ActiveMQServerImpl.java ---
    @@ -477,12 +484,44 @@ public final synchronized void start() throws Exception {
           }
        }
     
    +   @Override
    +   public CriticalAnalyzer getCriticalAnalyzer() {
    +      return this.analyzer;
    +   }
    +
        private void internalStart() throws Exception {
           if (state != SERVER_STATE.STOPPED) {
              logger.debug("Server already started!");
              return;
           }
     
    +      if (!configuration.isAnalyzeCritical()) {
    +         this.analyzer = new EmptyCriticalAnalyzer();
    +      }
    +
    +      /** Calling this for cases where the server was stopped and now is being restarted... failback, etc...*/
    +      this.analyzer.clear();
    +
    +      this.getCriticalAnalyzer().setCheckTime(configuration.getAnalyzeCriticalTimeout()).setTimeout(configuration.getAnalyzeCriticalTimeout());
    --- End diff --
    
    again why not do this in the constructor?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131532658
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalAnalyzerImpl.java ---
    @@ -0,0 +1,182 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import java.util.ConcurrentModificationException;
    +import java.util.concurrent.CopyOnWriteArrayList;
    +import java.util.concurrent.Semaphore;
    +import java.util.concurrent.TimeUnit;
    +
    +import org.apache.activemq.artemis.utils.collections.ConcurrentHashSet;
    +import org.jboss.logging.Logger;
    +
    +public class CriticalAnalyzerImpl implements CriticalAnalyzer {
    +
    +   private final Logger logger = Logger.getLogger(CriticalAnalyzer.class);
    +
    +   private volatile long timeout;
    --- End diff --
    
    I don't need it to be atomic really. as soon as the dead lock stops updating this.. it would be enough to me to just have a volatile here.
    
    
    The only overhead is the call to System.currentTimeMillis(); I did some tests and didn't see any regressions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131773683
  
    --- Diff: docs/user-manual/en/critical-analysis.md ---
    @@ -0,0 +1,32 @@
    +# Critical Analysis of the broker
    +
    +There are a few things that can go wrong on a production environment:
    +
    +- Bugs, for more than we try they still happen! We always try to correct them, but that's the only constant in software development.
    +- IO Errors, disks and hardware can go bad
    +- Memory issues, the CPU can go crazy by another process
    +
    +For cases like this, we added a protection to the broker to shut itself down when bad things happen.
    +
    +We measure time response in places like:
    +
    +- Queue delivery (add to the queue)
    +- Journal storage
    +- Paging operations
    +
    --- End diff --
    
    something for later btw, just a thought.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131770829
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/config/impl/ConfigurationImpl.java ---
    @@ -294,6 +294,14 @@
     
        private String internalNamingPrefix = ActiveMQDefaultConfiguration.getInternalNamingPrefix();
     
    +   private boolean criticalAnalyzer = ActiveMQDefaultConfiguration.getCriticalAnalyzer();
    +
    +   private boolean criticalAnalyzerHalt = ActiveMQDefaultConfiguration.getCriticalAnalyzerHalt();
    +
    +   private long criticalAnalyzerTimeout = ActiveMQDefaultConfiguration.getCriticalAnalyzerTimeout();
    +
    +   private long criticalAnalyzerCheckPeriod = 0; // non set
    --- End diff --
    
    i was meaning make a static contstant ANALYZE_CHECK_PERIOD_NOT_SET=0 and use that, it make it less of a magic number.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131763449
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalAnalyzerImpl.java ---
    @@ -0,0 +1,182 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import java.util.ConcurrentModificationException;
    +import java.util.concurrent.CopyOnWriteArrayList;
    +import java.util.concurrent.Semaphore;
    +import java.util.concurrent.TimeUnit;
    +
    +import org.apache.activemq.artemis.utils.collections.ConcurrentHashSet;
    +import org.jboss.logging.Logger;
    +
    +public class CriticalAnalyzerImpl implements CriticalAnalyzer {
    +
    +   private final Logger logger = Logger.getLogger(CriticalAnalyzer.class);
    +
    +   private volatile long timeout;
    +
    +   private volatile long checkTime;
    +
    +   @Override
    +   public void clear() {
    +      actions.clear();
    +      components.clear();
    +   }
    +
    +   private CopyOnWriteArrayList<CriticalAction> actions = new CopyOnWriteArrayList<>();
    +
    +   private Thread thread;
    +
    +   private final Semaphore running = new Semaphore(1);
    +
    +   private final ConcurrentHashSet<CriticalComponent> components = new ConcurrentHashSet<>();
    +
    +   @Override
    +   public void add(CriticalComponent component) {
    +      components.add(component);
    +   }
    +
    +   @Override
    +   public void remove(CriticalComponent component) {
    +      components.remove(component);
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setCheckTime(long timeout) {
    +      this.checkTime = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCheckTime() {
    +      if (checkTime == 0) {
    +         checkTime = getTimeout() / 2;
    +      }
    +      return checkTime;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setTimeout(long timeout) {
    +      this.timeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getTimeout() {
    +      if (timeout == 0) {
    +         timeout = TimeUnit.MINUTES.toMillis(2);
    +      }
    +      return timeout;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer addAction(CriticalAction action) {
    +      this.actions.add(action);
    +      return this;
    +   }
    +
    +   @Override
    +   public void check() {
    +      boolean retry = true;
    +      while (retry) {
    +         try {
    +            for (CriticalComponent component : components) {
    +
    +               int pathReturned = component.isExpired(timeout);
    +               if (pathReturned >= 0) {
    +                  fireAction(component, pathReturned);
    +                  // no need to keep running if there's already a component failed
    +                  return;
    +               }
    +            }
    +            retry = false; // got to the end of the list, no need to retry
    +         } catch (ConcurrentModificationException dontCare) {
    +            // lets retry on the loop
    +         }
    +      }
    +   }
    +
    +   private void fireAction(CriticalComponent component, int path) {
    +      for (CriticalAction action: actions) {
    +         try {
    +            action.run(component, path);
    +         } catch (Throwable e) {
    +            logger.warn(e.getMessage(), e);
    +         }
    +      }
    +   }
    +
    +   @Override
    +   public void start() {
    +
    +      if (!running.tryAcquire()) {
    +         // already running
    +         return;
    +      }
    +
    +      // we are not using any Thread Pool or any Scheduled Executors from the ArtemisServer
    +      // as that would defeat the purpose,
    +      // as in any deadlocks the schedulers may be starving for something not responding fast enough
    +      thread = new Thread("Artemis Critical Analyzer") {
    +         @Override
    +         public void run() {
    +            try {
    +               while (true) {
    +                  if (running.tryAcquire(getCheckTime(), TimeUnit.MILLISECONDS)) {
    +                     running.release();
    +                     // this means that the server has been stopped as we could acquire the semaphore... returning now
    +                     break;
    +                  }
    +                  check();
    +               }
    +            } catch (InterruptedException interrupted) {
    +               // i will just leave on that case
    +            }
    +         }
    +      };
    +
    +      thread.setDaemon(true);
    +
    +      thread.start();
    +   }
    +
    +   @Override
    +   public void stop() {
    --- End diff --
    
    this makes sense a bit more now, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131541803
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalAnalyzerImpl.java ---
    @@ -0,0 +1,182 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import java.util.ConcurrentModificationException;
    +import java.util.concurrent.CopyOnWriteArrayList;
    +import java.util.concurrent.Semaphore;
    +import java.util.concurrent.TimeUnit;
    +
    +import org.apache.activemq.artemis.utils.collections.ConcurrentHashSet;
    +import org.jboss.logging.Logger;
    +
    +public class CriticalAnalyzerImpl implements CriticalAnalyzer {
    +
    +   private final Logger logger = Logger.getLogger(CriticalAnalyzer.class);
    +
    +   private volatile long timeout;
    +
    +   private volatile long checkTime;
    +
    +   @Override
    +   public void clear() {
    +      actions.clear();
    +      components.clear();
    +   }
    +
    +   private CopyOnWriteArrayList<CriticalAction> actions = new CopyOnWriteArrayList<>();
    +
    +   private Thread thread;
    +
    +   private final Semaphore running = new Semaphore(1);
    +
    +   private final ConcurrentHashSet<CriticalComponent> components = new ConcurrentHashSet<>();
    +
    +   @Override
    +   public void add(CriticalComponent component) {
    +      components.add(component);
    +   }
    +
    +   @Override
    +   public void remove(CriticalComponent component) {
    +      components.remove(component);
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setCheckTime(long timeout) {
    +      this.checkTime = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCheckTime() {
    +      if (checkTime == 0) {
    +         checkTime = getTimeout() / 2;
    +      }
    +      return checkTime;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setTimeout(long timeout) {
    +      this.timeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getTimeout() {
    +      if (timeout == 0) {
    +         timeout = TimeUnit.MINUTES.toMillis(2);
    +      }
    +      return timeout;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer addAction(CriticalAction action) {
    +      this.actions.add(action);
    +      return this;
    +   }
    +
    +   @Override
    +   public void check() {
    +      boolean retry = true;
    +      while (retry) {
    +         try {
    +            for (CriticalComponent component : components) {
    +
    +               int pathReturned = component.isExpired(timeout);
    +               if (pathReturned >= 0) {
    +                  fireAction(component, pathReturned);
    +                  // no need to keep running if there's already a component failed
    +                  return;
    +               }
    +            }
    +            retry = false; // got to the end of the list, no need to retry
    +         } catch (ConcurrentModificationException dontCare) {
    +            // lets retry on the loop
    +         }
    +      }
    +   }
    +
    +   private void fireAction(CriticalComponent component, int path) {
    +      for (CriticalAction action: actions) {
    +         try {
    +            action.run(component, path);
    +         } catch (Throwable e) {
    +            logger.warn(e.getMessage(), e);
    +         }
    +      }
    +   }
    +
    +   @Override
    +   public void start() {
    +
    +      if (!running.tryAcquire()) {
    +         // already running
    +         return;
    +      }
    +
    +      // we are not using any Thread Pool or any Scheduled Executors from the ArtemisServer
    +      // as that would defeat the purpose,
    +      // as in any deadlocks the schedulers may be starving for something not responding fast enough
    +      thread = new Thread("Artemis Critical Analyzer") {
    +         @Override
    +         public void run() {
    +            try {
    +               while (true) {
    +                  if (running.tryAcquire(getCheckTime(), TimeUnit.MILLISECONDS)) {
    +                     running.release();
    +                     // this means that the server has been stopped as we could acquire the semaphore... returning now
    +                     break;
    +                  }
    +                  check();
    +               }
    +            } catch (InterruptedException interrupted) {
    +               // i will just leave on that case
    +            }
    +         }
    +      };
    +
    +      thread.setDaemon(true);
    +
    +      thread.start();
    +   }
    +
    +   @Override
    +   public void stop() {
    --- End diff --
    
    Nope.  It's up to the components to measure themselves.  
    
    
    The analyzer just look at the times. (The volatile ones). It starts a thread. And stop it when server is stopped. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131530423
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/deployers/impl/FileConfigurationParser.java ---
    @@ -608,6 +608,14 @@ public void parseMainConfig(final Element e, final Configuration config) throws
     
           config.setNetworkCheckPingCommand(getString(e, "network-check-ping-command", config.getNetworkCheckPingCommand(), Validators.NO_CHECK));
     
    +      config.setAnalyzeCritical(getBoolean(e, "analyze-critical", config.isAnalyzeCritical()));
    +
    +      config.setAnalyzeCriticalTimeout(getLong(e, "analyze-critical-timeout", config.getAnalyzeCriticalTimeout(), Validators.GE_ZERO));
    +
    +      config.setAnalyzeCriticalCheckPeriod(getLong(e, "analyze-critical-check-period", config.getAnalyzeCriticalCheckPeriod(), Validators.GE_ZERO));
    --- End diff --
    
    as per other review comment should the default not be half the timeout  if it is not set, not some arbitarty number which then forces the section of code later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131530398
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/config/impl/ConfigurationImpl.java ---
    @@ -2064,6 +2072,53 @@ public Configuration setNetworkCheckPing6Command(String command) {
           return this;
        }
     
    +   @Override
    +   public boolean isAnalyzeCritical() {
    +      return analyzeCritical;
    +   }
    +
    +   @Override
    +   public Configuration setAnalyzeCritical(boolean analyzeCritical) {
    +      this.analyzeCritical = analyzeCritical;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getAnalyzeCriticalTimeout() {
    +      return analyzeCriticalTimeout;
    +   }
    +
    +   @Override
    +   public Configuration setAnalyzeCriticalTimeout(long timeout) {
    +      this.analyzeCriticalTimeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getAnalyzeCriticalCheckPeriod() {
    +      if (analyzeCriticalCheckPeriod == ActiveMQDefaultConfiguration.getAnalyzeCriticalCheckPeriod()) {
    --- End diff --
    
    how would you differentiate between it being the default and someone actually have set it, should this defaulting not occur in config/broker.xml reading where if null then analyzeCriticalCheckPeriod is defaulted to half timeout?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131534309
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalAnalyzerImpl.java ---
    @@ -0,0 +1,182 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import java.util.ConcurrentModificationException;
    +import java.util.concurrent.CopyOnWriteArrayList;
    +import java.util.concurrent.Semaphore;
    +import java.util.concurrent.TimeUnit;
    +
    +import org.apache.activemq.artemis.utils.collections.ConcurrentHashSet;
    +import org.jboss.logging.Logger;
    +
    +public class CriticalAnalyzerImpl implements CriticalAnalyzer {
    +
    +   private final Logger logger = Logger.getLogger(CriticalAnalyzer.class);
    +
    +   private volatile long timeout;
    --- End diff --
    
    Is this one of these values not being updated by one thread and read by another then? I may have miss understood the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis issue #1443: ARTEMIS-1324 Deadlock detection and health che...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/1443
  
    @clebertsuconic I've left more code nits, and user feedback around the docs. 
    
    Not really able to put the code through its paces, no doubt won't be able to do that till it hits a real environment, it seems reasonable the pseudo logic you describe. 
    
    Also the test cases added look like they cover some good scenarios.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131765096
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/config/impl/ConfigurationImpl.java ---
    @@ -294,6 +294,14 @@
     
        private String internalNamingPrefix = ActiveMQDefaultConfiguration.getInternalNamingPrefix();
     
    +   private boolean criticalAnalyzer = ActiveMQDefaultConfiguration.getCriticalAnalyzer();
    +
    +   private boolean criticalAnalyzerHalt = ActiveMQDefaultConfiguration.getCriticalAnalyzerHalt();
    +
    +   private long criticalAnalyzerTimeout = ActiveMQDefaultConfiguration.getCriticalAnalyzerTimeout();
    +
    +   private long criticalAnalyzerCheckPeriod = 0; // non set
    --- End diff --
    
    should this not be criticalAnalyzerCheckPeriod = ActiveMQDefaultConfiguration.getCriticalAnalyzerCheckPeriod(criticalAnalyzerTimeout)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131532685
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/ActiveMQServerImpl.java ---
    @@ -477,12 +484,44 @@ public final synchronized void start() throws Exception {
           }
        }
     
    +   @Override
    +   public CriticalAnalyzer getCriticalAnalyzer() {
    +      return this.analyzer;
    +   }
    +
        private void internalStart() throws Exception {
           if (state != SERVER_STATE.STOPPED) {
              logger.debug("Server already started!");
              return;
           }
     
    +      if (!configuration.isAnalyzeCritical()) {
    --- End diff --
    
    I thought the configuration was being parsed inside ActiveMQServerImpl. (Bad memory, I forget about stuff months after working on stuff :) )... I will move it to constructor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131532700
  
    --- Diff: docs/user-manual/en/critical-analysis.md ---
    @@ -0,0 +1,32 @@
    +# Critical Analysis of the broker
    +
    +There are a few things that can go wrong on a production environment:
    +
    +- Bugs, for more than we try they still happen! We always try to correct them, but that's the only constant in software development.
    +- IO Errors, disks and hardware can go bad
    +- Memory issues, the CPU can go crazy by another process
    +
    +For cases like this, we added a protection to the broker to shut itself down when bad things happen.
    +
    +We measure time response in places like:
    +
    +- Queue delivery (add to the queue)
    +- Journal storage
    +- Paging operations
    +
    +If the response time goes beyond a configured timeout, the broker is considered unstable and an action will be taken to either shutdown the broker or halt the VM.
    +
    +You can use these following configuration options on broker.xml to configure how the critical analysis is performed.
    +
    +
    +Name | Description
    +:--- | :---
    +analyze-critical | Enable or disable the critical analysis (default true)
    +analyze-critical-timeout | Timeout used to do the critical analysis (default 120000 milliseconds)
    +analyze-critical-check-period | Time used to check the response times (default half of analyze-critical-timeout)
    +analyze-critical-halt | Should the VM be halted upon failures (default false)
    --- End diff --
    
    The idea here was to add monitoring on internal objects similar to what we do on pings / pongs through the protocols. I didn't want to make it any more complex than needed.
    
    This subject could be raised to become an independent project if you keep investing here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131532619
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/deployers/impl/FileConfigurationParser.java ---
    @@ -608,6 +608,14 @@ public void parseMainConfig(final Element e, final Configuration config) throws
     
           config.setNetworkCheckPingCommand(getString(e, "network-check-ping-command", config.getNetworkCheckPingCommand(), Validators.NO_CHECK));
     
    +      config.setAnalyzeCritical(getBoolean(e, "analyze-critical", config.isAnalyzeCritical()));
    +
    +      config.setAnalyzeCriticalTimeout(getLong(e, "analyze-critical-timeout", config.getAnalyzeCriticalTimeout(), Validators.GE_ZERO));
    +
    +      config.setAnalyzeCriticalCheckPeriod(getLong(e, "analyze-critical-check-period", config.getAnalyzeCriticalCheckPeriod(), Validators.GE_ZERO));
    --- End diff --
    
    I didn't want to use null here. On this case, check period=0 would mean not set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/activemq-artemis/pull/1443


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis issue #1443: ARTEMIS-1324 Deadlock detection and health che...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on the issue:

    https://github.com/apache/activemq-artemis/pull/1443
  
    I will make some changes here tomorrow. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131532564
  
    --- Diff: docs/user-manual/en/critical-analysis.md ---
    @@ -0,0 +1,32 @@
    +# Critical Analysis of the broker
    +
    +There are a few things that can go wrong on a production environment:
    +
    +- Bugs, for more than we try they still happen! We always try to correct them, but that's the only constant in software development.
    +- IO Errors, disks and hardware can go bad
    +- Memory issues, the CPU can go crazy by another process
    +
    +For cases like this, we added a protection to the broker to shut itself down when bad things happen.
    +
    +We measure time response in places like:
    +
    +- Queue delivery (add to the queue)
    +- Journal storage
    +- Paging operations
    +
    +If the response time goes beyond a configured timeout, the broker is considered unstable and an action will be taken to either shutdown the broker or halt the VM.
    +
    +You can use these following configuration options on broker.xml to configure how the critical analysis is performed.
    +
    +
    +Name | Description
    +:--- | :---
    +analyze-critical | Enable or disable the critical analysis (default true)
    +analyze-critical-timeout | Timeout used to do the critical analysis (default 120000 milliseconds)
    +analyze-critical-check-period | Time used to check the response times (default half of analyze-critical-timeout)
    +analyze-critical-halt | Should the VM be halted upon failures (default false)
    --- End diff --
    
    Shutdown. Server.stop.  But there is no guarantee it would work on deadlocks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131530746
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalAnalyzerImpl.java ---
    @@ -0,0 +1,182 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import java.util.ConcurrentModificationException;
    +import java.util.concurrent.CopyOnWriteArrayList;
    +import java.util.concurrent.Semaphore;
    +import java.util.concurrent.TimeUnit;
    +
    +import org.apache.activemq.artemis.utils.collections.ConcurrentHashSet;
    +import org.jboss.logging.Logger;
    +
    +public class CriticalAnalyzerImpl implements CriticalAnalyzer {
    +
    +   private final Logger logger = Logger.getLogger(CriticalAnalyzer.class);
    +
    +   private volatile long timeout;
    --- End diff --
    
    if this is on hot path, why not use an atomic or have use atomicfieldupdater (same with check time) 
    http://normanmaurer.me/blog/2013/10/28/Lesser-known-concurrent-classes-Part-1/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131769055
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/config/impl/ConfigurationImpl.java ---
    @@ -2064,6 +2072,53 @@ public Configuration setNetworkCheckPing6Command(String command) {
           return this;
        }
     
    +   @Override
    +   public boolean isCriticalAnalyzer() {
    +      return criticalAnalyzer;
    +   }
    +
    +   @Override
    +   public Configuration setCriticalAnalyzer(boolean CriticalAnalyzer) {
    +      this.criticalAnalyzer = CriticalAnalyzer;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCriticalAnalyzerTimeout() {
    +      return criticalAnalyzerTimeout;
    +   }
    +
    +   @Override
    +   public Configuration setCriticalAnalyzerTimeout(long timeout) {
    +      this.criticalAnalyzerTimeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCriticalAnalyzerCheckPeriod() {
    +      if (criticalAnalyzerCheckPeriod <= 0) {
    --- End diff --
    
    I can always merge what I have now.. and make changes later :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131766104
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/config/impl/ConfigurationImpl.java ---
    @@ -2064,6 +2072,53 @@ public Configuration setNetworkCheckPing6Command(String command) {
           return this;
        }
     
    +   @Override
    +   public boolean isCriticalAnalyzer() {
    +      return criticalAnalyzer;
    +   }
    +
    +   @Override
    +   public Configuration setCriticalAnalyzer(boolean CriticalAnalyzer) {
    +      this.criticalAnalyzer = CriticalAnalyzer;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCriticalAnalyzerTimeout() {
    +      return criticalAnalyzerTimeout;
    +   }
    +
    +   @Override
    +   public Configuration setCriticalAnalyzerTimeout(long timeout) {
    +      this.criticalAnalyzerTimeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCriticalAnalyzerCheckPeriod() {
    +      if (criticalAnalyzerCheckPeriod <= 0) {
    --- End diff --
    
    ditto i assume 0 should be constant DEFAULT_ANALYZE_CHECK_PERIOD


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131762073
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalComponentImpl.java ---
    @@ -0,0 +1,68 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.activemq.artemis.utils.critical;
    +
    +/**
    + * This is not abstract as it could be used through aggregations or extensions.
    + * This is only good for cases where you call leave within the same thread as you called enter.
    + */
    +public class CriticalComponentImpl implements CriticalComponent {
    +
    +   private final CriticalMeasure[] measures;
    +   private final CriticalAnalyzer analyzer;
    +
    +   public CriticalComponentImpl(CriticalAnalyzer analyzer, int numberOfPaths) {
    +      if (analyzer == null) {
    +         analyzer = EmptyCriticalAnalyzer.getInstance();
    +      }
    +      this.analyzer = analyzer;
    +
    +      if (analyzer.isMeasuring()) {
    +         measures = new CriticalMeasure[numberOfPaths];
    +         for (int i = 0; i < numberOfPaths; i++) {
    +            measures[i] = new CriticalMeasure();
    +         }
    +      } else {
    +         measures = null;
    +      }
    +   }
    +
    +   @Override
    +   public void enterCritical(int path) {
    +      if (analyzer.isMeasuring()) {
    +         measures[path].enterCritical();
    +      }
    +
    --- End diff --
    
    nit: empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131534319
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/config/impl/ConfigurationImpl.java ---
    @@ -2064,6 +2072,53 @@ public Configuration setNetworkCheckPing6Command(String command) {
           return this;
        }
     
    +   @Override
    +   public boolean isAnalyzeCritical() {
    +      return analyzeCritical;
    +   }
    +
    +   @Override
    +   public Configuration setAnalyzeCritical(boolean analyzeCritical) {
    +      this.analyzeCritical = analyzeCritical;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getAnalyzeCriticalTimeout() {
    +      return analyzeCriticalTimeout;
    +   }
    +
    +   @Override
    +   public Configuration setAnalyzeCriticalTimeout(long timeout) {
    +      this.analyzeCriticalTimeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getAnalyzeCriticalCheckPeriod() {
    +      if (analyzeCriticalCheckPeriod == ActiveMQDefaultConfiguration.getAnalyzeCriticalCheckPeriod()) {
    --- End diff --
    
    I was more getting at should the defaulting logic of using the timeout if the value is not set live in the config parser?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131769469
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalAnalyzerImpl.java ---
    @@ -0,0 +1,182 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import java.util.ConcurrentModificationException;
    +import java.util.concurrent.CopyOnWriteArrayList;
    +import java.util.concurrent.Semaphore;
    +import java.util.concurrent.TimeUnit;
    +
    +import org.apache.activemq.artemis.utils.collections.ConcurrentHashSet;
    +import org.jboss.logging.Logger;
    +
    +public class CriticalAnalyzerImpl implements CriticalAnalyzer {
    +
    +   private final Logger logger = Logger.getLogger(CriticalAnalyzer.class);
    +
    +   private volatile long timeout;
    +
    +   private volatile long checkTime;
    +
    +   @Override
    +   public void clear() {
    +      actions.clear();
    +      components.clear();
    +   }
    +
    +   private CopyOnWriteArrayList<CriticalAction> actions = new CopyOnWriteArrayList<>();
    +
    +   private Thread thread;
    +
    +   private final Semaphore running = new Semaphore(1);
    +
    +   private final ConcurrentHashSet<CriticalComponent> components = new ConcurrentHashSet<>();
    +
    +   @Override
    +   public void add(CriticalComponent component) {
    +      components.add(component);
    +   }
    +
    +   @Override
    +   public void remove(CriticalComponent component) {
    +      components.remove(component);
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setCheckTime(long timeout) {
    +      this.checkTime = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCheckTime() {
    +      if (checkTime == 0) {
    +         checkTime = getTimeout() / 2;
    +      }
    +      return checkTime;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setTimeout(long timeout) {
    +      this.timeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getTimeout() {
    +      if (timeout == 0) {
    +         timeout = TimeUnit.MINUTES.toMillis(2);
    +      }
    +      return timeout;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer addAction(CriticalAction action) {
    +      this.actions.add(action);
    +      return this;
    +   }
    +
    +   @Override
    +   public void check() {
    +      boolean retry = true;
    +      while (retry) {
    +         try {
    +            for (CriticalComponent component : components) {
    +
    +               int pathReturned = component.isExpired(timeout);
    +               if (pathReturned >= 0) {
    +                  fireAction(component, pathReturned);
    +                  // no need to keep running if there's already a component failed
    +                  return;
    +               }
    +            }
    +            retry = false; // got to the end of the list, no need to retry
    +         } catch (ConcurrentModificationException dontCare) {
    +            // lets retry on the loop
    +         }
    +      }
    +   }
    +
    +   private void fireAction(CriticalComponent component, int path) {
    +      for (CriticalAction action: actions) {
    +         try {
    +            action.run(component, path);
    +         } catch (Throwable e) {
    +            logger.warn(e.getMessage(), e);
    +         }
    +      }
    +   }
    +
    +   @Override
    +   public void start() {
    +
    +      if (!running.tryAcquire()) {
    +         // already running
    +         return;
    +      }
    +
    +      // we are not using any Thread Pool or any Scheduled Executors from the ArtemisServer
    +      // as that would defeat the purpose,
    +      // as in any deadlocks the schedulers may be starving for something not responding fast enough
    +      thread = new Thread("Artemis Critical Analyzer") {
    +         @Override
    +         public void run() {
    +            try {
    +               while (true) {
    +                  if (running.tryAcquire(getCheckTime(), TimeUnit.MILLISECONDS)) {
    +                     running.release();
    +                     // this means that the server has been stopped as we could acquire the semaphore... returning now
    +                     break;
    +                  }
    +                  check();
    +               }
    +            } catch (InterruptedException interrupted) {
    +               // i will just leave on that case
    +            }
    +         }
    +      };
    +
    +      thread.setDaemon(true);
    +
    +      thread.start();
    +   }
    +
    +   @Override
    +   public void stop() {
    --- End diff --
    
    Management wasn't good for this. it has locks on critical components.. I'm adding a broker plugin for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131769789
  
    --- Diff: docs/user-manual/en/critical-analysis.md ---
    @@ -0,0 +1,32 @@
    +# Critical Analysis of the broker
    +
    +There are a few things that can go wrong on a production environment:
    +
    +- Bugs, for more than we try they still happen! We always try to correct them, but that's the only constant in software development.
    +- IO Errors, disks and hardware can go bad
    +- Memory issues, the CPU can go crazy by another process
    +
    +For cases like this, we added a protection to the broker to shut itself down when bad things happen.
    +
    +We measure time response in places like:
    +
    +- Queue delivery (add to the queue)
    +- Journal storage
    +- Paging operations
    +
    --- End diff --
    
    I was just thinking the averages and percentiles of this, would make great things to be monitoring when healthy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131771146
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/config/impl/ConfigurationImpl.java ---
    @@ -2064,6 +2072,53 @@ public Configuration setNetworkCheckPing6Command(String command) {
           return this;
        }
     
    +   @Override
    +   public boolean isCriticalAnalyzer() {
    +      return criticalAnalyzer;
    +   }
    +
    +   @Override
    +   public Configuration setCriticalAnalyzer(boolean CriticalAnalyzer) {
    +      this.criticalAnalyzer = CriticalAnalyzer;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCriticalAnalyzerTimeout() {
    +      return criticalAnalyzerTimeout;
    +   }
    +
    +   @Override
    +   public Configuration setCriticalAnalyzerTimeout(long timeout) {
    +      this.criticalAnalyzerTimeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCriticalAnalyzerCheckPeriod() {
    +      if (criticalAnalyzerCheckPeriod <= 0) {
    --- End diff --
    
    agreed what is there is fine, these are only code nits


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131552826
  
    --- Diff: docs/user-manual/en/critical-analysis.md ---
    @@ -0,0 +1,32 @@
    +# Critical Analysis of the broker
    +
    +There are a few things that can go wrong on a production environment:
    +
    +- Bugs, for more than we try they still happen! We always try to correct them, but that's the only constant in software development.
    +- IO Errors, disks and hardware can go bad
    +- Memory issues, the CPU can go crazy by another process
    +
    +For cases like this, we added a protection to the broker to shut itself down when bad things happen.
    +
    +We measure time response in places like:
    +
    +- Queue delivery (add to the queue)
    +- Journal storage
    +- Paging operations
    +
    +If the response time goes beyond a configured timeout, the broker is considered unstable and an action will be taken to either shutdown the broker or halt the VM.
    +
    +You can use these following configuration options on broker.xml to configure how the critical analysis is performed.
    +
    +
    +Name | Description
    +:--- | :---
    +analyze-critical | Enable or disable the critical analysis (default true)
    +analyze-critical-timeout | Timeout used to do the critical analysis (default 120000 milliseconds)
    +analyze-critical-check-period | Time used to check the response times (default half of analyze-critical-timeout)
    +analyze-critical-halt | Should the VM be halted upon failures (default false)
    --- End diff --
    
    i would imagine a health monitor/critical analysers threads are independent of core flow, as such it should detect and action if a component isn't responding or non healthy state. As such it should be able to declare that state back. 
    
    e.g. if JVM bad state due to deadlock, that deadlock should only be affecting the threads/flow where the deadlock is , as such this component should still be servicing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131766464
  
    --- Diff: docs/user-manual/en/critical-analysis.md ---
    @@ -0,0 +1,32 @@
    +# Critical Analysis of the broker
    +
    +There are a few things that can go wrong on a production environment:
    +
    +- Bugs, for more than we try they still happen! We always try to correct them, but that's the only constant in software development.
    +- IO Errors, disks and hardware can go bad
    +- Memory issues, the CPU can go crazy by another process
    +
    +For cases like this, we added a protection to the broker to shut itself down when bad things happen.
    +
    +We measure time response in places like:
    +
    +- Queue delivery (add to the queue)
    +- Journal storage
    +- Paging operations
    +
    +If the response time goes beyond a configured timeout, the broker is considered unstable and an action will be taken to either shutdown the broker or halt the VM.
    +
    +You can use these following configuration options on broker.xml to configure how the critical analysis is performed.
    +
    +
    +Name | Description
    +:--- | :---
    +critical-analyzer | Enable or disable the critical analysis (default true)
    --- End diff --
    
    maybe be explicit in what value enable and disable should take - as in:
    
    Enable (true) or disable (false) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131711379
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalAnalyzerImpl.java ---
    @@ -0,0 +1,182 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import java.util.ConcurrentModificationException;
    +import java.util.concurrent.CopyOnWriteArrayList;
    +import java.util.concurrent.Semaphore;
    +import java.util.concurrent.TimeUnit;
    +
    +import org.apache.activemq.artemis.utils.collections.ConcurrentHashSet;
    +import org.jboss.logging.Logger;
    +
    +public class CriticalAnalyzerImpl implements CriticalAnalyzer {
    +
    +   private final Logger logger = Logger.getLogger(CriticalAnalyzer.class);
    +
    +   private volatile long timeout;
    +
    +   private volatile long checkTime;
    +
    +   @Override
    +   public void clear() {
    +      actions.clear();
    +      components.clear();
    +   }
    +
    +   private CopyOnWriteArrayList<CriticalAction> actions = new CopyOnWriteArrayList<>();
    +
    +   private Thread thread;
    +
    +   private final Semaphore running = new Semaphore(1);
    +
    +   private final ConcurrentHashSet<CriticalComponent> components = new ConcurrentHashSet<>();
    +
    +   @Override
    +   public void add(CriticalComponent component) {
    +      components.add(component);
    +   }
    +
    +   @Override
    +   public void remove(CriticalComponent component) {
    +      components.remove(component);
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setCheckTime(long timeout) {
    +      this.checkTime = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCheckTime() {
    +      if (checkTime == 0) {
    +         checkTime = getTimeout() / 2;
    +      }
    +      return checkTime;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setTimeout(long timeout) {
    +      this.timeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getTimeout() {
    +      if (timeout == 0) {
    +         timeout = TimeUnit.MINUTES.toMillis(2);
    +      }
    +      return timeout;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer addAction(CriticalAction action) {
    +      this.actions.add(action);
    +      return this;
    +   }
    +
    +   @Override
    +   public void check() {
    +      boolean retry = true;
    +      while (retry) {
    +         try {
    +            for (CriticalComponent component : components) {
    +
    +               int pathReturned = component.isExpired(timeout);
    +               if (pathReturned >= 0) {
    +                  fireAction(component, pathReturned);
    +                  // no need to keep running if there's already a component failed
    +                  return;
    +               }
    +            }
    +            retry = false; // got to the end of the list, no need to retry
    +         } catch (ConcurrentModificationException dontCare) {
    +            // lets retry on the loop
    +         }
    +      }
    +   }
    +
    +   private void fireAction(CriticalComponent component, int path) {
    +      for (CriticalAction action: actions) {
    +         try {
    +            action.run(component, path);
    +         } catch (Throwable e) {
    +            logger.warn(e.getMessage(), e);
    +         }
    +      }
    +   }
    +
    +   @Override
    +   public void start() {
    +
    +      if (!running.tryAcquire()) {
    +         // already running
    +         return;
    +      }
    +
    +      // we are not using any Thread Pool or any Scheduled Executors from the ArtemisServer
    +      // as that would defeat the purpose,
    +      // as in any deadlocks the schedulers may be starving for something not responding fast enough
    +      thread = new Thread("Artemis Critical Analyzer") {
    +         @Override
    +         public void run() {
    +            try {
    +               while (true) {
    +                  if (running.tryAcquire(getCheckTime(), TimeUnit.MILLISECONDS)) {
    +                     running.release();
    +                     // this means that the server has been stopped as we could acquire the semaphore... returning now
    +                     break;
    +                  }
    +                  check();
    +               }
    +            } catch (InterruptedException interrupted) {
    +               // i will just leave on that case
    +            }
    +         }
    +      };
    +
    +      thread.setDaemon(true);
    +
    +      thread.start();
    +   }
    +
    +   @Override
    +   public void stop() {
    --- End diff --
    
    this is how it works:
    
    CriticalAnalyzer has a daemon thread, that will look for expired components...
    
    A component provides a method isExpired(long timeout)::boolean
    
    the critical analyzer will call isExpired.. if true, an action is taken...
    
    
    there's a default implementation right now that you can provide enter / leave critical, and all it does is set System.currenttimeMillis on enter and another on leave...
    
    all the implementations are calling leave within a finally.
    
    
    if a component is taking more than the timeout (default to 2 minutes) to respond, then there is an issue and the server needs to go with server.stop() or Vm.halt.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131765381
  
    --- Diff: artemis-core-client/src/main/java/org/apache/activemq/artemis/api/config/ActiveMQDefaultConfiguration.java ---
    @@ -477,6 +477,16 @@ public static String getDefaultHapolicyBackupStrategy() {
     
        public static int DEFAULT_QUORUM_SIZE = -1;
     
    +   public static final boolean DEFAULT_ANALYZE_CRITICAL = true;
    +
    +   public static final long DEFAULT_ANALYZE_CRITICAL_TIMEOUT = 120000;
    +
    +   // this will be 0, the implementation should return 1/2 of the configured critical timeout
    +   public static final long DEFAULT_ANALYZE_CHECK_PERIOD = 0;
    --- End diff --
    
    nope.. will remove it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131541817
  
    --- Diff: docs/user-manual/en/critical-analysis.md ---
    @@ -0,0 +1,32 @@
    +# Critical Analysis of the broker
    +
    +There are a few things that can go wrong on a production environment:
    +
    +- Bugs, for more than we try they still happen! We always try to correct them, but that's the only constant in software development.
    +- IO Errors, disks and hardware can go bad
    +- Memory issues, the CPU can go crazy by another process
    +
    +For cases like this, we added a protection to the broker to shut itself down when bad things happen.
    +
    +We measure time response in places like:
    +
    +- Queue delivery (add to the queue)
    +- Journal storage
    +- Paging operations
    +
    +If the response time goes beyond a configured timeout, the broker is considered unstable and an action will be taken to either shutdown the broker or halt the VM.
    +
    +You can use these following configuration options on broker.xml to configure how the critical analysis is performed.
    +
    +
    +Name | Description
    +:--- | :---
    +analyze-critical | Enable or disable the critical analysis (default true)
    +analyze-critical-timeout | Timeout used to do the critical analysis (default 120000 milliseconds)
    +analyze-critical-check-period | Time used to check the response times (default half of analyze-critical-timeout)
    +analyze-critical-halt | Should the VM be halted upon failures (default false)
    --- End diff --
    
    I can use management.  
    
    I wasn't sure it would be useful. As if the JVM is in bad state would the message arrive ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131762773
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalComponentImpl.java ---
    @@ -0,0 +1,68 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.activemq.artemis.utils.critical;
    +
    +/**
    + * This is not abstract as it could be used through aggregations or extensions.
    + * This is only good for cases where you call leave within the same thread as you called enter.
    + */
    +public class CriticalComponentImpl implements CriticalComponent {
    +
    +   private final CriticalMeasure[] measures;
    +   private final CriticalAnalyzer analyzer;
    +
    +   public CriticalComponentImpl(CriticalAnalyzer analyzer, int numberOfPaths) {
    +      if (analyzer == null) {
    +         analyzer = EmptyCriticalAnalyzer.getInstance();
    +      }
    +      this.analyzer = analyzer;
    +
    +      if (analyzer.isMeasuring()) {
    +         measures = new CriticalMeasure[numberOfPaths];
    +         for (int i = 0; i < numberOfPaths; i++) {
    +            measures[i] = new CriticalMeasure();
    +         }
    +      } else {
    +         measures = null;
    +      }
    +   }
    +
    +   @Override
    +   public void enterCritical(int path) {
    +      if (analyzer.isMeasuring()) {
    +         measures[path].enterCritical();
    +      }
    +
    --- End diff --
    
    oops.. thx


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131764621
  
    --- Diff: artemis-core-client/src/main/java/org/apache/activemq/artemis/api/config/ActiveMQDefaultConfiguration.java ---
    @@ -477,6 +477,16 @@ public static String getDefaultHapolicyBackupStrategy() {
     
        public static int DEFAULT_QUORUM_SIZE = -1;
     
    +   public static final boolean DEFAULT_ANALYZE_CRITICAL = true;
    +
    +   public static final long DEFAULT_ANALYZE_CRITICAL_TIMEOUT = 120000;
    +
    +   // this will be 0, the implementation should return 1/2 of the configured critical timeout
    +   public static final long DEFAULT_ANALYZE_CHECK_PERIOD = 0;
    --- End diff --
    
    with the logic below now of timeout / 2, is this used?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131765975
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/config/impl/ConfigurationImpl.java ---
    @@ -294,6 +294,14 @@
     
        private String internalNamingPrefix = ActiveMQDefaultConfiguration.getInternalNamingPrefix();
     
    +   private boolean criticalAnalyzer = ActiveMQDefaultConfiguration.getCriticalAnalyzer();
    +
    +   private boolean criticalAnalyzerHalt = ActiveMQDefaultConfiguration.getCriticalAnalyzerHalt();
    +
    +   private long criticalAnalyzerTimeout = ActiveMQDefaultConfiguration.getCriticalAnalyzerTimeout();
    +
    +   private long criticalAnalyzerCheckPeriod = 0; // non set
    --- End diff --
    
    I guess 0 should be a reference to constant DEFAULT_ANALYZE_CHECK_PERIOD i called out earlier as having no reference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131534303
  
    --- Diff: docs/user-manual/en/critical-analysis.md ---
    @@ -0,0 +1,32 @@
    +# Critical Analysis of the broker
    +
    +There are a few things that can go wrong on a production environment:
    +
    +- Bugs, for more than we try they still happen! We always try to correct them, but that's the only constant in software development.
    +- IO Errors, disks and hardware can go bad
    +- Memory issues, the CPU can go crazy by another process
    +
    +For cases like this, we added a protection to the broker to shut itself down when bad things happen.
    +
    +We measure time response in places like:
    +
    +- Queue delivery (add to the queue)
    +- Journal storage
    +- Paging operations
    +
    +If the response time goes beyond a configured timeout, the broker is considered unstable and an action will be taken to either shutdown the broker or halt the VM.
    +
    +You can use these following configuration options on broker.xml to configure how the critical analysis is performed.
    +
    +
    +Name | Description
    +:--- | :---
    +analyze-critical | Enable or disable the critical analysis (default true)
    +analyze-critical-timeout | Timeout used to do the critical analysis (default 120000 milliseconds)
    +analyze-critical-check-period | Time used to check the response times (default half of analyze-critical-timeout)
    +analyze-critical-halt | Should the VM be halted upon failures (default false)
    --- End diff --
    
    I was more saying:
    
    how does an outside observer view the state. Eg many people rely on logs or JMX 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131530496
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalAnalyzerImpl.java ---
    @@ -0,0 +1,182 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import java.util.ConcurrentModificationException;
    +import java.util.concurrent.CopyOnWriteArrayList;
    +import java.util.concurrent.Semaphore;
    +import java.util.concurrent.TimeUnit;
    +
    +import org.apache.activemq.artemis.utils.collections.ConcurrentHashSet;
    +import org.jboss.logging.Logger;
    +
    +public class CriticalAnalyzerImpl implements CriticalAnalyzer {
    +
    +   private final Logger logger = Logger.getLogger(CriticalAnalyzer.class);
    +
    +   private volatile long timeout;
    +
    +   private volatile long checkTime;
    +
    +   @Override
    +   public void clear() {
    +      actions.clear();
    +      components.clear();
    +   }
    +
    +   private CopyOnWriteArrayList<CriticalAction> actions = new CopyOnWriteArrayList<>();
    +
    +   private Thread thread;
    +
    +   private final Semaphore running = new Semaphore(1);
    +
    +   private final ConcurrentHashSet<CriticalComponent> components = new ConcurrentHashSet<>();
    +
    +   @Override
    +   public void add(CriticalComponent component) {
    +      components.add(component);
    +   }
    +
    +   @Override
    +   public void remove(CriticalComponent component) {
    +      components.remove(component);
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setCheckTime(long timeout) {
    +      this.checkTime = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCheckTime() {
    +      if (checkTime == 0) {
    +         checkTime = getTimeout() / 2;
    +      }
    +      return checkTime;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setTimeout(long timeout) {
    +      this.timeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getTimeout() {
    +      if (timeout == 0) {
    +         timeout = TimeUnit.MINUTES.toMillis(2);
    +      }
    +      return timeout;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer addAction(CriticalAction action) {
    +      this.actions.add(action);
    +      return this;
    +   }
    +
    +   @Override
    +   public void check() {
    +      boolean retry = true;
    +      while (retry) {
    +         try {
    +            for (CriticalComponent component : components) {
    +
    +               int pathReturned = component.isExpired(timeout);
    +               if (pathReturned >= 0) {
    +                  fireAction(component, pathReturned);
    +                  // no need to keep running if there's already a component failed
    +                  return;
    +               }
    +            }
    +            retry = false; // got to the end of the list, no need to retry
    +         } catch (ConcurrentModificationException dontCare) {
    +            // lets retry on the loop
    +         }
    +      }
    +   }
    +
    +   private void fireAction(CriticalComponent component, int path) {
    +      for (CriticalAction action: actions) {
    +         try {
    +            action.run(component, path);
    +         } catch (Throwable e) {
    +            logger.warn(e.getMessage(), e);
    +         }
    +      }
    +   }
    +
    +   @Override
    +   public void start() {
    +
    +      if (!running.tryAcquire()) {
    +         // already running
    +         return;
    +      }
    +
    +      // we are not using any Thread Pool or any Scheduled Executors from the ArtemisServer
    +      // as that would defeat the purpose,
    +      // as in any deadlocks the schedulers may be starving for something not responding fast enough
    +      thread = new Thread("Artemis Critical Analyzer") {
    +         @Override
    +         public void run() {
    +            try {
    +               while (true) {
    +                  if (running.tryAcquire(getCheckTime(), TimeUnit.MILLISECONDS)) {
    +                     running.release();
    +                     // this means that the server has been stopped as we could acquire the semaphore... returning now
    +                     break;
    +                  }
    +                  check();
    +               }
    +            } catch (InterruptedException interrupted) {
    +               // i will just leave on that case
    +            }
    +         }
    +      };
    +
    +      thread.setDaemon(true);
    +
    +      thread.start();
    +   }
    +
    +   @Override
    +   public void stop() {
    --- End diff --
    
    having start and stop, is a little risky for someone to miss remember to put the stop in a try, finally block in code which if forgot the stop could be quite fatal. 
    
    An option (but may not be the right solution to the above problem) - is should it accept a lamda function / runnable / callable being the section to monitor. as such the try final can be in the core analyser and less likely or people to forget/miss later.
    
    The other option is that theres some static code analysis check to ensure if started there is a try finally stop within a method


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131774529
  
    --- Diff: docs/user-manual/en/configuration-index.md ---
    @@ -118,6 +118,11 @@ system-property-prefix | Prefix for replacing configuration settings using Bean
     [network-check-list](network-isolation.md) | The list of pings to be used on ping or InetAddress.isReacheable
     [network-check-ping-command](network-isolation.md) | The command used to oping IPV4 addresses
     [network-check-ping6-command](network-isolation.md) | The command used to oping IPV6 addresses
    +[critical-analyzer](critical-analysis.md) | Enable or disable the critical analysis (default true)
    --- End diff --
    
    @michaelandrepearce it's just I didn't understand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131773637
  
    --- Diff: docs/user-manual/en/configuration-index.md ---
    @@ -118,6 +118,11 @@ system-property-prefix | Prefix for replacing configuration settings using Bean
     [network-check-list](network-isolation.md) | The list of pings to be used on ping or InetAddress.isReacheable
     [network-check-ping-command](network-isolation.md) | The command used to oping IPV4 addresses
     [network-check-ping6-command](network-isolation.md) | The command used to oping IPV6 addresses
    +[critical-analyzer](critical-analysis.md) | Enable or disable the critical analysis (default true)
    --- End diff --
    
    agreed this is just a nit, something that can be done later


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131764161
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalMeasure.java ---
    @@ -0,0 +1,52 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import org.jboss.logging.Logger;
    +
    +public class CriticalMeasure {
    +
    +   private static final Logger logger = Logger.getLogger(CriticalMeasure.class);
    +
    +   private volatile long timeEnter;
    +   private volatile long timeLeft;
    +
    +   public void enterCritical() {
    +      timeEnter = System.currentTimeMillis();
    +   }
    +
    +   public void leaveCritical() {
    +      timeLeft = System.currentTimeMillis();
    +   }
    +
    +   public boolean isExpired(long timeout) {
    +      if (timeEnter > timeLeft) {
    +         return System.currentTimeMillis() - timeEnter > timeout;
    +      }
    +
    --- End diff --
    
    nit: empty line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis issue #1443: ARTEMIS-1324 Deadlock detection and health che...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/1443
  
    What is the over head of this, latency and throughput? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131530954
  
    --- Diff: docs/user-manual/en/critical-analysis.md ---
    @@ -0,0 +1,32 @@
    +# Critical Analysis of the broker
    +
    +There are a few things that can go wrong on a production environment:
    +
    +- Bugs, for more than we try they still happen! We always try to correct them, but that's the only constant in software development.
    +- IO Errors, disks and hardware can go bad
    +- Memory issues, the CPU can go crazy by another process
    +
    +For cases like this, we added a protection to the broker to shut itself down when bad things happen.
    +
    +We measure time response in places like:
    +
    +- Queue delivery (add to the queue)
    +- Journal storage
    +- Paging operations
    +
    +If the response time goes beyond a configured timeout, the broker is considered unstable and an action will be taken to either shutdown the broker or halt the VM.
    +
    +You can use these following configuration options on broker.xml to configure how the critical analysis is performed.
    +
    +
    +Name | Description
    +:--- | :---
    +analyze-critical | Enable or disable the critical analysis (default true)
    +analyze-critical-timeout | Timeout used to do the critical analysis (default 120000 milliseconds)
    +analyze-critical-check-period | Time used to check the response times (default half of analyze-critical-timeout)
    +analyze-critical-halt | Should the VM be halted upon failures (default false)
    --- End diff --
    
    what occurs if you have this to no halt? does it log out, also is it available as a jmx endpoint so users monitoring systems using the exposed jmx endpoints, could be configured to check and alert also to their operational support teams.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131768837
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/config/impl/ConfigurationImpl.java ---
    @@ -294,6 +294,14 @@
     
        private String internalNamingPrefix = ActiveMQDefaultConfiguration.getInternalNamingPrefix();
     
    +   private boolean criticalAnalyzer = ActiveMQDefaultConfiguration.getCriticalAnalyzer();
    +
    +   private boolean criticalAnalyzerHalt = ActiveMQDefaultConfiguration.getCriticalAnalyzerHalt();
    +
    +   private long criticalAnalyzerTimeout = ActiveMQDefaultConfiguration.getCriticalAnalyzerTimeout();
    +
    +   private long criticalAnalyzerCheckPeriod = 0; // non set
    --- End diff --
    
    I will just set 0.. meaning not set. I don't want to make this an Object just for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131765567
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/config/impl/ConfigurationImpl.java ---
    @@ -294,6 +294,14 @@
     
        private String internalNamingPrefix = ActiveMQDefaultConfiguration.getInternalNamingPrefix();
     
    +   private boolean criticalAnalyzer = ActiveMQDefaultConfiguration.getCriticalAnalyzer();
    +
    +   private boolean criticalAnalyzerHalt = ActiveMQDefaultConfiguration.getCriticalAnalyzerHalt();
    +
    +   private long criticalAnalyzerTimeout = ActiveMQDefaultConfiguration.getCriticalAnalyzerTimeout();
    +
    +   private long criticalAnalyzerCheckPeriod = 0; // non set
    --- End diff --
    
    ignore, just worked out whats going on here. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131773734
  
    --- Diff: docs/user-manual/en/critical-analysis.md ---
    @@ -0,0 +1,32 @@
    +# Critical Analysis of the broker
    +
    +There are a few things that can go wrong on a production environment:
    +
    +- Bugs, for more than we try they still happen! We always try to correct them, but that's the only constant in software development.
    +- IO Errors, disks and hardware can go bad
    +- Memory issues, the CPU can go crazy by another process
    +
    +For cases like this, we added a protection to the broker to shut itself down when bad things happen.
    +
    +We measure time response in places like:
    +
    +- Queue delivery (add to the queue)
    +- Journal storage
    +- Paging operations
    +
    --- End diff --
    
    It would need to be a different feature I think. 
    
    Monitoring has been on my radar since @franz1981 raised that PR on the CLI.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131768028
  
    --- Diff: artemis-core-client/src/main/java/org/apache/activemq/artemis/api/config/ActiveMQDefaultConfiguration.java ---
    @@ -477,6 +477,16 @@ public static String getDefaultHapolicyBackupStrategy() {
     
        public static int DEFAULT_QUORUM_SIZE = -1;
     
    +   public static final boolean DEFAULT_ANALYZE_CRITICAL = true;
    +
    +   public static final long DEFAULT_ANALYZE_CRITICAL_TIMEOUT = 120000;
    +
    +   // this will be 0, the implementation should return 1/2 of the configured critical timeout
    +   public static final long DEFAULT_ANALYZE_CHECK_PERIOD = 0;
    --- End diff --
    
    @clebertsuconic i put some other comments, maybe keep it and not have magic numbers in the other code, I've pointed out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131773379
  
    --- Diff: docs/user-manual/en/configuration-index.md ---
    @@ -118,6 +118,11 @@ system-property-prefix | Prefix for replacing configuration settings using Bean
     [network-check-list](network-isolation.md) | The list of pings to be used on ping or InetAddress.isReacheable
     [network-check-ping-command](network-isolation.md) | The command used to oping IPV4 addresses
     [network-check-ping6-command](network-isolation.md) | The command used to oping IPV6 addresses
    +[critical-analyzer](critical-analysis.md) | Enable or disable the critical analysis (default true)
    --- End diff --
    
    I'm not sure what you mean here. I'm adding some doc changes. Can you change it after merged?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131766984
  
    --- Diff: docs/user-manual/en/configuration-index.md ---
    @@ -118,6 +118,11 @@ system-property-prefix | Prefix for replacing configuration settings using Bean
     [network-check-list](network-isolation.md) | The list of pings to be used on ping or InetAddress.isReacheable
     [network-check-ping-command](network-isolation.md) | The command used to oping IPV4 addresses
     [network-check-ping6-command](network-isolation.md) | The command used to oping IPV6 addresses
    +[critical-analyzer](critical-analysis.md) | Enable or disable the critical analysis (default true)
    --- End diff --
    
    just a note if we change descriptions in critical-analysis based on review comments, probably should make these match. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131766851
  
    --- Diff: docs/user-manual/en/critical-analysis.md ---
    @@ -0,0 +1,32 @@
    +# Critical Analysis of the broker
    +
    +There are a few things that can go wrong on a production environment:
    +
    +- Bugs, for more than we try they still happen! We always try to correct them, but that's the only constant in software development.
    +- IO Errors, disks and hardware can go bad
    +- Memory issues, the CPU can go crazy by another process
    +
    +For cases like this, we added a protection to the broker to shut itself down when bad things happen.
    +
    +We measure time response in places like:
    +
    +- Queue delivery (add to the queue)
    +- Journal storage
    +- Paging operations
    +
    +If the response time goes beyond a configured timeout, the broker is considered unstable and an action will be taken to either shutdown the broker or halt the VM.
    +
    +You can use these following configuration options on broker.xml to configure how the critical analysis is performed.
    +
    +
    +Name | Description
    +:--- | :---
    +critical-analyzer | Enable or disable the critical analysis (default true)
    +critical-analyzer-timeout | Timeout used to do the critical analysis (default 120000 milliseconds)
    +critical-analyzer-check-period | Time used to check the response times (default half of critical-analyzer-timeout)
    +critical-analyzer-halt | Should the VM be halted upon failures (default false)
    --- End diff --
    
    describe behaviour if true?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131532611
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/config/impl/ConfigurationImpl.java ---
    @@ -2064,6 +2072,53 @@ public Configuration setNetworkCheckPing6Command(String command) {
           return this;
        }
     
    +   @Override
    +   public boolean isAnalyzeCritical() {
    +      return analyzeCritical;
    +   }
    +
    +   @Override
    +   public Configuration setAnalyzeCritical(boolean analyzeCritical) {
    +      this.analyzeCritical = analyzeCritical;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getAnalyzeCriticalTimeout() {
    +      return analyzeCriticalTimeout;
    +   }
    +
    +   @Override
    +   public Configuration setAnalyzeCriticalTimeout(long timeout) {
    +      this.analyzeCriticalTimeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getAnalyzeCriticalCheckPeriod() {
    +      if (analyzeCriticalCheckPeriod == ActiveMQDefaultConfiguration.getAnalyzeCriticalCheckPeriod()) {
    --- End diff --
    
    if 0, then the we would calculate the check-period.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131765286
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/config/impl/ConfigurationImpl.java ---
    @@ -294,6 +294,14 @@
     
        private String internalNamingPrefix = ActiveMQDefaultConfiguration.getInternalNamingPrefix();
     
    +   private boolean criticalAnalyzer = ActiveMQDefaultConfiguration.getCriticalAnalyzer();
    +
    +   private boolean criticalAnalyzerHalt = ActiveMQDefaultConfiguration.getCriticalAnalyzerHalt();
    +
    +   private long criticalAnalyzerTimeout = ActiveMQDefaultConfiguration.getCriticalAnalyzerTimeout();
    +
    +   private long criticalAnalyzerCheckPeriod = 0; // non set
    --- End diff --
    
    @michaelandrepearce nope.. because the value will only be determined during runtime (when getCriticalAnalyzerCheckPeriod is called.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131774705
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalAnalyzerImpl.java ---
    @@ -0,0 +1,182 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import java.util.ConcurrentModificationException;
    +import java.util.concurrent.CopyOnWriteArrayList;
    +import java.util.concurrent.Semaphore;
    +import java.util.concurrent.TimeUnit;
    +
    +import org.apache.activemq.artemis.utils.collections.ConcurrentHashSet;
    +import org.jboss.logging.Logger;
    +
    +public class CriticalAnalyzerImpl implements CriticalAnalyzer {
    +
    +   private final Logger logger = Logger.getLogger(CriticalAnalyzer.class);
    +
    +   private volatile long timeout;
    +
    +   private volatile long checkTime;
    +
    +   @Override
    +   public void clear() {
    +      actions.clear();
    +      components.clear();
    +   }
    +
    +   private CopyOnWriteArrayList<CriticalAction> actions = new CopyOnWriteArrayList<>();
    +
    +   private Thread thread;
    +
    +   private final Semaphore running = new Semaphore(1);
    +
    +   private final ConcurrentHashSet<CriticalComponent> components = new ConcurrentHashSet<>();
    +
    +   @Override
    +   public void add(CriticalComponent component) {
    +      components.add(component);
    +   }
    +
    +   @Override
    +   public void remove(CriticalComponent component) {
    +      components.remove(component);
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setCheckTime(long timeout) {
    +      this.checkTime = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCheckTime() {
    +      if (checkTime == 0) {
    +         checkTime = getTimeout() / 2;
    +      }
    +      return checkTime;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setTimeout(long timeout) {
    +      this.timeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getTimeout() {
    +      if (timeout == 0) {
    +         timeout = TimeUnit.MINUTES.toMillis(2);
    +      }
    +      return timeout;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer addAction(CriticalAction action) {
    +      this.actions.add(action);
    +      return this;
    +   }
    +
    +   @Override
    +   public void check() {
    +      boolean retry = true;
    +      while (retry) {
    +         try {
    +            for (CriticalComponent component : components) {
    +
    +               int pathReturned = component.isExpired(timeout);
    +               if (pathReturned >= 0) {
    +                  fireAction(component, pathReturned);
    +                  // no need to keep running if there's already a component failed
    +                  return;
    +               }
    +            }
    +            retry = false; // got to the end of the list, no need to retry
    +         } catch (ConcurrentModificationException dontCare) {
    +            // lets retry on the loop
    +         }
    +      }
    +   }
    +
    +   private void fireAction(CriticalComponent component, int path) {
    +      for (CriticalAction action: actions) {
    +         try {
    +            action.run(component, path);
    +         } catch (Throwable e) {
    +            logger.warn(e.getMessage(), e);
    +         }
    +      }
    +   }
    +
    +   @Override
    +   public void start() {
    +
    +      if (!running.tryAcquire()) {
    +         // already running
    +         return;
    +      }
    +
    +      // we are not using any Thread Pool or any Scheduled Executors from the ArtemisServer
    +      // as that would defeat the purpose,
    +      // as in any deadlocks the schedulers may be starving for something not responding fast enough
    +      thread = new Thread("Artemis Critical Analyzer") {
    +         @Override
    +         public void run() {
    +            try {
    +               while (true) {
    +                  if (running.tryAcquire(getCheckTime(), TimeUnit.MILLISECONDS)) {
    +                     running.release();
    +                     // this means that the server has been stopped as we could acquire the semaphore... returning now
    +                     break;
    +                  }
    +                  check();
    +               }
    +            } catch (InterruptedException interrupted) {
    +               // i will just leave on that case
    +            }
    +         }
    +      };
    +
    +      thread.setDaemon(true);
    +
    +      thread.start();
    +   }
    +
    +   @Override
    +   public void stop() {
    --- End diff --
    
    @michaelandrepearce I thought about using managementServier.sendNotification();
    
    but that is using PostOfficeImpl and tons of locks.. it's exposed through JMX I think.. so I didn't want to use the same thingy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131774052
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalAnalyzerImpl.java ---
    @@ -0,0 +1,182 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import java.util.ConcurrentModificationException;
    +import java.util.concurrent.CopyOnWriteArrayList;
    +import java.util.concurrent.Semaphore;
    +import java.util.concurrent.TimeUnit;
    +
    +import org.apache.activemq.artemis.utils.collections.ConcurrentHashSet;
    +import org.jboss.logging.Logger;
    +
    +public class CriticalAnalyzerImpl implements CriticalAnalyzer {
    +
    +   private final Logger logger = Logger.getLogger(CriticalAnalyzer.class);
    +
    +   private volatile long timeout;
    +
    +   private volatile long checkTime;
    +
    +   @Override
    +   public void clear() {
    +      actions.clear();
    +      components.clear();
    +   }
    +
    +   private CopyOnWriteArrayList<CriticalAction> actions = new CopyOnWriteArrayList<>();
    +
    +   private Thread thread;
    +
    +   private final Semaphore running = new Semaphore(1);
    +
    +   private final ConcurrentHashSet<CriticalComponent> components = new ConcurrentHashSet<>();
    +
    +   @Override
    +   public void add(CriticalComponent component) {
    +      components.add(component);
    +   }
    +
    +   @Override
    +   public void remove(CriticalComponent component) {
    +      components.remove(component);
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setCheckTime(long timeout) {
    +      this.checkTime = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCheckTime() {
    +      if (checkTime == 0) {
    +         checkTime = getTimeout() / 2;
    +      }
    +      return checkTime;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setTimeout(long timeout) {
    +      this.timeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getTimeout() {
    +      if (timeout == 0) {
    +         timeout = TimeUnit.MINUTES.toMillis(2);
    +      }
    +      return timeout;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer addAction(CriticalAction action) {
    +      this.actions.add(action);
    +      return this;
    +   }
    +
    +   @Override
    +   public void check() {
    +      boolean retry = true;
    +      while (retry) {
    +         try {
    +            for (CriticalComponent component : components) {
    +
    +               int pathReturned = component.isExpired(timeout);
    +               if (pathReturned >= 0) {
    +                  fireAction(component, pathReturned);
    +                  // no need to keep running if there's already a component failed
    +                  return;
    +               }
    +            }
    +            retry = false; // got to the end of the list, no need to retry
    +         } catch (ConcurrentModificationException dontCare) {
    +            // lets retry on the loop
    +         }
    +      }
    +   }
    +
    +   private void fireAction(CriticalComponent component, int path) {
    +      for (CriticalAction action: actions) {
    +         try {
    +            action.run(component, path);
    +         } catch (Throwable e) {
    +            logger.warn(e.getMessage(), e);
    +         }
    +      }
    +   }
    +
    +   @Override
    +   public void start() {
    +
    +      if (!running.tryAcquire()) {
    +         // already running
    +         return;
    +      }
    +
    +      // we are not using any Thread Pool or any Scheduled Executors from the ArtemisServer
    +      // as that would defeat the purpose,
    +      // as in any deadlocks the schedulers may be starving for something not responding fast enough
    +      thread = new Thread("Artemis Critical Analyzer") {
    +         @Override
    +         public void run() {
    +            try {
    +               while (true) {
    +                  if (running.tryAcquire(getCheckTime(), TimeUnit.MILLISECONDS)) {
    +                     running.release();
    +                     // this means that the server has been stopped as we could acquire the semaphore... returning now
    +                     break;
    +                  }
    +                  check();
    +               }
    +            } catch (InterruptedException interrupted) {
    +               // i will just leave on that case
    +            }
    +         }
    +      };
    +
    +      thread.setDaemon(true);
    +
    +      thread.start();
    +   }
    +
    +   @Override
    +   public void stop() {
    --- End diff --
    
    im lost, what do you mean here? i see no locks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131530890
  
    --- Diff: artemis-server/src/test/resources/ConfigurationTest-full-config.xml ---
    @@ -57,6 +57,10 @@
           <global-max-size>1234567</global-max-size>
           <max-disk-usage>37</max-disk-usage>
           <disk-scan-period>123</disk-scan-period>
    +      <analyze-critical-halt>true</analyze-critical-halt>
    --- End diff --
    
    between documentation you name this Critical Analysis, then in config etc you switch round the words to analyse critical. suggest keeping naming constant. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131530849
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/ActiveMQServerImpl.java ---
    @@ -477,12 +484,44 @@ public final synchronized void start() throws Exception {
           }
        }
     
    +   @Override
    +   public CriticalAnalyzer getCriticalAnalyzer() {
    +      return this.analyzer;
    +   }
    +
        private void internalStart() throws Exception {
           if (state != SERVER_STATE.STOPPED) {
              logger.debug("Server already started!");
              return;
           }
     
    +      if (!configuration.isAnalyzeCritical()) {
    --- End diff --
    
    why not do this within object construction? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131534288
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalAnalyzerImpl.java ---
    @@ -0,0 +1,182 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import java.util.ConcurrentModificationException;
    +import java.util.concurrent.CopyOnWriteArrayList;
    +import java.util.concurrent.Semaphore;
    +import java.util.concurrent.TimeUnit;
    +
    +import org.apache.activemq.artemis.utils.collections.ConcurrentHashSet;
    +import org.jboss.logging.Logger;
    +
    +public class CriticalAnalyzerImpl implements CriticalAnalyzer {
    +
    +   private final Logger logger = Logger.getLogger(CriticalAnalyzer.class);
    +
    +   private volatile long timeout;
    +
    +   private volatile long checkTime;
    +
    +   @Override
    +   public void clear() {
    +      actions.clear();
    +      components.clear();
    +   }
    +
    +   private CopyOnWriteArrayList<CriticalAction> actions = new CopyOnWriteArrayList<>();
    +
    +   private Thread thread;
    +
    +   private final Semaphore running = new Semaphore(1);
    +
    +   private final ConcurrentHashSet<CriticalComponent> components = new ConcurrentHashSet<>();
    +
    +   @Override
    +   public void add(CriticalComponent component) {
    +      components.add(component);
    +   }
    +
    +   @Override
    +   public void remove(CriticalComponent component) {
    +      components.remove(component);
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setCheckTime(long timeout) {
    +      this.checkTime = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCheckTime() {
    +      if (checkTime == 0) {
    +         checkTime = getTimeout() / 2;
    +      }
    +      return checkTime;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setTimeout(long timeout) {
    +      this.timeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getTimeout() {
    +      if (timeout == 0) {
    +         timeout = TimeUnit.MINUTES.toMillis(2);
    +      }
    +      return timeout;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer addAction(CriticalAction action) {
    +      this.actions.add(action);
    +      return this;
    +   }
    +
    +   @Override
    +   public void check() {
    +      boolean retry = true;
    +      while (retry) {
    +         try {
    +            for (CriticalComponent component : components) {
    +
    +               int pathReturned = component.isExpired(timeout);
    +               if (pathReturned >= 0) {
    +                  fireAction(component, pathReturned);
    +                  // no need to keep running if there's already a component failed
    +                  return;
    +               }
    +            }
    +            retry = false; // got to the end of the list, no need to retry
    +         } catch (ConcurrentModificationException dontCare) {
    +            // lets retry on the loop
    +         }
    +      }
    +   }
    +
    +   private void fireAction(CriticalComponent component, int path) {
    +      for (CriticalAction action: actions) {
    +         try {
    +            action.run(component, path);
    +         } catch (Throwable e) {
    +            logger.warn(e.getMessage(), e);
    +         }
    +      }
    +   }
    +
    +   @Override
    +   public void start() {
    +
    +      if (!running.tryAcquire()) {
    +         // already running
    +         return;
    +      }
    +
    +      // we are not using any Thread Pool or any Scheduled Executors from the ArtemisServer
    +      // as that would defeat the purpose,
    +      // as in any deadlocks the schedulers may be starving for something not responding fast enough
    +      thread = new Thread("Artemis Critical Analyzer") {
    +         @Override
    +         public void run() {
    +            try {
    +               while (true) {
    +                  if (running.tryAcquire(getCheckTime(), TimeUnit.MILLISECONDS)) {
    +                     running.release();
    +                     // this means that the server has been stopped as we could acquire the semaphore... returning now
    +                     break;
    +                  }
    +                  check();
    +               }
    +            } catch (InterruptedException interrupted) {
    +               // i will just leave on that case
    +            }
    +         }
    +      };
    +
    +      thread.setDaemon(true);
    +
    +      thread.start();
    +   }
    +
    +   @Override
    +   public void stop() {
    --- End diff --
    
    From what I understand start , begins some timer, and stop ends it. If the timer > x action y will occur.
    
    As such if someone in code accidentally forgets to call stop or forgets to put it in a finally block so that when exceptions occur it also stops the timer. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by michaelandrepearce <gi...@git.apache.org>.
Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131552744
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalAnalyzerImpl.java ---
    @@ -0,0 +1,182 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import java.util.ConcurrentModificationException;
    +import java.util.concurrent.CopyOnWriteArrayList;
    +import java.util.concurrent.Semaphore;
    +import java.util.concurrent.TimeUnit;
    +
    +import org.apache.activemq.artemis.utils.collections.ConcurrentHashSet;
    +import org.jboss.logging.Logger;
    +
    +public class CriticalAnalyzerImpl implements CriticalAnalyzer {
    +
    +   private final Logger logger = Logger.getLogger(CriticalAnalyzer.class);
    +
    +   private volatile long timeout;
    +
    +   private volatile long checkTime;
    +
    +   @Override
    +   public void clear() {
    +      actions.clear();
    +      components.clear();
    +   }
    +
    +   private CopyOnWriteArrayList<CriticalAction> actions = new CopyOnWriteArrayList<>();
    +
    +   private Thread thread;
    +
    +   private final Semaphore running = new Semaphore(1);
    +
    +   private final ConcurrentHashSet<CriticalComponent> components = new ConcurrentHashSet<>();
    +
    +   @Override
    +   public void add(CriticalComponent component) {
    +      components.add(component);
    +   }
    +
    +   @Override
    +   public void remove(CriticalComponent component) {
    +      components.remove(component);
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setCheckTime(long timeout) {
    +      this.checkTime = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCheckTime() {
    +      if (checkTime == 0) {
    +         checkTime = getTimeout() / 2;
    +      }
    +      return checkTime;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setTimeout(long timeout) {
    +      this.timeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getTimeout() {
    +      if (timeout == 0) {
    +         timeout = TimeUnit.MINUTES.toMillis(2);
    +      }
    +      return timeout;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer addAction(CriticalAction action) {
    +      this.actions.add(action);
    +      return this;
    +   }
    +
    +   @Override
    +   public void check() {
    +      boolean retry = true;
    +      while (retry) {
    +         try {
    +            for (CriticalComponent component : components) {
    +
    +               int pathReturned = component.isExpired(timeout);
    +               if (pathReturned >= 0) {
    +                  fireAction(component, pathReturned);
    +                  // no need to keep running if there's already a component failed
    +                  return;
    +               }
    +            }
    +            retry = false; // got to the end of the list, no need to retry
    +         } catch (ConcurrentModificationException dontCare) {
    +            // lets retry on the loop
    +         }
    +      }
    +   }
    +
    +   private void fireAction(CriticalComponent component, int path) {
    +      for (CriticalAction action: actions) {
    +         try {
    +            action.run(component, path);
    +         } catch (Throwable e) {
    +            logger.warn(e.getMessage(), e);
    +         }
    +      }
    +   }
    +
    +   @Override
    +   public void start() {
    +
    +      if (!running.tryAcquire()) {
    +         // already running
    +         return;
    +      }
    +
    +      // we are not using any Thread Pool or any Scheduled Executors from the ArtemisServer
    +      // as that would defeat the purpose,
    +      // as in any deadlocks the schedulers may be starving for something not responding fast enough
    +      thread = new Thread("Artemis Critical Analyzer") {
    +         @Override
    +         public void run() {
    +            try {
    +               while (true) {
    +                  if (running.tryAcquire(getCheckTime(), TimeUnit.MILLISECONDS)) {
    +                     running.release();
    +                     // this means that the server has been stopped as we could acquire the semaphore... returning now
    +                     break;
    +                  }
    +                  check();
    +               }
    +            } catch (InterruptedException interrupted) {
    +               // i will just leave on that case
    +            }
    +         }
    +      };
    +
    +      thread.setDaemon(true);
    +
    +      thread.start();
    +   }
    +
    +   @Override
    +   public void stop() {
    --- End diff --
    
    Im a bit confused now then if I'm honest. I've obviously overlooked or miss-understood what the intent of this is.
    
    How is it detecting hot path being accidental dead locked like we've seen before due to code bug? To detect that i always assumed the code / check would need to surround the component not rely on the component to report, as if it was in deadlock how would it report it is in such a state?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] activemq-artemis pull request #1443: ARTEMIS-1324 Deadlock detection and hea...

Posted by clebertsuconic <gi...@git.apache.org>.
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/1443#discussion_r131532629
  
    --- Diff: artemis-commons/src/main/java/org/apache/activemq/artemis/utils/critical/CriticalAnalyzerImpl.java ---
    @@ -0,0 +1,182 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements. See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License. You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.activemq.artemis.utils.critical;
    +
    +import java.util.ConcurrentModificationException;
    +import java.util.concurrent.CopyOnWriteArrayList;
    +import java.util.concurrent.Semaphore;
    +import java.util.concurrent.TimeUnit;
    +
    +import org.apache.activemq.artemis.utils.collections.ConcurrentHashSet;
    +import org.jboss.logging.Logger;
    +
    +public class CriticalAnalyzerImpl implements CriticalAnalyzer {
    +
    +   private final Logger logger = Logger.getLogger(CriticalAnalyzer.class);
    +
    +   private volatile long timeout;
    +
    +   private volatile long checkTime;
    +
    +   @Override
    +   public void clear() {
    +      actions.clear();
    +      components.clear();
    +   }
    +
    +   private CopyOnWriteArrayList<CriticalAction> actions = new CopyOnWriteArrayList<>();
    +
    +   private Thread thread;
    +
    +   private final Semaphore running = new Semaphore(1);
    +
    +   private final ConcurrentHashSet<CriticalComponent> components = new ConcurrentHashSet<>();
    +
    +   @Override
    +   public void add(CriticalComponent component) {
    +      components.add(component);
    +   }
    +
    +   @Override
    +   public void remove(CriticalComponent component) {
    +      components.remove(component);
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setCheckTime(long timeout) {
    +      this.checkTime = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getCheckTime() {
    +      if (checkTime == 0) {
    +         checkTime = getTimeout() / 2;
    +      }
    +      return checkTime;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer setTimeout(long timeout) {
    +      this.timeout = timeout;
    +      return this;
    +   }
    +
    +   @Override
    +   public long getTimeout() {
    +      if (timeout == 0) {
    +         timeout = TimeUnit.MINUTES.toMillis(2);
    +      }
    +      return timeout;
    +   }
    +
    +   @Override
    +   public CriticalAnalyzer addAction(CriticalAction action) {
    +      this.actions.add(action);
    +      return this;
    +   }
    +
    +   @Override
    +   public void check() {
    +      boolean retry = true;
    +      while (retry) {
    +         try {
    +            for (CriticalComponent component : components) {
    +
    +               int pathReturned = component.isExpired(timeout);
    +               if (pathReturned >= 0) {
    +                  fireAction(component, pathReturned);
    +                  // no need to keep running if there's already a component failed
    +                  return;
    +               }
    +            }
    +            retry = false; // got to the end of the list, no need to retry
    +         } catch (ConcurrentModificationException dontCare) {
    +            // lets retry on the loop
    +         }
    +      }
    +   }
    +
    +   private void fireAction(CriticalComponent component, int path) {
    +      for (CriticalAction action: actions) {
    +         try {
    +            action.run(component, path);
    +         } catch (Throwable e) {
    +            logger.warn(e.getMessage(), e);
    +         }
    +      }
    +   }
    +
    +   @Override
    +   public void start() {
    +
    +      if (!running.tryAcquire()) {
    +         // already running
    +         return;
    +      }
    +
    +      // we are not using any Thread Pool or any Scheduled Executors from the ArtemisServer
    +      // as that would defeat the purpose,
    +      // as in any deadlocks the schedulers may be starving for something not responding fast enough
    +      thread = new Thread("Artemis Critical Analyzer") {
    +         @Override
    +         public void run() {
    +            try {
    +               while (true) {
    +                  if (running.tryAcquire(getCheckTime(), TimeUnit.MILLISECONDS)) {
    +                     running.release();
    +                     // this means that the server has been stopped as we could acquire the semaphore... returning now
    +                     break;
    +                  }
    +                  check();
    +               }
    +            } catch (InterruptedException interrupted) {
    +               // i will just leave on that case
    +            }
    +         }
    +      };
    +
    +      thread.setDaemon(true);
    +
    +      thread.start();
    +   }
    +
    +   @Override
    +   public void stop() {
    --- End diff --
    
    this stop here is like we have in many other components.
    
    
    start means, start the thread.
    stop means shutdown the thread.
    
    I'm not sure what you mean?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---