You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@activemq.apache.org by GitBox <gi...@apache.org> on 2021/05/18 09:31:09 UTC
[GitHub] [activemq-artemis] franz1981 edited a comment on pull request #3584: ARTEMIS-3303 Default thread pool size is too generous

franz1981 edited a comment on pull request #3584:
URL: https://github.com/apache/activemq-artemis/pull/3584#issuecomment-843013239


   All fair points, and indeed I believe this should be a cautious and more conservative change but still, there are some historical motivations and experimental facts that can prove that what we set by default is no longer valid/usefull and that it was optimizing for context switches here, nor for throughout or latencies trade-offs.
   
   1. Re the Netty event loop sizing
   
   - historical facts: HornetQ and earlier versions of Artemis was blocking Netty threads, but that's no longer true. We can even choose to use `Blockhound` to enforce/check it on our CI, see https://github.com/netty/netty/pull/9687
   - experimental facts: generating a uniformly distributed load with clients >= cores using Core clients shown that the default configuration of Netty thread pool (3X number of cores) prevent scaling and is both hitting troughput and latencies. See https://github.com/apache/activemq-artemis/pull/3572#issuecomment-841788187 for some more details about it.
   
   The motivation re the experimental facts seems related how Netty event loop group works: 
   - Netty assign client connections in round-robin fashion to the configured Netty threads
   - each client connection can issue write/read events on the event loop (single) selector to wakeup for any work to do
   - if the number of Netty threads exceed the number of cores and the number of clients is <= Netty threads, each time such notification happen they have some chance (2/3 possibilities) the thread that's going to handle it won't be on cpu (because they exceed the amount of cores) and the OS is forced to deschedule some (random) thread in order to run the Netty thread responsible to handle the interrupt, causing un-necessary context-switches.
   
   The netty default is of 2X the amount of cores for applications that heavily relies just on event loop processing, but Artemis it's not: even AMQP use I/O threads and need GC, compiler threads and sometime global threads to perform its job. Just using 3X is a waste of resources for the current Artemis version.
   
   2. Re the global thread pool sizing
   
   That's a bit more complex and depends by how `ActiveMQThreadPoolExecutor` works.
   Just writing a simple program can help to spot what's the problem with it (very similar to the Netty one, but not the same).
   ```java
      public static void main(String[] args) throws InterruptedException {
         ThreadPoolExecutor executor = new ActiveMQThreadPoolExecutor(0, 30, 60L, TimeUnit.SECONDS, new ThreadFactory() {
            @Override
            public Thread newThread(Runnable r) {
               Thread t = new Thread(r);
               System.err.println("created new thread: " + t);
               return t;
            }
         });
         ExecutorFactory factory = new OrderedExecutorFactory(executor);
         final int clients = 30;
         int bursts = 100;
         ConcurrentHashSet[] executingThreads = new ConcurrentHashSet[clients];
         ArtemisExecutor[] artemisExecutor = new ArtemisExecutor[clients];
         for (int i = 0; i< clients; i++) {
            artemisExecutor[i] = factory.getExecutor();
            executingThreads[i] = new ConcurrentHashSet();
         }
         ConcurrentMap<Thread, AtomicLong> executingT = new ConcurrentHashMap<>();
         for (int j = 0; j< bursts;j++) {
            for (int i = 0; i < clients; i++) {
               ConcurrentHashSet threadsSeen =executingThreads[i];
               artemisExecutor[i].execute(() -> {
                  try {
                     TimeUnit.MILLISECONDS.sleep(1);
                  } catch (InterruptedException e) {
                     e.printStackTrace();
                  }
                  threadsSeen.add(Thread.currentThread());
                  AtomicLong counter = executingT.get(Thread.currentThread());
                  if (counter == null) {
                     executingT.put(Thread.currentThread(), new AtomicLong(1));
                  } else {
                     counter.lazySet(counter.get() + 1);
                  }
               });
            }
            System.out.println("GC pause");
            Thread.sleep(100);
         }
         for (int i = 0; i< clients; i++) {
            artemisExecutor[i].flush(60, TimeUnit.SECONDS);
         }
         executor.shutdown();
         executor.awaitTermination(70, TimeUnit.SECONDS);
         System.out.println("Executing threads: " + executingT);
         System.out.println("Workload distribution per artemis executor:");
         for (int i = 0; i < clients; i++) {
            System.out.println("[" + (i + 1) + "] - " + executingThreads[i].size());
         }
      }
   ```
   On my machine (12 cores with HT - 6 real cores) it prints 30 times 
   ```created new thread: ...```
   and 
   ```
   Executing threads: 
   {Thread[Thread-1,5,]=103, 
   Thread[Thread-20,5,]=99, 
   Thread[Thread-17,5,]=99, 
   Thread[Thread-11,5,]=101, 
   Thread[Thread-18,5,]=99, 
   Thread[Thread-14,5,]=100, 
   Thread[Thread-13,5,]=100, 
   Thread[Thread-21,5,]=99, 
   Thread[Thread-24,5,]=98, 
   Thread[Thread-28,5,]=98, 
   Thread[Thread-5,5,]=103, 
   Thread[Thread-30,5,]=97, 
   Thread[Thread-27,5,]=97, 
   Thread[Thread-6,5,]=103, 
   Thread[Thread-4,5,]=102, 
   Thread[Thread-23,5,]=98, 
   Thread[Thread-25,5,]=98, 
   Thread[Thread-8,5,]=102, 
   Thread[Thread-7,5,]=102,
   Thread[Thread-3,5,]=103,
   Thread[Thread-9,5,]=101, 
   Thread[Thread-10,5,]=102, 
   Thread[Thread-19,5,]=99, 
   Thread[Thread-12,5,]=101, 
   Thread[Thread-15,5,]=100, 
   Thread[Thread-26,5,]=97, 
   Thread[Thread-29,5,]=97, 
   Thread[Thread-2,5,]=103,
   Thread[Thread-16,5,]=100, 
   Thread[Thread-22,5,]=99}
   Workload distribution per artemis executor:
   [1] - 17
   [2] - 18
   [3] - 17
   [4] - 15
   [5] - 13
   [6] - 13
   [7] - 17
   [8] - 17
   [9] - 18
   [10] - 14
   [11] - 13
   [12] - 17
   [13] - 15
   [14] - 14
   [15] - 17
   [16] - 18
   [17] - 16
   [18] - 12
   [19] - 17
   [20] - 16
   [21] - 16
   [22] - 14
   [23] - 17
   [24] - 17
   [25] - 17
   [26] - 21
   [27] - 20
   [28] - 19
   [29] - 22
   [30] - 18
   ```
   It gives some important info to understand how this thread pool works.
   with small enough burst of tasks (but not that small, ~1 ms), issued by several core clients (30 for this test) with some pauses (100 ms is the g1gc default pause target): 
   
   - the load is spread among all threads ie each thread is getting ~100 tasks each
   - each executor (client) is getting it's tasks executed by different threads (12->22 on 30 available)
   - the number of created threads depends how busy existing ones are
   
   In short, if the global thread executor is going to perform mostly non-blocking operations (NOTE: the I/O executor is responsible for I/O blocking ops), with enough clients (clients > available cores) we're going to use the whole number of threads configured on the pool. 
   But if the max pool size exceed the available cores we will end up, similartly to the Netty case, to deschedule some at random, just to wake-up the next one in charge to handle a specific task.
   
   There are few assumptions to be verified (what if `ArtemisExecutor` kept busy for too much time a specific Thread, global thread pool tasks cannot block? etc etc) and more tests to be performed, but this shouldn't stop from searching for better adapative (based on the machine spec) default IMO.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org