You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@activemq.apache.org by qihongxu <gi...@git.apache.org> on 2018/12/29 08:15:13 UTC

[GitHub] activemq-artemis pull request #2484: Use a specific executor for pageSyncTim...

GitHub user qihongxu opened a pull request:

    https://github.com/apache/activemq-artemis/pull/2484

    Use a specific executor for pageSyncTimer

    Improve paging throughput by using a specific executor for pageSyncTimer
    
    Improving throughput on paging mode is one of our concerns since our cluster uses paging a lot.
    We found that pageSyncTimer in PagingStoreImpl shared the same executor with pageCursorProvider from thread pool. In heavy load scenario like hundreds of consumers receiving messages simultaneously, it became difficult for pageSyncTimer to get the executor due to race condition. Therefore page sync was delayed and producers suffered low throughput.
    
    To achieve higher performance we assign a specific executor to pageSyncTimer to avoid racing. And we run a small-scale test on a single modified broker.
    
    Broker: 4C/8G/500G SSD
    Producer: 200 threads, non-transactional send
    Consumer 200 threads, transactional receive
    Message text size: 100-200 bytes randomly
    AddressFullPolicy: PAGE
    
    Test result：
    
      | Only Send TPS | Only Receive TPS | Send&Receive TPS
    -- | -- | -- | --
    Original ver | 38k | 33k | 3k/30k
    Modified ver | 38k | 34k | 30k/12.5k
    
    
    The chart above shows that on modified broker send TPS improves from “poor” to “extremely fast”, while receive TPS drops from “extremely fast” to “not-bad” under heavy load. Considering consumer systems usually have a long processing chain after receiving messages, we don’t need too fast receive TPS. Instead, we want to guarantee send TPS to cope with traffic peak and lower producer’s delay time. Moreover, send and receive TPS in total raises from 33k to about 43k. From all above this trade-off seems beneficial and acceptable.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/qihongxu/activemq-artemis pageSyncTimer_executor

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/activemq-artemis/pull/2484.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2484
    
----
commit 01a09f2f2bee98643df06a4eb93588047fea6527
Author: Qihong Xu <qi...@...>
Date:   2018-12-28T11:59:41Z

    Use a specific executor for pageSyncTimer

----


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > If we return `true` from the dirty read we can just return it, while if we found the it `false` we could attempt to enter the read lock and validate that's not paging for real.
    
    Ive literally gone through every case, what occurs is we call isPaging within an if statement, and then do some logic after, as such anyhow any action we do within these if statements anyhow will be based off a stale state. 
    
    Im starting to just think we make isPaging not use a read lock  (aka make it dirty), as its only used in queueimpl like mentioned and for queuecontrol (aka the admin gui)


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @michaelandrepearce 
    > Im starting to feel like Alice here and were going to end up going into a rabbit hole ;) and will end up with the original isPaging just being dirty.
    
    +1 same exact feeling 
    Quoting myself twice for @wy96f:
    > LivePageCacheImpl (in violet) is now a major contention point
    
    and 
    
    > Compaction is stealing lot of cpu and I/O
    
    Just a thought: we can just choose in what being less stale or not.
    If we return `true` from the dirty read we can just return it, while if we found the it `false` we could attempt to enter the read lock and validate that's not paging for real.
    



---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by qihongxu <gi...@git.apache.org>.

Github user qihongxu commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > Instead of transaction consumer you could use client acknowledge or even individual acknowledge.
    
    It seems that both client-acknowledge and individual-acknowledge mode will finally use Transaction at server side. Considering these modes have no obvious difference in performance, we choose to use transaction as it’s more reliable and supports rollback.



---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    Looking at the CPU graph instead I can see many odd things:
    
    - Page writing is double copying the heap ByteBuffer into a native ByteBuffer to write on the page file:
    ![image](https://user-images.githubusercontent.com/13125299/50661999-308bce80-0fa5-11e9-9ee4-053327c36cb1.png)
    - Compaction is stealing lot of cpu and I/O:
    ![image](https://user-images.githubusercontent.com/13125299/50662028-4ef1ca00-0fa5-11e9-9e13-613b569e4c58.png)



---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by clebertsuconic <gi...@git.apache.org>.

Github user clebertsuconic commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    The receiver could be slower because of more IO. That’s not a bad thing. 
    
    
    I’m on vacation this week. I won’t be able to get to this.  Just saying why the receiver could be slower. 


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by qihongxu <gi...@git.apache.org>.

Github user qihongxu commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    
    
    
    > @qihongxu @michaelandrepearce
    > I'm running now a CI job: it will take some time, but when it will be fine I will merge this 
    > @qihongxu After all the relevant bits re paging will be merged I will send another PR with the same 2 commits I have sent to your branch: are you available to give some help to check the effects of that PR on your tests?
    
    
    > @qihongxu big thanks for all the effort on this!! And providing the testing time
    
    
    My pleasure :) We are glad to see any boost in perf, especially on paging mode.
    
    I will keep a close watch on the new PR you mentioned and ran some more tests as we have done in this issue if needed.
    
    Also thank you all for reviews and works on this PR! 


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    The version I have shown is just master ie any read lock is there!


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > em.....Could you please tell us which issues? We need to verify how it affects our cluster.
    
    The big issue im relating to, which became a night mare for my organisation, was that under high concurrency (high throughput and low latency broker setup), the buffers can get mixed up, and was causing index out of bounds issues.
    
    Fixes were multiple:
    https://github.com/apache/activemq-artemis/commit/024db5bd3c1656d265daf60c9e3a362d53b9088b
    https://github.com/apache/activemq-artemis/commit/da7fb89037481fb6343c760010d4553ff28ac87e
    
    Im also aware there have been some other concurrency fixes for smaller issues.



---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by qihongxu <gi...@git.apache.org>.

Github user qihongxu commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @michaelandrepearce @franz1981 
    After we ran tests on both version (one with no lock and the other with new LivePageCache & no lock ), the result chart is as below.
    
      | Send&Receive | In Total
    -- | -- | --
    Ver Original | 30k/12.5k | 32.5k
    Ver Modified checkDepage | 31.1k/11.3k | 42.4k
    Ver Modified hasNext | 23.8k/23.5k | 47.3k
    Ver no lock | **22.9k/26.5k** | **49.4k**
    Ver no lock & new livePageCache | **24.5k/22k** | **46.5k**
    
    Tests are implemented with same conditions before. Clients consumed and produced on the same queue with 10 million messages in it.
    All version are based on Artemis-2214, which cache something in PagedRef.
    



---

[GitHub] activemq-artemis pull request #2484: ARTEMIS-2216 Use a specific executor fo...

Posted by wy96f <gi...@git.apache.org>.

Github user wy96f commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2484#discussion_r245183888
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/paging/impl/PagingStoreImpl.java ---
    @@ -278,21 +293,26 @@ public boolean isPaging() {
           lock.readLock().lock();
     
           try {
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.BLOCK) {
    -            return false;
    -         }
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.FAIL) {
    -            return isFull();
    -         }
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.DROP) {
    -            return isFull();
    -         }
    -         return paging;
    +         return isPagingDirtyRead();
           } finally {
              lock.readLock().unlock();
           }
        }
     
    +   @Override
    +   public boolean isPagingDirtyRead() {
    +      if (addressFullMessagePolicy == AddressFullMessagePolicy.BLOCK) {
    --- End diff --
    
    addressFullMessagePolicy would be changed if address setting is reapplied. so we need to load the value.


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @michaelandrepearce Done, the PR has been sent, now we can just wait the perf results on it :)
    I have improved quite a bit the live page cache behaviour/reliability (especially if OOME), but sadly I see that the most called method `getMessage` cannot be improved anymore without making the lock-free code a real nightmare.
    The original version was O(n) depending which message was queried, because it needs to walk the entire linked list of paged messages. 
    In my version I have amortized the cost by using an interesting hybrid between an ArrayList and a LinkedList, similar to https://en.wikipedia.org/wiki/Unrolled_linked_list, but (very) optimized for addition.
    I'm mentioning this, because is a long time I want to design a single-threaded version of this same data-structure to be used as the main datastructure inside QueueImpl.



---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > The version I have shown is just master ie any read lock is there!
    
    I get that, im more relfecting on this change in PR.


---

[GitHub] activemq-artemis pull request #2484: ARTEMIS-2216 Use a specific executor fo...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2484#discussion_r245239811
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/paging/impl/PagingStoreImpl.java ---
    @@ -278,21 +293,26 @@ public boolean isPaging() {
           lock.readLock().lock();
     
           try {
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.BLOCK) {
    -            return false;
    -         }
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.FAIL) {
    -            return isFull();
    -         }
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.DROP) {
    -            return isFull();
    -         }
    -         return paging;
    +         return isPagingDirtyRead();
           } finally {
              lock.readLock().unlock();
           }
        }
     
    +   @Override
    +   public boolean isPagingDirtyRead() {
    +      if (addressFullMessagePolicy == AddressFullMessagePolicy.BLOCK) {
    --- End diff --
    
    @wy96f what @franz1981 is trying to say, is we can do the volatile read just once, by adding one line e.g.
    
    
             AddressFullMessagePolicy addressFullMessagePolicy = this.addressFullMessagePolicy;
             if (addressFullMessagePolicy == AddressFullMessagePolicy.BLOCK) {
                return false;
             }
             if (addressFullMessagePolicy == AddressFullMessagePolicy.FAIL) {
                return isFull();
             }
             if (addressFullMessagePolicy == AddressFullMessagePolicy.DROP) {
                return isFull();
             }
             return paging;


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @franz1981 nice graphs, looks like essentially isPaging in QueueImpl unless everything is dirtyRead then we simply move the problem somewhere else, i think we probably need to make then a more general call if all these are safe to have dirty read or not. (Im starting to feel like  


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by qihongxu <gi...@git.apache.org>.

Github user qihongxu commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > @michaelandrepearce @qihongxu Ok, checked: the latest version of this PR + my branch https://github.com/franz1981/activemq-artemis/tree/lock-free-live-page-cache is fully non-blocking :)
    
    Nice work!!
    
    > Be great to see a final stat in @qihongxu test env
    
    As soon as I'm back Monday I will try implement same tests on both version(this and franz's).


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > > Instead of transaction consumer you could use client acknowledge or even individual acknowledge.
    > 
    > It seems that both client-acknowledge and individual-acknowledge mode will finally use Transaction at server side. Considering these modes have no obvious difference in performance, we choose to use transaction as it’s more reliable and supports rollback.
    
    @qihongxu you're using JMS api not core then?


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by qihongxu <gi...@git.apache.org>.

Github user qihongxu commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @michaelandrepearce As presented in test result adding a specific executor do improve producer rates, and also means page() method in PagingStoreImpl will acquire writelock more often. At the same time QueueImpl::deliver will call isPaging() to check whether queue is paging or not, which acquire readlock and make it a race condition. This may explain why the increase of producer rates is accompanied by the decrease of consumer rates.
    According to this we have an internal branch that removes isPaging() method’s readlock in PagingStoreImpl, along with adding pageSyncTimer’s specific executor. After that we performed a same test and get a result of 21k/25k Send&Receive–in total 46k TPS. This version runs smoothly in our use case but we are still exploring potential effects.
    The reason why we use transactional receive is to make sure every message indeed delivered and consumed. Say that if we do non-tx receive and consumer system fail in the following processing chain, this message may be delivered but actually “lost”. So we choose to let consumer decide when should they ack and commit, instead of auto-commit acks. 


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    If you could supply more complete stats with ramp ups e.g pef at 1k, 2k, 3k...5k...15k..20k.21k.22k...25k etc etc, over time. Rather than just a summary snapshot.


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @qihongxu looks good to me.
    
    @franz1981 whats the current PR look like in heat maps?


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @qihongxu Just as a confirmation, I have used https://github.com/jvm-profiling-tools/async-profiler on lock events ie time while waiting to enter into a lock and I have found the same exact behaviours explained above:
    QueueImpl::checkDepage
    ![image](https://user-images.githubusercontent.com/13125299/50647451-de34b880-0f78-11e9-99bc-101c830cbda2.png)
    QueueImpl::DepageRunner
    ![image](https://user-images.githubusercontent.com/13125299/50647474-ef7dc500-0f78-11e9-857f-48d190a6e9c6.png)
    
    both are calling `PagingStoreImpl::isPaging` and are blocking each others
    



---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @qihongxu,
    
    re:
    "
    According to this we have an internal branch that removes isPaging() method’s readlock in PagingStoreImpl, along with adding pageSyncTimer’s specific executor. After that we performed a same test and get a result of 21k/25k Send&Receive–in total 46k TPS. This version runs smoothly in our use case but we are still exploring potential effects.
    "
    
    Agree that as the QueueImpl:checkDepage which is called from deliver method, only schedules another depage asynchronously that the read barrier lock for that is over kill as it doesn't have to be strongly consistent correct.
    
    I think changing the current method could be dangerous, unless we do a fuller analysis on all its uses, but i think we could easily add a new method named something like "isPagingDirtyRead" which can return without the readlock, it will mean the returned value isn't always 100% accurate but in cases such as checkDepage, we don't need it to be, its just a single.
    
    Would you be happy adding this, (will need to be exposed via PageSubscription)? And updating checkDepage to use that one?
    
    
    



---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @franz1981 based on this, shall we merge this pr as is which is quite impressive result at a combined 49.4k.
    
    And then work on livepagecache changes separately? 
    
    



---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by wy96f <gi...@git.apache.org>.

Github user wy96f commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > > CursorIterator:hasNext
    > 
    > Im bit concerned with this doing a dirty read, as this isnt something that is trigger an ascyn action, in actual fact the hasNext is purposefully synchronized. (especially as recently some concurrency issues have been found in paging (e.g. howards current pr) im sensitive to us being too hap hazard here.
    
    
    I don't get why isPaging() in hasNext() needs to be consistent. The paging status can change after isPaging() unless we readlock isPaging() and subsequent operations. Without readlock, a) if isPaging() returns true, but the other thread set it to false, subsequent next() call reads no data and returns null; b) if isPaging() returns false, but the other thread set it to true(a new message coming), deliverAsync would be called later anyway.


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @franz1981 did you send a pr to @qihongxu branch so he can merge it and this pr picks it up?
    
    Be great to see a final stat in @qihongxu test env


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    The bit around lock contention on paging check is very interesting where @qihongxu removed it and performance rose, this would suggest consumer perf change isnt related to io but due to over lock contention in code. Def would be good to see if its safe to remove 


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by clebertsuconic <gi...@git.apache.org>.

Github user clebertsuconic commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    during initial development years ago I needed to be careful when to leave paging. this needed some sync points to make sure depage would set it synchronously.
    
    I'm not saying anything against the change (I'm still in vacation mode, I'm back on monday).. just saying what was the original intent when it was written.


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @michaelandrepearce Good idea! let me do it now



---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by wy96f <gi...@git.apache.org>.

Github user wy96f commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    
    > Id agree, im just cautious as we've been hit a few times with concurrency issues that have been a nightmare to find (and quite recently as last month!).
    em.....Could you please tell us which issues? We need to verify how it affects our cluster.
    



---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > > > CursorIterator:hasNext
    > > 
    > > 
    > > Im bit concerned with this doing a dirty read, as this isnt something that is trigger an ascyn action, in actual fact the hasNext is purposefully synchronized. (especially as recently some concurrency issues have been found in paging (e.g. howards current pr) im sensitive to us being too hap hazard here.
    > 
    > I don't get why isPaging() in hasNext() needs to be consistent. The paging status can change after isPaging() unless we readlock isPaging() and subsequent operations. Without readlock, a) if isPaging() returns true, but the other thread set it to false, subsequent next() call reads no data and returns null; b) if isPaging() returns false, but the other thread set it to true(a new message coming), deliverAsync would be called later anyway.
    
    if thats the case then the sync method could be removed also, though i think @clebertsuconic has given some further info to why this is like it is.


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > whats the current PR look like in heat maps?
    
    I'm finishing a thing to fix `LivePageCacheImpl` too, hopefully today and I will post: impl-wise seems good to me as well :+1: 



---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    I have taken a look on the version used for this PR + the change suggested by @michaelandrepearce 
     on checkDepage and I have found that:
    LivePageCacheImpl (in violet) is now a major contention point:
    ![image](https://user-images.githubusercontent.com/13125299/50660994-fa991b00-0fa1-11e9-93cf-88218f07a2de.png)
    While re `isPaging`:
    ![image](https://user-images.githubusercontent.com/13125299/50661277-df7adb00-0fa2-11e9-9ed1-94823381102a.png)
    and
    ![image](https://user-images.githubusercontent.com/13125299/50661299-eefa2400-0fa2-11e9-9463-45a9b0a57548.png)
    
    Are now source of contention
    
    
    
    



---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > Looking at the CPU graph instead I can see many odd things ie Compaction is stealing lot of cpu and I/O:
    
    Nice find.


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    Yes if @qihongxu agree to give an hand :P re this pr probably make sense to include at least the second commit of my PR that was unrelated to the new cache impl and was just avoiding an unnecessary query that was always performed...


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @michaelandrepearce @qihongxu Just a lil OT but I'm getting these warns on master:
    ```
    2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,136 during compact replay
    2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,139 during compact replay
    2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,142 during compact replay
    2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,145 during compact replay
    2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,148 during compact replay
    2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,151 during compact replay
    2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,154 during compact replay
    ```


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    FYI: we are working to fix the CI machines now so I won't be able ATM to run any job on it :(
    I hope to have the CI machines up and running soon to continue :+1: 


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    So out the two, the checkDepage is the safer one to use a dirty read, so id expect to see that changed first before anything else.


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @qihongxu @michaelandrepearce 
    I'm running now a CI job: it will take some time, but when it will be fine I will merge this :+1: 
    @qihongxu After all the relevant bits re paging will be merged I will send another PR with the same 2 commits I have sent to your branch: are you available to give some help to check the effects of that PR on your tests? 


---

[GitHub] activemq-artemis pull request #2484: ARTEMIS-2216 Use a specific executor fo...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2484#discussion_r245224854
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/paging/impl/PagingStoreImpl.java ---
    @@ -278,21 +293,26 @@ public boolean isPaging() {
           lock.readLock().lock();
     
           try {
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.BLOCK) {
    -            return false;
    -         }
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.FAIL) {
    -            return isFull();
    -         }
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.DROP) {
    -            return isFull();
    -         }
    -         return paging;
    +         return isPagingDirtyRead();
           } finally {
              lock.readLock().unlock();
           }
        }
     
    +   @Override
    +   public boolean isPagingDirtyRead() {
    +      if (addressFullMessagePolicy == AddressFullMessagePolicy.BLOCK) {
    --- End diff --
    
    Yes, but we can just volatile load once before checking its value 3 times, on each call of  isPagingDirtyRead


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @qihongxu i dont see checkDepage using the dirtyRead in current commit


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @qihongxu 
    
    > Ver no lock & new livePageCache
    
    Is making use of the last 2 commits I have sent as a PR to your repository?
    It sould be way lot faster then `Ver no lock`


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @franz1981 i think well have to do some more real world testing with it (difference between a isolated bench and a full e2e test), with @qihongxu help hopefully, it might be something odd that by releasing some performance it causes an odd bottleneck somewhere else.
    
    Are you ok, if once full CI passes we merge as is, and continue this on a separate pr?


---

[GitHub] activemq-artemis pull request #2484: ARTEMIS-2216 Use a specific executor fo...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2484#discussion_r245048241
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/paging/impl/PagingStoreImpl.java ---
    @@ -278,21 +293,26 @@ public boolean isPaging() {
           lock.readLock().lock();
     
           try {
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.BLOCK) {
    -            return false;
    -         }
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.FAIL) {
    -            return isFull();
    -         }
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.DROP) {
    -            return isFull();
    -         }
    -         return paging;
    +         return isPagingDirtyRead();
           } finally {
              lock.readLock().unlock();
           }
        }
     
    +   @Override
    +   public boolean isPagingDirtyRead() {
    +      if (addressFullMessagePolicy == AddressFullMessagePolicy.BLOCK) {
    --- End diff --
    
    we can read it just once and save it in a local variable, avoiding 3 volatile loads: same can be done on the original version too


---

[GitHub] activemq-artemis pull request #2484: ARTEMIS-2216 Use a specific executor fo...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2484#discussion_r245095523
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/paging/cursor/impl/PageSubscriptionImpl.java ---
    @@ -1350,7 +1350,7 @@ public synchronized boolean hasNext() {
                 return true;
              }
     
    -         if (!pageStore.isPaging()) {
    +         if (!pageStore.isPagingDirtyRead()) {
    --- End diff --
    
    Concern here is this ins't an async case.
    
    Btw i cannot see the change in QueueImpl to use the new paging dirty read.


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @michaelandrepearce @qihongxu Ok, checked: the latest version of this PR + my branch https://github.com/franz1981/activemq-artemis/tree/lock-free-live-page-cache is never blocked :)


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > @michaelandrepearce @qihongxu Just a lil OT but I'm getting these warns on master:
    > 
    > ```
    > 2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,136 during compact replay
    > 2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,139 during compact replay
    > 2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,142 during compact replay
    > 2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,145 during compact replay
    > 2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,148 during compact replay
    > 2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,151 during compact replay
    > 2019-01-03 17:36:44,408 WARN  [org.apache.activemq.artemis.journal] AMQ142007: Can not find record 8,103,154 during compact replay
    > ```
    
    Do you get this on master? If not then that is a BIG worry


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > A possible explanation might be that in our case consumers were reading old pages while producers were writing new pages. There is no intersection between these two groups such as a common page being writing&reading by both of them.
    
    So why it the consumer side is being slowed down? 
    I need to check, but I remember that consumers were querying the live page cache, but I could be wrong.


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @michaelandrepearce @qihongxu I have re-implemented the LivePageCache to be completly lock-free: https://github.com/franz1981/activemq-artemis/tree/lock-free-live-page-cache
    Fell free to try it into this PR too and I can do a PR to your branch.
    I will provide soon some charts of the contention state :+1: 


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @qihongxu big thanks for all the effort on this!!


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by qihongxu <gi...@git.apache.org>.

Github user qihongxu commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > @qihongxu,
    > 
    > re:
    > "
    > According to this we have an internal branch that removes isPaging() method’s readlock in PagingStoreImpl, along with adding pageSyncTimer’s specific executor. After that we performed a same test and get a result of 21k/25k Send&Receive–in total 46k TPS. This version runs smoothly in our use case but we are still exploring potential effects.
    > "
    > 
    > Agree that as the QueueImpl:checkDepage which is called from deliver method, only schedules another depage asynchronously that the read barrier lock for that is over kill as it doesn't have to be strongly consistent correct.
    > 
    > I think changing the current method could be dangerous, unless we do a fuller analysis on all its uses, but i think we could easily add a new method named something like "isPagingDirtyRead" which can return without the readlock, it will mean the returned value isn't always 100% accurate but in cases such as checkDepage, we don't need it to be, its just a single, and we can call it where we know its safe to.
    > 
    > Would you be happy adding this, (will need to be exposed via PageSubscription)? And updating checkDepage to use that one?
    
    
    As you suggested we tried update checkDepage with a new "isPagingDirtyRead" method which can return without the readlock. But after running same tests on this version it seems that checkDepage did not affect receive performance too much. Instead, CursorIterator#hasNext in PageSubscriptionImpl also called isPaging() and had a significant impact on receive performance. According to this we updated hasNext() and forced it to use the new “isPagingDirtyRead”. Below is the result chart.
    (P.S the original ver is a control group, not exactly the “original” master-branch build. It has been applied with specific pageSyncTimer executor, and cache durable in PageReference - See PR Artemis-2214.)
    
    
      | Send&Receive | In Total
    -- | -- | --
    Ver Original | 30k/12.5k | 32.5k
    Ver Modified checkDepage | 31.1k/11.3k | 42.4k
    Ver Modified hasNext | 23.8k/23.5k | 47.3k
    
    Detailed TPS chart can be seen in attachment. (Sorry we dont have enough time for running tests. Just got back to work today)
    [TPS chart.docx](https://github.com/apache/activemq-artemis/files/2723731/TPS.chart.docx)



---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by qihongxu <gi...@git.apache.org>.

Github user qihongxu commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @michaelandrepearce 
    Removed readlock of isPaging().
    Also as @franz1981 suggested now only volatile load addressFullMessagePolicy once on each call.
    Please review and notify me if any more test needed:)


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @qihongxu Sorry to ask twice, but I haven't understood if `Ver no lock & new livePageCache` was making uses of this pr https://github.com/qihongxu/activemq-artemis/pull/1 


---

[GitHub] activemq-artemis pull request #2484: ARTEMIS-2216 Use a specific executor fo...

Posted by wy96f <gi...@git.apache.org>.

Github user wy96f commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2484#discussion_r245252912
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/paging/impl/PagingStoreImpl.java ---
    @@ -278,21 +293,26 @@ public boolean isPaging() {
           lock.readLock().lock();
     
           try {
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.BLOCK) {
    -            return false;
    -         }
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.FAIL) {
    -            return isFull();
    -         }
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.DROP) {
    -            return isFull();
    -         }
    -         return paging;
    +         return isPagingDirtyRead();
           } finally {
              lock.readLock().unlock();
           }
        }
     
    +   @Override
    +   public boolean isPagingDirtyRead() {
    +      if (addressFullMessagePolicy == AddressFullMessagePolicy.BLOCK) {
    --- End diff --
    
    > Yes, but we can just volatile load once before checking its value 3 times, on each call of isPagingDirtyRead
    
    get it. nice catch :+1: 


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by qihongxu <gi...@git.apache.org>.

Github user qihongxu commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @michaelandrepearce 
    > @qihongxu you're using JMS api not core api then?
    
    Yes (sometimes for for compatibility concern)
    
    > @qihongxu i dont see checkDepage using the dirtyRead in current commit
    
    Sorry I probably did not make it clear enough in my last comment. As the chart shows we tried checkDepage using the dirtyRead, but found it makes no difference in perf. We even changed isPaging() method in PageSubscriptionImpl from `return pageStore.isPaging();` to `return pageStore.isPagingDirtyRead();`. In this way not only checkDepage, but also other methods that call PageSubscription.isPaging() (as @franz1981 shown in flame charts above) will all use dirtyRead. But still, same perf as before.
    From all above we thought it's safer to leave others unchanged, only force CursorIterator:hasNext to use dirtyRead since it affects perf a lot. My latest RP is based on these concerns.


---

[GitHub] activemq-artemis pull request #2484: ARTEMIS-2216 Use a specific executor fo...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2484#discussion_r245094809
  
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/paging/impl/PagingStoreImpl.java ---
    @@ -278,21 +293,26 @@ public boolean isPaging() {
           lock.readLock().lock();
     
           try {
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.BLOCK) {
    -            return false;
    -         }
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.FAIL) {
    -            return isFull();
    -         }
    -         if (addressFullMessagePolicy == AddressFullMessagePolicy.DROP) {
    -            return isFull();
    -         }
    -         return paging;
    +         return isPagingDirtyRead();
           } finally {
              lock.readLock().unlock();
           }
        }
     
    +   @Override
    +   public boolean isPagingDirtyRead() {
    +      if (addressFullMessagePolicy == AddressFullMessagePolicy.BLOCK) {
    --- End diff --
    
    nice idea!


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @michaelandrepearce 
    > Do you get this on master or this PR (im mean is that a typo)?
    
    I've got that on master!


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @qihongxu ping! :) I'm too curious :)


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > CursorIterator:hasNext
    
    Im bit concerned with this doing a dirty read, as this isnt something that is trigger an ascyn action, in actual fact the hasNext is purposefully synchronized. 
    
    @franz1981 if checkDepage is removed from the lock, why would DepageRunner now be locking it up so bad? 


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    Comments on this are:
    
    We have shared executor and thread pools as a way to pool these its part artemis design so that these can be controlled, eg the number of threads are configurable. And theres uses cases for why, we dont just add executors. If we add a new pool it should be configurable like others.
    
    Whilst your use case this significantly improves things. As you note this change isnt entirely positive to all users as it does impact use cases where people may care more on the consumer side (i am one one of those). If anything if the broker is saturated its more important consumers for us can dequeue and if anything producers backpressure. As if consumers cant keep up and producer peak continues you can very soon end up situation where paging depth will just grow and grow.
    
    It would be good to get more stats therefore on how as producer rates ramup how consumer rates ramp down. And crticial cross over points. Along with number of remote consumers and producers. All of this could be mute if there is a way to eliminate the negative effect this change has.


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by michaelandrepearce <gi...@git.apache.org>.

Github user michaelandrepearce commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    > @michaelandrepearce
    > 
    > > Do you get this on master or this PR (im mean is that a typo)?
    > 
    > I've got that on master!
    
    ok so i think we not need worry on this for realms of this PR. (probably needs looking into, but doesn;t need to be solved in this pr - imo) 


---

[GitHub] activemq-artemis issue #2484: ARTEMIS-2216 Use a specific executor for pageS...

Posted by franz1981 <gi...@git.apache.org>.

Github user franz1981 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2484
  
    @michaelandrepearce I would like first to trigger a CI job of some kind, maybe @clebertsuconic can help with his superbox (just this time) to get an answer sooner?
    
    Re the cache I was thinking already to send another PR, but I have verified that is virtually impossible that's the reason of the consumer slow-down. These are the numbers of a the bench comparing it with the original version:
    ```
    Benchmark               (size)      (type)   Mode  Cnt          Score          Error  Units
    CacheBench.getMessage1      32     chunked  thrpt   10  150039261.251 ± 12504804.694  ops/s
    CacheBench.getMessage1      32  linkedlist  thrpt   10   31776755.611 ±  1405838.635  ops/s
    CacheBench.getMessage1    1024     chunked  thrpt   10   31433127.788 ±  3902498.303  ops/s
    CacheBench.getMessage1    1024  linkedlist  thrpt   10    2638404.341 ±   119171.758  ops/s
    CacheBench.getMessage1  102400     chunked  thrpt   10     344799.911 ±    27267.965  ops/s
    CacheBench.getMessage1  102400  linkedlist  thrpt   10      20020.925 ±     5392.418  ops/s
    CacheBench.getMessage3      32     chunked  thrpt   10  384605640.192 ± 35164543.632  ops/s
    CacheBench.getMessage3      32  linkedlist  thrpt   10   14124979.510 ±  2875341.272  ops/s
    CacheBench.getMessage3    1024     chunked  thrpt   10   90418506.375 ±  4593688.556  ops/s
    CacheBench.getMessage3    1024  linkedlist  thrpt   10    1562687.000 ±    91433.926  ops/s
    CacheBench.getMessage3  102400     chunked  thrpt   10     978575.016 ±    44800.161  ops/s
    CacheBench.getMessage3  102400  linkedlist  thrpt   10      21614.717 ±     5344.742  ops/s
    ```
    Where `getMessage1` is `LivePageCacheImpl::getMessage` called @ random positions by 1 thread and 
    `getMessage3` is `LivePageCacheImpl::getMessage` called @ random positions by 3 threads.
    `chunked` is the version and `linkedlist` the original version: the difference is quite large and the new version just scale linearly...


---