You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@james.apache.org by GitBox <gi...@apache.org> on 2020/12/01 09:03:32 UTC

[GitHub] [james-project] chibenwa edited a comment on pull request #255: JAMES-3435 Cassandra: No longer rely on LWT for domain and users

chibenwa edited a comment on pull request #255:
URL: https://github.com/apache/james-project/pull/255#issuecomment-736297452


   For one of our upcoming deployments, we are performing a load-testing campaign against a testing infrastructure. This load testing campain aims at finding the limits of the aforementioned platform.
   
   We successfully succeeded to load James JMAP endpoint to a breakpoint at 5400 users (isolation).
   
   Above that number, evidence suggest that we are CPU bound (requests )
   
   On a Cassandra standpoints, there is a high CPU usage (load of 10) that we linked to the usage of lightweight transactions / paxos usage, for ACLs [1] [2] [3] [4]. Detailed analysis is on the references.
   
   Once again, this is a topic I'm arguing for months [5], I need all of your support to have us take a strong decision here, and enforce it.
   
   Infrastructure:
    - 3x Cassandra nodes (8 cores, 32 GB RAM, 200 GB SSD)
    - 4x James server (4 cores, 8 GB RAM)
    - ElasticSearch servers: not measured.
   
   # Action to conduct
   
    - Perform a test run with ACL paxos turned off.
     -> This aims at confirming the deletarious impact of their usage
     -> Benoit & René are responsible to deploy and test a modified instance of James on PRE-PROD, with ACL turned off
     -> Benoit will continue lobbying AGAINST the usage of strong consistency in the community [5], which is overall a Cassandra bad practice and a mis-fit.
     -> If conclusive, Benoit will present a data-race proofed ACL implementation on top of Cassandra leveraging CRDT and eventual consistency.
   
    - Perform a run with more James CPU (4 * 6 cpus?) (René & Benoit)
     -> The goal is to see if we are James CPU bound or Cassandra CPU bound
   
   # Runs details
   
   ![4000-stats](https://user-images.githubusercontent.com/6928740/100713233-8dd56800-33e6-11eb-8c23-2dbe90436ab9.png)
   
   ![4000-latency](https://user-images.githubusercontent.com/6928740/100713229-8c0ba480-33e6-11eb-8328-a29252ca1a1e.png)
   
   [6] [7] shows a (successfull!) run of JMAP scenario alone on top of James.
   
   ![6000-stats](https://user-images.githubusercontent.com/6928740/100713248-97f76680-33e6-11eb-831d-8a50539a6844.png)
   
   ![6000-latency](https://user-images.githubusercontent.com/6928740/100713264-9d54b100-33e6-11eb-85b7-ed64c6e40459.png)
   
   [8] [9] shows a run hitting a throughtput limit point (5400 simultaneous users, 320 req/s) from which the performance highly downgrades. This is the system breaking point.
   
   # References
   
   [1] https://blog.pythian.com/lightweight-transactions-cassandra/ documents the CPU / memory / bandwith impact of using LWT.
   
   [dstat-cassandra.txt](https://github.com/apache/james-project/files/5621066/dstat-cassandra.txt)
   
   [2] dstat-cassandra.txt highlights a CPU over-usage on Cassandra node. This behavior is NOT NORMAL. Read-heavy workload are not supposed to be CPU-bound.
   
   [cassandra-tablestats.txt](https://github.com/apache/james-project/files/5621067/cassandra-tablestats.txt)
   
   [3] cassandra-tablestats.txt shouws table usage. We can notice BY FAR that our most used table is the system.paxos table.
   
   [compaction-history.txt](https://github.com/apache/james-project/files/5621070/compaction-history.txt)
   
   [4] compaction-history.txt highlights how often we do compact the paxos system table in comparison to other tables further higlighting this to be a hot-spot.
   
   
   [5] Benoit proposition to review lightweight transaction / paxos usage in James: https://github.com/apache/james-project/pull/255
   
   [6] 4000-stats.png shows good statistics of a run with 4000 users
   [7] 4000-latency.png shows latency evolution in regard to the number of users with 4000 users
   [8] 6000-stats.png shows good statistics of a run with 6000 users
   [9] 6000-latency.png shows latency evolution in regard to the number of users with 6000 users. Preformance breackage can be seen at 5400 users.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@james.apache.org
For additional commands, e-mail: notifications-help@james.apache.org