You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Benoit Tellier (Jira)" <se...@james.apache.org> on 2022/04/15 04:44:00 UTC
[jira] [Created] (JAMES-3749) Better metrics for RabbitMQ
Benoit Tellier created JAMES-3749:
-------------------------------------
Summary: Better metrics for RabbitMQ
Key: JAMES-3749
URL: https://issues.apache.org/jira/browse/JAMES-3749
Project: James Server
Issue Type: Improvement
Components: Metrics, rabbitmq
Affects Versions: 3.7.0
Reporter: Benoit Tellier
Fix For: 3.8.0
To my surprise, IMAP performance tests were highly limited by RabbitMQ.
We lacked decent metrics on RabbitMQ / the event bus to clearly audit this.
I added a few additional metrics, here are the results:
{code:java}
name=rabbit-acquire, count=52615, min=0.010816, max=2197.815295, mean=14.692384926275777, stddev=84.45677147245601, p50=0.075775, p75=0.203775, p95=63.176703, p98=199.229439, p99=375.390207, p999=1216.348159, m1_rate=203.7148778352276, m5_rate=119.5444112071225, m15_rate=51.27213833578196, mean_rate=83.30809197633765, rate_unit=events/second, duration_unit=milliseconds
name=rabbit-dispatch, count=27858, min=0.333824, max=2365.587455, mean=54.42362489080336, stddev=132.51032578954067, p50=15.466495, p75=43.253759, p95=229.638143, p98=432.013311, p99=633.339903, p999=1753.219071, m1_rate=109.2818061840995, m5_rate=70.86014994542329, m15_rate=40.57835311284083, mean_rate=104.18175115750351, rate_unit=events/second, duration_unit=milliseconds
name=rabbit-register, count=2976, min=9.633792, max=5603.590143, mean=179.2821071827957, stddev=508.63321381095363, p50=50.331647, p75=103.809023, p95=687.865855, p98=2013.265919, p99=3003.121663, p999=5100.273663, m1_rate=3.7740876538564017, m5_rate=9.38568671432365, m15_rate=9.574694038543646, mean_rate=11.166515283444058, rate_unit=events/second, duration_unit=milliseconds
name=rabbit-release, count=52600, min=6.64E-4, max=131.596287, mean=0.12847917764258554, stddev=1.7017175408795067, p50=0.006111, p75=0.010303, p95=0.035583, p98=1.269759, p99=2.719743, p999=17.432575, m1_rate=204.4922821701763, m5_rate=136.5136219818052, m15_rate=81.73827938617151, mean_rate=197.1284478914709, rate_unit=events/second, duration_unit=milliseconds
name=rabbit-unregister, count=449, min=10.878976, max=2466.250751, mean=190.00389787082403, stddev=380.5671338872364, p50=51.118079, p75=135.266303, p95=1010.827263, p98=1702.887423, p99=1912.602623, p999=2466.250751, m1_rate=9.012783767745082, m5_rate=5.543710748795059, m15_rate=4.918174526687269, mean_rate=19.889486577715797, rate_unit=events/second, duration_unit=milliseconds
{code}
Analysis:
- dispatch takes a really long time and impacts negatively all other operations
- the channel pool was undersized (contention to get a channel)
I did try out the followings:
- https://issues.apache.org/jira/browse/JAMES-3747 reactive implementation for the RabbitMQ channel pool.
Better reactive code but not a game changer to be honnest.
- Shorter routing key (don't include the full FQDN) -> small performance gains...
- Disable publish confirms: Game changer! Dispatch mean went from 50ms+ p99 to 500ms+ to mean 1ms, p99 8ms... All other metrics (bind / unbind) are impacted as well. Contention to acquire a channel is effectively gone...
- Turning off durability on notifiation channels unlocked further gains.
{code:java}
name=rabbit-acquire, count=1380387, min=0.005824, max=132.120575, mean=0.120752084973272, stddev=0.47552513438388405, p50=0.056831, p75=0.096767, p95=0.354303, p98=0.692223, p99=1.122303, p999=4.915199, m1_rate=1.6804637345701686E-238, m5_rate=9.96453889901133E-47, m15_rate=9.66533044449863E-15, mean_rate=32.71739356870932, rate_unit=events/second, duration_unit=milliseconds
name=rabbit-dispatch, count=757489, min=0.063232, max=245.366783, mean=0.6006688857527964, stddev=1.0950763726058712, p50=0.456703, p75=0.610303, p95=1.310719, p98=2.064383, p99=2.949119, p999=9.764863, m1_rate=9.8844762264434E-239, m5_rate=5.952316454575153E-47, m15_rate=5.761608528831366E-15, mean_rate=17.954172496560656, rate_unit=events/second, duration_unit=milliseconds
name=rabbit-register, count=18810, min=3.6864, max=1317.011455, mean=21.051024209250397, stddev=41.433421890807836, p50=12.058623, p75=19.529727, p95=66.322431, p98=106.954751, p99=155.189247, p999=507.510783, m1_rate=1.5731326523774949E-260, m5_rate=2.4205436310646514E-54, m15_rate=3.643125827717571E-18, mean_rate=0.4463666568673792, rate_unit=events/second, duration_unit=milliseconds
name=rabbit-release, count=1380385, min=6.6E-4, max=131.596287, mean=0.00816027294848901, stddev=0.12039944630360619, p50=0.006143, p75=0.009087, p95=0.013183, p98=0.021759, p99=0.034047, p999=0.230399, m1_rate=1.6486782760303405E-238, m5_rate=9.925571881590765E-47, m15_rate=9.652806233445519E-15, mean_rate=32.71825157947973, rate_unit=events/second, duration_unit=milliseconds
name=rabbit-unregister, count=18810, min=3.11296, max=1761.607679, mean=28.082487391812865, stddev=64.54031379534226, p50=12.582911, p75=20.971519, p95=100.139007, p98=192.937983, p99=287.309823, p999=805.306367, m1_rate=7.66299030904417E-260, m5_rate=1.8971818595887257E-53, m15_rate=8.637433599041584E-18, mean_rate=0.44693388443373244, rate_unit=events/second, duration_unit=milliseconds
{code}
I was able to double the request count in my IMAP benches and still get a 3 fold latency reduction.
h3. Proposals
- Offer an option to disable publish confirms. This new James 3.7.0 behaviour brings cool resiliency semantic but is definitly harmful for scalability. We can imagine some users wanting to turn that off.
- Offer a way to turn off durability on notifications. Notifications is likely not critical, and loss acceptable.
- Add those cool rabbitMQ metrics.
And of course, invest in an alternative to RabbitMQ that do not force us to choose between throughtput and safety. Thoughts: Pulsar.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org