You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directory.apache.org by Marc Boorshtein <mb...@gmail.com> on 2023/01/20 14:57:07 UTC

Network performance issues under heavy load

We're using ApacheDS as a frontend for MyVD, running 2.0.0.AM27-SNAPSHOT.
We're finding that under heavy load (~300 concurrent connections) we'll
periodically get "broken pipe" errors from the client.  i can reproduce
this pretty easily with jmeter's LDAP module.  The errors tend to come in
bunches and when there is a garbage collection event (under really heavy
loads you can see the logs slow down momentarily and then the errors occur.

My test bed is a mac m2 running java 17, however the server is an amazon
m5a. running the correta java 18 jvm with the ZGC garbage collector.
here;s the JVM switches:

-Xms4g -Xmx4g -XX:+UnlockExperimentalVMOptions -XX:+UseZGC
-Dsun.net.client.defaultConnectTimeout=10000
-Dsun.net.client.defaultReadTimeout=20000


 We're not seeing any issues in the myvd portion of the system, now in the
down stream directories being proxied.  Also, if we add an artificial
bottleneck by dropping the connection pool size from 300 to 50, but still
maintaining 300 clients, the issue decreases dramatically.

Any thoughts as to where i can start debugging this issue?  A thread dump
analysis doesn't show any deadlocks.

Thanks
Marc

Re: Network performance issues under heavy load

Posted by Emmanuel Lécharny <el...@gmail.com>.

> 
> On Fri, Jan 20, 2023 at 1:29 PM Marc Boorshtein <mboorshtein@gmail.com 
> <ma...@gmail.com>> wrote:
> 
> 
>         I would say that we only have a limited number of threads
>         dedicated to
>         process the incoming messages, and this number is computed based
>         on the
>         number of core you have on your machine.
> 
> 
>     Is this configurable?  I'd like to be able to adjust to figure out
>     if it makes an impact.
> 

There is a ads-transportNbThreads parameter that defaults to 3 which is 
pretty low. You should be able to tweak it:


dn: 
ads-transportid=ldap,ou=transports,ads-serverId=ldapServer,ou=servers,ads-directoryServiceId=default,ou=config
ads-systemport: 10389
ads-transportnbthreads: 8      <--------- Here
ads-transportaddress: 0.0.0.0
ads-transportid: ldap
objectclass: ads-transport
objectclass: ads-tcpTransport
objectClass: ads-base
objectclass: top
ads-enabled: TRUE



> 
>         First thing: is your server toping at 100% CPU? With Z GC it
>         should not
>         stop anything...
> 
> 
>     No, not even close
> 
> 
>         Have you profiled the server while running under your stress
>         test (not
>         tracing, but sampling)?
> 
> 
>     Do you have a tool you could recommend?


I'm a user of YourtKit, but abny profiler can do.


> 
>         Second thing: are you sure that it occurs when the GC kick on?
> 
> 
>     It's a guess.
> 
>         On 20/01/2023 15:57, Marc Boorshtein wrote:
>          > We're using ApacheDS as a frontend for MyVD,
>          > running 2.0.0.AM27-SNAPSHOT.  We're finding that under heavy
>         load (~300
>          > concurrent connections) we'll periodically get "broken pipe"
>         errors from
>          > the client.  i can reproduce this pretty easily with jmeter's
>         LDAP
>          > module.  The errors tend to come in bunches and when there is
>         a garbage
>          > collection event (under really heavy loads you can see the
>         logs slow
>          > down momentarily and then the errors occur.
>          >
>          > My test bed is a mac m2 running java 17, however the server
>         is an amazon
>          > m5a. running the correta java 18 jvm with the ZGC garbage
>         collector.
>          > here;s the JVM switches:
>          >
>          > -Xms4g -Xmx4g -XX:+UnlockExperimentalVMOptions -XX:+UseZGC
>          > -Dsun.net.client.defaultConnectTimeout=10000
>          > -Dsun.net.client.defaultReadTimeout=20000
>          >
>          >
>          >   We're not seeing any issues in the myvd portion of the
>         system, now in
>          > the down stream directories being proxied.  Also, if we add an
>          > artificial bottleneck by dropping the connection pool size
>         from 300 to
>          > 50, but still maintaining 300 clients, the issue decreases
>         dramatically.
>          >
>          > Any thoughts as to where i can start debugging this issue?  A
>         thread
>          > dump analysis doesn't show any deadlocks.
>          >
>          > Thanks
>          > Marc
> 
>         -- 
>         *Emmanuel Lécharny - CTO* 205 Promenade des Anglais – 06200 NICE
>         T. +33 (0)4 89 97 36 50
>         P. +33 (0)6 08 33 32 61
>         emmanuel.lecharny@busit.com <ma...@busit.com>
>         https://www.busit.com/ <https://www.busit.com/>
> 
>         ---------------------------------------------------------------------
>         To unsubscribe, e-mail: dev-unsubscribe@directory.apache.org
>         <ma...@directory.apache.org>
>         For additional commands, e-mail: dev-help@directory.apache.org
>         <ma...@directory.apache.org>
> 

-- 
*Emmanuel Lécharny - CTO* 205 Promenade des Anglais – 06200 NICE
T. +33 (0)4 89 97 36 50
P. +33 (0)6 08 33 32 61
emmanuel.lecharny@busit.com https://www.busit.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@directory.apache.org
For additional commands, e-mail: dev-help@directory.apache.org

Re: Network performance issues under heavy load

Posted by Emmanuel Lécharny <el...@gmail.com>.


On 20/01/2023 20:28, Marc Boorshtein wrote:
> Here's an additional datapoint.  Enabling TLS eliminates the issue 
> entirely (not LDAP+StartTLS, but just straight LDAPS).  Once I enabled 
> TLS I was able to hammer MyVD with 300+ inbound connections and 300 
> outbound connections and it worked great!


This is extra weird...

TLS is managed entirely by MINA, and I can't see how enabling something 
that eats CPU can make the server running faster :/

At this point, a thread dump could help...


> 
> So while I'd love to say "we'll just use LDAPS", the customer would 
> prefer not to so I need to keep digging into this.
> 
> On Fri, Jan 20, 2023 at 1:29 PM Marc Boorshtein <mboorshtein@gmail.com 
> <ma...@gmail.com>> wrote:
> 
> 
>         I would say that we only have a limited number of threads
>         dedicated to
>         process the incoming messages, and this number is computed based
>         on the
>         number of core you have on your machine.
> 
> 
>     Is this configurable?  I'd like to be able to adjust to figure out
>     if it makes an impact.
> 
> 
>         First thing: is your server toping at 100% CPU? With Z GC it
>         should not
>         stop anything...
> 
> 
>     No, not even close
> 
> 
>         Have you profiled the server while running under your stress
>         test (not
>         tracing, but sampling)?
> 
> 
>     Do you have a tool you could recommend?
> 
>         Second thing: are you sure that it occurs when the GC kick on?
> 
> 
>     It's a guess.
> 
>         On 20/01/2023 15:57, Marc Boorshtein wrote:
>          > We're using ApacheDS as a frontend for MyVD,
>          > running 2.0.0.AM27-SNAPSHOT.  We're finding that under heavy
>         load (~300
>          > concurrent connections) we'll periodically get "broken pipe"
>         errors from
>          > the client.  i can reproduce this pretty easily with jmeter's
>         LDAP
>          > module.  The errors tend to come in bunches and when there is
>         a garbage
>          > collection event (under really heavy loads you can see the
>         logs slow
>          > down momentarily and then the errors occur.
>          >
>          > My test bed is a mac m2 running java 17, however the server
>         is an amazon
>          > m5a. running the correta java 18 jvm with the ZGC garbage
>         collector.
>          > here;s the JVM switches:
>          >
>          > -Xms4g -Xmx4g -XX:+UnlockExperimentalVMOptions -XX:+UseZGC
>          > -Dsun.net.client.defaultConnectTimeout=10000
>          > -Dsun.net.client.defaultReadTimeout=20000
>          >
>          >
>          >   We're not seeing any issues in the myvd portion of the
>         system, now in
>          > the down stream directories being proxied.  Also, if we add an
>          > artificial bottleneck by dropping the connection pool size
>         from 300 to
>          > 50, but still maintaining 300 clients, the issue decreases
>         dramatically.
>          >
>          > Any thoughts as to where i can start debugging this issue?  A
>         thread
>          > dump analysis doesn't show any deadlocks.
>          >
>          > Thanks
>          > Marc
> 
>         -- 
>         *Emmanuel Lécharny - CTO* 205 Promenade des Anglais – 06200 NICE
>         T. +33 (0)4 89 97 36 50
>         P. +33 (0)6 08 33 32 61
>         emmanuel.lecharny@busit.com <ma...@busit.com>
>         https://www.busit.com/ <https://www.busit.com/>
> 
>         ---------------------------------------------------------------------
>         To unsubscribe, e-mail: dev-unsubscribe@directory.apache.org
>         <ma...@directory.apache.org>
>         For additional commands, e-mail: dev-help@directory.apache.org
>         <ma...@directory.apache.org>
> 

-- 
*Emmanuel Lécharny - CTO* 205 Promenade des Anglais – 06200 NICE
T. +33 (0)4 89 97 36 50
P. +33 (0)6 08 33 32 61
emmanuel.lecharny@busit.com https://www.busit.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@directory.apache.org
For additional commands, e-mail: dev-help@directory.apache.org

Re: Network performance issues under heavy load

Posted by Marc Boorshtein <mb...@gmail.com>.

Here's an additional datapoint.  Enabling TLS eliminates the issue entirely
(not LDAP+StartTLS, but just straight LDAPS).  Once I enabled TLS I was
able to hammer MyVD with 300+ inbound connections and 300 outbound
connections and it worked great!

So while I'd love to say "we'll just use LDAPS", the customer would prefer
not to so I need to keep digging into this.

On Fri, Jan 20, 2023 at 1:29 PM Marc Boorshtein <mb...@gmail.com>
wrote:

>
>> I would say that we only have a limited number of threads dedicated to
>> process the incoming messages, and this number is computed based on the
>> number of core you have on your machine.
>>
>
> Is this configurable?  I'd like to be able to adjust to figure out if it
> makes an impact.
>
>
>>
>> First thing: is your server toping at 100% CPU? With Z GC it should not
>> stop anything...
>>
>
> No, not even close
>
>
>>
>> Have you profiled the server while running under your stress test (not
>> tracing, but sampling)?
>>
>>
> Do you have a tool you could recommend?
>
>
>> Second thing: are you sure that it occurs when the GC kick on?
>>
>>
> It's a guess.
>
>
>> On 20/01/2023 15:57, Marc Boorshtein wrote:
>> > We're using ApacheDS as a frontend for MyVD,
>> > running 2.0.0.AM27-SNAPSHOT.  We're finding that under heavy load (~300
>> > concurrent connections) we'll periodically get "broken pipe" errors
>> from
>> > the client.  i can reproduce this pretty easily with jmeter's LDAP
>> > module.  The errors tend to come in bunches and when there is a garbage
>> > collection event (under really heavy loads you can see the logs slow
>> > down momentarily and then the errors occur.
>> >
>> > My test bed is a mac m2 running java 17, however the server is an
>> amazon
>> > m5a. running the correta java 18 jvm with the ZGC garbage collector.
>> > here;s the JVM switches:
>> >
>> > -Xms4g -Xmx4g -XX:+UnlockExperimentalVMOptions -XX:+UseZGC
>> > -Dsun.net.client.defaultConnectTimeout=10000
>> > -Dsun.net.client.defaultReadTimeout=20000
>> >
>> >
>> >   We're not seeing any issues in the myvd portion of the system, now in
>> > the down stream directories being proxied.  Also, if we add an
>> > artificial bottleneck by dropping the connection pool size from 300 to
>> > 50, but still maintaining 300 clients, the issue decreases dramatically.
>> >
>> > Any thoughts as to where i can start debugging this issue?  A thread
>> > dump analysis doesn't show any deadlocks.
>> >
>> > Thanks
>> > Marc
>>
>> --
>> *Emmanuel Lécharny - CTO* 205 Promenade des Anglais – 06200 NICE
>> T. +33 (0)4 89 97 36 50
>> P. +33 (0)6 08 33 32 61
>> emmanuel.lecharny@busit.com https://www.busit.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@directory.apache.org
>> For additional commands, e-mail: dev-help@directory.apache.org
>>
>>

Re: Network performance issues under heavy load

Posted by Emmanuel Lécharny <el...@gmail.com>.

Hi Marc,

not any obvious clue.

I would say that we only have a limited number of threads dedicated to 
process the incoming messages, and this number is computed based on the 
number of core you have on your machine.

First thing: is your server toping at 100% CPU? With Z GC it should not 
stop anything...

Have you profiled the server while running under your stress test (not 
tracing, but sampling)?

Second thing: are you sure that it occurs when the GC kick on?

On 20/01/2023 15:57, Marc Boorshtein wrote:
> We're using ApacheDS as a frontend for MyVD, 
> running 2.0.0.AM27-SNAPSHOT.  We're finding that under heavy load (~300 
> concurrent connections) we'll periodically get "broken pipe" errors from 
> the client.  i can reproduce this pretty easily with jmeter's LDAP 
> module.  The errors tend to come in bunches and when there is a garbage 
> collection event (under really heavy loads you can see the logs slow 
> down momentarily and then the errors occur.
> 
> My test bed is a mac m2 running java 17, however the server is an amazon 
> m5a. running the correta java 18 jvm with the ZGC garbage collector.  
> here;s the JVM switches:
> 
> -Xms4g -Xmx4g -XX:+UnlockExperimentalVMOptions -XX:+UseZGC 
> -Dsun.net.client.defaultConnectTimeout=10000 
> -Dsun.net.client.defaultReadTimeout=20000
> 
> 
>   We're not seeing any issues in the myvd portion of the system, now in 
> the down stream directories being proxied.  Also, if we add an 
> artificial bottleneck by dropping the connection pool size from 300 to 
> 50, but still maintaining 300 clients, the issue decreases dramatically.
> 
> Any thoughts as to where i can start debugging this issue?  A thread 
> dump analysis doesn't show any deadlocks.
> 
> Thanks
> Marc

-- 
*Emmanuel Lécharny - CTO* 205 Promenade des Anglais – 06200 NICE
T. +33 (0)4 89 97 36 50
P. +33 (0)6 08 33 32 61
emmanuel.lecharny@busit.com https://www.busit.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@directory.apache.org
For additional commands, e-mail: dev-help@directory.apache.org