You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Brent Williams <br...@gmail.com> on 2019/03/25 19:53:39 UTC

Java Thin Client TCP Connections

All,

I am running Apache Ingite 2.7.0. I have 3 nodes in my cluster, CPU,
memory, GC all tuned properly. I have even adjusted file limit to 65k open
connections. I have 8 client nodes that are connecting to the 3 node
cluster and for the most part working fine, however, we see spikes in
connections and we start to blow out the file limit and we get too many
files open and all client nodes hang.

When I check the connections per client on one of the server nodes, I am
seeing 5500+ TCP connections established per host.  This is roughly 44,0000
+ . My question is what should the file limits be? Why so many TCP
connections per host? How do we control this as it is causing our
production cluster to hang.

--Brent

Re: Java Thin Client TCP Connections

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

I can already see that you're not closing IgniteClient when you reconnect.
Meaning it will not free resources.

Have you considered using Ignite (with Ignition.setClientMode(true))
instead of IgniteClient?

Regards,
-- 
Ilya Kasnacheev


ср, 27 мар. 2019 г. в 22:03, Brent Williams <br...@gmail.com>:

> Igor,
>
> Thanks for responding.
>
> I have 2 java singletons that I use. The first is the CacheManager which
> starts the instance. The second is another Singleton that caches the
> repository Name + CacheClient so
> we can reuse them through out the process.
>
> /**
>  * This is a singleton for Apache Ingite, it starts the client instance.
>  */
> public class IgniteCacheManager {
>     private static IgniteCacheManager instance;
>     private IgniteClient igniteClient;
>     private ClientConfiguration cfg;
>
>     public static IgniteCacheManager getInstance(YamlConfig cacheConfig) {
>         if (instance == null) instance = new
> IgniteCacheManager(cacheConfig);
>         return instance;
>     }
>
>     private IgniteCacheManager(YamlConfig cacheConfig) {
>         String hostsMap = cacheConfig.getString("hosts");
>         String[] hosts = null;
>         if (hostsMap != null) {
>             hosts = hostsMap.split(",");
>         } else {
>             hosts = new String[] { "localhost:10800" };
>         }
>         cfg = new ClientConfiguration().setAddresses(hosts)
>                 .setTimeout(cacheConfig.getInteger("cache.timeout"));
>         igniteClient = Ignition.startClient(cfg);
>     }
>
>     public IgniteClient getClient() {
>         return this.igniteClient;
>     }
>
>     public void reconnect() {
>         igniteClient = Ignition.startClient(cfg);
>     }
> }
>
>
> public class CacheFactory {
>     private static CacheFactory instance;
>     private YamlConfig cacheConfig;
>     private Map<String, CacheClient<?>> clientCache = new
> ConcurrentHashMap<>();
>
>     public static CacheFactory getInstance(YamlConfig cacheConfig) {
>         if (instance == null) instance = new CacheFactory(cacheConfig);
>         return instance;
>     }
>
>     /**
>      * This is the main factory method for pulling a Cache Instance to
> begin Caching.
>      */
>     public <T> CacheClient<T> getCacheProvider(Class<T> t, String
> repository) {
>         CacheClient<T> client = null;
>         if (clientCache.containsKey(repository)) {
>             client = (CacheClient<T>) clientCache.get(repository);
>         } else {
>                 try {
>                     client = new IgniteCacheClientWrapper<T>(cacheConfig,
> repository);
>                     clientCache.put(repository, client);
>                 } catch (Exception ex) {
>                     /**
>                      * If we encounter any errors return null and let the
> caller decide how to act on the null response.
>                      */
>                      LOG.error(ex);
>                 }
>         }
>         return client;
>     }
> }
>
> To call this we use this method inside all of our request threads.
>
> public <t> someMethod() {
>
> }
>
> public <T> CacheClient<T> getCacheClient(Class<T> t, String key) {
>    factory = CacheFactory.getInstance(cacheConfig);
>    return factory.getCacheProvider(t, key);
>  }
>
> ...
> getCacheClient(StorageContainer.class, partner).get(id);.
> ...
>
>
> The spikes are unpredictable, we see normal load on all 3 nodes, however
> we do see a huge spike in these errors around the time the hosts lock up.
>
> Mar 25 06:25:04 prd-cache001 service.sh[10538]: java.io.IOException:
> Connection reset by peer
> Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
> sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
> sun.nio.ch.IOUtil.read(IOUtil.java:197)
> Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
> org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1104)
> Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2389)
> Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2156)
> Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1797)
> Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
> java.lang.Thread.run(Thread.java:748)
> Mar 25 06:25:04 prd-cache001 service.sh[10538]:
> [06:25:04,742][SEVERE][grid-nio-worker-client-listener-1-#30][ClientListenerProcessor]
> Failed to process selector key [ses=GridSelectorNioSessionImpl
> [worker=ByteBufferNioClientWorker [readBuf=java.nio.HeapByteBuffer[pos=0
> lim=8192 cap=8192], super=AbstractNioClientWorker [idx=1, bytesRcvd=0,
> bytesSent=0, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker
> [name=grid-nio-worker-client-listener-1, igniteInstanceName=null,
> finished=false, heartbeatTs=1553520191555, hashCode=1720789126,
> interrupted=false, runner=grid-nio-worker-client-listener-1-#30]]],
> writeBuf=null, readBuf=null, inRecovery=null, outRecovery=null,
> super=GridNioSessionImpl [locAddr=/10.132.52.64:10800, rmtAddr=/
> 10.132.52.59:49105, createTime=1553519004533, closeTime=0, bytesSent=5,
> bytesRcvd=12, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1553519005007,
> lastSndTime=1553519005525, lastRcvTime=1553519005007, readsPaused=false,
> filterChain=FilterChain[filters=[GridNioAsyncNotifyFilter,
> GridNioCodecFilter [parser=ClientListenerBufferedParser,
> directMode=false]], accepted=true, markedForClose=false]]]
>
> Then we see the TCP connection count go way up, too many files open and
> request times take forever. One thing I have changed was I did increase the
> timeout on the client, I had it at 100 ms but I increased it to 250 ms. Not
> sure if the load cause connection to timeout so it spawns more connections
> or if the GC causes the connections to hang.
>
> My 3 nodes are 2 CPU, 8 GB RAM, during load peaks, we see averages of 6%
> CPU with 30% memory utilization. Here is the Settings I have for me GC.
>
> /usr/bin/java -server -Xms1g -Xmx1g -XX:+AlwaysPreTouch -XX:+UseG1GC
> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:MaxMetaspaceSize=256m
> -Djava.net.preferIPv4Stack=true -DIGNITE_QUIET=true
> -DIGNITE_SUCCESS_FILE=/usr/share/apache-ignite/work/ignite_success_274df869-bebf-47d0-8c9e-6b2da78f1f09
> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=49112
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false
> -DIGNITE_HOME=/usr/share/apache-ignite
> -DIGNITE_PROG_NAME=/usr/share/apache-ignite/bin/ignite.sh -cp
> /usr/share/apache-ignite/libs/*:/usr/share/apache-ignite/libs/ignite-indexing/*:/usr/share/apache-ignite/libs/ignite-rest-http/*:/usr/share/apache-ignite/libs/ignite-spring/*:/usr/share/apache-ignite/libs/licenses/*
> org.apache.ignite.startup.cmdline.CommandLineStartup
> /etc/apache-ignite/default-config.xml
>
>
>
> On Wed, Mar 27, 2019 at 4:27 AM Igor Sapego <is...@apache.org> wrote:
>
>> That's really weird. There should not be so much connections. Normally
>> thin
>> client will open one TCP connection per node at max. In many cases, there
>> going to be only one connection.
>>
>> Do you create IgniteClient in your application once, or do you start them
>> several
>> times? Could it be that your code are leaking IgniteClient instances?
>>
>> Can you provide some minimal reproducer to us, so we can debug the issue?
>>
>> Best Regards,
>> Igor
>>
>>
>> On Mon, Mar 25, 2019 at 11:19 PM Brent Williams <br...@gmail.com>
>> wrote:
>>
>>> All,
>>>
>>> I am running Apache Ingite 2.7.0. I have 3 nodes in my cluster, CPU,
>>> memory, GC all tuned properly. I have even adjusted file limit to 65k open
>>> connections. I have 8 client nodes that are connecting to the 3 node
>>> cluster and for the most part working fine, however, we see spikes in
>>> connections and we start to blow out the file limit and we get too many
>>> files open and all client nodes hang.
>>>
>>> When I check the connections per client on one of the server nodes, I am
>>> seeing 5500+ TCP connections established per host.  This is roughly 44,0000
>>> + . My question is what should the file limits be? Why so many TCP
>>> connections per host? How do we control this as it is causing our
>>> production cluster to hang.
>>>
>>> --Brent
>>>
>>>
>>>

Re: Java Thin Client TCP Connections

Posted by Brent Williams <br...@gmail.com>.

Igor,

Thanks for responding.

I have 2 java singletons that I use. The first is the CacheManager which
starts the instance. The second is another Singleton that caches the
repository Name + CacheClient so
we can reuse them through out the process.

/**
 * This is a singleton for Apache Ingite, it starts the client instance.
 */
public class IgniteCacheManager {
    private static IgniteCacheManager instance;
    private IgniteClient igniteClient;
    private ClientConfiguration cfg;

    public static IgniteCacheManager getInstance(YamlConfig cacheConfig) {
        if (instance == null) instance = new
IgniteCacheManager(cacheConfig);
        return instance;
    }

    private IgniteCacheManager(YamlConfig cacheConfig) {
        String hostsMap = cacheConfig.getString("hosts");
        String[] hosts = null;
        if (hostsMap != null) {
            hosts = hostsMap.split(",");
        } else {
            hosts = new String[] { "localhost:10800" };
        }
        cfg = new ClientConfiguration().setAddresses(hosts)
                .setTimeout(cacheConfig.getInteger("cache.timeout"));
        igniteClient = Ignition.startClient(cfg);
    }

    public IgniteClient getClient() {
        return this.igniteClient;
    }

    public void reconnect() {
        igniteClient = Ignition.startClient(cfg);
    }
}


public class CacheFactory {
    private static CacheFactory instance;
    private YamlConfig cacheConfig;
    private Map<String, CacheClient<?>> clientCache = new
ConcurrentHashMap<>();

    public static CacheFactory getInstance(YamlConfig cacheConfig) {
        if (instance == null) instance = new CacheFactory(cacheConfig);
        return instance;
    }

    /**
     * This is the main factory method for pulling a Cache Instance to
begin Caching.
     */
    public <T> CacheClient<T> getCacheProvider(Class<T> t, String
repository) {
        CacheClient<T> client = null;
        if (clientCache.containsKey(repository)) {
            client = (CacheClient<T>) clientCache.get(repository);
        } else {
                try {
                    client = new IgniteCacheClientWrapper<T>(cacheConfig,
repository);
                    clientCache.put(repository, client);
                } catch (Exception ex) {
                    /**
                     * If we encounter any errors return null and let the
caller decide how to act on the null response.
                     */
                     LOG.error(ex);
                }
        }
        return client;
    }
}

To call this we use this method inside all of our request threads.

public <t> someMethod() {

}

public <T> CacheClient<T> getCacheClient(Class<T> t, String key) {
   factory = CacheFactory.getInstance(cacheConfig);
   return factory.getCacheProvider(t, key);
 }

...
getCacheClient(StorageContainer.class, partner).get(id);.
...


The spikes are unpredictable, we see normal load on all 3 nodes, however we
do see a huge spike in these errors around the time the hosts lock up.

Mar 25 06:25:04 prd-cache001 service.sh[10538]: java.io.IOException:
Connection reset by peer
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
sun.nio.ch.FileDispatcherImpl.read0(Native Method)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
sun.nio.ch.IOUtil.read(IOUtil.java:197)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1104)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2389)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2156)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1797)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
java.lang.Thread.run(Thread.java:748)
Mar 25 06:25:04 prd-cache001 service.sh[10538]:
[06:25:04,742][SEVERE][grid-nio-worker-client-listener-1-#30][ClientListenerProcessor]
Failed to process selector key [ses=GridSelectorNioSessionImpl
[worker=ByteBufferNioClientWorker [readBuf=java.nio.HeapByteBuffer[pos=0
lim=8192 cap=8192], super=AbstractNioClientWorker [idx=1, bytesRcvd=0,
bytesSent=0, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker
[name=grid-nio-worker-client-listener-1, igniteInstanceName=null,
finished=false, heartbeatTs=1553520191555, hashCode=1720789126,
interrupted=false, runner=grid-nio-worker-client-listener-1-#30]]],
writeBuf=null, readBuf=null, inRecovery=null, outRecovery=null,
super=GridNioSessionImpl [locAddr=/10.132.52.64:10800, rmtAddr=/
10.132.52.59:49105, createTime=1553519004533, closeTime=0, bytesSent=5,
bytesRcvd=12, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1553519005007,
lastSndTime=1553519005525, lastRcvTime=1553519005007, readsPaused=false,
filterChain=FilterChain[filters=[GridNioAsyncNotifyFilter,
GridNioCodecFilter [parser=ClientListenerBufferedParser,
directMode=false]], accepted=true, markedForClose=false]]]

Then we see the TCP connection count go way up, too many files open and
request times take forever. One thing I have changed was I did increase the
timeout on the client, I had it at 100 ms but I increased it to 250 ms. Not
sure if the load cause connection to timeout so it spawns more connections
or if the GC causes the connections to hang.

My 3 nodes are 2 CPU, 8 GB RAM, during load peaks, we see averages of 6%
CPU with 30% memory utilization. Here is the Settings I have for me GC.

/usr/bin/java -server -Xms1g -Xmx1g -XX:+AlwaysPreTouch -XX:+UseG1GC
-XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:MaxMetaspaceSize=256m
-Djava.net.preferIPv4Stack=true -DIGNITE_QUIET=true
-DIGNITE_SUCCESS_FILE=/usr/share/apache-ignite/work/ignite_success_274df869-bebf-47d0-8c9e-6b2da78f1f09
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=49112
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-DIGNITE_HOME=/usr/share/apache-ignite
-DIGNITE_PROG_NAME=/usr/share/apache-ignite/bin/ignite.sh -cp
/usr/share/apache-ignite/libs/*:/usr/share/apache-ignite/libs/ignite-indexing/*:/usr/share/apache-ignite/libs/ignite-rest-http/*:/usr/share/apache-ignite/libs/ignite-spring/*:/usr/share/apache-ignite/libs/licenses/*
org.apache.ignite.startup.cmdline.CommandLineStartup
/etc/apache-ignite/default-config.xml



On Wed, Mar 27, 2019 at 4:27 AM Igor Sapego <is...@apache.org> wrote:

> That's really weird. There should not be so much connections. Normally thin
> client will open one TCP connection per node at max. In many cases, there
> going to be only one connection.
>
> Do you create IgniteClient in your application once, or do you start them
> several
> times? Could it be that your code are leaking IgniteClient instances?
>
> Can you provide some minimal reproducer to us, so we can debug the issue?
>
> Best Regards,
> Igor
>
>
> On Mon, Mar 25, 2019 at 11:19 PM Brent Williams <br...@gmail.com>
> wrote:
>
>> All,
>>
>> I am running Apache Ingite 2.7.0. I have 3 nodes in my cluster, CPU,
>> memory, GC all tuned properly. I have even adjusted file limit to 65k open
>> connections. I have 8 client nodes that are connecting to the 3 node
>> cluster and for the most part working fine, however, we see spikes in
>> connections and we start to blow out the file limit and we get too many
>> files open and all client nodes hang.
>>
>> When I check the connections per client on one of the server nodes, I am
>> seeing 5500+ TCP connections established per host.  This is roughly 44,0000
>> + . My question is what should the file limits be? Why so many TCP
>> connections per host? How do we control this as it is causing our
>> production cluster to hang.
>>
>> --Brent
>>
>>
>>

Re: Java Thin Client TCP Connections

Posted by Igor Sapego <is...@apache.org>.

That's really weird. There should not be so much connections. Normally thin
client will open one TCP connection per node at max. In many cases, there
going to be only one connection.

Do you create IgniteClient in your application once, or do you start them
several
times? Could it be that your code are leaking IgniteClient instances?

Can you provide some minimal reproducer to us, so we can debug the issue?

Best Regards,
Igor

On Mon, Mar 25, 2019 at 11:19 PM Brent Williams <br...@gmail.com>
wrote:

> All,
>
> I am running Apache Ingite 2.7.0. I have 3 nodes in my cluster, CPU,
> memory, GC all tuned properly. I have even adjusted file limit to 65k open
> connections. I have 8 client nodes that are connecting to the 3 node
> cluster and for the most part working fine, however, we see spikes in
> connections and we start to blow out the file limit and we get too many
> files open and all client nodes hang.
>
> When I check the connections per client on one of the server nodes, I am
> seeing 5500+ TCP connections established per host.  This is roughly 44,0000
> + . My question is what should the file limits be? Why so many TCP
> connections per host? How do we control this as it is causing our
> production cluster to hang.
>
> --Brent
>
>
>