You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by adipro <ad...@zohocorp.com> on 2020/06/23 05:09:04 UTC

Thread starvations

We are getting these warnings quite often

```
./ignite-3fb86334.0.log:[09:48:02,663][WARNING][grid-timeout-worker-#71][IgniteKernal]
Possible thread pool starvation detected (no task completed in last 30000ms,
is query thread pool size large enough?)
./ignite-3fb86334.5.log:[03:50:26,683][WARNING][grid-timeout-worker-#71][G]
>>> Possible starvation in striped pool.
54439
./ignite-3fb86334.5.log:[03:50:56,687][WARNING][grid-timeout-worker-#71][G]
>>> Possible starvation in striped pool.
54440
./ignite-3fb86334.5.log:[03:51:26,690][WARNING][grid-timeout-worker-#71][G]
>>> Possible starvation in striped pool.
54441
./ignite-3fb86334.5.log:[03:51:26,692][WARNING][grid-timeout-worker-#71][G]
>>> Possible starvation in striped pool.
54442
./ignite-3fb86334.5.log:[03:51:56,695][WARNING][grid-timeout-worker-#71][G]
>>> Possible starvation in striped pool.
54443
./ignite-3fb86334.5.log:[03:52:24,302][WARNING][grid-timeout-worker-#71][diagnostic]
First 10 long running cache futures [total=1]
54444
./ignite-3fb86334.5.log:[03:52:24,302][WARNING][grid-timeout-worker-#71][diagnostic]
>>> Future [startTime=03:49:57.774, curTime=03:52:24.237,
fut=GridDhtAtomicUpdateFuture [updateCntr=49148, supe     
r=GridDhtAtomicAbstractUpdateFuture [futId=158766398, resCnt=0,
addedReader=false, dhtRes=TransformMapView
{43935ea9-e7cc-43d7-8895-3b16fc865f0a=[res=false, size=49148,
nearSize=0]}]]]
54445
./ignite-3fb86334.5.log:[03:52:26,702][WARNING][grid-timeout-worker-#71][G]
>>> Possible starvation in striped pool.
54446
./ignite-3fb86334.5.log:[04:02:26,797][WARNING][grid-timeout-worker-#71][G]
>>> Possible starvation in striped pool.
54447
./ignite-3fb86334.5.log:[04:17:24,415][WARNING][grid-timeout-worker-#71][diagnostic]
First 10 long running cache futures [total=1]
54448
./ignite-3fb86334.5.log:[04:17:24,415][WARNING][grid-timeout-worker-#71][diagnostic]
>>> Future [startTime=04:16:21.004, curTime=04:17:24.412,
fut=GridDhtAtomicUpdateFuture [updateCntr=49100, supe     
r=GridDhtAtomicAbstractUpdateFuture [futId=159106604, resCnt=0,
addedReader=false, dhtRes=TransformMapView
{43935ea9-e7cc-43d7-8895-3b16fc865f0a=[res=false, size=49100,
nearSize=0]}]]]
54449
./ignite-3fb86334.5.log:[04:46:57,195][WARNING][grid-timeout-worker-#71][G]
>>> Possible starvation in striped pool.
54450
./ignite-3fb86334.5.log:[04:55:05,029][WARNING][sys-stripe-1-#2][GridContinuousProcessor]
Failed to wait for ack message. [node=92ef74a3-d6c6-48e2-a786-67dd5cbe2132,
routine=0501eb73-1217-446b-8a9      6-4574a2483cc6]
54451
./ignite-3fb86334.5.log:[05:01:47,172][WARNING][checkpoint-runner-#222][GridCacheDatabaseSharedManager]
1 checkpoint pages were not written yet due to unsuccessful page write lock
acquisition and       will be retried
54452
./ignite-3fb86334.5.log:[05:01:57,451][WARNING][grid-timeout-worker-#71][G]
>>> Possible starvation in striped pool.
54453
./ignite-3fb86334.5.log:[05:15:29,983][WARNING][jvm-pause-detector-worker][IgniteKernal]
Possible too long JVM pause: 792 milliseconds.
54454
./ignite-3fb86334.5.log:[05:16:57,678][WARNING][grid-timeout-worker-#71][G]
>>> Possible starvation in striped pool.
54455
./ignite-3fb86334.5.log:[05:17:57,694][WARNING][grid-timeout-worker-#71][G]
>>> Possible starvation in striped pool.
54456
./ignite-3fb86334.5.log:[05:25:45,857][WARNING][sys-stripe-0-#1][GridContinuousProcessor]
Failed to wait for ack message. [node=92ef74a3-d6c6-48e2-a786-67dd5cbe2132,
routine=0501eb73-1217-446b-8a9      6-4574a2483cc6]
```


The setup ->

Two client servers which run 500 threads each simultaneously without locks.
Eagerly query Ignite DB.
Two DB servers with 30GB RAM (20GB off heap) and 2.3GHz - 64 core CPUs.

Important points:
-> Maximum we can get around 300 connections at a time to DB servers.
-> All queries are optimised.
-> All the caches are in Replication mode with FULL write sync. we use SQL
querying too.
-> In both DB servers, all thread pools sizes are default ones the value
which is of number of CPU cores.

We have done almost all kinds of performance optmizations mentioned in
Ignite docs but still facing these warnings. If we need to withstand a lot
of load like the above scenario, what we should we be doing?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Thread starvations

Posted by adipro <ad...@zohocorp.com>.

Sorry the above log file doesn't contain all info. Please check this file

Logs.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t2763/Logs.zip>  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Thread starvations

Posted by adipro <ad...@zohocorp.com>.

Thanks for the reply.

We've done query analysis and found that pretty much all queries are taking
around 1-5ms execution time for our operations which we consider it to be
fine.

But we have one query which is adding of rows in a table. We get maximum
5000 records at a time and we use IgniteDataStreamer API and JCache
QueryEntity way to insert rows in the table. We don't use JDBC. And for all
querying we either use SqlFieldsQuery or JCache getAll query. We are not
using job computes as well.

Now according to above statement, in real-time we get around 1500-2000 bulk
row insertions from each thread. All 300-400 threads insert at once without
any locking mechanism. This is actually giving us insert time of around
40-50 ms for one thread.

We are suspecting because of this many thread starvations are happening.

Weirdly, most of logs are printing read query delays but when seen from
Client application there are no delays that are printed too.

For the run-time of around two days, I'm sending you the entire logs of the
coordinator node. Please find the attached.

Logs.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t2763/Logs.zip>  

Unfortunately I am not system adminstrator and couldn't get to run jstack on
that system. But I'll try and get one if possible.

Version - We are using gridgain's libraries of version 8.7.10.
Client heap size is 2GB and Server heap size is 4GB.
Config -> Peristence is ON. Checkpointing frequency is every 15 mins. Wal
archiving is off. Data region size off-heap is 20GB. Persistence used would
be around 25GB in real-time for our case. All Caches used are of replication
mode. Full-sync happens between two server nodes. WAL segment size used is
256MB. And all are atomic mode atomicity.

Topology -> two Clients connected to two Servers. Total 4 nodes. All the
connection pool threads size are default ones which is equal to number of
CPU cores. Client machines are of 40 core CPUs. Server machines are of 64
core CPUs.

The system is actually working but we want to know if we actually are
worried these may cause any issues in future like cluster wide data
corruption, etc. We want to know if we are doing something wrong and would
like to correct.

Would be grateful if you go through the entire log. Thank you.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Thread starvations

Posted by akorensh <al...@gmail.com>.

Hi,
   In your case, striped pool starvation messages are showing that some
Ignite related tasks
   are taking longer then a preconfigured threshold. (See
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/configuration/IgniteConfiguration.html#getFailureDetectionTimeout--)
   
   As long as the system works through all the tasks w/out any errors, no
data is being corrupted.
   
   Looking at the Thread stacks below I see that most of the slow tasks 
   are related to I/O and memory allocation.
   Per the metrics, you are running at the bounds of your configured default
data  region.
   I would increase the amount of memory given to this data region.
   
   
    Follow these guides to optimize I/O and durable memory:
	https://apacheignite.readme.io/docs/performance-tips
	https://apacheignite.readme.io/v2.7.0/docs/durable-memory-tuning
	
	If you are using putAll in your code:
https://apacheignite.readme.io/docs/jcache#basic-operations
	If batch operations (such as IgniteCache#putAll, IgniteCache#invokeAll,
etc.) are performed in parallel, then keys should be ordered 
	in the same way to avoid deadlock. Use TreeMap instead of HashMap to
guarantee consistent ordering. 
	
Thanks, Alex	

----------------------------------------------------------------------------------
    
Look at the free indicators for default memory region.
	^-- Off-heap [used=20084MB, free=3.35%, comm=20680MB]
    ^--   sysMemPlc region [used=0MB, free=99.98%, comm=100MB]
    ^--   default region [used=20084MB, free=1.93%, comm=20480MB]
	

I/O operation
Thread [name="sys-stripe-2-#3", id=28, state=WAITING, blockCnt=120,
waitCnt=5301307]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
        at
o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
        at
o.a.i.i.util.future.GridFutureAdapter.getUninterruptibly(GridFutureAdapter.java:145)
        at
o.a.i.i.processors.cache.persistence.file.AsyncFileIO.read(AsyncFileIO.java:128)
        at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO$2.run(AbstractFileIO.java:89)
        at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO.fully(AbstractFileIO.java:52)
        at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO.readFully(AbstractFileIO.java:87)
        at
o.a.i.i.processors.cache.persistence.file.FilePageStore.readWithFailover(FilePageStore.java:791)
    
off-heap memory related operation.
Thread [name="sys-stripe-1-#2", id=27, state=RUNNABLE, blockCnt=70,
waitCnt=6227939]
        at
o.a.i.i.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:735)
        at
o.a.i.i.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:698)
        at
o.a.i.i.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:687)
        at
o.a.i.i.processors.cache.persistence.DataStructure.acquirePage(DataStructure.java:156)
        at
o.a.i.i.processors.cache.persistence.freelist.PagesList.put(PagesList.java:686)
        at 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Thread starvations

Posted by R S S Aditya Harish <ad...@zohocorp.com>.

Hi Alex, can you please check this. Or ask someone to check this scenario.

---- On Tue, 23 Jun 2020 20:18:51 +0530 akorensh <al...@gmail.com> wrote ----

Hi,
Starvation messages contain queue information, showing which tasks are
pending.
Looks like you are overloading the system w/too many requests/queries
that are taking a long time.

Send over log/reproducer/config/version #s/heap sizes and I'll take a
look. Take a thread dump using jstack at multiple instances to see what is
happening and include in your message. Include gc logs:
https://apacheignite.readme.io/docs/jvm-and-system-tuning#detailed-garbage-collection-stats

List the steps you take to get these messages. Are you sending SqlQuery
requests from each of your threads?(all 300 at once?) How many nodes are in
your setup? How long does each query take to complete? Do you use a thick
client or thin client or jdbc to send requests or are doing it on the server
directly.

These messages are informational, the system could still work through all
requests, given enough time.
Is that your case? Do you have a certain time limit you want these
requests to complete in?

see: https://apacheignite.readme.io/docs/thread-pools#overview
Try to vary the query/stripped pool to see whether it makes a difference.
Also adjust memory/gc :
https://apacheignite.readme.io/docs/jvm-and-system-tuning#garbage-collection-tuning

Thanks, Alex

--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Thread starvations

Posted by akorensh <al...@gmail.com>.

Hi,
   Starvation messages contain queue information, showing which tasks are
pending.
   Looks like you are overloading the system w/too many requests/queries
that are taking a long time.

   Send over log/reproducer/config/version #s/heap sizes and I'll take a
look. Take a thread dump using jstack at multiple instances to see what is
happening and include in your message. Include gc logs:
https://apacheignite.readme.io/docs/jvm-and-system-tuning#detailed-garbage-collection-stats


List the steps you take to get these messages.  Are you sending SqlQuery
requests from each of your threads?(all 300 at once?) How many nodes are in
your setup?   How long does each query take to complete? Do you use a thick
client or thin client or jdbc to send requests or are doing it on the server
directly.

  These messages are informational, the system could still work through all
requests, given enough time.
  Is that your case? Do you have a certain time limit you want these
requests to complete in?

  see: https://apacheignite.readme.io/docs/thread-pools#overview
   Try to vary the query/stripped pool to see whether it makes a difference.
   Also adjust memory/gc :
https://apacheignite.readme.io/docs/jvm-and-system-tuning#garbage-collection-tuning

Thanks, Alex
   

  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/