You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@geode.apache.org by "Darrel Schneider (JIRA)" <ji...@apache.org> on 2017/03/10 19:22:05 UTC

[jira] [Resolved] (GEODE-654) Interaction of GII and LIFO eviction can cause high CPU and recovery issues

     [ https://issues.apache.org/jira/browse/GEODE-654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Darrel Schneider resolved GEODE-654.
------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.0.0-incubating

The fix is on geode in git rev: f7670e1688a95a026a0376724bfe2277cb16eb94

> Interaction of GII and LIFO eviction can cause high CPU and recovery issues
> ---------------------------------------------------------------------------
>
>                 Key: GEODE-654
>                 URL: https://issues.apache.org/jira/browse/GEODE-654
>             Project: Geode
>          Issue Type: Bug
>          Components: core
>            Reporter: Vincent Ford
>            Assignee: Vincent Ford
>             Fix For: 1.0.0-incubating
>
>
> This was seen in a customer engagement with GemFire but also effects Geode and the usage of  LIFO eviction algorithm.  The GemFire support team led by David Wisler identified the following pattern. In the specific case the GII was blocked from completing processing and also blocked a put into the processing queue of a Gateway ( not part of Geode but would effect any usage of LIFO eviction ).
> (** 1 **)
> "Pooled High Priority Message Processor 4" daemon prio=10 tid=0x00002ba27c008800 nid=0x3fc5 waiting for monitor entry [0x00002ba1894f2000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
> 	at com.gemstone.gemfire.internal.cache.InitialImageOperation$RequestImageMessage.chunkEntries(InitialImageOperation.java:1555)
> 	- waiting to lock <0x00000005b81413d0> (a com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)
> 1554	synchronized(mapEntry) { // bug #46042 must sync to make sure the tag goes with the value
> 1555		VersionSource<?> id = stamp.getMemberID();
> 1556		if (id == null) { id = myId; }
> 1557			foundIds.add(id);
> 1558			// if the recipient passed a version vector, use it to filter out
> 1559			// entries the recipient already has
> (** 2 **)
> "pool-26-thread-4" prio=10 tid=0x00002ba390042800 nid=0x4073 waiting for monitor entry [0x00002ba18b4d0000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:205)
> 	- waiting to lock <0x00000005b7624058> (a com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:440)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:416)
> 	at com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:964)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.notifyGatewayHubs(LocalRegion.java:6295)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5858)
> 	at com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3203)
> 	- locked <0x00000005b81413d0> (a com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)
> Below it is clear that we are not getting out of the eviction loop, for some reason the method "evictEntry" is returning 0 as the amount of evicted bytes and that's why following  thread keeps swapping between BLOCKED and RUNNABLE and making no progress.
> [2015-09-08 12:39:09]
> "pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f waiting for monitor entry [0x00002ba18b34b000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
> 	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.evictEntry(AbstractLRURegionMap.java:302)
> 	- locked <0x00000005b9b7c8c8> (a com.gemstone.gemfire.internal.cache.VersionedThinDiskLRURegionEntry)
> 	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:522)
> 	at com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
> 	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
> 	at com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
> 	at com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
> 	at com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
> 	- locked <0x00000005b7624058> (a com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
> [2015-09-08 12:40:14]
> "pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f runnable [0x00002ba18b34b000]
>    java.lang.Thread.State: RUNNABLE
> 	at com.gemstone.gemfire.internal.cache.lru.NewLIFOClockHand.getLRUEntry(NewLIFOClockHand.java:40)
> 	- locked <0x00000005b8c55c90> (a com.gemstone.gemfire.internal.cache.lru.NewLRUClockHand$HeadLock)
> 	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:520)
> 	at com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
> 	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
> 	at com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
> 	at com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
> 	at com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
> 	- locked <0x00000005b7624058> (a com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
> From the AbstractLRURegionMap...
> 527		while (_getCCHelper().mustEvict(stats, _getOwner(), bytesToEvict)) {
> 528			LRUEntry removalEntry = (LRUEntry)_getLruList().getLRUEntry();
> 529			if (removalEntry != null) {
> 530				if (evictEntry(removalEntry, stats) != 0) {
> 531					if (debug) {
> 532						debugLogging("evicted entry key(2)=" + removalEntry.getKey()
> 533							+ " total entry size is now: " + getTotalEntrySize()
> 534							+ " bytesToEvict :" + bytesToEvict);
> 535			//                if (NewLRUClockHand.debug && logWriter != null) {
> 536			//                  _getLruList().dumpList(logWriter);
> 537			//                }
> 538					}
> 539					stats.incEvictions();
> 540					if (_isOwnerALocalRegion()) {
> 541						if (_getOwner() instanceof BucketRegion) {
> 542							((BucketRegion)_getOwner()).incEvictions(1);
> 543						}
> 544					}
> 545					if (debug)
> 546						debugLogging("evictions=" + stats.getEvictions());
> 547						_getCCHelper().afterEviction();
> 548
> 549				}
> 550			}
> 551
> 552			else {
> 553				if (debug && getTotalEntrySize() != 0) {
> 554					debugLogging("leaving evict loop early");
> 555				}
> 556
> 557				break;
> 558			}
> 559		}
> Summary---
> 1. Serial gateway sender queues use LIFO_MEMORY as their eviction algorithm and others may choose this option as well
> 2. Due to the above, when we mustEvict, we call getLRUEntry implementation of NewLIFEClockHand 
> 3. Unfortunately, this LIFO based implementation doesn't seem to work in all cases during gII. This implementation always returns simply the last entry in the queue without directly removing it. It is possible to have the last element on the list (returned by getLRUEntry for LIFO_MEMORY) be in a state that causes us to not evict any bytes, but then just continue looping forever expecting state to change.
> 4. This appears to be a corner case race condition and will be difficult to test or reproduce given the timings and data distribution required to encounter the issue. For example to encounter the distributed system created by Geode must be in GII between members, have entry in queue that can not be evicted ( part of transaction, certain internal tokens...), queue is in state requiring eviction. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)