You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@openjpa.apache.org by "Will Dazey (JIRA)" <ji...@apache.org> on 2018/05/01 17:56:00 UTC
[jira] [Comment Edited] (OPENJPA-2646) Sporadic NullPointerException occurs under heavy load when QuerySQLCache is enabled.

    [ https://issues.apache.org/jira/browse/OPENJPA-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459791#comment-16459791 ] 

Will Dazey edited comment on OPENJPA-2646 at 5/1/18 5:55 PM:
-------------------------------------------------------------

Hello!

After further investigation into this issue that Heath provided extensive investigation (thanks for that by the way!), I have found another issue that needs to be addressed.This new issue revolves around the "overflow" strategy being employed by the CachMap. Let me see if I can explain the situation as well as Heath:   

The story starts with how the CacheMap is setup to strategically balance between a hard cache ('cacheMap') and a soft cache ('softMap'). When the hard cache reaches its maximum size, this strategy is there to shift a random value from the hard cache to the soft cache (and when the soft cache gets full, just evict something and provide a warning that it's happening). However, there is a bit of a problem with timing and locking that makes the events of 1) remove from the hard cache and 2) put into the soft cache leave a gap in time where the value is not in either Map! This leads to the same NullPointerException that Heath was describing as well.

To take a look at the issue in the code, here is the first part: The put() implementation for ConcurrentHashMap:

org.apache.openjpa.lib.util.concurrent.ConcurrentHashMap.put(Object key, Object value)

{code:java}
     public Object put(Object key, Object value) {
        Object k = maskNull(key);
        int hash = hash(k);
        synchronized (this) {
            int i = hash & (table.length - 1);

            for (Entry e = table[i]; e != null; e = e.next) {
                if (e.hash == hash && eq(k, e.key)) {
                    Object oldValue = e.value;
                    e.value = value;
                    return oldValue;
                }
            }

            if (maxSize != Integer.MAX_VALUE)
                removeOverflow(maxSize - 1);
            table[i] = createEntry(hash, k, value, table[i]);
            if (size++ >= threshold) resize(2 * table.length);
        }
        return null;
    }
{code}

The portion that is of interest here is the call to {code}removeOverflow(maxSize - 1){code}
This call is made when the object to be inserted into this map is completely new. This makes sense since this is a map. If the key already exists, the backing size of the map doesn't actually change and nothing needs to be evicted. However, when a new key is introduced and the maxSize is not set for an unlimited size (MAX_VALUE), then we remove the overflow. So lets take a look at that implementation:

org.apache.openjpa.lib.util.concurrent.ConcurrentHashMap.removeOverflow(int maxSize)
{code:java}
    private void removeOverflow(int maxSize) {
        while (size > maxSize) {
            Map.Entry entry = removeRandom();
            if (entry == null)
                break;
            overflowRemoved(entry.getKey(), entry.getValue());
        }
    }
{code}

Observe here how we first call the method removeRandom(), then as long as a value was actually removed, we call a method called overflowRemoved(). This sequencing is the key to the issue. The method removeRandom() is pretty straight forward in that it finds a random index in the backing array and removes it. One thing that is special about removeRandom is that it uses synchronization! This is good! While the value is being removed, other threads should be blocked from accessing the map (assuming they are also using synchronization which is doubtful). However, after calling removeRandom(), the synchronization is released and now we call the method overflowRemoved()...

This overflowRemoved() method is a bit more complex. The default implementation of this method actually does notthing; it's empty. However, in CacheMap, when creating our ConcurrentHashMap, we override this specific method, providing the following implementation:

org.apache.openjpa.lib.util.concurrent.ConcurrentHashMap.removeOverflow(int maxSize)
{code:java}

    public CacheMap(boolean lru, int max, int size, float load,
        int concurrencyLevel) {
        ...
        cacheMap = new ConcurrentHashMap(size, load) {
                public void overflowRemoved(Object key, Object value) {
                    cacheMapOverflowRemoved(key, value);
                }
            };
        ...
    }

    protected void cacheMapOverflowRemoved(Object key, Object value) {
        if (softMap.size() < softMap.getMaxSize())
            put(softMap, key, value);
        else
            entryRemoved(key, value, true);
    }
{code}

So, what this means is that when the overflow is removed from the hard cache, there is a brief period in time where the entry exists in neither the hard nor the soft cache. Since non-LRU maps do not seem to maintain readLocks (as described in Heath's analysis) then there is nothing preventing a thread from trying to read during this interval of overflow migration and finding that the value is null. This in fact is the root cause of the NullPointerException Heath analyzed above:

org.apache.openjpa.persistence.QueryImpl.preExecute(Map params)
{code:java}
    private boolean preExecute(Map params) {
        ...
        //cache read might find the value and return that the _id is registered/cached
        Boolean registered = cache.register(_id, _query, fetch);
        boolean alreadyCached = (registered == null);
        String lang = _query.getLanguage();
        QueryStatistics<String> stats = cache.getStatistics();
        if (alreadyCached && LANG_PREPARED_SQL.equals(lang)) {
            //Now the query is being randomly evicted by some other thread
            //This means we could read here, but the value is in between caches, causing a miss
            PreparedQuery pq = _em.getPreparedQuery(_id);
            if (pq.isInitialized()) {
        ...
    }

    protected void cacheMapOverflowRemoved(Object key, Object value) {
        if (softMap.size() < softMap.getMaxSize())
            put(softMap, key, value);
        else
            entryRemoved(key, value, true);
    }
{code}


was (Author: dazeydev):
Hello!

After further investigation into this issue that Heath provided extensive investigation (thanks for that by the way!), I have found another issue that needs to be addressed.This new issue revolves around the "overflow" strategy being employed by the CachMap. Let me see if I can explain the situation as well as Heath:   

The story starts with how the CacheMap is setup to strategically balance between a hard cache ('cacheMap') and a soft cache ('softMap'). When the hard cache reaches its maximum size, this strategy is there to shift a random value from the hard cache to the soft cache (and when the soft cache gets full, just evict something and provide a warning that it's happening). However, there is a bit of a problem with timing and locking that makes the events of 1) remove from the hard cache and 2) put into the soft cache leave a gap in time where the value is not in either Map! This leads to the same NullPointerException that Heath was describing as well.

To take a look at the issue in the code, here is the first part: The put() implementation for ConcurrentHashMap:

org.apache.openjpa.lib.util.concurrent.ConcurrentHashMap.put(Object key, Object value)

{code:java}
     public Object put(Object key, Object value) {
        Object k = maskNull(key);
        int hash = hash(k);
        synchronized (this) {
            int i = hash & (table.length - 1);

            for (Entry e = table[i]; e != null; e = e.next) {
                if (e.hash == hash && eq(k, e.key)) {
                    Object oldValue = e.value;
                    e.value = value;
                    return oldValue;
                }
            }

            if (maxSize != Integer.MAX_VALUE)
                removeOverflow(maxSize - 1);
            table[i] = createEntry(hash, k, value, table[i]);
            if (size++ >= threshold) resize(2 * table.length);
        }
        return null;
    }
{code}

The portion that is of interest here is the call to {code}removeOverflow(maxSize - 1){code}
This call is made when the object to be inserted into this map is completely new. This makes sense since this is a map. If the key already exists, the backing size of the map doesn't actually change and nothing needs to be evicted. However, when a new key is introduced and the maxSize is not set for an unlimited size (MAX_VALUE), then we remove the overflow. So lets take a look at that implementation:

org.apache.openjpa.lib.util.concurrent.ConcurrentHashMap.removeOverflow(int maxSize)
{code: java}
    private void removeOverflow(int maxSize) {
        while (size > maxSize) {
            Map.Entry entry = removeRandom();
            if (entry == null)
                break;
            overflowRemoved(entry.getKey(), entry.getValue());
        }
    }
{code}

Observe here how we first call the method removeRandom(), then as long as a value was actually removed, we call a method called overflowRemoved(). This sequencing is the key to the issue. The method removeRandom() is pretty straight forward in that it finds a random index in the backing array and removes it. One thing that is special about removeRandom is that it uses synchronization! This is good! While the value is being removed, other threads should be blocked from accessing the map (assuming they are also using synchronization which is doubtful). However, after calling removeRandom(), the synchronization is released and now we call the method overflowRemoved()...

This overflowRemoved() method is a bit more complex. The default implementation of this method actually does notthing; it's empty. However, in CacheMap, when creating our ConcurrentHashMap, we override this specific method, providing the following implementation:

org.apache.openjpa.lib.util.concurrent.ConcurrentHashMap.removeOverflow(int maxSize)
{code: java}

    public CacheMap(boolean lru, int max, int size, float load,
        int concurrencyLevel) {
        ...
        cacheMap = new ConcurrentHashMap(size, load) {
                public void overflowRemoved(Object key, Object value) {
                    cacheMapOverflowRemoved(key, value);
                }
            };
        ...
    }

    protected void cacheMapOverflowRemoved(Object key, Object value) {
        if (softMap.size() < softMap.getMaxSize())
            put(softMap, key, value);
        else
            entryRemoved(key, value, true);
    }
{code}

So, what this means is that when the overflow is removed from the hard cache, there is a brief period in time where the entry exists in neither the hard nor the soft cache. Since non-LRU maps do not seem to maintain readLocks (as described in Heath's analysis) then there is nothing preventing a thread from trying to read during this interval of overflow migration and finding that the value is null. This in fact is the root cause of the NullPointerException Heath analyzed above:

org.apache.openjpa.persistence.QueryImpl.preExecute(Map params)
{code: java}
    private boolean preExecute(Map params) {
        ...
        //cache read might find the value and return that the _id is registered/cached
        Boolean registered = cache.register(_id, _query, fetch);
        boolean alreadyCached = (registered == null);
        String lang = _query.getLanguage();
        QueryStatistics<String> stats = cache.getStatistics();
        if (alreadyCached && LANG_PREPARED_SQL.equals(lang)) {
            //Now the query is being randomly evicted by some other thread
            //This means we could read here, but the value is in between caches, causing a miss
            PreparedQuery pq = _em.getPreparedQuery(_id);
            if (pq.isInitialized()) {
        ...
    }

    protected void cacheMapOverflowRemoved(Object key, Object value) {
        if (softMap.size() < softMap.getMaxSize())
            put(softMap, key, value);
        else
            entryRemoved(key, value, true);
    }
{code}

> Sporadic NullPointerException occurs under heavy load when QuerySQLCache is enabled.
> ------------------------------------------------------------------------------------
>
>                 Key: OPENJPA-2646
>                 URL: https://issues.apache.org/jira/browse/OPENJPA-2646
>             Project: OpenJPA
>          Issue Type: Bug
>          Components: query
>    Affects Versions: 2.2.3, 2.4.1
>            Reporter: Heath Thomann
>            Assignee: Heath Thomann
>            Priority: Major
>         Attachments: MyEntity.java, PreparedQueryCacheTest.java, PreparedQueryTest.debug
>
>
> Hello!  This is a follow up/continuation of JIRA OPENJPA-2609.  I will assume the reader slogged through that one as some of the info in that JIRA is very relevant here.  The fix for that JIRA was necessary to plug a hole in PreparedQuery (PQ) cache.  The fix did reduce my customer's occurrence of the exception by around 90%.  However, they still saw the reported exception and in addition they see following exception:
> Caused by: java.lang.NullPointerException
> at org.apache.openjpa.persistence.QueryImpl.preExecute(QueryImpl.java:557)
> at org.apache.openjpa.persistence.QueryImpl.execute(QueryImpl.java:285)
> at org.apache.openjpa.persistence.QueryImpl.getResultList(QueryImpl.java:302)
> at com.ibm.ws.persistence.QueryImpl.getResultList(QueryImpl.java:118)
> .....
>  
> As I mentioned in OJ2609, I was not able to exactly recreate the issue.  I described some of the strategic break points I had to use, and the assumptions I had to make.  Alas, I was able to recreate this issue with a multi threaded test, and specific query string (among other things)!  This JIRA will take the reader on another journey through the laborious process I went through to recreate the issue, and finally track down the very, I mean very, narrow window where the CCEx or NPE can occur.  So, settle in and lets start our fun adventure!  
> First, let me give an overview of the issue and look at the code to illustrate how the exceptions can occur.  After that I'll dive into the recreate.  
> The exception can occur when an item (a query string mapped to a PQ in my case) in the CacheMap is moved from the  "soft" cache to the main cache.  Please view the CacheMap code to see what I mean by the main cache (see variable 'cacheMap') and what I mean by the soft cache (see variable 'softMap').  When an item exists in the soft cache, and a thread requests that item, the item is moved out of the soft cache and put in the main cache.  If the main cache is full, another item is kicked out of the main cache.  The item which is removed from the main cache is put into a soft cache.  As this move from soft cache to main cache occurs, there is a narrow window where the item appears to not be in the caches at all.  In other words, if a (reader) thread asks (CacheMap.get) for item X, where another (writer) thread is moving item X from the soft cache to the main cache, the reader thread will not see item X and think the item is not cached.
> To see this better, lets look at CacheMap.get:    
>  public Object get(Object key) {
>         readLock();
>         try {
>             // Check the main map first
>             Object  val = cacheMap.get(key);
>             if (val == null) {
>                 // if we find the key in the soft map, move it back into
>                 // the primary map
>                 val = softMap.get(key);
>                 if (val != null){
>                     put(key, val);
>                 }else{
>                     val = pinnedMap.get(key);
>                 }
>             }
>            
>             return val;
>         } finally {
>             readUnlock();
>         }
>     }
> As you can see, we first look in the main cache for a given key (in my scenario the key is a query string).  If it is not found, we look in the soft map.  If it is in the soft map, we call the 'put' method.  The 'put' method is going to take the key out of the soft map and put it in the main cache map.  Let me paste the relevant parts of the put method (I'll leave it to the reader to view the method in its entirety):
> public Object put(Object key, Object value) {
>         writeLock();
>         try {
> ........
>             // otherwise, put the value into the map and clear it from the
>             // soft map
>             val = put(cacheMap, key, value);
>             if (val == null) {
>                 val = remove(softMap, key);
> ........
>         } finally {
>             writeUnlock();
>         }
> As you can see here, this 'put' method calls to another 'put' method which simply puts the key/value into the specified map.  If the map didn't already contain a value for the given key (which I think should always be the case for the cacheMap, or at least for my scenario), null will be returned.  When null is returned, remove is called to remove the key/value from the soft map.  
> Given this, lets look closer at the 'get' method with concurrency in mind.  Take the case where two threads enter the 'get' method at the same time.  Next, assume both threads are moving lock step and they both call 'cacheMap.get' with the same key, and assume the key is not in the main cache.  So at this time both threads will have a 'val' of null.  Next assume one thread gets focus (so the other thread suspends for a moment) and it calls the soft map with the key.  Assume the soft map contains the key.  In this case the thread will move (via the call to 'put') the key/value from the soft cache to the main cache.  This thread will then have a non-null 'val' and will return.  Next assume the suspended thread gets focus and continues.  Since its 'val' is null, it will ask the soft cache for the key.  However, we know that the other thread just moved the key/value from the soft to main cache!  As such, the thread will not find the key in the soft cache and will return a 'val' of null!!!  This is at the heart of the issue!  That is, one thread thinks there is nothing in the cache for their key!  
> The observant reader may notice that the 'get' and 'put' have a 'readLock' and 'writeLock', respectively, which attempts to get a ReentrantLock.  So one would think that there is proper read/write access of the caches to avoid such situations.  However, one must look at the read/writeLock methods.  If you look at the readLock method, you can see that it locks using a '_readLock' variable.  This variable is null for the CacheMap instance created by the PreparedQueryCacheImpl.  Therefore, the readLock method is a no-op.  This means that a reader thread can do a read at any time without any locking concerns.  
> Now that we know the heart of the issue (i.e. one thread received an incorrect cache miss), lets look at the implication of this incorrect cache miss.  What I haven't stated is where the threads are at when they call 'get' on the cache.  This is easier to see when looking at the attached test.  Basically though one thread is in PreparedQueryCacheImpl.register, specifically this line:
> PreparedQuery cached = get(id);
> Here is the stack from my test which I will describe:
> PreparedQueryCacheImpl.get(String) line: 193	
> PreparedQueryCacheImpl.register(String, Query, FetchConfiguration) line: 92	
> QueryImpl<X>.preExecute(Map) line: 554	
> If you look at the 'register' method, you can see that if the PreparedQuery is not found in the cache (i.e. 'get(id)' returns null), one will be created and an attempt made to put it into the cache.  If you look at  PreparedQueryCacheImpl.cache, you can see a request is made a second time to see if the value is in the cache.  More than likely this time the value will be found in the cache, in which case true is returned.  If not, false is returned.  Either way, it doesn't matter if true or false is returned.  To see why, look at org.apache.openjpa.persistence.QueryImpl.preExecute.  This is the code which has called PreparedQueryCacheImpl.register (see above stack).  Again, this might be more clear when looking at the provided test in the debugger.  You can see in 'org.apache.openjpa.persistence.QueryImpl.preExecute' that if anything other than null is returned from 'register', the code that calls 
> PreparedQuery.reparametrize is skipped!  This is at the heart of the issue.  To understand why, you need to go back to JIRA OPENJPA-2609 and understand the part about PreparedSQLStoreQuery$PreparedSQLExecutor, or understand the fact that a PreparedSQLStoreQuery$PreparedSQLExecutor is the executor. That is, the threads involved here (again, easier to see in the provided test) are using a PreparedSQLStoreQuery$PreparedSQLExecutor.  Given this the parameter must be reparametrize, in other words the parameter needs to be converted to its parameter index (an int).  Because this is skipped, we see the CCEx when trying to convert a string parameter to an expected Integer.  That is, when you look at this line in org.apache.openjpa.kernel.QueryImpl.execute:
> StoreQuery.Executor ex = (isInMemory(operation))
> The next couple lines below this is where the CCEx occurs, because the parameters have not been converted.  Note that when we are dealing with a 'ExpressionStoreQuery$DataStoreExecutor' (i.e. the very first time a query is executed - not cached - or caching is disabled), the reparamaterization doesn't need to occur because an ''ExpressionStoreQuery$DataStoreExecutor' expects a string parameters, rather than a paramater (int) index.  If you run with the debug module I provide (see below) I add debug to print the 'executor' making is a little easier to see when each of these two executors are used.
> Now, that describes the CCEx, but I've said nothing about the NPE.  Well, if you followed the above, as I'll explain, the NPE is very easy to see.  Again, just like the CCEx stack, the NPE stack is in the org.apache.openjpa.persistence.QueryImpl.preExecute method.  Just as was the case above for the CCEx, this method goes to PreparedQueryCacheImpl.register.  This time though, the 'register' returns null, meaning the PQ is already in the cache.  A few lines after the 'register' call, we see these lines:
> PreparedQuery pq = _em.getPreparedQuery(_id);
>     if (pq.isInitialized()) {
> The code is going back to the cache for the actual PQ.  Using the same "cache miss" scenario as described above, a null PQ will be returned.....obviously when we call pq.isInitialized() on a null, an NPE will occur!
> While this JIRA describes one way to cause the CCEx, I feel there are other holes that could cause the CCEx, or other exceptions.  For example, if you look at 'org.apache.openjpa.persistence.QueryImpl.ignorePreparedQuery()', you can see that it goes to the cache and if an entry is not found, it returns without doing the recompile.  I see a scenario where this is a problem.  That is, taking a look once again at 'QueryImpl.preExecute'.  Before the call to 'PreparedQueryCacheImpl.register' there is a test for null parameters.  If there is a null parameter, the 'ignorePreparedQuery()' method is called.  If you step into this method you can see that a call is made to the cache.  If we go through the same "cache miss" scenario, we will see that an "ignore" is not performed (by "ignore" it appears the Query is recompiled and in so doing the 'PreparedSQLStoreQuery$PreparedSQLExecutor' would be removed and instead a 'ExpressionStoreQuery$DataStoreExecutor' used).  When the "cache miss" occurs and the "ignore" not performed, we can see that yet again the call to PreparedQuery.reparametrize is skipped, thus leading to the same CCEx.  I think we need to fix the case where a "cache miss" could occur in CacheMap, or look at all the cases where the cache is called and account for the possibility of a "cache miss".  The former is likely the best options.
> Finally, let me describe the test I've created and attached.  It is named PreparedQueryCacheTest.java and uses the provided entity (MyEntity.java).  This test can be ran as a stand alone JSE JUnit test.  To execute it simply point your classpath to the OpenJPA snapshot jar of your choice.  You can use a simple persistence.xml file (no properties are necessary).  To make it much easier for the test to recreate the issue, you will need to edit the PreparedQueryCacheImpl.java class.  Therefore this does assume the user can edit and recompile the OpenJPA code.  The user should change the CacheMap in PreparedQueryCacheImpl to:
> _delegate = new CacheMap(false,2);
> from:
> _delegate = new CacheMap();
> Again, doing this makes it far more likely the issue will occur and is a necessity when running the debug test.  If you look at the test there is a test method to run when running in a debugger, and a test method to run to automatically recreate the exception.  Each test method is commented with enough text to explain how to recreate the exception.  Especially the debug test, the comments in this method detail exactly where to add break points and the necessary flow to recreate the issue in a debugger.  With my test you'll see a CCEx like this:
> Thread (Thread-32): Caught the following exception: <openjpa-2.2.3-SNAPSHOT-r422266:1737410 nonfatal user error> org.apache.openjpa.persistence.ArgumentException: Failed to execute query "select o from MyEntity o where o.comments = :mrn and o.id = 1". Check the query syntax for correctness. See nested exception for details.
>   With cause: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
> In some cases you may see the NPE reported in the beginning of the JIRA.  However, the CCEx occurs more often than the NPE.  
> In addition to the test, I've attached a debug module which has system println's to help demonstrate the issue.  The debug module is named PreparedQueryTest.debug. 
> Thanks,
> Heath Thomann



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)