You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by GitBox <gi...@apache.org> on 2022/11/01 10:03:33 UTC

[GitHub] [solr] tboeghk opened a new pull request, #1155: [SOLR-16497] Allow finer grained locking in SolrCores

tboeghk opened a new pull request, #1155:
URL: https://github.com/apache/solr/pull/1155

   https://issues.apache.org/jira/browse/SOLR-16497
   
   Access to loaded `SolrCore` instances is a synchronized read and write operation in `SolrCores#getCoreFromAnyList`. This method is touched by every request as every HTTP request is assigned the `SolrCore` it operates on.
   
   ### Background
   
   Under heavy load we discovered that application halts inside of Solr are becoming a serious problem in high traffic environments. Using Java Flight Recordings we discovered high accumulated applications halts on the `modifyLock` in `SolrCores`. In our case this means that we can only utilize our machines up to 25% cpu usage. With the fix applied, a utilization up to 80% is perfectly doable.
   
   > In our case this specific locking problem was masked by another locking problem in the `SlowCompositeReaderWrapper`. > We'll submit our fix to the locking problem in the `SlowCompositeReaderWrapper` in a following issue.
   
   ### Description
   
   Our Solr instances utilizes the `collapse` component heavily. The instances run with 32 cores and 32gb Java heap on a rather small index (4gb). The instances scale out at 50% cpu load. We take Java Flight Recorder snapshots of 60 seconds
   as soon the cpu usage exceeds 50%.
   
   <img width="993" alt="solr-issues-solrcores-locking" src="https://user-images.githubusercontent.com/557264/196221505-c0d819ea-574a-41f5-80ae-1b0218225b32.png">
   
   During our 60s Java Flight Recorder snapshot, the ~2k Jetty acceptor threads accumulated more than 12h locking time inside `SolrCores` on the `modifyLock` instance. The `modifyLock` instance is used as a synchronized lock (see screenshot). With this fix applied, the locking access is reduced to write accesses only. We validated this using another JFR snapshot:
   
   <img width="990" alt="solr-issues-solrcores-after" src="https://user-images.githubusercontent.com/557264/196221528-362e5d7f-022a-4aa8-9cd7-844f59a61102.png">
   
   We ran this code for a couple of weeks in our live environment in a backported version on a Solr version 8.11.2. This fix is built against the `main` branch.
   
   ### Solution
   
   The synchronized `modifyLock` is replaced by a [`ReentrantReadWriteLock`](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReentrantReadWriteLock.html). This allows concurrent reads from the internal `SolrCore` instance list (`cores`) but grants exclusive access to write operations on the instance list (`cores`).
   
   In Solr 9.x the cache inside the `TransientSolrCoreCacheFactoryDefault` adds a cache overflow handling of the size based internal cache ([SOLR-15964](https://issues.apache.org/jira/browse/SOLR-15964)). As soon as `SolrCore`s are evicted from the internal cache, the cache behaviour changes from a size based cache to a reference based cache via the cache's eviction handler. `SolrCore` instances that are still referenced are inserted back into the cache. This means that write operations to the cache (insert `SolrCore`) can be issued during read operations in `SolrCores`. Hence these operations have only a read lock which cannot be upgraded to a write lock (dead lock).
   
   To overcome this, we moved the cache maintenance (including the eviction handler) in `TransientSolrCoreCacheFactoryDefault` to a separate thread. This thread can acquire a write lock but on the other hand a separate thread will schedule a ping-pong behaviour in the eviction handler on a full cache with `SolrCore`s still referenced. To overcome this we made the overflow behaviour transparent by adding an additional `overflowCores` instance. Here we add evicted but still referenced cores from the `transientCores` cache.
   
   Furthermore we need to ensure that only a single `transientSolrCoreCache` inside `TransientSolrCoreCacheFactoryDefault` is created. As we now allow multiple read threads, we call the the `getTransientCacheHandler()` method initially holding a write lock inside the `load()` method. Calling the method only needs a write lock initially (for cache creation). For all other  calls, a read lock is sufficient. By default, the `getTransientCacheHandler()` acquires a read lock. If a write is needed (e.g. for core creation), the `getTransientCacheHandlerInternal()` is called. This method explicitly does not use a lock in order to provide the flexibility to choose between a read-lock and a write-lock. This ensures that a single instance of `transientSolrCoreCache` is created.
   
   The lock signaling between `SolrCore` and `CoreContainer` gets replaced by a
   [`Condition`](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/Condition.html) that is tied to the write lock. 
   
   This change allows for a finer grained access to the list of open `SolrCores`. The decreased blocking read access is noticeable in decreased blocking times of the Solr application (see screenshot).
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   - [ ] I have added documentation for the [Reference Guide](https://github.com/apache/solr/tree/main/solr/solr-ref-guide)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


Re: [PR] [SOLR-16497] Allow finer grained locking in SolrCores [solr]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-2089311201

   This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the dev@solr.apache.org mailing list. Thank you for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] debe commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
debe commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1337880470

   Basically the changes around the Handler are needed to get the read write locking correct. It's an all or nothing change. Even the smallest mistake could lead to a deadlock or missing cores. We were very careful in analyzing the relevant code paths, otherwise you risk deadlocks or undefined behaviour, e.g. thread switch while in critical block. Luckily the test suite is very good and revealed some problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] debe commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
debe commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1344569141

   > @debe thank you for taking a look!
   > 
   > First off, mockito is troublesome. we can take it out completely if there is a way to setup solr without it. I added it only for the transient cores part. same for the lookup misses, it was just an attempt to reproduce the results.
   > 
   > > I think you actually need to do something while holding the lock like it is in reality.
   > 
   > I don't completely follow there. what else happens inside the lock block from getCoreFromAnyList? ideally we would test the code as is, I don't see anything else inside that method. by introducing the consumer you are artificially increasing the time spent inside the lock, is this a better mirror of a running system?
   
   Ahhh, no I think this is a misunderstanding. I exactly copied the work that is actually done. CoreContainer.getCore() set's incRefCount to true and triggers SolrCore.open(). If you set IncRefCount to false the call to core.open() never happens and therefore everything inside open() which are MDC.put calls (I've counted them) and atomicInt incrementation.
   > 
   > > I think one should limit the number of threads to the number of CPU cores, otherwise you're benchmarking context switching of OS threads.
   > 
   > I agree with this point, anything we can tweak on the benchmark settings to get it closer to real life is good for me.
   > 
   > > To get an overview which public methods from SolrCores and CoreContainer are actually called and how frequent, I've checked a recent flight recorder image from one of our production machines.
   > 
   > thank you for this data. I would like to use it as a reference for the benchmark, but ideally we would setup the core without any interventions (still not 100% sold on the extra consumer you added).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] DennisBerger1984 commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
DennisBerger1984 commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1363701253

   @stillalex Hi Alex, before I'll go into christmas holidays, we're back in january, we'd like to comment on your recent results.
   We think synchronized is currently faster because there is some substantial work inside the synchronized block missing, compared to a real world scenario. The reason for this is that the cloudDescriptor is null and therefore 3 `MDC.put()` calls in `MDCLoggingContext.setCoreDescriptor()` are getting executed and we've noticed that the core doesn't get closed anymore, but it should probably be closed inside the benchmark.
   
   First we'd like to thank you again for bringing up the discussion about how to prove that certain code paths are perfoming better or worse and to increase the understanding about locking in the solr community. We think this was a great idea and basically the question behind it is "does this PR help at all to make things better". Nevertheless we've difficulties to take conclusions from the outcome of this benchmark. What it basically does is it answers the question whether a function call to `getCoreFromAnyList()` without thread congestion and side effects performs better or worse when using sychronized vs. ReentrantReadWriteLock. The answer is that synchronized will perform similar or better depending on the number of CPU cores and that it uses far less resources. We suspect that the adaptive spinning of the lightweight locks used in synchronized in the JVM learns pretty quickly, that the benchmark consists mostly of a synchronized block and that there is not much that can run in 
 parallel anyway and how to align the calls to `getCoreFromAnyList()` to get away with no heavy weight locking at all.
   
   So the problem is the synchronized behaviour of an entire solr when thread congestion, sideeffects like a big ThreadContext, other method calls grabbing the lock and possibly IO are happening. Then you can't guarantee that fast behaviour any more and it is very difficult to construct a micro benchmark that covers such cases. What ReentrantReadWriteLock can guarantee you is that it will never block read access even when you have congestion and lots of side effects, like slow MDC.put() calls, a big ThreadContext and so on. That's the most valuable advantage. In the new year we can do a macro (e.g. http) benchmark on our side again to verify a benefit and extract flight recorder dumps plus we plan to measure timings in each synchronized entrance and exit. Maybe we can find the root cause of long waits for the lock. But then again it's not about that our use case will be much faster then, it's about that you can construct a solr usage pattern which will severely suffer from synchroniz
 ed blocks but which will never suffer using a ReentrantReadWriteLock.
   
   Sincerly, 
   @mpetris(Marco) and @debe (Dennis)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
stillalex commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1334309402

   I have attempted to write a benchmark for this change to verify some of the perf gains. based on a limited understanding of the code, the PR looks reasonable to me, but the results I had are a bit different than expected. I am assuming the benchmark itself does not correctly mirror the setup used to perform this analysis.
   
   I am maintaining 2 branches, [one with only the benchmark](https://github.com/apache/solr/compare/main...stillalex:SOLR-16497-locks-bench?expand=1) and [one with this PR + same benchmark](https://github.com/apache/solr/compare/main...stillalex:SOLR-16497-locks-bench-and-patch?expand=1).
   
   * Results on main branch with no changes
   
   ```
   Iteration   1: 9129555.788 ops/s
   Iteration   2: 9203237.026 ops/s
   Iteration   3: 9232071.659 ops/s
   Iteration   4: 9187238.556 ops/s
   
   Result "org.apache.solr.core.SolrCoresBenchTest.getCoreFromAnyList":
     9188025.757 ±(99.9%) 278959.948 ops/s [Average]
     (min, avg, max) = (9129555.788, 9188025.757, 9232071.659), stdev = 43169.361
     CI (99.9%): [8909065.809, 9466985.706] (assumes normal distribution)
   
   Benchmark                               Mode  Cnt        Score        Error  Units
   SolrCoresBenchTest.getCoreFromAnyList  thrpt    4  9188025.757 ± 278959.948  ops/s
   ```
   
   * Results on patch
   
   ```
   Iteration   1: 1488901.065 ops/s
   Iteration   2: 1503247.470 ops/s
   Iteration   3: 1531084.510 ops/s
   Iteration   4: 1464588.927 ops/s
   
   Result "org.apache.solr.core.SolrCoresBenchTest.getCoreFromAnyList":
     1496955.493 ±(99.9%) 179578.485 ops/s [Average]
     (min, avg, max) = (1464588.927, 1496955.493, 1531084.510), stdev = 27789.969
     CI (99.9%): [1317377.008, 1676533.978] (assumes normal distribution)
   
   Benchmark                               Mode  Cnt        Score        Error  Units
   SolrCoresBenchTest.getCoreFromAnyList  thrpt    4  1496955.493 ± 179578.485  ops/s
   ```
   
   We effectively drop from 9M ops/s to 1.5M ops/s.
   The main observation here is that the benchmark does not mirror the PR setup. this is obvious from the fact that there are no transient cores (as far as I understand loading and unloading cores will increase the locking time under heavy load).
   
   I would like to understand how common is the setup that is using transient cores, the Solr Guide mentions that they are not recommended under SolrCloud: [Core Discovery](https://solr.apache.org/guide/solr/latest/configuration-guide/core-discovery.html) page under transient property.
   Also if there is interest I can spend more time tweaking this benchmark, I think having some way to verify/reproduce any perf change is very beneficial.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] bruno-roustant commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1048676693


##########
solr/core/src/java/org/apache/solr/core/TransientSolrCoreCacheDefault.java:
##########
@@ -47,6 +49,14 @@ public class TransientSolrCoreCacheDefault extends TransientSolrCoreCache {
    */
   protected final Cache<String, SolrCore> transientCores;
 
+  /**
+   * This explicitly models capacity overflow of transientCores with cores that are still in use. A
+   * cores cache which holds cores evicted from transientCores because of limited size but which
+   * shall remain cached because they are still referenced or because they are part of an ongoing
+   * operation (pending ops).
+   */
+  protected final Cache<String, SolrCore> overflowCores;

Review Comment:
   Ok. So I propose that this PR keeps the overflowCore, but changes its type to a Map<String, SolrCore> (unlimited size) instead of a Cache.
   This way we don't block the performance improvement with the read&write locks. On the other hand we temporary lose the transient core cache size limit, but I volunteer to work on a separate Jira issue to redesign the transient core cache. What is your opinion @dsmiley?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] dsmiley commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
dsmiley commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1352467817

   If this sounds good, I volunteer to do the refactoring and then you could re-do your PR on the result of that.  The impact of the refactoring on transient cores might be back-compat as well, but I think it's too little used to worry much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] debe commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
debe commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1344368444

   Hi @stillalex,
   I've changed your proposed benchmark a little bit. See [main](https://github.com/apache/solr/compare/main...tboeghk:solr:main-bench?expand=1) and [PR1155](https://github.com/tboeghk/solr/compare/main...tboeghk:solr:issues/solrcores-finer-locking-bench?expand=1).
   
   Main points are:
   
   1. I think you actually need to do something while holding the lock like it is in reality. So I've tried to mimic the behaviour of open() and close(). I set the parameter incRefCount to true in order to do something and I set doAnswer on SolrCore mock but it didn't work as expected. Basically you can't call a method intercepted by mockito inside your benchmarking function, because it will ruin it with lots of GC. The reason for this behaviour are those method interceptors and stacktrace objects, which all want to be GCed. I avoided that by introducing a consumer to do some work while the lock is being held:
   ```java 
   SolrCore getCoreFromAnyList(String name, boolean incRefCount, UUID coreId, Consumer<SolrCore> callback) {
       READ_WRITE_LOCK.readLock().lock();
       try {
         SolrCore core = getCoreFromAnyList(name, incRefCount, coreId);
         callback.accept(core);
         return core;
       } finally {
         READ_WRITE_LOCK.readLock().unlock();
       }
     }
   ```
   2. I think one should limit the number of threads to the number of CPU cores, otherwise you're benchmarking context switching of OS threads.
   
   To get an overview which public methods from SolrCores and CoreContainer are actually called and how frequent, I've checked a recent flight recorder image from one of our production machines. I was wondering if benchmarking getCoreFromAnyList() is sufficient to model the behaviour of our system. This is a 20 Minute frame without creation of a new Core. Here the distribution is in relation to the number of request calls.
     1. Requests 100%
     2. CoreContainer.getCore(String) 57% <--- this calls solrCores.getCoreFromAnyList
     4. CoreContainer.getRequestHandler(String) 36% <--- doesn't lock
     5. SolrCores.getCoreDescriptor(String) 12% <--- locks
     6. CoreContainer.getCore(String, UUID) 9% <--- this calls solrCores.getCoreFromAnyList
   
   Here are my benchmark result runs:
   
   The synchronized one:
   ```
   $ cat results-synchronized.txt 
   running JMH with args: -prof jfr:dir=profile-results;configName=jfr-profile.jfc SolrCoresBenchTest
   # Detecting actual CPU count: 12 detected
   # JMH version: 1.32
   # VM version: JDK 17.0.2, OpenJDK 64-Bit Server VM, 17.0.2+8-jvmci-22.0-b05
   # VM invoker: /home/db/.sdkman/candidates/java/22.0.0.2.r17-grl/bin/java
   # VM options: -Djmh.shutdownTimeout=5 -Djmh.shutdownTimeout.step=3 -Djava.security.egd=file:/dev/./urandom -XX:-UseBiasedLocking -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints --add-opens=java.base/java.lang.reflect=ALL-UNNAMED -Dlog4jConfigurationFile=./log4j2-bench.xml -Dlog4j2.is.webapp=false -Dlog4j2.garbagefreeThreadContextMap=true -Dlog4j2.enableDirectEncoders=true -Dlog4j2.enable.threadlocals=true -XX:+UseG1GC -XX:+ParallelRefProcEnabled
   # Blackhole mode: full + dont-inline hint
   # Warmup: 3 iterations, 10 s each
   # Measurement: 4 iterations, 20 s each
   # Timeout: 300 s per iteration
   # Threads: 12 threads, will synchronize iterations
   # Benchmark mode: Throughput, ops/time
   # Benchmark: org.apache.solr.core.SolrCoresBenchTest.getCoreFromAnyList
   
   # Run progress: 0,00% complete, ETA 00:01:50
   # Fork: 1 of 1
   # Preparing profilers: JavaFlightRecorderProfiler 
   OpenJDK 64-Bit Server VM warning: Option UseBiasedLocking was deprecated in version 15.0 and will likely be removed in a future release.
   # Warmup Iteration   1: --> benchmark random seed: 6624420638116043983
   1835172,373 ops/s
   # Warmup Iteration   2: 1821486,883 ops/s
   # Warmup Iteration   3: 1835960,182 ops/s
   Iteration   1: 1650271,601 ops/s
   Iteration   2: 1662880,927 ops/s
   Iteration   3: 1661900,359 ops/s
   Iteration   4: 1661879,729 ops/s
                    ·jfr: (text only)
   
   # Processing profiler results: JavaFlightRecorderProfiler 
   
   
   Result "org.apache.solr.core.SolrCoresBenchTest.getCoreFromAnyList":
     1659233,154 ±(99.9%) 38724,163 ops/s [Average]
     (min, avg, max) = (1650271,601, 1659233,154, 1662880,927), stdev = 5992,607
     CI (99.9%): [1620508,991, 1697957,317] (assumes normal distribution)
   
   Secondary result "org.apache.solr.core.SolrCoresBenchTest.getCoreFromAnyList:·jfr":
   JFR profiler results:
     profile-results/org.apache.solr.core.SolrCoresBenchTest.getCoreFromAnyList-Throughput/profile.jfr
   
   ```
   
   The rwlock one:
   ```
   $ cat results-rwlock.txt 
   running JMH with args: -prof jfr:dir=profile-results;configName=jfr-profile.jfc SolrCoresBenchTest
   # Detecting actual CPU count: 12 detected
   # JMH version: 1.32
   # VM version: JDK 17.0.2, OpenJDK 64-Bit Server VM, 17.0.2+8-jvmci-22.0-b05
   # VM invoker: /home/db/.sdkman/candidates/java/22.0.0.2.r17-grl/bin/java
   # VM options: -Djmh.shutdownTimeout=5 -Djmh.shutdownTimeout.step=3 -Djava.security.egd=file:/dev/./urandom -XX:-UseBiasedLocking -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints --add-opens=java.base/java.lang.reflect=ALL-UNNAMED -Dlog4jConfigurationFile=./log4j2-bench.xml -Dlog4j2.is.webapp=false -Dlog4j2.garbagefreeThreadContextMap=true -Dlog4j2.enableDirectEncoders=true -Dlog4j2.enable.threadlocals=true -XX:+UseG1GC -XX:+ParallelRefProcEnabled
   # Blackhole mode: full + dont-inline hint
   # Warmup: 3 iterations, 10 s each
   # Measurement: 4 iterations, 20 s each
   # Timeout: 300 s per iteration
   # Threads: 12 threads, will synchronize iterations
   # Benchmark mode: Throughput, ops/time
   # Benchmark: org.apache.solr.core.SolrCoresBenchTest.getCoreFromAnyList
   
   # Run progress: 0,00% complete, ETA 00:01:50
   # Fork: 1 of 1
   # Preparing profilers: JavaFlightRecorderProfiler 
   OpenJDK 64-Bit Server VM warning: Option UseBiasedLocking was deprecated in version 15.0 and will likely be removed in a future release.
   # Warmup Iteration   1: --> benchmark random seed: 6624420638116043983
   2415802,757 ops/s
   # Warmup Iteration   2: 2426055,238 ops/s
   # Warmup Iteration   3: 2417938,908 ops/s
   Iteration   1: 2342669,342 ops/s
   Iteration   2: 2360751,931 ops/s
   Iteration   3: 2343344,162 ops/s
   Iteration   4: 2363582,478 ops/s
                    ·jfr: (text only)
   
   # Processing profiler results: JavaFlightRecorderProfiler 
   
   
   Result "org.apache.solr.core.SolrCoresBenchTest.getCoreFromAnyList":
     2352586,978 ±(99.9%) 71895,415 ops/s [Average]
     (min, avg, max) = (2342669,342, 2352586,978, 2363582,478), stdev = 11125,895
     CI (99.9%): [2280691,563, 2424482,393] (assumes normal distribution)
   
   Secondary result "org.apache.solr.core.SolrCoresBenchTest.getCoreFromAnyList:·jfr":
   JFR profiler results:
     profile-results/org.apache.solr.core.SolrCoresBenchTest.getCoreFromAnyList-Throughput/profile.jfr
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] debe commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
debe commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1336824353

   Just as a side note, we are not using transient cores at all. It's just about the lock contention when working with normal cores. Many thanks for providing the benchmark, I'll try to review, run and play with it this week. At first glance it doesn't make that much sense that it's 9 times slower, as the repeated function is more or less the same besides the lock. :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] ben-manes commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
ben-manes commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1044256199


##########
solr/core/src/java/org/apache/solr/core/TransientSolrCoreCacheDefault.java:
##########
@@ -70,10 +80,15 @@ public TransientSolrCoreCacheDefault(CoreContainer coreContainer) {
     Caffeine<String, SolrCore> transientCoresCacheBuilder =
         Caffeine.newBuilder()
             .initialCapacity(initialCapacity)
-            // Use the current thread to queue evicted cores for closing. This ensures the
-            // cache max size is respected (with a different thread the max size would be
-            // respected asynchronously only eventually).
-            .executor(Runnable::run)
+            // Do NOT use the current thread to queue evicted cores for closing. Although this
+            // ensures the cache max size is respected only eventually, it should be a very
+            // short period of time until the cache maintenance task kicks in.
+            // The onEvict method needs the writeLock from SolrCores to operate safely.
+            // However, the current thread most probably has acquired a read-lock already
+            // somewhere up the call stack and would deadlock.
+            // Note that Caffeine cache has an internal maintenance task rescheduling logic which
+            // explicitly checks on the common pool.
+            .executor(ForkJoinPool.commonPool())

Review Comment:
   It’s too bad the api is not like a managed resource, like a stream or jdbc connection. Then you could return a proxy that pins it when in use, unpins when released. The proxy doing the callbacks like a jdbc pool or txn manager keeps it hidden from the application logic from handling explicitly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] mpetris commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
mpetris commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1044515521


##########
solr/core/src/java/org/apache/solr/core/SolrCores.java:
##########
@@ -555,23 +658,41 @@ public boolean isCoreLoading(String name) {
   }
 
   public void queueCoreToClose(SolrCore coreToClose) {
-    synchronized (modifyLock) {
+    READ_WRITE_LOCK.writeLock().lock();
+    try {
       pendingCloses.add(coreToClose); // Essentially just queue this core up for closing.
-      modifyLock.notifyAll(); // Wakes up closer thread too
+      WRITE_LOCK_CONDITION.signalAll(); // Wakes up closer thread too
+    } finally {
+      READ_WRITE_LOCK.writeLock().unlock();
     }
   }
 
   /**
    * @return the cache holding the transient cores; never null.
    */
   public TransientSolrCoreCache getTransientCacheHandler() {
-    synchronized (modifyLock) {
-      if (transientSolrCoreCacheFactory == null) {
-        throw new SolrException(
-            SolrException.ErrorCode.SERVER_ERROR,
-            getClass().getName() + " not loaded; call load() before using it");
-      }
-      return transientSolrCoreCacheFactory.getTransientSolrCoreCache();
+    READ_WRITE_LOCK.writeLock().lock();

Review Comment:
   Hi @bruno-roustant, that's right. A read lock is sufficient here because we bypassed the lazy initialization of the `transientSolrCoreCache` by calling `getInternalTransientCacheHandler`right after the creation of the `transientSolrCoreCacheFactory` in `SolrCores::load`. We will change that and document accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] debe commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
debe commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1352876178

   Should we change the overflow cache to a Map in the mean time, to get this PR going? what do you think?
   
   chers, -dennis


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
stillalex commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1344809731

   > Ahhh, no I think this is a misunderstanding. I exactly copied the work that is actually done. CoreContainer.getCore() set's incRefCount to true and triggers SolrCore.open(). If you set IncRefCount to false the call to core.open() never happens and therefore everything inside open() which are MDC.put calls (I've counted them) and atomicInt incrementation.
   
   @debe I see what you mean now. I think you are right, the things you added are equivalent to what is happening inside the lock. one would have to wonder why they are inside the lock in the first place (`refCount.incrementAndGet();` does not strike me as needing synchronization).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
stillalex commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1344548353

   @debe  thank you for taking a look!
   
   First off, mockito is troublesome. we can take it out completely if there is a way to setup solr without it. I added it only for the transient cores part. same for the lookup misses, it was just an attempt to reproduce the results.
   
   > I think you actually need to do something while holding the lock like it is in reality.
   
   I don't completely follow there. what else happens inside the lock block from getCoreFromAnyList? ideally we would test the code as is, I don't see anything else inside that method. by introducing the consumer you are artificially increasing the time spent inside the lock, is this a better mirror of a running system?
   
   > I think one should limit the number of threads to the number of CPU cores, otherwise you're benchmarking context switching of OS threads.
   
   I agree with this point, anything we can tweak on the benchmark settings to get it closer to real life is good for me.
   
   > To get an overview which public methods from SolrCores and CoreContainer are actually called and how frequent, I've checked a recent flight recorder image from one of our production machines.
   
   thank you for this data. I would like to use it as a reference for the benchmark, but ideally we would setup the core without any interventions (still not 100% sold on the extra consumer you added).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] dsmiley commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
dsmiley commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1336723354

   Transient cores is not supported in SolrCoud.  I know this because I have a hack to make it sort-of work in SolrCloud at work.   I've been meaning to review your PR finally; I keep telling myself that and hopefully it'll come true now-ish :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] ben-manes commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
ben-manes commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1043745746


##########
solr/core/src/java/org/apache/solr/core/TransientSolrCoreCacheDefault.java:
##########
@@ -70,10 +80,15 @@ public TransientSolrCoreCacheDefault(CoreContainer coreContainer) {
     Caffeine<String, SolrCore> transientCoresCacheBuilder =
         Caffeine.newBuilder()
             .initialCapacity(initialCapacity)
-            // Use the current thread to queue evicted cores for closing. This ensures the
-            // cache max size is respected (with a different thread the max size would be
-            // respected asynchronously only eventually).
-            .executor(Runnable::run)
+            // Do NOT use the current thread to queue evicted cores for closing. Although this
+            // ensures the cache max size is respected only eventually, it should be a very
+            // short period of time until the cache maintenance task kicks in.
+            // The onEvict method needs the writeLock from SolrCores to operate safely.
+            // However, the current thread most probably has acquired a read-lock already
+            // somewhere up the call stack and would deadlock.
+            // Note that Caffeine cache has an internal maintenance task rescheduling logic which
+            // explicitly checks on the common pool.
+            .executor(ForkJoinPool.commonPool())

Review Comment:
   I am still reluctant to introduce an eviction approver as I believe it is an easy but wrong answer. It is perfectly pragmatic for a private api where the caveats are known, but for a library it is a poor api with footguns that will have a large blast radius.
   
   I would use a secondary cache with weak referenced values for automatic clean up. That looks similar to the overflow cache here, except that by using reference counting it is discarded by the GC. This way there are no memory leaks and the same instance is retrievable if any other thread holds a strong reference to it to avoid duplications due to races. If simpler then the second cache can be inclusive since it is based on what is GC reachable rather than a size/time bound, so juggling the exclusivity might be useless logic that does not save any memory. By using Map computations the interactions between caches can be atomic, e.g. to load from the weak cache if absent from the primary one.
   
   A caveat is that a weak value means that a listener on that cache would have a null value (as GC'd), so it may not be immediately compatible with `SolrCores.queueCoreToClose(core)`. It may be that refactoring to coordinate the close based on reachability could solve that and remove the complexity that `SolrCores` must go through to safely shutdown an instance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] mpetris commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
mpetris commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1352824165

   Refactoring SolrCores with TransientCores support and SolrCores without TransientCores support into separate implementations is really a good idea. +1
   We would re-do our PR on the result of that. Would be great if we could still make it in time for 9.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] bruno-roustant commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1352812182

   +1 to extract the support of transient cores as a subclass of SolrCores.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] bruno-roustant commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1044250179


##########
solr/core/src/java/org/apache/solr/core/TransientSolrCoreCacheDefault.java:
##########
@@ -47,6 +49,14 @@ public class TransientSolrCoreCacheDefault extends TransientSolrCoreCache {
    */
   protected final Cache<String, SolrCore> transientCores;
 
+  /**
+   * This explicitly models capacity overflow of transientCores with cores that are still in use. A
+   * cores cache which holds cores evicted from transientCores because of limited size but which
+   * shall remain cached because they are still referenced or because they are part of an ongoing
+   * operation (pending ops).
+   */
+  protected final Cache<String, SolrCore> overflowCores;

Review Comment:
   I don't want to block the PR, which primary goal is to separate read and write locks, with questions about about how to manage the transient core cache.
   I propose that this PR does not use an overflowCores structure, but simply puts back cores to the transientCores cache, as the current code does. I expect the ping-pong behavior mentioned above to just be transient, so not a problem?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] bruno-roustant commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1044244530


##########
solr/core/src/java/org/apache/solr/core/TransientSolrCoreCacheDefault.java:
##########
@@ -70,10 +80,15 @@ public TransientSolrCoreCacheDefault(CoreContainer coreContainer) {
     Caffeine<String, SolrCore> transientCoresCacheBuilder =
         Caffeine.newBuilder()
             .initialCapacity(initialCapacity)
-            // Use the current thread to queue evicted cores for closing. This ensures the
-            // cache max size is respected (with a different thread the max size would be
-            // respected asynchronously only eventually).
-            .executor(Runnable::run)
+            // Do NOT use the current thread to queue evicted cores for closing. Although this
+            // ensures the cache max size is respected only eventually, it should be a very
+            // short period of time until the cache maintenance task kicks in.
+            // The onEvict method needs the writeLock from SolrCores to operate safely.
+            // However, the current thread most probably has acquired a read-lock already
+            // somewhere up the call stack and would deadlock.
+            // Note that Caffeine cache has an internal maintenance task rescheduling logic which
+            // explicitly checks on the common pool.
+            .executor(ForkJoinPool.commonPool())

Review Comment:
   Here SolrCore instances are not usual objects hold in memory, they can be still in-use, they are ref-counted, there can be pending operations on them. We have to close them carefully, not GCed as weak references.
   It becomes clearer to me that this transient core cache cannot have a strict limit, because it must manage the cores until they are not used anymore and unloaded. It's more a target size, that the cache should aim to keep when there is no heavy activity on the cores.
   If we go with an overflow structure, it is probably not a cache but an unlimited size map, with its maintenance thread to monitor the cores to close when they are not used anymore.
   Or we keep a single cache from which we do not remove cores if they are still in-use (so not Caffeine). But we would still need a maintenance thread to reduce the cache size down to the target size when it is possible.
   I'll take some time to reflect on this and probably create a Jira issue on this subject.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] bruno-roustant commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1044259391


##########
solr/core/src/java/org/apache/solr/core/TransientSolrCoreCacheDefault.java:
##########
@@ -70,10 +80,15 @@ public TransientSolrCoreCacheDefault(CoreContainer coreContainer) {
     Caffeine<String, SolrCore> transientCoresCacheBuilder =
         Caffeine.newBuilder()
             .initialCapacity(initialCapacity)
-            // Use the current thread to queue evicted cores for closing. This ensures the
-            // cache max size is respected (with a different thread the max size would be
-            // respected asynchronously only eventually).
-            .executor(Runnable::run)
+            // Do NOT use the current thread to queue evicted cores for closing. Although this
+            // ensures the cache max size is respected only eventually, it should be a very
+            // short period of time until the cache maintenance task kicks in.
+            // The onEvict method needs the writeLock from SolrCores to operate safely.
+            // However, the current thread most probably has acquired a read-lock already
+            // somewhere up the call stack and would deadlock.
+            // Note that Caffeine cache has an internal maintenance task rescheduling logic which
+            // explicitly checks on the common pool.
+            .executor(ForkJoinPool.commonPool())

Review Comment:
   Reviewing the API to be able to pin is indeed another possible approach.
   When we pin a Caffeine cache entry (with weight 0), what operations are involved internally? Does it reorder some kind of structure (priority queue) for eviction? Or is it very cheap?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] bruno-roustant commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1044244530


##########
solr/core/src/java/org/apache/solr/core/TransientSolrCoreCacheDefault.java:
##########
@@ -70,10 +80,15 @@ public TransientSolrCoreCacheDefault(CoreContainer coreContainer) {
     Caffeine<String, SolrCore> transientCoresCacheBuilder =
         Caffeine.newBuilder()
             .initialCapacity(initialCapacity)
-            // Use the current thread to queue evicted cores for closing. This ensures the
-            // cache max size is respected (with a different thread the max size would be
-            // respected asynchronously only eventually).
-            .executor(Runnable::run)
+            // Do NOT use the current thread to queue evicted cores for closing. Although this
+            // ensures the cache max size is respected only eventually, it should be a very
+            // short period of time until the cache maintenance task kicks in.
+            // The onEvict method needs the writeLock from SolrCores to operate safely.
+            // However, the current thread most probably has acquired a read-lock already
+            // somewhere up the call stack and would deadlock.
+            // Note that Caffeine cache has an internal maintenance task rescheduling logic which
+            // explicitly checks on the common pool.
+            .executor(ForkJoinPool.commonPool())

Review Comment:
   Here SolrCore instances are not usual object hold in memory, they can be still in-use, they are ref-counted, there can be pending operations on them. We have to close them carefully, not GCed as weak references.
   It becomes clearer to me that this transient core cache cannot have a strict limit, because it must manage the cores until they are not used anymore and unloaded. It's more a target size, that the cache should aim to keep when there is no heavy activity on the cores.
   If we go with an overflow structure, it is probably not a cache but an unlimited size map, with its maintenance thread to monitor the cores to close when they are not used anymore.
   Or we keep a single cache from which we do not remove cores if they are still in-use (so not Caffeine). But we would still need a maintenance thread to reduce the cache size down to the target size when it is possible.
   I'll take some time to reflect on this and probably create a Jira issue on this subject.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] mpetris commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
mpetris commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1044557278


##########
solr/core/src/java/org/apache/solr/core/SolrCores.java:
##########
@@ -555,23 +658,41 @@ public boolean isCoreLoading(String name) {
   }
 
   public void queueCoreToClose(SolrCore coreToClose) {
-    synchronized (modifyLock) {
+    READ_WRITE_LOCK.writeLock().lock();
+    try {
       pendingCloses.add(coreToClose); // Essentially just queue this core up for closing.
-      modifyLock.notifyAll(); // Wakes up closer thread too
+      WRITE_LOCK_CONDITION.signalAll(); // Wakes up closer thread too
+    } finally {
+      READ_WRITE_LOCK.writeLock().unlock();
     }
   }
 
   /**
    * @return the cache holding the transient cores; never null.
    */
   public TransientSolrCoreCache getTransientCacheHandler() {
-    synchronized (modifyLock) {
-      if (transientSolrCoreCacheFactory == null) {
-        throw new SolrException(
-            SolrException.ErrorCode.SERVER_ERROR,
-            getClass().getName() + " not loaded; call load() before using it");
-      }
-      return transientSolrCoreCacheFactory.getTransientSolrCoreCache();
+    READ_WRITE_LOCK.writeLock().lock();
+    try {
+      return getInternalTransientCacheHandler();
+    } finally {
+      READ_WRITE_LOCK.writeLock().unlock();
+    }
+  }
+
+  /**
+   * Explicitly does not use a lock to provide the flexibility to choose between a read-lock and a
+   * write-lock before calling this method. Choosing a lock and locking before calling this method
+   * is mandatory. Note: Using always a write-lock would be costly. Using always a read-lock would
+   * block all calls which already hold a write-lock.
+   *
+   * @return the cache holding the transient cores; never null.
+   */
+  private TransientSolrCoreCache getInternalTransientCacheHandler() {

Review Comment:
   OK, but getTransientCacheWithLock as a name for the private `getInternalTransientCacheHandler` sounds to me as if the method aquired the lock on its own. How about getTransientCacheHandlerNonLocked()?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] dsmiley commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
dsmiley commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1352464498

   I wish transient cores was better isolated to users that opt-in to use it.  I'm thinking, "SolrCores" could be pluggable with a subclass that supports TransientCores; the default (SolrCores itself) would not.  Then we could do something simple for the default implementation and delay how to handle transient cores.  For example this read/write lock stuff could be in the default/base impl but we leave the transient cache implementing one alone (wouldn't have this optimization, at least for now).  This doesn't seem hard and lets us punt on transient core complexity ramifications.  It makes the default case easier to understand (no transient core code interaction) and with the PR here, removes a bottleneck you saw.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] bruno-roustant commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1043152924


##########
solr/core/src/java/org/apache/solr/core/TransientSolrCoreCacheDefault.java:
##########
@@ -47,6 +49,14 @@ public class TransientSolrCoreCacheDefault extends TransientSolrCoreCache {
    */
   protected final Cache<String, SolrCore> transientCores;
 
+  /**
+   * This explicitly models capacity overflow of transientCores with cores that are still in use. A
+   * cores cache which holds cores evicted from transientCores because of limited size but which
+   * shall remain cached because they are still referenced or because they are part of an ongoing
+   * operation (pending ops).
+   */
+  protected final Cache<String, SolrCore> overflowCores;

Review Comment:
   As I understand, the cores evicted from the cache but still in use are put in this overflowCores. There is a hook called when the core is closed, but there is no action enqueued to close the core when possible. So the core will remain open until it is closed explicitly. This overflowCores cache can grow, but there is no action taken when it reaches it size limit. The cores evicted from overflowCores are not counted anymore.
   
   The PR description says _"a separate thread will schedule a ping-pong behaviour in the eviction handler on a full cache with SolrCores still referenced"_. Could you be more specific?
   Did you see this ping pong behavior, with a core being evicted, then put back, then evicted again, and so on?
   
   I believe we should not have this additional overflowCores, which brings complexity and fragility. We should either find a solution to work with the Caffeine cache transientCores (see my question below), or we should use another cache.



##########
solr/core/src/java/org/apache/solr/core/SolrCores.java:
##########
@@ -555,23 +658,41 @@ public boolean isCoreLoading(String name) {
   }
 
   public void queueCoreToClose(SolrCore coreToClose) {
-    synchronized (modifyLock) {
+    READ_WRITE_LOCK.writeLock().lock();
+    try {
       pendingCloses.add(coreToClose); // Essentially just queue this core up for closing.
-      modifyLock.notifyAll(); // Wakes up closer thread too
+      WRITE_LOCK_CONDITION.signalAll(); // Wakes up closer thread too
+    } finally {
+      READ_WRITE_LOCK.writeLock().unlock();
     }
   }
 
   /**
    * @return the cache holding the transient cores; never null.
    */
   public TransientSolrCoreCache getTransientCacheHandler() {
-    synchronized (modifyLock) {
-      if (transientSolrCoreCacheFactory == null) {
-        throw new SolrException(
-            SolrException.ErrorCode.SERVER_ERROR,
-            getClass().getName() + " not loaded; call load() before using it");
-      }
-      return transientSolrCoreCacheFactory.getTransientSolrCoreCache();
+    READ_WRITE_LOCK.writeLock().lock();
+    try {
+      return getInternalTransientCacheHandler();
+    } finally {
+      READ_WRITE_LOCK.writeLock().unlock();
+    }
+  }
+
+  /**
+   * Explicitly does not use a lock to provide the flexibility to choose between a read-lock and a
+   * write-lock before calling this method. Choosing a lock and locking before calling this method
+   * is mandatory. Note: Using always a write-lock would be costly. Using always a read-lock would
+   * block all calls which already hold a write-lock.
+   *
+   * @return the cache holding the transient cores; never null.
+   */
+  private TransientSolrCoreCache getInternalTransientCacheHandler() {

Review Comment:
   The javadoc should state that either lock must be hold when calling this method.
   Would it be clearer to have this method named getTransientCacheWithLock()?



##########
solr/core/src/java/org/apache/solr/core/SolrCores.java:
##########
@@ -555,23 +658,41 @@ public boolean isCoreLoading(String name) {
   }
 
   public void queueCoreToClose(SolrCore coreToClose) {
-    synchronized (modifyLock) {
+    READ_WRITE_LOCK.writeLock().lock();
+    try {
       pendingCloses.add(coreToClose); // Essentially just queue this core up for closing.
-      modifyLock.notifyAll(); // Wakes up closer thread too
+      WRITE_LOCK_CONDITION.signalAll(); // Wakes up closer thread too
+    } finally {
+      READ_WRITE_LOCK.writeLock().unlock();
     }
   }
 
   /**
    * @return the cache holding the transient cores; never null.
    */
   public TransientSolrCoreCache getTransientCacheHandler() {
-    synchronized (modifyLock) {
-      if (transientSolrCoreCacheFactory == null) {
-        throw new SolrException(
-            SolrException.ErrorCode.SERVER_ERROR,
-            getClass().getName() + " not loaded; call load() before using it");
-      }
-      return transientSolrCoreCacheFactory.getTransientSolrCoreCache();
+    READ_WRITE_LOCK.writeLock().lock();

Review Comment:
   I think this method should take the read lock, and document it.
   Anyway the caller would need to take the write lock to modify it outside of this method. So this method does not have to enforce the write lock.



##########
solr/core/src/java/org/apache/solr/core/TransientSolrCoreCacheDefault.java:
##########
@@ -70,10 +80,15 @@ public TransientSolrCoreCacheDefault(CoreContainer coreContainer) {
     Caffeine<String, SolrCore> transientCoresCacheBuilder =
         Caffeine.newBuilder()
             .initialCapacity(initialCapacity)
-            // Use the current thread to queue evicted cores for closing. This ensures the
-            // cache max size is respected (with a different thread the max size would be
-            // respected asynchronously only eventually).
-            .executor(Runnable::run)
+            // Do NOT use the current thread to queue evicted cores for closing. Although this
+            // ensures the cache max size is respected only eventually, it should be a very
+            // short period of time until the cache maintenance task kicks in.
+            // The onEvict method needs the writeLock from SolrCores to operate safely.
+            // However, the current thread most probably has acquired a read-lock already
+            // somewhere up the call stack and would deadlock.
+            // Note that Caffeine cache has an internal maintenance task rescheduling logic which
+            // explicitly checks on the common pool.
+            .executor(ForkJoinPool.commonPool())

Review Comment:
   @ben-manes we had a discussion on this subject some time ago. You told us about a way to pin entries. But in our case here, the is-in-use status of an entry is transient, we cannot change the weight of many entries each time they are used.
   So we decided to un-evict (put back) the entry to the cache if it is in-use. But we see now the limit of the approach when we have to manage a read lock and a write lock, and a different thread to put back.
   
   The solution I see would be to have an eviction approver (hard no) before the entry is selected for eviction (so we prefer to not evict rather than putting back). In this case we accept the O(n) scan for the entry selection, and we also accept that the cache size exceeds the limit, knowing that Caffeine maintenance tasks will kick periodically to try to evict entries again later. Would that be a possible option for Caffeine?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


Re: [PR] [SOLR-16497] Allow finer grained locking in SolrCores [solr]

Posted by "debe (via GitHub)" <gi...@apache.org>.
debe commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1951334636

   We are actively rewriting this fix to bring it to 9.X, 9.5.0 and main. I'd keep this open for now. Our progress is visible here https://github.com/otto-de/solr and we'll contribute it back from that repo. Chers, -dennis


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] debe commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
debe commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1337878673

   Basically the changes around the Handler are needed to get the read write locking correct. It's an all or nothing change. Even the smallest mistake could lead to a deadlock or missing cores. We were very careful in analyzing the relevant code paths, otherwise you risk deadlocks or undefined behaviour, e.g. thread switch while in critical block. Luckily the test suite is very good and revealed some problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
stillalex commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1337824454

   I was surprised at the results too, but I figured it's just my limited understanding of this code :) . I think the benchmark is too basic in its approach to cover the PR correctly, so please take a look and let me know what could be done to make it closer to real life operations.
   Good to rule out the transient cores, but if this does not come into play, why all the changes around the `getInternalTransientCacheHandler`?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] bruno-roustant commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1043219211


##########
solr/core/src/java/org/apache/solr/core/TransientSolrCoreCacheDefault.java:
##########
@@ -70,10 +80,15 @@ public TransientSolrCoreCacheDefault(CoreContainer coreContainer) {
     Caffeine<String, SolrCore> transientCoresCacheBuilder =
         Caffeine.newBuilder()
             .initialCapacity(initialCapacity)
-            // Use the current thread to queue evicted cores for closing. This ensures the
-            // cache max size is respected (with a different thread the max size would be
-            // respected asynchronously only eventually).
-            .executor(Runnable::run)
+            // Do NOT use the current thread to queue evicted cores for closing. Although this
+            // ensures the cache max size is respected only eventually, it should be a very
+            // short period of time until the cache maintenance task kicks in.
+            // The onEvict method needs the writeLock from SolrCores to operate safely.
+            // However, the current thread most probably has acquired a read-lock already
+            // somewhere up the call stack and would deadlock.
+            // Note that Caffeine cache has an internal maintenance task rescheduling logic which
+            // explicitly checks on the common pool.
+            .executor(ForkJoinPool.commonPool())

Review Comment:
   @ben-manes we had a [discussion](https://github.com/apache/solr/pull/580) on this subject some time ago. You told us about a way to pin entries. But in our case here, the is-in-use status of an entry is transient, we cannot change the weight of many entries each time they are used.
   So we decided to un-evict (put back) the entry to the cache if it is in-use. But we see now the limit of the approach when we have to manage a read lock and a write lock, and a different thread to put back.
   
   The solution I see would be to have an eviction approver (hard no) before the entry is selected for eviction (so we prefer to not evict rather than putting back). In this case we accept the O(n) scan for the entry selection, and we also accept that the cache size exceeds the limit, knowing that Caffeine maintenance tasks will kick periodically to try to evict entries again later. Would that be a possible option for Caffeine?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] ben-manes commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
ben-manes commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1044281175


##########
solr/core/src/java/org/apache/solr/core/TransientSolrCoreCacheDefault.java:
##########
@@ -70,10 +80,15 @@ public TransientSolrCoreCacheDefault(CoreContainer coreContainer) {
     Caffeine<String, SolrCore> transientCoresCacheBuilder =
         Caffeine.newBuilder()
             .initialCapacity(initialCapacity)
-            // Use the current thread to queue evicted cores for closing. This ensures the
-            // cache max size is respected (with a different thread the max size would be
-            // respected asynchronously only eventually).
-            .executor(Runnable::run)
+            // Do NOT use the current thread to queue evicted cores for closing. Although this
+            // ensures the cache max size is respected only eventually, it should be a very
+            // short period of time until the cache maintenance task kicks in.
+            // The onEvict method needs the writeLock from SolrCores to operate safely.
+            // However, the current thread most probably has acquired a read-lock already
+            // somewhere up the call stack and would deadlock.
+            // Note that Caffeine cache has an internal maintenance task rescheduling logic which
+            // explicitly checks on the common pool.
+            .executor(ForkJoinPool.commonPool())

Review Comment:
   Caffeine only uses amortized O(1) data structures so it is relatively cheap. The entry is on an lru doubly linked list that is managed in a non-blocking fashion. Currently the zero weights are left on the lru and skipped over during eviction, but they could later be moved on/off if that traversal became an issue. The modifying of an entry’s weight is done only by write operations, e.g. on a put, so more expensive that a read but still [very fast](https://github.com/ben-manes/caffeine/wiki/Benchmarks#write-100-1) (45M/s).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] mpetris commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
mpetris commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1044522720


##########
solr/core/src/java/org/apache/solr/core/SolrCores.java:
##########
@@ -555,23 +658,41 @@ public boolean isCoreLoading(String name) {
   }
 
   public void queueCoreToClose(SolrCore coreToClose) {
-    synchronized (modifyLock) {
+    READ_WRITE_LOCK.writeLock().lock();
+    try {
       pendingCloses.add(coreToClose); // Essentially just queue this core up for closing.
-      modifyLock.notifyAll(); // Wakes up closer thread too
+      WRITE_LOCK_CONDITION.signalAll(); // Wakes up closer thread too
+    } finally {
+      READ_WRITE_LOCK.writeLock().unlock();
     }
   }
 
   /**
    * @return the cache holding the transient cores; never null.
    */
   public TransientSolrCoreCache getTransientCacheHandler() {
-    synchronized (modifyLock) {
-      if (transientSolrCoreCacheFactory == null) {
-        throw new SolrException(
-            SolrException.ErrorCode.SERVER_ERROR,
-            getClass().getName() + " not loaded; call load() before using it");
-      }
-      return transientSolrCoreCacheFactory.getTransientSolrCoreCache();
+    READ_WRITE_LOCK.writeLock().lock();
+    try {
+      return getInternalTransientCacheHandler();
+    } finally {
+      READ_WRITE_LOCK.writeLock().unlock();
+    }
+  }
+
+  /**
+   * Explicitly does not use a lock to provide the flexibility to choose between a read-lock and a
+   * write-lock before calling this method. Choosing a lock and locking before calling this method
+   * is mandatory. Note: Using always a write-lock would be costly. Using always a read-lock would
+   * block all calls which already hold a write-lock.
+   *
+   * @return the cache holding the transient cores; never null.
+   */
+  private TransientSolrCoreCache getInternalTransientCacheHandler() {

Review Comment:
   Hi @bruno-roustant, the private method you quoted above, `getInternalTransientCacheHandler`, does in fact explicitly mention the need for a lock in the javadoc. Whereas the public `getTransientCacheHandler` holds the lock and would indeed benefit from a renaming to getTransientCacheWithLock. We will change the name accordingly and document that a lock is being held.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] bruno-roustant commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1044526330


##########
solr/core/src/java/org/apache/solr/core/SolrCores.java:
##########
@@ -555,23 +658,41 @@ public boolean isCoreLoading(String name) {
   }
 
   public void queueCoreToClose(SolrCore coreToClose) {
-    synchronized (modifyLock) {
+    READ_WRITE_LOCK.writeLock().lock();
+    try {
       pendingCloses.add(coreToClose); // Essentially just queue this core up for closing.
-      modifyLock.notifyAll(); // Wakes up closer thread too
+      WRITE_LOCK_CONDITION.signalAll(); // Wakes up closer thread too
+    } finally {
+      READ_WRITE_LOCK.writeLock().unlock();
     }
   }
 
   /**
    * @return the cache holding the transient cores; never null.
    */
   public TransientSolrCoreCache getTransientCacheHandler() {
-    synchronized (modifyLock) {
-      if (transientSolrCoreCacheFactory == null) {
-        throw new SolrException(
-            SolrException.ErrorCode.SERVER_ERROR,
-            getClass().getName() + " not loaded; call load() before using it");
-      }
-      return transientSolrCoreCacheFactory.getTransientSolrCoreCache();
+    READ_WRITE_LOCK.writeLock().lock();
+    try {
+      return getInternalTransientCacheHandler();
+    } finally {
+      READ_WRITE_LOCK.writeLock().unlock();
+    }
+  }
+
+  /**
+   * Explicitly does not use a lock to provide the flexibility to choose between a read-lock and a
+   * write-lock before calling this method. Choosing a lock and locking before calling this method
+   * is mandatory. Note: Using always a write-lock would be costly. Using always a read-lock would
+   * block all calls which already hold a write-lock.
+   *
+   * @return the cache holding the transient cores; never null.
+   */
+  private TransientSolrCoreCache getInternalTransientCacheHandler() {

Review Comment:
   I wouldn't rename the public method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
stillalex commented on PR #1155:
URL: https://github.com/apache/solr/pull/1155#issuecomment-1355742412

   @debe I have finally had some time to update the benchmark to remove mockito and only use pure SolrCore code for verification. I think I have taken all the noise out of it, but unfortunately this is back to my initial results, showing a slowdown in the PR, could you take another look?
   
   [main branch](https://github.com/apache/solr/compare/main...stillalex:SOLR-16497-locks-bench?expand=1):
   
   ```Result "org.apache.solr.core.SolrCoresBenchTest.getCoreFromAnyList":
     7013249.176 ±(99.9%) 420959.068 ops/s [Average]
     (min, avg, max) = (6986952.973, 7013249.176, 7030112.165), stdev = 23074.187
     CI (99.9%): [6592290.109, 7434208.244] (assumes normal distribution)
   ```
   
   PR ([main branch + pr patch](https://github.com/apache/solr/compare/main...stillalex:SOLR-16497-locks-bench-and-patch?expand=1)):
   
   ```Result "org.apache.solr.core.SolrCoresBenchTest.getCoreFromAnyList":
     1492787.497 ±(99.9%) 567915.344 ops/s [Average]
     (min, avg, max) = (1468715.133, 1492787.497, 1527941.373), stdev = 31129.356
     CI (99.9%): [924872.153, 2060702.841] (assumes normal distribution)
   ```
   
   @dsmiley is the benchmark useful in its current form? I would like to contribute it along with the PR (seeing the results it probably still needs a few tweaks). let me know if this is something you are interested in, otherwise I can leave it out this conversation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] mpetris commented on a diff in pull request #1155: [SOLR-16497] Allow finer grained locking in SolrCores

Posted by GitBox <gi...@apache.org>.
mpetris commented on code in PR #1155:
URL: https://github.com/apache/solr/pull/1155#discussion_r1045286686


##########
solr/core/src/java/org/apache/solr/core/TransientSolrCoreCacheDefault.java:
##########
@@ -47,6 +49,14 @@ public class TransientSolrCoreCacheDefault extends TransientSolrCoreCache {
    */
   protected final Cache<String, SolrCore> transientCores;
 
+  /**
+   * This explicitly models capacity overflow of transientCores with cores that are still in use. A
+   * cores cache which holds cores evicted from transientCores because of limited size but which
+   * shall remain cached because they are still referenced or because they are part of an ongoing
+   * operation (pending ops).
+   */
+  protected final Cache<String, SolrCore> overflowCores;

Review Comment:
   > The PR description says _"a separate thread will schedule a ping-pong behaviour in the eviction handler on a full cache with SolrCores still referenced"_. Could you be more specific?
   > Did you see this ping pong behavior, with a core being evicted, then put back, then evicted again, and so on?
   
   Since we don't use transient cores in our environment, I tested the behaviour via the tests in [TestLazyCores](https://github.com/apache/solr/blob/main/solr/core/src/test/org/apache/solr/core/TestLazyCores.java). I observed the ping-pong behaviour with the separation in read and write locks in place in SolrCores and still with the original behaviour in the TransientSolrCoreCacheDefault but with the difference that the onEvict() method uses the new write lock from SolrCores and the executor of the Caffeine cache is set to the ForkJoinPool.commonPool() (to avoid the deadlock when the Caffeine maintenance task and the eviction happens during a get on the cache holding a read lock already).
   
   With this scenario the [testCreateTransientFromAdmin](https://github.com/apache/solr/blob/37c34664bab9a5d6df9ba6f3544d09432845aa51/solr/core/src/test/org/apache/solr/core/TestLazyCores.java#L486) test  shows, [after holding references to five cores](https://github.com/apache/solr/blob/37c34664bab9a5d6df9ba6f3544d09432845aa51/solr/core/src/test/org/apache/solr/core/TestLazyCores.java#L508) with a max size of four, that the ForkJoinPool.commonPool() based Caffeine maintenance task  tries to evict cores that get put back in by `onEvict()` over and over again until the first core gets closed in the [TestThread](https://github.com/apache/solr/blob/37c34664bab9a5d6df9ba6f3544d09432845aa51/solr/core/src/test/org/apache/solr/core/TestLazyCores.java#L545). 
   
   So, yes this behaviour is transient but it seem to be present as long as the max size is exceeded. Not sure how long this can be but it felt bad to let the maintenance task go crazy during that time. That's why we came up with the overflowCores structure. 
   
   Note that it might be different when the executor is set to a custom pool. In Caffeine's BoundedLocalCache the scheduling of the maintenance task happens more frequently when the ForkJoinPool.commonPool() is used. But it seemed to make things even more complicated introducing a custom pool so we didn't explore that road.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org