You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/04/29 03:41:35 UTC

[GitHub] [pulsar] dinghram opened a new issue #10433: NPE in broker: EntryImpl.getLength()

dinghram opened a new issue #10433:
URL: https://github.com/apache/pulsar/issues/10433


   During a CHO test, one broker crashed with a NPE. We lost the logs due to a cleanup of files, but we have the general stack trace:
   managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java: 132: return data.readableBytes();
   managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryCacheImpl.java: 115: manager.entryAdded(entry.getLength());


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-828976205


   I wonder if something in this direction could be happening:
   
   Let's say we have 2 threads, T1 and T2 and a single object instance O1.
   
   1. T1 gets O1 from the pool
   2. T1 sets fields on O1
   3. T1 shares O1 to T2
   4. T2 set O1 fields to null and returns O1 to the pool
   5. T1 gets O1 from the pool
   6. T1 sets fields on O1
   7. **T2 reads O1 fields. The values that T1 set in step 6 aren't visible to T2. The visible values are the null values set in step 4.**
   
   I've been recently watching [Aleksey Shipilëv's Java Memory Model Unlearning Experience presentation](https://www.youtube.com/watch?v=TK-7GCCDF_I).
   
   In the presentation, [at about 21 minutes 50 seconds](https://www.youtube.com/watch?v=TK-7GCCDF_I&t=21m50s), there's an interesting example:
   ![image](https://user-images.githubusercontent.com/66864/116509894-bb0e3f80-a8cc-11eb-9d02-981ac3e33732.png)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-830017705


   Since the reported problem could be related to Netty Recycler usage, I'd recommend doing the testing with Netty Recycler completely turned off (this was already proposed in the first comment). 
   
   You can find positive reports about disabling the Netty Recycler at https://github.com/netty/netty/pull/5968#issuecomment-399578989 and https://github.com/netty/netty/pull/5968#issuecomment-352823429 . Elasticsearch [also disabled Netty Recycler completely](https://github.com/elastic/elasticsearch/blob/5baabff6670a8ed49297488ca8cac8ec12a2078d/distribution/tools/launchers/src/main/java/org/elasticsearch/tools/launchers/SystemJvmOptions.java#L50) because of the problems it causes. 
   
   @dinghram Would you be able to run your CHO (Continuous Hours of Operation) tests with Pulsar configured with `-Dio.netty.recycler.maxCapacityPerThread=0` ? 
   It would be very useful if you could also share the possible impact to performance (throughput, latency, resource utilization) after disabling Netty Recycler by setting `-Dio.netty.recycler.maxCapacityPerThread=0` (in `PULSAR_EXTRA_OPTS` as explained [in the first comment](https://github.com/apache/pulsar/issues/10433#issuecomment-828953935)).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-831790136


   UPDATE: The following description of a race condition doesn't seem to be possible since a counter in the base class of EntryImpl protects against it. ([AbstractCASReferenceCounted.release0](https://github.com/apache/pulsar/blob/5e446a67fcacf340d352eeafdce15a78fb971f19/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/util/AbstractCASReferenceCounted.java#L95-L110
   ))
   
   --- 
   
   ~~My current assumption of the problem is that a race condition is possible when cache invalidation happens with a certain timing.~~
   
   The possible race condition:
   
   Code in EntryCacheImpl, asyncReadEntry0 method (code is for branch-2.7):
   
   https://github.com/apache/pulsar/blob/dcaa1d350fec920ef0ba32b4aeecd2e9604d4824/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryCacheImpl.java#L272-L283
   
   Code for `entries.getRange`, referenced on line 272 in above code snippet:
   
   https://github.com/apache/pulsar/blob/45ee784f5ef92fc25e970bbf0be34412472173b1/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/util/RangeCache.java#L108-L122
   
   explanation of a possible race condition:
   
   - one thread reads entries from the cache. the entries get marked for use with the `value.retain();` call (line 115 in RangeCache).
   - in the meantime, the cache entry gets invalidated and invalidation calls  `release()` for the entry 
   - the EntryImpl instance gets returned to the Recycler object pool, and another thread picks it put for usage. 
   - the thread that read entries from the cache calls `.release()` for the instance that is now used by another thread. (line 282 in EntryCacheImpl)
   - since release has been called, it can cause the NPE on `managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java: 132: return data.readableBytes();` reported in this issue.
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-829919875


   There's a Netty bug that [has been fixed in 4.1.60.Final by this commit](https://github.com/netty/netty/commit/dde82f62). Pulsar 2.7.2 will be released with Netty 4.1.60.Final . Upgrade to Netty 4.1.60.Final in Pulsar was made as part of #10073 .
   
   @dinghram Would you be able to re-test with a build of Pulsar from branch-2.7 since that would include Netty 4.1.60.Final ?
   The [instructions for building custom docker images is in README](https://github.com/apache/pulsar#build-custom-docker-images).
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-1058890863


   The issue had no activity for 30 days, mark with Stale label.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] merlimat commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
merlimat commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-828964821


   the only explanation would be for the `entry` passed in (https://github.com/apache/pulsar/blob/8ea4a39dc8bf6f2f23a160688bb70a80f6acfd4d/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryCacheImpl.java#L86) to be released while the cache insert is still happening.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-831790136


   My current assumption of the problem is that a race condition is possible when cache invalidation happens with a certain timing. 
   
   The possible race condition:
   
   Code in EntryCacheImpl, asyncReadEntry0 method (code is for branch-2.7):
   
   https://github.com/apache/pulsar/blob/dcaa1d350fec920ef0ba32b4aeecd2e9604d4824/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryCacheImpl.java#L272-L283
   
   Code for `entries.getRange`, referenced on line 272 in above code snippet:
   
   https://github.com/apache/pulsar/blob/45ee784f5ef92fc25e970bbf0be34412472173b1/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/util/RangeCache.java#L108-L122
   
   explanation of a possible race condition:
   
   - one thread reads entries from the cache. the entries get marked for use with the `value.retain();` call (line 115 in RangeCache).
   - in the meantime, the cache entry gets invalidated and invalidation calls  `release()` for the entry 
   - the EntryImpl instance gets returned to the Recycler object pool, and another thread picks it put for usage. 
   - the thread that read entries from the cache calls `.release()` for the instance that is now used by another thread. (line 282 in EntryCacheImpl)
   - since release has been called, it can cause the NPE on `managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java: 132: return data.readableBytes();` reported in this issue.
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-831885032


   When looking more at the possible race condition, the previous explained scenario seems to be possible with a slight variation to it. It seems that invalidation while the entry is in use is possible, but the remaining steps are different. It seems that is would require 2 concurrent invalidations so that the problem would manifest itself. 
   
   It might sound like a very unlikely situation, but it seems that the problem of this issue is also very rare. With enough operations, a very unlikely call path will eventually get executed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-832414104


   > @lhotari Let me poke into this theory:
   > 
   > Initial refCnt of the entry in cache == 1
   > 
   > > one thread reads entries from the cache. the entries get marked for use with the value.retain(); call (line 115 in RangeCache).
   > 
   > A: value.retain() succeeds: refCnt++ == 2
   > B: value.retain() throws (other thread released it already somehow): entry is not returned.
   > 
   > > in the meantime, the cache entry gets invalidated and invalidation calls release() for the entry
   > 
   > A: refCnt-- == 1
   > B: entry was not returned, could not be used
   > 
   > "the EntryImpl instance gets returned to the Recycler object pool" does not happen in case A.
   > 
   > Did I miss something there?
   
   @dlg99 yes, you are right that it's not possible in the way that I first explained it. There's a more recent comment about the races in invalidating the entries which could lead to a somewhat similar situation. The referenced PR contains more discussion about the possible explanations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-828958550


   > Though it won't disable the call to recycle, in where we're resetting the fields to nulls.
   
   @merlimat I guess the `data` field shouldn't be null in this case at all. EntryImpl.create is setting the `data` field by calling `EntryImpl.create(position, cachedData)` on line 112.
   
   https://github.com/apache/pulsar/blob/8ea4a39dc8bf6f2f23a160688bb70a80f6acfd4d/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryCacheImpl.java#L112-L115
   
   `EntryImpl.create`:
   
   https://github.com/apache/pulsar/blob/8ea4a39dc8bf6f2f23a160688bb70a80f6acfd4d/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java#L81-L90
   
   The reported NPE is "impossible". :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] merlimat commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
merlimat commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-828973043


   But that too doesn’t look to be possible....


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-828953935


   @dinghram Thanks for reporting. 
   
   Did I understand the problem report correctly, that the NPE is happening on EntryImpl.java, line 132?
   
   https://github.com/apache/pulsar/blob/8ea4a39dc8bf6f2f23a160688bb70a80f6acfd4d/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java#L130-L133
   
   Called from EntryCacheImpl.java, line 115:
   
   https://github.com/apache/pulsar/blob/8ea4a39dc8bf6f2f23a160688bb70a80f6acfd4d/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryCacheImpl.java#L112-L115
   
   
   
   **Please share the Pulsar version and deployment type (plain/docker/k8s helm)**. The target Java JVM version would also be useful if you aren't deploying with official k8s helm chart and Docker images.
   
   One experiment that could be useful in tracking down the problem would be to add `-Dio.netty.recycler.maxCapacityPerThread=0` to the Pulsar Broker JVM options by adding it to the `PULSAR_EXTRA_OPTS` environment variable. 
   
   For example, In the k8s Helm chart deployment, an environment variable can be set for the broker by adding it under `broker.configData` in the `values.yaml` for the deployment.
   ```
   broker:
     configData:
       PULSAR_EXTRA_OPTS: -Dio.netty.recycler.maxCapacityPerThread=0
   ```
   
   The `-Dio.netty.recycler.maxCapacityPerThread=0` setting disables object instance reuse that might be causing the issue that you have reported. Pulsar uses the Netty framework's [Recycler](https://netty.io/4.1/api/io/netty/util/Recycler.html) class to minimize allocations of some object instances.
   
   **Would it be possible for you to do a test run with `PULSAR_EXTRA_OPTS=-Dio.netty.recycler.maxCapacityPerThread=0` to see if the problem reproduces with this setting?**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-828958550


   > Though it won't disable the call to recycle, in where we're resetting the fields to nulls.
   
   @merlimat I guess the `data` field shouldn't be null in this case at all. EntryImpl.create is setting the `data` field by calling `EntryImpl.create(position, cachedData)` on line 112.
   
   https://github.com/apache/pulsar/blob/8ea4a39dc8bf6f2f23a160688bb70a80f6acfd4d/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java#L130-L133
   
   `EntryImpl.create`:
   
   https://github.com/apache/pulsar/blob/8ea4a39dc8bf6f2f23a160688bb70a80f6acfd4d/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java#L81-L90
   
   The reported NPE is "impossible". :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] dlg99 commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
dlg99 commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-832083532


   @lhotari there are other players there:
   ReadEntriesCallback, ReadEntryCallback - implementations can call extra .release()
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-1019983963


   This problem might be fixed by #13328 . Netty Recycler was rewritten in 4.1.71.Final with PR https://github.com/netty/netty/pull/11858 . Pulsar uses Netty Recycler extensively and the improvements to Netty Recycler might fix various thread-safety problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-829036198


   @codelipenghui Thanks for pointing to the OpAddEntry code location. I initially didn't realize that the code had both `entry` and `cachedEntry` and that it's the `entry.data` that is null and the source for the field is `OpAddEntry.data` as you pointed out.
   
   https://github.com/apache/pulsar/blob/8ea4a39dc8bf6f2f23a160688bb70a80f6acfd4d/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpAddEntry.java#L162-L176
   
   Even with this updated information, the problem is "impossible"... :) This seems to be a nice Java Concurrency puzzler.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-831790136


   UPDATE: The following description of a race condition doesn't seem to be possible since a counter in the base class of EntryImpl protects against it. ([AbstractCASReferenceCounted.release0](https://github.com/apache/pulsar/blob/5e446a67fcacf340d352eeafdce15a78fb971f19/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/util/AbstractCASReferenceCounted.java#L95-L110
   ))
   
   My current assumption of the problem is that a race condition is possible when cache invalidation happens with a certain timing. 
   
   The possible race condition:
   
   Code in EntryCacheImpl, asyncReadEntry0 method (code is for branch-2.7):
   
   https://github.com/apache/pulsar/blob/dcaa1d350fec920ef0ba32b4aeecd2e9604d4824/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryCacheImpl.java#L272-L283
   
   Code for `entries.getRange`, referenced on line 272 in above code snippet:
   
   https://github.com/apache/pulsar/blob/45ee784f5ef92fc25e970bbf0be34412472173b1/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/util/RangeCache.java#L108-L122
   
   explanation of a possible race condition:
   
   - one thread reads entries from the cache. the entries get marked for use with the `value.retain();` call (line 115 in RangeCache).
   - in the meantime, the cache entry gets invalidated and invalidation calls  `release()` for the entry 
   - the EntryImpl instance gets returned to the Recycler object pool, and another thread picks it put for usage. 
   - the thread that read entries from the cache calls `.release()` for the instance that is now used by another thread. (line 282 in EntryCacheImpl)
   - since release has been called, it can cause the NPE on `managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java: 132: return data.readableBytes();` reported in this issue.
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] eolivelli commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
eolivelli commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-831796478


   this comment "// Value was already destroyed between get() and retain() " is very bad.
   catching a Throwable may lead to totally unpredictable behaviour...OutOfMemory? NPE ? IllegalXXXException ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-831906333


   I created a minimal fix without any new tests in #10480 . 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] dinghram commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
dinghram commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-829275756


   Pulsar version: 2.7.0 + cherry-picked fixes
   Deployment: helm
   Java version: 
     openjdk version "1.8.0_282"
     OpenJDK Runtime Environment (IcedTea 3.18.0) (build 1.8.0_282-b07 suse-lp152.2.9.1-x86_64)
     OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
   OS: openSUSE Leap 15.2 + patches


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-829019325


   @lhotari The pool here we are using is a thread-local pool, and always use the the same thread for a managed ledger to insert entry to the cache https://github.com/apache/pulsar/blob/6704f12104219611164aa2bb5bbdfc929613f1bf/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpAddEntry.java#L188. Seems the entry does not have an opportunity to share with other thread.
   
   And if passed here https://github.com/apache/pulsar/blob/8ea4a39dc8bf6f2f23a160688bb70a80f6acfd4d/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryCacheImpl.java#L112 means the data are not null at this point, otherwise will throw NPE here. It's very interesting. If the entry does not insert into the cache, the cache eviction also can get a chance to access the entry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-831201017


   #2993 seems related.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] dinghram commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
dinghram commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-865107202


   @lhotari Thanks for the feedback. We tested with the -Dio.netty.recycler.maxCapacityPerThread=0 option and saw no negative performance impact. In fact, the CPU usage was lower with this option set. However, this was a single test and not definitive of how Pulsar would behave long term with this setting. As such, it seems we can run with this option set if a customer begins seeing the NPE occur (we have only seen it once). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] merlimat commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
merlimat commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-828954901


   > The -Dio.netty.recycler.maxCapacityPerThread=0 setting disables object instance reuse that might be causing the issue that you have reported
   
   Though it won't disable the call to recycle, in where we're resetting the fields to nulls. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-831885032


   When looking more at the possible race condition, the previous explained scenario seems to be possible with a slight variation to it. 
   
   explanation of a possible race condition:
   
   - one thread reads entries from the cache. the entries get marked for use with the `value.retain();` call (line 115 in RangeCache).
   - in the meantime, the **cache entry gets invalidated 2 times in different ways and both invalidation call paths call `release()` for the entry** 
    - the EntryImpl instance gets returned to the Recycler object pool, and another thread picks it up for usage while it's used in another thread. 
   
   It might sound like a very unlikely situation, but it seems that the problem of this issue is also very rare. With enough operations, a very unlikely call path will eventually get executed.
   
   I was looking for different ways how entries could get invalidated and whether there are locks in place to ensure that an entry doesn't simulaneously get invalidated twice. There isn't a solution for this. I believe that the fix would be to ensure that an entry cannot be invalidated twice. The method for invalidation should be different than calling `release` so that it's possible to track when the entry is invalidated. I'll work on a PR with a proposal to have a simple fix for the problem this way. The changes will be fairly simple.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-829994866


   Another NPE bug that is related to the Netty Recycler: #5727 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-831885032


   When looking more at the possible race condition, the previous explained scenario seems to be possible with a slight variation to it. It seems that invalidation while the entry is in use is possible, but the remaining steps are different.
   
   It might sound like a very unlikely situation, but it seems that the problem of this issue is also very rare. With enough operations, a very unlikely call path will eventually get executed.
   
   I was looking for different ways how entries could get invalidated and whether there are locks in place to ensure that an entry doesn't simulaneously get invalidated twice. There isn't a solution for this. I believe that the fix would be to ensure that an entry cannot be invalidated twice. The method for invalidation should be different than calling `release` so that it's possible to track when the entry is invalidated. I'll work on a PR with a proposal to have a simple fix for the problem this way. The changes will be fairly simple.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-831885032


   When looking more at the possible race condition, the previous explained scenario seems to be possible with a slight variation to it. 
   
   explanation of a possible race condition:
   
   - one thread reads entries from the cache. the entries get marked for use with the `value.retain();` call (line 115 in RangeCache).
   - in the meantime, the **cache entry gets invalidated 2 times in different ways and both invalidation call paths call `release()` for the entry** 
    - the EntryImpl instance gets returned to the Recycler object pool, and another thread picks it put for usage. 
   - the thread that read entries from the cache calls `.release()` for the instance that is now used by another thread. (line 282 in EntryCacheImpl)
   - since release has been called, it can cause the NPE on `managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java: 132: return data.readableBytes();` reported in this issue.
   
   It might sound like a very unlikely situation, but it seems that the problem of this issue is also very rare. With enough operations, a very unlikely call path will eventually get executed.
   
   I was looking for different ways how entries could get invalidated and whether there are locks in place to ensure that an entry doesn't simulaneously get invalidated twice. There isn't a solution for this. I believe that the fix would be to ensure that an entry cannot be invalidated twice. The method for invalidation should be different than calling `release` so that it's possible to track when the entry is invalidated. I'll work on a PR with a proposal to have a simple fix for the problem this way. The changes will be fairly simple.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] dlg99 commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
dlg99 commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-832074255


   @lhotari Let me poke into this theory: 
   
   Initial refCnt of the entry in cache == 1
   
   > one thread reads entries from the cache. the entries get marked for use with the value.retain(); call (line 115 in RangeCache).
   
   A: value.retain() succeeds: refCnt++ == 2
   B: value.retain() throws (other thread released it already somehow): entry is not returned.
   
   > in the meantime, the cache entry gets invalidated and invalidation calls release() for the entry
   
   A: refCnt-- ==  1
   B: entry was not returned, could not be used
   
   "the EntryImpl instance gets returned to the Recycler object pool" does not happen in case A.
   
   Did I miss something there?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-829047679


   > The pool here we are using is a thread-local pool, and always use the the same thread for a managed ledger to insert entry to the cache
   
   @codelipenghui it seems that [Netty's Recycler ](https://github.com/netty/netty/blob/netty-4.1.51.Final/common/src/main/java/io/netty/util/Recycler.java)is thread-local for lookup and creation, but an object can be returned to the pool in any thread. The multi-thread access seems to lead to visibility problems which are possible in the JMM. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] dinghram commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
dinghram commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-865107202


   @lhotari Thanks for the feedback. We tested with the -Dio.netty.recycler.maxCapacityPerThread=0 option and saw no negative performance impact. In fact, the CPU usage was lower with this option set. However, this was a single test and not definitive of how Pulsar would behave long term with this setting. As such, it seems we can run with this option set if a customer begins seeing the NPE occur (we have only seen it once). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #10433: NPE in broker: EntryImpl.getLength()

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #10433:
URL: https://github.com/apache/pulsar/issues/10433#issuecomment-829036198


   @codelipenghui The pool is a thread-local pool, but the instance might be returned to a pool where it wasn't initially created, no?
   
   Thanks for pointing to the OpAddEntry code location. I initially didn't realize that the code had both `entry` and `cachedEntry` and that it's the `entry.data` that is null and the source for the field is `OpAddEntry.data` as you pointed out.
   
   https://github.com/apache/pulsar/blob/8ea4a39dc8bf6f2f23a160688bb70a80f6acfd4d/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpAddEntry.java#L162-L176
   
   Even with this updated information, the problem is "impossible"... :) This seems to be a nice Java Concurrency puzzler.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org