You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/11/04 04:58:41 UTC

[GitHub] [pulsar] lhotari edited a comment on pull request #12606: [ML] Avoid passing OpAddEntry across a thread boundary in asyncAddEntry

lhotari edited a comment on pull request #12606:
URL: https://github.com/apache/pulsar/pull/12606#issuecomment-960458045

> This change makes sense to me. Do you have any evidence of a problem that this PR is fixing?

Direct evidence in a form of a failing test is hard for a lot of thread safety issues.

In many cases, it is worth checking that code follows the rules of the Java Memory Model (JMM), specified in the Java Language Specification (JLS). In Java on 64-bit JVMs, there isn't a problem of corruption of memory pointers since "word tearing" problems are prevented. The main issue is possible data consistency and data race issues. ["Incorrectly Synchronized Programs May Exhibit Surprising Behavior"](https://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html#jls-17.4-A) is how JLS explains it. I guess the verification activity is like a model checking exercise, but at a different level.

My assumption is that the use of recycled objects and sharing them between multiple threads leads to "incorrectly synchronized programs" within Pulsar code base. It doesn't corrupt memory, but we might see "surprising behavior". One of the surprising behavior comes from the possibility of a field value changing even between subsequent reads while the field value hasn't been changed. This is explained in [Aleksey Shipilëv's Java Memory Model Unlearning Experience presentation](https://www.youtube.com/watch?v=TK-7GCCDF_I). [at about 21 minutes 50 seconds](https://www.youtube.com/watch?v=TK-7GCCDF_I&t=21m50s), there's an interesting example:
![image](https://user-images.githubusercontent.com/66864/116509894-bb0e3f80-a8cc-11eb-9d02-981ac3e33732.png)

In this example, the first thread makes a single change to field x and sets it to 1. The second thread reads this field twice. On the first read, the value is 1 and on the second read, the value is 0. This is something that one would assume that "is impossible". This is one example of surprising behavior of incorrectly synchronized programs.

There are multiple ways to mitigate the issue. The safest general approach is to ensure that recycled objects are exposed only in a single thread.

The main motivation of this PR is to reduce the likelyhood of "surprising behavior" related to the use of Netty recycled object, OpAddEntry. The Netty Recycler pool is thread local. When the creation (and lookup from the Netty Recycler) of OpAddEntry is moved in the same thread where the OpAddEntry is mainly mutated and read, it reduces the likelyhood of possible data races caused by sharing objects across threads without proper synchronization. This isn't perfect, but it's an improvement. Currently ManagedLedgerImpl uses a single thread per ledger. There are gaps in the solution. The goal of my draft PR #11387 would be to continue the improvements and eventually fix the current thread safety issues that the use of recycled OpAddEntry and OpReadEntry instances introduce in Pulsar.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org