You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/02/05 08:04:53 UTC

[GitHub] [flink] StephanEwen opened a new pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

StephanEwen opened a new pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018
 
 
   ## What is the purpose of the change
   
   Fixes the race condition when releasing `OpaqueMemoryResource`between the following lines: 
   ```java
   final boolean allDisposed = sharedResources.release(type, leaseHolder);
   if (allDisposed) {
   	releaseMemory(type, MemoryType.OFF_HEAP, size);
   }
   ```
     - The shared resource release happens under lock
     - Releasing the memory to the memory manager is outside the lock
     - Another thread can find the resource released, but the memory not yet returned
   
   The solution is to push the memory release (via a lambda) into the `sharedResources` such that the release also happens under the lock.
   
   ## Verifying this change
   
   This extends the existing unit tests.
   Thread safety is not verified in unit tests, as this is not easily possible.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): **no**
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no**
     - The serializers: **no**
     - The runtime per-record code paths (performance sensitive): **no**
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: **no**
     - The S3 file system connector: **no**
   
   ## Documentation
   
     - Does this pull request introduce a new feature? **no**
     - If yes, how is the feature documented? **not applicable**
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#issuecomment-582293770
 
 
   <!--
   Meta data
   Hash:4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Status:FAILURE URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4849 TriggerType:PUSH TriggerID:4ea977f18fb6bd5da47b80f96d899ef5d05193a4
   Hash:4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Status:PENDING URL:https://travis-ci.com/flink-ci/flink/builds/147504848 TriggerType:PUSH TriggerID:4ea977f18fb6bd5da47b80f96d899ef5d05193a4
   -->
   ## CI report:
   
   * 4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Travis: [PENDING](https://travis-ci.com/flink-ci/flink/builds/147504848) Azure: [FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4849) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#issuecomment-582293770
 
 
   <!--
   Meta data
   Hash:4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Status:FAILURE URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4849 TriggerType:PUSH TriggerID:4ea977f18fb6bd5da47b80f96d899ef5d05193a4
   Hash:4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Status:FAILURE URL:https://travis-ci.com/flink-ci/flink/builds/147504848 TriggerType:PUSH TriggerID:4ea977f18fb6bd5da47b80f96d899ef5d05193a4
   Hash:556091e68e5d84192ab586b96eec5a4bc889b0aa Status:PENDING URL:https://travis-ci.com/flink-ci/flink/builds/147512868 TriggerType:PUSH TriggerID:556091e68e5d84192ab586b96eec5a4bc889b0aa
   Hash:556091e68e5d84192ab586b96eec5a4bc889b0aa Status:PENDING URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4850 TriggerType:PUSH TriggerID:556091e68e5d84192ab586b96eec5a4bc889b0aa
   -->
   ## CI report:
   
   * 4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Travis: [FAILURE](https://travis-ci.com/flink-ci/flink/builds/147504848) Azure: [FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4849) 
   * 556091e68e5d84192ab586b96eec5a4bc889b0aa Travis: [PENDING](https://travis-ci.com/flink-ci/flink/builds/147512868) Azure: [PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4850) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen commented on a change in pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
StephanEwen commented on a change in pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#discussion_r375133066
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/memory/SharedResources.java
 ##########
 @@ -80,27 +81,37 @@
 		}
 	}
 
+	/**
+	 * Releases a lease (identified by the lease holder object) for the given type.
+	 * If no further leases exist, the resource is disposed.
+	 */
+	void release(String type, Object leaseHolder) throws Exception {
+		release(type, leaseHolder, (value) -> {});
+	}
+
 	/**
 	 * Releases a lease (identified by the lease holder object) for the given type.
 	 * If no further leases exist, the resource is disposed.
 	 *
-	 * @return True, if this was the last lease holder and the resource was disposed.
+	 * <p>This method takes an additional hook that is called when the resource is disposed.
 	 */
-	boolean release(String type, Object leaseHolder) throws Exception {
+	void release(String type, Object leaseHolder, Consumer<Long> releaser) throws Exception {
 		lock.lock();
 		try {
-			final LeasedResource resource = reservedResources.get(type);
+			final LeasedResource<?> resource = reservedResources.get(type);
 			if (resource == null) {
-				return false;
+				return;
 			}
 
 			if (resource.removeLeaseHolder(leaseHolder)) {
-				reservedResources.remove(type);
-				resource.dispose();
-				return true;
+				try {
+					reservedResources.remove(type);
+					resource.dispose();
+				}
+				finally {
+					releaser.accept(resource.size());
+				}
 
 Review comment:
   I would like to keep the `finally` block, because of the following:
     - It would be possibly for the "release resource" call to fail
     - if we do not release the memory back, we must fail the process, because it leaved the TM in a "corrupt" state. It would not be possible to ever use that slot again.
   
   Double releasing should not be possible, guarded by the code of `SharedResources`.
   
   For clarity, I changed the `dispose()` method to not throw an exception on "double release".

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#issuecomment-582293770
 
 
   <!--
   Meta data
   Hash:4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Status:FAILURE URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4849 TriggerType:PUSH TriggerID:4ea977f18fb6bd5da47b80f96d899ef5d05193a4
   Hash:4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Status:FAILURE URL:https://travis-ci.com/flink-ci/flink/builds/147504848 TriggerType:PUSH TriggerID:4ea977f18fb6bd5da47b80f96d899ef5d05193a4
   Hash:556091e68e5d84192ab586b96eec5a4bc889b0aa Status:FAILURE URL:https://travis-ci.com/flink-ci/flink/builds/147512868 TriggerType:PUSH TriggerID:556091e68e5d84192ab586b96eec5a4bc889b0aa
   Hash:556091e68e5d84192ab586b96eec5a4bc889b0aa Status:SUCCESS URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4850 TriggerType:PUSH TriggerID:556091e68e5d84192ab586b96eec5a4bc889b0aa
   -->
   ## CI report:
   
   * 4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Travis: [FAILURE](https://travis-ci.com/flink-ci/flink/builds/147504848) Azure: [FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4849) 
   * 556091e68e5d84192ab586b96eec5a4bc889b0aa Travis: [FAILURE](https://travis-ci.com/flink-ci/flink/builds/147512868) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4850) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] carp84 commented on a change in pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
carp84 commented on a change in pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#discussion_r375111995
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/memory/MemoryManager.java
 ##########
 @@ -576,7 +577,9 @@ public void releaseAllMemory(Object owner, MemoryType memoryType) {
 		// if we need to allocate the resource (no shared resource allocated, yet), this would be the size to use
 		final long numBytes = computeMemorySize(fractionToInitializeWith);
 
-		// the initializer attempt to reserve the memory before actual initialization
+		// intitializer and releaser as functions that are pushed into the SharedResources,
 
 Review comment:
   Minor: intitializer -> initializer

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen commented on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
StephanEwen commented on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#issuecomment-582373419
 
 
   Thanks, closing and rebasing and merging this then.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#issuecomment-582293770
 
 
   <!--
   Meta data
   Hash:4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Status:FAILURE URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4849 TriggerType:PUSH TriggerID:4ea977f18fb6bd5da47b80f96d899ef5d05193a4
   Hash:4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Status:FAILURE URL:https://travis-ci.com/flink-ci/flink/builds/147504848 TriggerType:PUSH TriggerID:4ea977f18fb6bd5da47b80f96d899ef5d05193a4
   Hash:556091e68e5d84192ab586b96eec5a4bc889b0aa Status:FAILURE URL:https://travis-ci.com/flink-ci/flink/builds/147512868 TriggerType:PUSH TriggerID:556091e68e5d84192ab586b96eec5a4bc889b0aa
   Hash:556091e68e5d84192ab586b96eec5a4bc889b0aa Status:PENDING URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4850 TriggerType:PUSH TriggerID:556091e68e5d84192ab586b96eec5a4bc889b0aa
   -->
   ## CI report:
   
   * 4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Travis: [FAILURE](https://travis-ci.com/flink-ci/flink/builds/147504848) Azure: [FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4849) 
   * 556091e68e5d84192ab586b96eec5a4bc889b0aa Travis: [FAILURE](https://travis-ci.com/flink-ci/flink/builds/147512868) Azure: [PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4850) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] carp84 commented on a change in pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
carp84 commented on a change in pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#discussion_r375122419
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/memory/SharedResources.java
 ##########
 @@ -80,27 +81,37 @@
 		}
 	}
 
+	/**
+	 * Releases a lease (identified by the lease holder object) for the given type.
+	 * If no further leases exist, the resource is disposed.
+	 */
+	void release(String type, Object leaseHolder) throws Exception {
+		release(type, leaseHolder, (value) -> {});
+	}
+
 	/**
 	 * Releases a lease (identified by the lease holder object) for the given type.
 	 * If no further leases exist, the resource is disposed.
 	 *
-	 * @return True, if this was the last lease holder and the resource was disposed.
+	 * <p>This method takes an additional hook that is called when the resource is disposed.
 	 */
-	boolean release(String type, Object leaseHolder) throws Exception {
+	void release(String type, Object leaseHolder, Consumer<Long> releaser) throws Exception {
 		lock.lock();
 		try {
-			final LeasedResource resource = reservedResources.get(type);
+			final LeasedResource<?> resource = reservedResources.get(type);
 			if (resource == null) {
-				return false;
+				return;
 			}
 
 			if (resource.removeLeaseHolder(leaseHolder)) {
-				reservedResources.remove(type);
-				resource.dispose();
-				return true;
+				try {
+					reservedResources.remove(type);
+					resource.dispose();
+				}
+				finally {
+					releaser.accept(resource.size());
+				}
 
 Review comment:
   I'm not sure whether calling the hook even when `resource.dispose` failed is a good idea. From the existing implementation of `SharedResources$LeasedResource#dispose`, it throws exception when `disposed` is `true` which means the resource is already disposed, following this logic the invocation of the hook may cause a memory "double release" ? Or if we assume `SharedResources$LeasedResource` will only be used with the lock (yes with current implementation), maybe we could remove the protection of multi-threading there as well as the exception declaration of the `dispose` method?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] tillrohrmann commented on a change in pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#discussion_r375184200
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/memory/SharedResources.java
 ##########
 @@ -80,27 +81,37 @@
 		}
 	}
 
+	/**
+	 * Releases a lease (identified by the lease holder object) for the given type.
+	 * If no further leases exist, the resource is disposed.
+	 */
+	void release(String type, Object leaseHolder) throws Exception {
+		release(type, leaseHolder, (value) -> {});
+	}
+
 	/**
 	 * Releases a lease (identified by the lease holder object) for the given type.
 	 * If no further leases exist, the resource is disposed.
 	 *
-	 * @return True, if this was the last lease holder and the resource was disposed.
+	 * <p>This method takes an additional hook that is called when the resource is disposed.
 	 */
-	boolean release(String type, Object leaseHolder) throws Exception {
+	void release(String type, Object leaseHolder, Consumer<Long> releaser) throws Exception {
 		lock.lock();
 		try {
-			final LeasedResource resource = reservedResources.get(type);
+			final LeasedResource<?> resource = reservedResources.get(type);
 			if (resource == null) {
-				return false;
+				return;
 			}
 
 			if (resource.removeLeaseHolder(leaseHolder)) {
-				reservedResources.remove(type);
-				resource.dispose();
-				return true;
+				try {
+					reservedResources.remove(type);
+					resource.dispose();
+				}
+				finally {
+					releaser.accept(resource.size());
+				}
 
 Review comment:
   I guess my question is more related to the overall memory release logic than to this change.
   
   What is the intended behaviour of Flink if `resource.dispose` fails? According to the contract `dispose` can throw an `Exception` and I assume it wouldn't be guaranteed that `resource` would have released all of its resources (native memory). 
   
   If one follows the chain of close calls, then the release of the `SharedResources` could be triggered by `RocksDBKeyedStateBackend#dispose` where we swallow the exception via `IOUtils.closeQuietly`. Wouldn't this leave Flink in a problematic state where we have less memory than we think we have? I think it is dangerous to swallow the exceptions and it should be handled correctly. In the case of memory leakage I guess we should fail fatally.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen commented on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
StephanEwen commented on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#issuecomment-582369183
 
 
   I added the following comment to the `MemoryManager.getSharedMemoryResourceForManagedMemory(...)` method to clarify that:
   
   ```
   <b>Important:</b> The failure semantics are as follows: If the memory manager fails to reserve
   the memory, the external resource initializer will not be called. If an exception is thrown when the
   opaque resource is closed (last lease is released), the memory manager will still un-reserve the
   memory to make sure its own accounting is clean. The exception will need to be handled by the caller of
   {@link OpaqueMemoryResource#close()}. For example, if this indicates that native memory was not released
   and the process might thus have a memory leak, the caller can decide to kill the process as a result.
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen commented on a change in pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
StephanEwen commented on a change in pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#discussion_r375185846
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/memory/SharedResources.java
 ##########
 @@ -80,27 +81,37 @@
 		}
 	}
 
+	/**
+	 * Releases a lease (identified by the lease holder object) for the given type.
+	 * If no further leases exist, the resource is disposed.
+	 */
+	void release(String type, Object leaseHolder) throws Exception {
+		release(type, leaseHolder, (value) -> {});
+	}
+
 	/**
 	 * Releases a lease (identified by the lease holder object) for the given type.
 	 * If no further leases exist, the resource is disposed.
 	 *
-	 * @return True, if this was the last lease holder and the resource was disposed.
+	 * <p>This method takes an additional hook that is called when the resource is disposed.
 	 */
-	boolean release(String type, Object leaseHolder) throws Exception {
+	void release(String type, Object leaseHolder, Consumer<Long> releaser) throws Exception {
 		lock.lock();
 		try {
-			final LeasedResource resource = reservedResources.get(type);
+			final LeasedResource<?> resource = reservedResources.get(type);
 			if (resource == null) {
-				return false;
+				return;
 			}
 
 			if (resource.removeLeaseHolder(leaseHolder)) {
-				reservedResources.remove(type);
-				resource.dispose();
-				return true;
+				try {
+					reservedResources.remove(type);
+					resource.dispose();
+				}
+				finally {
+					releaser.accept(resource.size());
+				}
 
 Review comment:
   The double release of the "reservation" is actually guarded twice:
     - `if (resource.removeLeaseHolder(leaseHolder))`: Only the last release of a lease can trigger that.
     - The memory manager itself has this under an "owner", so if no memory is reserved under that owner, none can be released.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen commented on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
StephanEwen commented on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#issuecomment-582352549
 
 
   The `finally` block should make sure that we don't corrupt the memory managers accounting.
   The exception handling for the resource itself is up to the caller.
   
   If the shared cache is not properly disposed (RocksDB swallows this in `closeQuietly()`) can in theory lead to the fact that native memory is not released and we have a leak.
   
   In practice, the RocksDB code throws no exceptions when problems occur, so we cannot react to that anyways.
   
   In any case, the change of logic would be required in the RocksDB code, not int the code of this PR.
   So far, the code's contract is "memory manager makes sure its part stays consistent" and the external resource user needs to decide what to do when that disposal fails.
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot commented on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
flinkbot commented on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#issuecomment-582288319
 
 
   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit 4ea977f18fb6bd5da47b80f96d899ef5d05193a4 (Wed Feb 05 08:07:04 UTC 2020)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#issuecomment-582293770
 
 
   <!--
   Meta data
   Hash:4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Status:FAILURE URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4849 TriggerType:PUSH TriggerID:4ea977f18fb6bd5da47b80f96d899ef5d05193a4
   Hash:4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Status:FAILURE URL:https://travis-ci.com/flink-ci/flink/builds/147504848 TriggerType:PUSH TriggerID:4ea977f18fb6bd5da47b80f96d899ef5d05193a4
   Hash:556091e68e5d84192ab586b96eec5a4bc889b0aa Status:UNKNOWN URL:TBD TriggerType:PUSH TriggerID:556091e68e5d84192ab586b96eec5a4bc889b0aa
   -->
   ## CI report:
   
   * 4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Travis: [FAILURE](https://travis-ci.com/flink-ci/flink/builds/147504848) Azure: [FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4849) 
   * 556091e68e5d84192ab586b96eec5a4bc889b0aa UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen closed pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
StephanEwen closed pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot commented on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
flinkbot commented on issue #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#issuecomment-582293770
 
 
   <!--
   Meta data
   Hash:4ea977f18fb6bd5da47b80f96d899ef5d05193a4 Status:UNKNOWN URL:TBD TriggerType:PUSH TriggerID:4ea977f18fb6bd5da47b80f96d899ef5d05193a4
   -->
   ## CI report:
   
   * 4ea977f18fb6bd5da47b80f96d899ef5d05193a4 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] carp84 commented on a change in pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource

Posted by GitBox <gi...@apache.org>.
carp84 commented on a change in pull request #11018: [FLINK-15905][runtime] Fix race condition between allocation and release of OpaqueMemoryResource
URL: https://github.com/apache/flink/pull/11018#discussion_r375180729
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/memory/SharedResources.java
 ##########
 @@ -80,27 +81,37 @@
 		}
 	}
 
+	/**
+	 * Releases a lease (identified by the lease holder object) for the given type.
+	 * If no further leases exist, the resource is disposed.
+	 */
+	void release(String type, Object leaseHolder) throws Exception {
+		release(type, leaseHolder, (value) -> {});
+	}
+
 	/**
 	 * Releases a lease (identified by the lease holder object) for the given type.
 	 * If no further leases exist, the resource is disposed.
 	 *
-	 * @return True, if this was the last lease holder and the resource was disposed.
+	 * <p>This method takes an additional hook that is called when the resource is disposed.
 	 */
-	boolean release(String type, Object leaseHolder) throws Exception {
+	void release(String type, Object leaseHolder, Consumer<Long> releaser) throws Exception {
 		lock.lock();
 		try {
-			final LeasedResource resource = reservedResources.get(type);
+			final LeasedResource<?> resource = reservedResources.get(type);
 			if (resource == null) {
-				return false;
+				return;
 			}
 
 			if (resource.removeLeaseHolder(leaseHolder)) {
-				reservedResources.remove(type);
-				resource.dispose();
-				return true;
+				try {
+					reservedResources.remove(type);
+					resource.dispose();
+				}
+				finally {
+					releaser.accept(resource.size());
+				}
 
 Review comment:
   Would there be any chance that the resource is already disposed, but we call the hook twice? By "double releasing" I meant the accounting in `MemoryManager`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services