You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/04/01 11:47:22 UTC

[GitHub] [flink] XComp opened a new pull request #19327: [FLINK-26987][runtime] Fixes getAllAndLock livelock

XComp opened a new pull request #19327:
URL: https://github.com/apache/flink/pull/19327


   ## What is the purpose of the change
   
   This livelock can happen in situations where an entry was marked
   for deletion but is not deleted, yet. There's actually no reason to retry 
   in case of a concurrent deletion. See FLINK-26987's description for 
   further analysis
   
   ## Brief change log
   
   * Removes the goto statement
   
   ## Verifying this change
   
   * Adds a test to cover that case (this test would run into an infinite loop with the old implementation)
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19327: [FLINK-26987][runtime] Fixes getAllAndLock livelock

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19327:
URL: https://github.com/apache/flink/pull/19327#issuecomment-1085807864


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34111",
       "triggerID" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34124",
       "triggerID" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34124",
       "triggerID" : "1086820523",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * e77b43243436c1e21ffe5bb617dafd12d4454213 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34124) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19327: [FLINK-26987][runtime] Fixes getAllAndLock livelock

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19327:
URL: https://github.com/apache/flink/pull/19327#issuecomment-1085807864


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34111",
       "triggerID" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34124",
       "triggerID" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e77b43243436c1e21ffe5bb617dafd12d4454213 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34124) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19327: [FLINK-26987][runtime] Fixes getAllAndLock livelock

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19327:
URL: https://github.com/apache/flink/pull/19327#issuecomment-1085807864


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34111",
       "triggerID" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fdbd0e9998dea2b98133e3f4ec256a2b07d4b241 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34111) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19327: [FLINK-26987][runtime] Fixes getAllAndLock livelock

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19327:
URL: https://github.com/apache/flink/pull/19327#issuecomment-1085807864


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34111",
       "triggerID" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34124",
       "triggerID" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34124",
       "triggerID" : "1086820523",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * e77b43243436c1e21ffe5bb617dafd12d4454213 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34124) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19327: [FLINK-26987][runtime] Fixes getAllAndLock livelock

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19327:
URL: https://github.com/apache/flink/pull/19327#issuecomment-1085807864


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34111",
       "triggerID" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fdbd0e9998dea2b98133e3f4ec256a2b07d4b241 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34111) 
   * e77b43243436c1e21ffe5bb617dafd12d4454213 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19327: [FLINK-26987][runtime] Fixes getAllAndLock livelock

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19327:
URL: https://github.com/apache/flink/pull/19327#issuecomment-1085807864


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34111",
       "triggerID" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34124",
       "triggerID" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34124",
       "triggerID" : "1086820523",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * e77b43243436c1e21ffe5bb617dafd12d4454213 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34124) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on a change in pull request #19327: [FLINK-26987][runtime] Fixes getAllAndLock livelock

Posted by GitBox <gi...@apache.org>.
XComp commented on a change in pull request #19327:
URL: https://github.com/apache/flink/pull/19327#discussion_r840713203



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java
##########
@@ -411,8 +410,7 @@ private static boolean isNotMarkedForDeletion(Stat stat) {
                         final RetrievableStateHandle<T> stateHandle = getAndLock(path);
                         stateHandles.add(new Tuple2<>(stateHandle, path));
                     } catch (NotExistException ignored) {
-                        // Concurrent deletion, retry
-                        continue retry;
+                        // entry is subject for deletion and can be ignored

Review comment:
       Good point about the fail fast. I expanded the comment a bit to cover the two case. 👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19327: [FLINK-26987][runtime] Fixes getAllAndLock livelock

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19327:
URL: https://github.com/apache/flink/pull/19327#issuecomment-1085807864


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34111",
       "triggerID" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34124",
       "triggerID" : "e77b43243436c1e21ffe5bb617dafd12d4454213",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fdbd0e9998dea2b98133e3f4ec256a2b07d4b241 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34111) 
   * e77b43243436c1e21ffe5bb617dafd12d4454213 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34124) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on pull request #19327: [FLINK-26987][runtime] Fixes getAllAndLock livelock

Posted by GitBox <gi...@apache.org>.
XComp commented on pull request #19327:
URL: https://github.com/apache/flink/pull/19327#issuecomment-1086821026


   I created FLINK-27033 to cover the CI failure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on pull request #19327: [FLINK-26987][runtime] Fixes getAllAndLock livelock

Posted by GitBox <gi...@apache.org>.
XComp commented on pull request #19327:
URL: https://github.com/apache/flink/pull/19327#issuecomment-1086820523


   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #19327: [FLINK-26987][runtime] Fixes getAllAndLock livelock

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #19327:
URL: https://github.com/apache/flink/pull/19327#issuecomment-1085807864


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fdbd0e9998dea2b98133e3f4ec256a2b07d4b241",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fdbd0e9998dea2b98133e3f4ec256a2b07d4b241 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] dmvk commented on a change in pull request #19327: [FLINK-26987][runtime] Fixes getAllAndLock livelock

Posted by GitBox <gi...@apache.org>.
dmvk commented on a change in pull request #19327:
URL: https://github.com/apache/flink/pull/19327#discussion_r840574253



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java
##########
@@ -411,8 +410,7 @@ private static boolean isNotMarkedForDeletion(Stat stat) {
                         final RetrievableStateHandle<T> stateHandle = getAndLock(path);
                         stateHandles.add(new Tuple2<>(stateHandle, path));
                     } catch (NotExistException ignored) {
-                        // Concurrent deletion, retry
-                        continue retry;
+                        // entry is subject for deletion and can be ignored

Review comment:
       Just a note for understanding the previous behavior.
   
   This either means that the node is marked for deletion (we can't acquire the lock) or that we're it has been already deleted (concurrent deletion).
   
   The intention of the `goto retry` was to fail fast here as we already knew the `cversion` of the root node has changed. This assumption no longer holds as it could have been simply marked for deletion. If there is a concurrent deletion happening, we'll still be able to catch that later on by the `cversion` check.
   

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java
##########
@@ -411,8 +410,7 @@ private static boolean isNotMarkedForDeletion(Stat stat) {
                         final RetrievableStateHandle<T> stateHandle = getAndLock(path);
                         stateHandles.add(new Tuple2<>(stateHandle, path));
                     } catch (NotExistException ignored) {
-                        // Concurrent deletion, retry
-                        continue retry;
+                        // entry is subject for deletion and can be ignored

Review comment:
       Maybe it would be nice to expand the comment along these lines, it's rather difficult to understand why this could be ignored.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org