You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/03/11 14:38:45 UTC

[GitHub] [flink] rkhachatryan opened a new pull request #19062: [FLINK-26592][state/changelog] Use mailbox in FsStateChangelogWriter instead of a lock

rkhachatryan opened a new pull request #19062:
URL: https://github.com/apache/flink/pull/19062


   This is an alternative version of #19050 that solves the problem using `MailboxExecutor`.
   
   ## What is the purpose of the change
   
   ```
   When a task thread tries to schedule an upload, it might wait for available capacity.
   Capacity is released by the uploading thread on upload completion.  After releasing,
   it must notify the task thread about the completion.
   Both task and uploading thread acquire FsStateChangelogWriter.lock. That causes
   a deadlock if uploader releases capacity insufficient for task thread to proceed.
   
   This change removes the lock and makes uploader thread to use mailbox actions.
   ```
   
   ## Verifying this change
   
   `FsStateChangelogStorageTest.testDeadlockOnUploadCompletion`
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur edited a comment on pull request #19062: [FLINK-26592][state/changelog] Use mailbox in FsStateChangelogWriter instead of a lock

Posted by GitBox <gi...@apache.org>.
curcur edited a comment on pull request #19062:
URL: https://github.com/apache/flink/pull/19062#issuecomment-1073638054


   I am fine with the fix. The deadlock happens when a task thread waits for enough capacity, but FsStateChangelogWriter#handleUploadSuccess can never get the lock (hold by the task thread waiting for enough capacity). The fix is straightforward.
   
   But @rkhachatryan , please take a look at my previous comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan merged pull request #19062: [FLINK-26592][state/changelog] Use mailbox in FsStateChangelogWriter instead of a lock

Posted by GitBox <gi...@apache.org>.
rkhachatryan merged pull request #19062:
URL: https://github.com/apache/flink/pull/19062


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on pull request #19062: [FLINK-26592][state/changelog] Use mailbox in FsStateChangelogWriter instead of a lock

Posted by GitBox <gi...@apache.org>.
curcur commented on pull request #19062:
URL: https://github.com/apache/flink/pull/19062#issuecomment-1073473660


   Hey Roman, I am a bit confused with the logic of `BatchingStateChangeUploadScheduler#upload` (sorry although not quite related to this fix), but relating to why the deadlock happens.
   
   Before the task thread schedules a `UploadTask`, it requires `uploadThrottle` to have **some** capacity to upload. But no matter it has **enough** capacity (to upload the full size of `UploadTask`), the `UploadTask` is scheduled anyway. Then in `BatchingStateChangeUploadScheduler#scheduleUploadIfNeeded`, the scheduled uploading is canceled if not having enough capacity and goes into the retrying logic.
   
   My question is why not before scheduling an upload task, just make sure it has **enough** capacity? 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on pull request #19062: [FLINK-26592][state/changelog] Use mailbox in FsStateChangelogWriter instead of a lock

Posted by GitBox <gi...@apache.org>.
curcur commented on pull request #19062:
URL: https://github.com/apache/flink/pull/19062#issuecomment-1073638054


   I am fine with the fix. The deadlock happens when a task thread waits for enough capacity, but FsStateChangelogWriter#handleUploadSuccess can never get the lock (hold by the task thread waiting for enough capacity).
   
   But @rkhachatryan , please take a look at my previous comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19062: [FLINK-26592][state/changelog] Use mailbox in FsStateChangelogWriter instead of a lock

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19062:
URL: https://github.com/apache/flink/pull/19062#issuecomment-1065178629


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bf5eb61e8156db51c58b7c65b2775403c0bf2d85",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=32924",
       "triggerID" : "bf5eb61e8156db51c58b7c65b2775403c0bf2d85",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bf5eb61e8156db51c58b7c65b2775403c0bf2d85 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=32924) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on pull request #19062: [FLINK-26592][state/changelog] Use mailbox in FsStateChangelogWriter instead of a lock

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on pull request #19062:
URL: https://github.com/apache/flink/pull/19062#issuecomment-1073704791


   Thanks for the review @curcur, 
   
   I think your [question](https://github.com/apache/flink/pull/19062#issuecomment-1073473660) is not directly related to this PR.
   I'll try to answer it, but let's move the discussion to a separate ticket or to offline to unblock the fix.
   > Then in BatchingStateChangeUploadScheduler#scheduleUploadIfNeeded, the scheduled uploading is canceled if not having enough capacity
   
   I think you misread the code, `scheduleUploadIfNeeded` doesn't check the capacity; it checks the thresholds; and if they are reached then the upload is scheduled immediately instead of waiting for `scheduleDelayMs`:
   
   >  and goes into the retrying logic.
   
   Upload **always** go through the retry logic, but inside `drainAndSave`.
   
   > My question is why not before scheduling an upload task, just make sure it has enough capacity?
   
   Capacity **is** checked before starting an upload (and not afterwards).
   However, it is enough to have at least **some** capacity to proceed; otherwise, too big upload will never start.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19062: [FLINK-26592][state/changelog] Use mailbox in FsStateChangelogWriter instead of a lock

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19062:
URL: https://github.com/apache/flink/pull/19062#issuecomment-1065178629


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bf5eb61e8156db51c58b7c65b2775403c0bf2d85",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=32924",
       "triggerID" : "bf5eb61e8156db51c58b7c65b2775403c0bf2d85",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bf5eb61e8156db51c58b7c65b2775403c0bf2d85 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=32924) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #19062: [FLINK-26592][state/changelog] Use mailbox in FsStateChangelogWriter instead of a lock

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #19062:
URL: https://github.com/apache/flink/pull/19062#issuecomment-1065178629


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bf5eb61e8156db51c58b7c65b2775403c0bf2d85",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bf5eb61e8156db51c58b7c65b2775403c0bf2d85",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bf5eb61e8156db51c58b7c65b2775403c0bf2d85 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org