You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/10/17 06:56:49 UTC

[GitHub] [flink] zhougit86 opened a new pull request, #21080: [FLINK-29545][runtime] add netty idle state handler

zhougit86 opened a new pull request, #21080:
URL: https://github.com/apache/flink/pull/21080

   
   
   ## What is the purpose of the change
   
   detect network connection problem
   
   
   ## Brief change log
   
   add an idle state handler for netty protocol
   
   
   ## Verifying this change
   
   Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
   
   
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): ( no)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
     - The serializers: (no )
     - The runtime per-record code paths (performance sensitive): (no)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes)
     - The S3 file system connector: (no )
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes)
     - If yes, how is the feature documented? (not applicable)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] zhougit86 closed pull request #21080: [FLINK-29545][runtime] add netty idle state handler

Posted by "zhougit86 (via GitHub)" <gi...@apache.org>.

zhougit86 closed pull request #21080: [FLINK-29545][runtime] add netty idle state handler
URL: https://github.com/apache/flink/pull/21080


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] pnowojski commented on pull request #21080: [FLINK-29545][runtime] add netty idle state handler

Posted by GitBox <gi...@apache.org>.

pnowojski commented on PR #21080:
URL: https://github.com/apache/flink/pull/21080#issuecomment-1295072883

   Hi @zhougit86 , I'm still not sure what's the value of this proposed improvement/fix. It:
   1. adds a bit of complexity to the already pretty complex system
   2. adds a new publicly exposed configuration option, that very very few people will actually know that it exists and even fewer how to use, while we would be forced to maintain forever
   
   Just in order to add built in Flink an unhealthy network detector and print potentially false warning? 🤔 Note that our netty handlers are not non blocking. In some cases, we are doing blocking I/O calls from the netty threads, so the heartbeats can be blocked for non negligible amount of time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] zhougit86 commented on pull request #21080: [FLINK-29545][runtime] add netty idle state handler

Posted by GitBox <gi...@apache.org>.

zhougit86 commented on PR #21080:
URL: https://github.com/apache/flink/pull/21080#issuecomment-1312507297

   Looks like some discussion has taken place before. I think my case is a little similar to the issue I attached. And we got the root cause in our company is our cloud provider has some HW issue my cause the network packet consistent loss. 
   
   Anyway , I understand this is a rare case. maybe someday in the future we want to work on this situation, let me take the ticket


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot commented on pull request #21080: [FLINK-29545][runtime] add netty idle state handler

Posted by GitBox <gi...@apache.org>.

flinkbot commented on PR #21080:
URL: https://github.com/apache/flink/pull/21080#issuecomment-1280382081

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a4e5cd5f7677788edfbcc9d065bd62aac25373b7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a4e5cd5f7677788edfbcc9d065bd62aac25373b7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a4e5cd5f7677788edfbcc9d065bd62aac25373b7 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] zhougit86 commented on pull request #21080: [FLINK-29545][runtime] add netty idle state handler

Posted by GitBox <gi...@apache.org>.

zhougit86 commented on PR #21080:
URL: https://github.com/apache/flink/pull/21080#issuecomment-1288691163

   > Could you elaborate more on the motivation behind this? I'm not sure how useful is the idle information provided by this PR. From one hand, if there is some data waiting to be sent, and it is not being sent, that's clearly visible via a number of metrics (backpressured status, number of bytes sent, queues lengths etc). So this is a bit redundant. On the other hand, there can be many different reasons behind this timeout being triggered, like for example:
   > 
   > * idling operator not producing any data
   > * operator aggregating for a longer period of time (window)
   > * filtering out all of the records
   > * operator busy doing some very heavy work for a long period
   > * sorted shuffle service
   > * some unhealthy JVM/TM state (long GC pauses, memory swapping, long blocking IO)
   > 
   > All of the above would produce a false warning that would be misleading.
   
   Hi Master:
   
   I have updated this PR, I send heartbeat periodically in the netty client side. Which can avoid misalarm when situation you listed above happen. And trust me, in our k8s environment, we can discover the WriteAndFlushNextMessageIfPossibleListener success for a lot of times and the netty server idle handler still give the idle alarm.
   
   Please help review, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] zhougit86 commented on pull request #21080: [FLINK-29545][runtime] add netty idle state handler

Posted by GitBox <gi...@apache.org>.

zhougit86 commented on PR #21080:
URL: https://github.com/apache/flink/pull/21080#issuecomment-1281021051

   
   > Could you elaborate more on the motivation behind this? I'm not sure how useful is the idle information provided by this PR. From one hand, if there is some data waiting to be sent, and it is not being sent, that's clearly visible via a number of metrics (backpressured status, number of bytes sent, queues lengths etc). So this is a bit redundant. On the other hand, there can be many different reasons behind this timeout being triggered, like for example:
   > 
   > * idling operator not producing any data
   > * operator aggregating for a longer period of time (window)
   > * filtering out all of the records
   > * operator busy doing some very heavy work for a long period
   > * sorted shuffle service
   > * some unhealthy JVM/TM state (long GC pauses, memory swapping, long blocking IO)
   > 
   > All of the above would produce a false warning that would be misleading.
   
   Hi Master, I have modified the commit a little bit. the client side will send a heartbeat frame, if it detects there is no packet was sent within some time.
   
   thus the server can detect the network related issue, even the business flow is stopped for a while. 
   
   And this commit is to detect the consuming stop issue haunted us for long. But with this commit, the consuming could stop immediately.... It would be very kind of you, if you can just only take a few mins review it and give suggestions, thus we could use this in our own env quickly.
   
   thanks a lot for your time, it will be valued!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org