You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/08/25 10:21:42 UTC

[GitHub] [flink] StephanEwen opened a new pull request #13239: [FLINK-19021][network] Various cleanups and refactorings of the ResultPartition

StephanEwen opened a new pull request #13239:
URL: https://github.com/apache/flink/pull/13239


   ## What is the purpose of the change
   
   This is a series of cleanups and refactorings with the ultimate goal to prepare the `ResultPartition` interface for a possible switch to be *"record-oriented"* rather than *"buffer-oriented with pre-partitioned buffers"* as it is now.
   
   A *record-oriented* `ResultPartition` is more flexible and supports, for example, implementations like a sorted/clustered shuffle where records are gathered in a joint buffer, then sorted/clustered by partition and then written to a joint file. This in turn helps reduce memory and file requirements for large-scale  shuffles.
   
   
   ## Brief change log
   
     - 5ea80cb8051c87658a604588c41a83349f061f67 removes the behavior of the SpanningRecordSerializer to prune its internal serialization buffer under special circumstances.
       Previously, the buffer was pruned when:
         - The buffer becomes larger than a certain threshold (5MB)
         - The full record end lines up exactly with a full buffer length (this change got introduced at some point, it is not clear what the purpose is)
   
       This optimization virtually never kicks in (because of the second condition) and also is unnecessary. There is only a single serializer on the sender side, so this will not help to reduce the maximum memory footprint needed in any way.
   
     - e41b74d83c44abc9b0f4511156d896455ff7a9dc removes the `releaseMemory()` call in the `ResultSubpartition`. There is currently no implementation of this (it is only there because of past (and possible future) implementations)
   
       Future versions where memory may have to be released will quite possibly not implement that on a `ResultPartition` level, rather than on a subpartition level. For example, a sort based shuffle has the buffers on a `ResultPartition` level.
   
     - d7c17e269b28f562b0264fe1574894c2099f0fa6 removes the config option `taskmanager.network.partition.force-release-on-consumption`
   
       This option was a fallback/safeguard option introduced when the fine-grained failover for batch was added. With the fine-grained batch failover, the batch result partitions were no longer immediately cleaned up, but retained for recovery until the scheduler decided that they were no longer needed. The purpose of the flag was to restore the old "immediate cleanup" behavior should it be needed.
   
       Because the batch failover has proven stable and this fallback option can be removed now.
   
     - cd0374117bf0d5e5e2bf2172a02e2b302fcc3bcc introduces separate classes for different ResultPartition types
   
       Given that the partitions behave differently regarding checkpoints, releasing, etc. the code is cleaner separated by introducing dedicated classes for the `ResultPartition` based on the type.
       **This is also an important preparation to later have more different implementations, like sort-based shuffles.**
   
       _Important:_ These new classes will not override any performance critical methods (like adding a buffer to the result). They merely specialize certain behaviors around checkpointing and cleanup.
   
     - b36f28884f445d81f9903fbc8e930295130136be moves the unaligned checkpoint methods from `ResultPartition` to a separate interface.
   
       This improves the problem that `ResultPartitions` have the unaligned checkpointing methods, but some do not support checkpoints and throw an UnsupportedOperationException. This segregates the interfaces and puts the methods relating to unaligned checkpoints in a dedicated interface.
   
   
   ## Verifying this change
   
   This change is only refactoring and cleanup, it does not change behavior. The Unit Tests are adjusted.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): **no**
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no**
     - The serializers: **no**
     - The runtime per-record code paths (performance sensitive): **no**
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: **no**
     - The S3 file system connector: **no**
   
   ## Documentation
   
     - Does this pull request introduce a new feature? **no**
     - If yes, how is the feature documented? **not applicable**
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13239: [FLINK-19021][network] Various cleanups and refactorings of the ResultPartition

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13239:
URL: https://github.com/apache/flink/pull/13239#issuecomment-679946229


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "af13efd2abeedd3f3d3a7db9e28d6a32456f8e52",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5853",
       "triggerID" : "af13efd2abeedd3f3d3a7db9e28d6a32456f8e52",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * af13efd2abeedd3f3d3a7db9e28d6a32456f8e52 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5853) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13239: [FLINK-19021][network] Various cleanups and refactorings of the ResultPartition

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13239:
URL: https://github.com/apache/flink/pull/13239#issuecomment-679946229


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "af13efd2abeedd3f3d3a7db9e28d6a32456f8e52",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "af13efd2abeedd3f3d3a7db9e28d6a32456f8e52",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * af13efd2abeedd3f3d3a7db9e28d6a32456f8e52 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] StephanEwen commented on pull request #13239: [FLINK-19021][network] Various cleanups and refactorings of the ResultPartition

Posted by GitBox <gi...@apache.org>.
StephanEwen commented on pull request #13239:
URL: https://github.com/apache/flink/pull/13239#issuecomment-682461355


   Thank you for the reviews.
   
   These are good points, let me fix those two issues.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] wsry commented on pull request #13239: [FLINK-19021][network] Various cleanups and refactorings of the ResultPartition

Posted by GitBox <gi...@apache.org>.
wsry commented on pull request #13239:
URL: https://github.com/apache/flink/pull/13239#issuecomment-680884418


   @StephanEwen Thanks for cleanups and refactorings. These are really nice changes. The code LGTM. Only two small questions:
   1. Should we also remove ResultPartitionWriter#getSubpartition which was also introduced by unaligned checkpoint and it may be not necessary for sorted-resultpartition to implementation it.
   2. We removed ReleaseOnConsumptionResultPartition in commit d7c17e2 ([FLINK-19045][network] Remove obsolete 'taskmanager.network.partitionforce-release-on-consumption' option.) which make the commit can be compiled, should we remove it in commit cd03741 ([FLINK-19046][network] Introduce separate classes for ResultPartitionTypes)?
   
   Looking forward to the PR to be merged soon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] guoweiM commented on pull request #13239: [FLINK-19021][network] Various cleanups and refactorings of the ResultPartition

Posted by GitBox <gi...@apache.org>.
guoweiM commented on pull request #13239:
URL: https://github.com/apache/flink/pull/13239#issuecomment-680893869


   Thanks @StephanEwen.  LGTM.
   Looking forward to the PR to be merged soon too. :)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13239: [FLINK-19021][network] Various cleanups and refactorings of the ResultPartition

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13239:
URL: https://github.com/apache/flink/pull/13239#issuecomment-679946229


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "af13efd2abeedd3f3d3a7db9e28d6a32456f8e52",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5853",
       "triggerID" : "af13efd2abeedd3f3d3a7db9e28d6a32456f8e52",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fbf7c39bb2d6cb79f52a93efabe310cb2bafdee9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fbf7c39bb2d6cb79f52a93efabe310cb2bafdee9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * af13efd2abeedd3f3d3a7db9e28d6a32456f8e52 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5853) 
   * fbf7c39bb2d6cb79f52a93efabe310cb2bafdee9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] wsry commented on a change in pull request #13239: [FLINK-19021][network] Various cleanups and refactorings of the ResultPartition

Posted by GitBox <gi...@apache.org>.
wsry commented on a change in pull request #13239:
URL: https://github.com/apache/flink/pull/13239#discussion_r477304810



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/io/network/api/writer/RecordWriter.java
##########
@@ -120,9 +120,7 @@ protected void emit(T record, int targetChannel) throws IOException, Interrupted
 		serializer.serializeRecord(record);
 
 		// Make sure we don't hold onto the large intermediate serialization buffer for too long

Review comment:
       We may also remove this comment




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13239: [FLINK-19021][network] Various cleanups and refactorings of the ResultPartition

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13239:
URL: https://github.com/apache/flink/pull/13239#issuecomment-679940706


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit af13efd2abeedd3f3d3a7db9e28d6a32456f8e52 (Tue Aug 25 10:24:36 UTC 2020)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13239: [FLINK-19021][network] Various cleanups and refactorings of the ResultPartition

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13239:
URL: https://github.com/apache/flink/pull/13239#issuecomment-679946229


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "af13efd2abeedd3f3d3a7db9e28d6a32456f8e52",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5853",
       "triggerID" : "af13efd2abeedd3f3d3a7db9e28d6a32456f8e52",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fbf7c39bb2d6cb79f52a93efabe310cb2bafdee9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5858",
       "triggerID" : "fbf7c39bb2d6cb79f52a93efabe310cb2bafdee9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fbf7c39bb2d6cb79f52a93efabe310cb2bafdee9 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5858) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] wsry edited a comment on pull request #13239: [FLINK-19021][network] Various cleanups and refactorings of the ResultPartition

Posted by GitBox <gi...@apache.org>.
wsry edited a comment on pull request #13239:
URL: https://github.com/apache/flink/pull/13239#issuecomment-680884418


   @StephanEwen Thanks for cleanups and refactorings. These are really nice changes. The code LGTM. Only two small questions:
   1. Should we also remove ResultPartitionWriter#getSubpartition which was also introduced by unaligned checkpoint and it may be not necessary for sorted-resultpartition to implementation it.
   2. We removed ReleaseOnConsumptionResultPartition in commit d7c17e2 ([FLINK-19045][network] Remove obsolete 'taskmanager.network.partitionforce-release-on-consumption' option.) which make the commit can not be compiled, should we remove it in commit cd03741 ([FLINK-19046][network] Introduce separate classes for ResultPartitionTypes)?
   
   Looking forward to the PR to be merged soon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13239: [FLINK-19021][network] Various cleanups and refactorings of the ResultPartition

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13239:
URL: https://github.com/apache/flink/pull/13239#issuecomment-679946229


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "af13efd2abeedd3f3d3a7db9e28d6a32456f8e52",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5853",
       "triggerID" : "af13efd2abeedd3f3d3a7db9e28d6a32456f8e52",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fbf7c39bb2d6cb79f52a93efabe310cb2bafdee9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5858",
       "triggerID" : "fbf7c39bb2d6cb79f52a93efabe310cb2bafdee9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * af13efd2abeedd3f3d3a7db9e28d6a32456f8e52 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5853) 
   * fbf7c39bb2d6cb79f52a93efabe310cb2bafdee9 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5858) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] asfgit closed pull request #13239: [FLINK-19021][network] Various cleanups and refactorings of the ResultPartition

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #13239:
URL: https://github.com/apache/flink/pull/13239


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org