You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/11/01 06:00:39 UTC

[GitHub] [flink] reswqa opened a new pull request, #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability

reswqa opened a new pull request, #21209:
URL: https://github.com/apache/flink/pull/21209

   
   ## What is the purpose of the change
   
   *Fix the unstable test HsResultPartitionTest.testAvailability*
   
   
   ## Brief change log
   
     - *Fix an incorrect test case in NetworkBufferTest*
     - *Calculate the release number instead of the survival number to avoid failing to release enough memory for full spilling strategy.*
     - *fix the unstable test HsResultPartitionTest.testAvailability*
   
   
   ## Verifying this change
   
   This change is already covered by existing tests*.
   
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] reswqa commented on a diff in pull request #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability

Posted by GitBox <gi...@apache.org>.
reswqa commented on code in PR #21209:
URL: https://github.com/apache/flink/pull/21209#discussion_r1014315054


##########
flink-runtime/src/test/java/org/apache/flink/runtime/io/network/partition/hybrid/HsFullSpillingStrategyTest.java:
##########
@@ -149,9 +147,9 @@ void testDecideActionWithGlobalInfo() {
 
         Map<Integer, List<BufferIndexAndChannel>> expectedReleaseBuffers = new HashMap<>();
         expectedReleaseBuffers.put(
-                subpartition1, new ArrayList<>(subpartitionBuffers1.subList(0, 2)));
+                subpartition1, new ArrayList<>(subpartitionBuffers1.subList(0, 3)));
         expectedReleaseBuffers.put(
-                subpartition2, new ArrayList<>(subpartitionBuffers2.subList(1, 3)));
+                subpartition2, new ArrayList<>(subpartitionBuffers2.subList(1, 4)));

Review Comment:
   > And please check createBufferIndexAndChannelsList. It's creating memory segment and network buffer for nothing.
   good catch, a previous refactoring changed `BufferWithIdentity` into `BufferIndexAndChannel`, so there is no need to create a real buffer. Unfortunately, I forgot to remove the useless code here.



##########
flink-runtime/src/test/java/org/apache/flink/runtime/io/network/partition/hybrid/HsFullSpillingStrategyTest.java:
##########
@@ -149,9 +147,9 @@ void testDecideActionWithGlobalInfo() {
 
         Map<Integer, List<BufferIndexAndChannel>> expectedReleaseBuffers = new HashMap<>();
         expectedReleaseBuffers.put(
-                subpartition1, new ArrayList<>(subpartitionBuffers1.subList(0, 2)));
+                subpartition1, new ArrayList<>(subpartitionBuffers1.subList(0, 3)));
         expectedReleaseBuffers.put(
-                subpartition2, new ArrayList<>(subpartitionBuffers2.subList(1, 3)));
+                subpartition2, new ArrayList<>(subpartitionBuffers2.subList(1, 4)));

Review Comment:
   > And please check createBufferIndexAndChannelsList. It's creating memory segment and network buffer for nothing.
   
   good catch, a previous refactoring changed `BufferWithIdentity` into `BufferIndexAndChannel`, so there is no need to create a real buffer. Unfortunately, I forgot to remove the useless code here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] pnowojski commented on pull request #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability

Posted by GitBox <gi...@apache.org>.
pnowojski commented on PR #21209:
URL: https://github.com/apache/flink/pull/21209#issuecomment-1303250441

   @xintongsong , could you take a look so that we can merge it ASAP? It's one of the tests that's failing master constantly. If not, let's revert FLINK-28889.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] reswqa commented on pull request #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability

Posted by GitBox <gi...@apache.org>.
reswqa commented on PR #21209:
URL: https://github.com/apache/flink/pull/21209#issuecomment-1298391990

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] xintongsong commented on pull request #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability

Posted by GitBox <gi...@apache.org>.
xintongsong commented on PR #21209:
URL: https://github.com/apache/flink/pull/21209#issuecomment-1303535913

   @pnowojski,
   My bad. I thought this was an instability rather than a constant failure. Checking this right now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] reswqa commented on pull request #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability

Posted by GitBox <gi...@apache.org>.
reswqa commented on PR #21209:
URL: https://github.com/apache/flink/pull/21209#issuecomment-1298440580

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] pnowojski commented on pull request #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability

Posted by GitBox <gi...@apache.org>.
pnowojski commented on PR #21209:
URL: https://github.com/apache/flink/pull/21209#issuecomment-1303539339

   I'm not sure if it's a constant failure, but I've had it twice in a row on the same PR, so it's pretty frequent.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] xintongsong commented on a diff in pull request #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability

Posted by GitBox <gi...@apache.org>.
xintongsong commented on code in PR #21209:
URL: https://github.com/apache/flink/pull/21209#discussion_r1014058947


##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/hybrid/HsFullSpillingStrategy.java:
##########
@@ -128,28 +128,32 @@ private void checkRelease(
             return;
         }
 
-        int survivedNum = (int) (poolSize - poolSize * releaseBufferRatio);
+        int releaseNum = (int) (poolSize * releaseBufferRatio);
         int numSubpartitions = spillingInfoProvider.getNumSubpartitions();
-        int subpartitionSurvivedNum = survivedNum / numSubpartitions;
-
+        int expectedSubpartitionReleaseNum = releaseNum / numSubpartitions;
         TreeMap<Integer, Deque<BufferIndexAndChannel>> bufferToRelease = new TreeMap<>();
 
         for (int subpartitionId = 0; subpartitionId < numSubpartitions; subpartitionId++) {
             Deque<BufferIndexAndChannel> buffersInOrder =
                     spillingInfoProvider.getBuffersInOrder(
                             subpartitionId, SpillStatus.SPILL, ConsumeStatusWithId.ALL_ANY);
-            // if the number of subpartition buffers less than survived buffers, reserved all of
-            // them.
-            int releaseNum = Math.max(0, buffersInOrder.size() - subpartitionSurvivedNum);
-            while (releaseNum-- != 0) {
+            // if the number of subpartition spilling buffers less than expected release number,
+            // release all of  them.
+            int subpartitionReleaseNum =
+                    Math.min(buffersInOrder.size(), expectedSubpartitionReleaseNum);
+            int subpartitionSurvivedNum = buffersInOrder.size() - subpartitionReleaseNum;
+            while (subpartitionSurvivedNum-- != 0) {
                 buffersInOrder.pollLast();
             }
             bufferToRelease.put(subpartitionId, buffersInOrder);
         }
 
         // collect results in order
         for (int i = 0; i < numSubpartitions; i++) {
-            builder.addBufferToRelease(i, bufferToRelease.getOrDefault(i, new ArrayDeque<>()));
+            Deque<BufferIndexAndChannel> bufferIndexAndChannels = bufferToRelease.get(i);
+            if (bufferIndexAndChannels != null && !bufferIndexAndChannels.isEmpty()) {
+                builder.addBufferToRelease(i, bufferToRelease.getOrDefault(i, new ArrayDeque<>()));
+            }

Review Comment:
   I don't quite get the differences between calculating the survive vs. release numbers. What's the purpose?
   
   It seems before this change, there's a bug that we are releasing number of survive buffers.



##########
flink-runtime/src/test/java/org/apache/flink/runtime/io/network/partition/hybrid/HsFullSpillingStrategyTest.java:
##########
@@ -149,9 +147,9 @@ void testDecideActionWithGlobalInfo() {
 
         Map<Integer, List<BufferIndexAndChannel>> expectedReleaseBuffers = new HashMap<>();
         expectedReleaseBuffers.put(
-                subpartition1, new ArrayList<>(subpartitionBuffers1.subList(0, 2)));
+                subpartition1, new ArrayList<>(subpartitionBuffers1.subList(0, 3)));
         expectedReleaseBuffers.put(
-                subpartition2, new ArrayList<>(subpartitionBuffers2.subList(1, 3)));
+                subpartition2, new ArrayList<>(subpartitionBuffers2.subList(1, 4)));

Review Comment:
   And please check `createBufferIndexAndChannelsList`. It's creating memory segment and network buffer for nothing.



##########
flink-runtime/src/test/java/org/apache/flink/runtime/io/network/partition/hybrid/HsFullSpillingStrategyTest.java:
##########
@@ -149,9 +147,9 @@ void testDecideActionWithGlobalInfo() {
 
         Map<Integer, List<BufferIndexAndChannel>> expectedReleaseBuffers = new HashMap<>();
         expectedReleaseBuffers.put(
-                subpartition1, new ArrayList<>(subpartitionBuffers1.subList(0, 2)));
+                subpartition1, new ArrayList<>(subpartitionBuffers1.subList(0, 3)));
         expectedReleaseBuffers.put(
-                subpartition2, new ArrayList<>(subpartitionBuffers2.subList(1, 3)));
+                subpartition2, new ArrayList<>(subpartitionBuffers2.subList(1, 4)));

Review Comment:
   For this test case, I think we no longer need the progresses.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] reswqa commented on pull request #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability

Posted by GitBox <gi...@apache.org>.
reswqa commented on PR #21209:
URL: https://github.com/apache/flink/pull/21209#issuecomment-1303943003

   @xintongsong Thanks for the review.
   > I tried the fix without the questioned hotfix commit. Unfortunately, it doesn't work. 
   
   As you said, there's a bug that we are releasing number of survive buffers, so even if the third commit set the `releaseBufferRatio` to 0, the buffer that has been spilled will still be released, so the test will may still fail.
   
   >I don't quite get the differences between calculating the survive vs. release numbers. What's the purpose?
   
   The change lies in the algorithm for surviving buffers. The previous algorithm will make the number of actual release buffers smaller than the expected value, thus causing the release operation to not release enough space in time. As a result, subsequent buffer request will frequently reach the threshold of `onMemoryUsageChanged`, causing a lot of overhead.
   
   For full spilling strategy, we set a ratio (0.4 by default) and we exactly hope that triggering the release will drop 40% of the memory space, but the previous algorithm does not meet this requirement. Because we have a precondition: only the buffers in the `spill` state can be released, which makes it easy to see that the total number of buffers that can be released (e.g. `buffersInOrder. size()`) is smaller than the expected number of surviving buffers, thus causing no buffer to be released.
   The new version will first calculate the number of buffers to be released. If the `expectedSubpartitionReleaseNum` is not 0 and the current subpartition has buffers that can be released, then some buffers must be released.So I changed the algorithm here.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] flinkbot commented on pull request #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability

Posted by GitBox <gi...@apache.org>.
flinkbot commented on PR #21209:
URL: https://github.com/apache/flink/pull/21209#issuecomment-1298070041

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "028d3672b5ffff53885c08b327f3638241101b03",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "028d3672b5ffff53885c08b327f3638241101b03",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 028d3672b5ffff53885c08b327f3638241101b03 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] xintongsong closed pull request #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability

Posted by GitBox <gi...@apache.org>.
xintongsong closed pull request #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability
URL: https://github.com/apache/flink/pull/21209


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] MartijnVisser commented on pull request #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability

Posted by GitBox <gi...@apache.org>.
MartijnVisser commented on PR #21209:
URL: https://github.com/apache/flink/pull/21209#issuecomment-1314990609

   @reswqa @xintongsong What's the status of this PR, given that it's linked a Blocker Jira?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] reswqa commented on pull request #21209: [FLINK-29818] Fix the unstable test HsResultPartitionTest.testAvailability

Posted by GitBox <gi...@apache.org>.
reswqa commented on PR #21209:
URL: https://github.com/apache/flink/pull/21209#issuecomment-1315028915

   @MartijnVisser This PR is still in review. But I think it is no longer a blocker as this test has been temporarily disabled. We can keep this Jira open, but lower its level, and close it after the PR is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org