You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "jeyhunkarimov (via GitHub)" <gi...@apache.org> on 2024/04/01 22:28:32 UTC

[PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

jeyhunkarimov opened a new pull request, #24600:
URL: https://github.com/apache/flink/pull/24600

   ## What is the purpose of the change
   
   Fix OutOfMemoryError with large queries and when `table.optimizer.dynamic-filtering.enabled`.
   
   ## Brief change log
   
     - Fix the OOM cause in `DynamicPartitionPruningUtils`
     - Add tests
   
   
   ## Verifying this change
   
   Added test to `DynamicPartitionPruningProgramTest.testLargeQueryPlanShouldNotOutOfMemory`
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): ( no)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
     - The serializers: (no)
     - The runtime per-record code paths (performance sensitive): (no)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
     - The S3 file system connector: (no)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (no)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "lsyldliu (via GitHub)" <gi...@apache.org>.

lsyldliu commented on code in PR #24600:
URL: https://github.com/apache/flink/pull/24600#discussion_r1579356007


##########
flink-table/flink-table-planner/src/test/java/org/apache/flink/table/planner/plan/optimize/program/DynamicPartitionPruningProgramTest.java:
##########
@@ -81,6 +87,42 @@ void setup() {
                                 + ")");
     }
 
+    @Test
+    void testLargeQueryPlanShouldNotOutOfMemory() {
+        // TABLE_OPTIMIZER_DYNAMIC_FILTERING_ENABLED is already enabled
+        List<String> strings = new ArrayList<>();
+        for (int i = 0; i < 100; i++) {
+            util.tableEnv()
+                    .executeSql(
+                            "CREATE TABLE IF NOT EXISTS table"
+                                    + i
+                                    + "(att STRING,filename STRING) "
+                                    + "with("
+                                    + "     'connector' = 'values', "
+                                    + "     'runtime-source' = 'NewSource', "
+                                    + "     'bounded' = 'true'"
+                                    + ")");
+            strings.add("select att,filename from table" + i);
+        }
+
+        final String countName = "CNM";
+        Table allUnionTable = util.tableEnv().sqlQuery(String.join(" UNION ALL ", strings));
+        Table res =

Review Comment:
   Can you complete this test pattern using SQL query purely instead of table API?



##########
flink-table/flink-table-planner/src/test/java/org/apache/flink/table/planner/plan/optimize/program/DynamicPartitionPruningProgramTest.java:
##########
@@ -81,6 +87,42 @@ void setup() {
                                 + ")");
     }
 
+    @Test
+    void testLargeQueryPlanShouldNotOutOfMemory() {
+        // TABLE_OPTIMIZER_DYNAMIC_FILTERING_ENABLED is already enabled
+        List<String> strings = new ArrayList<>();

Review Comment:
   strings -> subQueries?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "jeyhunkarimov (via GitHub)" <gi...@apache.org>.

jeyhunkarimov commented on PR #24600:
URL: https://github.com/apache/flink/pull/24600#issuecomment-2081151799

   Hi @lsyldliu thanks for the review. I addressed your comments. Could you please check in your available time? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "jeyhunkarimov (via GitHub)" <gi...@apache.org>.

jeyhunkarimov commented on code in PR #24600:
URL: https://github.com/apache/flink/pull/24600#discussion_r1581891157


##########
flink-table/flink-table-planner/src/test/java/org/apache/flink/table/planner/plan/optimize/program/DynamicPartitionPruningProgramTest.java:
##########
@@ -81,6 +87,42 @@ void setup() {
                                 + ")");
     }
 
+    @Test
+    void testLargeQueryPlanShouldNotOutOfMemory() {
+        // TABLE_OPTIMIZER_DYNAMIC_FILTERING_ENABLED is already enabled
+        List<String> strings = new ArrayList<>();

Review Comment:
   Nice catch, I just copy-pasted the code in jira. Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "jeyhunkarimov (via GitHub)" <gi...@apache.org>.

jeyhunkarimov commented on code in PR #24600:
URL: https://github.com/apache/flink/pull/24600#discussion_r1601116454


##########
flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/utils/DynamicPartitionPruningUtils.java:
##########
@@ -236,6 +238,9 @@ private void setTables(ContextResolvedTable catalogTable) {
                 tables.add(catalogTable);
             } else {
                 for (ContextResolvedTable thisTable : new ArrayList<>(tables)) {
+                    if (tables.contains(catalogTable)) {

Review Comment:
   HI @mumuhhh thanks for the ping. I think you are right. Now that I look in more detail, in addition to your suggestion, I think we can also remove the first  `if` check in the method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "jeyhunkarimov (via GitHub)" <gi...@apache.org>.

jeyhunkarimov commented on code in PR #24600:
URL: https://github.com/apache/flink/pull/24600#discussion_r1601116454


##########
flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/utils/DynamicPartitionPruningUtils.java:
##########
@@ -236,6 +238,9 @@ private void setTables(ContextResolvedTable catalogTable) {
                 tables.add(catalogTable);
             } else {
                 for (ContextResolvedTable thisTable : new ArrayList<>(tables)) {
+                    if (tables.contains(catalogTable)) {

Review Comment:
   HI @mumuhhh thanks for the ping and your suggestion! I think you are right. Now that I look in more detail, in addition to your suggestion, I think we can also remove the first  `if` check in the method. I filed the patch: https://github.com/apache/flink/pull/24788



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "jeyhunkarimov (via GitHub)" <gi...@apache.org>.

jeyhunkarimov commented on code in PR #24600:
URL: https://github.com/apache/flink/pull/24600#discussion_r1601116454


##########
flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/utils/DynamicPartitionPruningUtils.java:
##########
@@ -236,6 +238,9 @@ private void setTables(ContextResolvedTable catalogTable) {
                 tables.add(catalogTable);
             } else {
                 for (ContextResolvedTable thisTable : new ArrayList<>(tables)) {
+                    if (tables.contains(catalogTable)) {

Review Comment:
   HI @mumuhhh thanks for the ping. I think you are right. Now that I look in more detail, in addition to your suggestion, I think we can also remove the first  `if` check in the method. I filed the patch: https://github.com/apache/flink/pull/24788



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "flinkbot (via GitHub)" <gi...@apache.org>.

flinkbot commented on PR #24600:
URL: https://github.com/apache/flink/pull/24600#issuecomment-2030689199

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "06e59ca12ef6650b79e82fb513c47e53d90f052e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "06e59ca12ef6650b79e82fb513c47e53d90f052e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 06e59ca12ef6650b79e82fb513c47e53d90f052e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "lsyldliu (via GitHub)" <gi...@apache.org>.

lsyldliu commented on code in PR #24600:
URL: https://github.com/apache/flink/pull/24600#discussion_r1581993829


##########
flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/utils/DynamicPartitionPruningUtils.java:
##########
@@ -236,6 +238,9 @@ private void setTables(ContextResolvedTable catalogTable) {
                 tables.add(catalogTable);
             } else {
                 for (ContextResolvedTable thisTable : new ArrayList<>(tables)) {
+                    if (tables.contains(catalogTable)) {

Review Comment:
   I think we can use a boolean flag to check here, then we don't need to call `contains` method every time, it is O(N) time complexity.
   
   ```
                   boolean hasAdded = false;
                   for (ContextResolvedTable thisTable : new ArrayList<>(tables)) {
                       if (hasAdded) {
                           break;
                       }
                       if (!thisTable.getIdentifier().equals(catalogTable.getIdentifier())) {
                           tables.add(catalogTable);
                           hasAdded = true;
                       }
                   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "lsyldliu (via GitHub)" <gi...@apache.org>.

lsyldliu merged PR #24600:
URL: https://github.com/apache/flink/pull/24600


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "mumuhhh (via GitHub)" <gi...@apache.org>.

mumuhhh commented on code in PR #24600:
URL: https://github.com/apache/flink/pull/24600#discussion_r1600841072


##########
flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/utils/DynamicPartitionPruningUtils.java:
##########
@@ -115,7 +117,7 @@ private static class DppDimSideChecker {
         private final RelNode relNode;
         private boolean hasFilter;
         private boolean hasPartitionedScan;
-        private final List<ContextResolvedTable> tables = new ArrayList<>();
+        private final Set<ContextResolvedTable> tables = new HashSet<>();

Review Comment:
   Why do we write traversal comparisons like that？
   ```
                   boolean hasAdded = false;
                   for (ContextResolvedTable thisTable : new ArrayList<>(tables)) {
                       if (thisTable.getIdentifier().equals(catalogTable.getIdentifier())) {
                           hasAdded = true;
                           break;
                       }
                   }
                   if (!hasAdded) {
                       tables.add(catalogTable);
                   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "lsyldliu (via GitHub)" <gi...@apache.org>.

lsyldliu commented on code in PR #24600:
URL: https://github.com/apache/flink/pull/24600#discussion_r1579386652


##########
flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/utils/DynamicPartitionPruningUtils.java:
##########
@@ -115,7 +117,7 @@ private static class DppDimSideChecker {
         private final RelNode relNode;
         private boolean hasFilter;
         private boolean hasPartitionedScan;
-        private final List<ContextResolvedTable> tables = new ArrayList<>();
+        private final Set<ContextResolvedTable> tables = new HashSet<>();

Review Comment:
   I think we can optimize this for loop by the way to reduce the time complexity. If the `catalogTable` has already been added to the collection `tables`, we can just exit the loop without having to do subsequent comparison operations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "jeyhunkarimov (via GitHub)" <gi...@apache.org>.

jeyhunkarimov commented on code in PR #24600:
URL: https://github.com/apache/flink/pull/24600#discussion_r1546924713


##########
flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/utils/DynamicPartitionPruningUtils.java:
##########
@@ -115,7 +117,7 @@ private static class DppDimSideChecker {
         private final RelNode relNode;
         private boolean hasFilter;
         private boolean hasPartitionedScan;
-        private final List<ContextResolvedTable> tables = new ArrayList<>();
+        private final Set<ContextResolvedTable> tables = new HashSet<>();

Review Comment:
   OOM happens because of 
   ```
   for (ContextResolvedTable thisTable : new ArrayList<>(tables)) {
       if (!thisTable.getIdentifier().equals(catalogTable.getIdentifier())) {
           tables.add(catalogTable);
       }
   }
   ```
   
   in `setTables` method. That is, `tables.add` is used without checking if `tables` already contains the `catalogTable`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "mumuhhh (via GitHub)" <gi...@apache.org>.

mumuhhh commented on code in PR #24600:
URL: https://github.com/apache/flink/pull/24600#discussion_r1600843684


##########
flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/utils/DynamicPartitionPruningUtils.java:
##########
@@ -236,6 +238,9 @@ private void setTables(ContextResolvedTable catalogTable) {
                 tables.add(catalogTable);
             } else {
                 for (ContextResolvedTable thisTable : new ArrayList<>(tables)) {
+                    if (tables.contains(catalogTable)) {

Review Comment:
   > I think we can use a boolean flag to check here, then we don't need to call contains method every time, it is O(N) time complexity.
   > 
   > ```
   >                 boolean hasAdded = false;
   >                 for (ContextResolvedTable thisTable : new ArrayList<>(tables)) {
   >                     if (hasAdded) {
   >                         break;
   >                     }
   >                     if (!thisTable.getIdentifier().equals(catalogTable.getIdentifier())) {
   >                         tables.add(catalogTable);
   >                         hasAdded = true;
   >                     }
   >                 }
   > ```
   
   I think we should modify the traversal logic.
   ```
                   boolean hasAdded = false;
                   for (ContextResolvedTable thisTable : new ArrayList<>(tables)) {
                       if (thisTable.getIdentifier().equals(catalogTable.getIdentifier())) {
                           hasAdded = true;
                           break;
                       }
                   }
                   if (!hasAdded) {
                       tables.add(catalogTable);
                   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [FLINK-34379][table] Fix OutOfMemoryError with large queries [flink]

Posted by "mumuhhh (via GitHub)" <gi...@apache.org>.

mumuhhh commented on code in PR #24600:
URL: https://github.com/apache/flink/pull/24600#discussion_r1600843684


##########
flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/utils/DynamicPartitionPruningUtils.java:
##########
@@ -236,6 +238,9 @@ private void setTables(ContextResolvedTable catalogTable) {
                 tables.add(catalogTable);
             } else {
                 for (ContextResolvedTable thisTable : new ArrayList<>(tables)) {
+                    if (tables.contains(catalogTable)) {

Review Comment:
   > 我想我们可以在这里使用布尔标志来检查，那么我们不需要每次都调用方法，它是 O（N） 时间复杂度。`contains`
   > 
   > ```
   >                 boolean hasAdded = false;
   >                 for (ContextResolvedTable thisTable : new ArrayList<>(tables)) {
   >                     if (hasAdded) {
   >                         break;
   >                     }
   >                     if (!thisTable.getIdentifier().equals(catalogTable.getIdentifier())) {
   >                         tables.add(catalogTable);
   >                         hasAdded = true;
   >                     }
   >                 }
   > ```
   
   I think we should modify the traversal logic.
   ```
                   boolean hasAdded = false;
                   for (ContextResolvedTable thisTable : new ArrayList<>(tables)) {
                       if (thisTable.getIdentifier().equals(catalogTable.getIdentifier())) {
                           hasAdded = true;
                           break;
                       }
                   }
                   if (!hasAdded) {
                       tables.add(catalogTable);
                   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org