You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "XComp (via GitHub)" <gi...@apache.org> on 2023/03/01 09:46:30 UTC

[GitHub] [flink] XComp opened a new pull request, #22052: [FLINK-31278] Disables fork reuse and parallel execution for flink-core package

XComp opened a new pull request, #22052:
URL: https://github.com/apache/flink/pull/22052

   ## What is the purpose of the change
   
   This is meant as a temporary change to support investigating the OOM in flink-core.
   
   ## Brief change log
   
   It overrides the surefire plugin configuration for `flink-core` enabling sequential execution of the unit tests with a new JVM process being instantiated per process. Additionally, the heapdump generation is enabled.
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] XComp commented on a diff in pull request #22052: [FLINK-31278] Disables fork reuse and parallel execution for flink-core package

Posted by "XComp (via GitHub)" <gi...@apache.org>.
XComp commented on code in PR #22052:
URL: https://github.com/apache/flink/pull/22052#discussion_r1121429475


##########
flink-core/pom.xml:
##########
@@ -146,6 +146,30 @@ under the License.
 				</configuration>
 			</plugin>
 
+			<!-- The following configuration is used to debug https://issues.apache.org/jira/browse/FLINK-31278.
+			Overriding the surefire config can be removed after FLINK-31278 is resolved. -->
+			<plugin>
+				<groupId>org.apache.maven.plugins</groupId>
+				<artifactId>maven-surefire-plugin</artifactId>
+				<executions>
+					<execution>
+						<id>default-tests</id>
+						<phase>test</phase>
+						<goals>
+							<goal>test</goal>
+						</goals>
+						<configuration>
+							<includes>
+								<include>${test.unit.pattern}</include>
+							</includes>
+							<argLine>${flink.surefire.baseArgLine} -Xmx${flink.XmxUnitTest} -XX:+HeapDumpOnOutOfMemoryError</argLine>

Review Comment:
   @zentol I feel like I'm missing something here: Isn't `-XX:+HeapDumpOnOutOfMemoryError` a good thing to have in general in the surefire configuration? I couldn't find anywhere that this is enabled by default. But I cannot imagine that we haven't considered it before with all the OOMs we had to deal with in the past.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] flinkbot commented on pull request #22052: [FLINK-31278] Disables fork reuse and parallel execution for flink-core package

Posted by "flinkbot (via GitHub)" <gi...@apache.org>.
flinkbot commented on PR #22052:
URL: https://github.com/apache/flink/pull/22052#issuecomment-1449736293

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "90925bfa3697f09d8984e1d40aa978c4cb66076e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "90925bfa3697f09d8984e1d40aa978c4cb66076e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 90925bfa3697f09d8984e1d40aa978c4cb66076e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] zentol commented on pull request #22052: [FLINK-31278] Disables fork reuse and parallel execution for flink-core package

Posted by "zentol (via GitHub)" <gi...@apache.org>.
zentol commented on PR #22052:
URL: https://github.com/apache/flink/pull/22052#issuecomment-1474170781

   Still I don't see what this gets you. The test running into the OOM isn't necessarily who's causing it. If some test allocates 99% of memory but finishes, then some subsequent test may end up failing. You'd kinda end up in the same situation you are right now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] zentol commented on pull request #22052: [FLINK-31278] Disables fork reuse and parallel execution for flink-core package

Posted by "zentol (via GitHub)" <gi...@apache.org>.
zentol commented on PR #22052:
URL: https://github.com/apache/flink/pull/22052#issuecomment-1474157633

   What makes you think they are on azure-owned hosts? They are both scheduled builds in flink-ci, which do use the alibaba machines.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] zentol commented on pull request #22052: [FLINK-31278] Disables fork reuse and parallel execution for flink-core package

Posted by "zentol (via GitHub)" <gi...@apache.org>.
zentol commented on PR #22052:
URL: https://github.com/apache/flink/pull/22052#issuecomment-1474164411

   wait a minute. Why is the basic nightly run using `'ubuntu-20.04'` as the pool`?
   A because of FLINK-18370...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] XComp closed pull request #22052: [FLINK-31278] Disables fork reuse and parallel execution for flink-core package

Posted by "XComp (via GitHub)" <gi...@apache.org>.
XComp closed pull request #22052: [FLINK-31278] Disables fork reuse and parallel execution for flink-core package
URL: https://github.com/apache/flink/pull/22052


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] XComp commented on a diff in pull request #22052: [FLINK-31278] Disables fork reuse and parallel execution for flink-core package

Posted by "XComp (via GitHub)" <gi...@apache.org>.
XComp commented on code in PR #22052:
URL: https://github.com/apache/flink/pull/22052#discussion_r1121429475


##########
flink-core/pom.xml:
##########
@@ -146,6 +146,30 @@ under the License.
 				</configuration>
 			</plugin>
 
+			<!-- The following configuration is used to debug https://issues.apache.org/jira/browse/FLINK-31278.
+			Overriding the surefire config can be removed after FLINK-31278 is resolved. -->
+			<plugin>
+				<groupId>org.apache.maven.plugins</groupId>
+				<artifactId>maven-surefire-plugin</artifactId>
+				<executions>
+					<execution>
+						<id>default-tests</id>
+						<phase>test</phase>
+						<goals>
+							<goal>test</goal>
+						</goals>
+						<configuration>
+							<includes>
+								<include>${test.unit.pattern}</include>
+							</includes>
+							<argLine>${flink.surefire.baseArgLine} -Xmx${flink.XmxUnitTest} -XX:+HeapDumpOnOutOfMemoryError</argLine>

Review Comment:
   ```suggestion
   							<argLine>${flink.surefire.baseArgLine} -Xmx${flink.XmxUnitTest} -XX:+HeapDumpOnOutOfMemoryError</argLine>
   ```
   @zentol I feel like I'm missing something here: Isn't `-XX:+HeapDumpOnOutOfMemoryError` a good thing to have in general in the surefire configuration? I couldn't find anywhere that this is enabled by default. But I cannot imagine that we haven't considered it before with all the OOMs we had to deal with in the past.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] XComp commented on a diff in pull request #22052: [FLINK-31278] Disables fork reuse and parallel execution for flink-core package

Posted by "XComp (via GitHub)" <gi...@apache.org>.
XComp commented on code in PR #22052:
URL: https://github.com/apache/flink/pull/22052#discussion_r1121747886


##########
flink-core/pom.xml:
##########
@@ -146,6 +146,30 @@ under the License.
 				</configuration>
 			</plugin>
 
+			<!-- The following configuration is used to debug https://issues.apache.org/jira/browse/FLINK-31278.
+			Overriding the surefire config can be removed after FLINK-31278 is resolved. -->
+			<plugin>
+				<groupId>org.apache.maven.plugins</groupId>
+				<artifactId>maven-surefire-plugin</artifactId>
+				<executions>
+					<execution>
+						<id>default-tests</id>
+						<phase>test</phase>
+						<goals>
+							<goal>test</goal>
+						</goals>
+						<configuration>
+							<includes>
+								<include>${test.unit.pattern}</include>
+							</includes>
+							<argLine>${flink.surefire.baseArgLine} -Xmx${flink.XmxUnitTest} -XX:+HeapDumpOnOutOfMemoryError</argLine>

Review Comment:
   ok, I don't know why it didn't cross my mind earlier (because I went over this code line as well), but we're setting this parameter through `JAVA_TOOL_OPTIONS` in [apache/flink:tools/ci/test_controller:67](https://github.com/apache/flink/blob/7e37d59f834bca805f5fbee99db87eb909d1814f/tools/ci/test_controller.sh#L67).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] XComp commented on pull request #22052: [FLINK-31278] Disables fork reuse and parallel execution for flink-core package

Posted by "XComp (via GitHub)" <gi...@apache.org>.
XComp commented on PR #22052:
URL: https://github.com/apache/flink/pull/22052#issuecomment-1471786148

   Thanks @zentol for sharing your thoughts. The two OOM errors that were reported in FLINK-31278 happened on Azure-owned hosts rather than the Alibaba machines. I understand that for the latter one, we can't rely on the failed module actually causing the error because of other modules that are being executed concurrently on the same Alibaba machine. But for the Azure-owned agents, I assumed that this is actually not the case. I would think that these agents are properly separated from each other. Am I wrong with my assumption here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] zentol commented on pull request #22052: [FLINK-31278] Disables fork reuse and parallel execution for flink-core package

Posted by "zentol (via GitHub)" <gi...@apache.org>.
zentol commented on PR #22052:
URL: https://github.com/apache/flink/pull/22052#issuecomment-1474175255

   I'd suggest to instead somewhat profile the runtime tests instead.
   
   Add a special test that runs last (see https://stackoverflow.com/questions/57624495/junit-test-class-order) that triggers a heap dump (and whatever else we'd like (e.g., memory consumption), run a sub-section of tests, analyze dump.
   Maybe even as a resource that runs after every single test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org