You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Caizhi Weng (Jira)" <ji...@apache.org> on 2020/05/06 08:47:00 UTC

[jira] [Comment Edited] (FLINK-16636) TableEnvironmentITCase is crashing on Travis

    [ https://issues.apache.org/jira/browse/FLINK-16636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100592#comment-17100592 ] 

Caizhi Weng edited comment on FLINK-16636 at 5/6/20, 8:46 AM:
--------------------------------------------------------------

Hi,

After a few more investigations I'm afraid I have to conclude that this is not a bug. It's just that the memory size of our testing container is too small.

I use the [native memory tracking tool|https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html] to track the native memory usage of all test cases, and I'll post the final memory usage below. Click [here|https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr022.html] to see the explanation for each category.

{code}
RSS: 2928128 (in 1KB blocks)

Total: reserved=4679436KB, committed=3198040KB
-                 Java Heap (reserved=2097152KB, committed=1740800KB)
                            (mmap: reserved=2097152KB, committed=1740800KB)

-                     Class (reserved=1257953KB, committed=248449KB)
                            (classes #29687)
                            (malloc=25057KB #113855)
                            (mmap: reserved=1232896KB, committed=223392KB)

-                    Thread (reserved=56902KB, committed=56902KB)
                            (thread #56)
                            (stack: reserved=55456KB, committed=55456KB)
                            (malloc=167KB #287)
                            (arena=1279KB #110)

-                      Code (reserved=279028KB, committed=180816KB)
                            (malloc=29428KB #38259)
                            (mmap: reserved=249600KB, committed=151388KB)

-                        GC (reserved=139125KB, committed=125901KB)
                            (malloc=28533KB #81265)
                            (mmap: reserved=110592KB, committed=97368KB)

-                  Compiler (reserved=179KB, committed=179KB)
                            (malloc=48KB #444)
                            (arena=131KB #3)

-                  Internal (reserved=801672KB, committed=801664KB)
                            (malloc=801632KB #83153)
                            (mmap: reserved=40KB, committed=32KB)

-                    Symbol (reserved=33730KB, committed=33730KB)
                            (malloc=32059KB #274943)
                            (arena=1670KB #1)

-    Native Memory Tracking (reserved=9283KB, committed=9283KB)
                            (malloc=21KB #254)
                            (tracking overhead=9262KB)

-               Arena Chunk (reserved=316KB, committed=316KB)
                            (malloc=316KB)

-                   Unknown (reserved=4096KB, committed=0KB)
                            (mmap: reserved=4096KB, committed=0KB)
{code}

We see that besides heap memory, we have another 1GB+ native memory usage. What seems to be the most suspicious is the "Internal" memory which uses up to 800MB native memory, but I don't know what this "Internal" is (it's explained very roughly in the category documentation) and more detailed stack trace doesn't give me any information either. Besides, this "Internal" memory will drop from time to time to a small value, so I don't think there is a native memory leak here.

We also have a somewhat large "Code" and "Class" memory usage but this is also normal, as we generate lots of Java code when running SQL.

Note that besides the two surefire process, maven process and other process will also consume memory. So it just seems that we should enlarge the memory size of the container, or make the heap size limit smaller, or just to run these test cases with one single process.

It's true that JDK8 has some native memory leaks, but they're not big deal for tests running in about 30 minutes. Those leaks will take tens of hours to finally eat up native memories.


was (Author: tsreaper):
Hi,

After a few more investigations I'm afraid I have to conclude that this is not a bug. It's just that the memory size of our testing container is too small.

I use the [native memory tracking tool|https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html] to track the native memory usage of all test cases, and I'll post the final memory usage below. Click [here|https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr022.html] to see the explanation for each category.

{code}
RSS: 2928128 (in 1KB blocks)

Total: reserved=4679436KB, committed=3198040KB
-                 Java Heap (reserved=2097152KB, committed=1740800KB)
                            (mmap: reserved=2097152KB, committed=1740800KB)

-                     Class (reserved=1257953KB, committed=248449KB)
                            (classes #29687)
                            (malloc=25057KB #113855)
                            (mmap: reserved=1232896KB, committed=223392KB)

-                    Thread (reserved=56902KB, committed=56902KB)
                            (thread #56)
                            (stack: reserved=55456KB, committed=55456KB)
                            (malloc=167KB #287)
                            (arena=1279KB #110)

-                      Code (reserved=279028KB, committed=180816KB)
                            (malloc=29428KB #38259)
                            (mmap: reserved=249600KB, committed=151388KB)

-                        GC (reserved=139125KB, committed=125901KB)
                            (malloc=28533KB #81265)
                            (mmap: reserved=110592KB, committed=97368KB)

-                  Compiler (reserved=179KB, committed=179KB)
                            (malloc=48KB #444)
                            (arena=131KB #3)

-                  Internal (reserved=801672KB, committed=801664KB)
                            (malloc=801632KB #83153)
                            (mmap: reserved=40KB, committed=32KB)

-                    Symbol (reserved=33730KB, committed=33730KB)
                            (malloc=32059KB #274943)
                            (arena=1670KB #1)

-    Native Memory Tracking (reserved=9283KB, committed=9283KB)
                            (malloc=21KB #254)
                            (tracking overhead=9262KB)

-               Arena Chunk (reserved=316KB, committed=316KB)
                            (malloc=316KB)

-                   Unknown (reserved=4096KB, committed=0KB)
                            (mmap: reserved=4096KB, committed=0KB)
{code}

We see that besides heap memory, we have another 1GB+ native memory usage. What seems to be the most suspicious is the "Internal" memory which uses up to 800MB native memory, but I don't know what this "Internal" is (it's explained very roughly in the category documentation) and more detailed stack trace doesn't give me any information either. Besides, this "Internal" memory will drop from time to time to a small value, so I don't think there is a native memory leak here.

We also have a somewhat large "Code" and "Class" memory usage but this is also normal, as we generate lots of Java code when running SQL.

Note that besides the two surefire process, maven process and other process will also consume memory. So it just seems that we should enlarge the memory size of the container, or make the heap size limit smaller, or just to run these test cases with one single process.

> TableEnvironmentITCase is crashing on Travis
> --------------------------------------------
>
>                 Key: FLINK-16636
>                 URL: https://issues.apache.org/jira/browse/FLINK-16636
>             Project: Flink
>          Issue Type: Bug
>          Components: Table SQL / Planner
>    Affects Versions: 1.11.0
>            Reporter: Jark Wu
>            Assignee: Caizhi Weng
>            Priority: Blocker
>              Labels: pull-request-available, test-stability
>             Fix For: 1.11.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Here is the instance and exception stack: https://api.travis-ci.org/v3/job/663408376/log.txt
> But there is not too much helpful information there, maybe a accidental maven problem.
> {code}
> 09:55:07.703 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.1:test (integration-tests) on project flink-table-planner-blink_2.11: There are test failures.
> 09:55:07.703 [ERROR] 
> 09:55:07.703 [ERROR] Please refer to /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire-reports for the individual test results.
> 09:55:07.703 [ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
> 09:55:07.703 [ERROR] ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
> 09:55:07.703 [ERROR] Command was /bin/sh -c cd /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m -Dmvn.forkNumber=1 -XX:+UseG1GC -jar /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire/surefirebooter714252487017838305.jar /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire 2020-03-17T09-34-41_826-jvmRun1 surefire4625103637332937565tmp surefire_43192129054983363633tmp
> 09:55:07.703 [ERROR] Error occurred in starting fork, check output in log
> 09:55:07.703 [ERROR] Process Exit Code: 137
> 09:55:07.703 [ERROR] Crashed tests:
> 09:55:07.703 [ERROR] org.apache.flink.table.api.TableEnvironmentITCase
> 09:55:07.703 [ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
> 09:55:07.703 [ERROR] Command was /bin/sh -c cd /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m -Dmvn.forkNumber=1 -XX:+UseG1GC -jar /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire/surefirebooter714252487017838305.jar /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire 2020-03-17T09-34-41_826-jvmRun1 surefire4625103637332937565tmp surefire_43192129054983363633tmp
> 09:55:07.703 [ERROR] Error occurred in starting fork, check output in log
> 09:55:07.703 [ERROR] Process Exit Code: 137
> 09:55:07.703 [ERROR] Crashed tests:
> 09:55:07.703 [ERROR] org.apache.flink.table.api.TableEnvironmentITCase
> 09:55:07.703 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:510)
> 09:55:07.704 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkOnceMultiple(ForkStarter.java:382)
> 09:55:07.704 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:297)
> 09:55:07.704 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:246)
> 09:55:07.704 [ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1183)
> 09:55:07.704 [ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1011)
> 09:55:07.704 [ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:857)
> 09:55:07.704 [ERROR] at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132)
> 09:55:07.704 [ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
> 09:55:07.704 [ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> 09:55:07.704 [ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> 09:55:07.704 [ERROR] at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> 09:55:07.704 [ERROR] at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> 09:55:07.704 [ERROR] at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> 09:55:07.704 [ERROR] at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120)
> 09:55:07.704 [ERROR] at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:355)
> 09:55:07.704 [ERROR] at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155)
> 09:55:07.704 [ERROR] at org.apache.maven.cli.MavenCli.execute(MavenCli.java:584)
> 09:55:07.704 [ERROR] at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:216)
> 09:55:07.704 [ERROR] at org.apache.maven.cli.MavenCli.main(MavenCli.java:160)
> 09:55:07.704 [ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 09:55:07.704 [ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 09:55:07.704 [ERROR] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 09:55:07.704 [ERROR] at java.lang.reflect.Method.invoke(Method.java:498)
> 09:55:07.704 [ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> 09:55:07.704 [ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> 09:55:07.704 [ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> 09:55:07.704 [ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> 09:55:07.704 [ERROR] Caused by: org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
> 09:55:07.704 [ERROR] Command was /bin/sh -c cd /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m -Dmvn.forkNumber=1 -XX:+UseG1GC -jar /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire/surefirebooter714252487017838305.jar /home/travis/build/apache/flink/flink-table/flink-table-planner-blink/target/surefire 2020-03-17T09-34-41_826-jvmRun1 surefire4625103637332937565tmp surefire_43192129054983363633tmp
> 09:55:07.704 [ERROR] Error occurred in starting fork, check output in log
> 09:55:07.704 [ERROR] Process Exit Code: 137
> 09:55:07.704 [ERROR] Crashed tests:
> 09:55:07.704 [ERROR] org.apache.flink.table.api.TableEnvironmentITCase
> 09:55:07.704 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:669)
> 09:55:07.704 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.access$600(ForkStarter.java:115)
> 09:55:07.704 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:371)
> 09:55:07.704 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:347)
> 09:55:07.704 [ERROR] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 09:55:07.704 [ERROR] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 09:55:07.704 [ERROR] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 09:55:07.704 [ERROR] at java.lang.Thread.run(Thread.java:748)
> 09:55:07.704 [ERROR] -> [Help 1]
> 09:55:07.704 [ERROR] 
> 09:55:07.704 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
> 09:55:07.704 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> 09:55:07.704 [ERROR] 
> 09:55:07.704 [ERROR] For more information about the errors and possible solutions, please read the following articles:
> 09:55:07.704 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> 09:55:07.704 [ERROR] 
> 09:55:07.704 [ERROR] After correcting the problems, you can resume the build with the command
> 09:55:07.704 [ERROR]   mvn <goals> -rf :flink-table-planner-blink_2.11
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)