You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Robert Metzger (JIRA)" <ji...@apache.org> on 2015/09/01 11:54:45 UTC

[jira] [Updated] (FLINK-2601) IOManagerAsync may produce NPE during shutdown

     [ https://issues.apache.org/jira/browse/FLINK-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Metzger updated FLINK-2601:
----------------------------------
    Description: 
While analyzing a failed YARN test, I detected that it failed because it found the following exception in the logs:

taskmanager-stderr:
{code}
Exception in thread "I/O manager shutdown hook" java.lang.NullPointerException
	at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync.shutdown(IOManagerAsync.java:144)
	at org.apache.flink.runtime.io.disk.iomanager.IOManager$1.run(IOManager.java:103)
{code}

taskmanager.log
{code}
18:45:00,812 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
18:45:00,819 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: testing-worker-linux-docker-56ee9bbf-3203-linux-2.prod.travis-ci.org/172.17.9.129, server port: 38689, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
18:45:00,822 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
18:45:00,825 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/home/travis/build/rmetzger/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-localDir-nm-1_0/usercache/travis/appcache/application_1441046584836_0007': total 15 GB, usable 7 GB (46.67% usable)
18:45:00,929 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
18:45:01,186 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed memory (236 MB).
18:45:01,755 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /home/travis/build/rmetzger/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-localDir-nm-1_0/usercache/travis/appcache/application_1441046584836_0007/flink-io-1befed3c-89c5-4b5e-9043-1b92c4c047d4 for spill files.
18:45:01,831 ERROR org.apache.flink.yarn.appMaster.YarnTaskManagerRunner         - RECEIVED SIGNAL 15: SIGTERM
18:45:01,833 ERROR org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync     - Error while shutting down IO Manager reader thread.
java.lang.NullPointerException
	at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync.shutdown(IOManagerAsync.java:133)
	at org.apache.flink.runtime.io.disk.iomanager.IOManager$1.run(IOManager.java:103)
18:45:01,841 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager removed spill file directory /home/travis/build/rmetzger/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-localDir-nm-1_0/usercache/travis/appcache/application_1441046584836_0007/flink-io-1befed3c-89c5-4b5e-9043-1b92c4c047d4
{code}

Looks like the TM is shutting down while still starting up. Hardening this should be easy.

https://s3.amazonaws.com/archive.travis-ci.org/jobs/78052378/log.txt

  was:
While analyzing a failed YARN test, I detected that it failed because it found the following exception in the logs:

taskmanager-stderr:
{code}
Exception in thread "I/O manager shutdown hook" java.lang.NullPointerException
	at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync.shutdown(IOManagerAsync.java:144)
	at org.apache.flink.runtime.io.disk.iomanager.IOManager$1.run(IOManager.java:103)
{code}

taskmanager.log
{code}
18:45:00,812 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
18:45:00,819 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: testing-worker-linux-docker-56ee9bbf-3203-linux-2.prod.travis-ci.org/172.17.9.129, server port: 38689, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
18:45:00,822 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
18:45:00,825 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/home/travis/build/rmetzger/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-localDir-nm-1_0/usercache/travis/appcache/application_1441046584836_0007': total 15 GB, usable 7 GB (46.67% usable)
18:45:00,929 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
18:45:01,186 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed memory (236 MB).
18:45:01,755 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /home/travis/build/rmetzger/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-localDir-nm-1_0/usercache/travis/appcache/application_1441046584836_0007/flink-io-1befed3c-89c5-4b5e-9043-1b92c4c047d4 for spill files.
18:45:01,831 ERROR org.apache.flink.yarn.appMaster.YarnTaskManagerRunner         - RECEIVED SIGNAL 15: SIGTERM
18:45:01,833 ERROR org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync     - Error while shutting down IO Manager reader thread.
java.lang.NullPointerException
	at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync.shutdown(IOManagerAsync.java:133)
	at org.apache.flink.runtime.io.disk.iomanager.IOManager$1.run(IOManager.java:103)
18:45:01,841 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager removed spill file directory /home/travis/build/rmetzger/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-localDir-nm-1_0/usercache/travis/appcache/application_1441046584836_0007/flink-io-1befed3c-89c5-4b5e-9043-1b92c4c047d4
{code}

Looks like the TM is shutting down while still starting up. Hardening this should be easy.


> IOManagerAsync may produce NPE during shutdown
> ----------------------------------------------
>
>                 Key: FLINK-2601
>                 URL: https://issues.apache.org/jira/browse/FLINK-2601
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 0.10
>            Reporter: Robert Metzger
>            Priority: Minor
>              Labels: test-stability
>
> While analyzing a failed YARN test, I detected that it failed because it found the following exception in the logs:
> taskmanager-stderr:
> {code}
> Exception in thread "I/O manager shutdown hook" java.lang.NullPointerException
> 	at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync.shutdown(IOManagerAsync.java:144)
> 	at org.apache.flink.runtime.io.disk.iomanager.IOManager$1.run(IOManager.java:103)
> {code}
> taskmanager.log
> {code}
> 18:45:00,812 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
> 18:45:00,819 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: testing-worker-linux-docker-56ee9bbf-3203-linux-2.prod.travis-ci.org/172.17.9.129, server port: 38689, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
> 18:45:00,822 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
> 18:45:00,825 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/home/travis/build/rmetzger/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-localDir-nm-1_0/usercache/travis/appcache/application_1441046584836_0007': total 15 GB, usable 7 GB (46.67% usable)
> 18:45:00,929 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
> 18:45:01,186 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed memory (236 MB).
> 18:45:01,755 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /home/travis/build/rmetzger/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-localDir-nm-1_0/usercache/travis/appcache/application_1441046584836_0007/flink-io-1befed3c-89c5-4b5e-9043-1b92c4c047d4 for spill files.
> 18:45:01,831 ERROR org.apache.flink.yarn.appMaster.YarnTaskManagerRunner         - RECEIVED SIGNAL 15: SIGTERM
> 18:45:01,833 ERROR org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync     - Error while shutting down IO Manager reader thread.
> java.lang.NullPointerException
> 	at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync.shutdown(IOManagerAsync.java:133)
> 	at org.apache.flink.runtime.io.disk.iomanager.IOManager$1.run(IOManager.java:103)
> 18:45:01,841 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager removed spill file directory /home/travis/build/rmetzger/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-localDir-nm-1_0/usercache/travis/appcache/application_1441046584836_0007/flink-io-1befed3c-89c5-4b5e-9043-1b92c4c047d4
> {code}
> Looks like the TM is shutting down while still starting up. Hardening this should be easy.
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/78052378/log.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)