You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Greg Hogan <co...@greghogan.com> on 2018/03/15 16:54:14 UTC

Re: TaskManager crashes with PageRank algorithm in Gelly

Termination of the TaskManager by the Linux OOM killer indicates an overallocation of memory and you have set "taskmanager.heap.mb: 139264” on machines with 136 GB.

Even though you were able to (temporarily?) resolve the issue by enabling preallocation, you may see degraded performance if system processes (e.g. prefetch) have no memory to work with.

https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#managed-memory

Greg


> On Feb 21, 2018, at 1:14 PM, santoshg <sa...@uber.com> wrote:
> 
> Folks,
> 
> We are running a simple PageRank algorithm in Gelly with about 1M edges and
> we are seeing that one the TaskManager just crashes. We suspect it is some
> configuration issue because each TaskManager has a total of 136GB memory and
> we have 8 of these. So, the total memory is more than enough. 
> 
> Here is an excerpt from the TaskManager log:
> 
> 2018-02-21 17:52:24,610 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -
> --------------------------------------------------------------------------------
> 2018-02-21 17:52:24,626 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -  Starting
> TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)
> 2018-02-21 17:52:24,626 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -  OS current
> user: flink-user
> 2018-02-21 17:52:24,626 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -  Current
> Hadoop/Kerberos user: <no hadoop dependency found>
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -  JVM:
> OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b14
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum
> heap size: 25400 MiBytes
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME:
> /usr/lib/jvm/jre-1.8.0-openjdk.x86_64
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -  No Hadoop
> Dependency available
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -  JVM
> Options:
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -Xms25395M
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -Xmx25395M
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -XX:MaxDirectMemorySize=8388607T
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -XX:+UseG1GC
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -XX:+PrintSafepointStatistics
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -XX:+HeapDumpOnOutOfMemoryError
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -Dlog.file=/home/flink-user/flink-1.4.0/log/flink-flink-user-taskmanager-0-ip-10-10-1-59.log
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -Dlog4j.configuration=file:/home/flink-user/flink-1.4.0/conf/log4j.properties
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -Dlogback.configurationFile=file:/home/flink-user/flink-1.4.0/conf/logback.xml
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -  Program
> Arguments:
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> --configDir
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> /home/flink-user/flink-1.4.0/conf
> 2018-02-21 17:52:24,627 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath:
> /home/flink-user/flink-1.4.0/lib/flink-gelly_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-gelly-scala_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-python_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-hadoop-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-presto-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/log4j-1.2.17.jar:/home/flink-user/flink-1.4.0/lib/slf4j-log4j12-1.7.7.jar:/home/flink-user/flink-1.4.0/lib/flink-dist_2.11-1.4.0.jar:::
> 2018-02-21 17:52:24,628 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              -
> --------------------------------------------------------------------------------
> 2018-02-21 17:52:24,629 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              - Registered
> UNIX signal handlers for [TERM, HUP, INT]
> 2018-02-21 17:52:24,667 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              - Maximum
> number of open file descriptors is 768000
> 2018-02-21 17:52:24,728 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              - Loading
> configuration from /home/flink-user/flink-1.4.0/conf
> 2018-02-21 17:52:24,746 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.address, 10.10.1.242
> 2018-02-21 17:52:24,746 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.port, 6123
> 2018-02-21 17:52:24,746 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.heap.mb, 131072
> 2018-02-21 17:52:24,746 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.heap.mb, 139264
> 2018-02-21 17:52:24,746 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.numberOfTaskSlots, 64
> 2018-02-21 17:52:24,747 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.memory.preallocate, false
> 2018-02-21 17:52:24,747 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.memory.off-heap, true
> 2018-02-21 17:52:24,747 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.memory.fraction, 0.8
> 2018-02-21 17:52:24,747 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.network.memory.min, 4294967296
> 2018-02-21 17:52:24,747 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.network.memory.max, 12884901888
> 2018-02-21 17:52:24,747 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: parallelism.default, 512
> 2018-02-21 17:52:24,748 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: web.port, 8081
> 2018-02-21 17:52:24,748 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.tmp.dirs, /home/flink-user/flink-tmp-dir
> 2018-02-21 17:52:24,748 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: env.java.home, /usr/lib/jvm/jre-1.8.0-openjdk.x86_64
> 2018-02-21 17:52:24,749 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: env.java.opts, -XX:+UseG1GC
> -XX:+PrintSafepointStatistics -XX:+HeapDumpOnOutOfMemoryError
> 2018-02-21 17:52:24,749 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: akka.framesize, 201326591b
> 2018-02-21 17:52:24,749 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: akka.log.lifecycle.events, true
> 2018-02-21 17:52:24,749 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: akka.client.timeout, 300 s
> 2018-02-21 17:52:24,849 INFO  org.apache.flink.core.fs.FileSystem                          
> - Hadoop is not in the classpath/dependencies. The extended set of supported
> File Systems via Hadoop is not available.
> 2018-02-21 17:52:24,965 INFO 
> org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot
> create Hadoop Security Module because Hadoop cannot be found in the
> Classpath.
> 2018-02-21 17:52:25,188 INFO 
> org.apache.flink.runtime.security.SecurityUtils               - Cannot
> install HadoopSecurityContext because Hadoop cannot be found in the
> Classpath.
> 2018-02-21 17:52:25,347 INFO 
> org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to
> select the network interface and address to use by connecting to the leading
> JobManager.
> 2018-02-21 17:52:25,348 INFO 
> org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager
> will try to connect for 10000 milliseconds before falling back to heuristics
> 2018-02-21 17:52:25,350 INFO  org.apache.flink.runtime.net.ConnectionUtils                 
> - Retrieved new target address /10.10.1.242:6123.
> 2018-02-21 17:52:25,367 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager
> will use hostname/address 'ip-10-10-1-59' (10.10.1.59) for communication.
> 2018-02-21 17:52:25,405 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              - Starting
> TaskManager
> 2018-02-21 17:52:25,406 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              - Starting
> TaskManager actor system at ip-10-10-1-59:40949.
> 2018-02-21 17:52:25,408 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              - Trying to
> start actor system at ip-10-10-1-59:40949
> 2018-02-21 17:52:26,493 INFO  akka.event.slf4j.Slf4jLogger                                 
> - Slf4jLogger started
> 2018-02-21 17:52:26,553 INFO  akka.remote.Remoting                                         
> - Starting remoting
> 2018-02-21 17:52:27,021 INFO  akka.remote.Remoting                                         
> - Remoting started; listening on addresses
> :[akka.tcp://flink@ip-10-10-1-59:40949]
> 2018-02-21 17:52:27,022 INFO  akka.remote.Remoting                                         
> - Remoting now listens on addresses: [akka.tcp://flink@ip-10-10-1-59:40949]
> 2018-02-21 17:52:27,029 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              - Actor system
> started at akka.tcp://flink@ip-10-10-1-59:40949
> 2018-02-21 17:52:27,067 INFO 
> org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics
> reporter configured, no metrics will be exposed/reported.
> 2018-02-21 17:52:27,084 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager              - Starting
> TaskManager actor
> 
> 
> ---------------------
> 
> Here is the dump from the hs_err_pid file:
> 
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (mmap) failed to map 12288 bytes for committing
> reserved memory.
> # Possible reasons:
> #   The system is out of physical RAM or swap space
> #   In 32 bit mode, the process size limit was hit
> # Possible solutions:
> #   Reduce memory load on the system
> #   Increase physical memory or swap space
> #   Check if swap backing store is full
> #   Use 64 bit Java on a 64 bit OS
> #   Decrease Java heap size (-Xmx/-Xms)
> #   Decrease number of Java threads
> #   Decrease Java thread stack sizes (-Xss)
> #   Set larger code cache with -XX:ReservedCodeCacheSize=
> # This output file may be truncated or incomplete.
> #
> #  Out of Memory Error (os_linux.cpp:2651), pid=2439, tid=0x00007fc4b7efe700
> #
> # JRE version: OpenJDK Runtime Environment (8.0_161-b14) (build
> 1.8.0_161-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.161-b14 mixed mode linux-amd64
> compressed oops)
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> 
> ---------------  T H R E A D  ---------------
> 
> Current thread (0x00007fb5afff8260):
> 
> 
> --------------
> 
> In the JobManager we see the following:
> 
> 2018-02-21 17:55:52,380 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to
> restart or fail the job Flink Java Job at Wed Feb 21 17:53:30 UTC 2018
> (d55f327901087350c24e2a8c34937db1) if no longer possible.
> 2018-02-21 17:55:52,380 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink
> Java Job at Wed Feb 21 17:53:30 UTC 2018 (d55f327901087350c24e2a8c34937db1)
> switched from state FAILING to FAILED.
> java.lang.Exception: The data preparation for task 'Reduce (Sum)' , caused
> an error: Error obtaining the sorted input: Thread 'SortMerger Reading
> Thread' terminated due to an exception: Connection unexpectedly closed by
> remote task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate
> that the remote task manager was lost.
>        at
> org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:466)
>        at
> org.apache.flink.runtime.iterative.task.AbstractIterativeTask.run(AbstractIterativeTask.java:145)
>        at
> org.apache.flink.runtime.iterative.task.IterationIntermediateTask.run(IterationIntermediateTask.java:93)
>        at
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355)
>        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>        at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input:
> Thread 'SortMerger Reading Thread' terminated due to an exception:
> Connection unexpectedly closed by remote task manager
> 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task
> manager was lost.
>        at
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
>        at
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1095)
>        at
> org.apache.flink.runtime.operators.ReduceDriver.prepare(ReduceDriver.java:95)
>        at
> org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:460)
>        ... 5 more
> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread'
> terminated due to an exception: Connection unexpectedly closed by remote
> task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the
> remote task manager was lost.
>        at
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800)
> Caused by:
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connection unexpectedly closed by remote task manager
> 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task
> manager was lost.
> 
> 
> -------------
> 
> Here are the TaskManager settings:
> 
> # The heap size for the TaskManager JVM
> 
> taskmanager.heap.mb: 139264
> 
> 
> # The number of task slots that each TaskManager offers. Each slot runs one
> parallel pipeline.
> 
> taskmanager.numberOfTaskSlots: 64
> 
> # Specify whether TaskManager memory should be allocated when starting up
> (true) or when
> # memory is required in the memory manager (false)
> # Important Note: For pure streaming setups, we highly recommend to set this
> value to `false`
> # as the default state backends currently do not use the managed memory.
> 
> taskmanager.memory.preallocate: false
> taskmanager.memory.off-heap: true
> taskmanager.memory.fraction: 0.8
> 
> #taskmanager.network.memory.fraction: 0.1
> taskmanager.network.memory.min: 4294967296
> taskmanager.network.memory.max: 12884901888
> 
> #taskmanager.network.numberOfBuffers: 8192
> #taskmanager.debug.memory.startLogThread: true
> #taskmanager.debug.memory.logIntervalMs: 500
> 
> # The parallelism used for programs that did not specify and other
> parallelism.
> 
> parallelism.default: 512
> 
> -----------
> 
> So, what are we doing wrong here ?
> 
> 
> 
> 
> 
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/