You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "yinhua.dai" <yi...@outlook.com> on 2019/03/21 09:27:56 UTC

Job crashed very soon

Hi Community,

I was trying to run a big batch job which use JDBCInputFormat to retrieve a
large amount data from a mysql database and do some joins in flink, the
environment is AWS EMR. But it always failed very fast.

I'm using flink on yarn, flink 1.6.1
my cluster has 1000GB memory, my job parameter is:
-yD akka.ask.timeout=60s -yD akka.framesize=300m -yn 50 -ys 2 -yjm 8192 -ytm
8192 -p 40

I have 6 data sources with different tables and most of them are set with
100 parallelism.
I can only see below WARN logs from the yarn aggregated yarn logs, the whole
log is too big.

2019-03-21 09:10:16,430 WARN  org.apache.flink.runtime.taskmanager.Task                    
- Task 'CHAIN DataSource (at createInput(ExecutionEnvironment.java:548)
(org.apache.flink.api.java.io.jdbc.JDBCInputFormat)) -> FlatMap (where:
(AND(=(VALUE, _UTF-16LE'true'), =(ATTRIBUTENAME,
_UTF-16LE'QuoteCurrencyId'))), select: (QUOTEID)) (10/50)' did not react to
cancelling signal for 30 seconds, but is stuck in method:
 java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
java.net.SocketInputStream.read(SocketInputStream.java:171)
java.net.SocketInputStream.read(SocketInputStream.java:141)
sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
sun.security.ssl.InputRecord.read(InputRecord.java:503)
sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
java.io.FilterInputStream.read(FilterInputStream.java:133)
com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:64)
com.mysql.cj.protocol.a.SimplePacketReader.readMessage(SimplePacketReader.java:108)
com.mysql.cj.protocol.a.SimplePacketReader.readMessage(SimplePacketReader.java:45)
com.mysql.cj.protocol.a.TimeTrackingPacketReader.readMessage(TimeTrackingPacketReader.java:57)
com.mysql.cj.protocol.a.TimeTrackingPacketReader.readMessage(TimeTrackingPacketReader.java:41)
com.mysql.cj.protocol.a.MultiPacketReader.readMessage(MultiPacketReader.java:61)
com.mysql.cj.protocol.a.MultiPacketReader.readMessage(MultiPacketReader.java:44)
com.mysql.cj.protocol.a.ResultsetRowReader.read(ResultsetRowReader.java:75)
com.mysql.cj.protocol.a.ResultsetRowReader.read(ResultsetRowReader.java:42)
com.mysql.cj.protocol.a.NativeProtocol.read(NativeProtocol.java:1685)
com.mysql.cj.protocol.a.TextResultsetReader.read(TextResultsetReader.java:87)
com.mysql.cj.protocol.a.TextResultsetReader.read(TextResultsetReader.java:48)
com.mysql.cj.protocol.a.NativeProtocol.read(NativeProtocol.java:1698)
com.mysql.cj.protocol.a.NativeProtocol.readAllResults(NativeProtocol.java:1752)
com.mysql.cj.protocol.a.NativeProtocol.sendQueryPacket(NativeProtocol.java:1041)
com.mysql.cj.NativeSession.execSQL(NativeSession.java:1157)
com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:947)
com.mysql.cj.jdbc.ClientPreparedStatement.executeQuery(ClientPreparedStatement.java:1020)
org.apache.flink.api.java.io.jdbc.JDBCInputFormat.open(JDBCInputFormat.java:238)
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
java.lang.Thread.run(Thread.java:748)

2019-03-21 09:10:16,883 WARN  akka.remote.ReliableDeliverySupervisor                       
- Association with remote system
[akka.tcp://flink@ip-10-97-33-195.tr-fr-nonprod.aws-int.thomsonreuters.com:41133]
has failed, address is now gated for [50] ms. Reason: [Disassociated] 
2019-03-21 09:10:18,447 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                 
- RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2019-03-21 09:10:18,448 INFO 
org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
Shutting down TaskExecutorLocalStateStoresManager.
2019-03-21 09:10:18,448 INFO 
org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting
down BLOB cache
2019-03-21 09:10:18,448 INFO 
org.apache.flink.runtime.blob.TransientBlobCache              - Shutting
down BLOB cache


2019-03-21 09:08:36,913 WARN  akka.remote.transport.netty.NettyTransport                   
- Remote connection to [/10.97.33.195:39282] failed with
java.io.IOException: Connection reset by peer
2019-03-21 09:08:36,913 WARN  akka.remote.ReliableDeliverySupervisor                       
- Association with remote system
[akka.tcp://flink@ip-10-97-33-195.tr-fr-nonprod.aws-int.thomsonreuters.com:45423]
has failed, address is now gated for [50] ms. Reason: [Disassociated] 
2019-03-21 09:08:37,020 WARN  org.apache.flink.yarn.YarnResourceManager                    
- Discard registration from TaskExecutor
container_1553143971811_0015_01_000059 at
(akka.tcp://flink@ip-10-97-36-43.tr-fr-nonprod.aws-int.thomsonreuters.com:44685/user/taskmanager_0)
because the framework did not recognize it
2019-03-21 09:08:37,020 WARN  org.apache.flink.yarn.YarnResourceManager                    
- Discard registration from TaskExecutor
container_1553143971811_0015_01_000067 at
(akka.tcp://flink@ip-10-97-36-43.tr-fr-nonprod.aws-int.thomsonreuters.com:45931/user/taskmanager_0)
because the framework did not recognize it
2019-03-21 09:08:37,008 WARN  akka.remote.transport.netty.NettyTransport                   
- Remote connection to [null] failed with java.net.ConnectException:
Connection refused:
ip-10-97-36-43.tr-fr-nonprod.aws-int.thomsonreuters.com/10.97.36.43:38139


Appreciate if someone can help to have a look, thanks.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/