You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Stefan Bunk <st...@googlemail.com> on 2015/04/25 16:22:45 UTC

java.io.EOFException: Premature EOF: no length prefix available

Hi,

I have an error/exception in the distributed runtime. The job sometimes
works, sometimes does not.

The Web UI says:
Error: java.lang.Exception: The slot in which the task was scheduled has
been killed (probably loss of TaskManager).

Looking in the taskmanager logs, I find:
25.Apr. 00:12:42 WARN  DFSClient            - DFSOutputStream
ResponseProcessor exception  for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716650_1975996
java.io.EOFException: Premature EOF: no length prefix available
        at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492)
        at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:116)
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:721)
25.Apr. 00:12:43 WARN  DFSClient            - DataStreamer Exception
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
        at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
        at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at
org.apache.hadoop.hdfs.DFSOutputStream$Packet.writeTo(DFSOutputStream.java:278)
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:568)
25.Apr. 00:12:42 WARN  DFSClient            - DFSOutputStream
ResponseProcessor exception  for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988
java.io.IOException: Bad response ERROR for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988 from
datanode 172.16.19.81:50010
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:732)
25.Apr. 00:12:46 WARN  DFSClient            - Error Recovery for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716650_1975996 in
pipeline 172.16.20.112:50010, 172.16.19.81:50010, 172.16.19.109:50010: bad
datanode 172.16.20.112:50010
25.Apr. 00:12:47 WARN  DFSClient            - Error Recovery for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988 in
pipeline 172.16.20.112:50010, 172.16.19.81:50010, 172.16.20.105:50010: bad
datanode 172.16.20.112:50010
25.Apr. 00:12:48 WARN  RemoteWatcher        - Detected unreachable:
[akka.tcp://flink@172.16.21.111:6123]
25.Apr. 00:12:53 INFO  TaskManager          - Disconnecting from
JobManager: JobManager is no longer reachable
25.Apr. 00:12:53 INFO  TaskManager          - Cancelling all computations
and discarding all cached data.

The jobmanager's logs are saying:
25.Apr. 00:07:37 WARN  RemoteWatcher        - Detected unreachable:
[akka.tcp://flink@172.16.20.112:41265]
25.Apr. 00:07:37 INFO  JobManager           - Task manager akka.tcp://
flink@172.16.20.112:41265/user/taskmanager terminated.
25.Apr. 00:07:37 INFO  InstanceManager      - Unregistered task manager
akka.tcp://flink@172.16.20.112:41265. Number of registered task managers 9.
Number of available slots 18.
25.Apr. 00:07:37 INFO  JobManager           - Status of job
5f021c291483cdf7e7fae3271bfeacb1 (Wikipedia Extraction (dataset = full))
changed to FAILING The slot in which the task was scheduled has been killed
(probably loss of TaskManager)..

Any idea, what I can do? Change some config settings?

I already have:
taskmanager.heartbeat-interval: 10000
jobmanager.max-heartbeat-delay-before-failure.sec: 90

Just in case you think this might be correlated with FLINK-1916, which I
reported/posted a while ago: It's a different job, running on different
data.


Best,
Stefan

Re: java.io.EOFException: Premature EOF: no length prefix available

Posted by Stephan Ewen <se...@apache.org>.
To add to Robert's answer:

The two parameters  "taskmanager.heartbeat-interval: 10000" and "
jobmanager.max-heartbeat-delay-before-failure.sec: 90" are deprecated in
the current 0.9 master.

We need to use the akka parameters for the heartbeat in 0.9. Example:

akka.watch.heartbeat.interval: 10s
akka.watch.heartbeat.pause = 90s
akka.watch.threshold: 5

(The threshold says basically says how much fluctuations from the past may
cause in longer pauses to be tolerated. The lower, the more aggressive the
JobManager marks TaskManagers as failed)

On Mon, Apr 27, 2015 at 10:01 AM, Robert Metzger <rm...@apache.org>
wrote:

> Hi,
>
> it looks like there the TaskManager on 172.16.20.112:41265 was suffering
> from a network disruption (because the HDFS connection seems to be affected
> as well).
>
> Are the clocks on the JobManager and the TaskManager out of sync? Because
> the jobmanager is complaining on 07 about the missing TM, but on the TM, it
> says 12.
>
> Do you have any monitoring tools in your cluster that would allow you to
> find out if there was a network disruption? (is this error only occurring
> with one job, and all the others are working fine?)
>
>
> On Sat, Apr 25, 2015 at 4:22 PM, Stefan Bunk <st...@googlemail.com>
> wrote:
>
>> Hi,
>>
>> I have an error/exception in the distributed runtime. The job sometimes
>> works, sometimes does not.
>>
>> The Web UI says:
>> Error: java.lang.Exception: The slot in which the task was scheduled has
>> been killed (probably loss of TaskManager).
>>
>> Looking in the taskmanager logs, I find:
>> 25.Apr. 00:12:42 WARN  DFSClient            - DFSOutputStream
>> ResponseProcessor exception  for block
>> BP-1944967336-172.16.21.111-1412785070309:blk_1075716650_1975996
>> java.io.EOFException: Premature EOF: no length prefix available
>>         at
>> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492)
>>         at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:116)
>>         at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:721)
>> 25.Apr. 00:12:43 WARN  DFSClient            - DataStreamer Exception
>> java.io.IOException: Broken pipe
>>         at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>>         at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>>         at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>>         at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
>>         at
>> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
>>         at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>         at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
>>         at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
>>         at
>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>>         at java.io.DataOutputStream.write(DataOutputStream.java:107)
>>         at
>> org.apache.hadoop.hdfs.DFSOutputStream$Packet.writeTo(DFSOutputStream.java:278)
>>         at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:568)
>> 25.Apr. 00:12:42 WARN  DFSClient            - DFSOutputStream
>> ResponseProcessor exception  for block
>> BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988
>> java.io.IOException: Bad response ERROR for block
>> BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988 from
>> datanode 172.16.19.81:50010
>>         at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:732)
>> 25.Apr. 00:12:46 WARN  DFSClient            - Error Recovery for block
>> BP-1944967336-172.16.21.111-1412785070309:blk_1075716650_1975996 in
>> pipeline 172.16.20.112:50010, 172.16.19.81:50010, 172.16.19.109:50010:
>> bad datanode 172.16.20.112:50010
>> 25.Apr. 00:12:47 WARN  DFSClient            - Error Recovery for block
>> BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988 in
>> pipeline 172.16.20.112:50010, 172.16.19.81:50010, 172.16.20.105:50010:
>> bad datanode 172.16.20.112:50010
>> 25.Apr. 00:12:48 WARN  RemoteWatcher        - Detected unreachable:
>> [akka.tcp://flink@172.16.21.111:6123]
>> 25.Apr. 00:12:53 INFO  TaskManager          - Disconnecting from
>> JobManager: JobManager is no longer reachable
>> 25.Apr. 00:12:53 INFO  TaskManager          - Cancelling all computations
>> and discarding all cached data.
>>
>> The jobmanager's logs are saying:
>> 25.Apr. 00:07:37 WARN  RemoteWatcher        - Detected unreachable:
>> [akka.tcp://flink@172.16.20.112:41265]
>> 25.Apr. 00:07:37 INFO  JobManager           - Task manager akka.tcp://
>> flink@172.16.20.112:41265/user/taskmanager terminated.
>> 25.Apr. 00:07:37 INFO  InstanceManager      - Unregistered task manager
>> akka.tcp://flink@172.16.20.112:41265. Number of registered task managers
>> 9. Number of available slots 18.
>> 25.Apr. 00:07:37 INFO  JobManager           - Status of job
>> 5f021c291483cdf7e7fae3271bfeacb1 (Wikipedia Extraction (dataset = full))
>> changed to FAILING The slot in which the task was scheduled has been killed
>> (probably loss of TaskManager)..
>>
>> Any idea, what I can do? Change some config settings?
>>
>> I already have:
>> taskmanager.heartbeat-interval: 10000
>> jobmanager.max-heartbeat-delay-before-failure.sec: 90
>>
>> Just in case you think this might be correlated with FLINK-1916, which I
>> reported/posted a while ago: It's a different job, running on different
>> data.
>>
>>
>> Best,
>> Stefan
>>
>
>

Re: java.io.EOFException: Premature EOF: no length prefix available

Posted by Robert Metzger <rm...@apache.org>.
Hi,

it looks like there the TaskManager on 172.16.20.112:41265 was suffering
from a network disruption (because the HDFS connection seems to be affected
as well).

Are the clocks on the JobManager and the TaskManager out of sync? Because
the jobmanager is complaining on 07 about the missing TM, but on the TM, it
says 12.

Do you have any monitoring tools in your cluster that would allow you to
find out if there was a network disruption? (is this error only occurring
with one job, and all the others are working fine?)


On Sat, Apr 25, 2015 at 4:22 PM, Stefan Bunk <st...@googlemail.com>
wrote:

> Hi,
>
> I have an error/exception in the distributed runtime. The job sometimes
> works, sometimes does not.
>
> The Web UI says:
> Error: java.lang.Exception: The slot in which the task was scheduled has
> been killed (probably loss of TaskManager).
>
> Looking in the taskmanager logs, I find:
> 25.Apr. 00:12:42 WARN  DFSClient            - DFSOutputStream
> ResponseProcessor exception  for block
> BP-1944967336-172.16.21.111-1412785070309:blk_1075716650_1975996
> java.io.EOFException: Premature EOF: no length prefix available
>         at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:116)
>         at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:721)
> 25.Apr. 00:12:43 WARN  DFSClient            - DataStreamer Exception
> java.io.IOException: Broken pipe
>         at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>         at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>         at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>         at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
>         at
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
>         at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
>         at
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>         at java.io.DataOutputStream.write(DataOutputStream.java:107)
>         at
> org.apache.hadoop.hdfs.DFSOutputStream$Packet.writeTo(DFSOutputStream.java:278)
>         at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:568)
> 25.Apr. 00:12:42 WARN  DFSClient            - DFSOutputStream
> ResponseProcessor exception  for block
> BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988
> java.io.IOException: Bad response ERROR for block
> BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988 from
> datanode 172.16.19.81:50010
>         at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:732)
> 25.Apr. 00:12:46 WARN  DFSClient            - Error Recovery for block
> BP-1944967336-172.16.21.111-1412785070309:blk_1075716650_1975996 in
> pipeline 172.16.20.112:50010, 172.16.19.81:50010, 172.16.19.109:50010:
> bad datanode 172.16.20.112:50010
> 25.Apr. 00:12:47 WARN  DFSClient            - Error Recovery for block
> BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988 in
> pipeline 172.16.20.112:50010, 172.16.19.81:50010, 172.16.20.105:50010:
> bad datanode 172.16.20.112:50010
> 25.Apr. 00:12:48 WARN  RemoteWatcher        - Detected unreachable:
> [akka.tcp://flink@172.16.21.111:6123]
> 25.Apr. 00:12:53 INFO  TaskManager          - Disconnecting from
> JobManager: JobManager is no longer reachable
> 25.Apr. 00:12:53 INFO  TaskManager          - Cancelling all computations
> and discarding all cached data.
>
> The jobmanager's logs are saying:
> 25.Apr. 00:07:37 WARN  RemoteWatcher        - Detected unreachable:
> [akka.tcp://flink@172.16.20.112:41265]
> 25.Apr. 00:07:37 INFO  JobManager           - Task manager akka.tcp://
> flink@172.16.20.112:41265/user/taskmanager terminated.
> 25.Apr. 00:07:37 INFO  InstanceManager      - Unregistered task manager
> akka.tcp://flink@172.16.20.112:41265. Number of registered task managers
> 9. Number of available slots 18.
> 25.Apr. 00:07:37 INFO  JobManager           - Status of job
> 5f021c291483cdf7e7fae3271bfeacb1 (Wikipedia Extraction (dataset = full))
> changed to FAILING The slot in which the task was scheduled has been killed
> (probably loss of TaskManager)..
>
> Any idea, what I can do? Change some config settings?
>
> I already have:
> taskmanager.heartbeat-interval: 10000
> jobmanager.max-heartbeat-delay-before-failure.sec: 90
>
> Just in case you think this might be correlated with FLINK-1916, which I
> reported/posted a while ago: It's a different job, running on different
> data.
>
>
> Best,
> Stefan
>