You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Robert Schmidtke <ro...@gmail.com> on 2015/11/11 09:03:54 UTC

LeaseExpiredException during TestDFSIO on HDFS

Hi everyone,

I've been running the TestDFSIO benchmark on HDFS using the following
setup: 8 nodes, (1 namenode with co-located resource manager, 7 data nodes
with co-located node managers), HDFS block size of 32M, replication of 1,
21 files of 1G each (i.e. 3 mappers per data node). I am running TestDFSIO
ten times in a row (as a cycle of write, read and cleanup operations), and
in some of the runs I'm getting a LeaseExpiredException (not the first run
though). Following is a stack trace with some context. I was hoping that
maybe you could point me to where I might have gone wrong in my
configuration. My HDFS config files are pretty vanilla, I am using Hadoop
2.7.1.

...
15/11/10 11:44:15 INFO mapreduce.Job: Running job: job_1447152143064_0003
15/11/10 11:44:21 INFO mapreduce.Job: Job job_1447152143064_0003 running in
uber mode : false
15/11/10 11:44:21 INFO mapreduce.Job:  map 0% reduce 0%
15/11/10 11:44:27 INFO mapreduce.Job:  map 5% reduce 0%
15/11/10 11:44:28 INFO mapreduce.Job:  map 38% reduce 0%
15/11/10 11:44:29 INFO mapreduce.Job:  map 48% reduce 0%
15/11/10 11:44:30 INFO mapreduce.Job:  map 57% reduce 0%
15/11/10 11:44:35 INFO mapreduce.Job:  map 73% reduce 0%
15/11/10 11:44:37 INFO mapreduce.Job:  map 86% reduce 0%
15/11/10 11:44:38 INFO mapreduce.Job:  map 86% reduce 19%
15/11/10 11:44:47 INFO mapreduce.Job: Task Id :
attempt_1447152143064_0003_m_000008_0, Status : FAILED
Error:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /benchmarks/TestDFSIO/io_data/test_io_18 (inode 16554): File
does not exist. Holder
DFSClient_attempt_1447152143064_0003_m_000008_0_690388761_1 does not have
any open files.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3431)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3236)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3074)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3034)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:723)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy12.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1430)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1226)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)

15/11/10 11:44:48 INFO mapreduce.Job:  map 83% reduce 19%
15/11/10 11:44:50 INFO mapreduce.Job:  map 89% reduce 22%
15/11/10 11:44:51 INFO mapreduce.Job:  map 100% reduce 22%
15/11/10 11:44:52 INFO mapreduce.Job:  map 100% reduce 100%
15/11/10 11:44:53 INFO mapreduce.Job: Job job_1447152143064_0003 completed
successfully
15/11/10 11:44:53 INFO mapreduce.Job: Counters: 51
...

I am also seeing an extremely high standard deviation for the read rate (up
to almost 100%), as well as running times for read operations (between 20s
and 160s). The locality of the placement is also roughly only 15 out of 21.
Could this be related to the above exception(s)? Thanks a lot in advance,
I'm happy to supply any more information if you need it.

Robert

-- 
My GPG Key ID: 336E2680

Re: LeaseExpiredException during TestDFSIO on HDFS

Posted by Robert Schmidtke <ro...@gmail.com>.

I should add then I've been running TestDFSIO on the same hardware on
XtreemFS (a distributed file system that supports replication, striping
across nodes, locality for file splits etc., much like HDFS) using the same
configuration (32M block size, replication factor of 1, 21 files of 1G
each), and I'm not seeing any exceptions. The measured IO rates are lower
than HDFS's, however with almost no standard deviation and very consistent
running times, as well as 20 out of 21 data local placements. I'm telling
you this because I think this rules out hardware problems and it may give
you a hint about which part of the system might be at fault here.

Thanks
Robert

On Wed, Nov 11, 2015 at 9:03 AM, Robert Schmidtke <ro...@gmail.com>
wrote:

> Hi everyone,
>
> I've been running the TestDFSIO benchmark on HDFS using the following
> setup: 8 nodes, (1 namenode with co-located resource manager, 7 data nodes
> with co-located node managers), HDFS block size of 32M, replication of 1,
> 21 files of 1G each (i.e. 3 mappers per data node). I am running TestDFSIO
> ten times in a row (as a cycle of write, read and cleanup operations), and
> in some of the runs I'm getting a LeaseExpiredException (not the first run
> though). Following is a stack trace with some context. I was hoping that
> maybe you could point me to where I might have gone wrong in my
> configuration. My HDFS config files are pretty vanilla, I am using Hadoop
> 2.7.1.
>
> ...
> 15/11/10 11:44:15 INFO mapreduce.Job: Running job: job_1447152143064_0003
> 15/11/10 11:44:21 INFO mapreduce.Job: Job job_1447152143064_0003 running
> in uber mode : false
> 15/11/10 11:44:21 INFO mapreduce.Job:  map 0% reduce 0%
> 15/11/10 11:44:27 INFO mapreduce.Job:  map 5% reduce 0%
> 15/11/10 11:44:28 INFO mapreduce.Job:  map 38% reduce 0%
> 15/11/10 11:44:29 INFO mapreduce.Job:  map 48% reduce 0%
> 15/11/10 11:44:30 INFO mapreduce.Job:  map 57% reduce 0%
> 15/11/10 11:44:35 INFO mapreduce.Job:  map 73% reduce 0%
> 15/11/10 11:44:37 INFO mapreduce.Job:  map 86% reduce 0%
> 15/11/10 11:44:38 INFO mapreduce.Job:  map 86% reduce 19%
> 15/11/10 11:44:47 INFO mapreduce.Job: Task Id :
> attempt_1447152143064_0003_m_000008_0, Status : FAILED
> Error:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on /benchmarks/TestDFSIO/io_data/test_io_18 (inode 16554): File
> does not exist. Holder
> DFSClient_attempt_1447152143064_0003_m_000008_0_690388761_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3431)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3236)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3074)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3034)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:723)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy12.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1430)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1226)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
>
> 15/11/10 11:44:48 INFO mapreduce.Job:  map 83% reduce 19%
> 15/11/10 11:44:50 INFO mapreduce.Job:  map 89% reduce 22%
> 15/11/10 11:44:51 INFO mapreduce.Job:  map 100% reduce 22%
> 15/11/10 11:44:52 INFO mapreduce.Job:  map 100% reduce 100%
> 15/11/10 11:44:53 INFO mapreduce.Job: Job job_1447152143064_0003 completed
> successfully
> 15/11/10 11:44:53 INFO mapreduce.Job: Counters: 51
> ...
>
> I am also seeing an extremely high standard deviation for the read rate
> (up to almost 100%), as well as running times for read operations (between
> 20s and 160s). The locality of the placement is also roughly only 15 out of
> 21. Could this be related to the above exception(s)? Thanks a lot in
> advance, I'm happy to supply any more information if you need it.
>
> Robert
>
> --
> My GPG Key ID: 336E2680
>



-- 
My GPG Key ID: 336E2680

Re: LeaseExpiredException during TestDFSIO on HDFS

Posted by Robert Schmidtke <ro...@gmail.com>.

I should add then I've been running TestDFSIO on the same hardware on
XtreemFS (a distributed file system that supports replication, striping
across nodes, locality for file splits etc., much like HDFS) using the same
configuration (32M block size, replication factor of 1, 21 files of 1G
each), and I'm not seeing any exceptions. The measured IO rates are lower
than HDFS's, however with almost no standard deviation and very consistent
running times, as well as 20 out of 21 data local placements. I'm telling
you this because I think this rules out hardware problems and it may give
you a hint about which part of the system might be at fault here.

Thanks
Robert

On Wed, Nov 11, 2015 at 9:03 AM, Robert Schmidtke <ro...@gmail.com>
wrote:

> Hi everyone,
>
> I've been running the TestDFSIO benchmark on HDFS using the following
> setup: 8 nodes, (1 namenode with co-located resource manager, 7 data nodes
> with co-located node managers), HDFS block size of 32M, replication of 1,
> 21 files of 1G each (i.e. 3 mappers per data node). I am running TestDFSIO
> ten times in a row (as a cycle of write, read and cleanup operations), and
> in some of the runs I'm getting a LeaseExpiredException (not the first run
> though). Following is a stack trace with some context. I was hoping that
> maybe you could point me to where I might have gone wrong in my
> configuration. My HDFS config files are pretty vanilla, I am using Hadoop
> 2.7.1.
>
> ...
> 15/11/10 11:44:15 INFO mapreduce.Job: Running job: job_1447152143064_0003
> 15/11/10 11:44:21 INFO mapreduce.Job: Job job_1447152143064_0003 running
> in uber mode : false
> 15/11/10 11:44:21 INFO mapreduce.Job:  map 0% reduce 0%
> 15/11/10 11:44:27 INFO mapreduce.Job:  map 5% reduce 0%
> 15/11/10 11:44:28 INFO mapreduce.Job:  map 38% reduce 0%
> 15/11/10 11:44:29 INFO mapreduce.Job:  map 48% reduce 0%
> 15/11/10 11:44:30 INFO mapreduce.Job:  map 57% reduce 0%
> 15/11/10 11:44:35 INFO mapreduce.Job:  map 73% reduce 0%
> 15/11/10 11:44:37 INFO mapreduce.Job:  map 86% reduce 0%
> 15/11/10 11:44:38 INFO mapreduce.Job:  map 86% reduce 19%
> 15/11/10 11:44:47 INFO mapreduce.Job: Task Id :
> attempt_1447152143064_0003_m_000008_0, Status : FAILED
> Error:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on /benchmarks/TestDFSIO/io_data/test_io_18 (inode 16554): File
> does not exist. Holder
> DFSClient_attempt_1447152143064_0003_m_000008_0_690388761_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3431)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3236)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3074)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3034)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:723)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy12.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1430)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1226)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
>
> 15/11/10 11:44:48 INFO mapreduce.Job:  map 83% reduce 19%
> 15/11/10 11:44:50 INFO mapreduce.Job:  map 89% reduce 22%
> 15/11/10 11:44:51 INFO mapreduce.Job:  map 100% reduce 22%
> 15/11/10 11:44:52 INFO mapreduce.Job:  map 100% reduce 100%
> 15/11/10 11:44:53 INFO mapreduce.Job: Job job_1447152143064_0003 completed
> successfully
> 15/11/10 11:44:53 INFO mapreduce.Job: Counters: 51
> ...
>
> I am also seeing an extremely high standard deviation for the read rate
> (up to almost 100%), as well as running times for read operations (between
> 20s and 160s). The locality of the placement is also roughly only 15 out of
> 21. Could this be related to the above exception(s)? Thanks a lot in
> advance, I'm happy to supply any more information if you need it.
>
> Robert
>
> --
> My GPG Key ID: 336E2680
>



-- 
My GPG Key ID: 336E2680

Re: LeaseExpiredException during TestDFSIO on HDFS

Posted by Robert Schmidtke <ro...@gmail.com>.

I should add then I've been running TestDFSIO on the same hardware on
XtreemFS (a distributed file system that supports replication, striping
across nodes, locality for file splits etc., much like HDFS) using the same
configuration (32M block size, replication factor of 1, 21 files of 1G
each), and I'm not seeing any exceptions. The measured IO rates are lower
than HDFS's, however with almost no standard deviation and very consistent
running times, as well as 20 out of 21 data local placements. I'm telling
you this because I think this rules out hardware problems and it may give
you a hint about which part of the system might be at fault here.

Thanks
Robert

On Wed, Nov 11, 2015 at 9:03 AM, Robert Schmidtke <ro...@gmail.com>
wrote:

> Hi everyone,
>
> I've been running the TestDFSIO benchmark on HDFS using the following
> setup: 8 nodes, (1 namenode with co-located resource manager, 7 data nodes
> with co-located node managers), HDFS block size of 32M, replication of 1,
> 21 files of 1G each (i.e. 3 mappers per data node). I am running TestDFSIO
> ten times in a row (as a cycle of write, read and cleanup operations), and
> in some of the runs I'm getting a LeaseExpiredException (not the first run
> though). Following is a stack trace with some context. I was hoping that
> maybe you could point me to where I might have gone wrong in my
> configuration. My HDFS config files are pretty vanilla, I am using Hadoop
> 2.7.1.
>
> ...
> 15/11/10 11:44:15 INFO mapreduce.Job: Running job: job_1447152143064_0003
> 15/11/10 11:44:21 INFO mapreduce.Job: Job job_1447152143064_0003 running
> in uber mode : false
> 15/11/10 11:44:21 INFO mapreduce.Job:  map 0% reduce 0%
> 15/11/10 11:44:27 INFO mapreduce.Job:  map 5% reduce 0%
> 15/11/10 11:44:28 INFO mapreduce.Job:  map 38% reduce 0%
> 15/11/10 11:44:29 INFO mapreduce.Job:  map 48% reduce 0%
> 15/11/10 11:44:30 INFO mapreduce.Job:  map 57% reduce 0%
> 15/11/10 11:44:35 INFO mapreduce.Job:  map 73% reduce 0%
> 15/11/10 11:44:37 INFO mapreduce.Job:  map 86% reduce 0%
> 15/11/10 11:44:38 INFO mapreduce.Job:  map 86% reduce 19%
> 15/11/10 11:44:47 INFO mapreduce.Job: Task Id :
> attempt_1447152143064_0003_m_000008_0, Status : FAILED
> Error:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on /benchmarks/TestDFSIO/io_data/test_io_18 (inode 16554): File
> does not exist. Holder
> DFSClient_attempt_1447152143064_0003_m_000008_0_690388761_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3431)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3236)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3074)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3034)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:723)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy12.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1430)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1226)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
>
> 15/11/10 11:44:48 INFO mapreduce.Job:  map 83% reduce 19%
> 15/11/10 11:44:50 INFO mapreduce.Job:  map 89% reduce 22%
> 15/11/10 11:44:51 INFO mapreduce.Job:  map 100% reduce 22%
> 15/11/10 11:44:52 INFO mapreduce.Job:  map 100% reduce 100%
> 15/11/10 11:44:53 INFO mapreduce.Job: Job job_1447152143064_0003 completed
> successfully
> 15/11/10 11:44:53 INFO mapreduce.Job: Counters: 51
> ...
>
> I am also seeing an extremely high standard deviation for the read rate
> (up to almost 100%), as well as running times for read operations (between
> 20s and 160s). The locality of the placement is also roughly only 15 out of
> 21. Could this be related to the above exception(s)? Thanks a lot in
> advance, I'm happy to supply any more information if you need it.
>
> Robert
>
> --
> My GPG Key ID: 336E2680
>



-- 
My GPG Key ID: 336E2680

Re: LeaseExpiredException during TestDFSIO on HDFS

Posted by Robert Schmidtke <ro...@gmail.com>.

I should add then I've been running TestDFSIO on the same hardware on
XtreemFS (a distributed file system that supports replication, striping
across nodes, locality for file splits etc., much like HDFS) using the same
configuration (32M block size, replication factor of 1, 21 files of 1G
each), and I'm not seeing any exceptions. The measured IO rates are lower
than HDFS's, however with almost no standard deviation and very consistent
running times, as well as 20 out of 21 data local placements. I'm telling
you this because I think this rules out hardware problems and it may give
you a hint about which part of the system might be at fault here.

Thanks
Robert

On Wed, Nov 11, 2015 at 9:03 AM, Robert Schmidtke <ro...@gmail.com>
wrote:

> Hi everyone,
>
> I've been running the TestDFSIO benchmark on HDFS using the following
> setup: 8 nodes, (1 namenode with co-located resource manager, 7 data nodes
> with co-located node managers), HDFS block size of 32M, replication of 1,
> 21 files of 1G each (i.e. 3 mappers per data node). I am running TestDFSIO
> ten times in a row (as a cycle of write, read and cleanup operations), and
> in some of the runs I'm getting a LeaseExpiredException (not the first run
> though). Following is a stack trace with some context. I was hoping that
> maybe you could point me to where I might have gone wrong in my
> configuration. My HDFS config files are pretty vanilla, I am using Hadoop
> 2.7.1.
>
> ...
> 15/11/10 11:44:15 INFO mapreduce.Job: Running job: job_1447152143064_0003
> 15/11/10 11:44:21 INFO mapreduce.Job: Job job_1447152143064_0003 running
> in uber mode : false
> 15/11/10 11:44:21 INFO mapreduce.Job:  map 0% reduce 0%
> 15/11/10 11:44:27 INFO mapreduce.Job:  map 5% reduce 0%
> 15/11/10 11:44:28 INFO mapreduce.Job:  map 38% reduce 0%
> 15/11/10 11:44:29 INFO mapreduce.Job:  map 48% reduce 0%
> 15/11/10 11:44:30 INFO mapreduce.Job:  map 57% reduce 0%
> 15/11/10 11:44:35 INFO mapreduce.Job:  map 73% reduce 0%
> 15/11/10 11:44:37 INFO mapreduce.Job:  map 86% reduce 0%
> 15/11/10 11:44:38 INFO mapreduce.Job:  map 86% reduce 19%
> 15/11/10 11:44:47 INFO mapreduce.Job: Task Id :
> attempt_1447152143064_0003_m_000008_0, Status : FAILED
> Error:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on /benchmarks/TestDFSIO/io_data/test_io_18 (inode 16554): File
> does not exist. Holder
> DFSClient_attempt_1447152143064_0003_m_000008_0_690388761_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3431)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3236)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3074)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3034)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:723)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy12.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1430)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1226)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
>
> 15/11/10 11:44:48 INFO mapreduce.Job:  map 83% reduce 19%
> 15/11/10 11:44:50 INFO mapreduce.Job:  map 89% reduce 22%
> 15/11/10 11:44:51 INFO mapreduce.Job:  map 100% reduce 22%
> 15/11/10 11:44:52 INFO mapreduce.Job:  map 100% reduce 100%
> 15/11/10 11:44:53 INFO mapreduce.Job: Job job_1447152143064_0003 completed
> successfully
> 15/11/10 11:44:53 INFO mapreduce.Job: Counters: 51
> ...
>
> I am also seeing an extremely high standard deviation for the read rate
> (up to almost 100%), as well as running times for read operations (between
> 20s and 160s). The locality of the placement is also roughly only 15 out of
> 21. Could this be related to the above exception(s)? Thanks a lot in
> advance, I'm happy to supply any more information if you need it.
>
> Robert
>
> --
> My GPG Key ID: 336E2680
>



-- 
My GPG Key ID: 336E2680