You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Bogdan Raducanu <lr...@gmail.com> on 2015/08/21 19:05:45 UTC

hdfs: weird lease expiration issue

I have an application that continuously appends to an hdfs file and keeps
it open a long time.
At some point the application crashed and left the file open.
It was then restarted and it resumed normal operation, completing some
writes (appending to the file). But, an hour after the crash it experienced
write errors because its lease was removed.
Digging in the NN log I found this weird behavior.

Events timeline:

1. 15:25: application crashes

2. 15:28: application restarted, writing doesn't start immediately

3. 15:37 first write

4. some more writes

5. new block needed: 2015-08-11 15:52:59,223 INFO
org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /path/to/my/file.
... blk_1079708083_9361735{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]...|RBW],
ReplicaUnderConstruction[[DISK]...|RBW],
ReplicaUnderConstruction[[DISK]...|RBW]]}

6. some more writes; application uses hsync so we can see the writes in the
nn log: 2015-08-11 15:52:59,234 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* fsync: /path/to/my/file for DFSClient_NONMAPREDUCE_-1953764790_1

7. 60 minutes after crash: 2015-08-11 16:25:18,397 INFO
org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder:
DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1] has expired hard
limit

8. 2015-08-11 16:25:18,398 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1],
src=/path/to/my/file

9. 2015-08-11 16:25:18,398 INFO BlockStateChange: BLOCK*
blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
replicas=[ReplicaUnderConstruction[[DISK]...|RBW],
ReplicaUnderConstruction[[DISK]...|RBW],
ReplicaUnderConstruction[[DISK]...|RBW]]} recovery started,
primary=ReplicaUnderConstruction[[DISK]...|RBW]

10. 2015-08-11 16:25:18,398 WARN org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.internalReleaseLease: File /path/to/my/file has not been closed.
Lease recovery is in progress. RecoveryId = 9361840 for block
blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
replicas=...

So, somehow the crashed client's lease remained and an hour after the crash
it was removed.
This happened even though during this hour another client obtained a lease
and appended to the file.
Also, there is no "startFile: recover lease" log message when the new
client opens the file. It is like the old lease is not seen until the 1
hour hard limit expires.
Any idea how this could happen? This is on a distribution based on 2.6.0,
with HA

Re: hdfs: weird lease expiration issue

Posted by Bogdan Raducanu <lr...@gmail.com>.
Yes, I know how the lease is supposed to work. But it seems that in this
case it's not working as expected. The "second client" is the only running
client (the previous one having crashed). It fails later, when trying to
append again, with the error:

16:52:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
Lease mismatch on /path/to/my/file (inode 7963186) owned by HDFS_NameNode
but is accessed by DFSClient_NONMAPREDUCE_-1953764790_1        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3587)
      at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3474)
   at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:678)
  at
.......
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1265)
     at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)

So it seems the NN doesn't release its lease after taking over the file
during the hard limit expiration handling.

On Fri, Aug 21, 2015 at 8:17 PM, Ravi Prakash <ra...@ymail.com> wrote:

> Hi Bogdan!
>
> This is because the second application attempt appears to HDFS as a new
> client. Are you sure the second client experienced write errors because
> *its* lease was removed?
>
> Yongjun has a great writeup :
> http://blog.cloudera.com/blog/2015/02/understanding-hdfs-recovery-processes-part-1/
> (Thanks Yongjun). To quote
> "The lease manager maintains a soft limit (1 minute) and hard limit (1
> hour) for the expiration time (these limits are currently
> non-configurable), and all leases maintained by the lease manager abide by
> the same soft and hard limits. Before the soft limit expires, the client
> holding the lease of a file has exclusive write access to the file. If the
> soft limit expires and the client has not renewed the lease or closed the
> file (the lease of a file is released when the file is closed), another
> client can forcibly take over the lease. If the hard limit expires and the
> client has not renewed the lease, HDFS assumes that the client has quit and
> will automatically close the file on behalf of the client, thereby
> recovering the lease."
>
> HTH
> Ravi
>
>
>
>
> On Friday, August 21, 2015 10:05 AM, Bogdan Raducanu <lr...@gmail.com>
> wrote:
>
>
> I have an application that continuously appends to an hdfs file and keeps
> it open a long time.
> At some point the application crashed and left the file open.
> It was then restarted and it resumed normal operation, completing some
> writes (appending to the file). But, an hour after the crash it experienced
> write errors because its lease was removed.
> Digging in the NN log I found this weird behavior.
>
> Events timeline:
>
> 1. 15:25: application crashes
>
> 2. 15:28: application restarted, writing doesn't start immediately
>
> 3. 15:37 first write
>
> 4. some more writes
>
> 5. new block needed: 2015-08-11 15:52:59,223 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /path/to/my/file.
> ... blk_1079708083_9361735{blockUCState=UNDER_CONSTRUCTION,
> primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW]]}
>
> 6. some more writes; application uses hsync so we can see the writes in
> the nn log: 2015-08-11 15:52:59,234 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /path/to/my/file for
> DFSClient_NONMAPREDUCE_-1953764790_1
>
> 7. 60 minutes after crash: 2015-08-11 16:25:18,397 INFO
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder:
> DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1] has expired hard
> limit
>
> 8. 2015-08-11 16:25:18,398 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
> Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1],
> src=/path/to/my/file
>
> 9. 2015-08-11 16:25:18,398 INFO BlockStateChange: BLOCK*
> blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
> replicas=[ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW]]} recovery started,
> primary=ReplicaUnderConstruction[[DISK]...|RBW]
>
> 10. 2015-08-11 16:25:18,398 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.internalReleaseLease: File /path/to/my/file has not been closed.
> Lease recovery is in progress. RecoveryId = 9361840 for block
> blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
> replicas=...
>
> So, somehow the crashed client's lease remained and an hour after the
> crash it was removed.
> This happened even though during this hour another client obtained a lease
> and appended to the file.
> Also, there is no "startFile: recover lease" log message when the new
> client opens the file. It is like the old lease is not seen until the 1
> hour hard limit expires.
> Any idea how this could happen? This is on a distribution based on 2.6.0,
> with HA
>
>
>

Re: hdfs: weird lease expiration issue

Posted by Bogdan Raducanu <lr...@gmail.com>.
Yes, I know how the lease is supposed to work. But it seems that in this
case it's not working as expected. The "second client" is the only running
client (the previous one having crashed). It fails later, when trying to
append again, with the error:

16:52:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
Lease mismatch on /path/to/my/file (inode 7963186) owned by HDFS_NameNode
but is accessed by DFSClient_NONMAPREDUCE_-1953764790_1        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3587)
      at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3474)
   at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:678)
  at
.......
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1265)
     at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)

So it seems the NN doesn't release its lease after taking over the file
during the hard limit expiration handling.

On Fri, Aug 21, 2015 at 8:17 PM, Ravi Prakash <ra...@ymail.com> wrote:

> Hi Bogdan!
>
> This is because the second application attempt appears to HDFS as a new
> client. Are you sure the second client experienced write errors because
> *its* lease was removed?
>
> Yongjun has a great writeup :
> http://blog.cloudera.com/blog/2015/02/understanding-hdfs-recovery-processes-part-1/
> (Thanks Yongjun). To quote
> "The lease manager maintains a soft limit (1 minute) and hard limit (1
> hour) for the expiration time (these limits are currently
> non-configurable), and all leases maintained by the lease manager abide by
> the same soft and hard limits. Before the soft limit expires, the client
> holding the lease of a file has exclusive write access to the file. If the
> soft limit expires and the client has not renewed the lease or closed the
> file (the lease of a file is released when the file is closed), another
> client can forcibly take over the lease. If the hard limit expires and the
> client has not renewed the lease, HDFS assumes that the client has quit and
> will automatically close the file on behalf of the client, thereby
> recovering the lease."
>
> HTH
> Ravi
>
>
>
>
> On Friday, August 21, 2015 10:05 AM, Bogdan Raducanu <lr...@gmail.com>
> wrote:
>
>
> I have an application that continuously appends to an hdfs file and keeps
> it open a long time.
> At some point the application crashed and left the file open.
> It was then restarted and it resumed normal operation, completing some
> writes (appending to the file). But, an hour after the crash it experienced
> write errors because its lease was removed.
> Digging in the NN log I found this weird behavior.
>
> Events timeline:
>
> 1. 15:25: application crashes
>
> 2. 15:28: application restarted, writing doesn't start immediately
>
> 3. 15:37 first write
>
> 4. some more writes
>
> 5. new block needed: 2015-08-11 15:52:59,223 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /path/to/my/file.
> ... blk_1079708083_9361735{blockUCState=UNDER_CONSTRUCTION,
> primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW]]}
>
> 6. some more writes; application uses hsync so we can see the writes in
> the nn log: 2015-08-11 15:52:59,234 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /path/to/my/file for
> DFSClient_NONMAPREDUCE_-1953764790_1
>
> 7. 60 minutes after crash: 2015-08-11 16:25:18,397 INFO
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder:
> DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1] has expired hard
> limit
>
> 8. 2015-08-11 16:25:18,398 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
> Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1],
> src=/path/to/my/file
>
> 9. 2015-08-11 16:25:18,398 INFO BlockStateChange: BLOCK*
> blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
> replicas=[ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW]]} recovery started,
> primary=ReplicaUnderConstruction[[DISK]...|RBW]
>
> 10. 2015-08-11 16:25:18,398 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.internalReleaseLease: File /path/to/my/file has not been closed.
> Lease recovery is in progress. RecoveryId = 9361840 for block
> blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
> replicas=...
>
> So, somehow the crashed client's lease remained and an hour after the
> crash it was removed.
> This happened even though during this hour another client obtained a lease
> and appended to the file.
> Also, there is no "startFile: recover lease" log message when the new
> client opens the file. It is like the old lease is not seen until the 1
> hour hard limit expires.
> Any idea how this could happen? This is on a distribution based on 2.6.0,
> with HA
>
>
>

Re: hdfs: weird lease expiration issue

Posted by Bogdan Raducanu <lr...@gmail.com>.
Yes, I know how the lease is supposed to work. But it seems that in this
case it's not working as expected. The "second client" is the only running
client (the previous one having crashed). It fails later, when trying to
append again, with the error:

16:52:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
Lease mismatch on /path/to/my/file (inode 7963186) owned by HDFS_NameNode
but is accessed by DFSClient_NONMAPREDUCE_-1953764790_1        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3587)
      at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3474)
   at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:678)
  at
.......
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1265)
     at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)

So it seems the NN doesn't release its lease after taking over the file
during the hard limit expiration handling.

On Fri, Aug 21, 2015 at 8:17 PM, Ravi Prakash <ra...@ymail.com> wrote:

> Hi Bogdan!
>
> This is because the second application attempt appears to HDFS as a new
> client. Are you sure the second client experienced write errors because
> *its* lease was removed?
>
> Yongjun has a great writeup :
> http://blog.cloudera.com/blog/2015/02/understanding-hdfs-recovery-processes-part-1/
> (Thanks Yongjun). To quote
> "The lease manager maintains a soft limit (1 minute) and hard limit (1
> hour) for the expiration time (these limits are currently
> non-configurable), and all leases maintained by the lease manager abide by
> the same soft and hard limits. Before the soft limit expires, the client
> holding the lease of a file has exclusive write access to the file. If the
> soft limit expires and the client has not renewed the lease or closed the
> file (the lease of a file is released when the file is closed), another
> client can forcibly take over the lease. If the hard limit expires and the
> client has not renewed the lease, HDFS assumes that the client has quit and
> will automatically close the file on behalf of the client, thereby
> recovering the lease."
>
> HTH
> Ravi
>
>
>
>
> On Friday, August 21, 2015 10:05 AM, Bogdan Raducanu <lr...@gmail.com>
> wrote:
>
>
> I have an application that continuously appends to an hdfs file and keeps
> it open a long time.
> At some point the application crashed and left the file open.
> It was then restarted and it resumed normal operation, completing some
> writes (appending to the file). But, an hour after the crash it experienced
> write errors because its lease was removed.
> Digging in the NN log I found this weird behavior.
>
> Events timeline:
>
> 1. 15:25: application crashes
>
> 2. 15:28: application restarted, writing doesn't start immediately
>
> 3. 15:37 first write
>
> 4. some more writes
>
> 5. new block needed: 2015-08-11 15:52:59,223 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /path/to/my/file.
> ... blk_1079708083_9361735{blockUCState=UNDER_CONSTRUCTION,
> primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW]]}
>
> 6. some more writes; application uses hsync so we can see the writes in
> the nn log: 2015-08-11 15:52:59,234 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /path/to/my/file for
> DFSClient_NONMAPREDUCE_-1953764790_1
>
> 7. 60 minutes after crash: 2015-08-11 16:25:18,397 INFO
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder:
> DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1] has expired hard
> limit
>
> 8. 2015-08-11 16:25:18,398 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
> Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1],
> src=/path/to/my/file
>
> 9. 2015-08-11 16:25:18,398 INFO BlockStateChange: BLOCK*
> blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
> replicas=[ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW]]} recovery started,
> primary=ReplicaUnderConstruction[[DISK]...|RBW]
>
> 10. 2015-08-11 16:25:18,398 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.internalReleaseLease: File /path/to/my/file has not been closed.
> Lease recovery is in progress. RecoveryId = 9361840 for block
> blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
> replicas=...
>
> So, somehow the crashed client's lease remained and an hour after the
> crash it was removed.
> This happened even though during this hour another client obtained a lease
> and appended to the file.
> Also, there is no "startFile: recover lease" log message when the new
> client opens the file. It is like the old lease is not seen until the 1
> hour hard limit expires.
> Any idea how this could happen? This is on a distribution based on 2.6.0,
> with HA
>
>
>

Re: hdfs: weird lease expiration issue

Posted by Bogdan Raducanu <lr...@gmail.com>.
Yes, I know how the lease is supposed to work. But it seems that in this
case it's not working as expected. The "second client" is the only running
client (the previous one having crashed). It fails later, when trying to
append again, with the error:

16:52:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
Lease mismatch on /path/to/my/file (inode 7963186) owned by HDFS_NameNode
but is accessed by DFSClient_NONMAPREDUCE_-1953764790_1        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3587)
      at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3474)
   at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:678)
  at
.......
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1265)
     at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)

So it seems the NN doesn't release its lease after taking over the file
during the hard limit expiration handling.

On Fri, Aug 21, 2015 at 8:17 PM, Ravi Prakash <ra...@ymail.com> wrote:

> Hi Bogdan!
>
> This is because the second application attempt appears to HDFS as a new
> client. Are you sure the second client experienced write errors because
> *its* lease was removed?
>
> Yongjun has a great writeup :
> http://blog.cloudera.com/blog/2015/02/understanding-hdfs-recovery-processes-part-1/
> (Thanks Yongjun). To quote
> "The lease manager maintains a soft limit (1 minute) and hard limit (1
> hour) for the expiration time (these limits are currently
> non-configurable), and all leases maintained by the lease manager abide by
> the same soft and hard limits. Before the soft limit expires, the client
> holding the lease of a file has exclusive write access to the file. If the
> soft limit expires and the client has not renewed the lease or closed the
> file (the lease of a file is released when the file is closed), another
> client can forcibly take over the lease. If the hard limit expires and the
> client has not renewed the lease, HDFS assumes that the client has quit and
> will automatically close the file on behalf of the client, thereby
> recovering the lease."
>
> HTH
> Ravi
>
>
>
>
> On Friday, August 21, 2015 10:05 AM, Bogdan Raducanu <lr...@gmail.com>
> wrote:
>
>
> I have an application that continuously appends to an hdfs file and keeps
> it open a long time.
> At some point the application crashed and left the file open.
> It was then restarted and it resumed normal operation, completing some
> writes (appending to the file). But, an hour after the crash it experienced
> write errors because its lease was removed.
> Digging in the NN log I found this weird behavior.
>
> Events timeline:
>
> 1. 15:25: application crashes
>
> 2. 15:28: application restarted, writing doesn't start immediately
>
> 3. 15:37 first write
>
> 4. some more writes
>
> 5. new block needed: 2015-08-11 15:52:59,223 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /path/to/my/file.
> ... blk_1079708083_9361735{blockUCState=UNDER_CONSTRUCTION,
> primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW]]}
>
> 6. some more writes; application uses hsync so we can see the writes in
> the nn log: 2015-08-11 15:52:59,234 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /path/to/my/file for
> DFSClient_NONMAPREDUCE_-1953764790_1
>
> 7. 60 minutes after crash: 2015-08-11 16:25:18,397 INFO
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder:
> DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1] has expired hard
> limit
>
> 8. 2015-08-11 16:25:18,398 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
> Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1],
> src=/path/to/my/file
>
> 9. 2015-08-11 16:25:18,398 INFO BlockStateChange: BLOCK*
> blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
> replicas=[ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW],
> ReplicaUnderConstruction[[DISK]...|RBW]]} recovery started,
> primary=ReplicaUnderConstruction[[DISK]...|RBW]
>
> 10. 2015-08-11 16:25:18,398 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.internalReleaseLease: File /path/to/my/file has not been closed.
> Lease recovery is in progress. RecoveryId = 9361840 for block
> blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
> replicas=...
>
> So, somehow the crashed client's lease remained and an hour after the
> crash it was removed.
> This happened even though during this hour another client obtained a lease
> and appended to the file.
> Also, there is no "startFile: recover lease" log message when the new
> client opens the file. It is like the old lease is not seen until the 1
> hour hard limit expires.
> Any idea how this could happen? This is on a distribution based on 2.6.0,
> with HA
>
>
>

Re: hdfs: weird lease expiration issue

Posted by Ravi Prakash <ra...@ymail.com>.
Hi Bogdan!
This is because the second application attempt appears to HDFS as a new client. Are you sure the second client experienced write errors because *its* lease was removed?
Yongjun has a great writeup : http://blog.cloudera.com/blog/2015/02/understanding-hdfs-recovery-processes-part-1/ (Thanks Yongjun). To quote
"The lease manager maintains a soft limit (1 minute) and hard limit (1 hour) for the expiration time (these limits are currently non-configurable), and all leases maintained by the lease manager abide by the same soft and hard limits. Before the soft limit expires, the client holding the lease of a file has exclusive write access to the file. If the soft limit expires and the client has not renewed the lease or closed the file (the lease of a file is released when the file is closed), another client can forcibly take over the lease. If the hard limit expires and the client has not renewed the lease, HDFS assumes that the client has quit and will automatically close the file on behalf of the client, thereby recovering the lease."
HTHRavi

 


     On Friday, August 21, 2015 10:05 AM, Bogdan Raducanu <lr...@gmail.com> wrote:
   

 I have an application that continuously appends to an hdfs file and keeps it open a long time.At some point the application crashed and left the file open.It was then restarted and it resumed normal operation, completing some writes (appending to the file). But, an hour after the crash it experienced write errors because its lease was removed.Digging in the NN log I found this weird behavior.
Events timeline:
1. 15:25: application crashes
2. 15:28: application restarted, writing doesn't start immediately
3. 15:37 first write
4. some more writes
5. new block needed: 2015-08-11 15:52:59,223 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /path/to/my/file. ... blk_1079708083_9361735{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW]]}
6. some more writes; application uses hsync so we can see the writes in the nn log: 2015-08-11 15:52:59,234 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /path/to/my/file for DFSClient_NONMAPREDUCE_-1953764790_1
7. 60 minutes after crash: 2015-08-11 16:25:18,397 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1] has expired hard limit
8. 2015-08-11 16:25:18,398 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1], src=/path/to/my/file
9. 2015-08-11 16:25:18,398 INFO BlockStateChange: BLOCK* blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW]]} recovery started, primary=ReplicaUnderConstruction[[DISK]...|RBW]
10. 2015-08-11 16:25:18,398 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File /path/to/my/file has not been closed. Lease recovery is in progress. RecoveryId = 9361840 for block blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=...
So, somehow the crashed client's lease remained and an hour after the crash it was removed.This happened even though during this hour another client obtained a lease and appended to the file. Also, there is no "startFile: recover lease" log message when the new client opens the file. It is like the old lease is not seen until the 1 hour hard limit expires.Any idea how this could happen? This is on a distribution based on 2.6.0, with HA

  

Re: hdfs: weird lease expiration issue

Posted by Ravi Prakash <ra...@ymail.com>.
Hi Bogdan!
This is because the second application attempt appears to HDFS as a new client. Are you sure the second client experienced write errors because *its* lease was removed?
Yongjun has a great writeup : http://blog.cloudera.com/blog/2015/02/understanding-hdfs-recovery-processes-part-1/ (Thanks Yongjun). To quote
"The lease manager maintains a soft limit (1 minute) and hard limit (1 hour) for the expiration time (these limits are currently non-configurable), and all leases maintained by the lease manager abide by the same soft and hard limits. Before the soft limit expires, the client holding the lease of a file has exclusive write access to the file. If the soft limit expires and the client has not renewed the lease or closed the file (the lease of a file is released when the file is closed), another client can forcibly take over the lease. If the hard limit expires and the client has not renewed the lease, HDFS assumes that the client has quit and will automatically close the file on behalf of the client, thereby recovering the lease."
HTHRavi

 


     On Friday, August 21, 2015 10:05 AM, Bogdan Raducanu <lr...@gmail.com> wrote:
   

 I have an application that continuously appends to an hdfs file and keeps it open a long time.At some point the application crashed and left the file open.It was then restarted and it resumed normal operation, completing some writes (appending to the file). But, an hour after the crash it experienced write errors because its lease was removed.Digging in the NN log I found this weird behavior.
Events timeline:
1. 15:25: application crashes
2. 15:28: application restarted, writing doesn't start immediately
3. 15:37 first write
4. some more writes
5. new block needed: 2015-08-11 15:52:59,223 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /path/to/my/file. ... blk_1079708083_9361735{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW]]}
6. some more writes; application uses hsync so we can see the writes in the nn log: 2015-08-11 15:52:59,234 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /path/to/my/file for DFSClient_NONMAPREDUCE_-1953764790_1
7. 60 minutes after crash: 2015-08-11 16:25:18,397 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1] has expired hard limit
8. 2015-08-11 16:25:18,398 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1], src=/path/to/my/file
9. 2015-08-11 16:25:18,398 INFO BlockStateChange: BLOCK* blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW]]} recovery started, primary=ReplicaUnderConstruction[[DISK]...|RBW]
10. 2015-08-11 16:25:18,398 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File /path/to/my/file has not been closed. Lease recovery is in progress. RecoveryId = 9361840 for block blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=...
So, somehow the crashed client's lease remained and an hour after the crash it was removed.This happened even though during this hour another client obtained a lease and appended to the file. Also, there is no "startFile: recover lease" log message when the new client opens the file. It is like the old lease is not seen until the 1 hour hard limit expires.Any idea how this could happen? This is on a distribution based on 2.6.0, with HA

  

Re: hdfs: weird lease expiration issue

Posted by Ravi Prakash <ra...@ymail.com>.
Hi Bogdan!
This is because the second application attempt appears to HDFS as a new client. Are you sure the second client experienced write errors because *its* lease was removed?
Yongjun has a great writeup : http://blog.cloudera.com/blog/2015/02/understanding-hdfs-recovery-processes-part-1/ (Thanks Yongjun). To quote
"The lease manager maintains a soft limit (1 minute) and hard limit (1 hour) for the expiration time (these limits are currently non-configurable), and all leases maintained by the lease manager abide by the same soft and hard limits. Before the soft limit expires, the client holding the lease of a file has exclusive write access to the file. If the soft limit expires and the client has not renewed the lease or closed the file (the lease of a file is released when the file is closed), another client can forcibly take over the lease. If the hard limit expires and the client has not renewed the lease, HDFS assumes that the client has quit and will automatically close the file on behalf of the client, thereby recovering the lease."
HTHRavi

 


     On Friday, August 21, 2015 10:05 AM, Bogdan Raducanu <lr...@gmail.com> wrote:
   

 I have an application that continuously appends to an hdfs file and keeps it open a long time.At some point the application crashed and left the file open.It was then restarted and it resumed normal operation, completing some writes (appending to the file). But, an hour after the crash it experienced write errors because its lease was removed.Digging in the NN log I found this weird behavior.
Events timeline:
1. 15:25: application crashes
2. 15:28: application restarted, writing doesn't start immediately
3. 15:37 first write
4. some more writes
5. new block needed: 2015-08-11 15:52:59,223 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /path/to/my/file. ... blk_1079708083_9361735{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW]]}
6. some more writes; application uses hsync so we can see the writes in the nn log: 2015-08-11 15:52:59,234 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /path/to/my/file for DFSClient_NONMAPREDUCE_-1953764790_1
7. 60 minutes after crash: 2015-08-11 16:25:18,397 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1] has expired hard limit
8. 2015-08-11 16:25:18,398 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1], src=/path/to/my/file
9. 2015-08-11 16:25:18,398 INFO BlockStateChange: BLOCK* blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW]]} recovery started, primary=ReplicaUnderConstruction[[DISK]...|RBW]
10. 2015-08-11 16:25:18,398 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File /path/to/my/file has not been closed. Lease recovery is in progress. RecoveryId = 9361840 for block blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=...
So, somehow the crashed client's lease remained and an hour after the crash it was removed.This happened even though during this hour another client obtained a lease and appended to the file. Also, there is no "startFile: recover lease" log message when the new client opens the file. It is like the old lease is not seen until the 1 hour hard limit expires.Any idea how this could happen? This is on a distribution based on 2.6.0, with HA

  

Re: hdfs: weird lease expiration issue

Posted by Ravi Prakash <ra...@ymail.com>.
Hi Bogdan!
This is because the second application attempt appears to HDFS as a new client. Are you sure the second client experienced write errors because *its* lease was removed?
Yongjun has a great writeup : http://blog.cloudera.com/blog/2015/02/understanding-hdfs-recovery-processes-part-1/ (Thanks Yongjun). To quote
"The lease manager maintains a soft limit (1 minute) and hard limit (1 hour) for the expiration time (these limits are currently non-configurable), and all leases maintained by the lease manager abide by the same soft and hard limits. Before the soft limit expires, the client holding the lease of a file has exclusive write access to the file. If the soft limit expires and the client has not renewed the lease or closed the file (the lease of a file is released when the file is closed), another client can forcibly take over the lease. If the hard limit expires and the client has not renewed the lease, HDFS assumes that the client has quit and will automatically close the file on behalf of the client, thereby recovering the lease."
HTHRavi

 


     On Friday, August 21, 2015 10:05 AM, Bogdan Raducanu <lr...@gmail.com> wrote:
   

 I have an application that continuously appends to an hdfs file and keeps it open a long time.At some point the application crashed and left the file open.It was then restarted and it resumed normal operation, completing some writes (appending to the file). But, an hour after the crash it experienced write errors because its lease was removed.Digging in the NN log I found this weird behavior.
Events timeline:
1. 15:25: application crashes
2. 15:28: application restarted, writing doesn't start immediately
3. 15:37 first write
4. some more writes
5. new block needed: 2015-08-11 15:52:59,223 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /path/to/my/file. ... blk_1079708083_9361735{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW]]}
6. some more writes; application uses hsync so we can see the writes in the nn log: 2015-08-11 15:52:59,234 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /path/to/my/file for DFSClient_NONMAPREDUCE_-1953764790_1
7. 60 minutes after crash: 2015-08-11 16:25:18,397 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1] has expired hard limit
8. 2015-08-11 16:25:18,398 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1], src=/path/to/my/file
9. 2015-08-11 16:25:18,398 INFO BlockStateChange: BLOCK* blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW], ReplicaUnderConstruction[[DISK]...|RBW]]} recovery started, primary=ReplicaUnderConstruction[[DISK]...|RBW]
10. 2015-08-11 16:25:18,398 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File /path/to/my/file has not been closed. Lease recovery is in progress. RecoveryId = 9361840 for block blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=...
So, somehow the crashed client's lease remained and an hour after the crash it was removed.This happened even though during this hour another client obtained a lease and appended to the file. Also, there is no "startFile: recover lease" log message when the new client opens the file. It is like the old lease is not seen until the 1 hour hard limit expires.Any idea how this could happen? This is on a distribution based on 2.6.0, with HA