You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Igor Bolotin (JIRA)" <ji...@apache.org> on 2006/03/27 19:06:23 UTC

[jira] Created: (HADOOP-107) Namenode errors "Failed to complete filename.crc because dir.getFile()==null and null"

Namenode errors "Failed to complete filename.crc  because dir.getFile()==null and null" 
----------------------------------------------------------------------------------------

         Key: HADOOP-107
         URL: http://issues.apache.org/jira/browse/HADOOP-107
     Project: Hadoop
        Type: Bug
  Components: dfs  
 Environment: Linux
    Reporter: Igor Bolotin


We're getting lot of these errors and here is what I see in namenode log: 

060327 002016 Removing lease [Lease.  Holder: DFSClient_1897466025, heldlocks: 0, pendingcreates: 0], leases remaining: 1
060327 002523 Block report from member2.local:50010: 91895 blocks.
060327 003238 Block report from member1.local:50010: 91895 blocks.
060327 005830 Failed to complete /feedback/.feedback_10.1.10.102-33877.log.crc  because dir.getFile()==null and null
060327 005830 Server handler 1 on 50000 call error: java.io.IOException: Could not complete write to file /feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
java.io.IOException: Could not complete write to file /feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
        at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:205)
        at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216)

I can't be 100% sure, but it looks like these errors happen with checksum files for very small data files. 


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (HADOOP-107) Namenode errors "Failed to complete filename.crc because dir.getFile()==null and null"

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-107?page=all ]
     
Doug Cutting resolved HADOOP-107:
---------------------------------

    Fix Version: 0.1
     Resolution: Fixed
      Assign To: Doug Cutting

I just committed a fix for this.

> Namenode errors "Failed to complete filename.crc  because dir.getFile()==null and null"
> ---------------------------------------------------------------------------------------
>
>          Key: HADOOP-107
>          URL: http://issues.apache.org/jira/browse/HADOOP-107
>      Project: Hadoop
>         Type: Bug
>   Components: dfs
>  Environment: Linux
>     Reporter: Igor Bolotin
>     Assignee: Doug Cutting
>      Fix For: 0.1
>  Attachments: writeLocal.patch
>
> We're getting lot of these errors and here is what I see in namenode log: 
> 060327 002016 Removing lease [Lease.  Holder: DFSClient_1897466025, heldlocks: 0, pendingcreates: 0], leases remaining: 1
> 060327 002523 Block report from member2.local:50010: 91895 blocks.
> 060327 003238 Block report from member1.local:50010: 91895 blocks.
> 060327 005830 Failed to complete /feedback/.feedback_10.1.10.102-33877.log.crc  because dir.getFile()==null and null
> 060327 005830 Server handler 1 on 50000 call error: java.io.IOException: Could not complete write to file /feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
> java.io.IOException: Could not complete write to file /feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
>         at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:205)
>         at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216)
> I can't be 100% sure, but it looks like these errors happen with checksum files for very small data files. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-107) Namenode errors "Failed to complete filename.crc because dir.getFile()==null and null"

Posted by "Igor Bolotin (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-107?page=comments#action_12372041 ] 

Igor Bolotin commented on HADOOP-107:
-------------------------------------

Just tested the patch and now it works as expected. 
Thanks!

> Namenode errors "Failed to complete filename.crc  because dir.getFile()==null and null"
> ---------------------------------------------------------------------------------------
>
>          Key: HADOOP-107
>          URL: http://issues.apache.org/jira/browse/HADOOP-107
>      Project: Hadoop
>         Type: Bug
>   Components: dfs
>  Environment: Linux
>     Reporter: Igor Bolotin
>  Attachments: writeLocal.patch
>
> We're getting lot of these errors and here is what I see in namenode log: 
> 060327 002016 Removing lease [Lease.  Holder: DFSClient_1897466025, heldlocks: 0, pendingcreates: 0], leases remaining: 1
> 060327 002523 Block report from member2.local:50010: 91895 blocks.
> 060327 003238 Block report from member1.local:50010: 91895 blocks.
> 060327 005830 Failed to complete /feedback/.feedback_10.1.10.102-33877.log.crc  because dir.getFile()==null and null
> 060327 005830 Server handler 1 on 50000 call error: java.io.IOException: Could not complete write to file /feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
> java.io.IOException: Could not complete write to file /feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
>         at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:205)
>         at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216)
> I can't be 100% sure, but it looks like these errors happen with checksum files for very small data files. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [jira] Commented: (HADOOP-107) Namenode errors "Failed to complete filename.crc because dir.getFile()==null and null"

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

One should be able to renew a lease.  How does one renew a lease?   
Seems like adding a byte to the block should do that...  This seems  
very brittle as is.  If you are doing something CPU intensive that  
produces output, you don't want the system to timeout and potentially  
throw out a lot of your work.

On Mar 27, 2006, at 1:00 PM, Konstantin Shvachko (JIRA) wrote:

>     [ http://issues.apache.org/jira/browse/HADOOP-107? 
> page=comments#action_12372024 ]
>
> Konstantin Shvachko commented on HADOOP-107:
> --------------------------------------------
>
> It looks like your write to a file takes too long.
> The client has 1 minute to complete one block write until
> the lease issued for that client expires. When the lease expires the
> namenode thinks the block is abandoned. If your files are small,
> consisting of only 1 block, then the file will be considered abandoned
> as well. So the namenode removes the file before the client reports
> its completion.
> Lease duration is not configurable, so you cannot control that.
> But you can retry everything starting from file creation when you
> receive that exception.
> Is it true that your writes take longer than a minute?
>
>
>> Namenode errors "Failed to complete filename.crc  because  
>> dir.getFile()==null and null"
>> --------------------------------------------------------------------- 
>> ------------------
>>
>>          Key: HADOOP-107
>>          URL: http://issues.apache.org/jira/browse/HADOOP-107
>>      Project: Hadoop
>>         Type: Bug
>>   Components: dfs
>>  Environment: Linux
>>     Reporter: Igor Bolotin
>
>>
>> We're getting lot of these errors and here is what I see in  
>> namenode log:
>> 060327 002016 Removing lease [Lease.  Holder:  
>> DFSClient_1897466025, heldlocks: 0, pendingcreates: 0], leases  
>> remaining: 1
>> 060327 002523 Block report from member2.local:50010: 91895 blocks.
>> 060327 003238 Block report from member1.local:50010: 91895 blocks.
>> 060327 005830 Failed to complete / 
>> feedback/.feedback_10.1.10.102-33877.log.crc  because dir.getFile() 
>> ==null and null
>> 060327 005830 Server handler 1 on 50000 call error:  
>> java.io.IOException: Could not complete write to file / 
>> feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
>> java.io.IOException: Could not complete write to file / 
>> feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
>>         at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:205)
>>         at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown  
>> Source)
>>         at sun.reflect.DelegatingMethodAccessorImpl.invoke 
>> (DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:585)
>>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237)
>>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216)
>> I can't be 100% sure, but it looks like these errors happen with  
>> checksum files for very small data files.
>
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
>

[jira] Commented: (HADOOP-107) Namenode errors "Failed to complete filename.crc because dir.getFile()==null and null"

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-107?page=comments#action_12372024 ] 

Konstantin Shvachko commented on HADOOP-107:
--------------------------------------------

It looks like your write to a file takes too long.
The client has 1 minute to complete one block write until
the lease issued for that client expires. When the lease expires the
namenode thinks the block is abandoned. If your files are small,
consisting of only 1 block, then the file will be considered abandoned
as well. So the namenode removes the file before the client reports
its completion.
Lease duration is not configurable, so you cannot control that.
But you can retry everything starting from file creation when you
receive that exception.
Is it true that your writes take longer than a minute?


> Namenode errors "Failed to complete filename.crc  because dir.getFile()==null and null"
> ---------------------------------------------------------------------------------------
>
>          Key: HADOOP-107
>          URL: http://issues.apache.org/jira/browse/HADOOP-107
>      Project: Hadoop
>         Type: Bug
>   Components: dfs
>  Environment: Linux
>     Reporter: Igor Bolotin

>
> We're getting lot of these errors and here is what I see in namenode log: 
> 060327 002016 Removing lease [Lease.  Holder: DFSClient_1897466025, heldlocks: 0, pendingcreates: 0], leases remaining: 1
> 060327 002523 Block report from member2.local:50010: 91895 blocks.
> 060327 003238 Block report from member1.local:50010: 91895 blocks.
> 060327 005830 Failed to complete /feedback/.feedback_10.1.10.102-33877.log.crc  because dir.getFile()==null and null
> 060327 005830 Server handler 1 on 50000 call error: java.io.IOException: Could not complete write to file /feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
> java.io.IOException: Could not complete write to file /feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
>         at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:205)
>         at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216)
> I can't be 100% sure, but it looks like these errors happen with checksum files for very small data files. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-107) Namenode errors "Failed to complete filename.crc because dir.getFile()==null and null"

Posted by "Igor Bolotin (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-107?page=comments#action_12372028 ] 

Igor Bolotin commented on HADOOP-107:
-------------------------------------

This is correct - I tryed to write log files directly to DFS and depending on activity it could take pretty long time between calls. As a workaround - I changed my code to write to local output first and move to DFS only after file is closed (startLocalOutput/completeLocalOutput) and the problem went away.

The question is whether this behavior is expected? I though that with buffered output the blocks should not be requested until flush. If the block is requested too soon and the output takes a while for whatever reason - it's practically guaranteed to hit this problem, right? Also I never saw this happening with data files, only checksum files.

> Namenode errors "Failed to complete filename.crc  because dir.getFile()==null and null"
> ---------------------------------------------------------------------------------------
>
>          Key: HADOOP-107
>          URL: http://issues.apache.org/jira/browse/HADOOP-107
>      Project: Hadoop
>         Type: Bug
>   Components: dfs
>  Environment: Linux
>     Reporter: Igor Bolotin
>  Attachments: writeLocal.patch
>
> We're getting lot of these errors and here is what I see in namenode log: 
> 060327 002016 Removing lease [Lease.  Holder: DFSClient_1897466025, heldlocks: 0, pendingcreates: 0], leases remaining: 1
> 060327 002523 Block report from member2.local:50010: 91895 blocks.
> 060327 003238 Block report from member1.local:50010: 91895 blocks.
> 060327 005830 Failed to complete /feedback/.feedback_10.1.10.102-33877.log.crc  because dir.getFile()==null and null
> 060327 005830 Server handler 1 on 50000 call error: java.io.IOException: Could not complete write to file /feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
> java.io.IOException: Could not complete write to file /feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
>         at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:205)
>         at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216)
> I can't be 100% sure, but it looks like these errors happen with checksum files for very small data files. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-107) Namenode errors "Failed to complete filename.crc because dir.getFile()==null and null"

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-107?page=all ]

Doug Cutting updated HADOOP-107:
--------------------------------

    Attachment: writeLocal.patch

A connection to a datanode is opened for the checksum file when a file is opened.  Then lots of data is written to the main file, and only a little to the parallel checksum file.  So the checksum file might not get touched in up to a minute.

The last block of every file (checksum & main) is tee'd to a temporary local file, so that if the network connection dies then attempts can be made to re-transmit it to another datanode.

This patch changes things so that connections to datanodes are not initiated until the block is complete.  All writes are initially to the local, temporary file and only copied to a datanode when the block is complete.

> Namenode errors "Failed to complete filename.crc  because dir.getFile()==null and null"
> ---------------------------------------------------------------------------------------
>
>          Key: HADOOP-107
>          URL: http://issues.apache.org/jira/browse/HADOOP-107
>      Project: Hadoop
>         Type: Bug
>   Components: dfs
>  Environment: Linux
>     Reporter: Igor Bolotin
>  Attachments: writeLocal.patch
>
> We're getting lot of these errors and here is what I see in namenode log: 
> 060327 002016 Removing lease [Lease.  Holder: DFSClient_1897466025, heldlocks: 0, pendingcreates: 0], leases remaining: 1
> 060327 002523 Block report from member2.local:50010: 91895 blocks.
> 060327 003238 Block report from member1.local:50010: 91895 blocks.
> 060327 005830 Failed to complete /feedback/.feedback_10.1.10.102-33877.log.crc  because dir.getFile()==null and null
> 060327 005830 Server handler 1 on 50000 call error: java.io.IOException: Could not complete write to file /feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
> java.io.IOException: Could not complete write to file /feedback/.feedback_10.1.10.102-33877.log.crc by DFSClient_1897466025
>         at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:205)
>         at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216)
> I can't be 100% sure, but it looks like these errors happen with checksum files for very small data files. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira