You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "dhruba borthakur (JIRA)" <ji...@apache.org> on 2007/01/24 01:59:49 UTC

[jira] Created: (HADOOP-923) DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes
----------------------------------------------------------------------------------------

Key: HADOOP-923
URL: https://issues.apache.org/jira/browse/HADOOP-923
Project: Hadoop
Issue Type: Bug
Components: dfs
Affects Versions: 0.10.1
Reporter: dhruba borthakur
Assigned To: dhruba borthakur

The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.

At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.

It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose.

I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-923) DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467924 ] 

Doug Cutting commented on HADOOP-923:
-------------------------------------

Overall this sounds like a great direction.

> new DatanodeProtocol call named sendBlockModifications().

This seems more get-like than send-like.  So maybe it should be called 'getBlockModifications()'  or 'getBlockInstructions()?

> DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-923
>                 URL: https://issues.apache.org/jira/browse/HADOOP-923
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.10.1
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: scalableheartbeats1.patch
>
>
> The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends  a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.
> At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.
> It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose. 
> I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-923) DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-923:
------------------------------------

    Attachment: scalableheartbeats1.patch

A first version of the patch. Review comments needed.

I compared the two implementation 
Approach 1. keeping a single heartbeat RPC and making the namenode transparently return the list-of-blocks-to-be-transferred once every nth heartbeat call.
Approach 2. Introducing a new RPC call that is invoked by the datanode to retrieve list-of-blocks-to-be-transferred. (default 20 seconds).

This current patch implements Approach 2. This approach allows for having a dedicated namenode thread to process heartbeats (if the need arises).

> DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-923
>                 URL: https://issues.apache.org/jira/browse/HADOOP-923
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.10.1
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: scalableheartbeats1.patch
>
>
> The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends  a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.
> At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.
> It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose. 
> I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-923) DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467507 ] 

dhruba borthakur commented on HADOOP-923:
-----------------------------------------

I gathered simulated measurements on the time to process a single heartbeat on the namenode versus the number of blocks to replicate. Here is the data:

pending Blocks to replicate           time to process one heartbeat (millisec)
100,000                                         2
500,000                                         5
600,000                                         6

100,000 blocks typically corresponds to about 12 TB. This is analogous to the capacity of three typical datanodes., Thus, if three datanodes go down at the same time, the namenode spends about 2 ms of CPU time to process a single incoming heartbeat. The global FSNamesystem lock is kept for this entire 2 ms.

In the current implementation, heartbeats are sent by the datanode once every 3 seconds. A 1500 node cluster will cause the namenode to spend about 3 seconds (2ms * 1500) processing just heartbeat requests. The current DFS's scalabilty could be limited to 1500 datanodes.

The above results could vary depending on the type of hardware and communcation link that is being used.

 



> DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-923
>                 URL: https://issues.apache.org/jira/browse/HADOOP-923
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.10.1
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>
> The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends  a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.
> At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.
> It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose. 
> I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-923) DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-923:
------------------------------------

    Attachment: pendingTransferThread2.patch

Incorporated review comments. the change from the previous version is in method FSNamesystem.computeDatanodeWork().

> DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-923
>                 URL: https://issues.apache.org/jira/browse/HADOOP-923
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.10.1
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: pendingTransferThread2.patch
>
>
> The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends  a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.
> At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.
> It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose. 
> I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-923) DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-923:
------------------------------------

    Attachment: pendingTransferThread.patch

A background thread that computes pendingTransfers and chooseTargets. It stores the computed work into the relevent DatanodeDescriptor. A heartbeat from a datanode retrieves this pre-computed work from the Datanode Descriptor.

> DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-923
>                 URL: https://issues.apache.org/jira/browse/HADOOP-923
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.10.1
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: pendingTransferThread.patch
>
>
> The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends  a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.
> At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.
> It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose. 
> I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-923) DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-923:
------------------------------------

    Status: Patch Available  (was: Open)

Code has been reviewed by Hairong.

> DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-923
>                 URL: https://issues.apache.org/jira/browse/HADOOP-923
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.10.1
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: pendingTransferThread2.patch
>
>
> The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends  a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.
> At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.
> It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose. 
> I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-923) DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-923:
------------------------------------

    Attachment:     (was: scalableheartbeats1.patch)

> DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-923
>                 URL: https://issues.apache.org/jira/browse/HADOOP-923
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.10.1
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>
> The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends  a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.
> At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.
> It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose. 
> I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-923) DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471803 ] 

Hairong Kuang commented on HADOOP-923:
--------------------------------------

Two comments:
1. I feel that it is not neccessary to balance # of transfers when the heartbeat thread picks up the replication work. First the background thread that computes pendingTransfers has already balanced the load. Second block replication work needs to be done asap to avoid data loss. Since the datanode has been assinged the block replication work, no other datanode is able to pick up the work. If the work does not get to send to the datanode in the current heartbeat, it has to wait for at least another heartbeat interval.

2. The background thread that computes pendindingTransfer scans only 100 datanodes per interation and then sleep for 3 seconds. I feel that the approach does not scale well. For example, when a cluster size becomes 2000, a datanode's work gets computed every 2000/100*3=1min if we ignore the computation overhead, which is far less frequently than what we do now (every 3 seonds). Another minor flaw is that the thread uses the index to record the next node to be checked. But if the heartbeat queue gets updated between two consecutive interations, the index may not point to the right node.

> DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-923
>                 URL: https://issues.apache.org/jira/browse/HADOOP-923
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.10.1
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: pendingTransferThread.patch
>
>
> The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends  a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.
> At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.
> It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose. 
> I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-923) DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-923:
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.12.0
           Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Dhruba!

> DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-923
>                 URL: https://issues.apache.org/jira/browse/HADOOP-923
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.10.1
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>             Fix For: 0.12.0
>
>         Attachments: pendingTransferThread2.patch
>
>
> The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends  a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.
> At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.
> It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose. 
> I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-923) DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-923:
------------------------------------

    Attachment:     (was: pendingTransferThread.patch)

> DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-923
>                 URL: https://issues.apache.org/jira/browse/HADOOP-923
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.10.1
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>
> The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends  a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.
> At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.
> It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose. 
> I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-923) DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467867 ] 

dhruba borthakur commented on HADOOP-923:
-----------------------------------------

Introduce a new DatanodeProtocol call named  sendBlockModifications(). The namenode returns the blocks that are to be replicated or deleted as part of this call. The existing method sendHeartbeat() just updates the heartbeat array in the namenode, it does not send back the list of blocks that are pending replication or the blocks that are to be deleted.

The Datanode invokes the sendHeartbeat RPC once every 3 seconds. The Datanode invokes the sendBlockModifications RPC once every 10 heartbeats.

The namenode acquires only the heartbeat lock while processing the sendHeartbeat call. The namenode acquires the global FSnamesystem lock while processing the sendBlockModifications call.  

The above change ensures that heartbeats processing time does not depend on the amount of blocks that are pending to be replicated.


> DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-923
>                 URL: https://issues.apache.org/jira/browse/HADOOP-923
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.10.1
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>
> The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends  a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.
> At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.
> It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose. 
> I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.