You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "dhruba borthakur (JIRA)" <ji...@apache.org> on 2007/03/07 18:59:24 UTC

[jira] Created: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

DFS Scalability: optimize processing time of block reports
----------------------------------------------------------

                 Key: HADOOP-1079
                 URL: https://issues.apache.org/jira/browse/HADOOP-1079
             Project: Hadoop
          Issue Type: Bug
          Components: dfs
            Reporter: dhruba borthakur


I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode.

The problem becomes worse when the number of datanodes increases.

One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Created: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

Posted by Nigel Daley <nd...@yahoo-inc.com>.

> This means that the namenode processes a block report once every 2  
> seconds.

This is an average.  In actual fact aren't the DNs all trying to send  
them around the same time, since they're started roughly at the same  
time.  Perhaps a DN's first block report (after registration) should  
be sent at something like

   (blockInterval + randomInt) % blockInterval

seconds and then its subsequent block reports sent every  
blockInterval seconds after its previous successful block report.


On Mar 7, 2007, at 9:59 AM, dhruba borthakur (JIRA) wrote:

> DFS Scalability: optimize processing time of block reports
> ----------------------------------------------------------
>
>                  Key: HADOOP-1079
>                  URL: https://issues.apache.org/jira/browse/ 
> HADOOP-1079
>              Project: Hadoop
>           Issue Type: Bug
>           Components: dfs
>             Reporter: dhruba borthakur
>
>
> I have a cluster that has 1800 datanodes. Each datanode has around  
> 50000 blocks and sends a block report to the namenode once every  
> hour. This means that the namenode processes a block report once  
> every 2 seconds. Each block report contains all blocks that the  
> datanode currently hosts. This makes the namenode compare a huge  
> number of blocks that practically remains the same between two  
> consecutive reports. This wastes CPU on the namenode.
>
> The problem becomes worse when the number of datanodes increases.
>
> One proposal is to make succeeding block reports (after a  
> successful send of a full block report) be incremental. This will  
> make the namenode process only those blocks that were added/deleted  
> in the last period.
>
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Commented: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479412 ] 

Raghu Angadi commented on HADOOP-1079:
--------------------------------------


With deltas, how do we handle the case where a namenode request to remove a block  is lost? Two sides can't be in sync with only one side maintaining diffs... I think.

> In my test case, there were no two succeeding hourly block reports that were identical. 

This is normal. Did most of the reports result in deletion or addition of blocks during the report? 

> I had 1800 data nodes and was running randomWriter. In this scenario, using a hash to identify identical block reports might not buy us anything. 

Hash is not to identify if block report has changed. Both sides have up to date hash.. if hash is same then namenode and datanode have the same set of blocks. This has no relation to prev block report. There might be some fixes needed to make sure both sides see same set during a block report.



> DFS Scalability: optimize processing time of block reports
> ----------------------------------------------------------
>
>                 Key: HADOOP-1079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12496433 ] 

Raghu Angadi commented on HADOOP-1079:
--------------------------------------


Would n't this result in a possitive feedback loop for load on Namenode? i.e. most likely reason for a heart beat to fail is heavy load on Namenode, in such a case the patch will make every datanode to send blockReport greatly increasing the load on Namenode. Am I missing something?




> DFS Scalability: optimize processing time of block reports
> ----------------------------------------------------------
>
>                 Key: HADOOP-1079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: blockReportPeriod.patch
>
>
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12496425 ] 

dhruba borthakur commented on HADOOP-1079:
------------------------------------------

The BlockReport exists because of the following reasons. 

1. User/admin manually deleted a bunch of blk-xxxx files in the datanode.
     This happens rarely and a daily-block-report is good enough to rectify this situation
         
2. A heartbeat response from the namenode is lost. 
      When this occurs, the datanode will remember this condition. The next block report will occur immediately after the next successfull heartbeat.

Given the two above case, we can make the default block report period to be 1 day. This will reduce the CPU load on the namenode tremendously, especially on large clusters.





> DFS Scalability: optimize processing time of block reports
> ----------------------------------------------------------
>
>                 Key: HADOOP-1079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur reassigned HADOOP-1079:
----------------------------------------

    Assignee: dhruba borthakur

> DFS Scalability: optimize processing time of block reports
> ----------------------------------------------------------
>
>                 Key: HADOOP-1079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: blockReportPeriod.patch
>
>
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479224 ] 

Raghu Angadi commented on HADOOP-1079:
--------------------------------------


Namenode already knows the blocks added and deleted between block reports. I think block report exists only to catch some exceptions.

How does delta help catch differences between namenode and datanode? I mean how can two sides be sure that they have the exact set? May be they can exchange some hash of all the block ids and revert to normal report only if there is a mismatch. Also, this hash could be order independent so that name updates with each block added instread of iterating over the set when required.


> DFS Scalability: optimize processing time of block reports
> ----------------------------------------------------------
>
>                 Key: HADOOP-1079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479413 ] 

Raghu Angadi commented on HADOOP-1079:
--------------------------------------

> There might be some fixes needed to make sure both sides see same set during a block report.

in normal case.


> DFS Scalability: optimize processing time of block reports
> ----------------------------------------------------------
>
>                 Key: HADOOP-1079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481644 ] 

Raghu Angadi commented on HADOOP-1079:
--------------------------------------

> Actually I don't know what happens in such a case now. What ever datanode has is the master copy.
> Not sure what happens to blocks added after the report is sent but before it is processed.

This does lead to real problems see HADOOP-1093.

> DFS Scalability: optimize processing time of block reports
> ----------------------------------------------------------
>
>                 Key: HADOOP-1079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479369 ] 

dhruba borthakur commented on HADOOP-1079:
------------------------------------------

You are right in that block-reports exists only to catch exception scenarios. Suppose a datanode fails to successfully deliver a blockreceived mesage to the namenode. The namenode won't know about it till the next block report. Also, if a rougue application deletes blk-* files on the datanode or a portion of the data directories go bad, this will be detected by the next block report.

My proposal is the make the datanode remember that report that it sent to the namenode the last time. Only those blocks that got added/deleted since the last block report will be sent in the next incremental block report. This is a mechanism to ensure that the namenode and datanode are in sych as far as block information is concerned.

In my test case, there were no two succeeding hourly block reports that were identical. I had 1800 data nodes and was running randomWriter. In this scenario, using a hash to identify identical block reports might not buy us anything.



> DFS Scalability: optimize processing time of block reports
> ----------------------------------------------------------
>
>                 Key: HADOOP-1079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-1079:
-------------------------------------

    Attachment: blockReportPeriod.patch

Here is a sample patch that increases the blockReport periodicity from 1 hour to 1 day. It also causes a blockReport to be sent after a failed heartbeat.

I would like some comments/feedback on this approach.

> DFS Scalability: optimize processing time of block reports
> ----------------------------------------------------------
>
>                 Key: HADOOP-1079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>         Attachments: blockReportPeriod.patch
>
>
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12496443 ] 

dhruba borthakur commented on HADOOP-1079:
------------------------------------------

A short discussion with Sameer revealed that the case where admin/rogue-process manually deleted some blk-xxx files cannot be dealt with by an daily block report. A proposal is to do make the datanode compare its in-memory data structures with what is on disk. If it finds inconsistencies, then it send s a complete block report to the namenode. 

In response to Raghu's comments: I agree that sending a block report "immediately" after a successful heartbeat (that was preceeded by a failed heartbeat) could add more load on the namenode. It is enough if we can "hasten" the next block report rather than sending it immediately. I will work on this fix.

> DFS Scalability: optimize processing time of block reports
> ----------------------------------------------------------
>
>                 Key: HADOOP-1079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: blockReportPeriod.patch
>
>
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479417 ] 

Raghu Angadi commented on HADOOP-1079:
--------------------------------------

> With deltas, how do we handle the case where a namenode request to remove a block is lost?

Actually I don't know what happens in such a case now. What ever datanode has is the master copy.
Not sure what happens to blocks added after the report is sent but before it is processed.



> DFS Scalability: optimize processing time of block reports
> ----------------------------------------------------------
>
>                 Key: HADOOP-1079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1079) DFS Scalability: optimize processing time of block reports

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12496640 ] 

Hairong Kuang commented on HADOOP-1079:
---------------------------------------

When a name node thinks a data node is dead, it starts to replicate all the blocks belong to this datanode, which is very costly. A quick report block will help avoid these uneccessary block replications when the datanode is actually alive.

> DFS Scalability: optimize processing time of block reports
> ----------------------------------------------------------
>
>                 Key: HADOOP-1079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: blockReportPeriod.patch
>
>
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.