You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2007/09/14 20:54:32 UTC

[jira] Created: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

the heartbeat and task event queries interval should be set dynamically by the JobTracker
-----------------------------------------------------------------------------------------

                 Key: HADOOP-1900
                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
            Reporter: Owen O'Malley
            Assignee: Owen O'Malley


The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535840 ] 

Amareshwari Sri Ramadasu commented on HADOOP-1900:
--------------------------------------------------

bq. I wonder if instead we should just make it clusterSize/50+1? That way, small clusters will get a heartbeat of just one second, which should make them more responsive.

Agreed.

bq. Now, MapEventsFetcherThread polls jobtracker for completed map tasks for every 5 secs (MIN_POLL_INTERVAL). Shall we change polling interval also in the similar fashion as heartbeat interval? But, here some reduce tasks could be idle for longer time.

We can have MapEventsFetcherThread polling jobtracker for completed map tasks  in the similar fashion as heartbeat interval with MIN_POLL_INTERVAL=5secs and MIN_POLL_INTERVAL_MAX=30secs.  And Whenever tasktracker finds there are no mapevents and reduce task is waiting for the map events, it will wakeup the thread to fetch map events from job tracker.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1900:
----------------------------------

    Status: Open  (was: Patch Available)

Amareshwari, could you please move {{TaskCompletionEventResponse}} to a separate TaskCompletionEventResponse.java? It is currently in JobTracker.java, and it shouldn't be a 'public' class either.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535651 ] 

Doug Cutting commented on HADOOP-1900:
--------------------------------------

> For every additional 500 nodes we increase heartbeat interval by 10 secs.

I wonder if instead we should just make it clusterSize/50+1?  That way, small clusters will get a heartbeat of just one second, which should make them more responsive.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Status: Patch Available  (was: Open)

Submiting patch again after testing on 390 node cluster. 
Fixed a deadlock issue in the previous patch.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542670 ] 

Devaraj Das commented on HADOOP-1900:
-------------------------------------

One of the requirements that this issue is supposed to address is the JobTracker busyness (issue description). So, by basing the heartbeat on purely the cluster size, we are not addressing that requirement (unless we say that the clustersize based frequency would handle it). Just wanted to bring it to everyone's notice.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542122 ] 

amareshwari edited comment on HADOOP-1900 at 11/13/07 3:57 AM:
----------------------------------------------------------------------------

With the patch attached, I ran sort benchmarks on 390 node cluster and 120 node cluster. The performance is almost the same as with the trunk.

To simulate busyness at the job tracker, I ran the sort benchmarks on 120 node cluster with number of handlers=4 and with max queue size per handler =10, but there are drops and lost task trackers with the patch and without. 

Thus Cluster size factor as (clusterSize/50+1) is fine. but the busy factor has to be tuned more.

I propose the following to tune busy factor:

We have threshouldDropCount = clusterSize/10;

We increment busyFactor by  HEARTBEAT_BUSY_FACTOR (say 2secs) for 10% cluster size drops.
if(dropCount > threshouldDropCount) {
  busyFactor += (dropCount/threshouldDropCount)*HEARTBEAT_BUSY_FACTOR;
}

if job tracker is not busy for '_notBusyPeriod_' , then we will decrement busyFactor by HEARTBEAT_BUSY_FACTOR;
We have,  2 rpcs to be processed as at the jobtracker i.e. heartbeat and task completion events. let processing time for rpc be 2 seconds. 
Here, notBusyPeriod is calculated as:
notBusyPeriod = (clusterSize/#handlers)*processingTime*2;

To stabilize, 
consider we have drops at t, and we increment heartbeat interval by busyfactor,_b_. And we notice that _b_ is small enough to get decremented  and we dont see drops at the new interval, then we stabilize.

Thoughts?

      was (Author: amareshwari):
    
With the patch attached, I ran sort benchmarks on 390 node cluster and 120 node cluster. The performance is almost the same as with the trunk.

To simulate busyness at the job tracker, I ran the sort benchmarks on 120 node cluster with number of handlers=4 and with max queue size per handler =10, but there are drops and lost task trackers with the patch and wothout. 

Thus Cluster size factor as (clusterSize/50+1) is fine. but the busy factor has to be tuned more.

I propose the following to tune busy factor:

threshouldDropCount = clusterSize/10;
if(dropCount > threshouldDropCount) {
  busyFactor += (dropCount/threshouldDropCount)*HEARTBEAT_BUSY_FACTOR;
}

if job tracker is not busy for '_notBusyPeriod_' , then we will decrement busyFactor by HEARTBEAT_BUSY_FACTOR;
We have,  2 rpcs to be processed as at the jobtracker i.e. heartbeat and task completion events. let processing time for rpc be 2 seconds. 
Here, notBusyPeriod is calculated as:
notBusyPeriod = (clusterSize/#handlers)*processingTime*2;

To stabilize, 
consider we have drops at t, and we increment heartbeat interval by busyfactor,_b_. And we notice that _b_ is small enough  and we dont see drops at the new interval, then we stabilize.

Thoughts?
  
> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539740 ] 

Hadoop QA commented on HADOOP-1900:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12368873/patch-1900.txt
against trunk revision r591389.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs -1.  The patch appears to introduce 1 new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1055/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1055/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1055/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1055/console

This message is automatically generated.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542599 ] 

Sameer Paranjpye commented on HADOOP-1900:
------------------------------------------

> But I don't see the importance of faster heartbeats on faster hardware, especially if it adds complexity to the code.

+1

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542535 ] 

Owen O'Malley commented on HADOOP-1900:
---------------------------------------

Oops, clearly I meant max instead of min. I still think an adaptive response is overkill for this. 

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547040 ] 

Hadoop QA commented on HADOOP-1900:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12370618/patch-1900.txt
against trunk revision r599703.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1214/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1214/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1214/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1214/console

This message is automatically generated.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535894 ] 

Devaraj Das commented on HADOOP-1900:
-------------------------------------

bq. So, one way to take this into account might be to maintain an average time-to-complete for all tasks in the system (of current jobs) and factor that into the scaling of the intervals.

The TaskTracker currently pings the JobTracker asking for a task as soon as it finishes executing a task. I think that should be the behavior to keep the utilization of the tasktrackers optimal (of course, in general we could do better by sending it a bunch of tasks every time it asks for a new task, but that's the subject of another jira).

bq. Also, while we are at this, I say we should start to consider busy-ness of JobTracker too, along with the cluster-size. So, for e.g., if the individual tasks are taking in the order of minutes, then it might not matter much if we send one every 20s or so, in some cases it might. I know that the sort's map tasks take around 40s each... 

I propose a change to the status message in the heartbeat - the tasktracker can compare the current task status with the previous one and if it finds the status to be the same, it doesn't send the complete status object to the JobTracker, but just a flag saying it is a duplicate or something to that effect. That will reduce the data per RPC considerably for long running tasks whose statuses don't change frequently and also reduce the processing load on the JobTracker.

Thoughts?

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546175 ] 

Amareshwari Sri Ramadasu commented on HADOOP-1900:
--------------------------------------------------

The javadoc warning was on hbase -
  [javadoc] /export/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/contrib/hbase/src/java/org/apache/hadoop/hbase/shell/CreateCommand.java:77: warning - @param argument "table" is not a parameter name.
 
contrib tests are also failed in hbase.
I'm submitting the patch again for hudson

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539651 ] 

Sameer Paranjpye commented on HADOOP-1900:
------------------------------------------

Do we have any benchmarks to show that this helps?

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542122 ] 

amareshwari edited comment on HADOOP-1900 at 11/13/07 8:43 PM:
----------------------------------------------------------------------------

With the patch attached, I ran sort benchmarks on 390 node cluster and 120 node cluster. The performance is almost the same as with the trunk.

To simulate busyness at the job tracker, I ran the sort benchmarks on 120 node cluster with number of handlers=4 and with max queue size per handler =10, but there are drops and lost task trackers with the patch and without. 

Thus Cluster size factor as (clusterSize/50+1) should be fine. But the busyFactor has to be better tuned.

I propose the following to tune busy factor:

We have threshouldDropCount = clusterSize/10;

We increment busyFactor by  HEARTBEAT_BUSY_FACTOR (say 2secs) for every 10% cluster size drops.
if(dropCount > threshouldDropCount) {
  busyFactor += (dropCount/threshouldDropCount)*HEARTBEAT_BUSY_FACTOR;
}

For example, on a 100 node cluster, if we see 40 drops, busyFactor is incremented by 8 seconds (40/10*2).

If job tracker is not busy for 'observationInterval' , then we will decrement busyFactor by HEARTBEAT_BUSY_FACTOR;

To calculate observationInterval, 
We have,  2 rpcs to be processed as at the jobtracker i.e. heartbeat and task completion events. let processing time per rpc be 2 seconds. 
Here, observationInterval is calculated as:
observationInterval = (clusterSize/#handlers)*processingTime*2;

Assuming that we don't see drops at a certain observationInterval (and the corresponding busyFactor), we decrement the busyFactor by HEARTBEAT_BUSY_FACTOR. This can be done in a loop, until we see drops. When we see drops, we increment it by the constant HEARTBEAT_BUSY_FACTOR, and stabilize there .. until we see drops. 

For example, On a 100 on cluster, We start with 2 seconds heartbeat interval.
We see 40 drops, then busyFactor = 8; then, new interval = (2+8) =10;
We dont see drops for 40 seconds;  new interval = 10-2 =8;
We dont see drops for 40 seconds;  new interval = 8-2 =6;
We dont see drops for 40 seconds;  new interval = 6-2 =4;
We see drops; then new interval = 6;
We dont see drops for lone time, say. we stabilize here.
Say  we see 30 drops after some time, busyFactor =6; new interval = 6+6 =12;
And the loop repeats.

Thoughts?

      was (Author: amareshwari):
    With the patch attached, I ran sort benchmarks on 390 node cluster and 120 node cluster. The performance is almost the same as with the trunk.

To simulate busyness at the job tracker, I ran the sort benchmarks on 120 node cluster with number of handlers=4 and with max queue size per handler =10, but there are drops and lost task trackers with the patch and without. 

Thus Cluster size factor as (clusterSize/50+1) should be fine. But the busyFactor has to be better tuned.

I propose the following to tune busy factor:

We have threshouldDropCount = clusterSize/10;

We increment busyFactor by  HEARTBEAT_BUSY_FACTOR (say 2secs) for every 10% cluster size drops.
if(dropCount > threshouldDropCount) {
  busyFactor += (dropCount/threshouldDropCount)*HEARTBEAT_BUSY_FACTOR;
}

For example, on a 100 node cluster, if we see 40 drops, busyFactor is incremented by 8 seconds (40/10*2).

If job tracker is not busy for 'observationInterval' , then we will decrement busyFactor by HEARTBEAT_BUSY_FACTOR;

To calculate observationInterval, 
We have,  2 rpcs to be processed as at the jobtracker i.e. heartbeat and task completion events. let processing time per rpc be 2 seconds. 
Here, notBusyPeriod is calculated as:
notBusyPeriod = (clusterSize/#handlers)*processingTime*2;

Assuming that we don't see drops at a certain observationInterval (and the corresponding busyFactor), we decrement the busyFactor by HEARTBEAT_BUSY_FACTOR. This can be done in a loop, until we see drops. When we see drops, we increment it by the constant HEARTBEAT_BUSY_FACTOR, and stabilize there .. until we see drops. 

For example, On a 100 on cluster, We start with 2 seconds heartbeat interval.
We see 40 drops, then busyFactor = 8; then, new interval = (2+8) =10;
We dont see drops for 40 seconds;  new interval = 10-2 =8;
We dont see drops for 40 seconds;  new interval = 8-2 =6;
We dont see drops for 40 seconds;  new interval = 6-2 =4;
We see drops; then new interval = 6;
We dont see drops for lone time, say. we stabilize here.
Say  we see 30 drops after some time, busyFactor =6; new interval = 6+6 =12;
And the loop repeats.

Thoughts?
  
> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539067 ] 

Amareshwari Sri Ramadasu commented on HADOOP-1900:
--------------------------------------------------

In the patch attached,
The job tracker periodically calculates the heartbeat interval. It looks at both cluster size and busyness of jobtracker. If jobtracker is busy, the interval is incremented by a busyFactor. If it is not busy for two continuous periods, the interval is decremented by the busyFactor.

Map events polling interval is calculated as a function of heartbeat interval to skip the recalculation. It is calculated as follows:
polling_interval = heartbeat_interval/3;
if polling_interval < MIN_POLLING_INTERVAL, then polling_interval = MIN_POLLING_INTERVAL;
if polling_interval > MAX_POLLING_INTERVAL, then polling_interval = MAX_POLLING_INTERVAL;
MapEventsFetcherThread is notified if a reduce task doesnt find map events at the tasktracker.

bq.I propose a change to the status message in the heartbeat - the tasktracker can compare the current task status with the previous one and if it finds the status to be the same, it doesn't send the complete status object to the JobTracker, but just a flag saying it is a duplicate or something to that effect. That will reduce the data per RPC considerably for long running tasks whose statuses don't change frequently and also reduce the processing load on the JobTracker.

This will be addressed in another JIRA


> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Status: Patch Available  (was: Open)

patch with comments incorporated and tested.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Attachment: patch-1900.txt

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Attachment: patch-1900.txt

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545940 ] 

Hadoop QA commented on HADOOP-1900:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12370276/patch-1900.txt
against trunk revision r598555.

    @author +1.  The patch does not contain any @author tags.

    javadoc -1.  The javadoc tool appears to have generated  messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests -1.  The patch failed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1175/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1175/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1175/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1175/console

This message is automatically generated.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Status: Patch Available  (was: Open)

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Status: Patch Available  (was: Open)

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542122 ] 

Amareshwari Sri Ramadasu commented on HADOOP-1900:
--------------------------------------------------


With the patch attached, I ran sort benchmarks on 390 node cluster and 120 node cluster. The performance is almost the same as with the trunk.

To simulate busyness at the job tracker, I ran the sort benchmarks on 120 node cluster with number of handlers=4 and with max queue size per handler =10, but there are drops and lost task trackers with the patch and wothout. 

Thus Cluster size factor as (clusterSize/50+1) is fine. but the busy factor has to be tuned more.

I propose the following to tune busy factor:

threshouldDropCount = clusterSize/10;
if(dropCount > threshouldDropCount) {
  busyFactor += (dropCount/threshouldDropCount)*HEARTBEAT_BUSY_FACTOR;
}

if job tracker is not busy for '_notBusyPeriod_' , then we will decrement busyFactor by HEARTBEAT_BUSY_FACTOR;
We have,  2 rpcs to be processed as at the jobtracker i.e. heartbeat and task completion events. let processing time for rpc be 2 seconds. 
Here, notBusyPeriod is calculated as:
notBusyPeriod = (clusterSize/#handlers)*processingTime*2;

To stabilize, 
consider we have drops at t, and we increment heartbeat interval by busyfactor,_b_. And we notice that _b_ is small enough  and we dont see drops at the new interval, then we stabilize.

Thoughts?

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547776 ] 

Hadoop QA commented on HADOOP-1900:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12370812/patch-1900.txt
against trunk revision r600244.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1242/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1242/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1242/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1242/console

This message is automatically generated.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1900:
--------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Amareshwari!

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547797 ] 

Devaraj Das commented on HADOOP-1900:
-------------------------------------

Sorry one more comment - "if (diff > minWait) " condition should check for greater-than-or-equal-to.


> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542122 ] 

amareshwari edited comment on HADOOP-1900 at 11/13/07 4:53 AM:
----------------------------------------------------------------------------

With the patch attached, I ran sort benchmarks on 390 node cluster and 120 node cluster. The performance is almost the same as with the trunk.

To simulate busyness at the job tracker, I ran the sort benchmarks on 120 node cluster with number of handlers=4 and with max queue size per handler =10, but there are drops and lost task trackers with the patch and without. 

Thus Cluster size factor as (clusterSize/50+1) should be fine. But the busyFactor has to be better tuned.

I propose the following to tune busy factor:

We have threshouldDropCount = clusterSize/10;

We increment busyFactor by  HEARTBEAT_BUSY_FACTOR (say 2secs) for every 10% cluster size drops.
if(dropCount > threshouldDropCount) {
  busyFactor += (dropCount/threshouldDropCount)*HEARTBEAT_BUSY_FACTOR;
}

For example, on a 100 node cluster, if we see 40 drops, busyFactor is incremented by 8 seconds (40/10*2).

If job tracker is not busy for 'observationInterval' , then we will decrement busyFactor by HEARTBEAT_BUSY_FACTOR;

To calculate observationInterval, 
We have,  2 rpcs to be processed as at the jobtracker i.e. heartbeat and task completion events. let processing time per rpc be 2 seconds. 
Here, notBusyPeriod is calculated as:
notBusyPeriod = (clusterSize/#handlers)*processingTime*2;

Assuming that we don't see drops at a certain observationInterval (and the corresponding busyFactor), we decrement the busyFactor by HEARTBEAT_BUSY_FACTOR. This can be done in a loop, until we see drops. When we see drops, we increment it by the constant HEARTBEAT_BUSY_FACTOR, and stabilize there .. until we see drops. 

For example, On a 100 on cluster, We start with 2 seconds heartbeat interval.
We see 40 drops, then busyFactor = 8; then, new interval = (2+8) =10;
We dont see drops for 40 seconds;  new interval = 10-2 =8;
We dont see drops for 40 seconds;  new interval = 8-2 =6;
We dont see drops for 40 seconds;  new interval = 6-2 =4;
We see drops; then new interval = 6;
We dont see drops for lone time, say. we stabilize here.
Say  we see 30 drops after some time, busyFactor =6; new interval = 6+6 =12;
And the loop repeats.

Thoughts?

      was (Author: amareshwari):
    With the patch attached, I ran sort benchmarks on 390 node cluster and 120 node cluster. The performance is almost the same as with the trunk.

To simulate busyness at the job tracker, I ran the sort benchmarks on 120 node cluster with number of handlers=4 and with max queue size per handler =10, but there are drops and lost task trackers with the patch and without. 

Thus Cluster size factor as (clusterSize/50+1) is fine. but the busy factor has to be tuned more.

I propose the following to tune busy factor:

We have threshouldDropCount = clusterSize/10;

We increment busyFactor by  HEARTBEAT_BUSY_FACTOR (say 2secs) for 10% cluster size drops.
if(dropCount > threshouldDropCount) {
  busyFactor += (dropCount/threshouldDropCount)*HEARTBEAT_BUSY_FACTOR;
}

if job tracker is not busy for '_notBusyPeriod_' , then we will decrement busyFactor by HEARTBEAT_BUSY_FACTOR;
We have,  2 rpcs to be processed as at the jobtracker i.e. heartbeat and task completion events. let processing time for rpc be 2 seconds. 
Here, notBusyPeriod is calculated as:
notBusyPeriod = (clusterSize/#handlers)*processingTime*2;

To stabilize, 
consider we have drops at t, and we increment heartbeat interval by busyfactor,_b_. And we notice that _b_ is small enough to get decremented  and we dont see drops at the new interval, then we stabilize.

Thoughts?
  
> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542530 ] 

Devaraj Das commented on HADOOP-1900:
-------------------------------------

Owen, by what you suggested, it appears that the heartbeat interval would be 2 seconds for all cluster configurations with more than 20 nodes. This seems way too much.

After some thought, I am tending to agree with Owen that backoff may be difficult to control. So here is a simplified proposal:

1) Monitor the average time we take to process an RPC

2) Assuming that every RPC can be processed within millisecond(s), the average #RPCs that the server can process per minute (RPC-processed-per-minute) is:  (60000 / time-per-rpc). Assuming time-per-rpc is ~10 msec, ~6000 RPCs can be processed in a minute. Since the heartbeat RPC invocation locks the JobTracker, the number of handlers actually don't matter much.

3) The frequency of heartbeat should be (clustersize/RPC-processed-per-minute) minutes.
For example, if ClusterSize = 1000, the heartbeat interval is set to 1000/6000 min = 10 sec.

4) taskCompletionEvents : this RPC is treated no differently than the heartbeat RPC. In addition to regular polling, this RPC also happens on demand, i.e., a TaskTracker invokes this RPC whenever a ReduceTask asks for MapcompletionEvents and the TaskTracker has nothing to give back (a lower cap of 5 seconds is set between two on-demand rpcs). This is similar to the way heartbeat RPCs work - whenever tasks finish, the TaskTracker sends a heartbeat.

What do others think?

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Status: Open  (was: Patch Available)

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542539 ] 

Devaraj Das commented on HADOOP-1900:
-------------------------------------

I think in my last proposal, the adaptiveness is not going to be that much of an overkill since I am only suggesting that we just monitor the average time to process an RPC. That will take care of things like beefy hardware vs rudimentary ones... 

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Status: Patch Available  (was: Open)

tested on 500 node cluster

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Attachment: patch-1900.txt

bq. Amareshwari, could you please move TaskCompletionEventResponse to a separate TaskCompletionEventResponse.java? It is currently in JobTracker.java, and it shouldn't be a 'public' class either.

I removed the class TaskCompletionEventResponse. Since we want to use heartbeat interval as polling interval, we dont need this class any more. 

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548239 ] 

Hudson commented on HADOOP-1900:
--------------------------------

Integrated in Hadoop-Nightly #322 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/322/])

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535840 ] 

amareshwari edited comment on HADOOP-1900 at 10/18/07 3:07 AM:
----------------------------------------------------------------------------

bq. I wonder if instead we should just make it clusterSize/50+1? That way, small clusters will get a heartbeat of just one second, which should make them more responsive.

Agreed.

bq. Now, MapEventsFetcherThread polls jobtracker for completed map tasks for every 5 secs (MIN_POLL_INTERVAL). Shall we change polling interval also in the similar fashion as heartbeat interval? But, here some reduce tasks could be idle for longer time.

We can have MapEventsFetcherThread polling jobtracker for completed map tasks  in the similar fashion as heartbeat interval with MIN_POLL_INTERVAL=5secs and MAX_POLL_INTERVAL=30secs.  And Whenever tasktracker finds there are no mapevents and reduce task is waiting for the map events, it will wakeup the thread to fetch map events from job tracker.

      was (Author: amareshwari):
    bq. I wonder if instead we should just make it clusterSize/50+1? That way, small clusters will get a heartbeat of just one second, which should make them more responsive.

Agreed.

bq. Now, MapEventsFetcherThread polls jobtracker for completed map tasks for every 5 secs (MIN_POLL_INTERVAL). Shall we change polling interval also in the similar fashion as heartbeat interval? But, here some reduce tasks could be idle for longer time.

We can have MapEventsFetcherThread polling jobtracker for completed map tasks  in the similar fashion as heartbeat interval with MIN_POLL_INTERVAL=5secs and MIN_POLL_INTERVAL_MAX=30secs.  And Whenever tasktracker finds there are no mapevents and reduce task is waiting for the map events, it will wakeup the thread to fetch map events from job tracker.
  
> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542232 ] 

Owen O'Malley commented on HADOOP-1900:
---------------------------------------

I really really don't think the complexity is justified. If we have too many levels of back offs and retries it will be very hard to control the performance. I'd propose

{code}
   heartbeatPeriodSeconds = min(2, slaveNodes / 10);
{code}

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543270 ] 

Hadoop QA commented on HADOOP-1900:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12369658/patch-1900.txt
against trunk revision r595563.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1111/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1111/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1111/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1111/console

This message is automatically generated.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu reassigned HADOOP-1900:
------------------------------------------------

    Assignee: Amareshwari Sri Ramadasu  (was: Owen O'Malley)

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Attachment: patch-1900.txt

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535879 ] 

Arun C Murthy commented on HADOOP-1900:
---------------------------------------

bq. I wonder if instead we should just make it clusterSize/50+1? That way, small clusters will get a heartbeat of just one second, which should make them more responsive.

+1

I'd like to see some numbers about how long it takes to process a heartbeat etc. before we decide on the actual scaling factors (both up and down). Given that we've run so far on clusters of 2000 nodes with heartbeat-interval of 10s, I'd suspect scaling it up by 10s for every 500 nodes is too conservative... anyway I'll believe the numbers when we have them.

Also, while we are at this, I say we should start to consider *busy-ness* of JobTracker too, along with the cluster-size. So, for e.g., if the individual tasks are taking in the order of minutes, then it might not matter much if we send one every 20s or so, in some cases it might. I know that the sort's map tasks take around 40s each... 

So, one way to take this into account might be to maintain an average time-to-complete for all tasks in the system (of current jobs) and factor that into the scaling of the intervals.



> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Status: Open  (was: Patch Available)

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547802 ] 

Amareshwari Sri Ramadasu commented on HADOOP-1900:
--------------------------------------------------

bq. "if (diff > minWait) " condition should check for greater-than-or-equal-to
changed

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541755 ] 

Arun C Murthy commented on HADOOP-1900:
---------------------------------------

After more thought, some observations:

1. Heartbeat Interval Ranges

I believe initial heartbeat interval of (clustersize/50) is too aggressive for small clusters e.g. it leads to 1s for 50-node cluster, 2s for 100 nodes etc. I state this with care since there isn't much tasks can accomplish in a 2-3second interval. Instead, speaking from experience I'd like to see the chosen algorithm achieve the following intervals for the given cluster sizes:

|| Cluster Size || Heartbeat Interval (in secs) ||
| < 100 | 5s |
| 100-500 | 5s- 10s |
| 500-1000 | 10s-15s |
| 1000-1500 | 15s-20s |
| 1500-2000 | 20+ s |

These numbers are in-line with observed performance on real-world clusters, and also keeping in mind that any interval <5s is probably not going to be able to update much.

2. Dynamic Scaling of HeartBeat Intervals

I propose we model the back-off strategy loosely on TCP's _slow start_, i.e. put reliability above performance. When we notice a significant number of dropped RPCs the first thing is to ensure that it doesn't occur again. Keeping that in mind I propose we double the current heartbeat interval (upto the above limits, section 1), and keep doubling till we see no more dropped calls. Once we achieve that reliability goal, I propose we decrease the heartbeat interval slowly (say by 1s at a time) till we achieve stability i.e. no more dropped calls.

E.g. 

Cluster size of 100 nodes.

|| Time || Noticed Behaviour || Reaction on Heartbeat Interval ||
| t0 | | 5s |
| t1| dropped calls (say 10% of cluster-size i.e. 10 dropped calls) | Increase to 10s |
| t2 | no more dropped calls | decrease to 9s |
| t3 | no more dropped calls | decrease to 8s |
| t4 | no more dropped calls | decrease to 7s |
| t4 | dropped calls | increase to 8s |
| t5 | no more dropped calls | stabilize at 8s |


Thoughts?

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Attachment: patch-1900.txt

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Attachment: patch-1900.txt

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542547 ] 

Doug Cutting commented on HADOOP-1900:
--------------------------------------

I don't see why finding an optimal heartbeat time is critical.  Scaling as the cluster grows seems reasonable, and using a safe scaling factor so that the jobtracker is not overwhelmed even on slow hardware seems prudent.  But I don't see the importance of faster heartbeats on faster hardware, especially if it adds complexity to the code.  Unless someone can provide a reason why this is important, I also feel an adaptive mechanism is overkill.


> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542804 ] 

Owen O'Malley commented on HADOOP-1900:
---------------------------------------

Devaraj, it does address job tracker business because it assumes that the job tracker should spend N% of its time processing heartbeats. Since the total heartbeat load is proportional to the number of nodes, scaling this way accomplishes it.

Amareshwari +1

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547138 ] 

Devaraj Das commented on HADOOP-1900:
-------------------------------------

In the patch, I see that TaskTracker.initialize is no longer synchronized. Any reason why we should remove the synchronization for the method as part of this patch? Also, should we use int for doing the serialization/deserialization of the heartbeatInterval as opposed to long. Also, the sleep in getMapEvents should be elsewhere since we don't want to put a RPC handler to sleep (the handler will invoke this method eventually). (Sorry for the late comment)

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535509 ] 

Amareshwari Sri Ramadasu commented on HADOOP-1900:
--------------------------------------------------

Here is a proposal for changing heartbeat interval dynamically.

We will intialize heartbeat interval as HEARTBEAT_INTERVAL(10 secs) in task tracker.
Once tasktracker transmits heartbeat, jobtracker's response will have next heartbeat interval.

JobTracker calculates next heart beat interval as follows:

1. Using Clustersize (Number of task trackers) :
    nextInterval = 10secs * (cluster_size/500 +1)
    i.e. For every additional 500 nodes we increase heartbeat interval by 10 secs. 

2. if (nextInterval >HEARTBEAT_INTERVAL_MAX ) nextInteval = HEARTBEAT_INTERVAL_MAX;
if this exceeds HEARTBEAT_INTERVAL_MAX (60seconds), next heartbeat interval is 60 seconds.

3.  if dropcount of heartbeats is greater than a threshold drop count,  increase interval by 10 more seconds.
   Threshold drop count can be 'clustersize/10' . i.e. If there are 500 nodes in the cluster and more than 50 heartbeats are dropped, then we increase next heartbeat interval by  10 more seconds.

   Thus, next Interval calculation can be the following
{code}
threshold_dropcount = clustersize/10;
isBusy = dropCount > threshold_dropcount ?1:0;
nextInterval =  HEARTBEAT_INTERVAL* (cluster_size/500 +1) 
                                   + HEARTBEAT_INTERVAL*isBusy;
if (nextInterval >HEARTBEAT_INTERVAL_MAX ) nextInteval = HEARTBEAT_INTERVAL_MAX;
{code}

Now, MapEventsFetcherThread polls jobtracker for completed map tasks for every 5 secs (MIN_POLL_INTERVAL).  Shall we change polling interval also in the similar fashion as heartbeat interval? But, here some reduce tasks could be idle for longer time.

Any thoughts?


> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Status: Open  (was: Patch Available)

Some more comments from Devaraj.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Status: Open  (was: Patch Available)

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542737 ] 

Amareshwari Sri Ramadasu commented on HADOOP-1900:
--------------------------------------------------

Now considering only cluster size for varying heartbeat interval,  the propasal is as follows:

1. Heartbeat interval = max(2, clusterSize/50+1)
i.e. for every 50 nodes, we increase the heartbeat interval by 1 second.

2. Map completion events polling interval can take the same value as heartbeat interval.
mapEventsFetcherThread will be notified if a reduce task doesnt find map events at the tasktracker.

Thoughts?

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541905 ] 

Doug Cutting commented on HADOOP-1900:
--------------------------------------

This sounds complicated.  Is it really required?  On a small cluster, is there any harm in reporting every second?  I'd rather try the simple proportional approach first and see if it is not sufficient.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Fix Version/s: 0.16.0
           Status: Patch Available  (was: Open)

Here is a patch with proposed design.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Status: Open  (was: Patch Available)

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Status: Open  (was: Patch Available)

Some more comments from Arun and Devaraj -
1. Replace 'if' check by max method
2. declare heartbeatInterval variable as volatile.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Status: Patch Available  (was: Open)

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Attachment: patch-1900.txt

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542737 ] 

amareshwari edited comment on HADOOP-1900 at 11/15/07 3:20 AM:
----------------------------------------------------------------------------

Now considering only cluster size for varying heartbeat interval,  the propasal is as follows:

1. Heartbeat interval = max(2, clusterSize/50+1)
i.e. for every 50 nodes, we increase the heartbeat interval by 1 second.

2. Map completion events polling interval can take the same value as heartbeat interval.
Apart from polling, the tasktracker will also fetch map events from the JobTracker when a reducetask asks for events and it has nothing to give (this is similar to the way tasktrackers ask for a new task whenever it finishes executing a task)

Thoughts?

      was (Author: amareshwari):
    Now considering only cluster size for varying heartbeat interval,  the propasal is as follows:

1. Heartbeat interval = max(2, clusterSize/50+1)
i.e. for every 50 nodes, we increase the heartbeat interval by 1 second.

2. Map completion events polling interval can take the same value as heartbeat interval.
mapEventsFetcherThread will be notified if a reduce task doesnt find map events at the tasktracker.

Thoughts?
  
> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Amareshwari Sri Ramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sri Ramadasu updated HADOOP-1900:
---------------------------------------------

    Status: Patch Available  (was: Open)

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>             Fix For: 0.16.0
>
>         Attachments: patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539089 ] 

Hadoop QA commented on HADOOP-1900:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12368760/patch-1900.txt
against trunk revision r590273.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs -1.  The patch appears to introduce 1 new Findbugs warnings.

    core tests -1.  The patch failed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1039/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1039/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1039/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1039/console

This message is automatically generated.

> the heartbeat and task event queries interval should be set dynamically by the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to contact it dynamically, based on how the busy it is and the size of the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.