You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Matei Zaharia (JIRA)" <ji...@apache.org> on 2008/11/16 00:52:44 UTC

[jira] Created: (HADOOP-4665) Add preemption to the fair scheduler

Add preemption to the fair scheduler
------------------------------------

Key: HADOOP-4665
URL: https://issues.apache.org/jira/browse/HADOOP-4665
Project: Hadoop Core
Issue Type: New Feature
Components: contrib/fair-share
Reporter: Matei Zaharia

Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).

T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.

When deciding which tasks to kill to make room for the job, we will use the following heuristics:
- Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
- For maps: kill tasks that have run for the least amount of time (limiting wasted time).
- For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688825#action_12688825 ] 

Hadoop QA commented on HADOOP-4665:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12402607/hadoop-4665-v3.patch
  against trunk revision 757958.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 12 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/128/console

This message is automatically generated.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Attachment: hadoop-4665-v7.patch

Attaching a new patch which includes updated documentation.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Attachment: hadoop-4665-v7b.patch

Thanks, that makes sense. I added the license to the .java file.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch, hadoop-4665-v7b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Amr Awadallah (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675122#action_12675122 ] 

Amr Awadallah commented on HADOOP-4665:
---------------------------------------

 > Does the middle ground option make sense ? It may help all of us with 
only a little compromise, no ?

sure, thanks for being accommodating. What is rough eta for when 0.20.1 
would be out?

-- amr


> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Attachment: hadoop-4665-v4.patch

New patch that applies to trunk.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716091#action_12716091 ] 

Hadoop QA commented on HADOOP-4665:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12409635/hadoop-4665-v7.patch
  against trunk revision 781343.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 12 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    -1 release audit.  The applied patch generated 494 release audit warnings (more than the trunk's current 492 warnings).

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/console

This message is automatically generated.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Open  (was: Patch Available)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-4665:
------------------------------------

    Release Note: Introduced preemption for fair scheduler.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Open  (was: Patch Available)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674878#action_12674878 ] 

Hemanth Yamijala commented on HADOOP-4665:
------------------------------------------

Dhruba, we're planning on a release very soon based on 0.20. This would reset our testing cycle significantly. So, I would request you to move this to 0.21. Since you're already using this, I am assuming it will not impact you that much, I hope.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664655#action_12664655 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

Those are good points Joydeep. I will remove the getting conf variables every time, that was a mistake.

About subtracting the waits: This is just a question of how we interpret the parameters. Maybe we want nodeLocalWait and rackLocalWaits to be two different times that get added up for some "total wait". I originally meant for rackLocalWait to always be bigger than nodeLocalWait and thus capture the maximum delay. But since that is confusing and can lead to misconfiguration, I will make them add up as you said.

For your third point, the idea was as follows: When it still has a lot of maps left to launch, a job will almost always have node-local tasks, so waiting is fine. However, when there are only a few maps left, there will be fewer nodes on which it can launch node-local tasks, and these may be busy running long tasks or something. So, when it's waited for nodeLocalWait amount of time, the job will start being allowed to launch rack-local tasks instead. Once it has launched such a task, it is allowed to launch more rack-local tasks rather than having to begin the waiting all over again so that it doesn't drastically slow down the rate at which it can launch tasks if the nodes with node-local data still aren't becoming free. However, we remember the locality level of the last map launched, so if the job ever *does* manage to launch a node-local task again, we begin the wait period again. There's a similar story for going from rack-local to off-rack: the idea is that once you've had to wait so long as to launch an off-rack task, you probably have very few opportunities left for launching rack-local or node-local tasks, so you might as well be allowed to launch more off-rack tasks and finish the job rather than having your launch rate slowed to a trickle.

Now it's possible that just using shorter waits but requiring the wait to happen every time you need to launch a non-local task will work too. I don't know, but it seemed that in my gridmix tests the current implementation worked fine even for very small jobs (going from 2% to 75-80% node locality for 3-map tasks on a 100-node cluster).


> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707589#action_12707589 ] 

Hadoop QA commented on HADOOP-4665:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12407422/hadoop-4665-v5.patch
  against trunk revision 772960.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 12 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/310/console

This message is automatically generated.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694663#action_12694663 ] 

Matei Zaharia edited comment on HADOOP-4665 at 4/1/09 11:21 AM:
----------------------------------------------------------------

Hi Vinod,

A few comments:

* Is the 2-3 heartbeats for preemption really necessary? I imagine timeouts will be on the order of minutes, so a few seconds won't make a big difference. Although thinking of the timeout as an SLA is nice, it's also equally easy to think of it as "this is when it can start killing tasks". To me, putting this extra 2-3 heartbeats thing in seems like unnecessary complexity.
* The reason the preemption-enabled check is deep inside the method is to give you the ability to turn preemption off but see the SHOULD_PREEMPT log messages to figure out when your cluster *would* preempt tasks given certain settings. We wanted this at Facebook so that we can add some timeouts, count the SHOULD_PREEMPT messages over a week, and be sure that the settings chosen are good without actually losing a lot of tasks if there's a mistake. I think this is a good feature to keep for other people who are thinking of turning on preemption.
* There actually is a way to set a default preemption timeout as in Joydeep's comment - you can set defaultMinSharePreemptionTimeout in the XML file. The code for this is in PoolManager.
* The default settings of preemptionEnabled=true and no timeouts are to make preemption easy to configure gradually. We expect that most people will start out not wanting preemption, because it creates an extra worry of "have we set it too low". Then as people start running pools with "production" jobs (with min share set), they may want to enable preemption just for these jobs. They would be able to do that by just adding a preemptionTimeout entry to those pools in the config, and it would be active without needing to restart the JobTracker. Then if they see non-production jobs suffering, they could enable the fairSharePreemptionTimeout, again without requiring a cluster restart. The only reason to also provide a preemptionEnabled setting in the jobconf is for the testing purpose I mentioned above, where an organization switching over to preemption in production can figure out first whether it will kill too many tasks. Overall, my goal with all the fair scheduler config is to make it as easy as possible to use "out of the box". You don't need to define pools in advance, you don't need to define min shares or weights in advance, you don't need to decide when to use preemption in advance, etc, and the only setting you need in mapred-site.xml is the one that sets Hadoop to use the fair scheduler. Then as you decide you want the more advanced features, you enable them gradually. I actually think there are strong advantages to this over your proposal of having preemptionEnabled=false and having non-infinite default timeouts so again I'd like more motivation before making such a large code change. The other factor is that Facebook has been using the current version of the preemption code and found the current features useful.

I'll take a look at your other comments later this week. Regarding the code reuse in preemptTasks, it is indeed based on the one in the capacity scheduler, but I'd like to make refactoring that a separate issue from this JIRA. The right thing might be to have some of that functionality in TaskScheduler.

      was (Author: matei):
    Hi Vivek,

A few comments:

* Is the 2-3 heartbeats for preemption really necessary? I imagine timeouts will be on the order of minutes, so a few seconds won't make a big difference. Although thinking of the timeout as an SLA is nice, it's also equally easy to think of it as "this is when it can start killing tasks". To me, putting this extra 2-3 heartbeats thing in seems like unnecessary complexity.
* The reason the preemption-enabled check is deep inside the method is to give you the ability to turn preemption off but see the SHOULD_PREEMPT log messages to figure out when your cluster *would* preempt tasks given certain settings. We wanted this at Facebook so that we can add some timeouts, count the SHOULD_PREEMPT messages over a week, and be sure that the settings chosen are good without actually losing a lot of tasks if there's a mistake. I think this is a good feature to keep for other people who are thinking of turning on preemption.
* There actually is a way to set a default preemption timeout as in Joydeep's comment - you can set defaultMinSharePreemptionTimeout in the XML file. The code for this is in PoolManager.
* The default settings of preemptionEnabled=true and no timeouts are to make preemption easy to configure gradually. We expect that most people will start out not wanting preemption, because it creates an extra worry of "have we set it too low". Then as people start running pools with "production" jobs (with min share set), they may want to enable preemption just for these jobs. They would be able to do that by just adding a preemptionTimeout entry to those pools in the config, and it would be active without needing to restart the JobTracker. Then if they see non-production jobs suffering, they could enable the fairSharePreemptionTimeout, again without requiring a cluster restart. The only reason to also provide a preemptionEnabled setting in the jobconf is for the testing purpose I mentioned above, where an organization switching over to preemption in production can figure out first whether it will kill too many tasks. Overall, my goal with all the fair scheduler config is to make it as easy as possible to use "out of the box". You don't need to define pools in advance, you don't need to define min shares or weights in advance, you don't need to decide when to use preemption in advance, etc, and the only setting you need in mapred-site.xml is the one that sets Hadoop to use the fair scheduler. Then as you decide you want the more advanced features, you enable them gradually. I actually think there are strong advantages to this over your proposal of having preemptionEnabled=false and having non-infinite default timeouts so again I'd like more motivation before making such a large code change. The other factor is that Facebook has been using the current version of the preemption code and found the current features useful.

I'll take a look at your other comments later this week. Regarding the code reuse in preemptTasks, it is indeed based on the one in the capacity scheduler, but I'd like to make refactoring that a separate issue from this JIRA. The right thing might be to have some of that functionality in TaskScheduler.
  
> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Open  (was: Patch Available)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch, hadoop-4665-v7b.patch, hadoop-4665-v7c.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664639#action_12664639 ] 

Joydeep Sen Sarma commented on HADOOP-4665:
-------------------------------------------

Dhruba and I looked at this together and got stuck on getAllowedLocalityLevel()

- why subtract nodeLocalWait from rackLocalWait 
- why getting config variables each time
- if we were not rack/node local last time - why don't we wait for locality next time? Seems like once we lose locality - we are anyway going to run up a deficit and then schedule a boatload of non-local tasks. seems like the whether we want to wait for locality or not should be based on how much deficit we are incurring and whether it's still warranted to wait for locality (as opposed to whether we were able to schedule the last task locally)

thoughts?

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Attachment: fs-preemption-v0.patch

Here is an initial version of the patch for review. The main thing missing is unit tests.

The patch adds two things. First there's the preemption, which works as described in the issue - jobs may preempt others if either they aren't receiving their guaranteed share for some time, or they are at below half their fair share and negative deficit for some time. The times can be configured in the fair scheduler config file and thus modified at runtime, and the guaranteed share timeouts are per pool. On top of this, to aid with debugging and development of the fair scheduler in the future, there is a scheduler event log, which is disabled by default but creates some event logs in tab-separated format in $hadoop.log.dir/fairscheduler if you turn it on. These are meant to be nitty-gritty detailed logs with machine-parsable event types rather than the "human-readable" logs that go into the standard log4j log for the JobTracker. They are also potentially much larger on a large cluster, which is why they're off by default.

I'm running this through hudson to see whether there are complaints from findbugs, checkstyle, etc, but I will include some unit tests in the final patch.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Attachment: hadoop-4665-v1b.patch

Small update to change the fair scheduler log dir jobconf variable to a consistent format with the other event log related variables.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661860#action_12661860 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

Yikes, forgot to add, this patch also includes HADOOP-4789 because it is dependent on it. That might make it more confusing to read than it needs to be.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716308#action_12716308 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

The test failure seems to be unrelated to my patch (HADOOP-5869).

I'm not sure what the release audit warning means. This is what it says:

{code}
5d4
<      [java]  !????? /home/hudson/hudson-slave/workspace/Hadoop-Patch-vesta.apache.org/trunk/build/hadoop-781343_HADOOP-4665_PATCH-12409635/conf/fair-scheduler.xml
300d298
<      [java]  !????? /home/hudson/hudson-slave/workspace/Hadoop-Patch-vesta.apache.org/trunk/build/hadoop-781343_HADOOP-4665_PATCH-12409635/src/contrib/fairscheduler/src/java/org/apache/hadoop/mapred/FairSchedulerEventLog.java
{code}

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Assignee: Matei Zaharia
      Status: Patch Available  (was: Open)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689078#action_12689078 ] 

Vinod K V commented on HADOOP-4665:
-----------------------------------

I've started looking at the patch. If I understand correctly the way preemption is done, the jobs are sorted in the order of deficits and tasks are taken away from the jobs at the bottom of the list (positive deficits) and given to the jobs with negative deficits at the top. By doing this, though we are catering to job's fair share, it doesn't look like we are serving a queue/pool 's guaranteed shares.

So, a queue/pool can be running only its min-maps/min-reduces, but still tasks can be preempted from jobs of this queue. Is that correct? If that's the case, then that doesn't sound correct to me.

Can you throw some light on this?

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718326#action_12718326 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

Hi Vinod,

Sorry for not addressing the half fair share point earlier, it looks like I forgot to post that. The difference between isStarvedForFairShare and tasksDueToFairShare is intentional. I want the threshold for triggering fair share preemption to be a lot lower than the fair share so that it doesn't trigger unless something is going horribly wrong in the cluster. The reason is that in standard use of the fair scheduler, we expect any critical ("production") jobs to have min shares set, which are enforced much more precisely (and potentially with a smaller timeout). The fair share is for those jobs that are not critical and that we're okay with being a little unfair to if that reduces wasted work. So the service model is that we only do preemption if the job is being starved very badly. However, when we do preemption, we do bring you up to your full fair share, because at that point it's clear that you've been starved badly for a long time. Once you are at your full fair share, it will be easy for you to remain there as you'll be given chances to reuse those slots when your tasks finish. If some users request for a stricter enforcement of fair shares, we can make the "half" part configurable later, but we decided this model is a good way to prevent unnecessary preemption and exchange of slots back and forth between jobs, while also not being too unfair.

I'll make a patch with the other changes sometime in the next few days or maybe after I see some of your comments.

The changes in test cases and docs are indeed huge. The request was huge ;) (and important), and I took this opportunity to clean up the fair scheduler docs overall.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch, hadoop-4665-v7b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Patch Available  (was: Open)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch, hadoop-4665-v7b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721991#action_12721991 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-4665:
------------------------------------------------

It would be great if we can have a unit test testing [the case|https://issues.apache.org/jira/browse/HADOOP-5701?focusedCommentId=12700284&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12700284] given in HADOOP-5701.  Not sure if you already have it.  I did not check the patch.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch, hadoop-4665-v7b.patch, hadoop-4665-v7c.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Open  (was: Patch Available)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch, hadoop-4665-v7b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689344#action_12689344 ] 

Hadoop QA commented on HADOOP-4665:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12403578/hadoop-4665-v4.patch
  against trunk revision 758425.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 12 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/138/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/138/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/138/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/138/console

This message is automatically generated.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Amr Awadallah (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674907#action_12674907 ] 

Amr Awadallah commented on HADOOP-4665:
---------------------------------------

Hemanth,

  Facebook is not the only company that benefits from preemption, the 
community at large will benefit from this and the sooner that we get it 
out there the better. Also fact that Facebook has been running this for 
a number of weeks means that it will most likely not lead to any 
significant testing problems.

  Please consider fitting this in the 0.20 release, and let us know if 
there is anyway we can help to expedite.

Thanks,

-- amr
 


> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694526#action_12694526 ] 

Vinod K V commented on HADOOP-4665:
-----------------------------------

Some more miscellaneous points:
 - With the introduction of new timeouts, I think the importance of a template file for the allocations increases. I remember you saying something about it on some jira. Have you filed one?

 - This patch adds FairSchedulerEventLog for logging various events in the scheduler in machine-readable format. But there is no place from where utilities can determine the format of the log records: How should we track the event log records' format, add some schema file? Or alter the logs to be a list of key-value pairs similar to JobHistory instead of just values?

 - There is a lot of common code between {code}int org.apache.hadoop.mapred.FairScheduler.preemptTasks(JobInProgress job, TaskType type, int maxToPreempt){code} and {code}int org.apache.hadoop.mapred.CapacityTaskScheduler.MapSchedulingMgr.killTasksFromJob(JobInProgress job, int tasksToKill){code}. In fact most of it is the same. I think we should somehow try to refactor this common code. Don't know if we want to do it in this jira itself or not.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694529#action_12694529 ] 

Vinod K V commented on HADOOP-4665:
-----------------------------------

Oh, and Joydeep's previous comment isn't addressed yet.
bq. default preemption timeout(s) for pools. right now unless we configure on a per pool basis - the default timeout is infinite.
Because we already have a different plug for disabling preemption altogether, I think this default timeout as well as fairSharePreemptionTimeout should have some finite value.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675148#action_12675148 ] 

Nigel Daley commented on HADOOP-4665:
-------------------------------------

We've got to stabilize 0.20.  We're currently working on fixing bugs.  Adding new features to contrib components at this point would not get my vote.  We're already 2.5 months past 0.20 core feature freeze.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694663#action_12694663 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

Hi Vivek,

A few comments:

* Is the 2-3 heartbeats for preemption really necessary? I imagine timeouts will be on the order of minutes, so a few seconds won't make a big difference. Although thinking of the timeout as an SLA is nice, it's also equally easy to think of it as "this is when it can start killing tasks". To me, putting this extra 2-3 heartbeats thing in seems like unnecessary complexity.
* The reason the preemption-enabled check is deep inside the method is to give you the ability to turn preemption off but see the SHOULD_PREEMPT log messages to figure out when your cluster *would* preempt tasks given certain settings. We wanted this at Facebook so that we can add some timeouts, count the SHOULD_PREEMPT messages over a week, and be sure that the settings chosen are good without actually losing a lot of tasks if there's a mistake. I think this is a good feature to keep for other people who are thinking of turning on preemption.
* There actually is a way to set a default preemption timeout as in Joydeep's comment - you can set defaultMinSharePreemptionTimeout in the XML file. The code for this is in PoolManager.
* The default settings of preemptionEnabled=true and no timeouts are to make preemption easy to configure gradually. We expect that most people will start out not wanting preemption, because it creates an extra worry of "have we set it too low". Then as people start running pools with "production" jobs (with min share set), they may want to enable preemption just for these jobs. They would be able to do that by just adding a preemptionTimeout entry to those pools in the config, and it would be active without needing to restart the JobTracker. Then if they see non-production jobs suffering, they could enable the fairSharePreemptionTimeout, again without requiring a cluster restart. The only reason to also provide a preemptionEnabled setting in the jobconf is for the testing purpose I mentioned above, where an organization switching over to preemption in production can figure out first whether it will kill too many tasks. Overall, my goal with all the fair scheduler config is to make it as easy as possible to use "out of the box". You don't need to define pools in advance, you don't need to define min shares or weights in advance, you don't need to decide when to use preemption in advance, etc, and the only setting you need in mapred-site.xml is the one that sets Hadoop to use the fair scheduler. Then as you decide you want the more advanced features, you enable them gradually. I actually think there are strong advantages to this over your proposal of having preemptionEnabled=false and having non-infinite default timeouts so again I'd like more motivation before making such a large code change. The other factor is that Facebook has been using the current version of the preemption code and found the current features useful.

I'll take a look at your other comments later this week. Regarding the code reuse in preemptTasks, it is indeed based on the one in the capacity scheduler, but I'd like to make refactoring that a separate issue from this JIRA. The right thing might be to have some of that functionality in TaskScheduler.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Patch Available  (was: Open)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673484#action_12673484 ] 

dhruba borthakur commented on HADOOP-4665:
------------------------------------------

+1. code looks good.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697629#action_12697629 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

Vinod, have you had a chance to look at my arguments above? I'd really like to get this patch committed soon because I have others that depend on it.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Patch Available  (was: Open)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Patch Available  (was: Open)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch, hadoop-4665-v7b.patch, hadoop-4665-v7c.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Patch Available  (was: Open)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Attachment: hadoop-4665-v1.patch

Here is an updated version of this patch. It includes some fixes from testing this at Facebook, as well as Joydeep's comments (change event log to use Log4j rolling file appender, and provide a way to set default preemption timeouts).

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694524#action_12694524 ] 

Vinod K V commented on HADOOP-4665:
-----------------------------------

Here's a more comprehensive review:

Major points:
 - As it's a new feature, I think preemption(mapred.fairscheduler.preemption) should be disabled(false) by default.
 - Can we do something like above stated proposal: to preempt a task when a job's fairshare/minshare are not met within PREEMPTION_TIMEOUT leaving 2/3 heartbeats for the task to actually get killed.
 - DUMP_INTERVAL and PREEMPTION_INTERVAL should be configurable. The variables themselves can be package-private instead of public.
 - The class FairSchedulerEventLog can just be package-private. So do all the methods inside - init, log, shutdown and isEnabled - they don't need to be public as of now.

 - FairScheduler.preemptTasksIfNecessary() method:
   -- This method does a Collections.reverse(jobs) after sorting the jobs. We can just traverse the list in reverse order to get the desired effect here.
   --  The check as to whether FairScheduler will do preemption(preemptionEnabled && !useFifo) is done deep inside - all the stats are calculated and then only preemption is skipped if not needed. Can we take this check, may be, to the beginning of preemptTasksIfNecessary() method or inside update() method itself.

 - FairScheduler.tasksToPreempt() method:
   -- The count tasksDueToFairShare seems to be calculated to see if full fair-share of slots are allotted or not instead of the advertised half of fairshare. I think this is a mistake as isStarvedForFairShare() is checking for half of fair-share. Or am I missing something?
   -- EventLogs in this method are a bit confusing if the job is short on both minshare and fairshare. In this case, it is roughly giving an impression that we are preempting twice. I think it would be clear to just log that the job is short by so many slots w.r.t minshare and w.r.t fairshare. And in the end, just before returning we can simply say the exact number of slots we are going to preempt.

 - FairScheduler.start() method:
   -- I think that the call loadMgr.setEventLog(eventLog) should be after eventLog is initialized with a new FairSchedulerEvenLog object.

 - TestFairScheduler.java:
  -- You have added setup and cleanup tasks creation code in initTasks() method. Is there any specific reason for doing this? In any case, much of it duplicates code from JobInProgress.initTasks(), if we really want, we can refactor this code into a new method say JobInProgress.createSpecialTasks().
  -- There are no new test-cases related to preemption. I think we should have one/some.

Minor nits:
 - Javadoc for getFairSharePreemptionTimeout() is incomplete. Also, if I am understanding correctly FairSharePreemptionTimeout is the same for all pools/jobs.
 - The xml tag for minSharePreemptionTimeouts is currently preemptionTimeout. It can better be minSharePreemptionTimeout.
 - update thread:  The log message "Failed to update fair share calculations" can better be "Exception in Update thread". This because,now , update thread does more than just updating calculations.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674930#action_12674930 ] 

Hemanth Yamijala commented on HADOOP-4665:
------------------------------------------

Amr, that's a good point about the benefit to the community.

One middle ground option could be to make it part of Hadoop 0.20.1. so that the community doesn't wait until 0.21 (which could be way out right now).

One specific concern I had about this patch (from a cursory glance) was that it was locking the JobTracker, (the taskTrackerManager instance) in the update method which runs frequently. In general, this could interfere with regular processing in the jobtracker, like heartbeats (which are also synchronized on the same instance). We've in the past seen issues of this nature on large clusters. When the JT is locked up, tasktrackers could get lost and tasks could fail arbitrarily. HADOOP-5280 seems to be one specific instance of this (though we've not ascertained the reason JT got locked up). 

Does the middle ground option make sense ? It may help all of us with only a little compromise, no ?


> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689127#action_12689127 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

The patch actually makes sure that no job is ever brought below its fair or min share. This happens in preemptTasks, where we compute tasksToLeave and make sure we leave the job with at least that many. Since each pool's min and fair share is distributed among the jobs in that pool, this will ensure that pool shares are also kept. So the deficit thing is just a global ordering to service the jobs that have been starving the most first, but in following it, we also ensure that we never bring jobs below their shares.

Actually when we remove deficits from the scheduler (which I already have working code for; it's just too big a change to push that, 4665 and 4667 in the same JIRA), the logic will be simpler. We'll just look for jobs that are far below their fair share (as a ratio of share) and preempt from jobs that are far above their fair share (again as a ratio). Then there won't be this confusion about whether we can have a service order based on deficits or not. This patch is just an attempt to get some of that code into the scheduler before pushing the big change.. If you'd prefer that I post a patch for the new non-deficit-based fair scheduler instead, I can do that too.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716314#action_12716314 ] 

Tom White commented on HADOOP-4665:
-----------------------------------

These files are missing Apache license headers. Config files in Hadoop don't seem to include a header, so you can probably skip the one for fair-scheduler.xml.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Patch Available  (was: Open)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Fix Version/s:     (was: 0.20.0)
                   0.21.0

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Patch Available  (was: Open)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669739#action_12669739 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

We can definitely take out the flush if you are okay with possibly missing the end of the event log. Right now the event log was meant only as a debugging tool (which is why it's off by default), but maybe it's more useful to make it possible to always keep it on.

I will add a parameter for default min share preemption timeout as well.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673542#action_12673542 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

There's no reason to put it in 0.20 if 0.20 is in the testing stage right now. The issue was just opened against 0.20 because there was no 0.21 back then. It's true that the preemption is disabled by default but there's no need to complicate things.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Open  (was: Patch Available)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675140#action_12675140 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

That sounds good. I'm soon going to remove that lock by the way, and ensure that we only lock the JT when there are tasks to kill.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-4665:
----------------------------------

    Status: Open  (was: Patch Available)

The current patch no longer applies to trunk

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671725#action_12671725 ] 

Hadoop QA commented on HADOOP-4665:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12399789/hadoop-4665-v1b.patch
  against trunk revision 742171.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3813/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3813/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3813/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3813/console

This message is automatically generated.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678427#action_12678427 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

Robert, this is not going into 0.20, is it? I thought Nigel said we should not do that.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674762#action_12674762 ] 

dhruba borthakur commented on HADOOP-4665:
------------------------------------------

We haven been running this code in production for a few weeks now. I would have ideally liked it to go into 0.20 (because 0.20 is not yet released), but if that is difficult that I am fine putting it in 0.21

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Open  (was: Patch Available)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718099#action_12718099 ] 

Vinod K V commented on HADOOP-4665:
-----------------------------------

Matei, I have (re)started looking at the patch. The changes look good overall, except the following points:
 - preemptionInterval variable is initialized to 30000 whereas the default value is 15000. Shouldn't they be consistent?
- EagerTaskInitializationListener is not removed from the list of listeners in the terminate method 
- You seem to have missed one of my earlier points:
bq. The count tasksDueToFairShare seems to be calculated to see if full fair-share of slots are allotted or not instead of the advertised half of fairshare. I think this is a mistake as isStarvedForFairShare() is checking for half of fair-share. Or am I missing something?

The changes in test-cases and documentation seem to be huge. It'll take till tomorrow for me to complete the review of test-cases and documentation. Thanks for the patience.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch, hadoop-4665-v7b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Attachment: hadoop-4665-v3.patch

Updated the patch to work with the current trunk.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693990#action_12693990 ] 

Vinod K V commented on HADOOP-4665:
-----------------------------------

Sorry for the late reply, was on a long weekend.

Your explanation w.r.t the min-shares of pools clarified my doubts. Thanks!

I have discussed the overall approach of this patch with Hemant, Sreekanth and Rahul offline. We concluded on one slight improvement - the preemption patch of capacity-scheduler treats the preemption timeouts to be a kind of SLA for the pool/queue and so leaves a couple of heartbeats for slots to become free after it preempts a task. Can we do something like that here? - Essentially, the proposal is to preempt a task when a job's fairshare/minshare are not met within PREEMPTION_TIMEOUT- 2/3 heartbeats.

Few other code comments I have:
 - Can we make PREEMPTION_INTERVAL configurable?
 - The check as to whether FairScheduler will do preemption(preemptionEnabled && !useFifo) is done deep inside - all the stats are calculated and then only preemption is skipped if not needed. Can we take this check, may be, to the beginning of preemptTasksIfNecessary() method or inside update() method itself.
 - The class FairSchedulerEventLog can just be package-private. So do all the methods inside - init, log, shutdown and isEnabled - they don't need to be public as of now.

Again, sorry for the late reply. Appreciate your patience.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688564#action_12688564 ] 

Vinod K V commented on HADOOP-4665:
-----------------------------------

The latest patch hadoop-4665-v3.patch is not applying cleanly to trunk, perhaps because of HADOOP-4788.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699047#action_12699047 ] 

Vinod K V commented on HADOOP-4665:
-----------------------------------

Agree with Hemanth on having two knobs - one for completely disabling preemption and one for enabling a dry run.

Regarding leaving some heartbeats before killing tasks, I agree with your arguments and am fine with leaving that for now as the timeouts may indeed turn out to be in the order of minutes like you are saying. If it becomes a need later, we can do it in future in another JIRA issue.

I'm +1 to the rest of the arguments. So I'll be fine with the patch once the rest of the code comments are incorporated along with a knob for disabling preemption altogether.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4665:
-------------------------------------

    Fix Version/s: 0.20.0
     Hadoop Flags: [Reviewed]

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673516#action_12673516 ] 

Hemanth Yamijala commented on HADOOP-4665:
------------------------------------------

Dhruba, we have been testing the FairScheduler of Hadoop 0.20 release quite heavily in the past couple of weeks as part of an evaluation we are doing. I cannot say how much committing this to Hadoop 0.20 could effect what we've tested. Is it really necessary for the 0.20 release ? Can we get it committed to trunk only instead ?

Also, I'd looked at this code cursorily. I think pre-emption is an optional feature here, right ?

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673809#action_12673809 ] 

Hemanth Yamijala commented on HADOOP-4665:
------------------------------------------

Thanks, Matei. That would be very convenient. Dhruba, does this sound ok ? If you agree, we could set the fix in version to 0.21.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698187#action_12698187 ] 

Hemanth Yamijala commented on HADOOP-4665:
------------------------------------------

Matei, I've not been looking at this code much, but based on the discussion, I have only one comment: regarding the turning off of pre-emption. 

Your use case of organizations wanting to try out with pre-emption disabled, but still seeing when pre-emption would happen seems to me like a dry-run mode that you can see in utilities like an RPM update. As you've explained, looks like there are use cases for this.

>From our work with the capacity scheduler, we've found there are environments where pre-emption is indeed not necessary. Even when it exists, it has proved to be a complex feature to reason about. From this perspective, it seems like it may make sense to provide an option to completely turn it off and have reasonable confidence that nothing related to pre-emption would be in effect, including any additional computation that it requires.

Hence, my suggestion is the following: have a flag to truly turn off pre-emption and have a variable that allows a dry-run of pre-emption when it is enabled. I believe this may not be a very difficult change ? (Indeed, I've been thinking of cases where a dry-run of the entire scheduling logic makes sense  - for e.g. to get a 'scheduling log' that can be replayed). 

The flip side of my proposal is an additional configuration option. But depending on what we think the right defaults are, we can still make the configuration easy for end users, no ? To that extent, your arguments about the proposed default values are fine with me.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693987#action_12693987 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

The test failures seem to be unrelated to my patch - they happen in unmodified trunk too. One of them is HADOOP-5068. Is it okay for me to commit my patch?

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669702#action_12669702 ] 

Joydeep Sen Sarma commented on HADOOP-4665:
-------------------------------------------

a couple of other points (let me get back on the comments above):

- default preemption timeout(s) for pools. right now unless we configure on a per pool basis - the default timeout is infinite.
- eventlog - this looks very expensive - there's a flush on every log() call. can we just take out this flush()?



> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661864#action_12661864 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

There is also one change here that is outside of the fair scheduler that I'd like feedback on. In order to allow the scheduler to kill tasks, I added a killTask method to the TaskTrackerManager interface. This means that you don't need to assume that the TaskTrackerManager is always a JobTracker, so you can write unit tests for it with a mock object. There was already a killJob in there. Is it okay to add killTask or is there a reason not to?

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Attachment: hadoop-4665-v6.patch

Updated patch to keep up with trunk. Would appreciate a review.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Open  (was: Patch Available)

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Attachment: hadoop-4665-v7c.patch

I've attached a new patch taking into account Vinod's two comments and keeping up with trunk. Vinod, do you have a rough ETA on when you'll finish looking at this?

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, hadoop-4665-v7.patch, hadoop-4665-v7b.patch, hadoop-4665-v7c.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Attachment: hadoop-4665-v2.patch

Here is an updated patch that removes the locking of the JobTracker in update(). Instead, we only lock the JT every 30 seconds when preemptTasksIfNecessary decides that it is time to check for tasks to preempt. This should improve JT scalability.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688465#action_12688465 ] 

dhruba borthakur commented on HADOOP-4665:
------------------------------------------

We should not put it in the 0.20 release, especially because the branch is already cut.

It should be committed to trunk for now. If, at a later time (after making the 0.20 release), the community decides to pull it into the 0.20.1 release, then we can do that.

Paatch looks good. +1. Can you pl resubmit the patch so that HadoopQA gets to process it? Thanks.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Attachment: hadoop-4665-v5.patch

Here's a new patch. It incorporates Vinod's comments, except for changing the event log format to key-value pairs. The event log in its current state is meant to be used only for debugging and has a relatively simple tab-separated format, so I didn't want to complicate the API to it. We can have another JIRA for adding a parser class for it and turning it into more than a debug tool if there's demand for that. It also adds five unit tests for preemption and a default config file for the fair scheduler. The patch also makes the update interval configurable (since I was did that with the other two periodic check intervals).

Included in this patch is a fairly significant evolution of the fair scheduler unit testing framework, which adds tracking of tasks in FakeJobInProgress to allow for preemption to be tested meaningfully. The FakeJobInProgress and FakeTaskInProgress have some commonalities with the ones in the capacity scheduler, but unfortunately I wasn't able to use those directly because some of the classes used in the fair scheduler, such as Clock, don't have equivalents there. It would be nice to create a common testing framework for schedulers, but that should be a separate JIRA. I also think that the ultimate solution for that is not to make an elaborate FakeJobInProgress and related classes, but rather to make MiniMRCluster more user-friendly and switch all tests into it. We can also consider making the Clock class be used by all the MR code so that tests can run at an accelerated rate on these simulated clusters.

One other thing I will need to add in this patch is documentation for the preemption params. However, the other changes can be reviewed right now.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch, hadoop-4665-v5.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4665) Add preemption to the fair scheduler

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Status: Patch Available  (was: Open)

Passing patch through Hudson.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>             Fix For: 0.20.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.