You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@yunikorn.apache.org by "Adam Novak (Jira)" <ji...@apache.org> on 2022/04/25 21:18:00 UTC
[jira] [Created] (YUNIKORN-1185) Small applications starve large ones in the same FIFO queue

Adam Novak created YUNIKORN-1185:
------------------------------------

             Summary: Small applications starve large ones in the same FIFO queue
                 Key: YUNIKORN-1185
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1185
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: core - scheduler
            Reporter: Adam Novak


Even when I set my queue to use a {{fifo}} application sort policy, applications that enter the queue later are able to run before applications that are submitted earlier; the queue does not behave like a first-in, first-out queue.

Specifically, this happens when the later applications are smaller than the earlier ones. If enough small jobs applications are available in the queue to immediately fill any space that opens up, they will schedule as soon as space is available. YuniKorn doesn't wait for enough space to become free to schedule waiting large applications, no matter how much older they are than the things that are passing them in the queue.

The result of this is that a steady supply of small applications can keep a larger application waiting indefinitely, causing starvation.

The relevant code seems to be [in Queue's tryAllocate method|[https://github.com/apache/yunikorn-core/blob/73d55282f052f53852cc156d626c155ca5dddca2/pkg/scheduler/objects/queue.go#L1069-L1070]|https://github.com/apache/yunikorn-core/blob/73d55282f052f53852cc156d626c155ca5dddca2/pkg/scheduler/objects/queue.go#L1069-L1070].]. YuniKorn goes through all the applications in the queue in order, and greedily schedules work items until no more fit. If no space large enough to fit any work form the first application currently exists, it will always fill what space there is with work from applications later in the queue. It will never wait to drain out space on a node to fit work from that first application.

How can I configure or modify YuniKorn to prevent starvation, and make the applications in a queue execute in order, or at least not arbitrarily far out of order?

(I already tried the {{stateaware}} queue sort, but it doesn't seem to work well with applications as small as mine. It appeared to run only one application at a time, because my applications finish so fast.)
h4. Replication

First, have a Kubernetes cluster with a node {{k1.kube}} with 96 cores.

Next, set up YuniKorn 0.12.2 with this {{values.yml}} for the Helm chart:

 
{code:java}
embedAdmissionController: false
configuration: |
  partitions:
    -
      name: default
      placementrules:
        - name: tag
          value: namespace
          create: true
      queues:
        - name: root
          submitacl: '*'
          childtemplate:
           properties:
             application.sort.policy: fifo {code}
Then, run this script:
{code:java}
#!/usr/bin/env bash
# test-yunikorn.sh: Make sure YuniKorn prevents starvation
set -e# Set this to annotate jobs other than the middle job
OTHER_JOB_ANNOTATIONS=''
# And similarly for the middle job
MIDDLE_JOB_ANNOTATIONS=''# Where should we run?
#NODE_SELECTOR='nodeSelector: {"kubernetes.io/hostname": "k1.kube"}'
NODE_SELECTOR='affinity: {"nodeAffinity": {"requiredDuringSchedulingIgnoredDuringExecution": {"nodeSelectorTerms": [{"matchExpressions": [{"key": "kubernetes.io/hostname", "operator": "In", "values": ["k1.kube", "k2.kube", "k3.kube"]}]}]}}}'# How many 10-core jobs do we need to fill everywhere we will run?
SCALE="30"# Clean up
kubectl delete job -l app=yunikorntest || true# Make 10 core jobs that will block out our test job for at least 2 minutes
# Make sure they don't all finish at once.
rm -f jobs_before.yml
for NUM in $(seq 1 ${SCALE}) ; do
cat >>jobs_before.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: presleep${NUM}
  labels:
    app: yunikorntest
  ${OTHER_JOB_ANNOTATIONS}
spec:
  template:
    metadata:
      labels:
        app: yunikorntest
        applicationId: before-${NUM}
    spec:
      schedulerName: yunikorn
      ${NODE_SELECTOR}
      containers:
      - name: main
        image: ubuntu:20.04
        command: ["sleep",  "$(( $RANDOM % 20 + 120 ))"]
        resources:
          limits:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
          requests:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
      restartPolicy: Never
  ttlSecondsAfterFinished: 1000
---
EOF
done# How many jobs do we need to fill the cluster to compete against?
COMPETING_JOBS="$((SCALE*20))"# And 10 core jobs that, if they all pass it, will keep it blocked out for 20 minutes
# We expect it really to be blocked like 5-7-10 minutes if the SLA plugin is working.
rm -f jobs_after.yml
for NUM in $(seq 1 ${COMPETING_JOBS}) ; do
cat >>jobs_after.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: postsleep${NUM}
  labels:
    app: yunikorntest
  ${OTHER_JOB_ANNOTATIONS}
spec:
  template:
    metadata:
      labels:
        app: yunikorntest
        applicationId: after-${NUM}
    spec:
      schedulerName: yunikorn
      ${NODE_SELECTOR}
      containers:
      - name: main
        image: ubuntu:20.04
        command: ["sleep",  "$(( $RANDOM % 20 + 60 ))"]
        resources:
          limits:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
          requests:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
      restartPolicy: Never
  ttlSecondsAfterFinished: 1000
---
EOF
done# And the test job itself between them.
rm -f job_middle.yml
cat >job_middle.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: middle
  labels:
    app: yunikorntest
  ${MIDDLE_JOB_ANNOTATIONS}
spec:
  template:
    metadata:
      labels:
        app: yunikorntest
        applicationId: middle
    spec:
      schedulerName: yunikorn
      ${NODE_SELECTOR}
      containers:
      - name: main
        image: ubuntu:20.04
        command: ["sleep", "1"]
        resources:
          limits:
            memory: 300M
            cpu: 50000m
            ephemeral-storage: 1G
          requests:
            memory: 300M
            cpu: 50000m
            ephemeral-storage: 1G
      restartPolicy: Never
  ttlSecondsAfterFinished: 1000
EOFkubectl apply -f jobs_before.yml
sleep 10
kubectl apply -f job_middle.yml
sleep 10
CREATION_TIME="$(kubectl get job middle -o jsonpath='{.metadata.creationTimestamp}')"
kubectl apply -f jobs_after.yml
# Wait for it to finish
echo "Waiting for middle job to finish..."
COMPLETION_TIME=""
while [[ -z "${COMPLETION_TIME}" ]] ; do
    sleep 10
    JOB_STATE="$(kubectl get job middle -o jsonpath='{.status.succeeded}' || true)"
    if [[ "${JOB_STATE}" == "1" ]] ; then
        COMPLETION_TIME="$(kubectl get job middle -o jsonpath='{.status.completionTime}' || true)"
    fi
done
echo "Test large job was created at ${CREATION_TIME} and completed at ${COMPLETION_TIME}"
{code}
You will see that YuniKorn will run the vast majority of the "postsleep" jobs before allowing the "middle" job to schedule and run, even though the "middle" job was submitted to the queue first. By increasing the number of "postsleep" jobs submitted, you can starve the "middle" job for an arbitrarily long amount of time.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: dev-help@yunikorn.apache.org