You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nuttx.apache.org by je...@apache.org on 2020/12/10 07:34:07 UTC
[incubator-nuttx] 02/04: arch,
sched: Fix global IRQ control logics for SMP
This is an automated email from the ASF dual-hosted git repository.
jerpelea pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-nuttx.git
commit 409c65ce0bc27df685bd595f5100ad03692d674a
Author: Masayuki Ishikawa <ma...@gmail.com>
AuthorDate: Wed Nov 25 11:18:24 2020 +0900
arch, sched: Fix global IRQ control logics for SMP
Summary:
- This commit fixes global IRQ control logic
- In previous implementation, g_cpu_irqset for a remote CPU was
set in sched_add_readytorun(), sched_remove_readytorun() and
up_schedule_sigaction()
- In this implementation, they are removed.
- Instead, in the pause handler, call enter_critical_setion()
which will call up_cpu_paused() then acquire g_cpu_irqlock
- So if a new task with irqcount > 1 restarts on the remote CPU,
the CPU will only hold a critical section. Thus, the issue such as
'POSSIBLE FOR TWO CPUs TO HOLD A CRITICAL SECTION' could be resolved.
- Fix nxsched_resume_scheduler() so that it does not call spin_clrbit()
if a CPU does not hold a g_cpu_irqset
- Fix nxtask_exit() so that it acquires g_cpu_irqlock
- Update TODO
Impact:
- All SMP implementations
Testing:
- Tested with smp, ostest with the following configurations
- Tested with spresense:wifi_smp (NCPUS=2,4)
- Tested with sabre-6quad:smp (QEMU, dev board)
- Tested with maix-bit:smp (QEMU)
- Tested with esp32-core:smp (QEMU)
- Tested with lc823450-xgevk:rndis
Signed-off-by: Masayuki Ishikawa <Ma...@jp.sony.com>
---
TODO | 68 +-------------------------
arch/arm/src/armv7-a/arm_cpupause.c | 17 ++++++-
arch/arm/src/armv7-a/arm_schedulesigaction.c | 14 ++----
arch/arm/src/armv7-m/arm_schedulesigaction.c | 14 ++----
arch/arm/src/cxd56xx/cxd56_cpupause.c | 32 ++++++++++--
arch/arm/src/lc823450/lc823450_cpupause.c | 17 ++++++-
arch/risc-v/src/k210/k210_cpupause.c | 17 ++++++-
arch/risc-v/src/k210/k210_schedulesigaction.c | 14 ++----
arch/xtensa/src/common/xtensa_cpupause.c | 17 ++++++-
arch/xtensa/src/common/xtensa_schedsigaction.c | 14 ++----
sched/sched/sched_addreadytorun.c | 43 ++--------------
sched/sched/sched_removereadytorun.c | 43 ++--------------
sched/sched/sched_resumescheduler.c | 7 ++-
sched/task/task_exit.c | 7 +++
14 files changed, 131 insertions(+), 193 deletions(-)
diff --git a/TODO b/TODO
index acc6e5b..00bb931 100644
--- a/TODO
+++ b/TODO
@@ -1,4 +1,4 @@
-NuttX TODO List (Last updated November 20, 2020)
+NuttX TODO List (Last updated November 26, 2020)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This file summarizes known NuttX bugs, limitations, inconsistencies with
@@ -10,7 +10,7 @@ issues related to each board port.
nuttx/:
(16) Task/Scheduler (sched/)
- (3) SMP
+ (2) SMP
(1) Memory Management (mm/)
(0) Power Management (drivers/pm)
(5) Signals (sched/signal, arch/)
@@ -485,70 +485,6 @@ o SMP
an bugs caused by this. But I believe that failures are
possible.
- Title: POSSIBLE FOR TWO CPUs TO HOLD A CRITICAL SECTION?
- Description: The SMP design includes logic that will support multiple
- CPUs holding a critical section. Is this necessary? How
- can that occur? I think it can occur in the following
- situation:
-
- The log below was reported is NuttX running on two cores
- Cortex-A7 architecture in SMP mode. You can notice see that
- when nxsched_add_readytorun() was called, the g_cpu_irqset is 3.
-
- nxsched_add_readytorun: irqset cpu 1, me 0 btcbname init, irqset 1 irqcount 2.
- nxsched_add_readytorun: nxsched_add_readytorun line 338 g_cpu_irqset = 3.
-
- This can happen, but only under a very certain condition.
- g_cpu_irqset only exists to support this certain condition:
-
- a. A task running on CPU 0 takes the critical section. So
- g_cpu_irqset == 0x1.
-
- b. A task exits on CPU 1 and a waiting, ready-to-run task
- is re-started on CPU 1. This new task also holds the
- critical section. So when the task is re-restarted on
- CPU 1, we than have g_cpu_irqset == 0x3
-
- So we are in a very perverse state! There are two tasks
- running on two different CPUs and both hold the critical
- section. I believe that is a dangerous situation and there
- could be undiscovered bugs that could happen in that case.
- However, as of this moment, I have not heard of any specific
- problems caused by this weird behavior.
-
- A possible solution would be to add a new task state that
- would exist only for SMP.
-
- - Add a new SMP-only task list and state. Say,
- g_csection_wait[]. It should be prioritized.
- - When a task acquires the critical section, all tasks in
- g_readytorun[] that need the critical section would be
- moved to g_csection_wait[].
- - When any task is unblocked for any reason and moved to the
- g_readytorun[] list, if that unblocked task needs the
- critical section, it would also be moved to the
- g_csection_wait[] list. No task that needs the critical
- section can be in the ready-to-run list if the critical
- section is not available.
- - When the task releases the critical section, all tasks in
- the g_csection_wait[] needs to be moved back to
- g_readytorun[].
- - This may result in a context switch. The tasks should be
- moved back to g_readytorun[] highest priority first. If a
- context switch occurs and the critical section to re-taken
- by the re-started task, the lower priority tasks in
- g_csection_wait[] must stay in that list.
-
- That is really not as much work as it sounds. It is
- something that could be done in 2-3 days of work if you know
- what you are doing. Getting the proper test setup and
- verifying the change would be the more difficult task.
-
-Status: Open
-Priority: Unknown. Might be high, but first we would need to confirm
- that this situation can occur and that is actually causes
- a failure.
-
o Memory Management (mm/)
^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/arm/src/armv7-a/arm_cpupause.c b/arch/arm/src/armv7-a/arm_cpupause.c
index 4725418..8b2c56f 100644
--- a/arch/arm/src/armv7-a/arm_cpupause.c
+++ b/arch/arm/src/armv7-a/arm_cpupause.c
@@ -212,9 +212,22 @@ int arm_pause_handler(int irq, FAR void *context, FAR void *arg)
* been processed then g_cpu_paused[cpu] will not be locked.
*/
- if (spin_islocked(&g_cpu_paused[cpu]))
+ if (up_cpu_pausereq(cpu))
{
- return up_cpu_paused(cpu);
+ /* NOTE: The following enter_critical_section() will call
+ * up_cpu_paused() to process a pause request to break a deadlock
+ * because the caller held a critical section. Once up_cpu_paused()
+ * finished, the caller will proceed and release the g_cpu_irqlock.
+ * Then this CPU will acquire g_cpu_irqlock in the function.
+ */
+
+ irqstate_t flags = enter_critical_section();
+
+ /* NOTE: the pause request should not exist here */
+
+ DEBUGVERIFY(!up_cpu_pausereq(cpu));
+
+ leave_critical_section(flags);
}
return OK;
diff --git a/arch/arm/src/armv7-a/arm_schedulesigaction.c b/arch/arm/src/armv7-a/arm_schedulesigaction.c
index 424ba0d..8b44b32 100644
--- a/arch/arm/src/armv7-a/arm_schedulesigaction.c
+++ b/arch/arm/src/armv7-a/arm_schedulesigaction.c
@@ -314,17 +314,13 @@ void up_schedule_sigaction(struct tcb_s *tcb, sig_deliver_t sigdeliver)
DEBUGASSERT(tcb->irqcount < INT16_MAX);
tcb->irqcount++;
- /* In an SMP configuration, the interrupt disable logic also
- * involves spinlocks that are configured per the TCB irqcount
- * field. This is logically equivalent to
- * enter_critical_section(). The matching call to
- * leave_critical_section() will be performed in
- * arm_sigdeliver().
+ /* NOTE: If the task runs on another CPU(cpu), adjusting
+ * global IRQ controls will be done in the pause handler
+ * on the CPU(cpu) by taking a critical section.
+ * If the task is scheduled on this CPU(me), do nothing
+ * because this CPU already took a critical section
*/
- spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,
- &g_cpu_irqlock);
-
/* RESUME the other CPU if it was PAUSED */
if (cpu != me)
diff --git a/arch/arm/src/armv7-m/arm_schedulesigaction.c b/arch/arm/src/armv7-m/arm_schedulesigaction.c
index 96aaf3f..7376939 100644
--- a/arch/arm/src/armv7-m/arm_schedulesigaction.c
+++ b/arch/arm/src/armv7-m/arm_schedulesigaction.c
@@ -362,17 +362,13 @@ void up_schedule_sigaction(struct tcb_s *tcb, sig_deliver_t sigdeliver)
DEBUGASSERT(tcb->irqcount < INT16_MAX);
tcb->irqcount++;
- /* In an SMP configuration, the interrupt disable logic also
- * involves spinlocks that are configured per the TCB irqcount
- * field. This is logically equivalent to
- * enter_critical_section(). The matching call to
- * leave_critical_section() will be performed in
- * arm_sigdeliver().
+ /* NOTE: If the task runs on another CPU(cpu), adjusting
+ * global IRQ controls will be done in the pause handler
+ * on the CPU(cpu) by taking a critical section.
+ * If the task is scheduled on this CPU(me), do nothing
+ * because this CPU already took a critical section
*/
- spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,
- &g_cpu_irqlock);
-
/* RESUME the other CPU if it was PAUSED */
if (cpu != me)
diff --git a/arch/arm/src/cxd56xx/cxd56_cpupause.c b/arch/arm/src/cxd56xx/cxd56_cpupause.c
index c8bd410..925450a 100644
--- a/arch/arm/src/cxd56xx/cxd56_cpupause.c
+++ b/arch/arm/src/cxd56xx/cxd56_cpupause.c
@@ -283,6 +283,7 @@ int up_cpu_paused(int cpu)
int arm_pause_handler(int irq, void *c, FAR void *arg)
{
int cpu = up_cpu_index();
+ int ret = OK;
DPRINTF("cpu%d will be paused \n", cpu);
@@ -295,12 +296,37 @@ int arm_pause_handler(int irq, void *c, FAR void *arg)
* interrupt by calling up_cpu_paused.
*/
- if (spin_islocked(&g_cpu_paused[cpu]))
+ if (up_cpu_pausereq(cpu))
{
- return up_cpu_paused(cpu);
+ /* NOTE: The following enter_critical_section() would call
+ * up_cpu_paused() to process a pause request to break a deadlock
+ * because the caller held a critical section. Once up_cpu_paused()
+ * finished, the caller will proceed and release the g_cpu_irqlock.
+ * Then this CPU will acquire g_cpu_irqlock in the function.
+ */
+
+ irqstate_t flags = enter_critical_section();
+
+ /* NOTE: Normally, we do not call up_cpu_paused() here because
+ * the above enter_critical_setion() would call up_cpu_paused()
+ * inside because the caller holds a crtical section.
+ * Howerver, cxd56's remote IRQ control logic also uses this handler
+ * and a caller might not take a critical section to avoid a deadlock
+ * during up_enable_irq() and up_disable_irq(). This is allowed
+ * because IRQ control logic does not interact wtih the scheduler.
+ * This means that if the request was not handled above, we need
+ * to call up_cpu_puased() here again.
+ */
+
+ if (up_cpu_pausereq(cpu))
+ {
+ ret = up_cpu_paused(cpu);
+ }
+
+ leave_critical_section(flags);
}
- return OK;
+ return ret;
}
/****************************************************************************
diff --git a/arch/arm/src/lc823450/lc823450_cpupause.c b/arch/arm/src/lc823450/lc823450_cpupause.c
index 2714fc3..af8165c 100644
--- a/arch/arm/src/lc823450/lc823450_cpupause.c
+++ b/arch/arm/src/lc823450/lc823450_cpupause.c
@@ -209,9 +209,22 @@ int lc823450_pause_handler(int irq, void *c, FAR void *arg)
* interrupt by calling up_cpu_paused.
*/
- if (spin_islocked(&g_cpu_paused[cpu]))
+ if (up_cpu_pausereq(cpu))
{
- return up_cpu_paused(cpu);
+ /* NOTE: The following enter_critical_section() will call
+ * up_cpu_paused() to process a pause request to break a deadlock
+ * because the caller held a critical section. Once up_cpu_paused()
+ * finished, the caller will proceed and release the g_cpu_irqlock.
+ * Then this CPU will acquire g_cpu_irqlock in the function.
+ */
+
+ irqstate_t flags = enter_critical_section();
+
+ /* NOTE: the pause request should not exist here */
+
+ DEBUGVERIFY(!up_cpu_pausereq(cpu));
+
+ leave_critical_section(flags);
}
return OK;
diff --git a/arch/risc-v/src/k210/k210_cpupause.c b/arch/risc-v/src/k210/k210_cpupause.c
index 88af783..957d115 100644
--- a/arch/risc-v/src/k210/k210_cpupause.c
+++ b/arch/risc-v/src/k210/k210_cpupause.c
@@ -215,9 +215,22 @@ int riscv_pause_handler(int irq, void *c, FAR void *arg)
* interrupt by calling up_cpu_paused.
*/
- if (spin_islocked(&g_cpu_paused[cpu]))
+ if (up_cpu_pausereq(cpu))
{
- return up_cpu_paused(cpu);
+ /* NOTE: The following enter_critical_section() will call
+ * up_cpu_paused() to process a pause request to break a deadlock
+ * because the caller held a critical section. Once up_cpu_paused()
+ * finished, the caller will proceed and release the g_cpu_irqlock.
+ * Then this CPU will acquire g_cpu_irqlock in the function.
+ */
+
+ irqstate_t flags = enter_critical_section();
+
+ /* NOTE: the pause request should not exist here */
+
+ DEBUGVERIFY(!up_cpu_pausereq(cpu));
+
+ leave_critical_section(flags);
}
return OK;
diff --git a/arch/risc-v/src/k210/k210_schedulesigaction.c b/arch/risc-v/src/k210/k210_schedulesigaction.c
index 26a6c77..05aabbd 100644
--- a/arch/risc-v/src/k210/k210_schedulesigaction.c
+++ b/arch/risc-v/src/k210/k210_schedulesigaction.c
@@ -353,17 +353,13 @@ void up_schedule_sigaction(struct tcb_s *tcb, sig_deliver_t sigdeliver)
DEBUGASSERT(tcb->irqcount < INT16_MAX);
tcb->irqcount++;
- /* In an SMP configuration, the interrupt disable logic also
- * involves spinlocks that are configured per the TCB irqcount
- * field. This is logically equivalent to
- * enter_critical_section(). The matching call to
- * leave_critical_section() will be performed in
- * up_sigdeliver().
+ /* NOTE: If the task runs on another CPU(cpu), adjusting
+ * global IRQ controls will be done in the pause handler
+ * on the CPU(cpu) by taking a critical section.
+ * If the task is scheduled on this CPU(me), do nothing
+ * because this CPU already took a critical section
*/
- spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,
- &g_cpu_irqlock);
-
/* RESUME the other CPU if it was PAUSED */
if (cpu != me)
diff --git a/arch/xtensa/src/common/xtensa_cpupause.c b/arch/xtensa/src/common/xtensa_cpupause.c
index 5b59200..7c58176 100644
--- a/arch/xtensa/src/common/xtensa_cpupause.c
+++ b/arch/xtensa/src/common/xtensa_cpupause.c
@@ -190,9 +190,22 @@ void xtensa_pause_handler(void)
* interrupt by calling up_cpu_paused.
*/
- if (spin_islocked(&g_cpu_paused[cpu]))
+ if (up_cpu_pausereq(cpu))
{
- up_cpu_paused(cpu);
+ /* NOTE: The following enter_critical_section() will call
+ * up_cpu_paused() to process a pause request to break a deadlock
+ * because the caller held a critical section. Once up_cpu_paused()
+ * finished, the caller will proceed and release the g_cpu_irqlock.
+ * Then this CPU will acquire g_cpu_irqlock in the function.
+ */
+
+ irqstate_t flags = enter_critical_section();
+
+ /* NOTE: the pause request should not exist here */
+
+ DEBUGVERIFY(!up_cpu_pausereq(cpu));
+
+ leave_critical_section(flags);
}
}
diff --git a/arch/xtensa/src/common/xtensa_schedsigaction.c b/arch/xtensa/src/common/xtensa_schedsigaction.c
index 9888557..62f4bfd 100644
--- a/arch/xtensa/src/common/xtensa_schedsigaction.c
+++ b/arch/xtensa/src/common/xtensa_schedsigaction.c
@@ -339,17 +339,13 @@ void up_schedule_sigaction(struct tcb_s *tcb, sig_deliver_t sigdeliver)
DEBUGASSERT(tcb->irqcount < INT16_MAX);
tcb->irqcount++;
- /* In an SMP configuration, the interrupt disable logic also
- * involves spinlocks that are configured per the TCB irqcount
- * field. This is logically equivalent to
- * enter_critical_section().
- * The matching call to leave_critical_section() will be
- * performed in up_sigdeliver().
+ /* NOTE: If the task runs on another CPU(cpu), adjusting
+ * global IRQ controls will be done in the pause handler
+ * on the CPU(cpu) by taking a critical section.
+ * If the task is scheduled on this CPU(me), do nothing
+ * because this CPU already took a critical section
*/
- spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,
- &g_cpu_irqlock);
-
/* RESUME the other CPU if it was PAUSED */
if (cpu != me)
diff --git a/sched/sched/sched_addreadytorun.c b/sched/sched/sched_addreadytorun.c
index 0d610b5..15077d9 100644
--- a/sched/sched/sched_addreadytorun.c
+++ b/sched/sched/sched_addreadytorun.c
@@ -318,47 +318,12 @@ bool nxsched_add_readytorun(FAR struct tcb_s *btcb)
&g_cpu_schedlock);
}
- /* Adjust global IRQ controls. If irqcount is greater than zero,
- * then this task/this CPU holds the IRQ lock
+ /* NOTE: If the task runs on another CPU(cpu), adjusting global IRQ
+ * controls will be done in the pause handler on the new CPU(cpu).
+ * If the task is scheduled on this CPU(me), do nothing because
+ * this CPU already has a critical section
*/
- if (btcb->irqcount > 0)
- {
- /* Yes... make sure that scheduling logic on other CPUs knows
- * that we hold the IRQ lock.
- */
-
- spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,
- &g_cpu_irqlock);
- }
-
- /* No.. This CPU will be relinquishing the lock. But this works
- * differently if we are performing a context switch from an
- * interrupt handler and the interrupt handler has established
- * a critical section. We can detect this case when
- * g_cpu_nestcount[me] > 0.
- */
-
- else if (g_cpu_nestcount[me] <= 0)
- {
- /* Do nothing here
- * NOTE: spin_clrbit() will be done in sched_resumescheduler()
- */
- }
-
- /* Sanity check. g_cpu_netcount should be greater than zero
- * only while we are within the critical section and within
- * an interrupt handler. If we are not in an interrupt handler
- * then there is a problem; perhaps some logic previously
- * called enter_critical_section() with no matching call to
- * leave_critical_section(), leaving the non-zero count.
- */
-
- else
- {
- DEBUGASSERT(up_interrupt_context());
- }
-
/* If the following task is not locked to this CPU, then it must
* be moved to the g_readytorun list. Since it cannot be at the
* head of the list, we can do this without invoking any heavy
diff --git a/sched/sched/sched_removereadytorun.c b/sched/sched/sched_removereadytorun.c
index 963a259..87b6cbd 100644
--- a/sched/sched/sched_removereadytorun.c
+++ b/sched/sched/sched_removereadytorun.c
@@ -260,47 +260,12 @@ bool nxsched_remove_readytorun(FAR struct tcb_s *rtcb)
&g_cpu_schedlock);
}
- /* Adjust global IRQ controls. If irqcount is greater than zero,
- * then this task/this CPU holds the IRQ lock
+ /* NOTE: If the task runs on another CPU(cpu), adjusting global IRQ
+ * controls will be done in the pause handler on the new CPU(cpu).
+ * If the task is scheduled on this CPU(me), do nothing because
+ * this CPU already has a critical section
*/
- if (nxttcb->irqcount > 0)
- {
- /* Yes... make sure that scheduling logic on other CPUs knows
- * that we hold the IRQ lock.
- */
-
- spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,
- &g_cpu_irqlock);
- }
-
- /* No.. This CPU will be relinquishing the lock. But this works
- * differently if we are performing a context switch from an
- * interrupt handler and the interrupt handler has established
- * a critical section. We can detect this case when
- * g_cpu_nestcount[me] > 0.
- */
-
- else if (g_cpu_nestcount[me] <= 0)
- {
- /* Do nothing here
- * NOTE: spin_clrbit() will be done in sched_resumescheduler()
- */
- }
-
- /* Sanity check. g_cpu_netcount should be greater than zero
- * only while we are within the critical section and within
- * an interrupt handler. If we are not in an interrupt handler
- * then there is a problem; perhaps some logic previously
- * called enter_critical_section() with no matching call to
- * leave_critical_section(), leaving the non-zero count.
- */
-
- else
- {
- DEBUGASSERT(up_interrupt_context());
- }
-
nxttcb->task_state = TSTATE_TASK_RUNNING;
/* All done, restart the other CPU (if it was paused). */
diff --git a/sched/sched/sched_resumescheduler.c b/sched/sched/sched_resumescheduler.c
index fc9fe26..88c354b 100644
--- a/sched/sched/sched_resumescheduler.c
+++ b/sched/sched/sched_resumescheduler.c
@@ -136,8 +136,11 @@ void nxsched_resume_scheduler(FAR struct tcb_s *tcb)
{
/* Release our hold on the IRQ lock. */
- spin_clrbit(&g_cpu_irqset, me, &g_cpu_irqsetlock,
- &g_cpu_irqlock);
+ if ((g_cpu_irqset & (1 << me)) != 0)
+ {
+ spin_clrbit(&g_cpu_irqset, me, &g_cpu_irqsetlock,
+ &g_cpu_irqlock);
+ }
}
#endif /* CONFIG_SMP */
}
diff --git a/sched/task/task_exit.c b/sched/task/task_exit.c
index d570e45..64daf88 100644
--- a/sched/task/task_exit.c
+++ b/sched/task/task_exit.c
@@ -164,6 +164,13 @@ int nxtask_exit(void)
if (rtcb->irqcount == 0)
{
+ /* NOTE: Need to aquire g_cpu_irqlock here again before
+ * calling spin_setbit() becauses it was released in
+ * the above nxsched_resume_scheduler()
+ */
+
+ DEBUGVERIFY(irq_waitlock(this_cpu()));
+
spin_setbit(&g_cpu_irqset, this_cpu(), &g_cpu_irqsetlock,
&g_cpu_irqlock);
}