You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nuttx.apache.org by Mårten Svanfeldt <ma...@actia.se> on 2023/03/03 08:48:06 UTC

Socketcan reordered packets and packet stalls

Hello,

We are using Nuttx and its SocketCAN support (on top of iMXRT hardware), but have run into some problems related to packet reordering and also possible soft-locks/stalls in the rx side.

The symptoms we have is that the upper layer, which is a isotp implementation, receives packets out of order and sometimes does not receive the last packet in a burst of packets. The out of order problem is on the order of a few packets per 10k, and not receiving the last about the same prevalence (few per 10k bursts).

Our analysis so far is that the issue is mostly related to that SocketCAN, opposed to for example all other network drivers, does packet ingestion without holding the net_lock(). This results in that there is a window when can_recvmsg() hold the lock, checks the readahead buffer but not yet setup the connection callback and waiting, that if a packet arrives within this window the low-level driver will call can_callback() from the ISR, it fails the net_trylock() and puts the newly received packet into the read-ahead buffer. can_recvmsg will then setup the callback, and the next received packet will be delivered to the application resulting in reordering. If the packet that is put into the read-ahead buffer is the last in a burst (each burst has to be acked before the sender will send more packets), then the callback will never be called an dcan_recvmsg wait forever.

The out-of-order delivery we have solved with a local patch (basically, once we get a callback we process read-ahead buffer again before returning the newly received packet), but it is not that clean and does not solve the last-packet-in-burst problem. I am looking into ways to solve this properly and then contribute it up to Nuttx upstream, so I am looking for some input as to what would be considered an acceptable solution. Right now I have two possible solutions

1. Use our current way to deal with reordering, and then add a enqueued work-function that calls the callback in the case where net_trylock() fails (so data is immediately put on readahead queue, and then a work item later processes the callback to notify the application). This will probably work, but results in two notification chains and two places where read-ahead is handled.

2. Change SocketCAN to use the same pattern as all other network drivers, always do the ingestion of packets in a work-function. The ISR disables interrupts and enqueues a work function that then does the read-out of packets from the hardware mailboxes and sends them into the network layer (under net_lock()). This solves both problems cleanly, however there could be an issue for hardware that has very few mailboxes available with high packet rates. In our case with iMXRT we potentially have up to 60 hardware mailboxes, so it is no issue but for other hardware it could be that the driver needs to implement a software FIFO. This is right now my preferred solution.

Any input from anyone involved in the original implementation is greatly appreciated.

Regards
Marten Svanfeldt

Re: Socketcan reordered packets and packet stalls

Posted by Mårten Svanfeldt <ma...@actia.se>.

Thanks for your input Peter,

> The IMXRT SocketCAN driver is based on the S32K1XX driver which had limited amount of mailboxes.
> So they had to be cleared as quick as possible to make sure we don't drop packets if all mailboxes get full.
I did get that from the git history, however I hope we can find a way to not unnecessarily cripple newer more capable hardware while still supporting the more limited case.

>
> On the IMXRT it's different since in non-FD mode you've got a lot of mailboxes.
> Indeed the solution would be to schedule a workqueue to empty the mailboxes and then instead of net_trylock do a net_lock so the workqueue suspends if doesn't get the lock but when it gets it immediately clears the lock.
> Still with the workqueue you've to be careful that you don't drop packets.
>
> The better solution is to utilize the new features of the IOB buffer rewrite.
> I haven't look into the details but it allows to pre-allocate IOB so that the workqueue/irq immediately can push the packets to the network.
> This would work with hardware a low amount of mailboxes and would even open the possibility to use the DMA as well to transfer the CAN packets.

I did look into it a bit, and it is possible to use the IOB buffers to handle the "temporary storage", or rather remove the need for it. Unfortunately at this point we have a released software based on 10.2 (before the IOB rewrite of the network layer), and need to implement this as a bugfix to this version.

I did review the changes between latest master and 10.2, and even though things have changed so an IOB can be passed from the low-level driver to the upper levels, the SocketCAN drivers all still does net_trylock and calls can_input/can_callback from interrupt contexts, which means they miss notifying the socket recv-code in can_recvmsg. How does the following sound for a way forward:

 - I do implement (locally in our repo) the change to move from in-interrupt processing to workqueue based processing, can_input/can_callback is changed to expect being called from workqueue and therefor can use net_lock() etc which handles both the re-ordering and stall issues right now.

 - Once we bump our local NuttX version, which is expected within a few months, I'll forward-port this patch and provide it to upstream. It will then work out of the box for processors such as iMXRT that have many mailboxes. For other examples such as S32K1XX, or in all as an optimization, the driver can allocate IOBs and in the interrupt queue the data to IOB within the ISR, then the workqueue passes the IOB(-chain) to the upper layers. I could implement it for all SocketCAN drivers, its not that many, but would require help to test on non-iMXRT as I don't have access to them.

What I don't want to do is implement a local solution for now, then we bump NuttX and forward port it but the solution is outright rejected if I try to submit it upstream. Tweaks, adjustments etc sure but the general approach I hope can be sound.

Regards
Marten Svanfeldt

Socketcan reordered packets and packet stalls

Posted by Peter van der Perk <pe...@nxp.com>.

Hi Marten,

The IMXRT SocketCAN driver is based on the S32K1XX driver which had limited amount of mailboxes.
So they had to be cleared as quick as possible to make sure we don't drop packets if all mailboxes get full.

On the IMXRT it's different since in non-FD mode you've got a lot of mailboxes.
Indeed the solution would be to schedule a workqueue to empty the mailboxes and then instead of net_trylock do a net_lock so the workqueue suspends if doesn't get the lock but when it gets it immediately clears the lock.
Still with the workqueue you've to be careful that you don't drop packets.

The better solution is to utilize the new features of the IOB buffer rewrite.
I haven't look into the details but it allows to pre-allocate IOB so that the workqueue/irq immediately can push the packets to the network.
This would work with hardware a low amount of mailboxes and would even open the possibility to use the DMA as well to transfer the CAN packets.

Yours sincerely,

Peter van der Perk

-----Original Message-----
From: Mårten Svanfeldt <ma...@actia.se> 
Sent: Friday, March 3, 2023 9:48 AM
To: dev@nuttx.apache.org
Subject: Socketcan reordered packets and packet stalls

We are using Nuttx and its SocketCAN support (on top of iMXRT hardware), but have run into some problems related to packet reordering and also possible soft-locks/stalls in the rx side.


The symptoms we have is that the upper layer, which is a isotp implementation, receives packets out of order and sometimes does not receive the last packet in a burst of packets. The out of order problem is on the order of a few packets per 10k, and not receiving the last about the same prevalence (few per 10k bursts).


Our analysis so far is that the issue is mostly related to that SocketCAN, opposed to for example all other network drivers, does packet ingestion without holding the net_lock(). This results in that there is a window when can_recvmsg() hold the lock, checks the readahead buffer but not yet setup the connection callback and waiting, that if a packet arrives within this window the low-level driver will call can_callback() from the ISR, it fails the net_trylock() and puts the newly received packet into the read-ahead buffer. can_recvmsg will then setup the callback, and the next received packet will be delivered to the application resulting in reordering. If the packet that is put into the read-ahead buffer is the last in a burst (each burst has to be acked before the sender will send more packets), then the callback will never be called an dcan_recvmsg wait forever.


The out-of-order delivery we have solved with a local patch (basically, once we get a callback we process read-ahead buffer again before returning the newly received packet), but it is not that clean and does not solve the last-packet-in-burst problem. I am looking into ways to solve this properly and then contribute it up to Nuttx upstream, so I am looking for some input as to what would be considered an acceptable solution. Right now I have two possible solutions


  1.  Use our current way to deal with reordering, and then add a enqueued work-function that calls the callback in the case where net_trylock() fails (so data is immediately put on readahead queue, and then a work item later processes the callback to notify the application). This will probably work, but results in two notification chains and two places where read-ahead is handled.

  2.  Change SocketCAN to use the same pattern as all other network drivers, always do the ingestion of packets in a work-function. The ISR disables interrupts and enqueues a work function that then does the read-out of packets from the hardware mailboxes and sends them into the network layer (under net_lock()). This solves both problems cleanly, however there could be an issue for hardware that has very few mailboxes available with high packet rates. In our case with iMXRT we potentially have up to 60 hardware mailboxes, so it is no issue but for other hardware it could be that the driver needs to implement a software FIFO. This is right now my preferred solution.

Any input from anyone involved in the original implementation is greatly appreciated.

Regards
Marten Svanfeldt