You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nuttx.apache.org by Carlos Sanchez <ca...@geotab.com.INVALID> on 2023/01/03 17:45:15 UTC

d_len/d_buf arbitration for s32k1xx_flexcan

Hi all,

I am observing an extrange behavior: under heavy-error CAN TX scenario (no
acks so TX fails always), usually after the second call to write() my
writes fail. This is expected as s32k1xx_flexcan has two TX mailboxes and
from my understanding of the code there is no other buffering (on this
specific CAN driver at least).

However, if I enable CAN errors, and depending on runtime sync conditions
(basically, if I put a breakpoint on s32k1xx_txpoll) then all the writes
after the first silently fail, without really trying to send anything (I
see on ESR2 register than second TX mailbox does not really become active).

After some debugging, I have seen that the CANWORK=LPWORK thread has
scheduled calls to s32k1xx_error_work, overwrites d_buf/d_len to send the
error frames in. But TX polling does not always happen in CANWORK thread:
it does when it comes from s32k1xx_txdone_work, but not when it comes from
s32k1xx_txavail_work, which, despite the name, is called directly on
s32k1xx_txavail context (which is the application context). What is
happening in my case, is txavail_work->...->devif_poll->can_poll is setting
d_buf/d_len to the packed to be sent, but before s32k1xx_txpoll checks it
(due to my breakpoint), s32k1xx_error kicks in, "steals" d_buf/d_len to
setup the error frame and calls can_input. The frame which was set up by
the polling sequence gets silently discarded.

I have tried setting s32k1xx_txavail_work inside CANWORK, but this fails
because can_sendmsg checks immediately for non-blocking writes; placing the
polling in the work queue would cause all non-blocking writes to fail.

Related to this, I also fail to see the arbitration between TX/RX for
d_buf/d_len. From what I see, the same problem I am describing could happen
by s32k1xx_receive "stealing" d_buf/d_len, same as s32k1xx_error is doing
in my case. But this is only a thought, I have not observed it.

A possible clean solution is to use another buffer, but it is complex and
would mean losing the direct connection between the write and the HW TX
(which might be useful in general and it is for my use case). A quicker
solution would be for s32k1xx_error to lock the network, forcing it to wait
until txavail_work is done. This would solve my case. My second concern is
more difficult to solve as comments in the code explicitly say RX cannot be
delayed to the work queue or CAN frames would be lost.

Any ideas or anything I might be missing here?

Thanks,

Carlos

-- 

Carlos Sanchez (he, him, his)
Geotab

Embedded Systems Developer Team Lead | Europe

Visit

www.geotab.com

Twitter <https://twitter.com/geotab> | Facebook
<https://www.facebook.com/Geotab> | YouTube
<https://www.youtube.com/user/MyGeotab> | LinkedIn
<https://www.linkedin.com/company/geotab/>

Re: Re: d_len/d_buf arbitration for s32k1xx_flexcan

Posted by Carlos Sanchez <ca...@geotab.com.INVALID>.
The PR which solved the interaction between main thread and error
generation: https://github.com/apache/nuttx/pull/8060

This affects the stm32 CAN driver, actually. I mentioned s32x_error()
because we have ported the error generation code to the s32k1xx CAN driver,
because we needed that. The drivers are quite similar, so error generation
(which is only present in base stm32 drivers) can be ported quickly to
stm32h7 and s32k1xx. We have done both. But this problem is present on the
base stm32 CAN driver.

Mateusz (raiden00pl) you might want to have a look.

Thanks,

Carlos

On Wed, Jan 4, 2023 at 2:56 PM Carlos Sanchez <ca...@geotab.com>
wrote:

> Hi Peter,
>
>
>> It seems that calling can_input directly from IRQ got broken since the
>> IOB rewrite.
>> Before can_input only used dev->d_appdata, but now can_input overwrites
>> the dev->d_buf pointer as well.
>>
>> https://github.com/apache/nuttx/blob/779a610ca3ba495640b49d6c36bce89784955e0d/net/can/can_input.c#L231
>> dev-
>> <https://github.com/apache/nuttx/blob/779a610ca3ba495640b49d6c36bce89784955e0d/net/can/can_input.c#L231dev->>d_len
>> always has been fixed to sizeof(struct can_frame) or sizeof(struct
>> canfd_frame) depending on kconfig setting.
>>
>
> Please note dev->d_len might change depending on the timestamping setting
> for each socket.
>
> Short term you could either go back to an older version of NuttX, or try
>> to schedule the workqueue for can_input and see if you can enough
>> throughput.
>>
>
> I have added net_lock()/net_unlock() around s32k1xx_error() call and this
> solves the interaction between the application thread and CANWORK thread
> for error frame injection. Other _work handlers do lock the network, so I
> think this just slipped by and it should have been there since the
> beginning. I have found a couple of other problems in the driver, I
> will create a PR with all this.
>
> The interactions with interrupts are a little bit more problematic, though.
>
> Best regards,
>
> Carlos
>
> --
>
> Carlos Sanchez (he, him, his)
> Geotab
>
> Embedded Systems Developer Team Lead | Europe
>
> Visit
>
> www.geotab.com
>
> Twitter <https://twitter.com/geotab> | Facebook
> <https://www.facebook.com/Geotab> | YouTube
> <https://www.youtube.com/user/MyGeotab> | LinkedIn
> <https://www.linkedin.com/company/geotab/>
>


-- 

Carlos Sanchez (he, him, his)
Geotab

Embedded Systems Developer Team Lead | Europe

Visit

www.geotab.com

Twitter <https://twitter.com/geotab> | Facebook
<https://www.facebook.com/Geotab> | YouTube
<https://www.youtube.com/user/MyGeotab> | LinkedIn
<https://www.linkedin.com/company/geotab/>

Re: Re: d_len/d_buf arbitration for s32k1xx_flexcan

Posted by Carlos Sanchez <ca...@geotab.com.INVALID>.
Hi Peter,


> It seems that calling can_input directly from IRQ got broken since the IOB
> rewrite.
> Before can_input only used dev->d_appdata, but now can_input overwrites
> the dev->d_buf pointer as well.
>
> https://github.com/apache/nuttx/blob/779a610ca3ba495640b49d6c36bce89784955e0d/net/can/can_input.c#L231
> dev-
> <https://github.com/apache/nuttx/blob/779a610ca3ba495640b49d6c36bce89784955e0d/net/can/can_input.c#L231dev->>d_len
> always has been fixed to sizeof(struct can_frame) or sizeof(struct
> canfd_frame) depending on kconfig setting.
>

Please note dev->d_len might change depending on the timestamping setting
for each socket.

Short term you could either go back to an older version of NuttX, or try to
> schedule the workqueue for can_input and see if you can enough throughput.
>

I have added net_lock()/net_unlock() around s32k1xx_error() call and this
solves the interaction between the application thread and CANWORK thread
for error frame injection. Other _work handlers do lock the network, so I
think this just slipped by and it should have been there since the
beginning. I have found a couple of other problems in the driver, I
will create a PR with all this.

The interactions with interrupts are a little bit more problematic, though.

Best regards,

Carlos

-- 

Carlos Sanchez (he, him, his)
Geotab

Embedded Systems Developer Team Lead | Europe

Visit

www.geotab.com

Twitter <https://twitter.com/geotab> | Facebook
<https://www.facebook.com/Geotab> | YouTube
<https://www.youtube.com/user/MyGeotab> | LinkedIn
<https://www.linkedin.com/company/geotab/>

Re: Re: d_len/d_buf arbitration for s32k1xx_flexcan

Posted by Carlos Sanchez <ca...@geotab.com.INVALID>.
Hi Peter,

I have submitted a PR with a couple of fixes to s32k1xx_flexcan.c:
https://github.com/apache/nuttx/pull/8058

Among those, there is a fix to MAXMB field initialization in MCR. The
effects of that wrong init is the HW will use mailboxes the driver does not
configure nor check, causing RX data loss. This use gets more frequent as
the CAN frame arrival rate overcomes the interrupt handler frequency, as
the first mailboxes are not empty. I am wondering if this might have
affected your measurements when you decided not to use a work queue for RX
because you observed frame loss.

As I do not have the means to test heavy CAN traffic flow, I am just
pointing it to you. In my code, I have changed to a work queue based RX and
I have seen no problems so far (but IMHO it would be wiser to retest on
whatever setup you detected the frame loss originally).

BR

Carlos

On Wed, Jan 4, 2023 at 1:37 PM Peter van der Perk <pe...@nxp.com>
wrote:

> Hi,
>
> It seems that calling can_input directly from IRQ got broken since the IOB
> rewrite.
> Before can_input only used dev->d_appdata, but now can_input overwrites
> the dev->d_buf pointer as well.
>
> https://github.com/apache/nuttx/blob/779a610ca3ba495640b49d6c36bce89784955e0d/net/can/can_input.c#L231
> dev-
> <https://github.com/apache/nuttx/blob/779a610ca3ba495640b49d6c36bce89784955e0d/net/can/can_input.c#L231dev->>d_len
> always has been fixed to sizeof(struct can_frame) or sizeof(struct
> canfd_frame) depending on kconfig setting.
>
> Once the whole IOB rewrite gets in, is mature and is fully documented I
> can took a look at the mechanism again.
>
> Short term you could either go back to an older version of NuttX, or try
> to schedule the workqueue for can_input and see if you can enough
> throughput.
>
> Yours sincerely,
>
> Peter van der Perk
>
> -----Original Message-----
> From: raiden00pl <ra...@gmail.com>
> Sent: Wednesday, January 4, 2023 10:01 AM
> To: dev@nuttx.apache.org
> Subject: Re: d_len/d_buf arbitration for s32k1xx_flexcan
>
> Related issue:
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnuttx%2Fissues%2F5142&data=05%7C01%7Cpeter.vanderperk%40nxp.com%7C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HIk%2Fza58GP94ZSc1ANtynWEz3fD3o6PV05Kv90ORYig%3D&reserved=0
> .
> Not using a work queue to handle CAN RX is basically wrong. Unfortunately
> all NXP SocketCAN drivers are affected.
>
> wt., 3 sty 2023 o 19:38 Xiang Xiao <xi...@gmail.com> napisał(a):
>
> >   Sorry, "you must do..." may confuse you. What I mean is the CAN driver.
> >
> > On Wed, Jan 4, 2023 at 2:30 AM Carlos Sanchez
> > <ca...@geotab.com.invalid> wrote:
> >
> > > Hi Xiang,
> > >
> > > Please note what I describe is not caused by my code using multiple
> > > threads, but is happening on Nuttx upstream. My code is single
> > > threaded, but s32k1xx_flexcan driver (and several other Socket CAN
> > > drivers as they all seem to be derived from the same code base) does
> > > some things on the thread I call write() from, and some other things
> > > on CANWORK work queue thread.
> > > My understanding is that net code is structured so work queue
> > > threads are used, generally, but in Socket CAN drivers this was
> > > "waived" to avoid
> > data
> > > loss, causing the problem I describe.
> > >
> > > Thanks,
> > >
> > > Carlos
> > >
> > > On Tue, Jan 3, 2023 at 6:56 PM Xiang Xiao
> > > <xi...@gmail.com>
> > > wrote:
> > >
> > > > Since tx/rx share the same d_len/d_buf, you must do send/recv in
> > > > one
> > and
> > > > only thread(either by system work thread or driver dedicated
> > > > thread) to avoid the race condition you describe below.
> > > >
> > > > On Wed, Jan 4, 2023 at 1:45 AM Carlos Sanchez
> > > > <ca...@geotab.com.invalid> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I am observing an extrange behavior: under heavy-error CAN TX
> > scenario
> > > > (no
> > > > > acks so TX fails always), usually after the second call to
> > > > > write() my writes fail. This is expected as s32k1xx_flexcan has
> > > > > two TX mailboxes
> > > and
> > > > > from my understanding of the code there is no other buffering
> > > > > (on
> > this
> > > > > specific CAN driver at least).
> > > > >
> > > > > However, if I enable CAN errors, and depending on runtime sync
> > > conditions
> > > > > (basically, if I put a breakpoint on s32k1xx_txpoll) then all
> > > > > the
> > > writes
> > > > > after the first silently fail, without really trying to send
> > > > > anything
> > > (I
> > > > > see on ESR2 register than second TX mailbox does not really
> > > > > become
> > > > active).
> > > > >
> > > > > After some debugging, I have seen that the CANWORK=LPWORK thread
> > > > > has scheduled calls to s32k1xx_error_work, overwrites
> > > > > d_buf/d_len to send
> > > the
> > > > > error frames in. But TX polling does not always happen in
> > > > > CANWORK
> > > thread:
> > > > > it does when it comes from s32k1xx_txdone_work, but not when it
> > > > > comes
> > > > from
> > > > > s32k1xx_txavail_work, which, despite the name, is called
> > > > > directly on s32k1xx_txavail context (which is the application
> > > > > context). What is happening in my case, is
> > > > > txavail_work->...->devif_poll->can_poll is
> > > > setting
> > > > > d_buf/d_len to the packed to be sent, but before s32k1xx_txpoll
> > checks
> > > it
> > > > > (due to my breakpoint), s32k1xx_error kicks in, "steals"
> > > > > d_buf/d_len
> > to
> > > > > setup the error frame and calls can_input. The frame which was
> > > > > set up
> > > by
> > > > > the polling sequence gets silently discarded.
> > > > >
> > > > > I have tried setting s32k1xx_txavail_work inside CANWORK, but
> > > > > this
> > > fails
> > > > > because can_sendmsg checks immediately for non-blocking writes;
> > placing
> > > > the
> > > > > polling in the work queue would cause all non-blocking writes to
> > fail.
> > > > >
> > > > > Related to this, I also fail to see the arbitration between
> > > > > TX/RX for d_buf/d_len. From what I see, the same problem I am
> > > > > describing could
> > > > happen
> > > > > by s32k1xx_receive "stealing" d_buf/d_len, same as s32k1xx_error
> > > > > is
> > > doing
> > > > > in my case. But this is only a thought, I have not observed it.
> > > > >
> > > > > A possible clean solution is to use another buffer, but it is
> > > > > complex
> > > and
> > > > > would mean losing the direct connection between the write and
> > > > > the HW
> > TX
> > > > > (which might be useful in general and it is for my use case). A
> > quicker
> > > > > solution would be for s32k1xx_error to lock the network, forcing
> > > > > it
> > to
> > > > wait
> > > > > until txavail_work is done. This would solve my case. My second
> > concern
> > > > is
> > > > > more difficult to solve as comments in the code explicitly say
> > > > > RX
> > > cannot
> > > > be
> > > > > delayed to the work queue or CAN frames would be lost.
> > > > >
> > > > > Any ideas or anything I might be missing here?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Carlos
> > > > >
> > > > > --
> > > > >
> > > > > Carlos Sanchez (he, him, his)
> > > > > Geotab
> > > > >
> > > > > Embedded Systems Developer Team Lead | Europe
> > > > >
> > > > > Visit
> > > > >
> > > > > https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2
> > > > > Fwww.geotab.com%2F&data=05%7C01%7Cpeter.vanderperk%40nxp.com%7C9
> > > > > 9afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c5c3016
> > > > > 35%7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoi
> > > > > MC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C300
> > > > > 0%7C%7C%7C&sdata=V7yT6O%2FNT%2B4zbERMFnffrayp59Kgjlqo9o4U4bF92H4
> > > > > %3D&reserved=0
> > > > >
> > > > > Twitter
> > > > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F
> > > > > %2Ftwitter.com%2Fgeotab&data=05%7C01%7Cpeter.vanderperk%40nxp.co
> > > > > m%7C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c5
> > > > > c301635%7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJ
> > > > > WIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%
> > > > > 7C3000%7C%7C%7C&sdata=1DlNx9Iy%2BvxErS2AsgU43uut0IMDF29QDeL5usCu
> > > > > Lto%3D&reserved=0> | Facebook
> > > > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F
> > > > > %2Fwww.facebook.com%2FGeotab&data=05%7C01%7Cpeter.vanderperk%40n
> > > > > xp.com%7C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92c
> > > > > d99c5c301635%7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3
> > > > > d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn
> > > > > 0%3D%7C3000%7C%7C%7C&sdata=BJKh9GlKnnMUZ1d31mW%2B%2FNrWoVUS5lL6t
> > > > > 8SEhVdt2XM%3D&reserved=0> | YouTube
> > > > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F
> > > > > %2Fwww.youtube.com%2Fuser%2FMyGeotab&data=05%7C01%7Cpeter.vander
> > > > > perk%40nxp.com%7C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b
> > > > > 4c6fa92cd99c5c301635%7C0%7C1%7C638084196858222395%7CUnknown%7CTW
> > > > > FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC
> > > > > JXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Yw5fWwK9J0sfLkwzWqdvqj7Bn%2BZ
> > > > > xwfcsMvjsPxOf9fg%3D&reserved=0> | LinkedIn
> > > > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F
> > > > > %2Fwww.linkedin.com%2Fcompany%2Fgeotab%2F&data=05%7C01%7Cpeter.v
> > > > > anderperk%40nxp.com%7C99afe46e15344b2883eb08daee32418c%7C686ea1d
> > > > > 3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C638084196858222395%7CUnknown
> > > > > %7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha
> > > > > WwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=u5FaRsEC2aUMRWfAlt98v%2F
> > > > > 1dmbCNn7%2FMgvNXoSmzyUU%3D&reserved=0>
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Carlos Sanchez (he, him, his)
> > > Geotab
> > >
> > > Embedded Systems Developer Team Lead | Europe
> > >
> > > Visit
> > >
> > > https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
> > > .geotab.com%2F&data=05%7C01%7Cpeter.vanderperk%40nxp.com%7C99afe46e1
> > > 5344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7
> > > C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQ
> > > IjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=V7y
> > > T6O%2FNT%2B4zbERMFnffrayp59Kgjlqo9o4U4bF92H4%3D&reserved=0
> > >
> > > Twitter
> > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ft
> > > witter.com%2Fgeotab&data=05%7C01%7Cpeter.vanderperk%40nxp.com%7C99af
> > > e46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%
> > > 7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDA
> > > iLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdat
> > > a=1DlNx9Iy%2BvxErS2AsgU43uut0IMDF29QDeL5usCuLto%3D&reserved=0> |
> > > Facebook
> > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fw
> > > ww.facebook.com%2FGeotab&data=05%7C01%7Cpeter.vanderperk%40nxp.com%7
> > > C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c5c301635
> > > %7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> > > AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> > > &sdata=BJKh9GlKnnMUZ1d31mW%2B%2FNrWoVUS5lL6t8SEhVdt2XM%3D&reserved=0
> > > > | YouTube
> > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fw
> > > ww.youtube.com%2Fuser%2FMyGeotab&data=05%7C01%7Cpeter.vanderperk%40n
> > > xp.com%7C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c
> > > 5c301635%7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> > > oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%
> > > 7C%7C%7C&sdata=Yw5fWwK9J0sfLkwzWqdvqj7Bn%2BZxwfcsMvjsPxOf9fg%3D&rese
> > > rved=0> | LinkedIn
> > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fw
> > > ww.linkedin.com%2Fcompany%2Fgeotab%2F&data=05%7C01%7Cpeter.vanderper
> > > k%40nxp.com%7C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92
> > > cd99c5c301635%7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8e
> > > yJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C
> > > 3000%7C%7C%7C&sdata=u5FaRsEC2aUMRWfAlt98v%2F1dmbCNn7%2FMgvNXoSmzyUU%
> > > 3D&reserved=0>
> > >
> >
>


-- 

Carlos Sanchez (he, him, his)
Geotab

Embedded Systems Developer Team Lead | Europe

Visit

www.geotab.com

Twitter <https://twitter.com/geotab> | Facebook
<https://www.facebook.com/Geotab> | YouTube
<https://www.youtube.com/user/MyGeotab> | LinkedIn
<https://www.linkedin.com/company/geotab/>

RE: Re: d_len/d_buf arbitration for s32k1xx_flexcan

Posted by Peter van der Perk <pe...@nxp.com>.
Hi,

It seems that calling can_input directly from IRQ got broken since the IOB rewrite.
Before can_input only used dev->d_appdata, but now can_input overwrites the dev->d_buf pointer as well.
https://github.com/apache/nuttx/blob/779a610ca3ba495640b49d6c36bce89784955e0d/net/can/can_input.c#L231
dev->d_len always has been fixed to sizeof(struct can_frame) or sizeof(struct canfd_frame) depending on kconfig setting.

Once the whole IOB rewrite gets in, is mature and is fully documented I can took a look at the mechanism again.

Short term you could either go back to an older version of NuttX, or try to schedule the workqueue for can_input and see if you can enough throughput.

Yours sincerely,

Peter van der Perk

-----Original Message-----
From: raiden00pl <ra...@gmail.com> 
Sent: Wednesday, January 4, 2023 10:01 AM
To: dev@nuttx.apache.org
Subject: Re: d_len/d_buf arbitration for s32k1xx_flexcan

Related issue: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnuttx%2Fissues%2F5142&data=05%7C01%7Cpeter.vanderperk%40nxp.com%7C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HIk%2Fza58GP94ZSc1ANtynWEz3fD3o6PV05Kv90ORYig%3D&reserved=0.
Not using a work queue to handle CAN RX is basically wrong. Unfortunately all NXP SocketCAN drivers are affected.

wt., 3 sty 2023 o 19:38 Xiang Xiao <xi...@gmail.com> napisał(a):

>   Sorry, "you must do..." may confuse you. What I mean is the CAN driver.
>
> On Wed, Jan 4, 2023 at 2:30 AM Carlos Sanchez 
> <ca...@geotab.com.invalid> wrote:
>
> > Hi Xiang,
> >
> > Please note what I describe is not caused by my code using multiple 
> > threads, but is happening on Nuttx upstream. My code is single 
> > threaded, but s32k1xx_flexcan driver (and several other Socket CAN 
> > drivers as they all seem to be derived from the same code base) does 
> > some things on the thread I call write() from, and some other things 
> > on CANWORK work queue thread.
> > My understanding is that net code is structured so work queue 
> > threads are used, generally, but in Socket CAN drivers this was 
> > "waived" to avoid
> data
> > loss, causing the problem I describe.
> >
> > Thanks,
> >
> > Carlos
> >
> > On Tue, Jan 3, 2023 at 6:56 PM Xiang Xiao 
> > <xi...@gmail.com>
> > wrote:
> >
> > > Since tx/rx share the same d_len/d_buf, you must do send/recv in 
> > > one
> and
> > > only thread(either by system work thread or driver dedicated 
> > > thread) to avoid the race condition you describe below.
> > >
> > > On Wed, Jan 4, 2023 at 1:45 AM Carlos Sanchez 
> > > <ca...@geotab.com.invalid> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I am observing an extrange behavior: under heavy-error CAN TX
> scenario
> > > (no
> > > > acks so TX fails always), usually after the second call to 
> > > > write() my writes fail. This is expected as s32k1xx_flexcan has 
> > > > two TX mailboxes
> > and
> > > > from my understanding of the code there is no other buffering 
> > > > (on
> this
> > > > specific CAN driver at least).
> > > >
> > > > However, if I enable CAN errors, and depending on runtime sync
> > conditions
> > > > (basically, if I put a breakpoint on s32k1xx_txpoll) then all 
> > > > the
> > writes
> > > > after the first silently fail, without really trying to send 
> > > > anything
> > (I
> > > > see on ESR2 register than second TX mailbox does not really 
> > > > become
> > > active).
> > > >
> > > > After some debugging, I have seen that the CANWORK=LPWORK thread 
> > > > has scheduled calls to s32k1xx_error_work, overwrites 
> > > > d_buf/d_len to send
> > the
> > > > error frames in. But TX polling does not always happen in 
> > > > CANWORK
> > thread:
> > > > it does when it comes from s32k1xx_txdone_work, but not when it 
> > > > comes
> > > from
> > > > s32k1xx_txavail_work, which, despite the name, is called 
> > > > directly on s32k1xx_txavail context (which is the application 
> > > > context). What is happening in my case, is 
> > > > txavail_work->...->devif_poll->can_poll is
> > > setting
> > > > d_buf/d_len to the packed to be sent, but before s32k1xx_txpoll
> checks
> > it
> > > > (due to my breakpoint), s32k1xx_error kicks in, "steals" 
> > > > d_buf/d_len
> to
> > > > setup the error frame and calls can_input. The frame which was 
> > > > set up
> > by
> > > > the polling sequence gets silently discarded.
> > > >
> > > > I have tried setting s32k1xx_txavail_work inside CANWORK, but 
> > > > this
> > fails
> > > > because can_sendmsg checks immediately for non-blocking writes;
> placing
> > > the
> > > > polling in the work queue would cause all non-blocking writes to
> fail.
> > > >
> > > > Related to this, I also fail to see the arbitration between 
> > > > TX/RX for d_buf/d_len. From what I see, the same problem I am 
> > > > describing could
> > > happen
> > > > by s32k1xx_receive "stealing" d_buf/d_len, same as s32k1xx_error 
> > > > is
> > doing
> > > > in my case. But this is only a thought, I have not observed it.
> > > >
> > > > A possible clean solution is to use another buffer, but it is 
> > > > complex
> > and
> > > > would mean losing the direct connection between the write and 
> > > > the HW
> TX
> > > > (which might be useful in general and it is for my use case). A
> quicker
> > > > solution would be for s32k1xx_error to lock the network, forcing 
> > > > it
> to
> > > wait
> > > > until txavail_work is done. This would solve my case. My second
> concern
> > > is
> > > > more difficult to solve as comments in the code explicitly say 
> > > > RX
> > cannot
> > > be
> > > > delayed to the work queue or CAN frames would be lost.
> > > >
> > > > Any ideas or anything I might be missing here?
> > > >
> > > > Thanks,
> > > >
> > > > Carlos
> > > >
> > > > --
> > > >
> > > > Carlos Sanchez (he, him, his)
> > > > Geotab
> > > >
> > > > Embedded Systems Developer Team Lead | Europe
> > > >
> > > > Visit
> > > >
> > > > https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2
> > > > Fwww.geotab.com%2F&data=05%7C01%7Cpeter.vanderperk%40nxp.com%7C9
> > > > 9afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c5c3016
> > > > 35%7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoi
> > > > MC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C300
> > > > 0%7C%7C%7C&sdata=V7yT6O%2FNT%2B4zbERMFnffrayp59Kgjlqo9o4U4bF92H4
> > > > %3D&reserved=0
> > > >
> > > > Twitter 
> > > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F
> > > > %2Ftwitter.com%2Fgeotab&data=05%7C01%7Cpeter.vanderperk%40nxp.co
> > > > m%7C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c5
> > > > c301635%7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJ
> > > > WIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%
> > > > 7C3000%7C%7C%7C&sdata=1DlNx9Iy%2BvxErS2AsgU43uut0IMDF29QDeL5usCu
> > > > Lto%3D&reserved=0> | Facebook 
> > > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F
> > > > %2Fwww.facebook.com%2FGeotab&data=05%7C01%7Cpeter.vanderperk%40n
> > > > xp.com%7C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92c
> > > > d99c5c301635%7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3
> > > > d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn
> > > > 0%3D%7C3000%7C%7C%7C&sdata=BJKh9GlKnnMUZ1d31mW%2B%2FNrWoVUS5lL6t
> > > > 8SEhVdt2XM%3D&reserved=0> | YouTube 
> > > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F
> > > > %2Fwww.youtube.com%2Fuser%2FMyGeotab&data=05%7C01%7Cpeter.vander
> > > > perk%40nxp.com%7C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b
> > > > 4c6fa92cd99c5c301635%7C0%7C1%7C638084196858222395%7CUnknown%7CTW
> > > > FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC
> > > > JXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Yw5fWwK9J0sfLkwzWqdvqj7Bn%2BZ
> > > > xwfcsMvjsPxOf9fg%3D&reserved=0> | LinkedIn 
> > > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F
> > > > %2Fwww.linkedin.com%2Fcompany%2Fgeotab%2F&data=05%7C01%7Cpeter.v
> > > > anderperk%40nxp.com%7C99afe46e15344b2883eb08daee32418c%7C686ea1d
> > > > 3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C638084196858222395%7CUnknown
> > > > %7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha
> > > > WwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=u5FaRsEC2aUMRWfAlt98v%2F
> > > > 1dmbCNn7%2FMgvNXoSmzyUU%3D&reserved=0>
> > > >
> > >
> >
> >
> > --
> >
> > Carlos Sanchez (he, him, his)
> > Geotab
> >
> > Embedded Systems Developer Team Lead | Europe
> >
> > Visit
> >
> > https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
> > .geotab.com%2F&data=05%7C01%7Cpeter.vanderperk%40nxp.com%7C99afe46e1
> > 5344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7
> > C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQ
> > IjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=V7y
> > T6O%2FNT%2B4zbERMFnffrayp59Kgjlqo9o4U4bF92H4%3D&reserved=0
> >
> > Twitter 
> > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ft
> > witter.com%2Fgeotab&data=05%7C01%7Cpeter.vanderperk%40nxp.com%7C99af
> > e46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%
> > 7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDA
> > iLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdat
> > a=1DlNx9Iy%2BvxErS2AsgU43uut0IMDF29QDeL5usCuLto%3D&reserved=0> | 
> > Facebook 
> > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fw
> > ww.facebook.com%2FGeotab&data=05%7C01%7Cpeter.vanderperk%40nxp.com%7
> > C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c5c301635
> > %7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> > AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> > &sdata=BJKh9GlKnnMUZ1d31mW%2B%2FNrWoVUS5lL6t8SEhVdt2XM%3D&reserved=0
> > > | YouTube 
> > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fw
> > ww.youtube.com%2Fuser%2FMyGeotab&data=05%7C01%7Cpeter.vanderperk%40n
> > xp.com%7C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92cd99c
> > 5c301635%7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> > oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%
> > 7C%7C%7C&sdata=Yw5fWwK9J0sfLkwzWqdvqj7Bn%2BZxwfcsMvjsPxOf9fg%3D&rese
> > rved=0> | LinkedIn 
> > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fw
> > ww.linkedin.com%2Fcompany%2Fgeotab%2F&data=05%7C01%7Cpeter.vanderper
> > k%40nxp.com%7C99afe46e15344b2883eb08daee32418c%7C686ea1d3bc2b4c6fa92
> > cd99c5c301635%7C0%7C1%7C638084196858222395%7CUnknown%7CTWFpbGZsb3d8e
> > yJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C
> > 3000%7C%7C%7C&sdata=u5FaRsEC2aUMRWfAlt98v%2F1dmbCNn7%2FMgvNXoSmzyUU%
> > 3D&reserved=0>
> >
>

Re: d_len/d_buf arbitration for s32k1xx_flexcan

Posted by raiden00pl <ra...@gmail.com>.
Related issue: https://github.com/apache/nuttx/issues/5142.
Not using a work queue to handle CAN RX is basically wrong. Unfortunately
all NXP SocketCAN drivers are affected.

wt., 3 sty 2023 o 19:38 Xiang Xiao <xi...@gmail.com> napisał(a):

>   Sorry, "you must do..." may confuse you. What I mean is the CAN driver.
>
> On Wed, Jan 4, 2023 at 2:30 AM Carlos Sanchez
> <ca...@geotab.com.invalid> wrote:
>
> > Hi Xiang,
> >
> > Please note what I describe is not caused by my code using multiple
> > threads, but is happening on Nuttx upstream. My code is single threaded,
> > but s32k1xx_flexcan driver (and several other Socket CAN drivers as they
> > all seem to be derived from the same code base) does some things on the
> > thread I call write() from, and some other things on CANWORK work queue
> > thread.
> > My understanding is that net code is structured so work queue threads are
> > used, generally, but in Socket CAN drivers this was "waived" to avoid
> data
> > loss, causing the problem I describe.
> >
> > Thanks,
> >
> > Carlos
> >
> > On Tue, Jan 3, 2023 at 6:56 PM Xiang Xiao <xi...@gmail.com>
> > wrote:
> >
> > > Since tx/rx share the same d_len/d_buf, you must do send/recv in one
> and
> > > only thread(either by system work thread or driver dedicated thread) to
> > > avoid the race condition you describe below.
> > >
> > > On Wed, Jan 4, 2023 at 1:45 AM Carlos Sanchez
> > > <ca...@geotab.com.invalid> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I am observing an extrange behavior: under heavy-error CAN TX
> scenario
> > > (no
> > > > acks so TX fails always), usually after the second call to write() my
> > > > writes fail. This is expected as s32k1xx_flexcan has two TX mailboxes
> > and
> > > > from my understanding of the code there is no other buffering (on
> this
> > > > specific CAN driver at least).
> > > >
> > > > However, if I enable CAN errors, and depending on runtime sync
> > conditions
> > > > (basically, if I put a breakpoint on s32k1xx_txpoll) then all the
> > writes
> > > > after the first silently fail, without really trying to send anything
> > (I
> > > > see on ESR2 register than second TX mailbox does not really become
> > > active).
> > > >
> > > > After some debugging, I have seen that the CANWORK=LPWORK thread has
> > > > scheduled calls to s32k1xx_error_work, overwrites d_buf/d_len to send
> > the
> > > > error frames in. But TX polling does not always happen in CANWORK
> > thread:
> > > > it does when it comes from s32k1xx_txdone_work, but not when it comes
> > > from
> > > > s32k1xx_txavail_work, which, despite the name, is called directly on
> > > > s32k1xx_txavail context (which is the application context). What is
> > > > happening in my case, is txavail_work->...->devif_poll->can_poll is
> > > setting
> > > > d_buf/d_len to the packed to be sent, but before s32k1xx_txpoll
> checks
> > it
> > > > (due to my breakpoint), s32k1xx_error kicks in, "steals" d_buf/d_len
> to
> > > > setup the error frame and calls can_input. The frame which was set up
> > by
> > > > the polling sequence gets silently discarded.
> > > >
> > > > I have tried setting s32k1xx_txavail_work inside CANWORK, but this
> > fails
> > > > because can_sendmsg checks immediately for non-blocking writes;
> placing
> > > the
> > > > polling in the work queue would cause all non-blocking writes to
> fail.
> > > >
> > > > Related to this, I also fail to see the arbitration between TX/RX for
> > > > d_buf/d_len. From what I see, the same problem I am describing could
> > > happen
> > > > by s32k1xx_receive "stealing" d_buf/d_len, same as s32k1xx_error is
> > doing
> > > > in my case. But this is only a thought, I have not observed it.
> > > >
> > > > A possible clean solution is to use another buffer, but it is complex
> > and
> > > > would mean losing the direct connection between the write and the HW
> TX
> > > > (which might be useful in general and it is for my use case). A
> quicker
> > > > solution would be for s32k1xx_error to lock the network, forcing it
> to
> > > wait
> > > > until txavail_work is done. This would solve my case. My second
> concern
> > > is
> > > > more difficult to solve as comments in the code explicitly say RX
> > cannot
> > > be
> > > > delayed to the work queue or CAN frames would be lost.
> > > >
> > > > Any ideas or anything I might be missing here?
> > > >
> > > > Thanks,
> > > >
> > > > Carlos
> > > >
> > > > --
> > > >
> > > > Carlos Sanchez (he, him, his)
> > > > Geotab
> > > >
> > > > Embedded Systems Developer Team Lead | Europe
> > > >
> > > > Visit
> > > >
> > > > www.geotab.com
> > > >
> > > > Twitter <https://twitter.com/geotab> | Facebook
> > > > <https://www.facebook.com/Geotab> | YouTube
> > > > <https://www.youtube.com/user/MyGeotab> | LinkedIn
> > > > <https://www.linkedin.com/company/geotab/>
> > > >
> > >
> >
> >
> > --
> >
> > Carlos Sanchez (he, him, his)
> > Geotab
> >
> > Embedded Systems Developer Team Lead | Europe
> >
> > Visit
> >
> > www.geotab.com
> >
> > Twitter <https://twitter.com/geotab> | Facebook
> > <https://www.facebook.com/Geotab> | YouTube
> > <https://www.youtube.com/user/MyGeotab> | LinkedIn
> > <https://www.linkedin.com/company/geotab/>
> >
>

Re: d_len/d_buf arbitration for s32k1xx_flexcan

Posted by Xiang Xiao <xi...@gmail.com>.
  Sorry, "you must do..." may confuse you. What I mean is the CAN driver.

On Wed, Jan 4, 2023 at 2:30 AM Carlos Sanchez
<ca...@geotab.com.invalid> wrote:

> Hi Xiang,
>
> Please note what I describe is not caused by my code using multiple
> threads, but is happening on Nuttx upstream. My code is single threaded,
> but s32k1xx_flexcan driver (and several other Socket CAN drivers as they
> all seem to be derived from the same code base) does some things on the
> thread I call write() from, and some other things on CANWORK work queue
> thread.
> My understanding is that net code is structured so work queue threads are
> used, generally, but in Socket CAN drivers this was "waived" to avoid data
> loss, causing the problem I describe.
>
> Thanks,
>
> Carlos
>
> On Tue, Jan 3, 2023 at 6:56 PM Xiang Xiao <xi...@gmail.com>
> wrote:
>
> > Since tx/rx share the same d_len/d_buf, you must do send/recv in one and
> > only thread(either by system work thread or driver dedicated thread) to
> > avoid the race condition you describe below.
> >
> > On Wed, Jan 4, 2023 at 1:45 AM Carlos Sanchez
> > <ca...@geotab.com.invalid> wrote:
> >
> > > Hi all,
> > >
> > > I am observing an extrange behavior: under heavy-error CAN TX scenario
> > (no
> > > acks so TX fails always), usually after the second call to write() my
> > > writes fail. This is expected as s32k1xx_flexcan has two TX mailboxes
> and
> > > from my understanding of the code there is no other buffering (on this
> > > specific CAN driver at least).
> > >
> > > However, if I enable CAN errors, and depending on runtime sync
> conditions
> > > (basically, if I put a breakpoint on s32k1xx_txpoll) then all the
> writes
> > > after the first silently fail, without really trying to send anything
> (I
> > > see on ESR2 register than second TX mailbox does not really become
> > active).
> > >
> > > After some debugging, I have seen that the CANWORK=LPWORK thread has
> > > scheduled calls to s32k1xx_error_work, overwrites d_buf/d_len to send
> the
> > > error frames in. But TX polling does not always happen in CANWORK
> thread:
> > > it does when it comes from s32k1xx_txdone_work, but not when it comes
> > from
> > > s32k1xx_txavail_work, which, despite the name, is called directly on
> > > s32k1xx_txavail context (which is the application context). What is
> > > happening in my case, is txavail_work->...->devif_poll->can_poll is
> > setting
> > > d_buf/d_len to the packed to be sent, but before s32k1xx_txpoll checks
> it
> > > (due to my breakpoint), s32k1xx_error kicks in, "steals" d_buf/d_len to
> > > setup the error frame and calls can_input. The frame which was set up
> by
> > > the polling sequence gets silently discarded.
> > >
> > > I have tried setting s32k1xx_txavail_work inside CANWORK, but this
> fails
> > > because can_sendmsg checks immediately for non-blocking writes; placing
> > the
> > > polling in the work queue would cause all non-blocking writes to fail.
> > >
> > > Related to this, I also fail to see the arbitration between TX/RX for
> > > d_buf/d_len. From what I see, the same problem I am describing could
> > happen
> > > by s32k1xx_receive "stealing" d_buf/d_len, same as s32k1xx_error is
> doing
> > > in my case. But this is only a thought, I have not observed it.
> > >
> > > A possible clean solution is to use another buffer, but it is complex
> and
> > > would mean losing the direct connection between the write and the HW TX
> > > (which might be useful in general and it is for my use case). A quicker
> > > solution would be for s32k1xx_error to lock the network, forcing it to
> > wait
> > > until txavail_work is done. This would solve my case. My second concern
> > is
> > > more difficult to solve as comments in the code explicitly say RX
> cannot
> > be
> > > delayed to the work queue or CAN frames would be lost.
> > >
> > > Any ideas or anything I might be missing here?
> > >
> > > Thanks,
> > >
> > > Carlos
> > >
> > > --
> > >
> > > Carlos Sanchez (he, him, his)
> > > Geotab
> > >
> > > Embedded Systems Developer Team Lead | Europe
> > >
> > > Visit
> > >
> > > www.geotab.com
> > >
> > > Twitter <https://twitter.com/geotab> | Facebook
> > > <https://www.facebook.com/Geotab> | YouTube
> > > <https://www.youtube.com/user/MyGeotab> | LinkedIn
> > > <https://www.linkedin.com/company/geotab/>
> > >
> >
>
>
> --
>
> Carlos Sanchez (he, him, his)
> Geotab
>
> Embedded Systems Developer Team Lead | Europe
>
> Visit
>
> www.geotab.com
>
> Twitter <https://twitter.com/geotab> | Facebook
> <https://www.facebook.com/Geotab> | YouTube
> <https://www.youtube.com/user/MyGeotab> | LinkedIn
> <https://www.linkedin.com/company/geotab/>
>

Re: d_len/d_buf arbitration for s32k1xx_flexcan

Posted by Carlos Sanchez <ca...@geotab.com.INVALID>.
Hi Xiang,

Please note what I describe is not caused by my code using multiple
threads, but is happening on Nuttx upstream. My code is single threaded,
but s32k1xx_flexcan driver (and several other Socket CAN drivers as they
all seem to be derived from the same code base) does some things on the
thread I call write() from, and some other things on CANWORK work queue
thread.
My understanding is that net code is structured so work queue threads are
used, generally, but in Socket CAN drivers this was "waived" to avoid data
loss, causing the problem I describe.

Thanks,

Carlos

On Tue, Jan 3, 2023 at 6:56 PM Xiang Xiao <xi...@gmail.com> wrote:

> Since tx/rx share the same d_len/d_buf, you must do send/recv in one and
> only thread(either by system work thread or driver dedicated thread) to
> avoid the race condition you describe below.
>
> On Wed, Jan 4, 2023 at 1:45 AM Carlos Sanchez
> <ca...@geotab.com.invalid> wrote:
>
> > Hi all,
> >
> > I am observing an extrange behavior: under heavy-error CAN TX scenario
> (no
> > acks so TX fails always), usually after the second call to write() my
> > writes fail. This is expected as s32k1xx_flexcan has two TX mailboxes and
> > from my understanding of the code there is no other buffering (on this
> > specific CAN driver at least).
> >
> > However, if I enable CAN errors, and depending on runtime sync conditions
> > (basically, if I put a breakpoint on s32k1xx_txpoll) then all the writes
> > after the first silently fail, without really trying to send anything (I
> > see on ESR2 register than second TX mailbox does not really become
> active).
> >
> > After some debugging, I have seen that the CANWORK=LPWORK thread has
> > scheduled calls to s32k1xx_error_work, overwrites d_buf/d_len to send the
> > error frames in. But TX polling does not always happen in CANWORK thread:
> > it does when it comes from s32k1xx_txdone_work, but not when it comes
> from
> > s32k1xx_txavail_work, which, despite the name, is called directly on
> > s32k1xx_txavail context (which is the application context). What is
> > happening in my case, is txavail_work->...->devif_poll->can_poll is
> setting
> > d_buf/d_len to the packed to be sent, but before s32k1xx_txpoll checks it
> > (due to my breakpoint), s32k1xx_error kicks in, "steals" d_buf/d_len to
> > setup the error frame and calls can_input. The frame which was set up by
> > the polling sequence gets silently discarded.
> >
> > I have tried setting s32k1xx_txavail_work inside CANWORK, but this fails
> > because can_sendmsg checks immediately for non-blocking writes; placing
> the
> > polling in the work queue would cause all non-blocking writes to fail.
> >
> > Related to this, I also fail to see the arbitration between TX/RX for
> > d_buf/d_len. From what I see, the same problem I am describing could
> happen
> > by s32k1xx_receive "stealing" d_buf/d_len, same as s32k1xx_error is doing
> > in my case. But this is only a thought, I have not observed it.
> >
> > A possible clean solution is to use another buffer, but it is complex and
> > would mean losing the direct connection between the write and the HW TX
> > (which might be useful in general and it is for my use case). A quicker
> > solution would be for s32k1xx_error to lock the network, forcing it to
> wait
> > until txavail_work is done. This would solve my case. My second concern
> is
> > more difficult to solve as comments in the code explicitly say RX cannot
> be
> > delayed to the work queue or CAN frames would be lost.
> >
> > Any ideas or anything I might be missing here?
> >
> > Thanks,
> >
> > Carlos
> >
> > --
> >
> > Carlos Sanchez (he, him, his)
> > Geotab
> >
> > Embedded Systems Developer Team Lead | Europe
> >
> > Visit
> >
> > www.geotab.com
> >
> > Twitter <https://twitter.com/geotab> | Facebook
> > <https://www.facebook.com/Geotab> | YouTube
> > <https://www.youtube.com/user/MyGeotab> | LinkedIn
> > <https://www.linkedin.com/company/geotab/>
> >
>


-- 

Carlos Sanchez (he, him, his)
Geotab

Embedded Systems Developer Team Lead | Europe

Visit

www.geotab.com

Twitter <https://twitter.com/geotab> | Facebook
<https://www.facebook.com/Geotab> | YouTube
<https://www.youtube.com/user/MyGeotab> | LinkedIn
<https://www.linkedin.com/company/geotab/>

Re: d_len/d_buf arbitration for s32k1xx_flexcan

Posted by Xiang Xiao <xi...@gmail.com>.
Since tx/rx share the same d_len/d_buf, you must do send/recv in one and
only thread(either by system work thread or driver dedicated thread) to
avoid the race condition you describe below.

On Wed, Jan 4, 2023 at 1:45 AM Carlos Sanchez
<ca...@geotab.com.invalid> wrote:

> Hi all,
>
> I am observing an extrange behavior: under heavy-error CAN TX scenario (no
> acks so TX fails always), usually after the second call to write() my
> writes fail. This is expected as s32k1xx_flexcan has two TX mailboxes and
> from my understanding of the code there is no other buffering (on this
> specific CAN driver at least).
>
> However, if I enable CAN errors, and depending on runtime sync conditions
> (basically, if I put a breakpoint on s32k1xx_txpoll) then all the writes
> after the first silently fail, without really trying to send anything (I
> see on ESR2 register than second TX mailbox does not really become active).
>
> After some debugging, I have seen that the CANWORK=LPWORK thread has
> scheduled calls to s32k1xx_error_work, overwrites d_buf/d_len to send the
> error frames in. But TX polling does not always happen in CANWORK thread:
> it does when it comes from s32k1xx_txdone_work, but not when it comes from
> s32k1xx_txavail_work, which, despite the name, is called directly on
> s32k1xx_txavail context (which is the application context). What is
> happening in my case, is txavail_work->...->devif_poll->can_poll is setting
> d_buf/d_len to the packed to be sent, but before s32k1xx_txpoll checks it
> (due to my breakpoint), s32k1xx_error kicks in, "steals" d_buf/d_len to
> setup the error frame and calls can_input. The frame which was set up by
> the polling sequence gets silently discarded.
>
> I have tried setting s32k1xx_txavail_work inside CANWORK, but this fails
> because can_sendmsg checks immediately for non-blocking writes; placing the
> polling in the work queue would cause all non-blocking writes to fail.
>
> Related to this, I also fail to see the arbitration between TX/RX for
> d_buf/d_len. From what I see, the same problem I am describing could happen
> by s32k1xx_receive "stealing" d_buf/d_len, same as s32k1xx_error is doing
> in my case. But this is only a thought, I have not observed it.
>
> A possible clean solution is to use another buffer, but it is complex and
> would mean losing the direct connection between the write and the HW TX
> (which might be useful in general and it is for my use case). A quicker
> solution would be for s32k1xx_error to lock the network, forcing it to wait
> until txavail_work is done. This would solve my case. My second concern is
> more difficult to solve as comments in the code explicitly say RX cannot be
> delayed to the work queue or CAN frames would be lost.
>
> Any ideas or anything I might be missing here?
>
> Thanks,
>
> Carlos
>
> --
>
> Carlos Sanchez (he, him, his)
> Geotab
>
> Embedded Systems Developer Team Lead | Europe
>
> Visit
>
> www.geotab.com
>
> Twitter <https://twitter.com/geotab> | Facebook
> <https://www.facebook.com/Geotab> | YouTube
> <https://www.youtube.com/user/MyGeotab> | LinkedIn
> <https://www.linkedin.com/company/geotab/>
>