You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nuttx.apache.org by Fotis Panagiotopoulos <f....@gmail.com> on 2022/07/19 11:47:06 UTC

STM32F4 Ethernet Issues

Hello!

I am using Ethernet on an STM32F427 target, but I am facing some issues.

Initially the device works correctly. After some hours of continuous
operation I completely lose all network communications.
Trying to troubleshoot the issue, I enabled assertions and various other
debug features.

Again the device works correctly for some hours, and then I get a failed
assertion at stm32_eth.c, line 1372:

DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);

No other errors are reported (e.g. stack overflows etc).


I have observed that this issue usually manifests itself when there is
insufficient stack on a task.
But in my case, all tasks have oversized stacks. Typically they do not
exceed 50% utilization.
I have plenty of room available in the heap too (> 100kB).

Regarding the rest of the firmware, I cannot see any other misbehaviour or
problem.
I haven't ever seen any other unexplained problem, assertion fail,
hard-fault etc.
The application code passes all of our tests.
In fact, even when this issue happens, although I lose network
connectivity, the rest of the system works perfectly.

Please note that I have checked the contents of dev->d_len and dev->d_buf,
and they seem to contain valid data.
The address lies within the normal address space of the MCU, and the size
is sane.
So it doesn't look like any kind of memory corruption.


At this point I believe that this is an actual bug either on the STM32 MAC
driver, or at the TCP/IP stack itself.
I had a look at the driver code, but I didn't see anything suspicious.


Has anyone observed the same issue before?
Can it be affected in any way with my configuration?
Or maybe, do you have any recommendations on what to test next?


Thank you!

Re: STM32F4 Ethernet Issues

Posted by Nathan Hartman <ha...@gmail.com>.
On Tue, Jul 19, 2022 at 7:47 AM Fotis Panagiotopoulos <f....@gmail.com>
wrote:

> Hello!
>
> I am using Ethernet on an STM32F427 target, but I am facing some issues.
>
> Initially the device works correctly. After some hours of continuous
> operation I completely lose all network communications.
> Trying to troubleshoot the issue, I enabled assertions and various other
> debug features.
>
> Again the device works correctly for some hours, and then I get a failed
> assertion at stm32_eth.c, line 1372:
>
> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
>
> No other errors are reported (e.g. stack overflows etc).
>
>
> I have observed that this issue usually manifests itself when there is
> insufficient stack on a task.
> But in my case, all tasks have oversized stacks. Typically they do not
> exceed 50% utilization.
> I have plenty of room available in the heap too (> 100kB).
>
> Regarding the rest of the firmware, I cannot see any other misbehaviour or
> problem.
> I haven't ever seen any other unexplained problem, assertion fail,
> hard-fault etc.
> The application code passes all of our tests.
> In fact, even when this issue happens, although I lose network
> connectivity, the rest of the system works perfectly.
>
> Please note that I have checked the contents of dev->d_len and dev->d_buf,
> and they seem to contain valid data.
> The address lies within the normal address space of the MCU, and the size
> is sane.
> So it doesn't look like any kind of memory corruption.
>
>
> At this point I believe that this is an actual bug either on the STM32 MAC
> driver, or at the TCP/IP stack itself.
> I had a look at the driver code, but I didn't see anything suspicious.
>
>
> Has anyone observed the same issue before?
> Can it be affected in any way with my configuration?
> Or maybe, do you have any recommendations on what to test next?
>
>
> Thank you!



I am currently working on a networking device and trying to get it working
more robustly. My device is using a different ARM- based micro.

I have seen that when I enable the network debugging features, it seems to
hit an assertion failure before getting to nsh prompt at startup. This was
on a quite recent master. I haven't had a chance to diagnose this further.
Have you tried enabling these and if so, do they work?

Also, out of curiosity, have you tried running ostest on your board?

Thanks,
Nathan

Re: STM32F4 Ethernet Issues

Posted by Fotis Panagiotopoulos <f....@gmail.com>.
Hello,

I was on vacation last week, so I didn't make any progress on this.
I want to fix it, but I need guidance. No one has commented on this...

Did anyone manage to reproduce the issue using my fork?

On Tue, Aug 23, 2022 at 10:42 AM Sebastien Lorquet <se...@lorquet.fr>
wrote:

> Hi,
>
> is there any follow up on this point?
>
> Sebastien
>
>
> Le 13/08/2022 à 16:44, Fotis Panagiotopoulos a écrit :
> > Ok, I just managed to reproduce the issue on a NUCLEO-F429ZI, using the
> > NuttX apps.
> >
> > Please check my fork on
> > https://github.com/fjpanag/incubator-nuttx-apps/tree/tcp_issue
> > See the branch tcp_issue.
> >
> > I have "hacked" the NSH code to reproduce the issue.
> > A TCP connection is opened, and then closed. Then the network interface
> is
> > brought down. At this point the system crashes immediately.
> >
> > Note that I have locked the scheduler when the connection is closed.
> > This is to simulate an ifdown action BEFORE the FIN ACK is processed (as
> it
> > happens in my case).
> > My code does not have this locking, this is only for simulation purposes.
> >
> > Please use the provided defconfig. It is stored in the root of my apps
> fork.
> > I guess it is not related to the configuration, but my "working" sample
> is
> > provided.
> >
> >
> >
> > On Fri, Aug 12, 2022 at 7:15 PM Alan Carvalho de Assis <
> acassis@gmail.com>
> > wrote:
> >
> >> Hi Fotis,
> >>
> >> Yes, I understood the point. Because it needs the right timing it
> >> could be trick to duplicate.
> >>
> >> Did you try to create a simple host server to try to emulate this
> >> connection issue?
> >>
> >> BR,
> >>
> >> Alan
> >>
> >> On 8/12/22, Fotis Panagiotopoulos <f....@gmail.com> wrote:
> >>> I think I understand the nature of the bug.
> >>>
> >>> When closing a socket, tcp_close_eventhandler() is set as a callback in
> >> the
> >>> dev->d_devcb list.
> >>>
> >>> Typically, the server's response (FIN ACK) will have as a result
> >>> tcp_callback() to be executed, and thus the callback to be properly
> >> called,
> >>> with proper arguments.
> >>> Then the cb is properly free'd.
> >>>
> >>> If however devif_dev_event() has the chance to execute before
> >>> tcp_callback() (e.g. server's response was lost), then the callbacks
> take
> >>> NULL as a conn argument.
> >>> This crashes the whole system horribly.
> >>>
> >>> As you see, this requires specific timings with the server
> communication,
> >>> that's why this is so hard to reproduce.
> >>>
> >>>
> >>> On Fri, Aug 12, 2022 at 5:13 PM Fotis Panagiotopoulos <
> >> f.j.panag@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi Alan,
> >>>>
> >>>> I am trying hard to reproduce the issue reliably, but I haven't been
> >> able
> >>>> to do so yet.
> >>>>
> >>>> I noticed that when I disable CONFIG_NET_TCP_WRITE_BUFFERS, the
> problem
> >>>> does not disappear, rather it changes form.
> >>>> Now I occasionally get a failed assertion in wdog/wd_cancel.c line 95.
> >>>>
> >>>> I have to mention that everything in my system is commented out.
> >>>> Currently the only thing working is the network thread that opens the
> >> TCP
> >>>> connection, nothing else.
> >>>> I have disabled all of my usage of the workers, all signals etc.
> >>>> I verify that when the fault occurs, this thread is not interrupted by
> >>>> anything (using Segger SystemView).
> >>>> It looks like a scheduling issue is unlikely.
> >>>>
> >>>> I also increased the stacks more, and I added padding to the very few
> >>>> malloc's that I use.
> >>>>
> >>>> ---
> >>>>
> >>>> At this moment I observe something very interesting.
> >>>> I am calling netlib_ifdown(), which causes the attached stack trace.
> >>>>
> >>>> So:
> >>>> 1. netdev_ifdown() calls devif_dev_event() with the argument pvconn
> set
> >>>> explicitly to NULL.
> >>>> 2. devif_dev_event() eventually calls tcp_close_eventhandler()
> >>>> 3. tcp_close_eventhandler() assumes that conn is NOT NULL. Which
> causes
> >>>> the crash.
> >>>>
> >>>> This is wrong, but I don't have the understanding of it yet.
> >>>> Shall there be a check for a NULL conn?
> >>>> Or maybe tcp_close_eventhandler() is wrong to be in the cb's list in
> the
> >>>> first place?
> >>>> Or tcp_close_eventhandler() should be tolerant to a NULL conn
> argument?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Aug 11, 2022 at 12:05 PM Alan Carvalho de Assis
> >>>> <ac...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Fotis,
> >>>>>
> >>>>> Are you in sync with mainline?
> >>>>>
> >>>>> If you can create a host application to induce the issue will be
> >>>>> easier for us to test.
> >>>>>
> >>>>> BR,
> >>>>>
> >>>>> Alan
> >>>>>
> >>>>> On 8/9/22, Fotis Panagiotopoulos <f....@gmail.com> wrote:
> >>>>>> Hello,
> >>>>>>
> >>>>>> still trying to make the network work reliably.
> >>>>>> After fixing another issue of my application, I hit another problem.
> >>>>>>
> >>>>>> The following sequence causes NuttX to crash:
> >>>>>>
> >>>>>> 1. My application is creating a TCP socket and communicates with a
> >>>>> server.
> >>>>>> 2. At one point the server stops responding (unrelated to NuttX /
> >>>>> network
> >>>>>> issue).
> >>>>>> 3. The application detects the timeout, and calls close() on the
> >>>>>> socket.
> >>>>>> 4. A new socket is created, and it is connected to the server.
> >>>>>> 5. At this point, the server decides to send a FIN message for the
> >>>>> previous
> >>>>>> connection.
> >>>>>> 6. I get a failed assertion in devif_callback.c at line 85.
> >>>>>>
> >>>>>> Note that I haven't managed to manually reproduce this issue.
> >>>>>> No matter what I do manually, everything seems to be working
> >>>>>> correctly.
> >>>>>> I just have to wait for it to happen.
> >>>>>> It seems that it is only triggered if a FIN arrives **after** a SYN.
> >>>>>>
> >>>>>> I am sure that this is only happening with
> >>>>>> CONFIG_NET_TCP_WRITE_BUFFERS
> >>>>>> enabled.
> >>>>>> I have no problems without buffering.
> >>>>>>
> >>>>>> The assertion seems right to fire.
> >>>>>> When a FIN is received for a closed connection, the same callback is
> >>>>> free'd
> >>>>>> both by tcp_lost_connection() and later on by
> >>>>>> tcp_close_eventhandler().
> >>>>>> All these are happening within the same execution of tcp_input().
> >>>>>>
> >>>>>> Any ideas?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet
> >>>>>> <sebastien@lorquet.fr
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> good find but
> >>>>>>>
> >>>>>>> -I dont think any usual application tinkers with PHY regs during
> its
> >>>>>>> lifetime except the ethernet monitor
> >>>>>>>
> >>>>>>> -the fix is certainly a lock somewhere but global or fine grained I
> >>>>> dont
> >>>>>>> know.
> >>>>>>>
> >>>>>>> Not all calls need to be locked, eg the one that returns the PHY
> >>>>>>> address. Probaby not needed by default, but a PHY access lock would
> >>>>>>> prevent any issue you describe.
> >>>>>>>
> >>>>>>> I will wait for people with more expertise about this.
> >>>>>>>
> >>>>>>> Just a note, dont forget that not all PHY have an interrupt, the
> one
> >>>>>>> on
> >>>>>>> the nucleo stm32h743zi[2] board does not have one.
> >>>>>>>
> >>>>>>> Sebastien
> >>>>>>>
> >>>>>>> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit :
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> I have eventually found 2 issues regarding networking in my
> >>>>>>>> application.
> >>>>>>>> I would like to discuss the first one.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> My code contains something like this:
> >>>>>>>>
> >>>>>>>> int sd = socket(AF_INET, SOCK_DGRAM, 0);
> >>>>>>>>
> >>>>>>>> struct ifreq ifr;
> >>>>>>>> memset(&ifr, 0, sizeof(struct ifreq));
> >>>>>>>> strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
> >>>>>>>> ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
> >>>>>>>> ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
> >>>>>>>> ifr.ifr_mii_val_out = 0;
> >>>>>>>> ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);
> >>>>>>>>
> >>>>>>>> // Do stuff with ifr.ifr_mii_val_out.
> >>>>>>>>
> >>>>>>>> close(sd);
> >>>>>>>>
> >>>>>>>> I realized that this type of ioctl will directly access the
> >>>>>>>> hardware,
> >>>>>>>> without any locking.
> >>>>>>>> That is, if any other task needs to use the PHY in any other way,
> >>>>>>>> it
> >>>>>>>> will
> >>>>>>>> eventually corrupt its register data.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Two questions on this:
> >>>>>>>> 1. Is there any good reason for this?
> >>>>>>>> 2. What is the best way to fix it? Shall I add a driver level
> >> lock,
> >>>>> or
> >>>>>>>> should net_lock() be used in any higher layer?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <
> >>>>>>> f.j.panag@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hello,
> >>>>>>>>>
> >>>>>>>>>> We have deployed hundreds of boards with stm32f427 and ethernet,
> >>>>> they
> >>>>>>>>>> have all been working reliably for months without stopping, we
> >>>>>>>>>> know
> >>>>>>>>>> it
> >>>>>>>>>> because they critically depend on network functionality and we
> >>>>>>>>>> have
> >>>>>>>>>> reports if a card becomes unreachable. None has so far outside
> >> of
> >>>>>>>>>> dedicated tests.
> >>>>>>>>>> So I believe that there is no obvious hard bug in these drivers.
> >>>>>>>>> Good to hear that!
> >>>>>>>>> Although, I may be using a feature or protocol that you are not.
> >>>>>>>>> Of course, I don't believe that NuttX is broken per se, but a
> >>>>>>>>> minor
> >>>>>>>>> bug
> >>>>>>>>> may lurk somewhere...
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> I have seen that when I enable the network debugging features,
> >> it
> >>>>>>>>>> seems
> >>>>>>>>> to
> >>>>>>>>>> hit an assertion failure before getting to nsh prompt at
> >> startup.
> >>>>>>>>>> This
> >>>>>>>>> was
> >>>>>>>>>> on a quite recent master. I haven't had a chance to diagnose
> >> this
> >>>>>>>>> further.
> >>>>>>>>>> Have you tried enabling these and if so, do they work?
> >>>>>>>>> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and
> >>>>>>>>> it
> >>>>>>> works.
> >>>>>>>>> I have some devices under test, waiting to reproduce the issue to
> >>>>> see
> >>>>>>>>> if
> >>>>>>>>> this option provides any useful information.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Also, out of curiosity, have you tried running ostest on your
> >>>>> board?
> >>>>>>>>> I just tried.
> >>>>>>>>> It passed all the tests.
> >>>>>>>>>
> >>>>>>>>> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet
> >>>>>>>>> <sebastien@lorquet.fr
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> We have deployed hundreds of boards with stm32f427 and ethernet,
> >>>>> they
> >>>>>>>>>> have all been working reliably for months without stopping, we
> >>>>>>>>>> know
> >>>>>>>>>> it
> >>>>>>>>>> because they critically depend on network functionality and we
> >>>>>>>>>> have
> >>>>>>>>>> reports if a card becomes unreachable. None has so far outside
> >> of
> >>>>>>>>>> dedicated tests.
> >>>>>>>>>>
> >>>>>>>>>> So I believe that there is no obvious hard bug in these drivers.
> >>>>>>>>>>
> >>>>>>>>>> Most certainly a build option on your particular config. debug
> >> is
> >>>>>>>>>> a
> >>>>>>>>>> possible issue, thread problems is another possibility.
> >>>>>>>>>>
> >>>>>>>>>> Sebastien
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
> >>>>>>>>>>> Hello!
> >>>>>>>>>>>
> >>>>>>>>>>> I am using Ethernet on an STM32F427 target, but I am facing
> >> some
> >>>>>>> issues.
> >>>>>>>>>>> Initially the device works correctly. After some hours of
> >>>>> continuous
> >>>>>>>>>>> operation I completely lose all network communications.
> >>>>>>>>>>> Trying to troubleshoot the issue, I enabled assertions and
> >>>>>>>>>>> various
> >>>>>>> other
> >>>>>>>>>>> debug features.
> >>>>>>>>>>>
> >>>>>>>>>>> Again the device works correctly for some hours, and then I get
> >>>>>>>>>>> a
> >>>>>>> failed
> >>>>>>>>>>> assertion at stm32_eth.c, line 1372:
> >>>>>>>>>>>
> >>>>>>>>>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
> >>>>>>>>>>>
> >>>>>>>>>>> No other errors are reported (e.g. stack overflows etc).
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I have observed that this issue usually manifests itself when
> >>>>> there
> >>>>>>>>>>> is
> >>>>>>>>>>> insufficient stack on a task.
> >>>>>>>>>>> But in my case, all tasks have oversized stacks. Typically they
> >>>>>>>>>>> do
> >>>>>>>>>>> not
> >>>>>>>>>>> exceed 50% utilization.
> >>>>>>>>>>> I have plenty of room available in the heap too (> 100kB).
> >>>>>>>>>>>
> >>>>>>>>>>> Regarding the rest of the firmware, I cannot see any other
> >>>>>>> misbehaviour
> >>>>>>>>>> or
> >>>>>>>>>>> problem.
> >>>>>>>>>>> I haven't ever seen any other unexplained problem, assertion
> >>>>>>>>>>> fail,
> >>>>>>>>>>> hard-fault etc.
> >>>>>>>>>>> The application code passes all of our tests.
> >>>>>>>>>>> In fact, even when this issue happens, although I lose network
> >>>>>>>>>>> connectivity, the rest of the system works perfectly.
> >>>>>>>>>>>
> >>>>>>>>>>> Please note that I have checked the contents of dev->d_len and
> >>>>>>>>>> dev->d_buf,
> >>>>>>>>>>> and they seem to contain valid data.
> >>>>>>>>>>> The address lies within the normal address space of the MCU,
> >> and
> >>>>> the
> >>>>>>>>>> size
> >>>>>>>>>>> is sane.
> >>>>>>>>>>> So it doesn't look like any kind of memory corruption.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> At this point I believe that this is an actual bug either on
> >> the
> >>>>>>>>>>> STM32
> >>>>>>>>>> MAC
> >>>>>>>>>>> driver, or at the TCP/IP stack itself.
> >>>>>>>>>>> I had a look at the driver code, but I didn't see anything
> >>>>>>>>>>> suspicious.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Has anyone observed the same issue before?
> >>>>>>>>>>> Can it be affected in any way with my configuration?
> >>>>>>>>>>> Or maybe, do you have any recommendations on what to test next?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you!
> >>>>>>>>>>>
>

Re: STM32F4 Ethernet Issues

Posted by Sebastien Lorquet <se...@lorquet.fr>.
Hi,

is there any follow up on this point?

Sebastien


Le 13/08/2022 à 16:44, Fotis Panagiotopoulos a écrit :
> Ok, I just managed to reproduce the issue on a NUCLEO-F429ZI, using the
> NuttX apps.
>
> Please check my fork on
> https://github.com/fjpanag/incubator-nuttx-apps/tree/tcp_issue
> See the branch tcp_issue.
>
> I have "hacked" the NSH code to reproduce the issue.
> A TCP connection is opened, and then closed. Then the network interface is
> brought down. At this point the system crashes immediately.
>
> Note that I have locked the scheduler when the connection is closed.
> This is to simulate an ifdown action BEFORE the FIN ACK is processed (as it
> happens in my case).
> My code does not have this locking, this is only for simulation purposes.
>
> Please use the provided defconfig. It is stored in the root of my apps fork.
> I guess it is not related to the configuration, but my "working" sample is
> provided.
>
>
>
> On Fri, Aug 12, 2022 at 7:15 PM Alan Carvalho de Assis <ac...@gmail.com>
> wrote:
>
>> Hi Fotis,
>>
>> Yes, I understood the point. Because it needs the right timing it
>> could be trick to duplicate.
>>
>> Did you try to create a simple host server to try to emulate this
>> connection issue?
>>
>> BR,
>>
>> Alan
>>
>> On 8/12/22, Fotis Panagiotopoulos <f....@gmail.com> wrote:
>>> I think I understand the nature of the bug.
>>>
>>> When closing a socket, tcp_close_eventhandler() is set as a callback in
>> the
>>> dev->d_devcb list.
>>>
>>> Typically, the server's response (FIN ACK) will have as a result
>>> tcp_callback() to be executed, and thus the callback to be properly
>> called,
>>> with proper arguments.
>>> Then the cb is properly free'd.
>>>
>>> If however devif_dev_event() has the chance to execute before
>>> tcp_callback() (e.g. server's response was lost), then the callbacks take
>>> NULL as a conn argument.
>>> This crashes the whole system horribly.
>>>
>>> As you see, this requires specific timings with the server communication,
>>> that's why this is so hard to reproduce.
>>>
>>>
>>> On Fri, Aug 12, 2022 at 5:13 PM Fotis Panagiotopoulos <
>> f.j.panag@gmail.com>
>>> wrote:
>>>
>>>> Hi Alan,
>>>>
>>>> I am trying hard to reproduce the issue reliably, but I haven't been
>> able
>>>> to do so yet.
>>>>
>>>> I noticed that when I disable CONFIG_NET_TCP_WRITE_BUFFERS, the problem
>>>> does not disappear, rather it changes form.
>>>> Now I occasionally get a failed assertion in wdog/wd_cancel.c line 95.
>>>>
>>>> I have to mention that everything in my system is commented out.
>>>> Currently the only thing working is the network thread that opens the
>> TCP
>>>> connection, nothing else.
>>>> I have disabled all of my usage of the workers, all signals etc.
>>>> I verify that when the fault occurs, this thread is not interrupted by
>>>> anything (using Segger SystemView).
>>>> It looks like a scheduling issue is unlikely.
>>>>
>>>> I also increased the stacks more, and I added padding to the very few
>>>> malloc's that I use.
>>>>
>>>> ---
>>>>
>>>> At this moment I observe something very interesting.
>>>> I am calling netlib_ifdown(), which causes the attached stack trace.
>>>>
>>>> So:
>>>> 1. netdev_ifdown() calls devif_dev_event() with the argument pvconn set
>>>> explicitly to NULL.
>>>> 2. devif_dev_event() eventually calls tcp_close_eventhandler()
>>>> 3. tcp_close_eventhandler() assumes that conn is NOT NULL. Which causes
>>>> the crash.
>>>>
>>>> This is wrong, but I don't have the understanding of it yet.
>>>> Shall there be a check for a NULL conn?
>>>> Or maybe tcp_close_eventhandler() is wrong to be in the cb's list in the
>>>> first place?
>>>> Or tcp_close_eventhandler() should be tolerant to a NULL conn argument?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Aug 11, 2022 at 12:05 PM Alan Carvalho de Assis
>>>> <ac...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Fotis,
>>>>>
>>>>> Are you in sync with mainline?
>>>>>
>>>>> If you can create a host application to induce the issue will be
>>>>> easier for us to test.
>>>>>
>>>>> BR,
>>>>>
>>>>> Alan
>>>>>
>>>>> On 8/9/22, Fotis Panagiotopoulos <f....@gmail.com> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> still trying to make the network work reliably.
>>>>>> After fixing another issue of my application, I hit another problem.
>>>>>>
>>>>>> The following sequence causes NuttX to crash:
>>>>>>
>>>>>> 1. My application is creating a TCP socket and communicates with a
>>>>> server.
>>>>>> 2. At one point the server stops responding (unrelated to NuttX /
>>>>> network
>>>>>> issue).
>>>>>> 3. The application detects the timeout, and calls close() on the
>>>>>> socket.
>>>>>> 4. A new socket is created, and it is connected to the server.
>>>>>> 5. At this point, the server decides to send a FIN message for the
>>>>> previous
>>>>>> connection.
>>>>>> 6. I get a failed assertion in devif_callback.c at line 85.
>>>>>>
>>>>>> Note that I haven't managed to manually reproduce this issue.
>>>>>> No matter what I do manually, everything seems to be working
>>>>>> correctly.
>>>>>> I just have to wait for it to happen.
>>>>>> It seems that it is only triggered if a FIN arrives **after** a SYN.
>>>>>>
>>>>>> I am sure that this is only happening with
>>>>>> CONFIG_NET_TCP_WRITE_BUFFERS
>>>>>> enabled.
>>>>>> I have no problems without buffering.
>>>>>>
>>>>>> The assertion seems right to fire.
>>>>>> When a FIN is received for a closed connection, the same callback is
>>>>> free'd
>>>>>> both by tcp_lost_connection() and later on by
>>>>>> tcp_close_eventhandler().
>>>>>> All these are happening within the same execution of tcp_input().
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet
>>>>>> <sebastien@lorquet.fr
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> good find but
>>>>>>>
>>>>>>> -I dont think any usual application tinkers with PHY regs during its
>>>>>>> lifetime except the ethernet monitor
>>>>>>>
>>>>>>> -the fix is certainly a lock somewhere but global or fine grained I
>>>>> dont
>>>>>>> know.
>>>>>>>
>>>>>>> Not all calls need to be locked, eg the one that returns the PHY
>>>>>>> address. Probaby not needed by default, but a PHY access lock would
>>>>>>> prevent any issue you describe.
>>>>>>>
>>>>>>> I will wait for people with more expertise about this.
>>>>>>>
>>>>>>> Just a note, dont forget that not all PHY have an interrupt, the one
>>>>>>> on
>>>>>>> the nucleo stm32h743zi[2] board does not have one.
>>>>>>>
>>>>>>> Sebastien
>>>>>>>
>>>>>>> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit :
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I have eventually found 2 issues regarding networking in my
>>>>>>>> application.
>>>>>>>> I would like to discuss the first one.
>>>>>>>>
>>>>>>>>
>>>>>>>> My code contains something like this:
>>>>>>>>
>>>>>>>> int sd = socket(AF_INET, SOCK_DGRAM, 0);
>>>>>>>>
>>>>>>>> struct ifreq ifr;
>>>>>>>> memset(&ifr, 0, sizeof(struct ifreq));
>>>>>>>> strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
>>>>>>>> ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
>>>>>>>> ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
>>>>>>>> ifr.ifr_mii_val_out = 0;
>>>>>>>> ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);
>>>>>>>>
>>>>>>>> // Do stuff with ifr.ifr_mii_val_out.
>>>>>>>>
>>>>>>>> close(sd);
>>>>>>>>
>>>>>>>> I realized that this type of ioctl will directly access the
>>>>>>>> hardware,
>>>>>>>> without any locking.
>>>>>>>> That is, if any other task needs to use the PHY in any other way,
>>>>>>>> it
>>>>>>>> will
>>>>>>>> eventually corrupt its register data.
>>>>>>>>
>>>>>>>>
>>>>>>>> Two questions on this:
>>>>>>>> 1. Is there any good reason for this?
>>>>>>>> 2. What is the best way to fix it? Shall I add a driver level
>> lock,
>>>>> or
>>>>>>>> should net_lock() be used in any higher layer?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <
>>>>>>> f.j.panag@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>>> We have deployed hundreds of boards with stm32f427 and ethernet,
>>>>> they
>>>>>>>>>> have all been working reliably for months without stopping, we
>>>>>>>>>> know
>>>>>>>>>> it
>>>>>>>>>> because they critically depend on network functionality and we
>>>>>>>>>> have
>>>>>>>>>> reports if a card becomes unreachable. None has so far outside
>> of
>>>>>>>>>> dedicated tests.
>>>>>>>>>> So I believe that there is no obvious hard bug in these drivers.
>>>>>>>>> Good to hear that!
>>>>>>>>> Although, I may be using a feature or protocol that you are not.
>>>>>>>>> Of course, I don't believe that NuttX is broken per se, but a
>>>>>>>>> minor
>>>>>>>>> bug
>>>>>>>>> may lurk somewhere...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I have seen that when I enable the network debugging features,
>> it
>>>>>>>>>> seems
>>>>>>>>> to
>>>>>>>>>> hit an assertion failure before getting to nsh prompt at
>> startup.
>>>>>>>>>> This
>>>>>>>>> was
>>>>>>>>>> on a quite recent master. I haven't had a chance to diagnose
>> this
>>>>>>>>> further.
>>>>>>>>>> Have you tried enabling these and if so, do they work?
>>>>>>>>> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and
>>>>>>>>> it
>>>>>>> works.
>>>>>>>>> I have some devices under test, waiting to reproduce the issue to
>>>>> see
>>>>>>>>> if
>>>>>>>>> this option provides any useful information.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Also, out of curiosity, have you tried running ostest on your
>>>>> board?
>>>>>>>>> I just tried.
>>>>>>>>> It passed all the tests.
>>>>>>>>>
>>>>>>>>> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet
>>>>>>>>> <sebastien@lorquet.fr
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We have deployed hundreds of boards with stm32f427 and ethernet,
>>>>> they
>>>>>>>>>> have all been working reliably for months without stopping, we
>>>>>>>>>> know
>>>>>>>>>> it
>>>>>>>>>> because they critically depend on network functionality and we
>>>>>>>>>> have
>>>>>>>>>> reports if a card becomes unreachable. None has so far outside
>> of
>>>>>>>>>> dedicated tests.
>>>>>>>>>>
>>>>>>>>>> So I believe that there is no obvious hard bug in these drivers.
>>>>>>>>>>
>>>>>>>>>> Most certainly a build option on your particular config. debug
>> is
>>>>>>>>>> a
>>>>>>>>>> possible issue, thread problems is another possibility.
>>>>>>>>>>
>>>>>>>>>> Sebastien
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
>>>>>>>>>>> Hello!
>>>>>>>>>>>
>>>>>>>>>>> I am using Ethernet on an STM32F427 target, but I am facing
>> some
>>>>>>> issues.
>>>>>>>>>>> Initially the device works correctly. After some hours of
>>>>> continuous
>>>>>>>>>>> operation I completely lose all network communications.
>>>>>>>>>>> Trying to troubleshoot the issue, I enabled assertions and
>>>>>>>>>>> various
>>>>>>> other
>>>>>>>>>>> debug features.
>>>>>>>>>>>
>>>>>>>>>>> Again the device works correctly for some hours, and then I get
>>>>>>>>>>> a
>>>>>>> failed
>>>>>>>>>>> assertion at stm32_eth.c, line 1372:
>>>>>>>>>>>
>>>>>>>>>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
>>>>>>>>>>>
>>>>>>>>>>> No other errors are reported (e.g. stack overflows etc).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have observed that this issue usually manifests itself when
>>>>> there
>>>>>>>>>>> is
>>>>>>>>>>> insufficient stack on a task.
>>>>>>>>>>> But in my case, all tasks have oversized stacks. Typically they
>>>>>>>>>>> do
>>>>>>>>>>> not
>>>>>>>>>>> exceed 50% utilization.
>>>>>>>>>>> I have plenty of room available in the heap too (> 100kB).
>>>>>>>>>>>
>>>>>>>>>>> Regarding the rest of the firmware, I cannot see any other
>>>>>>> misbehaviour
>>>>>>>>>> or
>>>>>>>>>>> problem.
>>>>>>>>>>> I haven't ever seen any other unexplained problem, assertion
>>>>>>>>>>> fail,
>>>>>>>>>>> hard-fault etc.
>>>>>>>>>>> The application code passes all of our tests.
>>>>>>>>>>> In fact, even when this issue happens, although I lose network
>>>>>>>>>>> connectivity, the rest of the system works perfectly.
>>>>>>>>>>>
>>>>>>>>>>> Please note that I have checked the contents of dev->d_len and
>>>>>>>>>> dev->d_buf,
>>>>>>>>>>> and they seem to contain valid data.
>>>>>>>>>>> The address lies within the normal address space of the MCU,
>> and
>>>>> the
>>>>>>>>>> size
>>>>>>>>>>> is sane.
>>>>>>>>>>> So it doesn't look like any kind of memory corruption.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> At this point I believe that this is an actual bug either on
>> the
>>>>>>>>>>> STM32
>>>>>>>>>> MAC
>>>>>>>>>>> driver, or at the TCP/IP stack itself.
>>>>>>>>>>> I had a look at the driver code, but I didn't see anything
>>>>>>>>>>> suspicious.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Has anyone observed the same issue before?
>>>>>>>>>>> Can it be affected in any way with my configuration?
>>>>>>>>>>> Or maybe, do you have any recommendations on what to test next?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thank you!
>>>>>>>>>>>

Re: STM32F4 Ethernet Issues

Posted by Fotis Panagiotopoulos <f....@gmail.com>.
Ok, I just managed to reproduce the issue on a NUCLEO-F429ZI, using the
NuttX apps.

Please check my fork on
https://github.com/fjpanag/incubator-nuttx-apps/tree/tcp_issue
See the branch tcp_issue.

I have "hacked" the NSH code to reproduce the issue.
A TCP connection is opened, and then closed. Then the network interface is
brought down. At this point the system crashes immediately.

Note that I have locked the scheduler when the connection is closed.
This is to simulate an ifdown action BEFORE the FIN ACK is processed (as it
happens in my case).
My code does not have this locking, this is only for simulation purposes.

Please use the provided defconfig. It is stored in the root of my apps fork.
I guess it is not related to the configuration, but my "working" sample is
provided.



On Fri, Aug 12, 2022 at 7:15 PM Alan Carvalho de Assis <ac...@gmail.com>
wrote:

> Hi Fotis,
>
> Yes, I understood the point. Because it needs the right timing it
> could be trick to duplicate.
>
> Did you try to create a simple host server to try to emulate this
> connection issue?
>
> BR,
>
> Alan
>
> On 8/12/22, Fotis Panagiotopoulos <f....@gmail.com> wrote:
> > I think I understand the nature of the bug.
> >
> > When closing a socket, tcp_close_eventhandler() is set as a callback in
> the
> > dev->d_devcb list.
> >
> > Typically, the server's response (FIN ACK) will have as a result
> > tcp_callback() to be executed, and thus the callback to be properly
> called,
> > with proper arguments.
> > Then the cb is properly free'd.
> >
> > If however devif_dev_event() has the chance to execute before
> > tcp_callback() (e.g. server's response was lost), then the callbacks take
> > NULL as a conn argument.
> > This crashes the whole system horribly.
> >
> > As you see, this requires specific timings with the server communication,
> > that's why this is so hard to reproduce.
> >
> >
> > On Fri, Aug 12, 2022 at 5:13 PM Fotis Panagiotopoulos <
> f.j.panag@gmail.com>
> > wrote:
> >
> >> Hi Alan,
> >>
> >> I am trying hard to reproduce the issue reliably, but I haven't been
> able
> >> to do so yet.
> >>
> >> I noticed that when I disable CONFIG_NET_TCP_WRITE_BUFFERS, the problem
> >> does not disappear, rather it changes form.
> >> Now I occasionally get a failed assertion in wdog/wd_cancel.c line 95.
> >>
> >> I have to mention that everything in my system is commented out.
> >> Currently the only thing working is the network thread that opens the
> TCP
> >> connection, nothing else.
> >> I have disabled all of my usage of the workers, all signals etc.
> >> I verify that when the fault occurs, this thread is not interrupted by
> >> anything (using Segger SystemView).
> >> It looks like a scheduling issue is unlikely.
> >>
> >> I also increased the stacks more, and I added padding to the very few
> >> malloc's that I use.
> >>
> >> ---
> >>
> >> At this moment I observe something very interesting.
> >> I am calling netlib_ifdown(), which causes the attached stack trace.
> >>
> >> So:
> >> 1. netdev_ifdown() calls devif_dev_event() with the argument pvconn set
> >> explicitly to NULL.
> >> 2. devif_dev_event() eventually calls tcp_close_eventhandler()
> >> 3. tcp_close_eventhandler() assumes that conn is NOT NULL. Which causes
> >> the crash.
> >>
> >> This is wrong, but I don't have the understanding of it yet.
> >> Shall there be a check for a NULL conn?
> >> Or maybe tcp_close_eventhandler() is wrong to be in the cb's list in the
> >> first place?
> >> Or tcp_close_eventhandler() should be tolerant to a NULL conn argument?
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Aug 11, 2022 at 12:05 PM Alan Carvalho de Assis
> >> <ac...@gmail.com>
> >> wrote:
> >>
> >>> Hi Fotis,
> >>>
> >>> Are you in sync with mainline?
> >>>
> >>> If you can create a host application to induce the issue will be
> >>> easier for us to test.
> >>>
> >>> BR,
> >>>
> >>> Alan
> >>>
> >>> On 8/9/22, Fotis Panagiotopoulos <f....@gmail.com> wrote:
> >>> > Hello,
> >>> >
> >>> > still trying to make the network work reliably.
> >>> > After fixing another issue of my application, I hit another problem.
> >>> >
> >>> > The following sequence causes NuttX to crash:
> >>> >
> >>> > 1. My application is creating a TCP socket and communicates with a
> >>> server.
> >>> > 2. At one point the server stops responding (unrelated to NuttX /
> >>> network
> >>> > issue).
> >>> > 3. The application detects the timeout, and calls close() on the
> >>> > socket.
> >>> > 4. A new socket is created, and it is connected to the server.
> >>> > 5. At this point, the server decides to send a FIN message for the
> >>> previous
> >>> > connection.
> >>> > 6. I get a failed assertion in devif_callback.c at line 85.
> >>> >
> >>> > Note that I haven't managed to manually reproduce this issue.
> >>> > No matter what I do manually, everything seems to be working
> >>> > correctly.
> >>> > I just have to wait for it to happen.
> >>> > It seems that it is only triggered if a FIN arrives **after** a SYN.
> >>> >
> >>> > I am sure that this is only happening with
> >>> > CONFIG_NET_TCP_WRITE_BUFFERS
> >>> > enabled.
> >>> > I have no problems without buffering.
> >>> >
> >>> > The assertion seems right to fire.
> >>> > When a FIN is received for a closed connection, the same callback is
> >>> free'd
> >>> > both by tcp_lost_connection() and later on by
> >>> > tcp_close_eventhandler().
> >>> > All these are happening within the same execution of tcp_input().
> >>> >
> >>> > Any ideas?
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet
> >>> > <sebastien@lorquet.fr
> >>> >
> >>> > wrote:
> >>> >
> >>> >> Hi,
> >>> >>
> >>> >> good find but
> >>> >>
> >>> >> -I dont think any usual application tinkers with PHY regs during its
> >>> >> lifetime except the ethernet monitor
> >>> >>
> >>> >> -the fix is certainly a lock somewhere but global or fine grained I
> >>> dont
> >>> >> know.
> >>> >>
> >>> >> Not all calls need to be locked, eg the one that returns the PHY
> >>> >> address. Probaby not needed by default, but a PHY access lock would
> >>> >> prevent any issue you describe.
> >>> >>
> >>> >> I will wait for people with more expertise about this.
> >>> >>
> >>> >> Just a note, dont forget that not all PHY have an interrupt, the one
> >>> >> on
> >>> >> the nucleo stm32h743zi[2] board does not have one.
> >>> >>
> >>> >> Sebastien
> >>> >>
> >>> >> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit :
> >>> >> > Hello,
> >>> >> >
> >>> >> > I have eventually found 2 issues regarding networking in my
> >>> >> > application.
> >>> >> > I would like to discuss the first one.
> >>> >> >
> >>> >> >
> >>> >> > My code contains something like this:
> >>> >> >
> >>> >> > int sd = socket(AF_INET, SOCK_DGRAM, 0);
> >>> >> >
> >>> >> > struct ifreq ifr;
> >>> >> > memset(&ifr, 0, sizeof(struct ifreq));
> >>> >> > strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
> >>> >> > ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
> >>> >> > ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
> >>> >> > ifr.ifr_mii_val_out = 0;
> >>> >> > ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);
> >>> >> >
> >>> >> > // Do stuff with ifr.ifr_mii_val_out.
> >>> >> >
> >>> >> > close(sd);
> >>> >> >
> >>> >> > I realized that this type of ioctl will directly access the
> >>> >> > hardware,
> >>> >> > without any locking.
> >>> >> > That is, if any other task needs to use the PHY in any other way,
> >>> >> > it
> >>> >> > will
> >>> >> > eventually corrupt its register data.
> >>> >> >
> >>> >> >
> >>> >> > Two questions on this:
> >>> >> > 1. Is there any good reason for this?
> >>> >> > 2. What is the best way to fix it? Shall I add a driver level
> lock,
> >>> or
> >>> >> > should net_lock() be used in any higher layer?
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <
> >>> >> f.j.panag@gmail.com>
> >>> >> > wrote:
> >>> >> >
> >>> >> >> Hello,
> >>> >> >>
> >>> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
> >>> they
> >>> >> >>> have all been working reliably for months without stopping, we
> >>> >> >>> know
> >>> >> >>> it
> >>> >> >>> because they critically depend on network functionality and we
> >>> >> >>> have
> >>> >> >>> reports if a card becomes unreachable. None has so far outside
> of
> >>> >> >>> dedicated tests.
> >>> >> >>> So I believe that there is no obvious hard bug in these drivers.
> >>> >> >> Good to hear that!
> >>> >> >> Although, I may be using a feature or protocol that you are not.
> >>> >> >> Of course, I don't believe that NuttX is broken per se, but a
> >>> >> >> minor
> >>> >> >> bug
> >>> >> >> may lurk somewhere...
> >>> >> >>
> >>> >> >>
> >>> >> >>> I have seen that when I enable the network debugging features,
> it
> >>> >> >>> seems
> >>> >> >> to
> >>> >> >>> hit an assertion failure before getting to nsh prompt at
> startup.
> >>> >> >>> This
> >>> >> >> was
> >>> >> >>> on a quite recent master. I haven't had a chance to diagnose
> this
> >>> >> >> further.
> >>> >> >>> Have you tried enabling these and if so, do they work?
> >>> >> >> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and
> >>> >> >> it
> >>> >> works.
> >>> >> >> I have some devices under test, waiting to reproduce the issue to
> >>> see
> >>> >> >> if
> >>> >> >> this option provides any useful information.
> >>> >> >>
> >>> >> >>
> >>> >> >>> Also, out of curiosity, have you tried running ostest on your
> >>> board?
> >>> >> >> I just tried.
> >>> >> >> It passed all the tests.
> >>> >> >>
> >>> >> >> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet
> >>> >> >> <sebastien@lorquet.fr
> >>> >> >
> >>> >> >> wrote:
> >>> >> >>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
> >>> they
> >>> >> >>> have all been working reliably for months without stopping, we
> >>> >> >>> know
> >>> >> >>> it
> >>> >> >>> because they critically depend on network functionality and we
> >>> >> >>> have
> >>> >> >>> reports if a card becomes unreachable. None has so far outside
> of
> >>> >> >>> dedicated tests.
> >>> >> >>>
> >>> >> >>> So I believe that there is no obvious hard bug in these drivers.
> >>> >> >>>
> >>> >> >>> Most certainly a build option on your particular config. debug
> is
> >>> >> >>> a
> >>> >> >>> possible issue, thread problems is another possibility.
> >>> >> >>>
> >>> >> >>> Sebastien
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
> >>> >> >>>> Hello!
> >>> >> >>>>
> >>> >> >>>> I am using Ethernet on an STM32F427 target, but I am facing
> some
> >>> >> issues.
> >>> >> >>>>
> >>> >> >>>> Initially the device works correctly. After some hours of
> >>> continuous
> >>> >> >>>> operation I completely lose all network communications.
> >>> >> >>>> Trying to troubleshoot the issue, I enabled assertions and
> >>> >> >>>> various
> >>> >> other
> >>> >> >>>> debug features.
> >>> >> >>>>
> >>> >> >>>> Again the device works correctly for some hours, and then I get
> >>> >> >>>> a
> >>> >> failed
> >>> >> >>>> assertion at stm32_eth.c, line 1372:
> >>> >> >>>>
> >>> >> >>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
> >>> >> >>>>
> >>> >> >>>> No other errors are reported (e.g. stack overflows etc).
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> I have observed that this issue usually manifests itself when
> >>> there
> >>> >> >>>> is
> >>> >> >>>> insufficient stack on a task.
> >>> >> >>>> But in my case, all tasks have oversized stacks. Typically they
> >>> >> >>>> do
> >>> >> >>>> not
> >>> >> >>>> exceed 50% utilization.
> >>> >> >>>> I have plenty of room available in the heap too (> 100kB).
> >>> >> >>>>
> >>> >> >>>> Regarding the rest of the firmware, I cannot see any other
> >>> >> misbehaviour
> >>> >> >>> or
> >>> >> >>>> problem.
> >>> >> >>>> I haven't ever seen any other unexplained problem, assertion
> >>> >> >>>> fail,
> >>> >> >>>> hard-fault etc.
> >>> >> >>>> The application code passes all of our tests.
> >>> >> >>>> In fact, even when this issue happens, although I lose network
> >>> >> >>>> connectivity, the rest of the system works perfectly.
> >>> >> >>>>
> >>> >> >>>> Please note that I have checked the contents of dev->d_len and
> >>> >> >>> dev->d_buf,
> >>> >> >>>> and they seem to contain valid data.
> >>> >> >>>> The address lies within the normal address space of the MCU,
> and
> >>> the
> >>> >> >>> size
> >>> >> >>>> is sane.
> >>> >> >>>> So it doesn't look like any kind of memory corruption.
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> At this point I believe that this is an actual bug either on
> the
> >>> >> >>>> STM32
> >>> >> >>> MAC
> >>> >> >>>> driver, or at the TCP/IP stack itself.
> >>> >> >>>> I had a look at the driver code, but I didn't see anything
> >>> >> >>>> suspicious.
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> Has anyone observed the same issue before?
> >>> >> >>>> Can it be affected in any way with my configuration?
> >>> >> >>>> Or maybe, do you have any recommendations on what to test next?
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> Thank you!
> >>> >> >>>>
> >>> >>
> >>> >
> >>>
> >>
> >
>

Re: STM32F4 Ethernet Issues

Posted by Alan Carvalho de Assis <ac...@gmail.com>.
Hi Fotis,

Yes, I understood the point. Because it needs the right timing it
could be trick to duplicate.

Did you try to create a simple host server to try to emulate this
connection issue?

BR,

Alan

On 8/12/22, Fotis Panagiotopoulos <f....@gmail.com> wrote:
> I think I understand the nature of the bug.
>
> When closing a socket, tcp_close_eventhandler() is set as a callback in the
> dev->d_devcb list.
>
> Typically, the server's response (FIN ACK) will have as a result
> tcp_callback() to be executed, and thus the callback to be properly called,
> with proper arguments.
> Then the cb is properly free'd.
>
> If however devif_dev_event() has the chance to execute before
> tcp_callback() (e.g. server's response was lost), then the callbacks take
> NULL as a conn argument.
> This crashes the whole system horribly.
>
> As you see, this requires specific timings with the server communication,
> that's why this is so hard to reproduce.
>
>
> On Fri, Aug 12, 2022 at 5:13 PM Fotis Panagiotopoulos <f....@gmail.com>
> wrote:
>
>> Hi Alan,
>>
>> I am trying hard to reproduce the issue reliably, but I haven't been able
>> to do so yet.
>>
>> I noticed that when I disable CONFIG_NET_TCP_WRITE_BUFFERS, the problem
>> does not disappear, rather it changes form.
>> Now I occasionally get a failed assertion in wdog/wd_cancel.c line 95.
>>
>> I have to mention that everything in my system is commented out.
>> Currently the only thing working is the network thread that opens the TCP
>> connection, nothing else.
>> I have disabled all of my usage of the workers, all signals etc.
>> I verify that when the fault occurs, this thread is not interrupted by
>> anything (using Segger SystemView).
>> It looks like a scheduling issue is unlikely.
>>
>> I also increased the stacks more, and I added padding to the very few
>> malloc's that I use.
>>
>> ---
>>
>> At this moment I observe something very interesting.
>> I am calling netlib_ifdown(), which causes the attached stack trace.
>>
>> So:
>> 1. netdev_ifdown() calls devif_dev_event() with the argument pvconn set
>> explicitly to NULL.
>> 2. devif_dev_event() eventually calls tcp_close_eventhandler()
>> 3. tcp_close_eventhandler() assumes that conn is NOT NULL. Which causes
>> the crash.
>>
>> This is wrong, but I don't have the understanding of it yet.
>> Shall there be a check for a NULL conn?
>> Or maybe tcp_close_eventhandler() is wrong to be in the cb's list in the
>> first place?
>> Or tcp_close_eventhandler() should be tolerant to a NULL conn argument?
>>
>>
>>
>>
>>
>> On Thu, Aug 11, 2022 at 12:05 PM Alan Carvalho de Assis
>> <ac...@gmail.com>
>> wrote:
>>
>>> Hi Fotis,
>>>
>>> Are you in sync with mainline?
>>>
>>> If you can create a host application to induce the issue will be
>>> easier for us to test.
>>>
>>> BR,
>>>
>>> Alan
>>>
>>> On 8/9/22, Fotis Panagiotopoulos <f....@gmail.com> wrote:
>>> > Hello,
>>> >
>>> > still trying to make the network work reliably.
>>> > After fixing another issue of my application, I hit another problem.
>>> >
>>> > The following sequence causes NuttX to crash:
>>> >
>>> > 1. My application is creating a TCP socket and communicates with a
>>> server.
>>> > 2. At one point the server stops responding (unrelated to NuttX /
>>> network
>>> > issue).
>>> > 3. The application detects the timeout, and calls close() on the
>>> > socket.
>>> > 4. A new socket is created, and it is connected to the server.
>>> > 5. At this point, the server decides to send a FIN message for the
>>> previous
>>> > connection.
>>> > 6. I get a failed assertion in devif_callback.c at line 85.
>>> >
>>> > Note that I haven't managed to manually reproduce this issue.
>>> > No matter what I do manually, everything seems to be working
>>> > correctly.
>>> > I just have to wait for it to happen.
>>> > It seems that it is only triggered if a FIN arrives **after** a SYN.
>>> >
>>> > I am sure that this is only happening with
>>> > CONFIG_NET_TCP_WRITE_BUFFERS
>>> > enabled.
>>> > I have no problems without buffering.
>>> >
>>> > The assertion seems right to fire.
>>> > When a FIN is received for a closed connection, the same callback is
>>> free'd
>>> > both by tcp_lost_connection() and later on by
>>> > tcp_close_eventhandler().
>>> > All these are happening within the same execution of tcp_input().
>>> >
>>> > Any ideas?
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet
>>> > <sebastien@lorquet.fr
>>> >
>>> > wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> good find but
>>> >>
>>> >> -I dont think any usual application tinkers with PHY regs during its
>>> >> lifetime except the ethernet monitor
>>> >>
>>> >> -the fix is certainly a lock somewhere but global or fine grained I
>>> dont
>>> >> know.
>>> >>
>>> >> Not all calls need to be locked, eg the one that returns the PHY
>>> >> address. Probaby not needed by default, but a PHY access lock would
>>> >> prevent any issue you describe.
>>> >>
>>> >> I will wait for people with more expertise about this.
>>> >>
>>> >> Just a note, dont forget that not all PHY have an interrupt, the one
>>> >> on
>>> >> the nucleo stm32h743zi[2] board does not have one.
>>> >>
>>> >> Sebastien
>>> >>
>>> >> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit :
>>> >> > Hello,
>>> >> >
>>> >> > I have eventually found 2 issues regarding networking in my
>>> >> > application.
>>> >> > I would like to discuss the first one.
>>> >> >
>>> >> >
>>> >> > My code contains something like this:
>>> >> >
>>> >> > int sd = socket(AF_INET, SOCK_DGRAM, 0);
>>> >> >
>>> >> > struct ifreq ifr;
>>> >> > memset(&ifr, 0, sizeof(struct ifreq));
>>> >> > strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
>>> >> > ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
>>> >> > ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
>>> >> > ifr.ifr_mii_val_out = 0;
>>> >> > ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);
>>> >> >
>>> >> > // Do stuff with ifr.ifr_mii_val_out.
>>> >> >
>>> >> > close(sd);
>>> >> >
>>> >> > I realized that this type of ioctl will directly access the
>>> >> > hardware,
>>> >> > without any locking.
>>> >> > That is, if any other task needs to use the PHY in any other way,
>>> >> > it
>>> >> > will
>>> >> > eventually corrupt its register data.
>>> >> >
>>> >> >
>>> >> > Two questions on this:
>>> >> > 1. Is there any good reason for this?
>>> >> > 2. What is the best way to fix it? Shall I add a driver level lock,
>>> or
>>> >> > should net_lock() be used in any higher layer?
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <
>>> >> f.j.panag@gmail.com>
>>> >> > wrote:
>>> >> >
>>> >> >> Hello,
>>> >> >>
>>> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
>>> they
>>> >> >>> have all been working reliably for months without stopping, we
>>> >> >>> know
>>> >> >>> it
>>> >> >>> because they critically depend on network functionality and we
>>> >> >>> have
>>> >> >>> reports if a card becomes unreachable. None has so far outside of
>>> >> >>> dedicated tests.
>>> >> >>> So I believe that there is no obvious hard bug in these drivers.
>>> >> >> Good to hear that!
>>> >> >> Although, I may be using a feature or protocol that you are not.
>>> >> >> Of course, I don't believe that NuttX is broken per se, but a
>>> >> >> minor
>>> >> >> bug
>>> >> >> may lurk somewhere...
>>> >> >>
>>> >> >>
>>> >> >>> I have seen that when I enable the network debugging features, it
>>> >> >>> seems
>>> >> >> to
>>> >> >>> hit an assertion failure before getting to nsh prompt at startup.
>>> >> >>> This
>>> >> >> was
>>> >> >>> on a quite recent master. I haven't had a chance to diagnose this
>>> >> >> further.
>>> >> >>> Have you tried enabling these and if so, do they work?
>>> >> >> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and
>>> >> >> it
>>> >> works.
>>> >> >> I have some devices under test, waiting to reproduce the issue to
>>> see
>>> >> >> if
>>> >> >> this option provides any useful information.
>>> >> >>
>>> >> >>
>>> >> >>> Also, out of curiosity, have you tried running ostest on your
>>> board?
>>> >> >> I just tried.
>>> >> >> It passed all the tests.
>>> >> >>
>>> >> >> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet
>>> >> >> <sebastien@lorquet.fr
>>> >> >
>>> >> >> wrote:
>>> >> >>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
>>> they
>>> >> >>> have all been working reliably for months without stopping, we
>>> >> >>> know
>>> >> >>> it
>>> >> >>> because they critically depend on network functionality and we
>>> >> >>> have
>>> >> >>> reports if a card becomes unreachable. None has so far outside of
>>> >> >>> dedicated tests.
>>> >> >>>
>>> >> >>> So I believe that there is no obvious hard bug in these drivers.
>>> >> >>>
>>> >> >>> Most certainly a build option on your particular config. debug is
>>> >> >>> a
>>> >> >>> possible issue, thread problems is another possibility.
>>> >> >>>
>>> >> >>> Sebastien
>>> >> >>>
>>> >> >>>
>>> >> >>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
>>> >> >>>> Hello!
>>> >> >>>>
>>> >> >>>> I am using Ethernet on an STM32F427 target, but I am facing some
>>> >> issues.
>>> >> >>>>
>>> >> >>>> Initially the device works correctly. After some hours of
>>> continuous
>>> >> >>>> operation I completely lose all network communications.
>>> >> >>>> Trying to troubleshoot the issue, I enabled assertions and
>>> >> >>>> various
>>> >> other
>>> >> >>>> debug features.
>>> >> >>>>
>>> >> >>>> Again the device works correctly for some hours, and then I get
>>> >> >>>> a
>>> >> failed
>>> >> >>>> assertion at stm32_eth.c, line 1372:
>>> >> >>>>
>>> >> >>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
>>> >> >>>>
>>> >> >>>> No other errors are reported (e.g. stack overflows etc).
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> I have observed that this issue usually manifests itself when
>>> there
>>> >> >>>> is
>>> >> >>>> insufficient stack on a task.
>>> >> >>>> But in my case, all tasks have oversized stacks. Typically they
>>> >> >>>> do
>>> >> >>>> not
>>> >> >>>> exceed 50% utilization.
>>> >> >>>> I have plenty of room available in the heap too (> 100kB).
>>> >> >>>>
>>> >> >>>> Regarding the rest of the firmware, I cannot see any other
>>> >> misbehaviour
>>> >> >>> or
>>> >> >>>> problem.
>>> >> >>>> I haven't ever seen any other unexplained problem, assertion
>>> >> >>>> fail,
>>> >> >>>> hard-fault etc.
>>> >> >>>> The application code passes all of our tests.
>>> >> >>>> In fact, even when this issue happens, although I lose network
>>> >> >>>> connectivity, the rest of the system works perfectly.
>>> >> >>>>
>>> >> >>>> Please note that I have checked the contents of dev->d_len and
>>> >> >>> dev->d_buf,
>>> >> >>>> and they seem to contain valid data.
>>> >> >>>> The address lies within the normal address space of the MCU, and
>>> the
>>> >> >>> size
>>> >> >>>> is sane.
>>> >> >>>> So it doesn't look like any kind of memory corruption.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> At this point I believe that this is an actual bug either on the
>>> >> >>>> STM32
>>> >> >>> MAC
>>> >> >>>> driver, or at the TCP/IP stack itself.
>>> >> >>>> I had a look at the driver code, but I didn't see anything
>>> >> >>>> suspicious.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Has anyone observed the same issue before?
>>> >> >>>> Can it be affected in any way with my configuration?
>>> >> >>>> Or maybe, do you have any recommendations on what to test next?
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Thank you!
>>> >> >>>>
>>> >>
>>> >
>>>
>>
>

Re: STM32F4 Ethernet Issues

Posted by Fotis Panagiotopoulos <f....@gmail.com>.
I think I understand the nature of the bug.

When closing a socket, tcp_close_eventhandler() is set as a callback in the
dev->d_devcb list.

Typically, the server's response (FIN ACK) will have as a result
tcp_callback() to be executed, and thus the callback to be properly called,
with proper arguments.
Then the cb is properly free'd.

If however devif_dev_event() has the chance to execute before
tcp_callback() (e.g. server's response was lost), then the callbacks take
NULL as a conn argument.
This crashes the whole system horribly.

As you see, this requires specific timings with the server communication,
that's why this is so hard to reproduce.


On Fri, Aug 12, 2022 at 5:13 PM Fotis Panagiotopoulos <f....@gmail.com>
wrote:

> Hi Alan,
>
> I am trying hard to reproduce the issue reliably, but I haven't been able
> to do so yet.
>
> I noticed that when I disable CONFIG_NET_TCP_WRITE_BUFFERS, the problem
> does not disappear, rather it changes form.
> Now I occasionally get a failed assertion in wdog/wd_cancel.c line 95.
>
> I have to mention that everything in my system is commented out.
> Currently the only thing working is the network thread that opens the TCP
> connection, nothing else.
> I have disabled all of my usage of the workers, all signals etc.
> I verify that when the fault occurs, this thread is not interrupted by
> anything (using Segger SystemView).
> It looks like a scheduling issue is unlikely.
>
> I also increased the stacks more, and I added padding to the very few
> malloc's that I use.
>
> ---
>
> At this moment I observe something very interesting.
> I am calling netlib_ifdown(), which causes the attached stack trace.
>
> So:
> 1. netdev_ifdown() calls devif_dev_event() with the argument pvconn set
> explicitly to NULL.
> 2. devif_dev_event() eventually calls tcp_close_eventhandler()
> 3. tcp_close_eventhandler() assumes that conn is NOT NULL. Which causes
> the crash.
>
> This is wrong, but I don't have the understanding of it yet.
> Shall there be a check for a NULL conn?
> Or maybe tcp_close_eventhandler() is wrong to be in the cb's list in the
> first place?
> Or tcp_close_eventhandler() should be tolerant to a NULL conn argument?
>
>
>
>
>
> On Thu, Aug 11, 2022 at 12:05 PM Alan Carvalho de Assis <ac...@gmail.com>
> wrote:
>
>> Hi Fotis,
>>
>> Are you in sync with mainline?
>>
>> If you can create a host application to induce the issue will be
>> easier for us to test.
>>
>> BR,
>>
>> Alan
>>
>> On 8/9/22, Fotis Panagiotopoulos <f....@gmail.com> wrote:
>> > Hello,
>> >
>> > still trying to make the network work reliably.
>> > After fixing another issue of my application, I hit another problem.
>> >
>> > The following sequence causes NuttX to crash:
>> >
>> > 1. My application is creating a TCP socket and communicates with a
>> server.
>> > 2. At one point the server stops responding (unrelated to NuttX /
>> network
>> > issue).
>> > 3. The application detects the timeout, and calls close() on the socket.
>> > 4. A new socket is created, and it is connected to the server.
>> > 5. At this point, the server decides to send a FIN message for the
>> previous
>> > connection.
>> > 6. I get a failed assertion in devif_callback.c at line 85.
>> >
>> > Note that I haven't managed to manually reproduce this issue.
>> > No matter what I do manually, everything seems to be working correctly.
>> > I just have to wait for it to happen.
>> > It seems that it is only triggered if a FIN arrives **after** a SYN.
>> >
>> > I am sure that this is only happening with CONFIG_NET_TCP_WRITE_BUFFERS
>> > enabled.
>> > I have no problems without buffering.
>> >
>> > The assertion seems right to fire.
>> > When a FIN is received for a closed connection, the same callback is
>> free'd
>> > both by tcp_lost_connection() and later on by tcp_close_eventhandler().
>> > All these are happening within the same execution of tcp_input().
>> >
>> > Any ideas?
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet <sebastien@lorquet.fr
>> >
>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> good find but
>> >>
>> >> -I dont think any usual application tinkers with PHY regs during its
>> >> lifetime except the ethernet monitor
>> >>
>> >> -the fix is certainly a lock somewhere but global or fine grained I
>> dont
>> >> know.
>> >>
>> >> Not all calls need to be locked, eg the one that returns the PHY
>> >> address. Probaby not needed by default, but a PHY access lock would
>> >> prevent any issue you describe.
>> >>
>> >> I will wait for people with more expertise about this.
>> >>
>> >> Just a note, dont forget that not all PHY have an interrupt, the one on
>> >> the nucleo stm32h743zi[2] board does not have one.
>> >>
>> >> Sebastien
>> >>
>> >> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit :
>> >> > Hello,
>> >> >
>> >> > I have eventually found 2 issues regarding networking in my
>> >> > application.
>> >> > I would like to discuss the first one.
>> >> >
>> >> >
>> >> > My code contains something like this:
>> >> >
>> >> > int sd = socket(AF_INET, SOCK_DGRAM, 0);
>> >> >
>> >> > struct ifreq ifr;
>> >> > memset(&ifr, 0, sizeof(struct ifreq));
>> >> > strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
>> >> > ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
>> >> > ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
>> >> > ifr.ifr_mii_val_out = 0;
>> >> > ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);
>> >> >
>> >> > // Do stuff with ifr.ifr_mii_val_out.
>> >> >
>> >> > close(sd);
>> >> >
>> >> > I realized that this type of ioctl will directly access the hardware,
>> >> > without any locking.
>> >> > That is, if any other task needs to use the PHY in any other way, it
>> >> > will
>> >> > eventually corrupt its register data.
>> >> >
>> >> >
>> >> > Two questions on this:
>> >> > 1. Is there any good reason for this?
>> >> > 2. What is the best way to fix it? Shall I add a driver level lock,
>> or
>> >> > should net_lock() be used in any higher layer?
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <
>> >> f.j.panag@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Hello,
>> >> >>
>> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
>> they
>> >> >>> have all been working reliably for months without stopping, we know
>> >> >>> it
>> >> >>> because they critically depend on network functionality and we have
>> >> >>> reports if a card becomes unreachable. None has so far outside of
>> >> >>> dedicated tests.
>> >> >>> So I believe that there is no obvious hard bug in these drivers.
>> >> >> Good to hear that!
>> >> >> Although, I may be using a feature or protocol that you are not.
>> >> >> Of course, I don't believe that NuttX is broken per se, but a minor
>> >> >> bug
>> >> >> may lurk somewhere...
>> >> >>
>> >> >>
>> >> >>> I have seen that when I enable the network debugging features, it
>> >> >>> seems
>> >> >> to
>> >> >>> hit an assertion failure before getting to nsh prompt at startup.
>> >> >>> This
>> >> >> was
>> >> >>> on a quite recent master. I haven't had a chance to diagnose this
>> >> >> further.
>> >> >>> Have you tried enabling these and if so, do they work?
>> >> >> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and it
>> >> works.
>> >> >> I have some devices under test, waiting to reproduce the issue to
>> see
>> >> >> if
>> >> >> this option provides any useful information.
>> >> >>
>> >> >>
>> >> >>> Also, out of curiosity, have you tried running ostest on your
>> board?
>> >> >> I just tried.
>> >> >> It passed all the tests.
>> >> >>
>> >> >> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet
>> >> >> <sebastien@lorquet.fr
>> >> >
>> >> >> wrote:
>> >> >>
>> >> >>> Hi,
>> >> >>>
>> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
>> they
>> >> >>> have all been working reliably for months without stopping, we know
>> >> >>> it
>> >> >>> because they critically depend on network functionality and we have
>> >> >>> reports if a card becomes unreachable. None has so far outside of
>> >> >>> dedicated tests.
>> >> >>>
>> >> >>> So I believe that there is no obvious hard bug in these drivers.
>> >> >>>
>> >> >>> Most certainly a build option on your particular config. debug is a
>> >> >>> possible issue, thread problems is another possibility.
>> >> >>>
>> >> >>> Sebastien
>> >> >>>
>> >> >>>
>> >> >>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
>> >> >>>> Hello!
>> >> >>>>
>> >> >>>> I am using Ethernet on an STM32F427 target, but I am facing some
>> >> issues.
>> >> >>>>
>> >> >>>> Initially the device works correctly. After some hours of
>> continuous
>> >> >>>> operation I completely lose all network communications.
>> >> >>>> Trying to troubleshoot the issue, I enabled assertions and various
>> >> other
>> >> >>>> debug features.
>> >> >>>>
>> >> >>>> Again the device works correctly for some hours, and then I get a
>> >> failed
>> >> >>>> assertion at stm32_eth.c, line 1372:
>> >> >>>>
>> >> >>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
>> >> >>>>
>> >> >>>> No other errors are reported (e.g. stack overflows etc).
>> >> >>>>
>> >> >>>>
>> >> >>>> I have observed that this issue usually manifests itself when
>> there
>> >> >>>> is
>> >> >>>> insufficient stack on a task.
>> >> >>>> But in my case, all tasks have oversized stacks. Typically they do
>> >> >>>> not
>> >> >>>> exceed 50% utilization.
>> >> >>>> I have plenty of room available in the heap too (> 100kB).
>> >> >>>>
>> >> >>>> Regarding the rest of the firmware, I cannot see any other
>> >> misbehaviour
>> >> >>> or
>> >> >>>> problem.
>> >> >>>> I haven't ever seen any other unexplained problem, assertion fail,
>> >> >>>> hard-fault etc.
>> >> >>>> The application code passes all of our tests.
>> >> >>>> In fact, even when this issue happens, although I lose network
>> >> >>>> connectivity, the rest of the system works perfectly.
>> >> >>>>
>> >> >>>> Please note that I have checked the contents of dev->d_len and
>> >> >>> dev->d_buf,
>> >> >>>> and they seem to contain valid data.
>> >> >>>> The address lies within the normal address space of the MCU, and
>> the
>> >> >>> size
>> >> >>>> is sane.
>> >> >>>> So it doesn't look like any kind of memory corruption.
>> >> >>>>
>> >> >>>>
>> >> >>>> At this point I believe that this is an actual bug either on the
>> >> >>>> STM32
>> >> >>> MAC
>> >> >>>> driver, or at the TCP/IP stack itself.
>> >> >>>> I had a look at the driver code, but I didn't see anything
>> >> >>>> suspicious.
>> >> >>>>
>> >> >>>>
>> >> >>>> Has anyone observed the same issue before?
>> >> >>>> Can it be affected in any way with my configuration?
>> >> >>>> Or maybe, do you have any recommendations on what to test next?
>> >> >>>>
>> >> >>>>
>> >> >>>> Thank you!
>> >> >>>>
>> >>
>> >
>>
>

Re: STM32F4 Ethernet Issues

Posted by Alan Carvalho de Assis <ac...@gmail.com>.
Hi Fotis,

Nice, seems like you are narrowing down the numbers of variables, it
will make easier to let us to replicate the issue when you find it.

You also can use coloration and canaries to help with memory issues.

BR,

Alan

On 8/12/22, Fotis Panagiotopoulos <f....@gmail.com> wrote:
> Hi Alan,
>
> I am trying hard to reproduce the issue reliably, but I haven't been able
> to do so yet.
>
> I noticed that when I disable CONFIG_NET_TCP_WRITE_BUFFERS, the problem
> does not disappear, rather it changes form.
> Now I occasionally get a failed assertion in wdog/wd_cancel.c line 95.
>
> I have to mention that everything in my system is commented out.
> Currently the only thing working is the network thread that opens the TCP
> connection, nothing else.
> I have disabled all of my usage of the workers, all signals etc.
> I verify that when the fault occurs, this thread is not interrupted by
> anything (using Segger SystemView).
> It looks like a scheduling issue is unlikely.
>
> I also increased the stacks more, and I added padding to the very few
> malloc's that I use.
>
> ---
>
> At this moment I observe something very interesting.
> I am calling netlib_ifdown(), which causes the attached stack trace.
>
> So:
> 1. netdev_ifdown() calls devif_dev_event() with the argument pvconn set
> explicitly to NULL.
> 2. devif_dev_event() eventually calls tcp_close_eventhandler()
> 3. tcp_close_eventhandler() assumes that conn is NOT NULL. Which causes the
> crash.
>
> This is wrong, but I don't have the understanding of it yet.
> Shall there be a check for a NULL conn?
> Or maybe tcp_close_eventhandler() is wrong to be in the cb's list in the
> first place?
> Or tcp_close_eventhandler() should be tolerant to a NULL conn argument?
>
>
>
>
>
> On Thu, Aug 11, 2022 at 12:05 PM Alan Carvalho de Assis <ac...@gmail.com>
> wrote:
>
>> Hi Fotis,
>>
>> Are you in sync with mainline?
>>
>> If you can create a host application to induce the issue will be
>> easier for us to test.
>>
>> BR,
>>
>> Alan
>>
>> On 8/9/22, Fotis Panagiotopoulos <f....@gmail.com> wrote:
>> > Hello,
>> >
>> > still trying to make the network work reliably.
>> > After fixing another issue of my application, I hit another problem.
>> >
>> > The following sequence causes NuttX to crash:
>> >
>> > 1. My application is creating a TCP socket and communicates with a
>> server.
>> > 2. At one point the server stops responding (unrelated to NuttX /
>> > network
>> > issue).
>> > 3. The application detects the timeout, and calls close() on the
>> > socket.
>> > 4. A new socket is created, and it is connected to the server.
>> > 5. At this point, the server decides to send a FIN message for the
>> previous
>> > connection.
>> > 6. I get a failed assertion in devif_callback.c at line 85.
>> >
>> > Note that I haven't managed to manually reproduce this issue.
>> > No matter what I do manually, everything seems to be working correctly.
>> > I just have to wait for it to happen.
>> > It seems that it is only triggered if a FIN arrives **after** a SYN.
>> >
>> > I am sure that this is only happening with CONFIG_NET_TCP_WRITE_BUFFERS
>> > enabled.
>> > I have no problems without buffering.
>> >
>> > The assertion seems right to fire.
>> > When a FIN is received for a closed connection, the same callback is
>> free'd
>> > both by tcp_lost_connection() and later on by tcp_close_eventhandler().
>> > All these are happening within the same execution of tcp_input().
>> >
>> > Any ideas?
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet
>> > <se...@lorquet.fr>
>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> good find but
>> >>
>> >> -I dont think any usual application tinkers with PHY regs during its
>> >> lifetime except the ethernet monitor
>> >>
>> >> -the fix is certainly a lock somewhere but global or fine grained I
>> >> dont
>> >> know.
>> >>
>> >> Not all calls need to be locked, eg the one that returns the PHY
>> >> address. Probaby not needed by default, but a PHY access lock would
>> >> prevent any issue you describe.
>> >>
>> >> I will wait for people with more expertise about this.
>> >>
>> >> Just a note, dont forget that not all PHY have an interrupt, the one
>> >> on
>> >> the nucleo stm32h743zi[2] board does not have one.
>> >>
>> >> Sebastien
>> >>
>> >> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit :
>> >> > Hello,
>> >> >
>> >> > I have eventually found 2 issues regarding networking in my
>> >> > application.
>> >> > I would like to discuss the first one.
>> >> >
>> >> >
>> >> > My code contains something like this:
>> >> >
>> >> > int sd = socket(AF_INET, SOCK_DGRAM, 0);
>> >> >
>> >> > struct ifreq ifr;
>> >> > memset(&ifr, 0, sizeof(struct ifreq));
>> >> > strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
>> >> > ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
>> >> > ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
>> >> > ifr.ifr_mii_val_out = 0;
>> >> > ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);
>> >> >
>> >> > // Do stuff with ifr.ifr_mii_val_out.
>> >> >
>> >> > close(sd);
>> >> >
>> >> > I realized that this type of ioctl will directly access the
>> >> > hardware,
>> >> > without any locking.
>> >> > That is, if any other task needs to use the PHY in any other way, it
>> >> > will
>> >> > eventually corrupt its register data.
>> >> >
>> >> >
>> >> > Two questions on this:
>> >> > 1. Is there any good reason for this?
>> >> > 2. What is the best way to fix it? Shall I add a driver level lock,
>> >> > or
>> >> > should net_lock() be used in any higher layer?
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <
>> >> f.j.panag@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Hello,
>> >> >>
>> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
>> they
>> >> >>> have all been working reliably for months without stopping, we
>> >> >>> know
>> >> >>> it
>> >> >>> because they critically depend on network functionality and we
>> >> >>> have
>> >> >>> reports if a card becomes unreachable. None has so far outside of
>> >> >>> dedicated tests.
>> >> >>> So I believe that there is no obvious hard bug in these drivers.
>> >> >> Good to hear that!
>> >> >> Although, I may be using a feature or protocol that you are not.
>> >> >> Of course, I don't believe that NuttX is broken per se, but a minor
>> >> >> bug
>> >> >> may lurk somewhere...
>> >> >>
>> >> >>
>> >> >>> I have seen that when I enable the network debugging features, it
>> >> >>> seems
>> >> >> to
>> >> >>> hit an assertion failure before getting to nsh prompt at startup.
>> >> >>> This
>> >> >> was
>> >> >>> on a quite recent master. I haven't had a chance to diagnose this
>> >> >> further.
>> >> >>> Have you tried enabling these and if so, do they work?
>> >> >> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and it
>> >> works.
>> >> >> I have some devices under test, waiting to reproduce the issue to
>> >> >> see
>> >> >> if
>> >> >> this option provides any useful information.
>> >> >>
>> >> >>
>> >> >>> Also, out of curiosity, have you tried running ostest on your
>> >> >>> board?
>> >> >> I just tried.
>> >> >> It passed all the tests.
>> >> >>
>> >> >> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet
>> >> >> <sebastien@lorquet.fr
>> >> >
>> >> >> wrote:
>> >> >>
>> >> >>> Hi,
>> >> >>>
>> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
>> they
>> >> >>> have all been working reliably for months without stopping, we
>> >> >>> know
>> >> >>> it
>> >> >>> because they critically depend on network functionality and we
>> >> >>> have
>> >> >>> reports if a card becomes unreachable. None has so far outside of
>> >> >>> dedicated tests.
>> >> >>>
>> >> >>> So I believe that there is no obvious hard bug in these drivers.
>> >> >>>
>> >> >>> Most certainly a build option on your particular config. debug is
>> >> >>> a
>> >> >>> possible issue, thread problems is another possibility.
>> >> >>>
>> >> >>> Sebastien
>> >> >>>
>> >> >>>
>> >> >>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
>> >> >>>> Hello!
>> >> >>>>
>> >> >>>> I am using Ethernet on an STM32F427 target, but I am facing some
>> >> issues.
>> >> >>>>
>> >> >>>> Initially the device works correctly. After some hours of
>> continuous
>> >> >>>> operation I completely lose all network communications.
>> >> >>>> Trying to troubleshoot the issue, I enabled assertions and
>> >> >>>> various
>> >> other
>> >> >>>> debug features.
>> >> >>>>
>> >> >>>> Again the device works correctly for some hours, and then I get a
>> >> failed
>> >> >>>> assertion at stm32_eth.c, line 1372:
>> >> >>>>
>> >> >>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
>> >> >>>>
>> >> >>>> No other errors are reported (e.g. stack overflows etc).
>> >> >>>>
>> >> >>>>
>> >> >>>> I have observed that this issue usually manifests itself when
>> >> >>>> there
>> >> >>>> is
>> >> >>>> insufficient stack on a task.
>> >> >>>> But in my case, all tasks have oversized stacks. Typically they
>> >> >>>> do
>> >> >>>> not
>> >> >>>> exceed 50% utilization.
>> >> >>>> I have plenty of room available in the heap too (> 100kB).
>> >> >>>>
>> >> >>>> Regarding the rest of the firmware, I cannot see any other
>> >> misbehaviour
>> >> >>> or
>> >> >>>> problem.
>> >> >>>> I haven't ever seen any other unexplained problem, assertion
>> >> >>>> fail,
>> >> >>>> hard-fault etc.
>> >> >>>> The application code passes all of our tests.
>> >> >>>> In fact, even when this issue happens, although I lose network
>> >> >>>> connectivity, the rest of the system works perfectly.
>> >> >>>>
>> >> >>>> Please note that I have checked the contents of dev->d_len and
>> >> >>> dev->d_buf,
>> >> >>>> and they seem to contain valid data.
>> >> >>>> The address lies within the normal address space of the MCU, and
>> the
>> >> >>> size
>> >> >>>> is sane.
>> >> >>>> So it doesn't look like any kind of memory corruption.
>> >> >>>>
>> >> >>>>
>> >> >>>> At this point I believe that this is an actual bug either on the
>> >> >>>> STM32
>> >> >>> MAC
>> >> >>>> driver, or at the TCP/IP stack itself.
>> >> >>>> I had a look at the driver code, but I didn't see anything
>> >> >>>> suspicious.
>> >> >>>>
>> >> >>>>
>> >> >>>> Has anyone observed the same issue before?
>> >> >>>> Can it be affected in any way with my configuration?
>> >> >>>> Or maybe, do you have any recommendations on what to test next?
>> >> >>>>
>> >> >>>>
>> >> >>>> Thank you!
>> >> >>>>
>> >>
>> >
>>
>

Re: STM32F4 Ethernet Issues

Posted by Fotis Panagiotopoulos <f....@gmail.com>.
Hi Alan,

I am trying hard to reproduce the issue reliably, but I haven't been able
to do so yet.

I noticed that when I disable CONFIG_NET_TCP_WRITE_BUFFERS, the problem
does not disappear, rather it changes form.
Now I occasionally get a failed assertion in wdog/wd_cancel.c line 95.

I have to mention that everything in my system is commented out.
Currently the only thing working is the network thread that opens the TCP
connection, nothing else.
I have disabled all of my usage of the workers, all signals etc.
I verify that when the fault occurs, this thread is not interrupted by
anything (using Segger SystemView).
It looks like a scheduling issue is unlikely.

I also increased the stacks more, and I added padding to the very few
malloc's that I use.

---

At this moment I observe something very interesting.
I am calling netlib_ifdown(), which causes the attached stack trace.

So:
1. netdev_ifdown() calls devif_dev_event() with the argument pvconn set
explicitly to NULL.
2. devif_dev_event() eventually calls tcp_close_eventhandler()
3. tcp_close_eventhandler() assumes that conn is NOT NULL. Which causes the
crash.

This is wrong, but I don't have the understanding of it yet.
Shall there be a check for a NULL conn?
Or maybe tcp_close_eventhandler() is wrong to be in the cb's list in the
first place?
Or tcp_close_eventhandler() should be tolerant to a NULL conn argument?





On Thu, Aug 11, 2022 at 12:05 PM Alan Carvalho de Assis <ac...@gmail.com>
wrote:

> Hi Fotis,
>
> Are you in sync with mainline?
>
> If you can create a host application to induce the issue will be
> easier for us to test.
>
> BR,
>
> Alan
>
> On 8/9/22, Fotis Panagiotopoulos <f....@gmail.com> wrote:
> > Hello,
> >
> > still trying to make the network work reliably.
> > After fixing another issue of my application, I hit another problem.
> >
> > The following sequence causes NuttX to crash:
> >
> > 1. My application is creating a TCP socket and communicates with a
> server.
> > 2. At one point the server stops responding (unrelated to NuttX / network
> > issue).
> > 3. The application detects the timeout, and calls close() on the socket.
> > 4. A new socket is created, and it is connected to the server.
> > 5. At this point, the server decides to send a FIN message for the
> previous
> > connection.
> > 6. I get a failed assertion in devif_callback.c at line 85.
> >
> > Note that I haven't managed to manually reproduce this issue.
> > No matter what I do manually, everything seems to be working correctly.
> > I just have to wait for it to happen.
> > It seems that it is only triggered if a FIN arrives **after** a SYN.
> >
> > I am sure that this is only happening with CONFIG_NET_TCP_WRITE_BUFFERS
> > enabled.
> > I have no problems without buffering.
> >
> > The assertion seems right to fire.
> > When a FIN is received for a closed connection, the same callback is
> free'd
> > both by tcp_lost_connection() and later on by tcp_close_eventhandler().
> > All these are happening within the same execution of tcp_input().
> >
> > Any ideas?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet <se...@lorquet.fr>
> > wrote:
> >
> >> Hi,
> >>
> >> good find but
> >>
> >> -I dont think any usual application tinkers with PHY regs during its
> >> lifetime except the ethernet monitor
> >>
> >> -the fix is certainly a lock somewhere but global or fine grained I dont
> >> know.
> >>
> >> Not all calls need to be locked, eg the one that returns the PHY
> >> address. Probaby not needed by default, but a PHY access lock would
> >> prevent any issue you describe.
> >>
> >> I will wait for people with more expertise about this.
> >>
> >> Just a note, dont forget that not all PHY have an interrupt, the one on
> >> the nucleo stm32h743zi[2] board does not have one.
> >>
> >> Sebastien
> >>
> >> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit :
> >> > Hello,
> >> >
> >> > I have eventually found 2 issues regarding networking in my
> >> > application.
> >> > I would like to discuss the first one.
> >> >
> >> >
> >> > My code contains something like this:
> >> >
> >> > int sd = socket(AF_INET, SOCK_DGRAM, 0);
> >> >
> >> > struct ifreq ifr;
> >> > memset(&ifr, 0, sizeof(struct ifreq));
> >> > strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
> >> > ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
> >> > ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
> >> > ifr.ifr_mii_val_out = 0;
> >> > ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);
> >> >
> >> > // Do stuff with ifr.ifr_mii_val_out.
> >> >
> >> > close(sd);
> >> >
> >> > I realized that this type of ioctl will directly access the hardware,
> >> > without any locking.
> >> > That is, if any other task needs to use the PHY in any other way, it
> >> > will
> >> > eventually corrupt its register data.
> >> >
> >> >
> >> > Two questions on this:
> >> > 1. Is there any good reason for this?
> >> > 2. What is the best way to fix it? Shall I add a driver level lock, or
> >> > should net_lock() be used in any higher layer?
> >> >
> >> >
> >> >
> >> > On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <
> >> f.j.panag@gmail.com>
> >> > wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
> they
> >> >>> have all been working reliably for months without stopping, we know
> >> >>> it
> >> >>> because they critically depend on network functionality and we have
> >> >>> reports if a card becomes unreachable. None has so far outside of
> >> >>> dedicated tests.
> >> >>> So I believe that there is no obvious hard bug in these drivers.
> >> >> Good to hear that!
> >> >> Although, I may be using a feature or protocol that you are not.
> >> >> Of course, I don't believe that NuttX is broken per se, but a minor
> >> >> bug
> >> >> may lurk somewhere...
> >> >>
> >> >>
> >> >>> I have seen that when I enable the network debugging features, it
> >> >>> seems
> >> >> to
> >> >>> hit an assertion failure before getting to nsh prompt at startup.
> >> >>> This
> >> >> was
> >> >>> on a quite recent master. I haven't had a chance to diagnose this
> >> >> further.
> >> >>> Have you tried enabling these and if so, do they work?
> >> >> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and it
> >> works.
> >> >> I have some devices under test, waiting to reproduce the issue to see
> >> >> if
> >> >> this option provides any useful information.
> >> >>
> >> >>
> >> >>> Also, out of curiosity, have you tried running ostest on your board?
> >> >> I just tried.
> >> >> It passed all the tests.
> >> >>
> >> >> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet
> >> >> <sebastien@lorquet.fr
> >> >
> >> >> wrote:
> >> >>
> >> >>> Hi,
> >> >>>
> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
> they
> >> >>> have all been working reliably for months without stopping, we know
> >> >>> it
> >> >>> because they critically depend on network functionality and we have
> >> >>> reports if a card becomes unreachable. None has so far outside of
> >> >>> dedicated tests.
> >> >>>
> >> >>> So I believe that there is no obvious hard bug in these drivers.
> >> >>>
> >> >>> Most certainly a build option on your particular config. debug is a
> >> >>> possible issue, thread problems is another possibility.
> >> >>>
> >> >>> Sebastien
> >> >>>
> >> >>>
> >> >>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
> >> >>>> Hello!
> >> >>>>
> >> >>>> I am using Ethernet on an STM32F427 target, but I am facing some
> >> issues.
> >> >>>>
> >> >>>> Initially the device works correctly. After some hours of
> continuous
> >> >>>> operation I completely lose all network communications.
> >> >>>> Trying to troubleshoot the issue, I enabled assertions and various
> >> other
> >> >>>> debug features.
> >> >>>>
> >> >>>> Again the device works correctly for some hours, and then I get a
> >> failed
> >> >>>> assertion at stm32_eth.c, line 1372:
> >> >>>>
> >> >>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
> >> >>>>
> >> >>>> No other errors are reported (e.g. stack overflows etc).
> >> >>>>
> >> >>>>
> >> >>>> I have observed that this issue usually manifests itself when there
> >> >>>> is
> >> >>>> insufficient stack on a task.
> >> >>>> But in my case, all tasks have oversized stacks. Typically they do
> >> >>>> not
> >> >>>> exceed 50% utilization.
> >> >>>> I have plenty of room available in the heap too (> 100kB).
> >> >>>>
> >> >>>> Regarding the rest of the firmware, I cannot see any other
> >> misbehaviour
> >> >>> or
> >> >>>> problem.
> >> >>>> I haven't ever seen any other unexplained problem, assertion fail,
> >> >>>> hard-fault etc.
> >> >>>> The application code passes all of our tests.
> >> >>>> In fact, even when this issue happens, although I lose network
> >> >>>> connectivity, the rest of the system works perfectly.
> >> >>>>
> >> >>>> Please note that I have checked the contents of dev->d_len and
> >> >>> dev->d_buf,
> >> >>>> and they seem to contain valid data.
> >> >>>> The address lies within the normal address space of the MCU, and
> the
> >> >>> size
> >> >>>> is sane.
> >> >>>> So it doesn't look like any kind of memory corruption.
> >> >>>>
> >> >>>>
> >> >>>> At this point I believe that this is an actual bug either on the
> >> >>>> STM32
> >> >>> MAC
> >> >>>> driver, or at the TCP/IP stack itself.
> >> >>>> I had a look at the driver code, but I didn't see anything
> >> >>>> suspicious.
> >> >>>>
> >> >>>>
> >> >>>> Has anyone observed the same issue before?
> >> >>>> Can it be affected in any way with my configuration?
> >> >>>> Or maybe, do you have any recommendations on what to test next?
> >> >>>>
> >> >>>>
> >> >>>> Thank you!
> >> >>>>
> >>
> >
>

Re: STM32F4 Ethernet Issues

Posted by Alan Carvalho de Assis <ac...@gmail.com>.
Hi Fotis,

Are you in sync with mainline?

If you can create a host application to induce the issue will be
easier for us to test.

BR,

Alan

On 8/9/22, Fotis Panagiotopoulos <f....@gmail.com> wrote:
> Hello,
>
> still trying to make the network work reliably.
> After fixing another issue of my application, I hit another problem.
>
> The following sequence causes NuttX to crash:
>
> 1. My application is creating a TCP socket and communicates with a server.
> 2. At one point the server stops responding (unrelated to NuttX / network
> issue).
> 3. The application detects the timeout, and calls close() on the socket.
> 4. A new socket is created, and it is connected to the server.
> 5. At this point, the server decides to send a FIN message for the previous
> connection.
> 6. I get a failed assertion in devif_callback.c at line 85.
>
> Note that I haven't managed to manually reproduce this issue.
> No matter what I do manually, everything seems to be working correctly.
> I just have to wait for it to happen.
> It seems that it is only triggered if a FIN arrives **after** a SYN.
>
> I am sure that this is only happening with CONFIG_NET_TCP_WRITE_BUFFERS
> enabled.
> I have no problems without buffering.
>
> The assertion seems right to fire.
> When a FIN is received for a closed connection, the same callback is free'd
> both by tcp_lost_connection() and later on by tcp_close_eventhandler().
> All these are happening within the same execution of tcp_input().
>
> Any ideas?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet <se...@lorquet.fr>
> wrote:
>
>> Hi,
>>
>> good find but
>>
>> -I dont think any usual application tinkers with PHY regs during its
>> lifetime except the ethernet monitor
>>
>> -the fix is certainly a lock somewhere but global or fine grained I dont
>> know.
>>
>> Not all calls need to be locked, eg the one that returns the PHY
>> address. Probaby not needed by default, but a PHY access lock would
>> prevent any issue you describe.
>>
>> I will wait for people with more expertise about this.
>>
>> Just a note, dont forget that not all PHY have an interrupt, the one on
>> the nucleo stm32h743zi[2] board does not have one.
>>
>> Sebastien
>>
>> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit :
>> > Hello,
>> >
>> > I have eventually found 2 issues regarding networking in my
>> > application.
>> > I would like to discuss the first one.
>> >
>> >
>> > My code contains something like this:
>> >
>> > int sd = socket(AF_INET, SOCK_DGRAM, 0);
>> >
>> > struct ifreq ifr;
>> > memset(&ifr, 0, sizeof(struct ifreq));
>> > strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
>> > ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
>> > ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
>> > ifr.ifr_mii_val_out = 0;
>> > ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);
>> >
>> > // Do stuff with ifr.ifr_mii_val_out.
>> >
>> > close(sd);
>> >
>> > I realized that this type of ioctl will directly access the hardware,
>> > without any locking.
>> > That is, if any other task needs to use the PHY in any other way, it
>> > will
>> > eventually corrupt its register data.
>> >
>> >
>> > Two questions on this:
>> > 1. Is there any good reason for this?
>> > 2. What is the best way to fix it? Shall I add a driver level lock, or
>> > should net_lock() be used in any higher layer?
>> >
>> >
>> >
>> > On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <
>> f.j.panag@gmail.com>
>> > wrote:
>> >
>> >> Hello,
>> >>
>> >>> We have deployed hundreds of boards with stm32f427 and ethernet, they
>> >>> have all been working reliably for months without stopping, we know
>> >>> it
>> >>> because they critically depend on network functionality and we have
>> >>> reports if a card becomes unreachable. None has so far outside of
>> >>> dedicated tests.
>> >>> So I believe that there is no obvious hard bug in these drivers.
>> >> Good to hear that!
>> >> Although, I may be using a feature or protocol that you are not.
>> >> Of course, I don't believe that NuttX is broken per se, but a minor
>> >> bug
>> >> may lurk somewhere...
>> >>
>> >>
>> >>> I have seen that when I enable the network debugging features, it
>> >>> seems
>> >> to
>> >>> hit an assertion failure before getting to nsh prompt at startup.
>> >>> This
>> >> was
>> >>> on a quite recent master. I haven't had a chance to diagnose this
>> >> further.
>> >>> Have you tried enabling these and if so, do they work?
>> >> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and it
>> works.
>> >> I have some devices under test, waiting to reproduce the issue to see
>> >> if
>> >> this option provides any useful information.
>> >>
>> >>
>> >>> Also, out of curiosity, have you tried running ostest on your board?
>> >> I just tried.
>> >> It passed all the tests.
>> >>
>> >> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet
>> >> <sebastien@lorquet.fr
>> >
>> >> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> We have deployed hundreds of boards with stm32f427 and ethernet, they
>> >>> have all been working reliably for months without stopping, we know
>> >>> it
>> >>> because they critically depend on network functionality and we have
>> >>> reports if a card becomes unreachable. None has so far outside of
>> >>> dedicated tests.
>> >>>
>> >>> So I believe that there is no obvious hard bug in these drivers.
>> >>>
>> >>> Most certainly a build option on your particular config. debug is a
>> >>> possible issue, thread problems is another possibility.
>> >>>
>> >>> Sebastien
>> >>>
>> >>>
>> >>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
>> >>>> Hello!
>> >>>>
>> >>>> I am using Ethernet on an STM32F427 target, but I am facing some
>> issues.
>> >>>>
>> >>>> Initially the device works correctly. After some hours of continuous
>> >>>> operation I completely lose all network communications.
>> >>>> Trying to troubleshoot the issue, I enabled assertions and various
>> other
>> >>>> debug features.
>> >>>>
>> >>>> Again the device works correctly for some hours, and then I get a
>> failed
>> >>>> assertion at stm32_eth.c, line 1372:
>> >>>>
>> >>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
>> >>>>
>> >>>> No other errors are reported (e.g. stack overflows etc).
>> >>>>
>> >>>>
>> >>>> I have observed that this issue usually manifests itself when there
>> >>>> is
>> >>>> insufficient stack on a task.
>> >>>> But in my case, all tasks have oversized stacks. Typically they do
>> >>>> not
>> >>>> exceed 50% utilization.
>> >>>> I have plenty of room available in the heap too (> 100kB).
>> >>>>
>> >>>> Regarding the rest of the firmware, I cannot see any other
>> misbehaviour
>> >>> or
>> >>>> problem.
>> >>>> I haven't ever seen any other unexplained problem, assertion fail,
>> >>>> hard-fault etc.
>> >>>> The application code passes all of our tests.
>> >>>> In fact, even when this issue happens, although I lose network
>> >>>> connectivity, the rest of the system works perfectly.
>> >>>>
>> >>>> Please note that I have checked the contents of dev->d_len and
>> >>> dev->d_buf,
>> >>>> and they seem to contain valid data.
>> >>>> The address lies within the normal address space of the MCU, and the
>> >>> size
>> >>>> is sane.
>> >>>> So it doesn't look like any kind of memory corruption.
>> >>>>
>> >>>>
>> >>>> At this point I believe that this is an actual bug either on the
>> >>>> STM32
>> >>> MAC
>> >>>> driver, or at the TCP/IP stack itself.
>> >>>> I had a look at the driver code, but I didn't see anything
>> >>>> suspicious.
>> >>>>
>> >>>>
>> >>>> Has anyone observed the same issue before?
>> >>>> Can it be affected in any way with my configuration?
>> >>>> Or maybe, do you have any recommendations on what to test next?
>> >>>>
>> >>>>
>> >>>> Thank you!
>> >>>>
>>
>

Re: STM32F4 Ethernet Issues

Posted by Fotis Panagiotopoulos <f....@gmail.com>.
Hello,

still trying to make the network work reliably.
After fixing another issue of my application, I hit another problem.

The following sequence causes NuttX to crash:

1. My application is creating a TCP socket and communicates with a server.
2. At one point the server stops responding (unrelated to NuttX / network
issue).
3. The application detects the timeout, and calls close() on the socket.
4. A new socket is created, and it is connected to the server.
5. At this point, the server decides to send a FIN message for the previous
connection.
6. I get a failed assertion in devif_callback.c at line 85.

Note that I haven't managed to manually reproduce this issue.
No matter what I do manually, everything seems to be working correctly.
I just have to wait for it to happen.
It seems that it is only triggered if a FIN arrives **after** a SYN.

I am sure that this is only happening with CONFIG_NET_TCP_WRITE_BUFFERS
enabled.
I have no problems without buffering.

The assertion seems right to fire.
When a FIN is received for a closed connection, the same callback is free'd
both by tcp_lost_connection() and later on by tcp_close_eventhandler().
All these are happening within the same execution of tcp_input().

Any ideas?



















On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet <se...@lorquet.fr>
wrote:

> Hi,
>
> good find but
>
> -I dont think any usual application tinkers with PHY regs during its
> lifetime except the ethernet monitor
>
> -the fix is certainly a lock somewhere but global or fine grained I dont
> know.
>
> Not all calls need to be locked, eg the one that returns the PHY
> address. Probaby not needed by default, but a PHY access lock would
> prevent any issue you describe.
>
> I will wait for people with more expertise about this.
>
> Just a note, dont forget that not all PHY have an interrupt, the one on
> the nucleo stm32h743zi[2] board does not have one.
>
> Sebastien
>
> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit :
> > Hello,
> >
> > I have eventually found 2 issues regarding networking in my application.
> > I would like to discuss the first one.
> >
> >
> > My code contains something like this:
> >
> > int sd = socket(AF_INET, SOCK_DGRAM, 0);
> >
> > struct ifreq ifr;
> > memset(&ifr, 0, sizeof(struct ifreq));
> > strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
> > ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
> > ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
> > ifr.ifr_mii_val_out = 0;
> > ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);
> >
> > // Do stuff with ifr.ifr_mii_val_out.
> >
> > close(sd);
> >
> > I realized that this type of ioctl will directly access the hardware,
> > without any locking.
> > That is, if any other task needs to use the PHY in any other way, it will
> > eventually corrupt its register data.
> >
> >
> > Two questions on this:
> > 1. Is there any good reason for this?
> > 2. What is the best way to fix it? Shall I add a driver level lock, or
> > should net_lock() be used in any higher layer?
> >
> >
> >
> > On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <
> f.j.panag@gmail.com>
> > wrote:
> >
> >> Hello,
> >>
> >>> We have deployed hundreds of boards with stm32f427 and ethernet, they
> >>> have all been working reliably for months without stopping, we know it
> >>> because they critically depend on network functionality and we have
> >>> reports if a card becomes unreachable. None has so far outside of
> >>> dedicated tests.
> >>> So I believe that there is no obvious hard bug in these drivers.
> >> Good to hear that!
> >> Although, I may be using a feature or protocol that you are not.
> >> Of course, I don't believe that NuttX is broken per se, but a minor bug
> >> may lurk somewhere...
> >>
> >>
> >>> I have seen that when I enable the network debugging features, it seems
> >> to
> >>> hit an assertion failure before getting to nsh prompt at startup. This
> >> was
> >>> on a quite recent master. I haven't had a chance to diagnose this
> >> further.
> >>> Have you tried enabling these and if so, do they work?
> >> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and it
> works.
> >> I have some devices under test, waiting to reproduce the issue to see if
> >> this option provides any useful information.
> >>
> >>
> >>> Also, out of curiosity, have you tried running ostest on your board?
> >> I just tried.
> >> It passed all the tests.
> >>
> >> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet <sebastien@lorquet.fr
> >
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> We have deployed hundreds of boards with stm32f427 and ethernet, they
> >>> have all been working reliably for months without stopping, we know it
> >>> because they critically depend on network functionality and we have
> >>> reports if a card becomes unreachable. None has so far outside of
> >>> dedicated tests.
> >>>
> >>> So I believe that there is no obvious hard bug in these drivers.
> >>>
> >>> Most certainly a build option on your particular config. debug is a
> >>> possible issue, thread problems is another possibility.
> >>>
> >>> Sebastien
> >>>
> >>>
> >>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
> >>>> Hello!
> >>>>
> >>>> I am using Ethernet on an STM32F427 target, but I am facing some
> issues.
> >>>>
> >>>> Initially the device works correctly. After some hours of continuous
> >>>> operation I completely lose all network communications.
> >>>> Trying to troubleshoot the issue, I enabled assertions and various
> other
> >>>> debug features.
> >>>>
> >>>> Again the device works correctly for some hours, and then I get a
> failed
> >>>> assertion at stm32_eth.c, line 1372:
> >>>>
> >>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
> >>>>
> >>>> No other errors are reported (e.g. stack overflows etc).
> >>>>
> >>>>
> >>>> I have observed that this issue usually manifests itself when there is
> >>>> insufficient stack on a task.
> >>>> But in my case, all tasks have oversized stacks. Typically they do not
> >>>> exceed 50% utilization.
> >>>> I have plenty of room available in the heap too (> 100kB).
> >>>>
> >>>> Regarding the rest of the firmware, I cannot see any other
> misbehaviour
> >>> or
> >>>> problem.
> >>>> I haven't ever seen any other unexplained problem, assertion fail,
> >>>> hard-fault etc.
> >>>> The application code passes all of our tests.
> >>>> In fact, even when this issue happens, although I lose network
> >>>> connectivity, the rest of the system works perfectly.
> >>>>
> >>>> Please note that I have checked the contents of dev->d_len and
> >>> dev->d_buf,
> >>>> and they seem to contain valid data.
> >>>> The address lies within the normal address space of the MCU, and the
> >>> size
> >>>> is sane.
> >>>> So it doesn't look like any kind of memory corruption.
> >>>>
> >>>>
> >>>> At this point I believe that this is an actual bug either on the STM32
> >>> MAC
> >>>> driver, or at the TCP/IP stack itself.
> >>>> I had a look at the driver code, but I didn't see anything suspicious.
> >>>>
> >>>>
> >>>> Has anyone observed the same issue before?
> >>>> Can it be affected in any way with my configuration?
> >>>> Or maybe, do you have any recommendations on what to test next?
> >>>>
> >>>>
> >>>> Thank you!
> >>>>
>

Re: STM32F4 Ethernet Issues

Posted by Sebastien Lorquet <se...@lorquet.fr>.
Hi,

good find but

-I dont think any usual application tinkers with PHY regs during its 
lifetime except the ethernet monitor

-the fix is certainly a lock somewhere but global or fine grained I dont 
know.

Not all calls need to be locked, eg the one that returns the PHY 
address. Probaby not needed by default, but a PHY access lock would 
prevent any issue you describe.

I will wait for people with more expertise about this.

Just a note, dont forget that not all PHY have an interrupt, the one on 
the nucleo stm32h743zi[2] board does not have one.

Sebastien

Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit :
> Hello,
>
> I have eventually found 2 issues regarding networking in my application.
> I would like to discuss the first one.
>
>
> My code contains something like this:
>
> int sd = socket(AF_INET, SOCK_DGRAM, 0);
>
> struct ifreq ifr;
> memset(&ifr, 0, sizeof(struct ifreq));
> strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
> ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
> ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
> ifr.ifr_mii_val_out = 0;
> ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);
>
> // Do stuff with ifr.ifr_mii_val_out.
>
> close(sd);
>
> I realized that this type of ioctl will directly access the hardware,
> without any locking.
> That is, if any other task needs to use the PHY in any other way, it will
> eventually corrupt its register data.
>
>
> Two questions on this:
> 1. Is there any good reason for this?
> 2. What is the best way to fix it? Shall I add a driver level lock, or
> should net_lock() be used in any higher layer?
>
>
>
> On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <f....@gmail.com>
> wrote:
>
>> Hello,
>>
>>> We have deployed hundreds of boards with stm32f427 and ethernet, they
>>> have all been working reliably for months without stopping, we know it
>>> because they critically depend on network functionality and we have
>>> reports if a card becomes unreachable. None has so far outside of
>>> dedicated tests.
>>> So I believe that there is no obvious hard bug in these drivers.
>> Good to hear that!
>> Although, I may be using a feature or protocol that you are not.
>> Of course, I don't believe that NuttX is broken per se, but a minor bug
>> may lurk somewhere...
>>
>>
>>> I have seen that when I enable the network debugging features, it seems
>> to
>>> hit an assertion failure before getting to nsh prompt at startup. This
>> was
>>> on a quite recent master. I haven't had a chance to diagnose this
>> further.
>>> Have you tried enabling these and if so, do they work?
>> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and it works.
>> I have some devices under test, waiting to reproduce the issue to see if
>> this option provides any useful information.
>>
>>
>>> Also, out of curiosity, have you tried running ostest on your board?
>> I just tried.
>> It passed all the tests.
>>
>> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet <se...@lorquet.fr>
>> wrote:
>>
>>> Hi,
>>>
>>> We have deployed hundreds of boards with stm32f427 and ethernet, they
>>> have all been working reliably for months without stopping, we know it
>>> because they critically depend on network functionality and we have
>>> reports if a card becomes unreachable. None has so far outside of
>>> dedicated tests.
>>>
>>> So I believe that there is no obvious hard bug in these drivers.
>>>
>>> Most certainly a build option on your particular config. debug is a
>>> possible issue, thread problems is another possibility.
>>>
>>> Sebastien
>>>
>>>
>>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
>>>> Hello!
>>>>
>>>> I am using Ethernet on an STM32F427 target, but I am facing some issues.
>>>>
>>>> Initially the device works correctly. After some hours of continuous
>>>> operation I completely lose all network communications.
>>>> Trying to troubleshoot the issue, I enabled assertions and various other
>>>> debug features.
>>>>
>>>> Again the device works correctly for some hours, and then I get a failed
>>>> assertion at stm32_eth.c, line 1372:
>>>>
>>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
>>>>
>>>> No other errors are reported (e.g. stack overflows etc).
>>>>
>>>>
>>>> I have observed that this issue usually manifests itself when there is
>>>> insufficient stack on a task.
>>>> But in my case, all tasks have oversized stacks. Typically they do not
>>>> exceed 50% utilization.
>>>> I have plenty of room available in the heap too (> 100kB).
>>>>
>>>> Regarding the rest of the firmware, I cannot see any other misbehaviour
>>> or
>>>> problem.
>>>> I haven't ever seen any other unexplained problem, assertion fail,
>>>> hard-fault etc.
>>>> The application code passes all of our tests.
>>>> In fact, even when this issue happens, although I lose network
>>>> connectivity, the rest of the system works perfectly.
>>>>
>>>> Please note that I have checked the contents of dev->d_len and
>>> dev->d_buf,
>>>> and they seem to contain valid data.
>>>> The address lies within the normal address space of the MCU, and the
>>> size
>>>> is sane.
>>>> So it doesn't look like any kind of memory corruption.
>>>>
>>>>
>>>> At this point I believe that this is an actual bug either on the STM32
>>> MAC
>>>> driver, or at the TCP/IP stack itself.
>>>> I had a look at the driver code, but I didn't see anything suspicious.
>>>>
>>>>
>>>> Has anyone observed the same issue before?
>>>> Can it be affected in any way with my configuration?
>>>> Or maybe, do you have any recommendations on what to test next?
>>>>
>>>>
>>>> Thank you!
>>>>

Re: STM32F4 Ethernet Issues

Posted by Fotis Panagiotopoulos <f....@gmail.com>.
Hello,

I have eventually found 2 issues regarding networking in my application.
I would like to discuss the first one.


My code contains something like this:

int sd = socket(AF_INET, SOCK_DGRAM, 0);

struct ifreq ifr;
memset(&ifr, 0, sizeof(struct ifreq));
strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
ifr.ifr_mii_val_out = 0;
ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);

// Do stuff with ifr.ifr_mii_val_out.

close(sd);

I realized that this type of ioctl will directly access the hardware,
without any locking.
That is, if any other task needs to use the PHY in any other way, it will
eventually corrupt its register data.


Two questions on this:
1. Is there any good reason for this?
2. What is the best way to fix it? Shall I add a driver level lock, or
should net_lock() be used in any higher layer?



On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <f....@gmail.com>
wrote:

> Hello,
>
> > We have deployed hundreds of boards with stm32f427 and ethernet, they
> > have all been working reliably for months without stopping, we know it
> > because they critically depend on network functionality and we have
> > reports if a card becomes unreachable. None has so far outside of
> > dedicated tests.
>
> > So I believe that there is no obvious hard bug in these drivers.
>
> Good to hear that!
> Although, I may be using a feature or protocol that you are not.
> Of course, I don't believe that NuttX is broken per se, but a minor bug
> may lurk somewhere...
>
>
> > I have seen that when I enable the network debugging features, it seems
> to
> > hit an assertion failure before getting to nsh prompt at startup. This
> was
> > on a quite recent master. I haven't had a chance to diagnose this
> further.
> > Have you tried enabling these and if so, do they work?
>
> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and it works.
> I have some devices under test, waiting to reproduce the issue to see if
> this option provides any useful information.
>
>
> > Also, out of curiosity, have you tried running ostest on your board?
>
> I just tried.
> It passed all the tests.
>
> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet <se...@lorquet.fr>
> wrote:
>
>> Hi,
>>
>> We have deployed hundreds of boards with stm32f427 and ethernet, they
>> have all been working reliably for months without stopping, we know it
>> because they critically depend on network functionality and we have
>> reports if a card becomes unreachable. None has so far outside of
>> dedicated tests.
>>
>> So I believe that there is no obvious hard bug in these drivers.
>>
>> Most certainly a build option on your particular config. debug is a
>> possible issue, thread problems is another possibility.
>>
>> Sebastien
>>
>>
>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
>> > Hello!
>> >
>> > I am using Ethernet on an STM32F427 target, but I am facing some issues.
>> >
>> > Initially the device works correctly. After some hours of continuous
>> > operation I completely lose all network communications.
>> > Trying to troubleshoot the issue, I enabled assertions and various other
>> > debug features.
>> >
>> > Again the device works correctly for some hours, and then I get a failed
>> > assertion at stm32_eth.c, line 1372:
>> >
>> > DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
>> >
>> > No other errors are reported (e.g. stack overflows etc).
>> >
>> >
>> > I have observed that this issue usually manifests itself when there is
>> > insufficient stack on a task.
>> > But in my case, all tasks have oversized stacks. Typically they do not
>> > exceed 50% utilization.
>> > I have plenty of room available in the heap too (> 100kB).
>> >
>> > Regarding the rest of the firmware, I cannot see any other misbehaviour
>> or
>> > problem.
>> > I haven't ever seen any other unexplained problem, assertion fail,
>> > hard-fault etc.
>> > The application code passes all of our tests.
>> > In fact, even when this issue happens, although I lose network
>> > connectivity, the rest of the system works perfectly.
>> >
>> > Please note that I have checked the contents of dev->d_len and
>> dev->d_buf,
>> > and they seem to contain valid data.
>> > The address lies within the normal address space of the MCU, and the
>> size
>> > is sane.
>> > So it doesn't look like any kind of memory corruption.
>> >
>> >
>> > At this point I believe that this is an actual bug either on the STM32
>> MAC
>> > driver, or at the TCP/IP stack itself.
>> > I had a look at the driver code, but I didn't see anything suspicious.
>> >
>> >
>> > Has anyone observed the same issue before?
>> > Can it be affected in any way with my configuration?
>> > Or maybe, do you have any recommendations on what to test next?
>> >
>> >
>> > Thank you!
>> >
>>
>

Re: STM32F4 Ethernet Issues

Posted by Fotis Panagiotopoulos <f....@gmail.com>.
Hello,

> We have deployed hundreds of boards with stm32f427 and ethernet, they
> have all been working reliably for months without stopping, we know it
> because they critically depend on network functionality and we have
> reports if a card becomes unreachable. None has so far outside of
> dedicated tests.

> So I believe that there is no obvious hard bug in these drivers.

Good to hear that!
Although, I may be using a feature or protocol that you are not.
Of course, I don't believe that NuttX is broken per se, but a minor bug may
lurk somewhere...


> I have seen that when I enable the network debugging features, it seems to
> hit an assertion failure before getting to nsh prompt at startup. This was
> on a quite recent master. I haven't had a chance to diagnose this further.
> Have you tried enabling these and if so, do they work?

If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and it works.
I have some devices under test, waiting to reproduce the issue to see if
this option provides any useful information.


> Also, out of curiosity, have you tried running ostest on your board?

I just tried.
It passed all the tests.

On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet <se...@lorquet.fr>
wrote:

> Hi,
>
> We have deployed hundreds of boards with stm32f427 and ethernet, they
> have all been working reliably for months without stopping, we know it
> because they critically depend on network functionality and we have
> reports if a card becomes unreachable. None has so far outside of
> dedicated tests.
>
> So I believe that there is no obvious hard bug in these drivers.
>
> Most certainly a build option on your particular config. debug is a
> possible issue, thread problems is another possibility.
>
> Sebastien
>
>
> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
> > Hello!
> >
> > I am using Ethernet on an STM32F427 target, but I am facing some issues.
> >
> > Initially the device works correctly. After some hours of continuous
> > operation I completely lose all network communications.
> > Trying to troubleshoot the issue, I enabled assertions and various other
> > debug features.
> >
> > Again the device works correctly for some hours, and then I get a failed
> > assertion at stm32_eth.c, line 1372:
> >
> > DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
> >
> > No other errors are reported (e.g. stack overflows etc).
> >
> >
> > I have observed that this issue usually manifests itself when there is
> > insufficient stack on a task.
> > But in my case, all tasks have oversized stacks. Typically they do not
> > exceed 50% utilization.
> > I have plenty of room available in the heap too (> 100kB).
> >
> > Regarding the rest of the firmware, I cannot see any other misbehaviour
> or
> > problem.
> > I haven't ever seen any other unexplained problem, assertion fail,
> > hard-fault etc.
> > The application code passes all of our tests.
> > In fact, even when this issue happens, although I lose network
> > connectivity, the rest of the system works perfectly.
> >
> > Please note that I have checked the contents of dev->d_len and
> dev->d_buf,
> > and they seem to contain valid data.
> > The address lies within the normal address space of the MCU, and the size
> > is sane.
> > So it doesn't look like any kind of memory corruption.
> >
> >
> > At this point I believe that this is an actual bug either on the STM32
> MAC
> > driver, or at the TCP/IP stack itself.
> > I had a look at the driver code, but I didn't see anything suspicious.
> >
> >
> > Has anyone observed the same issue before?
> > Can it be affected in any way with my configuration?
> > Or maybe, do you have any recommendations on what to test next?
> >
> >
> > Thank you!
> >
>

Re: STM32F4 Ethernet Issues

Posted by Sebastien Lorquet <se...@lorquet.fr>.
Hi,

We have deployed hundreds of boards with stm32f427 and ethernet, they 
have all been working reliably for months without stopping, we know it 
because they critically depend on network functionality and we have 
reports if a card becomes unreachable. None has so far outside of 
dedicated tests.

So I believe that there is no obvious hard bug in these drivers.

Most certainly a build option on your particular config. debug is a 
possible issue, thread problems is another possibility.

Sebastien


On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
> Hello!
>
> I am using Ethernet on an STM32F427 target, but I am facing some issues.
>
> Initially the device works correctly. After some hours of continuous
> operation I completely lose all network communications.
> Trying to troubleshoot the issue, I enabled assertions and various other
> debug features.
>
> Again the device works correctly for some hours, and then I get a failed
> assertion at stm32_eth.c, line 1372:
>
> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
>
> No other errors are reported (e.g. stack overflows etc).
>
>
> I have observed that this issue usually manifests itself when there is
> insufficient stack on a task.
> But in my case, all tasks have oversized stacks. Typically they do not
> exceed 50% utilization.
> I have plenty of room available in the heap too (> 100kB).
>
> Regarding the rest of the firmware, I cannot see any other misbehaviour or
> problem.
> I haven't ever seen any other unexplained problem, assertion fail,
> hard-fault etc.
> The application code passes all of our tests.
> In fact, even when this issue happens, although I lose network
> connectivity, the rest of the system works perfectly.
>
> Please note that I have checked the contents of dev->d_len and dev->d_buf,
> and they seem to contain valid data.
> The address lies within the normal address space of the MCU, and the size
> is sane.
> So it doesn't look like any kind of memory corruption.
>
>
> At this point I believe that this is an actual bug either on the STM32 MAC
> driver, or at the TCP/IP stack itself.
> I had a look at the driver code, but I didn't see anything suspicious.
>
>
> Has anyone observed the same issue before?
> Can it be affected in any way with my configuration?
> Or maybe, do you have any recommendations on what to test next?
>
>
> Thank you!
>