You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nuttx.apache.org by GitBox <gi...@apache.org> on 2022/08/30 03:38:08 UTC

[GitHub] [incubator-nuttx] anchao opened a new pull request, #6956: net/tcp: use independent work to free the conn instance

anchao opened a new pull request, #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956

   
   ## Summary
   net/tcp: use independent work to free the conn instance
   
   I noticed that the conn instance will leak during stress test,
   The close work queued from tcp_close_eventhandler() will be canceled
   by tcp_timer() immediately:
   ```
   
   Breakpoint 1, tcp_close_eventhandler (dev=0x565cd338 <up_irq_restore+108>, pvpriv=0x5655e6ff <getpid+12>, flags=0) at tcp/tcp_close.c:71
   (gdb) bt
   | #0  tcp_close_eventhandler (dev=0x565cd338 <up_irq_restore+108>, pvpriv=0x5655e6ff <getpid+12>, flags=0) at tcp/tcp_close.c:71
   | #1  0x5658bf1e in devif_conn_event (dev=0x5660bd80 <g_sim_dev>, flags=512, list=0x5660d558 <g_cbprealloc+312>) at devif/devif_callback.c:508
   | #2  0x5658a219 in tcp_callback (dev=0x5660bd80 <g_sim_dev>, conn=0x5660c4a0 <g_tcp_connections>, flags=512) at tcp/tcp_callback.c:167
   | #3  0x56589253 in tcp_timer (dev=0x5660bd80 <g_sim_dev>, conn=0x5660c4a0 <g_tcp_connections>) at tcp/tcp_timer.c:378
   | #4  0x5658dd47 in tcp_poll (dev=0x5660bd80 <g_sim_dev>, conn=0x5660c4a0 <g_tcp_connections>) at tcp/tcp_devpoll.c:95
   | #5  0x5658b95f in devif_poll_tcp_connections (dev=0x5660bd80 <g_sim_dev>, callback=0x565770f2 <netdriver_txpoll>) at devif/devif_poll.c:601
   | #6  0x5658b9ea in devif_poll (dev=0x5660bd80 <g_sim_dev>, callback=0x565770f2 <netdriver_txpoll>) at devif/devif_poll.c:722
   | #7  0x56577230 in netdriver_txavail_work (arg=0x5660bd80 <g_sim_dev>) at sim/up_netdriver.c:308
   | #8  0x5655999e in work_thread (argc=2, argv=0xf3db5dd0) at wqueue/kwork_thread.c:178
   | #9  0x5655983f in nxtask_start () at task/task_start.c:129
   
   (gdb) c
   Continuing.
   Breakpoint 2, tcp_update_timer (conn=0x5660c4a0 <g_tcp_connections>) at tcp/tcp_timer.c:178
   (gdb) bt
   | #0  tcp_update_timer (conn=0x5660c4a0 <g_tcp_connections>) at tcp/tcp_timer.c:178
   | #1  0x5658952a in tcp_timer (dev=0x5660bd80 <g_sim_dev>, conn=0x5660c4a0 <g_tcp_connections>) at tcp/tcp_timer.c:708
   | #2  0x5658dd47 in tcp_poll (dev=0x5660bd80 <g_sim_dev>, conn=0x5660c4a0 <g_tcp_connections>) at tcp/tcp_devpoll.c:95
   | #3  0x5658b95f in devif_poll_tcp_connections (dev=0x5660bd80 <g_sim_dev>, callback=0x565770f2 <netdriver_txpoll>) at devif/devif_poll.c:601
   | #4  0x5658b9ea in devif_poll (dev=0x5660bd80 <g_sim_dev>, callback=0x565770f2 <netdriver_txpoll>) at devif/devif_poll.c:722
   | #5  0x56577230 in netdriver_txavail_work (arg=0x5660bd80 <g_sim_dev>) at sim/up_netdriver.c:308
   | #6  0x5655999e in work_thread (argc=2, argv=0xf3db5dd0) at wqueue/kwork_thread.c:178
   | #7  0x5655983f in nxtask_start () at task/task_start.c:129
   ```
   
   Since a separate work will add 24 bytes to each conn instance,
   but in order to support the feature of asynchronous close(),
   I can not find a better way than adding a separate work,
   for resource constraints, I recommend the developers to enable
   CONFIG_NET_ALLOC_CONNS, which will reduce the ram usage.
   
   Signed-off-by: chao an <an...@xiaomi.com>
   
   
   ## Impact
   
   N/A
   Refer to the discussion in issue: @PetervdPerk-NXP 
   https://github.com/apache/incubator-nuttx/issues/6721
   https://github.com/apache/incubator-nuttx/pull/6740
   
   ## Testing
   
   tcp stress test
   
   @fjpanag 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] pkarashchenko commented on a diff in pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
pkarashchenko commented on code in PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#discussion_r960267262


##########
net/tcp/tcp_close.c:
##########
@@ -175,11 +178,15 @@ static uint16_t tcp_close_eventhandler(FAR struct net_driver_s *dev,
   return flags;
 
 end_wait:
-  tcp_callback_free(conn, conn->clscb);
+  if (conn->clscb)

Review Comment:
   optional
   ```suggestion
     if (conn->clscb != NULL)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1232084111

   > Does your netdev still work in this case? it look like psock_tcp_connect can't receive any important event inside psock_connect_eventhandler.
   
   Yes, during the incident another thread managed to communicate with a TCP server, without delays or errors.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1237868322

   @anchao This is exactly the tests I did, last week.
   
   According to these tests:
   * When the fall-back is used, the assertion does NOT fire.
   * When the assertion fires, the fall-back was NOT used.
   
   I starting re-testing it again, yesterday.  
   I can confirm again point 1.  
   I couldn't manage to cause the assertion to fail to confirm number 2, but I did so last week.
   
   I will continue testing this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1236984317

   @anchao Sorry I should have explained better.
   
   I have applied this patch, but I have deliberately commented out your last fix checking the reference count.
   I did this, so I can still have the assertion fail, and examine this issue better.
   
   I wanted to understand the path the code is taking, as I believe it is not working as we think.
   
   You identified [this](https://github.com/apache/incubator-nuttx/blob/master/net/tcp/tcp_conn.c#L638-L676) as the root cause of the failed assertion.  
   According to my analysis, this is wrong. The two do not seem correlated. Or if they are, this is not the only problem.
   
   So this fix may be wrong, as it tries to solve an issue that maybe it is not there.  
   I am afraid that this fix may just mask the actual problem, or cause a leak.
   
   I would like us to positively identify the root cause, before applying any changes.
   
   Also, the issue with threads getting stuck during `connect()` appears to be fairly common to us.  
   It happens in all of our devices, a few times per day.
   I have a tendency to believe that these two are correlated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] anchao commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
anchao commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1254823655

   @fjpanag , do you have any comments? We have the net stress test for this PR, from the feedback from test team, there are no other side effects, Since this PR resolved that the conn instance can not be released correctly, how about merge this PR first? If you have other findings, let us rise a new issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] xiaoxiang781216 commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
xiaoxiang781216 commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1232071598

   Does your netdev still work in this case? it look like psock_tcp_connect can't receive any important event inside psock_connect_eventhandler.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1232086596

   I also removed and re-plugged the Ethernet during this. This has as a result a pair of calls to ifdown/ifup. The rest of the system behaved correctly, while this thread remained stuck in this state.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] xiaoxiang781216 commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
xiaoxiang781216 commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1232091028

   In this case, do you receive NETDEV_DOWN?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1232236729

   Repeating the experiment multiple times I got a failed assertion @ `tcp/tcp_conn.c`, line 770.
   
   Worker A:
   ![image](https://user-images.githubusercontent.com/46975045/187554890-c417da71-b593-4c5a-a90f-d88f6b69406f.png)
   
   Worker B:
   ![image](https://user-images.githubusercontent.com/46975045/187554926-2d07bb28-09f1-495e-97ac-7a3e162241f9.png)
   
   Thread A:
   ![image](https://user-images.githubusercontent.com/46975045/187554981-c200c83a-241c-4a55-9f1a-1c3423edb2ae.png)
   
   Thread B:
   ![image](https://user-images.githubusercontent.com/46975045/187555037-0d744d93-64ef-4096-8294-e4a2443a835a.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] anchao commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
anchao commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1236124183

   > During the failed assertion, two threads operate on the same `conn` structure.
   > 
   > One is receiving data while the other tries to free the connection. Hence the failed assertion. See the following call traces:
   
   @fjpanag  Did you apply the latest patch set? In this case, if the reference count is greater than 0, tcp_free() should not be triggered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1232200183

   I managed to cause the issue with the stuck thread again. Here is how I did it.
   
   My setup has as follows:
   
   ```mermaid
   graph LR
       A[Device] -->|Ethernet A| B
       B[Switch] -->|Ethernet B| C
       C[LAN]
   ```
   
   So, everything was working normally.  
   Two threads open a TCP socket towards two different servers, and happily exchange data.
   
   I cause a disruption on the network, that is I unplug the Ethernet cable **B**.  
   The communications are troubled, but the device sees the link to be still UP.
   
   The problem in server communications causes both connections to be closed.  
   This will lead to two calls to `close()`, one in every thread.
   
   I connect Ethernet cable **B** back. Communication is now possible.  
   The two threads call `socket()`, followed by `connect()` etc...
   
   Thread A manages to connect to the server. It exchanges data normally.
   **Thread B is stuck in `connect()` again!**
   
   The system does not seem to be able to recover. After a couple of minutes I repeat the experiment.  
   Once again I disconnect and reconnect Ethernet cable **B**.
   
   **Now Thread A also gets stuck in `connect()`!**
   Thread B remains stuck.
   
   The system cannot recover in any way, it needs a full reboot.
   
   ---
   
   While the issue manifested itself, I took a look around with my debugger and I saw:
   * There was no dead-lock, i.e. two threads waiting for each-other.
   * All other threads were running normally. Typically not much to do, so they mostly `sleep()`.
   * Both workers in the low-priority queue were mostly `sleep()`ing.
   * Another part of the system did use the workers successfully.
   * The workers never executed anything related to the network while in this state.
   * While in this state I unplugged Ethernet cable **A**. Nothing happened. (Expected, thread is stuck).
   * I re-connected Ethernet cable **A**. Again, nothing happened. (Expected)
   * `devif_dev_event(dev, NETDEV_DOWN);` (file netdev_ioctl.c, line 1946) cannot be called, thread is stuck.
   
   Both stuck threads have the same call trace:
   
   ![image](https://user-images.githubusercontent.com/46975045/187548884-20751ca9-2821-47d6-8a8e-700d0eb6f8bb.png)
   
   ![image](https://user-images.githubusercontent.com/46975045/187548929-ce29a4b0-7845-46f4-afc1-0c44ea7eed24.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] anchao commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
anchao commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1237178894

   Hi @fjpanag ,
   
   Could you please confirm whether tcp_alloc uses the fall-back connection instance when the assert() triggered? 
   https://github.com/apache/incubator-nuttx/blob/master/net/tcp/tcp_conn.c#L638-L704
   (perhaps you need to add some debug logs to track the pointer of "conn"), I guess the issue you encounter maybe as blew:
   
   ```
   1. tcp_close ready and queue the lpwork to free the resources:
   net/tcp/tcp_close.c:
   tcp_close_eventhandler
   |
    ->work_queue  (LPWORK priority 100)
    
   2.  High priority task( > 100) open socket and allocate the tcp_conn instance,
        no free connections and fall-back to active list to found out the close pending instance
   
   socket
   |
    ->tcp_alloc /* conn->crefs++ */(https://github.com/apache/incubator-nuttx/blob/master/net/tcp/tcp_conn.c#L638-L704)
    |
      ->connect
   
   
   3. High priority task blocked on connect(), task yield to lpwork, 
   tcp_close_work()
   |
    ->tcp_free()
        |
          ->assert()  (conn->crefs > 0)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] anchao commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
anchao commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1231279901

   >  I see that we are going back and forth with removing and adding cls work @anchao do you recall why it was removed last time?
   
   The resource consumption of network has increased compared to the previous version:
   https://github.com/apache/incubator-nuttx/issues/6721
   https://github.com/apache/incubator-nuttx/pull/6740
   
   We will consider optimization of resource consumption from other aspects in the future.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1232045297

   I had one case were one thread stuck indefinitely within a `connect()` call.  
   During the incident, all other tasks could be seen idling. It was not a dead-lock.
   
   I am not sure that it is caused by this commit, I just haven't seen this before.  
   It seems as a timing issue.
   
   Here is the stack trace of the stuck thread:
   
   ![Screenshot from 2022-08-30 21-35-46](https://user-images.githubusercontent.com/46975045/187519885-fc4e4bbf-f604-4fc5-ad7f-e732f081ba69.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1234043050

   @xiaoxiang781216 Sorry I need more time. I haven't finished with the tests yet.
   
   I also have one device that has crashed again, but I haven't got it in my hands yet (its on a different location), so I don't know if it is related.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] anchao commented on a diff in pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
anchao commented on code in PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#discussion_r962995175


##########
net/tcp/tcp_close.c:
##########
@@ -175,11 +178,15 @@ static uint16_t tcp_close_eventhandler(FAR struct net_driver_s *dev,
   return flags;
 
 end_wait:
-  tcp_callback_free(conn, conn->clscb);
+  if (conn->clscb)

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1232097455

   In my case `CONFIG_NETDOWN_NOTIFIER` is disabled.
   
   I don't know if it is related, but `tcp_close_eventhandler()` is still not called every time a socket is closed. Only some times. Is this expected?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] xiaoxiang781216 commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
xiaoxiang781216 commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1232108316

   > In my case `CONFIG_NETDOWN_NOTIFIER` is disabled.
   > 
   > I don't know if it is related, but `tcp_close_eventhandler()` is still not called every time a socket is closed. Only some times. Is this expected?
   
   If you don't close the socket by self, and unplug the netdev without CONFIG_NETDOWN_NOTIFIER, nobody can notify you. Does the problem disappear without this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1231385176

   @anchao I will be glad to test this, but it will take me some time. Maybe till the evening or tomorrow morning.
   
   Please wait for my results, I will comment here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] xiaoxiang781216 merged pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
xiaoxiang781216 merged PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1235315678

   During the failed assertion, two threads operate on the same `conn` structure.
   
   One is receiving data while the other tries to free the connection. Hence the failed assertion.
   See the following call traces:
   
   ![image](https://user-images.githubusercontent.com/46975045/188116310-6330bef6-7cce-402a-8e17-c612052fd832.png)
   
   ![image](https://user-images.githubusercontent.com/46975045/188116493-3d1fb4b2-02b5-40ed-8ab8-9008aecf3ebb.png)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1231416441

   @anchao FYI see issue #6960.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1259447840

   @anchao Sorry I was on an expedition abroad with no internet access.
   
   Give me some time to re-sync with everything, and I will comment back.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] xiaoxiang781216 commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
xiaoxiang781216 commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1233839252

   @fjpanag do you have any update?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1232114674

   > If you don't close the socket by self, and unplug the netdev without CONFIG_NETDOWN_NOTIFIER, nobody can notify you. Does the problem disappear without this patch?
   
   The socket will soon try to send data and it will fail. This will call `close()`.
   
   I have added logs and I monitor the usage of `close()` and `tcp_close_eventhandler()`. I can assure you that when the first one is called the second one may **not** be called.
   
   The problem with the stuck thread only happened once, using this patch. I am trying hard, but I haven't found a way to reproduce it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] fjpanag commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
fjpanag commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1235308463

   > > Repeating the experiment multiple times I got a failed assertion @ `tcp/tcp_conn.c`, line 770.
   > 
   > @fjpanag ,
   > 
   > Thanks a lot for your detailed analysis! From the callstack you shared, it should be that the conn waiting to be released is reused as a fall-back conn:
   > 
   > https://github.com/apache/incubator-nuttx/blob/master/net/tcp/tcp_conn.c#L638-L676
   > 
   > I think here should add some sanity check before tcp_free():
   > 
   > ```
   > $ git diff .
   > diff --git a/net/tcp/tcp_close.c b/net/tcp/tcp_close.c
   > index fb1e9015d7..dca09c267e 100644
   > --- a/net/tcp/tcp_close.c
   > +++ b/net/tcp/tcp_close.c
   > @@ -53,10 +53,13 @@ static void tcp_close_work(FAR void *param)
   >  
   >    net_lock();
   >  
   > -  /* Stop the network monitor for all sockets */
   > +  if (conn && conn->crefs == 0)
   > +    {
   > +      /* Stop the network monitor for all sockets */
   >  
   > -  tcp_stop_monitor(conn, TCP_CLOSE);
   > -  tcp_free(conn);
   > +      tcp_stop_monitor(conn, TCP_CLOSE);
   > +      tcp_free(conn);
   > +    }
   >  
   >    net_unlock();
   >  }
   > ```
   > 
   > I will update this PR, could you please help with further testing ?
   
   No, this is not the root cause.
   
   I managed to have the code to enter [this](https://github.com/apache/incubator-nuttx/blob/master/net/tcp/tcp_conn.c#L638-L676) multiple times, but the assertion does not fire.
   
   I managed to have the assertion fail again, but before the failure the above code was not executed.  
   There is another problem, elsewhere.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] pkarashchenko commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
pkarashchenko commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1231271094

   I see that we are going back and forth with removing and adding cls work @anchao  do you recall why it was removed last time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-nuttx] anchao commented on pull request #6956: net/tcp: use independent work to free the conn instance

Posted by GitBox <gi...@apache.org>.
anchao commented on PR #6956:
URL: https://github.com/apache/incubator-nuttx/pull/6956#issuecomment-1232422666

   > Repeating the experiment multiple times I got a failed assertion @ `tcp/tcp_conn.c`, line 770.
   
   @fjpanag ,
   
   Thanks a lot for your detailed analysis! From the callstack you shared, it should be that the conn waiting to be released is reused as a fall-back conn:
   
   https://github.com/apache/incubator-nuttx/blob/master/net/tcp/tcp_conn.c#L638-L676
   
   I think here should add some sanity check before tcp_free():
   
   ```
   $ git diff .
   diff --git a/net/tcp/tcp_close.c b/net/tcp/tcp_close.c
   index fb1e9015d7..dca09c267e 100644
   --- a/net/tcp/tcp_close.c
   +++ b/net/tcp/tcp_close.c
   @@ -53,10 +53,13 @@ static void tcp_close_work(FAR void *param)
    
      net_lock();
    
   -  /* Stop the network monitor for all sockets */
   +  if (conn && conn->crefs == 0)
   +    {
   +      /* Stop the network monitor for all sockets */
    
   -  tcp_stop_monitor(conn, TCP_CLOSE);
   -  tcp_free(conn);
   +      tcp_stop_monitor(conn, TCP_CLOSE);
   +      tcp_free(conn);
   +    }
    
      net_unlock();
    }
   
   ```
   I will update  this PR, could you please help with further testing ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org