You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nuttx.apache.org by Reto Gähwiler <gr...@gmail.com> on 2020/06/30 08:58:14 UTC

close() socket called in second thread combined with reconnect kills eth (stm32h743zi)

Hello Everyone,

I am facing the following problem working with nuttx and ethernet
connections. A TCP socket is setup as blocking and connected to the server.
The connection is handled in one thread which hangs in the recv call and
processes the data if some arrives. In case of an error the connection is
closed.
Now, if a close() call on that particular TCP connection is called from a
different thread, it terminates the connection and the recv() fails and
breaks free.
If we now connect to a new IP, it first seems to be fine but shortly after
the whole network disappears. No more icmp responses (therefore no ping)
and all other opened connections in different threads are not reachable
anymore. Besides, any of the still opened connections starts to consume all
cpu time. Looking into it with the debugger attached it can be seen,
that in the net/devif/devif_callback.c the for-loop looking for the
callback in the device event list is cycling without an end.

Looking at wireshark while data is transmitted from my client to the server
it looks as follows around the termination. So basically before we
reconnect and fail.

No. Time Source Destination Protocol Length Src.MacAddress Info
> 43178 0.000451 195.65.177.171 10.62.64.110 TCP 75 Fortinet_09:00:06 29500
> → 1026 [PSH, ACK] Seq=30001 Ack=759475 Win=1758 Len=21
> 43179 0.000102 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
> 29500 [ACK] Seq=759475 Ack=30022 Win=5954 Len=0
> 43182 0.001144 10.62.64.110 195.65.177.171 TCP 586 xxxx_0c:70:04 1026 →
> 29500 [PSH, ACK] Seq=759475 Ack=30022 Win=6150 Len=532
> 43183 0.000437 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 [TCP
> Out-Of-Order] 1026 → 29500 [FIN, ACK] Seq=759475 Ack=30022 Win=6150 Len=0
> 43184 0.000049 195.65.177.171 10.62.64.110 TCP 75 Fortinet_09:00:06 29500
> → 1026 [PSH, ACK] Seq=30022 Ack=760007 Win=1758 Len=21
> 43185 0.000090 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
> 29500 [RST, ACK] Seq=760007 Ack=30043 Win=1758 Len=0
> 43186 0.000012 195.65.177.171 10.62.64.110 TCP 60 Fortinet_09:00:06 [TCP
> Dup ACK 43184#1] 29500 → 1026 [ACK] Seq=30043 Ack=760007 Win=1758 Len=0
> 43187 0.000096 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
> 29500 [RST, ACK] Seq=760007 Ack=30043 Win=1758 Len=0
>

As can be seen are the clients (the device I am working on) sequence number
not synchronised after the last data transmit (seq=759475, len=532 -->
nextseq=760007) and the FIN,ACK also sent by the device (seq=759475 as
well!!!). Therefore, it looks like closing a connection this way is not
thread safe!
In case of an idle connection the sequence numbers would look just fine but
the next connection will trigger the same error.

I then also tried to make use of the shutdown and call it from the thread I
used to call close, but shutdown.c is just a dummy API as already noticed
by seanshpark
<https://nuttx.yahoogroups.narkive.com/YjaUuARV/socket-shutdown>5 years ago.

The platform the code is executed on is based on a stm32h743zi. Since
things seem to happen in the libraries it could affect other platforms as
well.

I was wondering if anyone else ran into the issue of calling close on a
socket from a different thread as the recv/send is handled on and that the
following connection kills the entire ethernet? Please let me know if you
know a fix for blocking sockets or it would be better to go with
non-blocking and work with select/poll instead.

Thanks for your input and help,
best regards, Reto

Re: close() socket called in second thread combined with reconnect kills eth (stm32h743zi)

Posted by Reto G��hwiler <gr...@gmail.com>.

Good Morning Everyone, 
First of all, thanks for your responses. We eventually will change the design and handle recv and close in the same pthread and follow your advices, Xiang Xiao / Nathan Hartman. 

@Gregory Nutt: Yes, I was meaning the same pthread. Technically, the pthread calling the close and the pthread hanging in the recv were started from the main application which is started by nuttx. Is that now the same task group?
 

On 2020/06/30 08:58:14, Reto Gähwiler <gr...@gmail.com> wrote: 
> Hello Everyone,
> 
> I am facing the following problem working with nuttx and ethernet
> connections. A TCP socket is setup as blocking and connected to the server.
> The connection is handled in one thread which hangs in the recv call and
> processes the data if some arrives. In case of an error the connection is
> closed.
> Now, if a close() call on that particular TCP connection is called from a
> different thread, it terminates the connection and the recv() fails and
> breaks free.
> If we now connect to a new IP, it first seems to be fine but shortly after
> the whole network disappears. No more icmp responses (therefore no ping)
> and all other opened connections in different threads are not reachable
> anymore. Besides, any of the still opened connections starts to consume all
> cpu time. Looking into it with the debugger attached it can be seen,
> that in the net/devif/devif_callback.c the for-loop looking for the
> callback in the device event list is cycling without an end.
> 
> Looking at wireshark while data is transmitted from my client to the server
> it looks as follows around the termination. So basically before we
> reconnect and fail.
> 
> No. Time Source Destination Protocol Length Src.MacAddress Info
> > 43178 0.000451 195.65.177.171 10.62.64.110 TCP 75 Fortinet_09:00:06 29500
> > → 1026 [PSH, ACK] Seq=30001 Ack=759475 Win=1758 Len=21
> > 43179 0.000102 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
> > 29500 [ACK] Seq=759475 Ack=30022 Win=5954 Len=0
> > 43182 0.001144 10.62.64.110 195.65.177.171 TCP 586 xxxx_0c:70:04 1026 →
> > 29500 [PSH, ACK] Seq=759475 Ack=30022 Win=6150 Len=532
> > 43183 0.000437 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 [TCP
> > Out-Of-Order] 1026 → 29500 [FIN, ACK] Seq=759475 Ack=30022 Win=6150 Len=0
> > 43184 0.000049 195.65.177.171 10.62.64.110 TCP 75 Fortinet_09:00:06 29500
> > → 1026 [PSH, ACK] Seq=30022 Ack=760007 Win=1758 Len=21
> > 43185 0.000090 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
> > 29500 [RST, ACK] Seq=760007 Ack=30043 Win=1758 Len=0
> > 43186 0.000012 195.65.177.171 10.62.64.110 TCP 60 Fortinet_09:00:06 [TCP
> > Dup ACK 43184#1] 29500 → 1026 [ACK] Seq=30043 Ack=760007 Win=1758 Len=0
> > 43187 0.000096 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
> > 29500 [RST, ACK] Seq=760007 Ack=30043 Win=1758 Len=0
> >
> 
> As can be seen are the clients (the device I am working on) sequence number
> not synchronised after the last data transmit (seq=759475, len=532 -->
> nextseq=760007) and the FIN,ACK also sent by the device (seq=759475 as
> well!!!). Therefore, it looks like closing a connection this way is not
> thread safe!
> In case of an idle connection the sequence numbers would look just fine but
> the next connection will trigger the same error.
> 
> I then also tried to make use of the shutdown and call it from the thread I
> used to call close, but shutdown.c is just a dummy API as already noticed
> by seanshpark
> <https://nuttx.yahoogroups.narkive.com/YjaUuARV/socket-shutdown>5 years ago.
> 
> The platform the code is executed on is based on a stm32h743zi. Since
> things seem to happen in the libraries it could affect other platforms as
> well.
> 
> I was wondering if anyone else ran into the issue of calling close on a
> socket from a different thread as the recv/send is handled on and that the
> following connection kills the entire ethernet? Please let me know if you
> know a fix for blocking sockets or it would be better to go with
> non-blocking and work with select/poll instead.
> 
> Thanks for your input and help,
> best regards, Reto
>

Re: close() socket called in second thread combined with reconnect kills eth (stm32h743zi)

Posted by Gregory Nutt <sp...@gmail.com>.

On 6/30/2020 2:58 AM, Reto Gähwiler wrote:
> Hello Everyone,
>
> I am facing the following problem working with nuttx and ethernet
> connections. A TCP socket is setup as blocking and connected to the server.
> The connection is handled in one thread which hangs in the recv call and
> processes the data if some arrives. In case of an error the connection is
> closed.
> Now, if a close() call on that particular TCP connection is called from a
> different thread, it terminates the connection and the recv() fails and
> breaks free.
> If we now connect to a new IP, it first seems to be fine but shortly after
> the whole network disappears. No more icmp responses (therefore no ping)
> and all other opened connections in different threads are not reachable
> anymore. Besides, any of the still opened connections starts to consume all
> cpu time. Looking into it with the debugger attached it can be seen,
> that in the net/devif/devif_callback.c the for-loop looking for the
> callback in the device event list is cycling without an end.
>
> Looking at wireshark while data is transmitted from my client to the server
> it looks as follows around the termination. So basically before we
> reconnect and fail.
>
> No. Time Source Destination Protocol Length Src.MacAddress Info
>> 43178 0.000451 195.65.177.171 10.62.64.110 TCP 75 Fortinet_09:00:06 29500
>> → 1026 [PSH, ACK] Seq=30001 Ack=759475 Win=1758 Len=21
>> 43179 0.000102 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
>> 29500 [ACK] Seq=759475 Ack=30022 Win=5954 Len=0
>> 43182 0.001144 10.62.64.110 195.65.177.171 TCP 586 xxxx_0c:70:04 1026 →
>> 29500 [PSH, ACK] Seq=759475 Ack=30022 Win=6150 Len=532
>> 43183 0.000437 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 [TCP
>> Out-Of-Order] 1026 → 29500 [FIN, ACK] Seq=759475 Ack=30022 Win=6150 Len=0
>> 43184 0.000049 195.65.177.171 10.62.64.110 TCP 75 Fortinet_09:00:06 29500
>> → 1026 [PSH, ACK] Seq=30022 Ack=760007 Win=1758 Len=21
>> 43185 0.000090 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
>> 29500 [RST, ACK] Seq=760007 Ack=30043 Win=1758 Len=0
>> 43186 0.000012 195.65.177.171 10.62.64.110 TCP 60 Fortinet_09:00:06 [TCP
>> Dup ACK 43184#1] 29500 → 1026 [ACK] Seq=30043 Ack=760007 Win=1758 Len=0
>> 43187 0.000096 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
>> 29500 [RST, ACK] Seq=760007 Ack=30043 Win=1758 Len=0
>>
> As can be seen are the clients (the device I am working on) sequence number
> not synchronised after the last data transmit (seq=759475, len=532 -->
> nextseq=760007) and the FIN,ACK also sent by the device (seq=759475 as
> well!!!). Therefore, it looks like closing a connection this way is not
> thread safe!
> In case of an idle connection the sequence numbers would look just fine but
> the next connection will trigger the same error.
>
> I then also tried to make use of the shutdown and call it from the thread I
> used to call close, but shutdown.c is just a dummy API as already noticed
> by seanshpark
> <https://nuttx.yahoogroups.narkive.com/YjaUuARV/socket-shutdown>5 years ago.
>
> The platform the code is executed on is based on a stm32h743zi. Since
> things seem to happen in the libraries it could affect other platforms as
> well.
>
> I was wondering if anyone else ran into the issue of calling close on a
> socket from a different thread as the recv/send is handled on and that the
> following connection kills the entire ethernet? Please let me know if you
> know a fix for blocking sockets or it would be better to go with
> non-blocking and work with select/poll instead.
>
> Thanks for your input and help,
> best regards, Reto

I am little confused by what you mean by a thread.  If you are talking 
about a pthread, then yes, it should be able to close any socket opened 
by the task group.

The main thread and its children pthreads are members of the same task 
group: 
https://cwiki.apache.org/confluence/display/NUTTX/Tasks+vs.+Threads+FAQ

Any member of the task group close a socket.  The socket is a private 
resource of the task group.

If you are trying to close a socket in Task B that was opened in Task A, 
that will not work.  The socket is a private resource of Task A and 
cannot be closed by Task B.

Re: close() socket called in second thread combined with reconnect kills eth (stm32h743zi)

Posted by Nathan Hartman <ha...@gmail.com>.

Ping... this question seemed to be waiting on our mailing list for several days:

On Tue, Jun 30, 2020 at 4:58 AM Reto Gähwiler <gr...@gmail.com> wrote:
> I am facing the following problem working with nuttx and ethernet
> connections. A TCP socket is setup as blocking and connected to the server.
> The connection is handled in one thread which hangs in the recv call and
> processes the data if some arrives. In case of an error the connection is
> closed.
> Now, if a close() call on that particular TCP connection is called from a
> different thread, it terminates the connection and the recv() fails and
> breaks free.
> If we now connect to a new IP, it first seems to be fine but shortly after
> the whole network disappears. No more icmp responses (therefore no ping)
> and all other opened connections in different threads are not reachable
> anymore. Besides, any of the still opened connections starts to consume all
> cpu time. Looking into it with the debugger attached it can be seen,
> that in the net/devif/devif_callback.c the for-loop looking for the
> callback in the device event list is cycling without an end.

(snip)

> I was wondering if anyone else ran into the issue of calling close on a
> socket from a different thread as the recv/send is handled on and that the
> following connection kills the entire ethernet? Please let me know if you
> know a fix for blocking sockets or it would be better to go with
> non-blocking and work with select/poll instead.

Hi Reto,

Have you considered trying to put all the handling into a single
thread and using some sort of inter-thread messaging to request that
thread to close the connection when no longer needed?

I don't know the answer to calling close() from another thread because
I have always written my networking code to manage a socket by a
single thread, where all operations on the socket occur in that same
thread.

If your concern is blocking while waiting to recv() or recvfrom(), I
use select() to check whether any received data is waiting, usually
with a timeout so that the thread can do other things while it waits
for data.

Hopefully someone else can chime in about your specific question.

Nathan

RE: close() socket called in second thread combined with reconnect kills eth (stm32h743zi)

Posted by Xiang Xiao <xi...@gmail.com>.


> -----Original Message-----
> From: Reto Gähwiler <gr...@gmail.com>
> Sent: Tuesday, June 30, 2020 4:58 PM
> To: dev@nuttx.apache.org
> Subject: close() socket called in second thread combined with reconnect kills eth (stm32h743zi)
> 
> Hello Everyone,
> 
> I am facing the following problem working with nuttx and ethernet connections. A TCP socket is setup as blocking and connected to
> the server.
> The connection is handled in one thread which hangs in the recv call and processes the data if some arrives. In case of an error the
> connection is closed.
> Now, if a close() call on that particular TCP connection is called from a different thread, it terminates the connection and the recv() fails
> and breaks free.

It's unsafe to call close() while other thread is blocking on recv(). Yes, it's safe for most POSIX OS but isn't truth for NuttX because NuttX always directly release all resource associated with the socket in close() regardless whether other thread is blocking on it.
Note: not only socket is unsafe, but also normal file handle is unsafe in this case too. On the other hand, it's safe to call other API(except close) concurrently from the different threads.

> If we now connect to a new IP, it first seems to be fine but shortly after the whole network disappears. No more icmp responses
> (therefore no ping) and all other opened connections in different threads are not reachable anymore. Besides, any of the still opened
> connections starts to consume all cpu time. Looking into it with the debugger attached it can be seen, that in the
> net/devif/devif_callback.c the for-loop looking for the callback in the device event list is cycling without an end.
> 
> Looking at wireshark while data is transmitted from my client to the server it looks as follows around the termination. So basically
> before we reconnect and fail.
> 
> No. Time Source Destination Protocol Length Src.MacAddress Info
> > 43178 0.000451 195.65.177.171 10.62.64.110 TCP 75 Fortinet_09:00:06
> > 29500 → 1026 [PSH, ACK] Seq=30001 Ack=759475 Win=1758 Len=21
> > 43179 0.000102 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
> > 29500 [ACK] Seq=759475 Ack=30022 Win=5954 Len=0
> > 43182 0.001144 10.62.64.110 195.65.177.171 TCP 586 xxxx_0c:70:04 1026
> > →
> > 29500 [PSH, ACK] Seq=759475 Ack=30022 Win=6150 Len=532
> > 43183 0.000437 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 [TCP
> > Out-Of-Order] 1026 → 29500 [FIN, ACK] Seq=759475 Ack=30022 Win=6150
> > Len=0
> > 43184 0.000049 195.65.177.171 10.62.64.110 TCP 75 Fortinet_09:00:06
> > 29500 → 1026 [PSH, ACK] Seq=30022 Ack=760007 Win=1758 Len=21
> > 43185 0.000090 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
> > 29500 [RST, ACK] Seq=760007 Ack=30043 Win=1758 Len=0
> > 43186 0.000012 195.65.177.171 10.62.64.110 TCP 60 Fortinet_09:00:06
> > [TCP Dup ACK 43184#1] 29500 → 1026 [ACK] Seq=30043 Ack=760007 Win=1758
> > Len=0
> > 43187 0.000096 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
> > 29500 [RST, ACK] Seq=760007 Ack=30043 Win=1758 Len=0
> >
> 
> As can be seen are the clients (the device I am working on) sequence number not synchronised after the last data transmit
> (seq=759475, len=532 -->
> nextseq=760007) and the FIN,ACK also sent by the device (seq=759475 as well!!!). Therefore, it looks like closing a connection this way
> is not thread safe!
> In case of an idle connection the sequence numbers would look just fine but the next connection will trigger the same error.
> 
> I then also tried to make use of the shutdown and call it from the thread I used to call close, but shutdown.c is just a dummy API as
> already noticed by seanshpark
> <https://nuttx.yahoogroups.narkive.com/YjaUuARV/socket-shutdown>5 years ago.
> 
> The platform the code is executed on is based on a stm32h743zi. Since things seem to happen in the libraries it could affect other
> platforms as well.
> 
> I was wondering if anyone else ran into the issue of calling close on a socket from a different thread as the recv/send is handled on
> and that the following connection kills the entire ethernet? Please let me know if you know a fix for blocking sockets or it would be
> better to go with non-blocking and work with select/poll instead.
> 

Two methods can fix this problem:
1.Implement the safe close() like other OS:
   a.Increase the reference count at the entry point of each API
   b.Decrease the reference count at the leave point of each API(potentially release the socket resource here)
   c.close() has to wake up other blocking thread instead releasing the socket resource directly
2.Close the socket in the receiving thread only, you can either:
   a.Send the signal to break the blocking recv()
   b.Create a pipe and poll/select both socket and pipe, and then send the data to pipe to break poll/select
Of course, Item 1 is the better choice, item 2 is just a workaround.

> Thanks for your input and help,
> best regards, Reto