You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@plc4x.apache.org by Julian Feinauer <j....@pragmaticminds.de> on 2019/08/02 14:49:47 UTC

Are we leaking sockets?

Hi all,

we observe a strange behavior in production.
We are still investigating the exact scenario and it’s a bit complex as we have many connections to many plcs and fire many requests through many different channels…
But what we observe is that we get the well known “too many open files” Exception ona linux server WHEN one of the plcs gets unreachable (pool will try many times to recreate the connection).

I just checked the Codebase for a Second and I think we are handling the exceptions wrong (or not at all?).
If I understand it correctly from [1] (didn’t bother to check nettys doc as its rather poor) we should close the socket somewhere but we ALWAYS do super.exceptionCaught() which just propagates it upward in the channel hierarchy but seems to NEVER close it.

Am I wrong with that?

We try to get create a MWE which reproduces that behavior to check if we fix it like that.

Best
Julian

[1] https://www.baeldung.com/netty-exception-handling

Re: Are we leaking sockets?

Posted by Julian Feinauer <j....@pragmaticminds.de>.

Hi Chris,

which mention? I didnt find a comment from you?
From my understanding it should be correct as the FUTURE is only for the Connection.

But as this is crucial I asked for reviews of 2 ppl : )

J

Am 05.08.19, 09:27 schrieb "Christofer Dutz" <ch...@c-ware.de>:

    As I mentioned in the PR,
    
    Are we sure the callback is only called on failure on connection layer? I wouldn't like PLC4X to kill the worker when communicating with a non-standard PLC hence one of our protocol layers firing an error while processing the data.
    
    Chris
    
    Am 04.08.19, 20:26 schrieb "Julian Feinauer" <jf...@apache.org>:
    
        Hi all,
        
        so I found the cause and fixed it.
        In fact, when the connection aborted we ended up in a situation where we created a thread pool (worker pool for netty) which was never shutdown (as no channel was created where the handling was done with later on).
        I created the PR https://github.com/apache/plc4x/pull/76 to develop.
        
        If this PR gets accepted I suggest to create a bugfix release 0.4.1 as this is really an issue for us in production.
        
        Any concerns with this approach?
        
        Thanks!
        Julian
        
        On 2019/08/02 15:05:47, Julian Feinauer <j....@pragmaticminds.de> wrote: 
        > Hey,
        > 
        > agree @cdutz... I am just running an example and it really seems like that.
        > So I'll try to finish a MWE and perhaps ask on the netty list : )
        > 
        > Julian
        > 
        > Am 02.08.19, 16:58 schrieb "Christofer Dutz" <ch...@c-ware.de>:
        > 
        >     Hi Julian,
        >     
        >     Well if I look into my sock drawer at home I think we might be leaking some socks ... I agree ... there are several single-socks in there ;-)
        >     
        >     But regarding netty ... yes it is absolutely possible we're not handling this correctly as the docs are quite extensive and I didn't bother reading all of them ;-)
        >     
        >     So perhaps we should read them or ask some Netty pro
        >     
        >     Chris
        >     
        >     Am 02.08.19, 16:50 schrieb "Julian Feinauer" <j....@pragmaticminds.de>:
        >     
        >         Hi all,
        >         
        >         we observe a strange behavior in production.
        >         We are still investigating the exact scenario and it’s a bit complex as we have many connections to many plcs and fire many requests through many different channels…
        >         But what we observe is that we get the well known “too many open files” Exception ona linux server WHEN one of the plcs gets unreachable (pool will try many times to recreate the connection).
        >         
        >         I just checked the Codebase for a Second and I think we are handling the exceptions wrong (or not at all?).
        >         If I understand it correctly from [1] (didn’t bother to check nettys doc as its rather poor) we should close the socket somewhere but we ALWAYS do super.exceptionCaught() which just propagates it upward in the channel hierarchy but seems to NEVER close it.
        >         
        >         Am I wrong with that?
        >         
        >         We try to get create a MWE which reproduces that behavior to check if we fix it like that.
        >         
        >         Best
        >         Julian
        >         
        >         [1] https://www.baeldung.com/netty-exception-handling
        >         
        >     
        >     
        > 
        >

Re: Are we leaking sockets?

Posted by Christofer Dutz <ch...@c-ware.de>.

As I mentioned in the PR,

Are we sure the callback is only called on failure on connection layer? I wouldn't like PLC4X to kill the worker when communicating with a non-standard PLC hence one of our protocol layers firing an error while processing the data.

Chris

Am 04.08.19, 20:26 schrieb "Julian Feinauer" <jf...@apache.org>:

    Hi all,
    
    so I found the cause and fixed it.
    In fact, when the connection aborted we ended up in a situation where we created a thread pool (worker pool for netty) which was never shutdown (as no channel was created where the handling was done with later on).
    I created the PR https://github.com/apache/plc4x/pull/76 to develop.
    
    If this PR gets accepted I suggest to create a bugfix release 0.4.1 as this is really an issue for us in production.
    
    Any concerns with this approach?
    
    Thanks!
    Julian
    
    On 2019/08/02 15:05:47, Julian Feinauer <j....@pragmaticminds.de> wrote: 
    > Hey,
    > 
    > agree @cdutz... I am just running an example and it really seems like that.
    > So I'll try to finish a MWE and perhaps ask on the netty list : )
    > 
    > Julian
    > 
    > Am 02.08.19, 16:58 schrieb "Christofer Dutz" <ch...@c-ware.de>:
    > 
    >     Hi Julian,
    >     
    >     Well if I look into my sock drawer at home I think we might be leaking some socks ... I agree ... there are several single-socks in there ;-)
    >     
    >     But regarding netty ... yes it is absolutely possible we're not handling this correctly as the docs are quite extensive and I didn't bother reading all of them ;-)
    >     
    >     So perhaps we should read them or ask some Netty pro
    >     
    >     Chris
    >     
    >     Am 02.08.19, 16:50 schrieb "Julian Feinauer" <j....@pragmaticminds.de>:
    >     
    >         Hi all,
    >         
    >         we observe a strange behavior in production.
    >         We are still investigating the exact scenario and it’s a bit complex as we have many connections to many plcs and fire many requests through many different channels…
    >         But what we observe is that we get the well known “too many open files” Exception ona linux server WHEN one of the plcs gets unreachable (pool will try many times to recreate the connection).
    >         
    >         I just checked the Codebase for a Second and I think we are handling the exceptions wrong (or not at all?).
    >         If I understand it correctly from [1] (didn’t bother to check nettys doc as its rather poor) we should close the socket somewhere but we ALWAYS do super.exceptionCaught() which just propagates it upward in the channel hierarchy but seems to NEVER close it.
    >         
    >         Am I wrong with that?
    >         
    >         We try to get create a MWE which reproduces that behavior to check if we fix it like that.
    >         
    >         Best
    >         Julian
    >         
    >         [1] https://www.baeldung.com/netty-exception-handling
    >         
    >     
    >     
    > 
    >

Re: Are we leaking sockets?

Posted by Julian Feinauer <jf...@apache.org>.

Hi all,

so I found the cause and fixed it.
In fact, when the connection aborted we ended up in a situation where we created a thread pool (worker pool for netty) which was never shutdown (as no channel was created where the handling was done with later on).
I created the PR https://github.com/apache/plc4x/pull/76 to develop.

If this PR gets accepted I suggest to create a bugfix release 0.4.1 as this is really an issue for us in production.

Any concerns with this approach?

Thanks!
Julian

On 2019/08/02 15:05:47, Julian Feinauer <j....@pragmaticminds.de> wrote: 
> Hey,
> 
> agree @cdutz... I am just running an example and it really seems like that.
> So I'll try to finish a MWE and perhaps ask on the netty list : )
> 
> Julian
> 
> Am 02.08.19, 16:58 schrieb "Christofer Dutz" <ch...@c-ware.de>:
> 
>     Hi Julian,
>     
>     Well if I look into my sock drawer at home I think we might be leaking some socks ... I agree ... there are several single-socks in there ;-)
>     
>     But regarding netty ... yes it is absolutely possible we're not handling this correctly as the docs are quite extensive and I didn't bother reading all of them ;-)
>     
>     So perhaps we should read them or ask some Netty pro
>     
>     Chris
>     
>     Am 02.08.19, 16:50 schrieb "Julian Feinauer" <j....@pragmaticminds.de>:
>     
>         Hi all,
>         
>         we observe a strange behavior in production.
>         We are still investigating the exact scenario and it’s a bit complex as we have many connections to many plcs and fire many requests through many different channels…
>         But what we observe is that we get the well known “too many open files” Exception ona linux server WHEN one of the plcs gets unreachable (pool will try many times to recreate the connection).
>         
>         I just checked the Codebase for a Second and I think we are handling the exceptions wrong (or not at all?).
>         If I understand it correctly from [1] (didn’t bother to check nettys doc as its rather poor) we should close the socket somewhere but we ALWAYS do super.exceptionCaught() which just propagates it upward in the channel hierarchy but seems to NEVER close it.
>         
>         Am I wrong with that?
>         
>         We try to get create a MWE which reproduces that behavior to check if we fix it like that.
>         
>         Best
>         Julian
>         
>         [1] https://www.baeldung.com/netty-exception-handling
>         
>     
>     
> 
>

Re: Are we leaking sockets?

Posted by Julian Feinauer <j....@pragmaticminds.de>.

Hey,

agree @cdutz... I am just running an example and it really seems like that.
So I'll try to finish a MWE and perhaps ask on the netty list : )

Julian

Am 02.08.19, 16:58 schrieb "Christofer Dutz" <ch...@c-ware.de>:

Hi Julian,

Well if I look into my sock drawer at home I think we might be leaking some socks ... I agree ... there are several single-socks in there ;-)

But regarding netty ... yes it is absolutely possible we're not handling this correctly as the docs are quite extensive and I didn't bother reading all of them ;-)

So perhaps we should read them or ask some Netty pro

Chris

Am 02.08.19, 16:50 schrieb "Julian Feinauer" <j....@pragmaticminds.de>:

Hi all,

we observe a strange behavior in production.
We are still investigating the exact scenario and it’s a bit complex as we have many connections to many plcs and fire many requests through many different channels…
But what we observe is that we get the well known “too many open files” Exception ona linux server WHEN one of the plcs gets unreachable (pool will try many times to recreate the connection).

I just checked the Codebase for a Second and I think we are handling the exceptions wrong (or not at all?).
If I understand it correctly from [1] (didn’t bother to check nettys doc as its rather poor) we should close the socket somewhere but we ALWAYS do super.exceptionCaught() which just propagates it upward in the channel hierarchy but seems to NEVER close it.

Am I wrong with that?

We try to get create a MWE which reproduces that behavior to check if we fix it like that.

Best
Julian

[1] https://www.baeldung.com/netty-exception-handling

Re: Are we leaking sockets?

Posted by Christofer Dutz <ch...@c-ware.de>.

Hi Julian,

Well if I look into my sock drawer at home I think we might be leaking some socks ... I agree ... there are several single-socks in there ;-)

But regarding netty ... yes it is absolutely possible we're not handling this correctly as the docs are quite extensive and I didn't bother reading all of them ;-)

So perhaps we should read them or ask some Netty pro

Chris

Am 02.08.19, 16:50 schrieb "Julian Feinauer" <j....@pragmaticminds.de>:

    Hi all,
    
    we observe a strange behavior in production.
    We are still investigating the exact scenario and it’s a bit complex as we have many connections to many plcs and fire many requests through many different channels…
    But what we observe is that we get the well known “too many open files” Exception ona linux server WHEN one of the plcs gets unreachable (pool will try many times to recreate the connection).
    
    I just checked the Codebase for a Second and I think we are handling the exceptions wrong (or not at all?).
    If I understand it correctly from [1] (didn’t bother to check nettys doc as its rather poor) we should close the socket somewhere but we ALWAYS do super.exceptionCaught() which just propagates it upward in the channel hierarchy but seems to NEVER close it.
    
    Am I wrong with that?
    
    We try to get create a MWE which reproduces that behavior to check if we fix it like that.
    
    Best
    Julian
    
    [1] https://www.baeldung.com/netty-exception-handling