You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by wangzhijiang999 <wa...@aliyun.com> on 2016/05/23 03:40:08 UTC

problem of sharing TCP connection when transferring data

Hi,
     I am confused with sharing tcp connection for the same connectionID, if two tasks share the same connection, and there is no available buffer in the local buffer pool of the first task  , then it will set autoread as false for the channel, but it will effect the second task if it still has available buffer. So if one of the tasks no available buffer , all the other tasks can not read data from channel because of this. My understanding is right? If so, are there any improvements for it?  Thank you for any help!

Re: problem of sharing TCP connection when transferring data

Posted by Deepak Sharma <de...@gmail.com>.

Thanks Ufuk.
Sure I would pick some starter issues and get to these low level issues.

-Deepak
On 23 May 2016 11:21 pm, "Ufuk Celebi" <uc...@apache.org> wrote:

> On Mon, May 23, 2016 at 7:30 PM, Deepak Sharma <de...@gmail.com>
> wrote:
> > Would it be possible to get involved on this issues and start
> contributing
> > to Flink community?
>
> Hey Deepak!
>
> Nice to see that you are also interested in this. If you are new to
> Flink I would recommend to start contributing by looking into one of
> the starter issues:
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20FLINK%20AND%20status%20%3D%20Open%20AND%20labels%20%3D%20starter
>
> The lower layers are quite tricky. If it's OK with you, I would rather
> post any follow ups like JIRA issues or design docs in this thread, at
> which point it will be possible to chime in here, too. In the mean
> time the issues I've linked are a better place to start I think.
>
> – Ufuk
>

Re: problem of sharing TCP connection when transferring data

Posted by Ufuk Celebi <uc...@apache.org>.

On Mon, May 23, 2016 at 7:30 PM, Deepak Sharma <de...@gmail.com> wrote:
> Would it be possible to get involved on this issues and start contributing
> to Flink community?

Hey Deepak!

Nice to see that you are also interested in this. If you are new to
Flink I would recommend to start contributing by looking into one of
the starter issues:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20FLINK%20AND%20status%20%3D%20Open%20AND%20labels%20%3D%20starter

The lower layers are quite tricky. If it's OK with you, I would rather
post any follow ups like JIRA issues or design docs in this thread, at
which point it will be possible to chime in here, too. In the mean
time the issues I've linked are a better place to start I think.

– Ufuk

Re: problem of sharing TCP connection when transferring data

Posted by Deepak Sharma <de...@gmail.com>.

I am not Flink master or regular user of FLink , but would like to start
contributing to Flink.
Would it be possible to get involved on this issues and start contributing
to Flink community?

Thanks
Deepak

On Mon, May 23, 2016 at 10:49 PM, Ufuk Celebi <uc...@apache.org> wrote:

> On Mon, May 23, 2016 at 6:55 PM, wangzhijiang999
> <wa...@aliyun.com> wrote:
> >    In summary, if one task set autoread as false, and when it notify the
> > available buffer, there are some messages during this time to be
> processed
> > first, if one message belongs to another failed task, the autoread for
> this
> > channel would not be set true anymore. The only way is to cancel all the
> > tasks in this channel to release the channel. Is it right?
>
> Yes, very good observation. In this sense the failure model of Flink
> is baked in into the way the channels are multiplexed, which is a bad
> thing (as you already noticed with your improved failover strategy).
>
> If you want, let me look into this issue on a high level and let's fix
> this together as a first step. Let's have a chat about this by the end
> of the week. Does this work for you?
>
> After that, we can continue with the flow control issue, which is
> definitely a bigger task.
>
> – Ufuk
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

回复：problem of sharing TCP connection when transferring data

Posted by wangzhijiang999 <wa...@aliyun.com>.

 Hi Ufuk,        I am willing to do some work for this issue and has a basic solution for it. And wish to get professional suggestion from you. What is the next step for it ?  Looking forward to your reply!
Zhijiang Wang------------------------------------------------------------------发件人：Ufuk Celebi <uc...@apache.org>发送时间：2016年5月24日(星期二) 01:19收件人：user <us...@flink.apache.org>; wangzhijiang999 <wa...@aliyun.com>主　题：Re: problem of sharing TCP connection when transferring data On Mon, May 23, 2016 at 6:55 PM, wangzhijiang999
<wa...@aliyun.com> wrote:
>    In summary, if one task set autoread as false, and when it notify the
> available buffer, there are some messages during this time to be processed
> first, if one message belongs to another failed task, the autoread for this
> channel would not be set true anymore. The only way is to cancel all the
> tasks in this channel to release the channel. Is it right?

Yes, very good observation. In this sense the failure model of Flink
is baked in into the way the channels are multiplexed, which is a bad
thing (as you already noticed with your improved failover strategy).

If you want, let me look into this issue on a high level and let's fix
this together as a first step. Let's have a chat about this by the end
of the week. Does this work for you?

After that, we can continue with the flow control issue, which is
definitely a bigger task.

– Ufuk

Re: problem of sharing TCP connection when transferring data

Posted by Ufuk Celebi <uc...@apache.org>.

On Mon, May 23, 2016 at 6:55 PM, wangzhijiang999
<wa...@aliyun.com> wrote:
>    In summary, if one task set autoread as false, and when it notify the
> available buffer, there are some messages during this time to be processed
> first, if one message belongs to another failed task, the autoread for this
> channel would not be set true anymore. The only way is to cancel all the
> tasks in this channel to release the channel. Is it right?

Yes, very good observation. In this sense the failure model of Flink
is baked in into the way the channels are multiplexed, which is a bad
thing (as you already noticed with your improved failover strategy).

If you want, let me look into this issue on a high level and let's fix
this together as a first step. Let's have a chat about this by the end
of the week. Does this work for you?

After that, we can continue with the flow control issue, which is
definitely a bigger task.

– Ufuk

回复：problem of sharing TCP connection when transferring data

Posted by wangzhijiang999 <wa...@aliyun.com>.

 Hi Ufuk,
       Thank you for the detail explaination!  As we confirmed that the task will set the autoread as false for the sharing channel when no available segment buffer. In further, when this task has available buffer again, it will notify the event to set the autoread as true. But in some scenarios, there would be a propobility that the autoread for this sharing channel would not be set as true anymore. That is , when available buffer to notify event and currently there are some messages staged in the queue,it would process these messages first, the message shoule be put on input channel buffer in common way, but if the task failed and the buffer pool is released, it will return false when process the message,so the channel will not be set as autoread true any more, then all the other tasks sharing this channel will be effected.      In summary, if one task set autoread as false, and when it notify the available buffer, there are some messages during this time to be processed first, if one message belongs to another failed task, the autoread for this channel would not be set true anymore. The only way is to cancel all the tasks in this channel to release the channel. Is it right?    In the past, I improved the failover strategy based on flink for our application and noticed this issue. Also i am very interested and pleasure to do some related work for flink improvement as you mentioned. Actually i am working on improving flink in many ways for our application, and wish further contact with you for the professional advise. Thank you again!
 Zhijiang Wang------------------------------------------------------------------发件人：Ufuk Celebi <uc...@apache.org>发送时间：2016年5月23日(星期一) 19:49收件人：user <us...@flink.apache.org>; wangzhijiang999 <wa...@aliyun.com>主　题：Re: problem of sharing TCP connection when transferring data Yes, that is a correct description of the state of things.

A way to improve this is to introduce flow control in the application
layer, where consumers only receive buffers when they have buffers
available. They could announce on the channel how many buffers they
have before they receive anything. This way there will be no blocking
of the channel and we could actually multiplex more consumers over the
same channel.

The implementation is probably a little tricky, but if you want to
work on this and have time to actually do it, we can think about the
details. :-) Would you be interested? If yes, let's schedule a Hangout
where we brainstorm about the solution and how to implement it.
Ideally, we would come up with a design document, which we share on
the mailing list and then we continue implementing it. I currently
only have time to act as a guide/mentor and you would have to do most
of the implementation.

– Ufuk

On Mon, May 23, 2016 at 5:40 AM, wangzhijiang999
<wa...@aliyun.com> wrote:
> Hi,
>
>      I am confused with sharing tcp connection for the same connectionID, if
> two tasks share the same connection, and there is no available buffer in the
> local buffer pool of the first task  , then it will set autoread as false
> for the channel, but it will effect the second task if it still has
> available buffer. So if one of the tasks no available buffer , all the other
> tasks can not read data from channel because of this. My understanding is
> right? If so, are there any improvements for it?  Thank you for any help!
>
>
>
>
>

Re: problem of sharing TCP connection when transferring data

Posted by Ufuk Celebi <uc...@apache.org>.

Yes, that is a correct description of the state of things.

A way to improve this is to introduce flow control in the application
layer, where consumers only receive buffers when they have buffers
available. They could announce on the channel how many buffers they
have before they receive anything. This way there will be no blocking
of the channel and we could actually multiplex more consumers over the
same channel.

The implementation is probably a little tricky, but if you want to
work on this and have time to actually do it, we can think about the
details. :-) Would you be interested? If yes, let's schedule a Hangout
where we brainstorm about the solution and how to implement it.
Ideally, we would come up with a design document, which we share on
the mailing list and then we continue implementing it. I currently
only have time to act as a guide/mentor and you would have to do most
of the implementation.

– Ufuk

On Mon, May 23, 2016 at 5:40 AM, wangzhijiang999
<wa...@aliyun.com> wrote:
> Hi,
>
>      I am confused with sharing tcp connection for the same connectionID, if
> two tasks share the same connection, and there is no available buffer in the
> local buffer pool of the first task  , then it will set autoread as false
> for the channel, but it will effect the second task if it still has
> available buffer. So if one of the tasks no available buffer , all the other
> tasks can not read data from channel because of this. My understanding is
> right? If so, are there any improvements for it?  Thank you for any help!
>
>
>
>
>