You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Rahul Patwari <ra...@gmail.com> on 2020/07/31 21:02:46 UTC

sporadic "Insufficient no of network buffers" issue

Hi,

We are observing "Insufficient number of Network Buffers" issue
Sporadically when Flink is upgraded from 1.4.2 to 1.8.2.
The state of the tasks with this issue translated from DEPLOYING to FAILED.
Whenever this issue occurs, the job manager restarts. Sometimes, the issue
goes away after the restart.
As we are not getting the issue consistently, we are in a dilemma of
whether to change the memory configurations or not.

Min recommended No. of Network Buffers: (8 * 8) * 8 * 4 = 2048
The exception says that 13112 no. of network buffers are present, which is
6x the recommendation.

Is reducing the no. of shuffles the only way to reduce the no. of network
buffers required?

Thanks,
Rahul

configs:
env: Kubernetes
Flink: 1.8.2
using default configs for memory.fraction, memory.min, memory.max.
using 8 TM, 8 slots/TM
Each TM is running with 1 core, 4 GB Memory.

Exception:
java.io.IOException: Insufficient number of network buffers: required 2,
but only 0 available. The total number of network buffers is currently set
to 13112 of 32768 bytes each. You can increase this number by setting the
configuration keys 'taskmanager.network.memory.fraction',
'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
at
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:138)
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:311)
at
org.apache.flink.runtime.io.network.NetworkEnvironment.setupInputGate(NetworkEnvironment.java:271)
at
org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:224)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:614)
at java.lang.Thread.run(Thread.java:748)

Re: sporadic "Insufficient no of network buffers" issue

Posted by Rahul Patwari <ra...@gmail.com>.
After debugging more, it seems like this issue is caused by the scheduling
strategy.
Depending on the tasks assigned to the task manager, probably the amount of
memory configured for network buffers is running out.

Through these references: FLINK-12122
<https://issues.apache.org/jira/browse/FLINK-12122>, FLINK-15031
<https://issues.apache.org/jira/browse/FLINK-15031>, Flink 1.10 release
notes
<https://ci.apache.org/projects/flink/flink-docs-stable/release-notes/flink-1.10.html>
we
came to know that the scheduling strategy has changed since 1.5.0(FLIP-6)
from 1.4.2 and the change is sort of fixed back in 1.9.2 with support for
providing a configuration for scheduling strategy -
cluster.evenly-spread-out-slots:
true
<https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#cluster-evenly-spread-out-slots>

"Spread out" strategy could definitely help in this case.
can you please confirm our findings and probably suggest some possible ways
to mitigate this issue.

Rahul

On Sat, Aug 1, 2020 at 9:24 PM Rahul Patwari <ra...@gmail.com>
wrote:

> From the metrics in Prometheus, we observed that the minimum
> AvailableMemorySegments out of all the task managers is 4.5k when the
> exception was thrown.
> So there were enough network buffers.
> correction to the configs provided above: each TM CPU has 8 cores.
>
> Apart from having fewer network buffers, can something else trigger this
> issue?
> Also, is it expected that the issue is sporadic?
>
> Rahul
>
> On Sat, Aug 1, 2020 at 12:24 PM Ivan Yang <iv...@gmail.com> wrote:
>
>> Yes, increase the taskmanager.network.memory.fraction in your case. Also
>> reduce the parallelism will reduce number of network buffer required for
>> your job. I never used 1.4.x, so don’t know about it.
>>
>> Ivan
>>
>> On Jul 31, 2020, at 11:37 PM, Rahul Patwari <ra...@gmail.com>
>> wrote:
>>
>> Thanks for your reply, Ivan.
>>
>> I think taskmanager.network.memory.max is by default 1GB.
>> In my case, the network buffers memory is 13112 * 32768 = around 400MB
>> which is 10% of the TM memory as by default
>> taskmanager.network.memory.fraction is 0.1.
>> Do you mean to increase taskmanager.network.memory.fraction?
>>
>>    1. If Flink is upgraded from 1.4.2 to 1.8.2 does the application
>>    need more network buffers?
>>    2. Can this issue happen sporadically? sometimes this issue is not
>>    seen when the job manager is restarted.
>>
>> I am thinking whether having fewer network buffers is the root cause (or)
>> if the root cause is something else which triggers this issue.
>>
>> On Sat, Aug 1, 2020 at 9:36 AM Ivan Yang <iv...@gmail.com> wrote:
>>
>>> Hi Rahul,
>>>
>>> Try to increase taskmanager.network.memory.max to 1GB, basically double
>>> what you have now. However, you only have 4GB RAM for the entire TM, seems
>>> out of proportion to have 1GB network buffer with 4GB total RAM. Reducing
>>> number of shuffling will require less network buffer. But if your job need
>>> the shuffling, then you may consider to add more memory to TM.
>>>
>>> Thanks,
>>> Ivan
>>>
>>> On Jul 31, 2020, at 2:02 PM, Rahul Patwari <ra...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> We are observing "Insufficient number of Network Buffers" issue
>>> Sporadically when Flink is upgraded from 1.4.2 to 1.8.2.
>>> The state of the tasks with this issue translated from DEPLOYING to
>>> FAILED.
>>> Whenever this issue occurs, the job manager restarts. Sometimes, the
>>> issue goes away after the restart.
>>> As we are not getting the issue consistently, we are in a dilemma of
>>> whether to change the memory configurations or not.
>>>
>>> Min recommended No. of Network Buffers: (8 * 8) * 8 * 4 = 2048
>>> The exception says that 13112 no. of network buffers are present, which
>>> is 6x the recommendation.
>>>
>>> Is reducing the no. of shuffles the only way to reduce the no. of
>>> network buffers required?
>>>
>>> Thanks,
>>> Rahul
>>>
>>> configs:
>>> env: Kubernetes
>>> Flink: 1.8.2
>>> using default configs for memory.fraction, memory.min, memory.max.
>>> using 8 TM, 8 slots/TM
>>> Each TM is running with 1 core, 4 GB Memory.
>>>
>>> Exception:
>>> java.io.IOException: Insufficient number of network buffers: required 2,
>>> but only 0 available. The total number of network buffers is currently set
>>> to 13112 of 32768 bytes each. You can increase this number by setting the
>>> configuration keys 'taskmanager.network.memory.fraction',
>>> 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
>>> at
>>> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:138)
>>> at
>>> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:311)
>>> at
>>> org.apache.flink.runtime.io.network.NetworkEnvironment.setupInputGate(NetworkEnvironment.java:271)
>>> at
>>> org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:224)
>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:614)
>>> at java.lang.Thread.run(Thread.java:748)
>>>
>>>
>>>
>>

Re: sporadic "Insufficient no of network buffers" issue

Posted by Rahul Patwari <ra...@gmail.com>.
From the metrics in Prometheus, we observed that the minimum
AvailableMemorySegments out of all the task managers is 4.5k when the
exception was thrown.
So there were enough network buffers.
correction to the configs provided above: each TM CPU has 8 cores.

Apart from having fewer network buffers, can something else trigger this
issue?
Also, is it expected that the issue is sporadic?

Rahul

On Sat, Aug 1, 2020 at 12:24 PM Ivan Yang <iv...@gmail.com> wrote:

> Yes, increase the taskmanager.network.memory.fraction in your case. Also
> reduce the parallelism will reduce number of network buffer required for
> your job. I never used 1.4.x, so don’t know about it.
>
> Ivan
>
> On Jul 31, 2020, at 11:37 PM, Rahul Patwari <ra...@gmail.com>
> wrote:
>
> Thanks for your reply, Ivan.
>
> I think taskmanager.network.memory.max is by default 1GB.
> In my case, the network buffers memory is 13112 * 32768 = around 400MB
> which is 10% of the TM memory as by default
> taskmanager.network.memory.fraction is 0.1.
> Do you mean to increase taskmanager.network.memory.fraction?
>
>    1. If Flink is upgraded from 1.4.2 to 1.8.2 does the application
>    need more network buffers?
>    2. Can this issue happen sporadically? sometimes this issue is not
>    seen when the job manager is restarted.
>
> I am thinking whether having fewer network buffers is the root cause (or)
> if the root cause is something else which triggers this issue.
>
> On Sat, Aug 1, 2020 at 9:36 AM Ivan Yang <iv...@gmail.com> wrote:
>
>> Hi Rahul,
>>
>> Try to increase taskmanager.network.memory.max to 1GB, basically double
>> what you have now. However, you only have 4GB RAM for the entire TM, seems
>> out of proportion to have 1GB network buffer with 4GB total RAM. Reducing
>> number of shuffling will require less network buffer. But if your job need
>> the shuffling, then you may consider to add more memory to TM.
>>
>> Thanks,
>> Ivan
>>
>> On Jul 31, 2020, at 2:02 PM, Rahul Patwari <ra...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> We are observing "Insufficient number of Network Buffers" issue
>> Sporadically when Flink is upgraded from 1.4.2 to 1.8.2.
>> The state of the tasks with this issue translated from DEPLOYING to
>> FAILED.
>> Whenever this issue occurs, the job manager restarts. Sometimes, the
>> issue goes away after the restart.
>> As we are not getting the issue consistently, we are in a dilemma of
>> whether to change the memory configurations or not.
>>
>> Min recommended No. of Network Buffers: (8 * 8) * 8 * 4 = 2048
>> The exception says that 13112 no. of network buffers are present, which
>> is 6x the recommendation.
>>
>> Is reducing the no. of shuffles the only way to reduce the no. of network
>> buffers required?
>>
>> Thanks,
>> Rahul
>>
>> configs:
>> env: Kubernetes
>> Flink: 1.8.2
>> using default configs for memory.fraction, memory.min, memory.max.
>> using 8 TM, 8 slots/TM
>> Each TM is running with 1 core, 4 GB Memory.
>>
>> Exception:
>> java.io.IOException: Insufficient number of network buffers: required 2,
>> but only 0 available. The total number of network buffers is currently set
>> to 13112 of 32768 bytes each. You can increase this number by setting the
>> configuration keys 'taskmanager.network.memory.fraction',
>> 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
>> at
>> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:138)
>> at
>> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:311)
>> at
>> org.apache.flink.runtime.io.network.NetworkEnvironment.setupInputGate(NetworkEnvironment.java:271)
>> at
>> org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:224)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:614)
>> at java.lang.Thread.run(Thread.java:748)
>>
>>
>>
>

Re: sporadic "Insufficient no of network buffers" issue

Posted by Ivan Yang <iv...@gmail.com>.
Yes, increase the taskmanager.network.memory.fraction in your case. Also reduce the parallelism will reduce number of network buffer required for your job. I never used 1.4.x, so don’t know about it.

Ivan

> On Jul 31, 2020, at 11:37 PM, Rahul Patwari <ra...@gmail.com> wrote:
> 
> Thanks for your reply, Ivan.
> 
> I think taskmanager.network.memory.max is by default 1GB. 
> In my case, the network buffers memory is 13112 * 32768 = around 400MB which is 10% of the TM memory as by default taskmanager.network.memory.fraction is 0.1.
> Do you mean to increase taskmanager.network.memory.fraction?
> If Flink is upgraded from 1.4.2 to 1.8.2 does the application need more network buffers?
> Can this issue happen sporadically? sometimes this issue is not seen when the job manager is restarted.
> I am thinking whether having fewer network buffers is the root cause (or) if the root cause is something else which triggers this issue.
> 
> On Sat, Aug 1, 2020 at 9:36 AM Ivan Yang <ivanygyang@gmail.com <ma...@gmail.com>> wrote:
> Hi Rahul,
> 
> Try to increase taskmanager.network.memory.max to 1GB, basically double what you have now. However, you only have 4GB RAM for the entire TM, seems out of proportion to have 1GB network buffer with 4GB total RAM. Reducing number of shuffling will require less network buffer. But if your job need the shuffling, then you may consider to add more memory to TM.
> 
> Thanks,
> Ivan
> 
>> On Jul 31, 2020, at 2:02 PM, Rahul Patwari <rahulpatwari8383@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> We are observing "Insufficient number of Network Buffers" issue Sporadically when Flink is upgraded from 1.4.2 to 1.8.2.
>> The state of the tasks with this issue translated from DEPLOYING to FAILED. 
>> Whenever this issue occurs, the job manager restarts. Sometimes, the issue goes away after the restart.
>> As we are not getting the issue consistently, we are in a dilemma of whether to change the memory configurations or not.
>> 
>> Min recommended No. of Network Buffers: (8 * 8) * 8 * 4 = 2048
>> The exception says that 13112 no. of network buffers are present, which is 6x the recommendation.
>> 
>> Is reducing the no. of shuffles the only way to reduce the no. of network buffers required?
>> 
>> Thanks,
>> Rahul 
>> 
>> configs:
>> env: Kubernetes 
>> Flink: 1.8.2
>> using default configs for memory.fraction, memory.min, memory.max.
>> using 8 TM, 8 slots/TM
>> Each TM is running with 1 core, 4 GB Memory.
>> 
>> Exception:
>> java.io.IOException: Insufficient number of network buffers: required 2, but only 0 available. The total number of network buffers is currently set to 13112 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.network.memory.fraction', 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
>> at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:138)
>> at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:311)
>> at org.apache.flink.runtime.io.network.NetworkEnvironment.setupInputGate(NetworkEnvironment.java:271)
>> at org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:224)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:614)
>> at java.lang.Thread.run(Thread.java:748)
> 


Re: sporadic "Insufficient no of network buffers" issue

Posted by Rahul Patwari <ra...@gmail.com>.
Thanks for your reply, Ivan.

I think taskmanager.network.memory.max is by default 1GB.
In my case, the network buffers memory is 13112 * 32768 = around 400MB
which is 10% of the TM memory as by default
taskmanager.network.memory.fraction is 0.1.
Do you mean to increase taskmanager.network.memory.fraction?

   1. If Flink is upgraded from 1.4.2 to 1.8.2 does the application
   need more network buffers?
   2. Can this issue happen sporadically? sometimes this issue is not seen
   when the job manager is restarted.

I am thinking whether having fewer network buffers is the root cause (or)
if the root cause is something else which triggers this issue.

On Sat, Aug 1, 2020 at 9:36 AM Ivan Yang <iv...@gmail.com> wrote:

> Hi Rahul,
>
> Try to increase taskmanager.network.memory.max to 1GB, basically double
> what you have now. However, you only have 4GB RAM for the entire TM, seems
> out of proportion to have 1GB network buffer with 4GB total RAM. Reducing
> number of shuffling will require less network buffer. But if your job need
> the shuffling, then you may consider to add more memory to TM.
>
> Thanks,
> Ivan
>
> On Jul 31, 2020, at 2:02 PM, Rahul Patwari <ra...@gmail.com>
> wrote:
>
> Hi,
>
> We are observing "Insufficient number of Network Buffers" issue
> Sporadically when Flink is upgraded from 1.4.2 to 1.8.2.
> The state of the tasks with this issue translated from DEPLOYING to
> FAILED.
> Whenever this issue occurs, the job manager restarts. Sometimes, the issue
> goes away after the restart.
> As we are not getting the issue consistently, we are in a dilemma of
> whether to change the memory configurations or not.
>
> Min recommended No. of Network Buffers: (8 * 8) * 8 * 4 = 2048
> The exception says that 13112 no. of network buffers are present, which is
> 6x the recommendation.
>
> Is reducing the no. of shuffles the only way to reduce the no. of network
> buffers required?
>
> Thanks,
> Rahul
>
> configs:
> env: Kubernetes
> Flink: 1.8.2
> using default configs for memory.fraction, memory.min, memory.max.
> using 8 TM, 8 slots/TM
> Each TM is running with 1 core, 4 GB Memory.
>
> Exception:
> java.io.IOException: Insufficient number of network buffers: required 2,
> but only 0 available. The total number of network buffers is currently set
> to 13112 of 32768 bytes each. You can increase this number by setting the
> configuration keys 'taskmanager.network.memory.fraction',
> 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
> at
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:138)
> at
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:311)
> at
> org.apache.flink.runtime.io.network.NetworkEnvironment.setupInputGate(NetworkEnvironment.java:271)
> at
> org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:224)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:614)
> at java.lang.Thread.run(Thread.java:748)
>
>
>

Re: sporadic "Insufficient no of network buffers" issue

Posted by Ivan Yang <iv...@gmail.com>.
Hi Rahul,

Try to increase taskmanager.network.memory.max to 1GB, basically double what you have now. However, you only have 4GB RAM for the entire TM, seems out of proportion to have 1GB network buffer with 4GB total RAM. Reducing number of shuffling will require less network buffer. But if your job need the shuffling, then you may consider to add more memory to TM.

Thanks,
Ivan

> On Jul 31, 2020, at 2:02 PM, Rahul Patwari <ra...@gmail.com> wrote:
> 
> Hi,
> 
> We are observing "Insufficient number of Network Buffers" issue Sporadically when Flink is upgraded from 1.4.2 to 1.8.2.
> The state of the tasks with this issue translated from DEPLOYING to FAILED. 
> Whenever this issue occurs, the job manager restarts. Sometimes, the issue goes away after the restart.
> As we are not getting the issue consistently, we are in a dilemma of whether to change the memory configurations or not.
> 
> Min recommended No. of Network Buffers: (8 * 8) * 8 * 4 = 2048
> The exception says that 13112 no. of network buffers are present, which is 6x the recommendation.
> 
> Is reducing the no. of shuffles the only way to reduce the no. of network buffers required?
> 
> Thanks,
> Rahul 
> 
> configs:
> env: Kubernetes 
> Flink: 1.8.2
> using default configs for memory.fraction, memory.min, memory.max.
> using 8 TM, 8 slots/TM
> Each TM is running with 1 core, 4 GB Memory.
> 
> Exception:
> java.io.IOException: Insufficient number of network buffers: required 2, but only 0 available. The total number of network buffers is currently set to 13112 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.network.memory.fraction', 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
> at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:138)
> at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:311)
> at org.apache.flink.runtime.io.network.NetworkEnvironment.setupInputGate(NetworkEnvironment.java:271)
> at org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:224)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:614)
> at java.lang.Thread.run(Thread.java:748)