You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by gerardg <ge...@talaia.io> on 2018/10/26 09:07:52 UTC

Unbalanced Kafka consumer consumption

Hi,

We are experience issues scaling our Flink application and we have observed
that it may be because Kafka messages consumption is not balanced across
partitions. The attached image (lag per partition) shows how only one
partition consumes messages (the blue one in the back) and it wasn't until
it finished that the other ones started to consume at a good rate (actually
the total throughput multiplied by 4 when these started) . Also, when that
ones started to consume, one partition just stopped an accumulated messages
back again until they finished.

We don't see any resource (CPU, network, disk..) struggling in our cluster
so we are not sure what could be causing this behavior. I can only assume
that somehow Flink or the Kafka consumer is artificially slowing down the
other partitions. Maybe due to how back pressure is handled?

<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1007/consumer_max_lag.png> 

Gerard





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Unbalanced Kafka consumer consumption

Posted by Gerard Garcia <ge...@talaia.io>.
The stream is partitioned by key after ingestion at the finest granularity
that we can (which is finer than how stream is partitioned when produced to
kafka). It is not perfectly balanced but still is not so unbalanced to show
this behavior (more balanced than what the lag images show).

Anyway, let's assume that the problem is that the stream is so unbalanced
that one operator subtask can't handle the ingestion rate. It is expected
then that all the others operators reduce its ingestion rate even if they
have resources to spare? The task is configured with processing time and
there are no windows. If that is the case, is there a way to let operator
subtasks process freely even if one of them is causing back pressure
upstream?

The attached images shows how Kafka lag increases while the throughput is
stable until some operator subtasks finish.

Thanks,

Gerard

On Fri, Oct 26, 2018 at 8:09 PM Elias Levy <fe...@gmail.com>
wrote:

> You can always shuffle the stream generated by the Kafka source
> (dataStream.shuffle()) to evenly distribute records downstream.
>
> On Fri, Oct 26, 2018 at 2:08 AM gerardg <ge...@talaia.io> wrote:
>
>> Hi,
>>
>> We are experience issues scaling our Flink application and we have
>> observed
>> that it may be because Kafka messages consumption is not balanced across
>> partitions. The attached image (lag per partition) shows how only one
>> partition consumes messages (the blue one in the back) and it wasn't until
>> it finished that the other ones started to consume at a good rate
>> (actually
>> the total throughput multiplied by 4 when these started) . Also, when that
>> ones started to consume, one partition just stopped an accumulated
>> messages
>> back again until they finished.
>>
>> We don't see any resource (CPU, network, disk..) struggling in our cluster
>> so we are not sure what could be causing this behavior. I can only assume
>> that somehow Flink or the Kafka consumer is artificially slowing down the
>> other partitions. Maybe due to how back pressure is handled?
>>
>> <
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1007/consumer_max_lag.png>
>>
>>
>> Gerard
>>
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>
>

Re: Unbalanced Kafka consumer consumption

Posted by Elias Levy <fe...@gmail.com>.
You can always shuffle the stream generated by the Kafka source
(dataStream.shuffle()) to evenly distribute records downstream.

On Fri, Oct 26, 2018 at 2:08 AM gerardg <ge...@talaia.io> wrote:

> Hi,
>
> We are experience issues scaling our Flink application and we have observed
> that it may be because Kafka messages consumption is not balanced across
> partitions. The attached image (lag per partition) shows how only one
> partition consumes messages (the blue one in the back) and it wasn't until
> it finished that the other ones started to consume at a good rate (actually
> the total throughput multiplied by 4 when these started) . Also, when that
> ones started to consume, one partition just stopped an accumulated messages
> back again until they finished.
>
> We don't see any resource (CPU, network, disk..) struggling in our cluster
> so we are not sure what could be causing this behavior. I can only assume
> that somehow Flink or the Kafka consumer is artificially slowing down the
> other partitions. Maybe due to how back pressure is handled?
>
> <
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1007/consumer_max_lag.png>
>
>
> Gerard
>
>
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>