You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Robert Quinlivan <rq...@signal.co> on 2017/03/15 16:11:05 UTC

Offset commit request failing

Good morning,

I'm hoping for some help understanding the expected behavior for an offset
commit request and why this request might fail on the broker.

*Context:*

For context, my configuration looks like this:

   - Three brokers
   - Consumer offsets topic replication factor set to 3
   - Auto commit enabled
   - The user application topic, which I will call "my_topic", has a
   replication factor of 3 as well and 800 partitions
   - 4000 consumers attached in consumer group "my_group"


*Issue:*

When I attach the consumers, the coordinator logs the following error
message repeatedly for each generation:

ERROR [Group Metadata Manager on Broker 0]: Appending metadata message for
group my_group generation 2066 failed due to
org.apache.kafka.common.errors.RecordTooLargeException, returning UNKNOWN
error code to the client (kafka.coordinator.GroupMetadataManager)

*Observed behavior:*

The consumer group does not stay connected long enough to consume messages.
It is effectively stuck in a rebalance loop and the "my_topic" data has
become unavailable.


*Investigation:*

Following the Group Metadata Manager code, it looks like the broker is
writing to a cache after it writes an Offset Commit Request to the log
file. If this cache write fails, the broker then logs this error and
returns an error code in the response. In this case, the error from the
cache is MESSAGE_TOO_LARGE, which is logged as a RecordTooLargeException.
However, the broker then sets the error code to UNKNOWN on the Offset
Commit Response.

It seems that the issue is the size of the metadata in the Offset Commit
Request. I have the following questions:

   1. What is the size limit for this request? Are we exceeding the size
   which is causing this request to fail?
   2. If this is an issue with metadata size, what would cause abnormally
   large metadata?
   3. How is this cache used within the broker?


Thanks in advance for any insights you can provide.

Regards,
Robert Quinlivan
Software Engineer, Signal

Re: Offset commit request failing

Posted by Robert Quinlivan <rq...@signal.co>.

Thanks for the response. Reading through that thread, it appears that this
issue was addressed with KAFKA-3810
<https://issues.apache.org/jira/browse/KAFKA-3810>. This change eases the
restriction on fetch size between replicas. However, should the outcome be
a more comprehensive change to the serialization format of the request? The
size of the group metadata currently grows linearly with the number of
topic-partitions. This is difficult to tune for in a configuration using
topic auto creation.



On Fri, Mar 17, 2017 at 3:17 AM, James Cheng <wu...@gmail.com> wrote:

> I think it's due to the high number of partitions and the high number of
> consumers in the group. The group coordination info to keep track of the
> assignments actually happens via a message that travels through the
> __consumer_offsets topic. So with so many partitions and consumers, the
> message gets too big to go through the topic.
>
> There is a long thread here that discusses it. I don't remember what
> specific actions came out of that discussion. http://search-hadoop.com/m/
> Kafka/uyzND1yd26N1rFtRd1?subj=+DISCUSS+scalability+limits+
> in+the+coordinator
>
> -James
>
> Sent from my iPhone
>
> > On Mar 15, 2017, at 9:40 AM, Robert Quinlivan <rq...@signal.co>
> wrote:
> >
> > I should also mention that this error was seen on broker version
> 0.10.1.1.
> > I found that this condition sounds somewhat similar to KAFKA-4362
> > <https://issues.apache.org/jira/browse/KAFKA-4362>, but that issue was
> > submitted in 0.10.1.1 so they appear to be different issues.
> >
> > On Wed, Mar 15, 2017 at 11:11 AM, Robert Quinlivan <rquinlivan@signal.co
> >
> > wrote:
> >
> >> Good morning,
> >>
> >> I'm hoping for some help understanding the expected behavior for an
> offset
> >> commit request and why this request might fail on the broker.
> >>
> >> *Context:*
> >>
> >> For context, my configuration looks like this:
> >>
> >>   - Three brokers
> >>   - Consumer offsets topic replication factor set to 3
> >>   - Auto commit enabled
> >>   - The user application topic, which I will call "my_topic", has a
> >>   replication factor of 3 as well and 800 partitions
> >>   - 4000 consumers attached in consumer group "my_group"
> >>
> >>
> >> *Issue:*
> >>
> >> When I attach the consumers, the coordinator logs the following error
> >> message repeatedly for each generation:
> >>
> >> ERROR [Group Metadata Manager on Broker 0]: Appending metadata message
> for
> >> group my_group generation 2066 failed due to org.apache.kafka.common.
> >> errors.RecordTooLargeException, returning UNKNOWN error code to the
> >> client (kafka.coordinator.GroupMetadataManager)
> >>
> >> *Observed behavior:*
> >>
> >> The consumer group does not stay connected long enough to consume
> >> messages. It is effectively stuck in a rebalance loop and the "my_topic"
> >> data has become unavailable.
> >>
> >>
> >> *Investigation:*
> >>
> >> Following the Group Metadata Manager code, it looks like the broker is
> >> writing to a cache after it writes an Offset Commit Request to the log
> >> file. If this cache write fails, the broker then logs this error and
> >> returns an error code in the response. In this case, the error from the
> >> cache is MESSAGE_TOO_LARGE, which is logged as a
> RecordTooLargeException.
> >> However, the broker then sets the error code to UNKNOWN on the Offset
> >> Commit Response.
> >>
> >> It seems that the issue is the size of the metadata in the Offset Commit
> >> Request. I have the following questions:
> >>
> >>   1. What is the size limit for this request? Are we exceeding the size
> >>   which is causing this request to fail?
> >>   2. If this is an issue with metadata size, what would cause abnormally
> >>   large metadata?
> >>   3. How is this cache used within the broker?
> >>
> >>
> >> Thanks in advance for any insights you can provide.
> >>
> >> Regards,
> >> Robert Quinlivan
> >> Software Engineer, Signal
> >>
> >
> >
> >
> > --
> > Robert Quinlivan
> > Software Engineer, Signal
>



-- 
Robert Quinlivan
Software Engineer, Signal

Re: Offset commit request failing

Posted by James Cheng <wu...@gmail.com>.

I think it's due to the high number of partitions and the high number of consumers in the group. The group coordination info to keep track of the assignments actually happens via a message that travels through the __consumer_offsets topic. So with so many partitions and consumers, the message gets too big to go through the topic.

There is a long thread here that discusses it. I don't remember what specific actions came out of that discussion. http://search-hadoop.com/m/Kafka/uyzND1yd26N1rFtRd1?subj=+DISCUSS+scalability+limits+in+the+coordinator

-James

Sent from my iPhone

> On Mar 15, 2017, at 9:40 AM, Robert Quinlivan <rq...@signal.co> wrote:
> 
> I should also mention that this error was seen on broker version 0.10.1.1.
> I found that this condition sounds somewhat similar to KAFKA-4362
> <https://issues.apache.org/jira/browse/KAFKA-4362>, but that issue was
> submitted in 0.10.1.1 so they appear to be different issues.
> 
> On Wed, Mar 15, 2017 at 11:11 AM, Robert Quinlivan <rq...@signal.co>
> wrote:
> 
>> Good morning,
>> 
>> I'm hoping for some help understanding the expected behavior for an offset
>> commit request and why this request might fail on the broker.
>> 
>> *Context:*
>> 
>> For context, my configuration looks like this:
>> 
>>   - Three brokers
>>   - Consumer offsets topic replication factor set to 3
>>   - Auto commit enabled
>>   - The user application topic, which I will call "my_topic", has a
>>   replication factor of 3 as well and 800 partitions
>>   - 4000 consumers attached in consumer group "my_group"
>> 
>> 
>> *Issue:*
>> 
>> When I attach the consumers, the coordinator logs the following error
>> message repeatedly for each generation:
>> 
>> ERROR [Group Metadata Manager on Broker 0]: Appending metadata message for
>> group my_group generation 2066 failed due to org.apache.kafka.common.
>> errors.RecordTooLargeException, returning UNKNOWN error code to the
>> client (kafka.coordinator.GroupMetadataManager)
>> 
>> *Observed behavior:*
>> 
>> The consumer group does not stay connected long enough to consume
>> messages. It is effectively stuck in a rebalance loop and the "my_topic"
>> data has become unavailable.
>> 
>> 
>> *Investigation:*
>> 
>> Following the Group Metadata Manager code, it looks like the broker is
>> writing to a cache after it writes an Offset Commit Request to the log
>> file. If this cache write fails, the broker then logs this error and
>> returns an error code in the response. In this case, the error from the
>> cache is MESSAGE_TOO_LARGE, which is logged as a RecordTooLargeException.
>> However, the broker then sets the error code to UNKNOWN on the Offset
>> Commit Response.
>> 
>> It seems that the issue is the size of the metadata in the Offset Commit
>> Request. I have the following questions:
>> 
>>   1. What is the size limit for this request? Are we exceeding the size
>>   which is causing this request to fail?
>>   2. If this is an issue with metadata size, what would cause abnormally
>>   large metadata?
>>   3. How is this cache used within the broker?
>> 
>> 
>> Thanks in advance for any insights you can provide.
>> 
>> Regards,
>> Robert Quinlivan
>> Software Engineer, Signal
>> 
> 
> 
> 
> -- 
> Robert Quinlivan
> Software Engineer, Signal

Re: Offset commit request failing

Posted by Robert Quinlivan <rq...@signal.co>.

I should also mention that this error was seen on broker version 0.10.1.1.
I found that this condition sounds somewhat similar to KAFKA-4362
<https://issues.apache.org/jira/browse/KAFKA-4362>, but that issue was
submitted in 0.10.1.1 so they appear to be different issues.

On Wed, Mar 15, 2017 at 11:11 AM, Robert Quinlivan <rq...@signal.co>
wrote:

> Good morning,
>
> I'm hoping for some help understanding the expected behavior for an offset
> commit request and why this request might fail on the broker.
>
> *Context:*
>
> For context, my configuration looks like this:
>
>    - Three brokers
>    - Consumer offsets topic replication factor set to 3
>    - Auto commit enabled
>    - The user application topic, which I will call "my_topic", has a
>    replication factor of 3 as well and 800 partitions
>    - 4000 consumers attached in consumer group "my_group"
>
>
> *Issue:*
>
> When I attach the consumers, the coordinator logs the following error
> message repeatedly for each generation:
>
> ERROR [Group Metadata Manager on Broker 0]: Appending metadata message for
> group my_group generation 2066 failed due to org.apache.kafka.common.
> errors.RecordTooLargeException, returning UNKNOWN error code to the
> client (kafka.coordinator.GroupMetadataManager)
>
> *Observed behavior:*
>
> The consumer group does not stay connected long enough to consume
> messages. It is effectively stuck in a rebalance loop and the "my_topic"
> data has become unavailable.
>
>
> *Investigation:*
>
> Following the Group Metadata Manager code, it looks like the broker is
> writing to a cache after it writes an Offset Commit Request to the log
> file. If this cache write fails, the broker then logs this error and
> returns an error code in the response. In this case, the error from the
> cache is MESSAGE_TOO_LARGE, which is logged as a RecordTooLargeException.
> However, the broker then sets the error code to UNKNOWN on the Offset
> Commit Response.
>
> It seems that the issue is the size of the metadata in the Offset Commit
> Request. I have the following questions:
>
>    1. What is the size limit for this request? Are we exceeding the size
>    which is causing this request to fail?
>    2. If this is an issue with metadata size, what would cause abnormally
>    large metadata?
>    3. How is this cache used within the broker?
>
>
> Thanks in advance for any insights you can provide.
>
> Regards,
> Robert Quinlivan
> Software Engineer, Signal
>



-- 
Robert Quinlivan
Software Engineer, Signal