You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Jason Rosenberg <jb...@squareup.com> on 2013/10/21 10:50:03 UTC

Consumer lag issues

(I've changed the subject of this thread (was "Preparing for the 0.8 final
release"))

So, I'm not sure that my issue is exactly the same as that mentioned in the
FAQ.

Anyway, in looking at the MaxLag values for several consumers (not all
consuming the same topics), it looks like there was a strange interaction.

First, a new consumer app was started, which initially had a log of MaxLag
for a period of about a day.  This lag stayed constant at about 175M, then
once this new consumer caught up, it's lag dropped down quickly toward zero.

What's interesting, is that at that time that the new consumer seemed to
have caught up, 4 other consumers which had undetectable lag, suddenly had
their lag values shoot up, taking between 2 to 4 days to recover.  What's
interesting, is that these other consumers were mostly consuming different
topics than the new consumer that triggered this lag swap.

The 1 consumer that took the longest to recover (4 days) was consuming the
same topic as the new one, and the lag value stayed roughly constant at
about 175M messages, until it recovered.

All these consumers are using their own groupid's.

Does any of this make sense?

One thing, is that I expect the new consumer which started the issues was a
super fast consumer (no downstream IO connections, etc.).

Does any of this make sense?  Could one consumer affect other consumers of
unrelated topics?  Does it make any sense that one consumer would catch up,
only to cause several other consumers to suddenly fall behind?

Thanks :)

Jason






On Sun, Oct 20, 2013 at 12:18 PM, Jun Rao <ju...@gmail.com> wrote:

> The fetch thread uses multi-fetch. Have you looked at
>
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whymessagesaredelayedinmyconsumer%3F
> ,
> which may be related to your issue.
>
> Thanks,
>
> Jun
>
>
> On Sun, Oct 20, 2013 at 2:48 AM, Jason Rosenberg <jb...@squareup.com> wrote:
>
> > Ok,
> >
> > So here's an outline of what I think seems to have happened.
> >
> > I have a consumer, that uses a filter to consume a large number of topics
> > (e.g. several hundred).  Each topic has only a single partition.
> >
> > It normally has no trouble keeping up processing all messages on all
> > topics.   However, we had a case a couple days ago where it seemed to
> hang,
> > and not consume anything for several hours.  I restarted the consumer
> (and
> > now I've updated it from 0.8-beta1 to 0.8-latest-HEAD).  Data is flowing
> > again, but some topics are seeming to take much longer than others to
> catch
> > up.  The slow ones seem to be the topics that have more data than others
> (a
> > loose theory at present).
> >
> > Does that make sense?  If I understand things correctly, the consumer
> will
> > fetch chunks of data from each topic/partition, in order, in a big loop?
> >  So if it has caught up with most of the topics, will it waste time
> > re-polling all those (and getting nothing) before coming back to the
> topics
> > that are lagging?  Perhaps having a larger fetch size would help here?
> >
> > Jason
> >
> >
> > On Sat, Oct 19, 2013 at 6:24 PM, Jason Rosenberg <jb...@squareup.com>
> wrote:
> >
> > > I'll try to, next time it hangs!
> > >
> > >
> > > On Sat, Oct 19, 2013 at 4:04 PM, Neha Narkhede <
> neha.narkhede@gmail.com
> > >wrote:
> > >
> > >> Can you send around a thread dump of the halted consumer process?
> > >>
> > >>
> > >>
> > >> On Sat, Oct 19, 2013 at 12:16 PM, Jason Rosenberg <jb...@squareup.com>
> > >> wrote:
> > >>
> > >> > The latest HEAD does seem to solve one issue, where a new topic
> being
> > >> > created after the consumer is started, would not be consumed.
> > >> >
> > >> > But the bigger issue is that we have a couple different consumers
> both
> > >> > consuming the same set of topics (under different groupids), and
> > hanging
> > >> > after a while (both hanging at about the same point).  The topics in
> > >> each
> > >> > case are selected with a filter (actually a relatively large number
> of
> > >> > topics, some of which are newly created over time).  I'm still not
> > sure
> > >> > whether the new version is solving this issue (since it was a rare
> > >> > transient thing anyway).
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jason
> > >> >
> > >> >
> > >> > On Sat, Oct 19, 2013 at 2:03 AM, Jun Rao <ju...@gmail.com> wrote:
> > >> >
> > >> > > Yes, 0.8 will be released from the HEAD of the 0.8 branch. Is the
> > >> problem
> > >> > > with consuming new topics or topics whose partitions are
> increased?
> > If
> > >> > so,
> > >> > > see KAFKA-1030 and KAFKA-1075.
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Jun
> > >> > >
> > >> > >
> > >> > > On Fri, Oct 18, 2013 at 4:03 PM, Jason Rosenberg <
> jbr@squareup.com>
> > >> > wrote:
> > >> > >
> > >> > > > Will the 0.8 release come from the HEAD of the 0.8 branch?  I'd
> > >> like to
> > >> > > > experiment with it, to see if it solves some of the issues I'm
> > >> seeing,
> > >> > > with
> > >> > > > consumers refusing to consume new messages.  We've been using
> the
> > >> beta1
> > >> > > > version.
> > >> > > >
> > >> > > > I remember mention there was a Jira issues along these lines,
> > which
> > >> was
> > >> > > > fixed post 0.8-beta1.  Which issue was that (I'd like to see if
> it
> > >> > > matches
> > >> > > > what I'm seeing).
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Jason
> > >> > > >
> > >> > > >
> > >> > > > On Wed, Oct 9, 2013 at 8:04 PM, Jay Kreps <ja...@gmail.com>
> > >> wrote:
> > >> > > >
> > >> > > > > I uploaded a patch against trunk which also fixes KAFKA-1036,
> > the
> > >> > other
> > >> > > > > knows windows issue. Review appreciated. Should be an easy
> one.
> > >> > > > >
> > >> > > > > https://issues.apache.org/jira/browse/KAFKA-1008
> > >> > > > >
> > >> > > > > -Jay
> > >> > > > >
> > >> > > > >
> > >> > > > > On Wed, Oct 9, 2013 at 8:56 AM, Jun Rao <ju...@gmail.com>
> > wrote:
> > >> > > > >
> > >> > > > > > KAFKA-1008 has been checked into the 0.8 branch and needs to
> > be
> > >> > > > manually
> > >> > > > > > double-committed to trunk. To avoid merging problems, I
> > suggest
> > >> > that
> > >> > > > for
> > >> > > > > > all future changes in the 0.8 branch, we double commit them
> to
> > >> > trunk.
> > >> > > > Any
> > >> > > > > > objections?
> > >> > > > > >
> > >> > > > > > Thanks,
> > >> > > > > >
> > >> > > > > > Jun
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Mon, Oct 7, 2013 at 5:33 PM, Jun Rao <ju...@gmail.com>
> > >> wrote:
> > >> > > > > >
> > >> > > > > > > Hi, Everyone,
> > >> > > > > > >
> > >> > > > > > > I made another pass of the remaining jiras that we plan to
> > >> fix in
> > >> > > the
> > >> > > > > 0.8
> > >> > > > > > > final release.
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://issues.apache.org/jira/browse/KAFKA-954?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%20%220.8%22%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)
> > >> > > > > > >
> > >> > > > > > > Do people agree with this list?
> > >> > > > > > >
> > >> > > > > > > Joe,
> > >> > > > > > >
> > >> > > > > > > I don't have good understanding of KAFKA-1018. Do you
> think
> > >> this
> > >> > > > needs
> > >> > > > > to
> > >> > > > > > > be fixed in 0.8 final?
> > >> > > > > > >
> > >> > > > > > > Thanks,
> > >> > > > > > >
> > >> > > > > > > Jun
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Fri, Sep 13, 2013 at 9:18 AM, Jun Rao <
> junrao@gmail.com>
> > >> > wrote:
> > >> > > > > > >
> > >> > > > > > >> Hi, Everyone,
> > >> > > > > > >>
> > >> > > > > > >> We have been stabilizing the 0.8 branch since the beta1
> > >> > release. I
> > >> > > > > think
> > >> > > > > > >> we are getting close to an 0.8 final release. I made an
> > >> initial
> > >> > > list
> > >> > > > > of
> > >> > > > > > the
> > >> > > > > > >> remaining jiras that should be fixed in 0.8.
> > >> > > > > > >>
> > >> > > > > > >>
> > >> > > > > > >>
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%20%220.8%22%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)
> > >> > > > > > >>
> > >> > > > > > >> 1. Do people agree with the list?
> > >> > > > > > >>
> > >> > > > > > >> 2. If the list is good, could people help
> > >> contributing/reviewing
> > >> > > the
> > >> > > > > > >> remaining jiras?
> > >> > > > > > >>
> > >> > > > > > >> Thanks,
> > >> > > > > > >>
> > >> > > > > > >> Jun
> > >> > > > > > >>
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Consumer lag issues

Posted by Jun Rao <ju...@gmail.com>.

Max lag corresponds to the partition that lags the most. So it could stay
high until all partitions are caught up.

The second issue is weird. Lags across consumer groups should be more or
less independent. Could this be a producer side issue? Do you see a sudden
jump in the incoming byte rate?

Thanks,

Jun


On Mon, Oct 21, 2013 at 1:50 AM, Jason Rosenberg <jb...@squareup.com> wrote:

> (I've changed the subject of this thread (was "Preparing for the 0.8 final
> release"))
>
> So, I'm not sure that my issue is exactly the same as that mentioned in the
> FAQ.
>
> Anyway, in looking at the MaxLag values for several consumers (not all
> consuming the same topics), it looks like there was a strange interaction.
>
> First, a new consumer app was started, which initially had a log of MaxLag
> for a period of about a day.  This lag stayed constant at about 175M, then
> once this new consumer caught up, it's lag dropped down quickly toward
> zero.
>
> What's interesting, is that at that time that the new consumer seemed to
> have caught up, 4 other consumers which had undetectable lag, suddenly had
> their lag values shoot up, taking between 2 to 4 days to recover.  What's
> interesting, is that these other consumers were mostly consuming different
> topics than the new consumer that triggered this lag swap.
>
> The 1 consumer that took the longest to recover (4 days) was consuming the
> same topic as the new one, and the lag value stayed roughly constant at
> about 175M messages, until it recovered.
>
> All these consumers are using their own groupid's.
>
> Does any of this make sense?
>
> One thing, is that I expect the new consumer which started the issues was a
> super fast consumer (no downstream IO connections, etc.).
>
> Does any of this make sense?  Could one consumer affect other consumers of
> unrelated topics?  Does it make any sense that one consumer would catch up,
> only to cause several other consumers to suddenly fall behind?
>
> Thanks :)
>
> Jason
>
>
>
>
>
>
> On Sun, Oct 20, 2013 at 12:18 PM, Jun Rao <ju...@gmail.com> wrote:
>
> > The fetch thread uses multi-fetch. Have you looked at
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whymessagesaredelayedinmyconsumer%3F
> > ,
> > which may be related to your issue.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Sun, Oct 20, 2013 at 2:48 AM, Jason Rosenberg <jb...@squareup.com>
> wrote:
> >
> > > Ok,
> > >
> > > So here's an outline of what I think seems to have happened.
> > >
> > > I have a consumer, that uses a filter to consume a large number of
> topics
> > > (e.g. several hundred).  Each topic has only a single partition.
> > >
> > > It normally has no trouble keeping up processing all messages on all
> > > topics.   However, we had a case a couple days ago where it seemed to
> > hang,
> > > and not consume anything for several hours.  I restarted the consumer
> > (and
> > > now I've updated it from 0.8-beta1 to 0.8-latest-HEAD).  Data is
> flowing
> > > again, but some topics are seeming to take much longer than others to
> > catch
> > > up.  The slow ones seem to be the topics that have more data than
> others
> > (a
> > > loose theory at present).
> > >
> > > Does that make sense?  If I understand things correctly, the consumer
> > will
> > > fetch chunks of data from each topic/partition, in order, in a big
> loop?
> > >  So if it has caught up with most of the topics, will it waste time
> > > re-polling all those (and getting nothing) before coming back to the
> > topics
> > > that are lagging?  Perhaps having a larger fetch size would help here?
> > >
> > > Jason
> > >
> > >
> > > On Sat, Oct 19, 2013 at 6:24 PM, Jason Rosenberg <jb...@squareup.com>
> > wrote:
> > >
> > > > I'll try to, next time it hangs!
> > > >
> > > >
> > > > On Sat, Oct 19, 2013 at 4:04 PM, Neha Narkhede <
> > neha.narkhede@gmail.com
> > > >wrote:
> > > >
> > > >> Can you send around a thread dump of the halted consumer process?
> > > >>
> > > >>
> > > >>
> > > >> On Sat, Oct 19, 2013 at 12:16 PM, Jason Rosenberg <jbr@squareup.com
> >
> > > >> wrote:
> > > >>
> > > >> > The latest HEAD does seem to solve one issue, where a new topic
> > being
> > > >> > created after the consumer is started, would not be consumed.
> > > >> >
> > > >> > But the bigger issue is that we have a couple different consumers
> > both
> > > >> > consuming the same set of topics (under different groupids), and
> > > hanging
> > > >> > after a while (both hanging at about the same point).  The topics
> in
> > > >> each
> > > >> > case are selected with a filter (actually a relatively large
> number
> > of
> > > >> > topics, some of which are newly created over time).  I'm still not
> > > sure
> > > >> > whether the new version is solving this issue (since it was a rare
> > > >> > transient thing anyway).
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Jason
> > > >> >
> > > >> >
> > > >> > On Sat, Oct 19, 2013 at 2:03 AM, Jun Rao <ju...@gmail.com>
> wrote:
> > > >> >
> > > >> > > Yes, 0.8 will be released from the HEAD of the 0.8 branch. Is
> the
> > > >> problem
> > > >> > > with consuming new topics or topics whose partitions are
> > increased?
> > > If
> > > >> > so,
> > > >> > > see KAFKA-1030 and KAFKA-1075.
> > > >> > >
> > > >> > > Thanks,
> > > >> > >
> > > >> > > Jun
> > > >> > >
> > > >> > >
> > > >> > > On Fri, Oct 18, 2013 at 4:03 PM, Jason Rosenberg <
> > jbr@squareup.com>
> > > >> > wrote:
> > > >> > >
> > > >> > > > Will the 0.8 release come from the HEAD of the 0.8 branch?
>  I'd
> > > >> like to
> > > >> > > > experiment with it, to see if it solves some of the issues I'm
> > > >> seeing,
> > > >> > > with
> > > >> > > > consumers refusing to consume new messages.  We've been using
> > the
> > > >> beta1
> > > >> > > > version.
> > > >> > > >
> > > >> > > > I remember mention there was a Jira issues along these lines,
> > > which
> > > >> was
> > > >> > > > fixed post 0.8-beta1.  Which issue was that (I'd like to see
> if
> > it
> > > >> > > matches
> > > >> > > > what I'm seeing).
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > >
> > > >> > > > Jason
> > > >> > > >
> > > >> > > >
> > > >> > > > On Wed, Oct 9, 2013 at 8:04 PM, Jay Kreps <
> jay.kreps@gmail.com>
> > > >> wrote:
> > > >> > > >
> > > >> > > > > I uploaded a patch against trunk which also fixes
> KAFKA-1036,
> > > the
> > > >> > other
> > > >> > > > > knows windows issue. Review appreciated. Should be an easy
> > one.
> > > >> > > > >
> > > >> > > > > https://issues.apache.org/jira/browse/KAFKA-1008
> > > >> > > > >
> > > >> > > > > -Jay
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Wed, Oct 9, 2013 at 8:56 AM, Jun Rao <ju...@gmail.com>
> > > wrote:
> > > >> > > > >
> > > >> > > > > > KAFKA-1008 has been checked into the 0.8 branch and needs
> to
> > > be
> > > >> > > > manually
> > > >> > > > > > double-committed to trunk. To avoid merging problems, I
> > > suggest
> > > >> > that
> > > >> > > > for
> > > >> > > > > > all future changes in the 0.8 branch, we double commit
> them
> > to
> > > >> > trunk.
> > > >> > > > Any
> > > >> > > > > > objections?
> > > >> > > > > >
> > > >> > > > > > Thanks,
> > > >> > > > > >
> > > >> > > > > > Jun
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Mon, Oct 7, 2013 at 5:33 PM, Jun Rao <junrao@gmail.com
> >
> > > >> wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi, Everyone,
> > > >> > > > > > >
> > > >> > > > > > > I made another pass of the remaining jiras that we plan
> to
> > > >> fix in
> > > >> > > the
> > > >> > > > > 0.8
> > > >> > > > > > > final release.
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://issues.apache.org/jira/browse/KAFKA-954?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%20%220.8%22%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)
> > > >> > > > > > >
> > > >> > > > > > > Do people agree with this list?
> > > >> > > > > > >
> > > >> > > > > > > Joe,
> > > >> > > > > > >
> > > >> > > > > > > I don't have good understanding of KAFKA-1018. Do you
> > think
> > > >> this
> > > >> > > > needs
> > > >> > > > > to
> > > >> > > > > > > be fixed in 0.8 final?
> > > >> > > > > > >
> > > >> > > > > > > Thanks,
> > > >> > > > > > >
> > > >> > > > > > > Jun
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > On Fri, Sep 13, 2013 at 9:18 AM, Jun Rao <
> > junrao@gmail.com>
> > > >> > wrote:
> > > >> > > > > > >
> > > >> > > > > > >> Hi, Everyone,
> > > >> > > > > > >>
> > > >> > > > > > >> We have been stabilizing the 0.8 branch since the beta1
> > > >> > release. I
> > > >> > > > > think
> > > >> > > > > > >> we are getting close to an 0.8 final release. I made an
> > > >> initial
> > > >> > > list
> > > >> > > > > of
> > > >> > > > > > the
> > > >> > > > > > >> remaining jiras that should be fixed in 0.8.
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%20%220.8%22%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)
> > > >> > > > > > >>
> > > >> > > > > > >> 1. Do people agree with the list?
> > > >> > > > > > >>
> > > >> > > > > > >> 2. If the list is good, could people help
> > > >> contributing/reviewing
> > > >> > > the
> > > >> > > > > > >> remaining jiras?
> > > >> > > > > > >>
> > > >> > > > > > >> Thanks,
> > > >> > > > > > >>
> > > >> > > > > > >> Jun
> > > >> > > > > > >>
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>