You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Gwen Shapira <gw...@confluent.io> on 2016/06/09 05:31:56 UTC

Re: MirrorMaker consumers getting stuck on certain partitions

Hi Tushar,

Few follow up questions:
1. Can you enable logs for mirror maker (there should be
conf/tools-log4j.properties file for this) at INFO level? This can
give us a clue on why it stopped.

2. Did you check the FAQ for "why is my consumer hanging"? Most of the
reasons (for example "message too large") can apply here too.

Gwen

On Wed, May 4, 2016 at 9:54 PM, Mhaskar, Tushar
<tm...@paypal.com.invalid> wrote:
> Anyone encountered this error?
>
> Thanks
> Tushar
>
>
>
>
>
> On 5/2/16, 10:44 PM, "Mhaskar, Tushar" <tm...@paypal.com.INVALID> wrote:
>
>>Consumption on  Partitions – 99,0 and 98 are getting stucked.
>>
>>Thanks
>>Tushar
>>From: "Mhaskar, Tushar" <tm...@paypal.com>>
>>Date: Monday, May 2, 2016 at 9:52 PM
>>To: "users@kafka.apache.org<ma...@kafka.apache.org>" <us...@kafka.apache.org>>
>>Subject: MirrorMaker consumers getting stuck on certain partitions
>>
>>Hi,
>>
>>I am running Mirror Maker (version 0.9 , new consumer). Sometimes I find that consumer gets stuck on certain partitions and offset doesn’t move in that case.
>>
>>I have 10 MM process running, each having 10 streams. The topic has 100 partitions.
>>
>>Below is the sample output of the consumer (I have cut short the output. Remaining partitions except the highlighted ones have similar lag in thousands).
>>I can see the LOG size and Lag increasing but not the current offset and the MM is also running.
>>
>>
>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118463859, 118471286, 7427, slca.lvs.fpti.mm-6_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118490771, 118492211, 1440, slca.lvs.fpti.mm-6_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118484045, 118493447, 9402, slca.lvs.fpti.mm-2_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118492467, 118502925, 10458, slca.lvs.fpti.mm-1_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118488448, 8257814, slca.lvs.fpti.mm-9_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118488594, 118497984, 9390, slca.lvs.fpti.mm-1_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118477757, 118478130, 373, slca.lvs.fpti.mm-3_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118472483, 118481154, 8671, slca.lvs.fpti.mm-8_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 21, 118478106, 118479540, 1434, slca.lvs.fpti.mm-2_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 59, 118487660, 118494558, 6898, slca.lvs.fpti.mm-5_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 48, 118483938, 118490119, 6181, slca.lvs.fpti.mm-4_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 95, 118490885, 118495214, 4329, slca.lvs.fpti.mm-9_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 58, 118499230, 118499608, 378, slca.lvs.fpti.mm-5_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 44, 118484130, 118491922, 7792, slca.lvs.fpti.mm-4_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 56, 118481454, 118485764, 4310, slca.lvs.fpti.mm-5_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 64, 118463461, 118471399, 7938, slca.lvs.fpti.mm-6_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 81, 118481575, 118482081, 506, slca.lvs.fpti.mm-8_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116010184, 118476683, 2466499, slca.lvs.fpti.mm-0_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118478651, 118486854, 8203, slca.lvs.fpti.mm-4_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118473021, 118481007, 7986, slca.lvs.fpti.mm-1_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118462885, 118470919, 8034, slca.lvs.fpti.mm-3_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117227057, 118484055, 1256998, slca.lvs.fpti.mm-9_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 36, 118491829, 118498669, 6840, slca.lvs.fpti.mm-3_/10.196.246.38
>>
>>
>>
>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118483038, 118489002, 5964, slca.lvs.fpti.mm-6_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118500463, 118510063, 9600, slca.lvs.fpti.mm-6_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118503344, 118511243, 7899, slca.lvs.fpti.mm-2_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118514764, 118520550, 5786, slca.lvs.fpti.mm-1_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118506337, 8275703, slca.lvs.fpti.mm-9_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118507871, 118515735, 7864, slca.lvs.fpti.mm-1_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118487244, 118495868, 8624, slca.lvs.fpti.mm-3_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118491768, 118499046, 7278, slca.lvs.fpti.mm-8_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 9, 118498940, 118499194, 254, slca.lvs.fpti.mm-0_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 6, 118516902, 118517425, 523, slca.lvs.fpti.mm-0_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 61, 118514062, 118521457, 7395, slca.lvs.fpti.mm-6_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116100206, 118494575, 2394369, slca.lvs.fpti.mm-0_/10.196.246.37
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118497705, 118504514, 6809, slca.lvs.fpti.mm-4_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118492007, 118498617, 6610, slca.lvs.fpti.mm-1_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118481808, 118488785, 6977, slca.lvs.fpti.mm-3_/10.196.246.38
>>
>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117318343, 118501939, 1183596, slca.lvs.fpti.mm-9_/10.196.246.37
>>
>>
>>
>>Anyone else facing the same problem?
>>
>>
>>Thanks
>>
>>Tushar

Re: MirrorMaker consumers getting stuck on certain partitions

Posted by "Mhaskar, Tushar" <tm...@paypal.com.INVALID>.
What is the other issue, since we are also facing this in 0.8.2.1 MM?

-Tushar





On 6/9/16, 2:59 PM, "Jason Gustafson" <ja...@confluent.io> wrote:

>Rebalances can occur for several reasons, such as new topics being added,
>partitions being increased for a topic, or one of the consumers temporarily
>falling out of the group because of a session timeout. If you have logs
>left around from your old MM instance, you should be able to track back to
>the time that the lag began to accumulate and see if there was a rebalance
>at that time. There could be another problem for sure, but given the fact
>that you're on 0.9.0.0 and that no one has reported this issue for any
>other version, I'd want to definitively rule out this issue before spending
>time looking for other causes.
>
>-Jason
>
>On Thu, Jun 9, 2016 at 2:41 PM, Mhaskar, Tushar <tmhaskar@paypal.com.invalid
>> wrote:
>
>> Thanks for the JIRA ticket.
>>
>> The MM process runs fine for days or months before getting stuck on
>> partition, so I am guessing how can rebalancing affect this.
>> We have equal number of consumer threads and partitions.
>>
>> -Tushar
>>
>>
>>
>> On 6/9/16, 1:25 PM, "Jason Gustafson" <ja...@confluent.io> wrote:
>>
>> >Hey Tushar,
>> >
>> >I think this is the one: https://issues.apache.org/jira/browse/KAFKA-2978
>> .
>> >
>> >-Jason
>> >
>> >On Thu, Jun 9, 2016 at 10:27 AM, Mhaskar, Tushar <
>> >tmhaskar@paypal.com.invalid> wrote:
>> >
>> >> Hi Jason,
>> >>
>> >> We used to faced this issue often in 0.9.0.0 MM, then we switched back
>> to
>> >> 0.8.2.1 MM because it was more stable than 0.9.0.0 MM code.
>> >> Is there a JIRA for that issue?
>> >>
>> >> Thanks
>> >> Tushar
>> >>
>> >>
>> >>
>> >> On 6/9/16, 9:40 AM, "Jason Gustafson" <ja...@confluent.io> wrote:
>> >>
>> >> >Hi Tushar,
>> >> >
>> >> >Are you on 0.9.0.0 or 0.9.0.1? There was a bug in 0.9.0.0 which could
>> >> >result in some partitions going unconsumed for a while.
>> >> >
>> >> >-Jason
>> >> >
>> >> >On Thu, Jun 9, 2016 at 12:17 AM, Gwen Shapira <gw...@confluent.io>
>> wrote:
>> >> >
>> >> >> Also, maybe a thread dump (using jstack) of the mirrormaker when it
>> is
>> >> >> stuck? It may give us a clue...
>> >> >>
>> >> >> On Thu, Jun 9, 2016 at 10:05 AM, Mhaskar, Tushar
>> >> >> <tm...@paypal.com.invalid> wrote:
>> >> >> > Yes the max size of the messages is within 1MB.
>> >> >> >
>> >> >> > That is the exact thing, we are not getting any exceptions in the
>> log.
>> >> >> >
>> >> >> > Nothing in common for the stuck partitions as well. There are 100
>> >> >> > partitions across 10 brokers.Every time this issue happens, the
>> >> consumers
>> >> >> > get stuck on different partitions.
>> >> >> >
>> >> >> > I can double check on the offending partitions for the message size
>> >> >> using the logfile dump tool.
>> >> >> >
>> >> >> > Tushar
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On 6/8/16, 11:36 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
>> >> >> >
>> >> >> >>Hi Tushar,
>> >> >> >>
>> >> >> >>I'm sure you know the joke about the statistician who drowned in a
>> >> >> >>pool with average depth of 1 feet. It isn't the average size that
>> will
>> >> >> >>get you, it is the maximum size :)
>> >> >> >>
>> >> >> >>However, too large messages should have resulted in an exception
>> >> >> >>getting logged. Perhaps you want to double check by using the
>> logfile
>> >> >> >>dump tool on one of the offending partitions, it will list the
>> size of
>> >> >> >>each message.
>> >> >> >>
>> >> >> >>I have to say that I don't have many ideas on why this happens. The
>> >> >> >>fact that it happens with both new and old consumers indicate it is
>> >> >> >>not an implementation bug but rather something with the partitions
>> /
>> >> >> >>data itself.
>> >> >> >>
>> >> >> >>Anything in common for those partitions? Are they on same broker,
>> for
>> >> >> instance?
>> >> >> >>
>> >> >> >>Gwen
>> >> >> >>
>> >> >> >>
>> >> >> >>On Thu, Jun 9, 2016 at 9:24 AM, Mhaskar, Tushar
>> >> >> >><tm...@paypal.com.invalid> wrote:
>> >> >> >>> Hi Gwen,
>> >> >> >>>
>> >> >> >>> 1) We already have our log level as INFO. But we don’t see
>> anything
>> >> >> until and unless there is rebalance happening due to restarting the
>> >> mirror
>> >> >> maker or any other activity like shutting down the mirror maker
>> process
>> >> >> itself.
>> >> >> >>> 2) Our average message size is 4KB.
>> >> >> >>>
>> >> >> >>> Thanks,
>> >> >> >>> Tushar
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> On 6/8/16, 10:31 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
>> >> >> >>>
>> >> >> >>>>Hi Tushar,
>> >> >> >>>>
>> >> >> >>>>Few follow up questions:
>> >> >> >>>>1. Can you enable logs for mirror maker (there should be
>> >> >> >>>>conf/tools-log4j.properties file for this) at INFO level? This
>> can
>> >> >> >>>>give us a clue on why it stopped.
>> >> >> >>>>
>> >> >> >>>>2. Did you check the FAQ for "why is my consumer hanging"? Most
>> of
>> >> the
>> >> >> >>>>reasons (for example "message too large") can apply here too.
>> >> >> >>>>
>> >> >> >>>>Gwen
>> >> >> >>>>
>> >> >> >>>>On Wed, May 4, 2016 at 9:54 PM, Mhaskar, Tushar
>> >> >> >>>><tm...@paypal.com.invalid> wrote:
>> >> >> >>>>> Anyone encountered this error?
>> >> >> >>>>>
>> >> >> >>>>> Thanks
>> >> >> >>>>> Tushar
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>> On 5/2/16, 10:44 PM, "Mhaskar, Tushar"
>> >> <tm...@paypal.com.INVALID>
>> >> >> wrote:
>> >> >> >>>>>
>> >> >> >>>>>>Consumption on  Partitions – 99,0 and 98 are getting stucked.
>> >> >> >>>>>>
>> >> >> >>>>>>Thanks
>> >> >> >>>>>>Tushar
>> >> >> >>>>>>From: "Mhaskar, Tushar" <tmhaskar@paypal.com<mailto:
>> >> >> tmhaskar@paypal.com>>
>> >> >> >>>>>>Date: Monday, May 2, 2016 at 9:52 PM
>> >> >> >>>>>>To: "users@kafka.apache.org<ma...@kafka.apache.org>" <
>> >> >> users@kafka.apache.org<ma...@kafka.apache.org>>
>> >> >> >>>>>>Subject: MirrorMaker consumers getting stuck on certain
>> partitions
>> >> >> >>>>>>
>> >> >> >>>>>>Hi,
>> >> >> >>>>>>
>> >> >> >>>>>>I am running Mirror Maker (version 0.9 , new consumer).
>> Sometimes
>> >> I
>> >> >> find that consumer gets stuck on certain partitions and offset
>> doesn’t
>> >> move
>> >> >> in that case.
>> >> >> >>>>>>
>> >> >> >>>>>>I have 10 MM process running, each having 10 streams. The topic
>> >> has
>> >> >> 100 partitions.
>> >> >> >>>>>>
>> >> >> >>>>>>Below is the sample output of the consumer (I have cut short
>> the
>> >> >> output. Remaining partitions except the highlighted ones have similar
>> >> lag
>> >> >> in thousands).
>> >> >> >>>>>>I can see the LOG size and Lag increasing but not the current
>> >> offset
>> >> >> and the MM is also running.
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG,
>> >> OWNER
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118463859,
>> 118471286,
>> >> >> 7427, slca.lvs.fpti.mm-6_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118490771,
>> 118492211,
>> >> >> 1440, slca.lvs.fpti.mm-6_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118484045,
>> 118493447,
>> >> >> 9402, slca.lvs.fpti.mm-2_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118492467,
>> 118502925,
>> >> >> 10458, slca.lvs.fpti.mm-1_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634,
>> 118488448,
>> >> >> 8257814, slca.lvs.fpti.mm-9_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118488594,
>> 118497984,
>> >> >> 9390, slca.lvs.fpti.mm-1_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118477757,
>> 118478130,
>> >> >> 373, slca.lvs.fpti.mm-3_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118472483,
>> 118481154,
>> >> >> 8671, slca.lvs.fpti.mm-8_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 21, 118478106,
>> 118479540,
>> >> >> 1434, slca.lvs.fpti.mm-2_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 59, 118487660,
>> 118494558,
>> >> >> 6898, slca.lvs.fpti.mm-5_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 48, 118483938,
>> 118490119,
>> >> >> 6181, slca.lvs.fpti.mm-4_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 95, 118490885,
>> 118495214,
>> >> >> 4329, slca.lvs.fpti.mm-9_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 58, 118499230,
>> 118499608,
>> >> >> 378, slca.lvs.fpti.mm-5_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 44, 118484130,
>> 118491922,
>> >> >> 7792, slca.lvs.fpti.mm-4_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 56, 118481454,
>> 118485764,
>> >> >> 4310, slca.lvs.fpti.mm-5_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 64, 118463461,
>> 118471399,
>> >> >> 7938, slca.lvs.fpti.mm-6_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 81, 118481575,
>> 118482081,
>> >> >> 506, slca.lvs.fpti.mm-8_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116010184,
>> 118476683,
>> >> >> 2466499, slca.lvs.fpti.mm-0_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118478651,
>> 118486854,
>> >> >> 8203, slca.lvs.fpti.mm-4_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118473021,
>> 118481007,
>> >> >> 7986, slca.lvs.fpti.mm-1_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118462885,
>> 118470919,
>> >> >> 8034, slca.lvs.fpti.mm-3_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117227057,
>> 118484055,
>> >> >> 1256998, slca.lvs.fpti.mm-9_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 36, 118491829,
>> 118498669,
>> >> >> 6840, slca.lvs.fpti.mm-3_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG,
>> >> OWNER
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118483038,
>> 118489002,
>> >> >> 5964, slca.lvs.fpti.mm-6_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118500463,
>> 118510063,
>> >> >> 9600, slca.lvs.fpti.mm-6_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118503344,
>> 118511243,
>> >> >> 7899, slca.lvs.fpti.mm-2_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118514764,
>> 118520550,
>> >> >> 5786, slca.lvs.fpti.mm-1_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634,
>> 118506337,
>> >> >> 8275703, slca.lvs.fpti.mm-9_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118507871,
>> 118515735,
>> >> >> 7864, slca.lvs.fpti.mm-1_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118487244,
>> 118495868,
>> >> >> 8624, slca.lvs.fpti.mm-3_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118491768,
>> 118499046,
>> >> >> 7278, slca.lvs.fpti.mm-8_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 9, 118498940,
>> 118499194,
>> >> >> 254, slca.lvs.fpti.mm-0_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 6, 118516902,
>> 118517425,
>> >> >> 523, slca.lvs.fpti.mm-0_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 61, 118514062,
>> 118521457,
>> >> >> 7395, slca.lvs.fpti.mm-6_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116100206,
>> 118494575,
>> >> >> 2394369, slca.lvs.fpti.mm-0_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118497705,
>> 118504514,
>> >> >> 6809, slca.lvs.fpti.mm-4_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118492007,
>> 118498617,
>> >> >> 6610, slca.lvs.fpti.mm-1_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118481808,
>> 118488785,
>> >> >> 6977, slca.lvs.fpti.mm-3_/10.196.246.38
>> >> >> >>>>>>
>> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117318343,
>> 118501939,
>> >> >> 1183596, slca.lvs.fpti.mm-9_/10.196.246.37
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>Anyone else facing the same problem?
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>Thanks
>> >> >> >>>>>>
>> >> >> >>>>>>Tushar
>> >> >>
>> >>
>>

Re: MirrorMaker consumers getting stuck on certain partitions

Posted by Jason Gustafson <ja...@confluent.io>.
Rebalances can occur for several reasons, such as new topics being added,
partitions being increased for a topic, or one of the consumers temporarily
falling out of the group because of a session timeout. If you have logs
left around from your old MM instance, you should be able to track back to
the time that the lag began to accumulate and see if there was a rebalance
at that time. There could be another problem for sure, but given the fact
that you're on 0.9.0.0 and that no one has reported this issue for any
other version, I'd want to definitively rule out this issue before spending
time looking for other causes.

-Jason

On Thu, Jun 9, 2016 at 2:41 PM, Mhaskar, Tushar <tmhaskar@paypal.com.invalid
> wrote:

> Thanks for the JIRA ticket.
>
> The MM process runs fine for days or months before getting stuck on
> partition, so I am guessing how can rebalancing affect this.
> We have equal number of consumer threads and partitions.
>
> -Tushar
>
>
>
> On 6/9/16, 1:25 PM, "Jason Gustafson" <ja...@confluent.io> wrote:
>
> >Hey Tushar,
> >
> >I think this is the one: https://issues.apache.org/jira/browse/KAFKA-2978
> .
> >
> >-Jason
> >
> >On Thu, Jun 9, 2016 at 10:27 AM, Mhaskar, Tushar <
> >tmhaskar@paypal.com.invalid> wrote:
> >
> >> Hi Jason,
> >>
> >> We used to faced this issue often in 0.9.0.0 MM, then we switched back
> to
> >> 0.8.2.1 MM because it was more stable than 0.9.0.0 MM code.
> >> Is there a JIRA for that issue?
> >>
> >> Thanks
> >> Tushar
> >>
> >>
> >>
> >> On 6/9/16, 9:40 AM, "Jason Gustafson" <ja...@confluent.io> wrote:
> >>
> >> >Hi Tushar,
> >> >
> >> >Are you on 0.9.0.0 or 0.9.0.1? There was a bug in 0.9.0.0 which could
> >> >result in some partitions going unconsumed for a while.
> >> >
> >> >-Jason
> >> >
> >> >On Thu, Jun 9, 2016 at 12:17 AM, Gwen Shapira <gw...@confluent.io>
> wrote:
> >> >
> >> >> Also, maybe a thread dump (using jstack) of the mirrormaker when it
> is
> >> >> stuck? It may give us a clue...
> >> >>
> >> >> On Thu, Jun 9, 2016 at 10:05 AM, Mhaskar, Tushar
> >> >> <tm...@paypal.com.invalid> wrote:
> >> >> > Yes the max size of the messages is within 1MB.
> >> >> >
> >> >> > That is the exact thing, we are not getting any exceptions in the
> log.
> >> >> >
> >> >> > Nothing in common for the stuck partitions as well. There are 100
> >> >> > partitions across 10 brokers.Every time this issue happens, the
> >> consumers
> >> >> > get stuck on different partitions.
> >> >> >
> >> >> > I can double check on the offending partitions for the message size
> >> >> using the logfile dump tool.
> >> >> >
> >> >> > Tushar
> >> >> >
> >> >> >
> >> >> >
> >> >> > On 6/8/16, 11:36 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
> >> >> >
> >> >> >>Hi Tushar,
> >> >> >>
> >> >> >>I'm sure you know the joke about the statistician who drowned in a
> >> >> >>pool with average depth of 1 feet. It isn't the average size that
> will
> >> >> >>get you, it is the maximum size :)
> >> >> >>
> >> >> >>However, too large messages should have resulted in an exception
> >> >> >>getting logged. Perhaps you want to double check by using the
> logfile
> >> >> >>dump tool on one of the offending partitions, it will list the
> size of
> >> >> >>each message.
> >> >> >>
> >> >> >>I have to say that I don't have many ideas on why this happens. The
> >> >> >>fact that it happens with both new and old consumers indicate it is
> >> >> >>not an implementation bug but rather something with the partitions
> /
> >> >> >>data itself.
> >> >> >>
> >> >> >>Anything in common for those partitions? Are they on same broker,
> for
> >> >> instance?
> >> >> >>
> >> >> >>Gwen
> >> >> >>
> >> >> >>
> >> >> >>On Thu, Jun 9, 2016 at 9:24 AM, Mhaskar, Tushar
> >> >> >><tm...@paypal.com.invalid> wrote:
> >> >> >>> Hi Gwen,
> >> >> >>>
> >> >> >>> 1) We already have our log level as INFO. But we don’t see
> anything
> >> >> until and unless there is rebalance happening due to restarting the
> >> mirror
> >> >> maker or any other activity like shutting down the mirror maker
> process
> >> >> itself.
> >> >> >>> 2) Our average message size is 4KB.
> >> >> >>>
> >> >> >>> Thanks,
> >> >> >>> Tushar
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> On 6/8/16, 10:31 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
> >> >> >>>
> >> >> >>>>Hi Tushar,
> >> >> >>>>
> >> >> >>>>Few follow up questions:
> >> >> >>>>1. Can you enable logs for mirror maker (there should be
> >> >> >>>>conf/tools-log4j.properties file for this) at INFO level? This
> can
> >> >> >>>>give us a clue on why it stopped.
> >> >> >>>>
> >> >> >>>>2. Did you check the FAQ for "why is my consumer hanging"? Most
> of
> >> the
> >> >> >>>>reasons (for example "message too large") can apply here too.
> >> >> >>>>
> >> >> >>>>Gwen
> >> >> >>>>
> >> >> >>>>On Wed, May 4, 2016 at 9:54 PM, Mhaskar, Tushar
> >> >> >>>><tm...@paypal.com.invalid> wrote:
> >> >> >>>>> Anyone encountered this error?
> >> >> >>>>>
> >> >> >>>>> Thanks
> >> >> >>>>> Tushar
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>> On 5/2/16, 10:44 PM, "Mhaskar, Tushar"
> >> <tm...@paypal.com.INVALID>
> >> >> wrote:
> >> >> >>>>>
> >> >> >>>>>>Consumption on  Partitions – 99,0 and 98 are getting stucked.
> >> >> >>>>>>
> >> >> >>>>>>Thanks
> >> >> >>>>>>Tushar
> >> >> >>>>>>From: "Mhaskar, Tushar" <tmhaskar@paypal.com<mailto:
> >> >> tmhaskar@paypal.com>>
> >> >> >>>>>>Date: Monday, May 2, 2016 at 9:52 PM
> >> >> >>>>>>To: "users@kafka.apache.org<ma...@kafka.apache.org>" <
> >> >> users@kafka.apache.org<ma...@kafka.apache.org>>
> >> >> >>>>>>Subject: MirrorMaker consumers getting stuck on certain
> partitions
> >> >> >>>>>>
> >> >> >>>>>>Hi,
> >> >> >>>>>>
> >> >> >>>>>>I am running Mirror Maker (version 0.9 , new consumer).
> Sometimes
> >> I
> >> >> find that consumer gets stuck on certain partitions and offset
> doesn’t
> >> move
> >> >> in that case.
> >> >> >>>>>>
> >> >> >>>>>>I have 10 MM process running, each having 10 streams. The topic
> >> has
> >> >> 100 partitions.
> >> >> >>>>>>
> >> >> >>>>>>Below is the sample output of the consumer (I have cut short
> the
> >> >> output. Remaining partitions except the highlighted ones have similar
> >> lag
> >> >> in thousands).
> >> >> >>>>>>I can see the LOG size and Lag increasing but not the current
> >> offset
> >> >> and the MM is also running.
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG,
> >> OWNER
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118463859,
> 118471286,
> >> >> 7427, slca.lvs.fpti.mm-6_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118490771,
> 118492211,
> >> >> 1440, slca.lvs.fpti.mm-6_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118484045,
> 118493447,
> >> >> 9402, slca.lvs.fpti.mm-2_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118492467,
> 118502925,
> >> >> 10458, slca.lvs.fpti.mm-1_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634,
> 118488448,
> >> >> 8257814, slca.lvs.fpti.mm-9_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118488594,
> 118497984,
> >> >> 9390, slca.lvs.fpti.mm-1_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118477757,
> 118478130,
> >> >> 373, slca.lvs.fpti.mm-3_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118472483,
> 118481154,
> >> >> 8671, slca.lvs.fpti.mm-8_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 21, 118478106,
> 118479540,
> >> >> 1434, slca.lvs.fpti.mm-2_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 59, 118487660,
> 118494558,
> >> >> 6898, slca.lvs.fpti.mm-5_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 48, 118483938,
> 118490119,
> >> >> 6181, slca.lvs.fpti.mm-4_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 95, 118490885,
> 118495214,
> >> >> 4329, slca.lvs.fpti.mm-9_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 58, 118499230,
> 118499608,
> >> >> 378, slca.lvs.fpti.mm-5_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 44, 118484130,
> 118491922,
> >> >> 7792, slca.lvs.fpti.mm-4_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 56, 118481454,
> 118485764,
> >> >> 4310, slca.lvs.fpti.mm-5_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 64, 118463461,
> 118471399,
> >> >> 7938, slca.lvs.fpti.mm-6_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 81, 118481575,
> 118482081,
> >> >> 506, slca.lvs.fpti.mm-8_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116010184,
> 118476683,
> >> >> 2466499, slca.lvs.fpti.mm-0_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118478651,
> 118486854,
> >> >> 8203, slca.lvs.fpti.mm-4_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118473021,
> 118481007,
> >> >> 7986, slca.lvs.fpti.mm-1_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118462885,
> 118470919,
> >> >> 8034, slca.lvs.fpti.mm-3_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117227057,
> 118484055,
> >> >> 1256998, slca.lvs.fpti.mm-9_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 36, 118491829,
> 118498669,
> >> >> 6840, slca.lvs.fpti.mm-3_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG,
> >> OWNER
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118483038,
> 118489002,
> >> >> 5964, slca.lvs.fpti.mm-6_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118500463,
> 118510063,
> >> >> 9600, slca.lvs.fpti.mm-6_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118503344,
> 118511243,
> >> >> 7899, slca.lvs.fpti.mm-2_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118514764,
> 118520550,
> >> >> 5786, slca.lvs.fpti.mm-1_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634,
> 118506337,
> >> >> 8275703, slca.lvs.fpti.mm-9_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118507871,
> 118515735,
> >> >> 7864, slca.lvs.fpti.mm-1_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118487244,
> 118495868,
> >> >> 8624, slca.lvs.fpti.mm-3_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118491768,
> 118499046,
> >> >> 7278, slca.lvs.fpti.mm-8_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 9, 118498940,
> 118499194,
> >> >> 254, slca.lvs.fpti.mm-0_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 6, 118516902,
> 118517425,
> >> >> 523, slca.lvs.fpti.mm-0_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 61, 118514062,
> 118521457,
> >> >> 7395, slca.lvs.fpti.mm-6_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116100206,
> 118494575,
> >> >> 2394369, slca.lvs.fpti.mm-0_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118497705,
> 118504514,
> >> >> 6809, slca.lvs.fpti.mm-4_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118492007,
> 118498617,
> >> >> 6610, slca.lvs.fpti.mm-1_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118481808,
> 118488785,
> >> >> 6977, slca.lvs.fpti.mm-3_/10.196.246.38
> >> >> >>>>>>
> >> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117318343,
> 118501939,
> >> >> 1183596, slca.lvs.fpti.mm-9_/10.196.246.37
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>Anyone else facing the same problem?
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>Thanks
> >> >> >>>>>>
> >> >> >>>>>>Tushar
> >> >>
> >>
>

Re: MirrorMaker consumers getting stuck on certain partitions

Posted by "Mhaskar, Tushar" <tm...@paypal.com.INVALID>.
Thanks for the JIRA ticket.

The MM process runs fine for days or months before getting stuck on partition, so I am guessing how can rebalancing affect this.
We have equal number of consumer threads and partitions.

-Tushar



On 6/9/16, 1:25 PM, "Jason Gustafson" <ja...@confluent.io> wrote:

>Hey Tushar,
>
>I think this is the one: https://issues.apache.org/jira/browse/KAFKA-2978.
>
>-Jason
>
>On Thu, Jun 9, 2016 at 10:27 AM, Mhaskar, Tushar <
>tmhaskar@paypal.com.invalid> wrote:
>
>> Hi Jason,
>>
>> We used to faced this issue often in 0.9.0.0 MM, then we switched back to
>> 0.8.2.1 MM because it was more stable than 0.9.0.0 MM code.
>> Is there a JIRA for that issue?
>>
>> Thanks
>> Tushar
>>
>>
>>
>> On 6/9/16, 9:40 AM, "Jason Gustafson" <ja...@confluent.io> wrote:
>>
>> >Hi Tushar,
>> >
>> >Are you on 0.9.0.0 or 0.9.0.1? There was a bug in 0.9.0.0 which could
>> >result in some partitions going unconsumed for a while.
>> >
>> >-Jason
>> >
>> >On Thu, Jun 9, 2016 at 12:17 AM, Gwen Shapira <gw...@confluent.io> wrote:
>> >
>> >> Also, maybe a thread dump (using jstack) of the mirrormaker when it is
>> >> stuck? It may give us a clue...
>> >>
>> >> On Thu, Jun 9, 2016 at 10:05 AM, Mhaskar, Tushar
>> >> <tm...@paypal.com.invalid> wrote:
>> >> > Yes the max size of the messages is within 1MB.
>> >> >
>> >> > That is the exact thing, we are not getting any exceptions in the log.
>> >> >
>> >> > Nothing in common for the stuck partitions as well. There are 100
>> >> > partitions across 10 brokers.Every time this issue happens, the
>> consumers
>> >> > get stuck on different partitions.
>> >> >
>> >> > I can double check on the offending partitions for the message size
>> >> using the logfile dump tool.
>> >> >
>> >> > Tushar
>> >> >
>> >> >
>> >> >
>> >> > On 6/8/16, 11:36 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
>> >> >
>> >> >>Hi Tushar,
>> >> >>
>> >> >>I'm sure you know the joke about the statistician who drowned in a
>> >> >>pool with average depth of 1 feet. It isn't the average size that will
>> >> >>get you, it is the maximum size :)
>> >> >>
>> >> >>However, too large messages should have resulted in an exception
>> >> >>getting logged. Perhaps you want to double check by using the logfile
>> >> >>dump tool on one of the offending partitions, it will list the size of
>> >> >>each message.
>> >> >>
>> >> >>I have to say that I don't have many ideas on why this happens. The
>> >> >>fact that it happens with both new and old consumers indicate it is
>> >> >>not an implementation bug but rather something with the partitions /
>> >> >>data itself.
>> >> >>
>> >> >>Anything in common for those partitions? Are they on same broker, for
>> >> instance?
>> >> >>
>> >> >>Gwen
>> >> >>
>> >> >>
>> >> >>On Thu, Jun 9, 2016 at 9:24 AM, Mhaskar, Tushar
>> >> >><tm...@paypal.com.invalid> wrote:
>> >> >>> Hi Gwen,
>> >> >>>
>> >> >>> 1) We already have our log level as INFO. But we don’t see anything
>> >> until and unless there is rebalance happening due to restarting the
>> mirror
>> >> maker or any other activity like shutting down the mirror maker process
>> >> itself.
>> >> >>> 2) Our average message size is 4KB.
>> >> >>>
>> >> >>> Thanks,
>> >> >>> Tushar
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On 6/8/16, 10:31 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
>> >> >>>
>> >> >>>>Hi Tushar,
>> >> >>>>
>> >> >>>>Few follow up questions:
>> >> >>>>1. Can you enable logs for mirror maker (there should be
>> >> >>>>conf/tools-log4j.properties file for this) at INFO level? This can
>> >> >>>>give us a clue on why it stopped.
>> >> >>>>
>> >> >>>>2. Did you check the FAQ for "why is my consumer hanging"? Most of
>> the
>> >> >>>>reasons (for example "message too large") can apply here too.
>> >> >>>>
>> >> >>>>Gwen
>> >> >>>>
>> >> >>>>On Wed, May 4, 2016 at 9:54 PM, Mhaskar, Tushar
>> >> >>>><tm...@paypal.com.invalid> wrote:
>> >> >>>>> Anyone encountered this error?
>> >> >>>>>
>> >> >>>>> Thanks
>> >> >>>>> Tushar
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> On 5/2/16, 10:44 PM, "Mhaskar, Tushar"
>> <tm...@paypal.com.INVALID>
>> >> wrote:
>> >> >>>>>
>> >> >>>>>>Consumption on  Partitions – 99,0 and 98 are getting stucked.
>> >> >>>>>>
>> >> >>>>>>Thanks
>> >> >>>>>>Tushar
>> >> >>>>>>From: "Mhaskar, Tushar" <tmhaskar@paypal.com<mailto:
>> >> tmhaskar@paypal.com>>
>> >> >>>>>>Date: Monday, May 2, 2016 at 9:52 PM
>> >> >>>>>>To: "users@kafka.apache.org<ma...@kafka.apache.org>" <
>> >> users@kafka.apache.org<ma...@kafka.apache.org>>
>> >> >>>>>>Subject: MirrorMaker consumers getting stuck on certain partitions
>> >> >>>>>>
>> >> >>>>>>Hi,
>> >> >>>>>>
>> >> >>>>>>I am running Mirror Maker (version 0.9 , new consumer). Sometimes
>> I
>> >> find that consumer gets stuck on certain partitions and offset doesn’t
>> move
>> >> in that case.
>> >> >>>>>>
>> >> >>>>>>I have 10 MM process running, each having 10 streams. The topic
>> has
>> >> 100 partitions.
>> >> >>>>>>
>> >> >>>>>>Below is the sample output of the consumer (I have cut short the
>> >> output. Remaining partitions except the highlighted ones have similar
>> lag
>> >> in thousands).
>> >> >>>>>>I can see the LOG size and Lag increasing but not the current
>> offset
>> >> and the MM is also running.
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG,
>> OWNER
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118463859, 118471286,
>> >> 7427, slca.lvs.fpti.mm-6_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118490771, 118492211,
>> >> 1440, slca.lvs.fpti.mm-6_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118484045, 118493447,
>> >> 9402, slca.lvs.fpti.mm-2_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118492467, 118502925,
>> >> 10458, slca.lvs.fpti.mm-1_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118488448,
>> >> 8257814, slca.lvs.fpti.mm-9_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118488594, 118497984,
>> >> 9390, slca.lvs.fpti.mm-1_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118477757, 118478130,
>> >> 373, slca.lvs.fpti.mm-3_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118472483, 118481154,
>> >> 8671, slca.lvs.fpti.mm-8_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 21, 118478106, 118479540,
>> >> 1434, slca.lvs.fpti.mm-2_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 59, 118487660, 118494558,
>> >> 6898, slca.lvs.fpti.mm-5_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 48, 118483938, 118490119,
>> >> 6181, slca.lvs.fpti.mm-4_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 95, 118490885, 118495214,
>> >> 4329, slca.lvs.fpti.mm-9_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 58, 118499230, 118499608,
>> >> 378, slca.lvs.fpti.mm-5_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 44, 118484130, 118491922,
>> >> 7792, slca.lvs.fpti.mm-4_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 56, 118481454, 118485764,
>> >> 4310, slca.lvs.fpti.mm-5_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 64, 118463461, 118471399,
>> >> 7938, slca.lvs.fpti.mm-6_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 81, 118481575, 118482081,
>> >> 506, slca.lvs.fpti.mm-8_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116010184, 118476683,
>> >> 2466499, slca.lvs.fpti.mm-0_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118478651, 118486854,
>> >> 8203, slca.lvs.fpti.mm-4_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118473021, 118481007,
>> >> 7986, slca.lvs.fpti.mm-1_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118462885, 118470919,
>> >> 8034, slca.lvs.fpti.mm-3_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117227057, 118484055,
>> >> 1256998, slca.lvs.fpti.mm-9_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 36, 118491829, 118498669,
>> >> 6840, slca.lvs.fpti.mm-3_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG,
>> OWNER
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118483038, 118489002,
>> >> 5964, slca.lvs.fpti.mm-6_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118500463, 118510063,
>> >> 9600, slca.lvs.fpti.mm-6_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118503344, 118511243,
>> >> 7899, slca.lvs.fpti.mm-2_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118514764, 118520550,
>> >> 5786, slca.lvs.fpti.mm-1_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118506337,
>> >> 8275703, slca.lvs.fpti.mm-9_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118507871, 118515735,
>> >> 7864, slca.lvs.fpti.mm-1_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118487244, 118495868,
>> >> 8624, slca.lvs.fpti.mm-3_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118491768, 118499046,
>> >> 7278, slca.lvs.fpti.mm-8_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 9, 118498940, 118499194,
>> >> 254, slca.lvs.fpti.mm-0_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 6, 118516902, 118517425,
>> >> 523, slca.lvs.fpti.mm-0_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 61, 118514062, 118521457,
>> >> 7395, slca.lvs.fpti.mm-6_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116100206, 118494575,
>> >> 2394369, slca.lvs.fpti.mm-0_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118497705, 118504514,
>> >> 6809, slca.lvs.fpti.mm-4_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118492007, 118498617,
>> >> 6610, slca.lvs.fpti.mm-1_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118481808, 118488785,
>> >> 6977, slca.lvs.fpti.mm-3_/10.196.246.38
>> >> >>>>>>
>> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117318343, 118501939,
>> >> 1183596, slca.lvs.fpti.mm-9_/10.196.246.37
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>Anyone else facing the same problem?
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>Thanks
>> >> >>>>>>
>> >> >>>>>>Tushar
>> >>
>>

Re: MirrorMaker consumers getting stuck on certain partitions

Posted by Jason Gustafson <ja...@confluent.io>.
Hey Tushar,

I think this is the one: https://issues.apache.org/jira/browse/KAFKA-2978.

-Jason

On Thu, Jun 9, 2016 at 10:27 AM, Mhaskar, Tushar <
tmhaskar@paypal.com.invalid> wrote:

> Hi Jason,
>
> We used to faced this issue often in 0.9.0.0 MM, then we switched back to
> 0.8.2.1 MM because it was more stable than 0.9.0.0 MM code.
> Is there a JIRA for that issue?
>
> Thanks
> Tushar
>
>
>
> On 6/9/16, 9:40 AM, "Jason Gustafson" <ja...@confluent.io> wrote:
>
> >Hi Tushar,
> >
> >Are you on 0.9.0.0 or 0.9.0.1? There was a bug in 0.9.0.0 which could
> >result in some partitions going unconsumed for a while.
> >
> >-Jason
> >
> >On Thu, Jun 9, 2016 at 12:17 AM, Gwen Shapira <gw...@confluent.io> wrote:
> >
> >> Also, maybe a thread dump (using jstack) of the mirrormaker when it is
> >> stuck? It may give us a clue...
> >>
> >> On Thu, Jun 9, 2016 at 10:05 AM, Mhaskar, Tushar
> >> <tm...@paypal.com.invalid> wrote:
> >> > Yes the max size of the messages is within 1MB.
> >> >
> >> > That is the exact thing, we are not getting any exceptions in the log.
> >> >
> >> > Nothing in common for the stuck partitions as well. There are 100
> >> > partitions across 10 brokers.Every time this issue happens, the
> consumers
> >> > get stuck on different partitions.
> >> >
> >> > I can double check on the offending partitions for the message size
> >> using the logfile dump tool.
> >> >
> >> > Tushar
> >> >
> >> >
> >> >
> >> > On 6/8/16, 11:36 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
> >> >
> >> >>Hi Tushar,
> >> >>
> >> >>I'm sure you know the joke about the statistician who drowned in a
> >> >>pool with average depth of 1 feet. It isn't the average size that will
> >> >>get you, it is the maximum size :)
> >> >>
> >> >>However, too large messages should have resulted in an exception
> >> >>getting logged. Perhaps you want to double check by using the logfile
> >> >>dump tool on one of the offending partitions, it will list the size of
> >> >>each message.
> >> >>
> >> >>I have to say that I don't have many ideas on why this happens. The
> >> >>fact that it happens with both new and old consumers indicate it is
> >> >>not an implementation bug but rather something with the partitions /
> >> >>data itself.
> >> >>
> >> >>Anything in common for those partitions? Are they on same broker, for
> >> instance?
> >> >>
> >> >>Gwen
> >> >>
> >> >>
> >> >>On Thu, Jun 9, 2016 at 9:24 AM, Mhaskar, Tushar
> >> >><tm...@paypal.com.invalid> wrote:
> >> >>> Hi Gwen,
> >> >>>
> >> >>> 1) We already have our log level as INFO. But we don’t see anything
> >> until and unless there is rebalance happening due to restarting the
> mirror
> >> maker or any other activity like shutting down the mirror maker process
> >> itself.
> >> >>> 2) Our average message size is 4KB.
> >> >>>
> >> >>> Thanks,
> >> >>> Tushar
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On 6/8/16, 10:31 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
> >> >>>
> >> >>>>Hi Tushar,
> >> >>>>
> >> >>>>Few follow up questions:
> >> >>>>1. Can you enable logs for mirror maker (there should be
> >> >>>>conf/tools-log4j.properties file for this) at INFO level? This can
> >> >>>>give us a clue on why it stopped.
> >> >>>>
> >> >>>>2. Did you check the FAQ for "why is my consumer hanging"? Most of
> the
> >> >>>>reasons (for example "message too large") can apply here too.
> >> >>>>
> >> >>>>Gwen
> >> >>>>
> >> >>>>On Wed, May 4, 2016 at 9:54 PM, Mhaskar, Tushar
> >> >>>><tm...@paypal.com.invalid> wrote:
> >> >>>>> Anyone encountered this error?
> >> >>>>>
> >> >>>>> Thanks
> >> >>>>> Tushar
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> On 5/2/16, 10:44 PM, "Mhaskar, Tushar"
> <tm...@paypal.com.INVALID>
> >> wrote:
> >> >>>>>
> >> >>>>>>Consumption on  Partitions – 99,0 and 98 are getting stucked.
> >> >>>>>>
> >> >>>>>>Thanks
> >> >>>>>>Tushar
> >> >>>>>>From: "Mhaskar, Tushar" <tmhaskar@paypal.com<mailto:
> >> tmhaskar@paypal.com>>
> >> >>>>>>Date: Monday, May 2, 2016 at 9:52 PM
> >> >>>>>>To: "users@kafka.apache.org<ma...@kafka.apache.org>" <
> >> users@kafka.apache.org<ma...@kafka.apache.org>>
> >> >>>>>>Subject: MirrorMaker consumers getting stuck on certain partitions
> >> >>>>>>
> >> >>>>>>Hi,
> >> >>>>>>
> >> >>>>>>I am running Mirror Maker (version 0.9 , new consumer). Sometimes
> I
> >> find that consumer gets stuck on certain partitions and offset doesn’t
> move
> >> in that case.
> >> >>>>>>
> >> >>>>>>I have 10 MM process running, each having 10 streams. The topic
> has
> >> 100 partitions.
> >> >>>>>>
> >> >>>>>>Below is the sample output of the consumer (I have cut short the
> >> output. Remaining partitions except the highlighted ones have similar
> lag
> >> in thousands).
> >> >>>>>>I can see the LOG size and Lag increasing but not the current
> offset
> >> and the MM is also running.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG,
> OWNER
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118463859, 118471286,
> >> 7427, slca.lvs.fpti.mm-6_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118490771, 118492211,
> >> 1440, slca.lvs.fpti.mm-6_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118484045, 118493447,
> >> 9402, slca.lvs.fpti.mm-2_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118492467, 118502925,
> >> 10458, slca.lvs.fpti.mm-1_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118488448,
> >> 8257814, slca.lvs.fpti.mm-9_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118488594, 118497984,
> >> 9390, slca.lvs.fpti.mm-1_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118477757, 118478130,
> >> 373, slca.lvs.fpti.mm-3_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118472483, 118481154,
> >> 8671, slca.lvs.fpti.mm-8_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 21, 118478106, 118479540,
> >> 1434, slca.lvs.fpti.mm-2_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 59, 118487660, 118494558,
> >> 6898, slca.lvs.fpti.mm-5_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 48, 118483938, 118490119,
> >> 6181, slca.lvs.fpti.mm-4_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 95, 118490885, 118495214,
> >> 4329, slca.lvs.fpti.mm-9_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 58, 118499230, 118499608,
> >> 378, slca.lvs.fpti.mm-5_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 44, 118484130, 118491922,
> >> 7792, slca.lvs.fpti.mm-4_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 56, 118481454, 118485764,
> >> 4310, slca.lvs.fpti.mm-5_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 64, 118463461, 118471399,
> >> 7938, slca.lvs.fpti.mm-6_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 81, 118481575, 118482081,
> >> 506, slca.lvs.fpti.mm-8_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116010184, 118476683,
> >> 2466499, slca.lvs.fpti.mm-0_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118478651, 118486854,
> >> 8203, slca.lvs.fpti.mm-4_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118473021, 118481007,
> >> 7986, slca.lvs.fpti.mm-1_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118462885, 118470919,
> >> 8034, slca.lvs.fpti.mm-3_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117227057, 118484055,
> >> 1256998, slca.lvs.fpti.mm-9_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 36, 118491829, 118498669,
> >> 6840, slca.lvs.fpti.mm-3_/10.196.246.38
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG,
> OWNER
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118483038, 118489002,
> >> 5964, slca.lvs.fpti.mm-6_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118500463, 118510063,
> >> 9600, slca.lvs.fpti.mm-6_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118503344, 118511243,
> >> 7899, slca.lvs.fpti.mm-2_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118514764, 118520550,
> >> 5786, slca.lvs.fpti.mm-1_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118506337,
> >> 8275703, slca.lvs.fpti.mm-9_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118507871, 118515735,
> >> 7864, slca.lvs.fpti.mm-1_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118487244, 118495868,
> >> 8624, slca.lvs.fpti.mm-3_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118491768, 118499046,
> >> 7278, slca.lvs.fpti.mm-8_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 9, 118498940, 118499194,
> >> 254, slca.lvs.fpti.mm-0_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 6, 118516902, 118517425,
> >> 523, slca.lvs.fpti.mm-0_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 61, 118514062, 118521457,
> >> 7395, slca.lvs.fpti.mm-6_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116100206, 118494575,
> >> 2394369, slca.lvs.fpti.mm-0_/10.196.246.37
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118497705, 118504514,
> >> 6809, slca.lvs.fpti.mm-4_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118492007, 118498617,
> >> 6610, slca.lvs.fpti.mm-1_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118481808, 118488785,
> >> 6977, slca.lvs.fpti.mm-3_/10.196.246.38
> >> >>>>>>
> >> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117318343, 118501939,
> >> 1183596, slca.lvs.fpti.mm-9_/10.196.246.37
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>Anyone else facing the same problem?
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>Thanks
> >> >>>>>>
> >> >>>>>>Tushar
> >>
>

Re: MirrorMaker consumers getting stuck on certain partitions

Posted by "Mhaskar, Tushar" <tm...@paypal.com.INVALID>.
Hi Jason,

We used to faced this issue often in 0.9.0.0 MM, then we switched back to 0.8.2.1 MM because it was more stable than 0.9.0.0 MM code.
Is there a JIRA for that issue?

Thanks
Tushar



On 6/9/16, 9:40 AM, "Jason Gustafson" <ja...@confluent.io> wrote:

>Hi Tushar,
>
>Are you on 0.9.0.0 or 0.9.0.1? There was a bug in 0.9.0.0 which could
>result in some partitions going unconsumed for a while.
>
>-Jason
>
>On Thu, Jun 9, 2016 at 12:17 AM, Gwen Shapira <gw...@confluent.io> wrote:
>
>> Also, maybe a thread dump (using jstack) of the mirrormaker when it is
>> stuck? It may give us a clue...
>>
>> On Thu, Jun 9, 2016 at 10:05 AM, Mhaskar, Tushar
>> <tm...@paypal.com.invalid> wrote:
>> > Yes the max size of the messages is within 1MB.
>> >
>> > That is the exact thing, we are not getting any exceptions in the log.
>> >
>> > Nothing in common for the stuck partitions as well. There are 100
>> > partitions across 10 brokers.Every time this issue happens, the consumers
>> > get stuck on different partitions.
>> >
>> > I can double check on the offending partitions for the message size
>> using the logfile dump tool.
>> >
>> > Tushar
>> >
>> >
>> >
>> > On 6/8/16, 11:36 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
>> >
>> >>Hi Tushar,
>> >>
>> >>I'm sure you know the joke about the statistician who drowned in a
>> >>pool with average depth of 1 feet. It isn't the average size that will
>> >>get you, it is the maximum size :)
>> >>
>> >>However, too large messages should have resulted in an exception
>> >>getting logged. Perhaps you want to double check by using the logfile
>> >>dump tool on one of the offending partitions, it will list the size of
>> >>each message.
>> >>
>> >>I have to say that I don't have many ideas on why this happens. The
>> >>fact that it happens with both new and old consumers indicate it is
>> >>not an implementation bug but rather something with the partitions /
>> >>data itself.
>> >>
>> >>Anything in common for those partitions? Are they on same broker, for
>> instance?
>> >>
>> >>Gwen
>> >>
>> >>
>> >>On Thu, Jun 9, 2016 at 9:24 AM, Mhaskar, Tushar
>> >><tm...@paypal.com.invalid> wrote:
>> >>> Hi Gwen,
>> >>>
>> >>> 1) We already have our log level as INFO. But we don’t see anything
>> until and unless there is rebalance happening due to restarting the mirror
>> maker or any other activity like shutting down the mirror maker process
>> itself.
>> >>> 2) Our average message size is 4KB.
>> >>>
>> >>> Thanks,
>> >>> Tushar
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On 6/8/16, 10:31 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
>> >>>
>> >>>>Hi Tushar,
>> >>>>
>> >>>>Few follow up questions:
>> >>>>1. Can you enable logs for mirror maker (there should be
>> >>>>conf/tools-log4j.properties file for this) at INFO level? This can
>> >>>>give us a clue on why it stopped.
>> >>>>
>> >>>>2. Did you check the FAQ for "why is my consumer hanging"? Most of the
>> >>>>reasons (for example "message too large") can apply here too.
>> >>>>
>> >>>>Gwen
>> >>>>
>> >>>>On Wed, May 4, 2016 at 9:54 PM, Mhaskar, Tushar
>> >>>><tm...@paypal.com.invalid> wrote:
>> >>>>> Anyone encountered this error?
>> >>>>>
>> >>>>> Thanks
>> >>>>> Tushar
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On 5/2/16, 10:44 PM, "Mhaskar, Tushar" <tm...@paypal.com.INVALID>
>> wrote:
>> >>>>>
>> >>>>>>Consumption on  Partitions – 99,0 and 98 are getting stucked.
>> >>>>>>
>> >>>>>>Thanks
>> >>>>>>Tushar
>> >>>>>>From: "Mhaskar, Tushar" <tmhaskar@paypal.com<mailto:
>> tmhaskar@paypal.com>>
>> >>>>>>Date: Monday, May 2, 2016 at 9:52 PM
>> >>>>>>To: "users@kafka.apache.org<ma...@kafka.apache.org>" <
>> users@kafka.apache.org<ma...@kafka.apache.org>>
>> >>>>>>Subject: MirrorMaker consumers getting stuck on certain partitions
>> >>>>>>
>> >>>>>>Hi,
>> >>>>>>
>> >>>>>>I am running Mirror Maker (version 0.9 , new consumer). Sometimes I
>> find that consumer gets stuck on certain partitions and offset doesn’t move
>> in that case.
>> >>>>>>
>> >>>>>>I have 10 MM process running, each having 10 streams. The topic has
>> 100 partitions.
>> >>>>>>
>> >>>>>>Below is the sample output of the consumer (I have cut short the
>> output. Remaining partitions except the highlighted ones have similar lag
>> in thousands).
>> >>>>>>I can see the LOG size and Lag increasing but not the current offset
>> and the MM is also running.
>> >>>>>>
>> >>>>>>
>> >>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118463859, 118471286,
>> 7427, slca.lvs.fpti.mm-6_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118490771, 118492211,
>> 1440, slca.lvs.fpti.mm-6_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118484045, 118493447,
>> 9402, slca.lvs.fpti.mm-2_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118492467, 118502925,
>> 10458, slca.lvs.fpti.mm-1_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118488448,
>> 8257814, slca.lvs.fpti.mm-9_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118488594, 118497984,
>> 9390, slca.lvs.fpti.mm-1_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118477757, 118478130,
>> 373, slca.lvs.fpti.mm-3_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118472483, 118481154,
>> 8671, slca.lvs.fpti.mm-8_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 21, 118478106, 118479540,
>> 1434, slca.lvs.fpti.mm-2_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 59, 118487660, 118494558,
>> 6898, slca.lvs.fpti.mm-5_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 48, 118483938, 118490119,
>> 6181, slca.lvs.fpti.mm-4_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 95, 118490885, 118495214,
>> 4329, slca.lvs.fpti.mm-9_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 58, 118499230, 118499608,
>> 378, slca.lvs.fpti.mm-5_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 44, 118484130, 118491922,
>> 7792, slca.lvs.fpti.mm-4_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 56, 118481454, 118485764,
>> 4310, slca.lvs.fpti.mm-5_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 64, 118463461, 118471399,
>> 7938, slca.lvs.fpti.mm-6_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 81, 118481575, 118482081,
>> 506, slca.lvs.fpti.mm-8_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116010184, 118476683,
>> 2466499, slca.lvs.fpti.mm-0_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118478651, 118486854,
>> 8203, slca.lvs.fpti.mm-4_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118473021, 118481007,
>> 7986, slca.lvs.fpti.mm-1_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118462885, 118470919,
>> 8034, slca.lvs.fpti.mm-3_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117227057, 118484055,
>> 1256998, slca.lvs.fpti.mm-9_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 36, 118491829, 118498669,
>> 6840, slca.lvs.fpti.mm-3_/10.196.246.38
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118483038, 118489002,
>> 5964, slca.lvs.fpti.mm-6_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118500463, 118510063,
>> 9600, slca.lvs.fpti.mm-6_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118503344, 118511243,
>> 7899, slca.lvs.fpti.mm-2_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118514764, 118520550,
>> 5786, slca.lvs.fpti.mm-1_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118506337,
>> 8275703, slca.lvs.fpti.mm-9_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118507871, 118515735,
>> 7864, slca.lvs.fpti.mm-1_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118487244, 118495868,
>> 8624, slca.lvs.fpti.mm-3_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118491768, 118499046,
>> 7278, slca.lvs.fpti.mm-8_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 9, 118498940, 118499194,
>> 254, slca.lvs.fpti.mm-0_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 6, 118516902, 118517425,
>> 523, slca.lvs.fpti.mm-0_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 61, 118514062, 118521457,
>> 7395, slca.lvs.fpti.mm-6_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116100206, 118494575,
>> 2394369, slca.lvs.fpti.mm-0_/10.196.246.37
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118497705, 118504514,
>> 6809, slca.lvs.fpti.mm-4_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118492007, 118498617,
>> 6610, slca.lvs.fpti.mm-1_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118481808, 118488785,
>> 6977, slca.lvs.fpti.mm-3_/10.196.246.38
>> >>>>>>
>> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117318343, 118501939,
>> 1183596, slca.lvs.fpti.mm-9_/10.196.246.37
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>Anyone else facing the same problem?
>> >>>>>>
>> >>>>>>
>> >>>>>>Thanks
>> >>>>>>
>> >>>>>>Tushar
>>

Re: MirrorMaker consumers getting stuck on certain partitions

Posted by Jason Gustafson <ja...@confluent.io>.
Hi Tushar,

Are you on 0.9.0.0 or 0.9.0.1? There was a bug in 0.9.0.0 which could
result in some partitions going unconsumed for a while.

-Jason

On Thu, Jun 9, 2016 at 12:17 AM, Gwen Shapira <gw...@confluent.io> wrote:

> Also, maybe a thread dump (using jstack) of the mirrormaker when it is
> stuck? It may give us a clue...
>
> On Thu, Jun 9, 2016 at 10:05 AM, Mhaskar, Tushar
> <tm...@paypal.com.invalid> wrote:
> > Yes the max size of the messages is within 1MB.
> >
> > That is the exact thing, we are not getting any exceptions in the log.
> >
> > Nothing in common for the stuck partitions as well. There are 100
> > partitions across 10 brokers.Every time this issue happens, the consumers
> > get stuck on different partitions.
> >
> > I can double check on the offending partitions for the message size
> using the logfile dump tool.
> >
> > Tushar
> >
> >
> >
> > On 6/8/16, 11:36 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
> >
> >>Hi Tushar,
> >>
> >>I'm sure you know the joke about the statistician who drowned in a
> >>pool with average depth of 1 feet. It isn't the average size that will
> >>get you, it is the maximum size :)
> >>
> >>However, too large messages should have resulted in an exception
> >>getting logged. Perhaps you want to double check by using the logfile
> >>dump tool on one of the offending partitions, it will list the size of
> >>each message.
> >>
> >>I have to say that I don't have many ideas on why this happens. The
> >>fact that it happens with both new and old consumers indicate it is
> >>not an implementation bug but rather something with the partitions /
> >>data itself.
> >>
> >>Anything in common for those partitions? Are they on same broker, for
> instance?
> >>
> >>Gwen
> >>
> >>
> >>On Thu, Jun 9, 2016 at 9:24 AM, Mhaskar, Tushar
> >><tm...@paypal.com.invalid> wrote:
> >>> Hi Gwen,
> >>>
> >>> 1) We already have our log level as INFO. But we don’t see anything
> until and unless there is rebalance happening due to restarting the mirror
> maker or any other activity like shutting down the mirror maker process
> itself.
> >>> 2) Our average message size is 4KB.
> >>>
> >>> Thanks,
> >>> Tushar
> >>>
> >>>
> >>>
> >>>
> >>> On 6/8/16, 10:31 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
> >>>
> >>>>Hi Tushar,
> >>>>
> >>>>Few follow up questions:
> >>>>1. Can you enable logs for mirror maker (there should be
> >>>>conf/tools-log4j.properties file for this) at INFO level? This can
> >>>>give us a clue on why it stopped.
> >>>>
> >>>>2. Did you check the FAQ for "why is my consumer hanging"? Most of the
> >>>>reasons (for example "message too large") can apply here too.
> >>>>
> >>>>Gwen
> >>>>
> >>>>On Wed, May 4, 2016 at 9:54 PM, Mhaskar, Tushar
> >>>><tm...@paypal.com.invalid> wrote:
> >>>>> Anyone encountered this error?
> >>>>>
> >>>>> Thanks
> >>>>> Tushar
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 5/2/16, 10:44 PM, "Mhaskar, Tushar" <tm...@paypal.com.INVALID>
> wrote:
> >>>>>
> >>>>>>Consumption on  Partitions – 99,0 and 98 are getting stucked.
> >>>>>>
> >>>>>>Thanks
> >>>>>>Tushar
> >>>>>>From: "Mhaskar, Tushar" <tmhaskar@paypal.com<mailto:
> tmhaskar@paypal.com>>
> >>>>>>Date: Monday, May 2, 2016 at 9:52 PM
> >>>>>>To: "users@kafka.apache.org<ma...@kafka.apache.org>" <
> users@kafka.apache.org<ma...@kafka.apache.org>>
> >>>>>>Subject: MirrorMaker consumers getting stuck on certain partitions
> >>>>>>
> >>>>>>Hi,
> >>>>>>
> >>>>>>I am running Mirror Maker (version 0.9 , new consumer). Sometimes I
> find that consumer gets stuck on certain partitions and offset doesn’t move
> in that case.
> >>>>>>
> >>>>>>I have 10 MM process running, each having 10 streams. The topic has
> 100 partitions.
> >>>>>>
> >>>>>>Below is the sample output of the consumer (I have cut short the
> output. Remaining partitions except the highlighted ones have similar lag
> in thousands).
> >>>>>>I can see the LOG size and Lag increasing but not the current offset
> and the MM is also running.
> >>>>>>
> >>>>>>
> >>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118463859, 118471286,
> 7427, slca.lvs.fpti.mm-6_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118490771, 118492211,
> 1440, slca.lvs.fpti.mm-6_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118484045, 118493447,
> 9402, slca.lvs.fpti.mm-2_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118492467, 118502925,
> 10458, slca.lvs.fpti.mm-1_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118488448,
> 8257814, slca.lvs.fpti.mm-9_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118488594, 118497984,
> 9390, slca.lvs.fpti.mm-1_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118477757, 118478130,
> 373, slca.lvs.fpti.mm-3_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118472483, 118481154,
> 8671, slca.lvs.fpti.mm-8_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 21, 118478106, 118479540,
> 1434, slca.lvs.fpti.mm-2_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 59, 118487660, 118494558,
> 6898, slca.lvs.fpti.mm-5_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 48, 118483938, 118490119,
> 6181, slca.lvs.fpti.mm-4_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 95, 118490885, 118495214,
> 4329, slca.lvs.fpti.mm-9_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 58, 118499230, 118499608,
> 378, slca.lvs.fpti.mm-5_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 44, 118484130, 118491922,
> 7792, slca.lvs.fpti.mm-4_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 56, 118481454, 118485764,
> 4310, slca.lvs.fpti.mm-5_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 64, 118463461, 118471399,
> 7938, slca.lvs.fpti.mm-6_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 81, 118481575, 118482081,
> 506, slca.lvs.fpti.mm-8_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116010184, 118476683,
> 2466499, slca.lvs.fpti.mm-0_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118478651, 118486854,
> 8203, slca.lvs.fpti.mm-4_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118473021, 118481007,
> 7986, slca.lvs.fpti.mm-1_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118462885, 118470919,
> 8034, slca.lvs.fpti.mm-3_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117227057, 118484055,
> 1256998, slca.lvs.fpti.mm-9_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 36, 118491829, 118498669,
> 6840, slca.lvs.fpti.mm-3_/10.196.246.38
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118483038, 118489002,
> 5964, slca.lvs.fpti.mm-6_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118500463, 118510063,
> 9600, slca.lvs.fpti.mm-6_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118503344, 118511243,
> 7899, slca.lvs.fpti.mm-2_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118514764, 118520550,
> 5786, slca.lvs.fpti.mm-1_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118506337,
> 8275703, slca.lvs.fpti.mm-9_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118507871, 118515735,
> 7864, slca.lvs.fpti.mm-1_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118487244, 118495868,
> 8624, slca.lvs.fpti.mm-3_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118491768, 118499046,
> 7278, slca.lvs.fpti.mm-8_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 9, 118498940, 118499194,
> 254, slca.lvs.fpti.mm-0_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 6, 118516902, 118517425,
> 523, slca.lvs.fpti.mm-0_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 61, 118514062, 118521457,
> 7395, slca.lvs.fpti.mm-6_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116100206, 118494575,
> 2394369, slca.lvs.fpti.mm-0_/10.196.246.37
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118497705, 118504514,
> 6809, slca.lvs.fpti.mm-4_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118492007, 118498617,
> 6610, slca.lvs.fpti.mm-1_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118481808, 118488785,
> 6977, slca.lvs.fpti.mm-3_/10.196.246.38
> >>>>>>
> >>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117318343, 118501939,
> 1183596, slca.lvs.fpti.mm-9_/10.196.246.37
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>Anyone else facing the same problem?
> >>>>>>
> >>>>>>
> >>>>>>Thanks
> >>>>>>
> >>>>>>Tushar
>

Re: MirrorMaker consumers getting stuck on certain partitions

Posted by Gwen Shapira <gw...@confluent.io>.
Also, maybe a thread dump (using jstack) of the mirrormaker when it is
stuck? It may give us a clue...

On Thu, Jun 9, 2016 at 10:05 AM, Mhaskar, Tushar
<tm...@paypal.com.invalid> wrote:
> Yes the max size of the messages is within 1MB.
>
> That is the exact thing, we are not getting any exceptions in the log.
>
> Nothing in common for the stuck partitions as well. There are 100
> partitions across 10 brokers.Every time this issue happens, the consumers
> get stuck on different partitions.
>
> I can double check on the offending partitions for the message size using the logfile dump tool.
>
> Tushar
>
>
>
> On 6/8/16, 11:36 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
>
>>Hi Tushar,
>>
>>I'm sure you know the joke about the statistician who drowned in a
>>pool with average depth of 1 feet. It isn't the average size that will
>>get you, it is the maximum size :)
>>
>>However, too large messages should have resulted in an exception
>>getting logged. Perhaps you want to double check by using the logfile
>>dump tool on one of the offending partitions, it will list the size of
>>each message.
>>
>>I have to say that I don't have many ideas on why this happens. The
>>fact that it happens with both new and old consumers indicate it is
>>not an implementation bug but rather something with the partitions /
>>data itself.
>>
>>Anything in common for those partitions? Are they on same broker, for instance?
>>
>>Gwen
>>
>>
>>On Thu, Jun 9, 2016 at 9:24 AM, Mhaskar, Tushar
>><tm...@paypal.com.invalid> wrote:
>>> Hi Gwen,
>>>
>>> 1) We already have our log level as INFO. But we don’t see anything until and unless there is rebalance happening due to restarting the mirror maker or any other activity like shutting down the mirror maker process itself.
>>> 2) Our average message size is 4KB.
>>>
>>> Thanks,
>>> Tushar
>>>
>>>
>>>
>>>
>>> On 6/8/16, 10:31 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
>>>
>>>>Hi Tushar,
>>>>
>>>>Few follow up questions:
>>>>1. Can you enable logs for mirror maker (there should be
>>>>conf/tools-log4j.properties file for this) at INFO level? This can
>>>>give us a clue on why it stopped.
>>>>
>>>>2. Did you check the FAQ for "why is my consumer hanging"? Most of the
>>>>reasons (for example "message too large") can apply here too.
>>>>
>>>>Gwen
>>>>
>>>>On Wed, May 4, 2016 at 9:54 PM, Mhaskar, Tushar
>>>><tm...@paypal.com.invalid> wrote:
>>>>> Anyone encountered this error?
>>>>>
>>>>> Thanks
>>>>> Tushar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 5/2/16, 10:44 PM, "Mhaskar, Tushar" <tm...@paypal.com.INVALID> wrote:
>>>>>
>>>>>>Consumption on  Partitions – 99,0 and 98 are getting stucked.
>>>>>>
>>>>>>Thanks
>>>>>>Tushar
>>>>>>From: "Mhaskar, Tushar" <tm...@paypal.com>>
>>>>>>Date: Monday, May 2, 2016 at 9:52 PM
>>>>>>To: "users@kafka.apache.org<ma...@kafka.apache.org>" <us...@kafka.apache.org>>
>>>>>>Subject: MirrorMaker consumers getting stuck on certain partitions
>>>>>>
>>>>>>Hi,
>>>>>>
>>>>>>I am running Mirror Maker (version 0.9 , new consumer). Sometimes I find that consumer gets stuck on certain partitions and offset doesn’t move in that case.
>>>>>>
>>>>>>I have 10 MM process running, each having 10 streams. The topic has 100 partitions.
>>>>>>
>>>>>>Below is the sample output of the consumer (I have cut short the output. Remaining partitions except the highlighted ones have similar lag in thousands).
>>>>>>I can see the LOG size and Lag increasing but not the current offset and the MM is also running.
>>>>>>
>>>>>>
>>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118463859, 118471286, 7427, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118490771, 118492211, 1440, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118484045, 118493447, 9402, slca.lvs.fpti.mm-2_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118492467, 118502925, 10458, slca.lvs.fpti.mm-1_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118488448, 8257814, slca.lvs.fpti.mm-9_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118488594, 118497984, 9390, slca.lvs.fpti.mm-1_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118477757, 118478130, 373, slca.lvs.fpti.mm-3_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118472483, 118481154, 8671, slca.lvs.fpti.mm-8_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 21, 118478106, 118479540, 1434, slca.lvs.fpti.mm-2_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 59, 118487660, 118494558, 6898, slca.lvs.fpti.mm-5_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 48, 118483938, 118490119, 6181, slca.lvs.fpti.mm-4_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 95, 118490885, 118495214, 4329, slca.lvs.fpti.mm-9_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 58, 118499230, 118499608, 378, slca.lvs.fpti.mm-5_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 44, 118484130, 118491922, 7792, slca.lvs.fpti.mm-4_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 56, 118481454, 118485764, 4310, slca.lvs.fpti.mm-5_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 64, 118463461, 118471399, 7938, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 81, 118481575, 118482081, 506, slca.lvs.fpti.mm-8_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116010184, 118476683, 2466499, slca.lvs.fpti.mm-0_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118478651, 118486854, 8203, slca.lvs.fpti.mm-4_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118473021, 118481007, 7986, slca.lvs.fpti.mm-1_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118462885, 118470919, 8034, slca.lvs.fpti.mm-3_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117227057, 118484055, 1256998, slca.lvs.fpti.mm-9_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 36, 118491829, 118498669, 6840, slca.lvs.fpti.mm-3_/10.196.246.38
>>>>>>
>>>>>>
>>>>>>
>>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118483038, 118489002, 5964, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118500463, 118510063, 9600, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118503344, 118511243, 7899, slca.lvs.fpti.mm-2_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118514764, 118520550, 5786, slca.lvs.fpti.mm-1_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118506337, 8275703, slca.lvs.fpti.mm-9_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118507871, 118515735, 7864, slca.lvs.fpti.mm-1_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118487244, 118495868, 8624, slca.lvs.fpti.mm-3_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118491768, 118499046, 7278, slca.lvs.fpti.mm-8_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 9, 118498940, 118499194, 254, slca.lvs.fpti.mm-0_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 6, 118516902, 118517425, 523, slca.lvs.fpti.mm-0_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 61, 118514062, 118521457, 7395, slca.lvs.fpti.mm-6_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116100206, 118494575, 2394369, slca.lvs.fpti.mm-0_/10.196.246.37
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118497705, 118504514, 6809, slca.lvs.fpti.mm-4_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118492007, 118498617, 6610, slca.lvs.fpti.mm-1_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118481808, 118488785, 6977, slca.lvs.fpti.mm-3_/10.196.246.38
>>>>>>
>>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117318343, 118501939, 1183596, slca.lvs.fpti.mm-9_/10.196.246.37
>>>>>>
>>>>>>
>>>>>>
>>>>>>Anyone else facing the same problem?
>>>>>>
>>>>>>
>>>>>>Thanks
>>>>>>
>>>>>>Tushar

Re: MirrorMaker consumers getting stuck on certain partitions

Posted by "Mhaskar, Tushar" <tm...@paypal.com.INVALID>.
Yes the max size of the messages is within 1MB.

That is the exact thing, we are not getting any exceptions in the log.

Nothing in common for the stuck partitions as well. There are 100 
partitions across 10 brokers.Every time this issue happens, the consumers 
get stuck on different partitions.

I can double check on the offending partitions for the message size using the logfile dump tool.

Tushar



On 6/8/16, 11:36 PM, "Gwen Shapira" <gw...@confluent.io> wrote:

>Hi Tushar,
>
>I'm sure you know the joke about the statistician who drowned in a
>pool with average depth of 1 feet. It isn't the average size that will
>get you, it is the maximum size :)
>
>However, too large messages should have resulted in an exception
>getting logged. Perhaps you want to double check by using the logfile
>dump tool on one of the offending partitions, it will list the size of
>each message.
>
>I have to say that I don't have many ideas on why this happens. The
>fact that it happens with both new and old consumers indicate it is
>not an implementation bug but rather something with the partitions /
>data itself.
>
>Anything in common for those partitions? Are they on same broker, for instance?
>
>Gwen
>
>
>On Thu, Jun 9, 2016 at 9:24 AM, Mhaskar, Tushar
><tm...@paypal.com.invalid> wrote:
>> Hi Gwen,
>>
>> 1) We already have our log level as INFO. But we don’t see anything until and unless there is rebalance happening due to restarting the mirror maker or any other activity like shutting down the mirror maker process itself.
>> 2) Our average message size is 4KB.
>>
>> Thanks,
>> Tushar
>>
>>
>>
>>
>> On 6/8/16, 10:31 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
>>
>>>Hi Tushar,
>>>
>>>Few follow up questions:
>>>1. Can you enable logs for mirror maker (there should be
>>>conf/tools-log4j.properties file for this) at INFO level? This can
>>>give us a clue on why it stopped.
>>>
>>>2. Did you check the FAQ for "why is my consumer hanging"? Most of the
>>>reasons (for example "message too large") can apply here too.
>>>
>>>Gwen
>>>
>>>On Wed, May 4, 2016 at 9:54 PM, Mhaskar, Tushar
>>><tm...@paypal.com.invalid> wrote:
>>>> Anyone encountered this error?
>>>>
>>>> Thanks
>>>> Tushar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 5/2/16, 10:44 PM, "Mhaskar, Tushar" <tm...@paypal.com.INVALID> wrote:
>>>>
>>>>>Consumption on  Partitions – 99,0 and 98 are getting stucked.
>>>>>
>>>>>Thanks
>>>>>Tushar
>>>>>From: "Mhaskar, Tushar" <tm...@paypal.com>>
>>>>>Date: Monday, May 2, 2016 at 9:52 PM
>>>>>To: "users@kafka.apache.org<ma...@kafka.apache.org>" <us...@kafka.apache.org>>
>>>>>Subject: MirrorMaker consumers getting stuck on certain partitions
>>>>>
>>>>>Hi,
>>>>>
>>>>>I am running Mirror Maker (version 0.9 , new consumer). Sometimes I find that consumer gets stuck on certain partitions and offset doesn’t move in that case.
>>>>>
>>>>>I have 10 MM process running, each having 10 streams. The topic has 100 partitions.
>>>>>
>>>>>Below is the sample output of the consumer (I have cut short the output. Remaining partitions except the highlighted ones have similar lag in thousands).
>>>>>I can see the LOG size and Lag increasing but not the current offset and the MM is also running.
>>>>>
>>>>>
>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118463859, 118471286, 7427, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118490771, 118492211, 1440, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118484045, 118493447, 9402, slca.lvs.fpti.mm-2_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118492467, 118502925, 10458, slca.lvs.fpti.mm-1_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118488448, 8257814, slca.lvs.fpti.mm-9_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118488594, 118497984, 9390, slca.lvs.fpti.mm-1_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118477757, 118478130, 373, slca.lvs.fpti.mm-3_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118472483, 118481154, 8671, slca.lvs.fpti.mm-8_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 21, 118478106, 118479540, 1434, slca.lvs.fpti.mm-2_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 59, 118487660, 118494558, 6898, slca.lvs.fpti.mm-5_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 48, 118483938, 118490119, 6181, slca.lvs.fpti.mm-4_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 95, 118490885, 118495214, 4329, slca.lvs.fpti.mm-9_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 58, 118499230, 118499608, 378, slca.lvs.fpti.mm-5_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 44, 118484130, 118491922, 7792, slca.lvs.fpti.mm-4_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 56, 118481454, 118485764, 4310, slca.lvs.fpti.mm-5_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 64, 118463461, 118471399, 7938, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 81, 118481575, 118482081, 506, slca.lvs.fpti.mm-8_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116010184, 118476683, 2466499, slca.lvs.fpti.mm-0_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118478651, 118486854, 8203, slca.lvs.fpti.mm-4_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118473021, 118481007, 7986, slca.lvs.fpti.mm-1_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118462885, 118470919, 8034, slca.lvs.fpti.mm-3_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117227057, 118484055, 1256998, slca.lvs.fpti.mm-9_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 36, 118491829, 118498669, 6840, slca.lvs.fpti.mm-3_/10.196.246.38
>>>>>
>>>>>
>>>>>
>>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118483038, 118489002, 5964, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118500463, 118510063, 9600, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118503344, 118511243, 7899, slca.lvs.fpti.mm-2_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118514764, 118520550, 5786, slca.lvs.fpti.mm-1_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118506337, 8275703, slca.lvs.fpti.mm-9_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118507871, 118515735, 7864, slca.lvs.fpti.mm-1_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118487244, 118495868, 8624, slca.lvs.fpti.mm-3_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118491768, 118499046, 7278, slca.lvs.fpti.mm-8_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 9, 118498940, 118499194, 254, slca.lvs.fpti.mm-0_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 6, 118516902, 118517425, 523, slca.lvs.fpti.mm-0_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 61, 118514062, 118521457, 7395, slca.lvs.fpti.mm-6_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116100206, 118494575, 2394369, slca.lvs.fpti.mm-0_/10.196.246.37
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118497705, 118504514, 6809, slca.lvs.fpti.mm-4_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118492007, 118498617, 6610, slca.lvs.fpti.mm-1_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118481808, 118488785, 6977, slca.lvs.fpti.mm-3_/10.196.246.38
>>>>>
>>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117318343, 118501939, 1183596, slca.lvs.fpti.mm-9_/10.196.246.37
>>>>>
>>>>>
>>>>>
>>>>>Anyone else facing the same problem?
>>>>>
>>>>>
>>>>>Thanks
>>>>>
>>>>>Tushar

Re: MirrorMaker consumers getting stuck on certain partitions

Posted by Gwen Shapira <gw...@confluent.io>.
Hi Tushar,

I'm sure you know the joke about the statistician who drowned in a
pool with average depth of 1 feet. It isn't the average size that will
get you, it is the maximum size :)

However, too large messages should have resulted in an exception
getting logged. Perhaps you want to double check by using the logfile
dump tool on one of the offending partitions, it will list the size of
each message.

I have to say that I don't have many ideas on why this happens. The
fact that it happens with both new and old consumers indicate it is
not an implementation bug but rather something with the partitions /
data itself.

Anything in common for those partitions? Are they on same broker, for instance?

Gwen


On Thu, Jun 9, 2016 at 9:24 AM, Mhaskar, Tushar
<tm...@paypal.com.invalid> wrote:
> Hi Gwen,
>
> 1) We already have our log level as INFO. But we don’t see anything until and unless there is rebalance happening due to restarting the mirror maker or any other activity like shutting down the mirror maker process itself.
> 2) Our average message size is 4KB.
>
> Thanks,
> Tushar
>
>
>
>
> On 6/8/16, 10:31 PM, "Gwen Shapira" <gw...@confluent.io> wrote:
>
>>Hi Tushar,
>>
>>Few follow up questions:
>>1. Can you enable logs for mirror maker (there should be
>>conf/tools-log4j.properties file for this) at INFO level? This can
>>give us a clue on why it stopped.
>>
>>2. Did you check the FAQ for "why is my consumer hanging"? Most of the
>>reasons (for example "message too large") can apply here too.
>>
>>Gwen
>>
>>On Wed, May 4, 2016 at 9:54 PM, Mhaskar, Tushar
>><tm...@paypal.com.invalid> wrote:
>>> Anyone encountered this error?
>>>
>>> Thanks
>>> Tushar
>>>
>>>
>>>
>>>
>>>
>>> On 5/2/16, 10:44 PM, "Mhaskar, Tushar" <tm...@paypal.com.INVALID> wrote:
>>>
>>>>Consumption on  Partitions – 99,0 and 98 are getting stucked.
>>>>
>>>>Thanks
>>>>Tushar
>>>>From: "Mhaskar, Tushar" <tm...@paypal.com>>
>>>>Date: Monday, May 2, 2016 at 9:52 PM
>>>>To: "users@kafka.apache.org<ma...@kafka.apache.org>" <us...@kafka.apache.org>>
>>>>Subject: MirrorMaker consumers getting stuck on certain partitions
>>>>
>>>>Hi,
>>>>
>>>>I am running Mirror Maker (version 0.9 , new consumer). Sometimes I find that consumer gets stuck on certain partitions and offset doesn’t move in that case.
>>>>
>>>>I have 10 MM process running, each having 10 streams. The topic has 100 partitions.
>>>>
>>>>Below is the sample output of the consumer (I have cut short the output. Remaining partitions except the highlighted ones have similar lag in thousands).
>>>>I can see the LOG size and Lag increasing but not the current offset and the MM is also running.
>>>>
>>>>
>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118463859, 118471286, 7427, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118490771, 118492211, 1440, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118484045, 118493447, 9402, slca.lvs.fpti.mm-2_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118492467, 118502925, 10458, slca.lvs.fpti.mm-1_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118488448, 8257814, slca.lvs.fpti.mm-9_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118488594, 118497984, 9390, slca.lvs.fpti.mm-1_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118477757, 118478130, 373, slca.lvs.fpti.mm-3_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118472483, 118481154, 8671, slca.lvs.fpti.mm-8_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 21, 118478106, 118479540, 1434, slca.lvs.fpti.mm-2_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 59, 118487660, 118494558, 6898, slca.lvs.fpti.mm-5_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 48, 118483938, 118490119, 6181, slca.lvs.fpti.mm-4_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 95, 118490885, 118495214, 4329, slca.lvs.fpti.mm-9_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 58, 118499230, 118499608, 378, slca.lvs.fpti.mm-5_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 44, 118484130, 118491922, 7792, slca.lvs.fpti.mm-4_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 56, 118481454, 118485764, 4310, slca.lvs.fpti.mm-5_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 64, 118463461, 118471399, 7938, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 81, 118481575, 118482081, 506, slca.lvs.fpti.mm-8_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116010184, 118476683, 2466499, slca.lvs.fpti.mm-0_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118478651, 118486854, 8203, slca.lvs.fpti.mm-4_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118473021, 118481007, 7986, slca.lvs.fpti.mm-1_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118462885, 118470919, 8034, slca.lvs.fpti.mm-3_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117227057, 118484055, 1256998, slca.lvs.fpti.mm-9_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 36, 118491829, 118498669, 6840, slca.lvs.fpti.mm-3_/10.196.246.38
>>>>
>>>>
>>>>
>>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118483038, 118489002, 5964, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118500463, 118510063, 9600, slca.lvs.fpti.mm-6_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118503344, 118511243, 7899, slca.lvs.fpti.mm-2_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118514764, 118520550, 5786, slca.lvs.fpti.mm-1_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118506337, 8275703, slca.lvs.fpti.mm-9_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118507871, 118515735, 7864, slca.lvs.fpti.mm-1_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118487244, 118495868, 8624, slca.lvs.fpti.mm-3_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118491768, 118499046, 7278, slca.lvs.fpti.mm-8_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 9, 118498940, 118499194, 254, slca.lvs.fpti.mm-0_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 6, 118516902, 118517425, 523, slca.lvs.fpti.mm-0_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 61, 118514062, 118521457, 7395, slca.lvs.fpti.mm-6_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116100206, 118494575, 2394369, slca.lvs.fpti.mm-0_/10.196.246.37
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118497705, 118504514, 6809, slca.lvs.fpti.mm-4_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118492007, 118498617, 6610, slca.lvs.fpti.mm-1_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118481808, 118488785, 6977, slca.lvs.fpti.mm-3_/10.196.246.38
>>>>
>>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117318343, 118501939, 1183596, slca.lvs.fpti.mm-9_/10.196.246.37
>>>>
>>>>
>>>>
>>>>Anyone else facing the same problem?
>>>>
>>>>
>>>>Thanks
>>>>
>>>>Tushar

Re: MirrorMaker consumers getting stuck on certain partitions

Posted by "Mhaskar, Tushar" <tm...@paypal.com.INVALID>.
Hi Gwen,

1) We already have our log level as INFO. But we don’t see anything until and unless there is rebalance happening due to restarting the mirror maker or any other activity like shutting down the mirror maker process itself.
2) Our average message size is 4KB.

Thanks,
Tushar




On 6/8/16, 10:31 PM, "Gwen Shapira" <gw...@confluent.io> wrote:

>Hi Tushar,
>
>Few follow up questions:
>1. Can you enable logs for mirror maker (there should be
>conf/tools-log4j.properties file for this) at INFO level? This can
>give us a clue on why it stopped.
>
>2. Did you check the FAQ for "why is my consumer hanging"? Most of the
>reasons (for example "message too large") can apply here too.
>
>Gwen
>
>On Wed, May 4, 2016 at 9:54 PM, Mhaskar, Tushar
><tm...@paypal.com.invalid> wrote:
>> Anyone encountered this error?
>>
>> Thanks
>> Tushar
>>
>>
>>
>>
>>
>> On 5/2/16, 10:44 PM, "Mhaskar, Tushar" <tm...@paypal.com.INVALID> wrote:
>>
>>>Consumption on  Partitions – 99,0 and 98 are getting stucked.
>>>
>>>Thanks
>>>Tushar
>>>From: "Mhaskar, Tushar" <tm...@paypal.com>>
>>>Date: Monday, May 2, 2016 at 9:52 PM
>>>To: "users@kafka.apache.org<ma...@kafka.apache.org>" <us...@kafka.apache.org>>
>>>Subject: MirrorMaker consumers getting stuck on certain partitions
>>>
>>>Hi,
>>>
>>>I am running Mirror Maker (version 0.9 , new consumer). Sometimes I find that consumer gets stuck on certain partitions and offset doesn’t move in that case.
>>>
>>>I have 10 MM process running, each having 10 streams. The topic has 100 partitions.
>>>
>>>Below is the sample output of the consumer (I have cut short the output. Remaining partitions except the highlighted ones have similar lag in thousands).
>>>I can see the LOG size and Lag increasing but not the current offset and the MM is also running.
>>>
>>>
>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118463859, 118471286, 7427, slca.lvs.fpti.mm-6_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118490771, 118492211, 1440, slca.lvs.fpti.mm-6_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118484045, 118493447, 9402, slca.lvs.fpti.mm-2_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118492467, 118502925, 10458, slca.lvs.fpti.mm-1_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118488448, 8257814, slca.lvs.fpti.mm-9_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118488594, 118497984, 9390, slca.lvs.fpti.mm-1_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118477757, 118478130, 373, slca.lvs.fpti.mm-3_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118472483, 118481154, 8671, slca.lvs.fpti.mm-8_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 21, 118478106, 118479540, 1434, slca.lvs.fpti.mm-2_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 59, 118487660, 118494558, 6898, slca.lvs.fpti.mm-5_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 48, 118483938, 118490119, 6181, slca.lvs.fpti.mm-4_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 95, 118490885, 118495214, 4329, slca.lvs.fpti.mm-9_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 58, 118499230, 118499608, 378, slca.lvs.fpti.mm-5_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 44, 118484130, 118491922, 7792, slca.lvs.fpti.mm-4_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 56, 118481454, 118485764, 4310, slca.lvs.fpti.mm-5_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 64, 118463461, 118471399, 7938, slca.lvs.fpti.mm-6_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 81, 118481575, 118482081, 506, slca.lvs.fpti.mm-8_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116010184, 118476683, 2466499, slca.lvs.fpti.mm-0_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118478651, 118486854, 8203, slca.lvs.fpti.mm-4_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118473021, 118481007, 7986, slca.lvs.fpti.mm-1_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118462885, 118470919, 8034, slca.lvs.fpti.mm-3_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117227057, 118484055, 1256998, slca.lvs.fpti.mm-9_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 36, 118491829, 118498669, 6840, slca.lvs.fpti.mm-3_/10.196.246.38
>>>
>>>
>>>
>>>GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 65, 118483038, 118489002, 5964, slca.lvs.fpti.mm-6_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 68, 118500463, 118510063, 9600, slca.lvs.fpti.mm-6_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 26, 118503344, 118511243, 7899, slca.lvs.fpti.mm-2_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 14, 118514764, 118520550, 5786, slca.lvs.fpti.mm-1_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 99, 110230634, 118506337, 8275703, slca.lvs.fpti.mm-9_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 19, 118507871, 118515735, 7864, slca.lvs.fpti.mm-1_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 35, 118487244, 118495868, 8624, slca.lvs.fpti.mm-3_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 80, 118491768, 118499046, 7278, slca.lvs.fpti.mm-8_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 9, 118498940, 118499194, 254, slca.lvs.fpti.mm-0_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 6, 118516902, 118517425, 523, slca.lvs.fpti.mm-0_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 61, 118514062, 118521457, 7395, slca.lvs.fpti.mm-6_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 0, 116100206, 118494575, 2394369, slca.lvs.fpti.mm-0_/10.196.246.37
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 40, 118497705, 118504514, 6809, slca.lvs.fpti.mm-4_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 12, 118492007, 118498617, 6610, slca.lvs.fpti.mm-1_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 30, 118481808, 118488785, 6977, slca.lvs.fpti.mm-3_/10.196.246.38
>>>
>>>slca.lvs.fpti.mm, fpti.platform.enrch, 98, 117318343, 118501939, 1183596, slca.lvs.fpti.mm-9_/10.196.246.37
>>>
>>>
>>>
>>>Anyone else facing the same problem?
>>>
>>>
>>>Thanks
>>>
>>>Tushar