You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Shimi Kiviti <sh...@gmail.com> on 2017/04/30 16:58:18 UTC

Kafka-streams process stopped processing messages

Hi Everyone,

I have a problem and I hope one of you can help me figuring it out.
One of our kafka-streams processes stopped processing messages

When I turn on debug log I see lots of these messages:

2017-04-30 15:42:20,228 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher: Sending
fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
.ec2.internal:9092 (id: 1 rack: null)
2017-04-30 15:42:20,696 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
Ignoring fetched records for devlast-changelog-2 at offset 2962649 since
the current position is 2963379

After a LONG time, the only messages in the log are these:

2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread | sa] DEBUG
o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for group sa to
coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack: null)
2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread | sa] DEBUG
o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat response for
group same

Any idea?

Thanks,
Shimi

Re: Kafka-streams process stopped processing messages

Posted by Guozhang Wang <wa...@gmail.com>.

Hi Shimi,

Just to clarify, your scenario is that 1) you shutdown the single-instance
app, cleanly, 2) you restart the app with a single-instance, and that is
taking hours, right?

Did you use any in-memory stores (i.e. not RocksDB) in your topology?


Guozhang


On Tue, May 16, 2017 at 1:08 AM, Eno Thereska <en...@gmail.com>
wrote:

> Hi Shimi,
>
> Could we start a new email thread on the slow booting to separate it from
> the initial thread (call it "slow boot" or something)? Thank you. Also,
> could you provide the logs for the booting part if possible, together with
> your streams config.
>
> Thanks
> Eno
> > On 15 May 2017, at 20:49, Shimi Kiviti <sh...@gmail.com> wrote:
> >
> > I do run the clients with 0.10.2.1 and it takes hours
> > What I don't understand is why it takes hours to boot on a server that
> has
> > all the data in RocksDB already. Is that related to the amount of data in
> > RocksDB (changelog topics) or the data in the source topic the processes
> > reads from?
> > On Mon, 15 May 2017 at 20:32 Guozhang Wang <wa...@gmail.com> wrote:
> >
> >> Hello Shimi,
> >>
> >> Could you try upgrading your clients to 0.10.2.1 (note you do not need
> to
> >> upgrade your servers if it is already on 0.10.1, since newer Streams
> >> clients can directly talk to older versioned brokers since 0.10.1+) and
> try
> >> it out again? I have a few optimizations to reduce rebalance latencies
> in
> >> both the underlying consumer client as well as streams library, and
> >> hopefully they will help with your rebalance issues.
> >>
> >> Also, we have a bunch of more fixes on consumer rebalance that we have
> >> already pushed in trunk and hence will be included in the upcoming June
> >> release of 0.11.0.0.
> >>
> >>
> >> Guozhang
> >>
> >> On Sat, May 13, 2017 at 12:32 PM, Shimi Kiviti <sh...@gmail.com>
> wrote:
> >>
> >>> I tried all these configurations and now like version 0.10.1.1 I see a
> >> very
> >>> slow startup.
> >>> I decreased the cluster to a single server which was running without
> any
> >>> problem for a few hours. Now, each time I restart this process it gets
> >> into
> >>> rebalancing state for several hours.
> >>> That mean that every time we need to deploy a new version of our app
> >> (which
> >>> can be several times a day) we have a down time of hours.
> >>>
> >>>
> >>> On Sat, May 6, 2017 at 5:13 PM, Eno Thereska <en...@gmail.com>
> >>> wrote:
> >>>
> >>>> Yeah we’ve seen cases when the session timeout might also need
> >>> increasing.
> >>>> Could you try upping it to something like 60000ms and let us know how
> >> it
> >>>> goes:
> >>>>
> >>>>>> streamsProps.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 60000);
> >>>>
> >>>>
> >>>> Thanks
> >>>> Eno
> >>>>
> >>>>> On May 6, 2017, at 8:35 AM, Shimi Kiviti <sh...@gmail.com> wrote:
> >>>>>
> >>>>> Thanks Eno,
> >>>>> I already set the the recurve buffer size to 1MB
> >>>>> I will also try producer
> >>>>>
> >>>>> What about session timeout and heart beat timeout? Do you think it
> >>> should
> >>>>> be increased?
> >>>>>
> >>>>> Thanks,
> >>>>> Shimi
> >>>>>
> >>>>> On Sat, 6 May 2017 at 0:21 Eno Thereska <en...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Hi Shimi,
> >>>>>>
> >>>>>> I’ve noticed with our benchmarks that on AWS environments with high
> >>>>>> network latency the network socket buffers often need adjusting. Any
> >>>> chance
> >>>>>> you could add the following to your streams configuration to change
> >>> the
> >>>>>> default socket size bytes to a higher value (at least 1MB) and let
> >> us
> >>>> know?
> >>>>>>
> >>>>>> private static final int SOCKET_SIZE_BYTES = 1 * 1024 * 1024; // at
> >>>> least
> >>>>>> 1MB
> >>>>>> streamsProps.put(ConsumerConfig.RECEIVE_BUFFER_CONFIG,
> >>>> SOCKET_SIZE_BYTES);
> >>>>>> streamsProps.put(ProducerConfig.SEND_BUFFER_CONFIG,
> >>> SOCKET_SIZE_BYTES);
> >>>>>>
> >>>>>> Thanks
> >>>>>> Eno
> >>>>>>
> >>>>>>> On May 4, 2017, at 3:45 PM, Shimi Kiviti <sh...@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>> Thanks Eno,
> >>>>>>>
> >>>>>>> We still see problems on our side.
> >>>>>>> when we run kafka-streams 0.10.1.1 eventually the problem goes away
> >>> but
> >>>>>>> with 0.10.2.1 it is not.
> >>>>>>> We see a lot of the rebalancing messages I wrote before
> >>>>>>>
> >>>>>>> on at least 1 kafka-stream nodes we see disconnection messages like
> >>> the
> >>>>>>> following. These messages repeat all the time
> >>>>>>>
> >>>>>>> 2017-05-04 14:25:56,063 [StreamThread-1] INFO
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Discovered coordinator
> >>>>>>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) for
> >>> group
> >>>> sa.
> >>>>>>> 2017-05-04 14:25:56,063 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Initiating connection to node 2147483646 at
> >>>>>> ip-10-0-91-10.ec2.internal:9092.
> >>>>>>> 2017-05-04 14:25:56,091 [StreamThread-1] INFO
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: (Re-)joining group sa
> >>>>>>> 2017-05-04 14:25:56,093 [StreamThread-1] DEBUG
> >>>>>>> o.a.k.s.p.i.StreamPartitionAssignor: stream-thread
> [StreamThread-1]
> >>>> found
> >>>>>>> [sa-events] topics possibly matching regex
> >>>>>>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
> >>>> o.a.k.s.p.TopologyBuilder:
> >>>>>>> stream-thread [StreamThread-1] updating builder with
> >>>>>>> SubscriptionUpdates{updatedTopicSubscriptions=[sa-events]}
> topic(s)
> >>>> with
> >>>>>> po
> >>>>>>> ssible matching regex subscription(s)
> >>>>>>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending JoinGroup ((type:
> >>>>>>> JoinGroupRequest, groupId=sa, sessionTimeout=10000,
> >>>>>>> rebalanceTimeout=2147483647, memb
> >>>>>>> erId=, protocolType=consumer,
> >>>>>>>
> >>>>>> groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$
> >>>> ProtocolMetadata@2f894d9b
> >>>>>> ))
> >>>>>>> to coordinator ip-10-0-91-10.ec2.internal:9092 (id: 2147483646
> >> rack:
> >>>>>> null)
> >>>>>>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.n.Selector:
> >>>>>> Created
> >>>>>>> socket with SO_RCVBUF = 1048576, SO_SNDBUF = 131072, SO_TIMEOUT = 0
> >>> to
> >>>>>> node
> >>>>>>> 2147483646
> >>>>>>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Completed connection to node 2147483646.  Fetching API versions.
> >>>>>>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Initiating API versions fetch from node 2147483646.
> >>>>>>> 2017-05-04 14:25:56,104 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Recorded API versions for node 2147483646: (Produce(0): 0 to 2
> >>> [usable:
> >>>>>> 2],
> >>>>>>> Fetch(1): 0 to 3 [usable: 3], Offsets(2): 0 to 1 [usable: 1],
> >>>>>>> Metadata(3): 0 to 2 [usable: 2], LeaderAndIsr(4): 0 [usable: 0],
> >>>>>>> StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 2 [usable:
> >> 2],
> >>>>>>> ControlledShutdown(7): 1 [usable: 1], OffsetCommit(8): 0 to 2
> >>> [usable:
> >>>>>>> 2], OffsetFetch(9): 0 to 1 [usable: 1], GroupCoordinator(10): 0
> >>>> [usable:
> >>>>>>> 0], JoinGroup(11): 0 to 1 [usable: 1], Heartbeat(12): 0 [usable:
> >> 0],
> >>>>>>> LeaveGroup(13): 0 [usable: 0], SyncGroup(14): 0 [usable: 0], Desc
> >>>>>>> ribeGroups(15): 0 [usable: 0], ListGroups(16): 0 [usable: 0],
> >>>>>>> SaslHandshake(17): 0 [usable: 0], ApiVersions(18): 0 [usable: 0],
> >>>>>>> CreateTopics(19): 0 [usable: 0], DeleteTopics(20): 0 [usable: 0])
> >>>>>>> 2017-05-04 14:29:44,800 [kafka-producer-network-thread |
> >>>>>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> >>> DEBUG
> >>>>>>> o.a.k.c.NetworkClient: Node -2 disconnected.
> >>>>>>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> >>>>>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> >>> DEBUG
> >>>>>>> o.a.k.c.NetworkClient: Sending metadata request (type=MetadataR
> >>>>>>> equest, topics=) to node 1
> >>>>>>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> >>>>>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> >>> DEBUG
> >>>>>>> o.a.k.c.NetworkClient: Node -1 disconnected.
> >>>>>>> 2017-05-04 14:29:44,802 [kafka-producer-network-thread |
> >>>>>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> >>> DEBUG
> >>>>>>> o.a.k.c.Metadata: Updated cluster metadata version 4 to Cluster
> >>>>>>> (id = JsVqjH3tS4CIcqpd2jkogA, nodes = [ip-10-0-91-10.ec2.internal:
> >>> 9092
> >>>>>> (id:
> >>>>>>> 1 rack: null), ip-10-0-95-250.ec2.internal:9092 (id: 2 rack:
> >> null)],
> >>>>>>> partitions = [])
> >>>>>>> 2017-05-04 14:30:56,062 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to
> >>> node 2
> >>>>>>> 2017-05-04 14:30:56,073 [StreamThread-1] DEBUG o.a.k.c.Metadata:
> >>>> Updated
> >>>>>>> cluster metadata version 7 to Cluster(id = JsVqjH3tS4CIcqpd2jkogA,
> >>>> nodes
> >>>>>> =
> >>>>>>> [ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null), ip-10
> >>>>>>> -0-91-10.ec2.internal:9092 (id: 1 rack: null)], partitions =
> >>>>>>> [Partition(topic = sa-events, partition = 0, leader = 1, replicas =
> >>>>>> [1,2],
> >>>>>>> isr = [2,1]), Partition(topic = sa-events, partition = 1, lea
> >>>>>>> der = 2, replicas = [1,2], isr = [2,1]), Partition(topic =
> >> sa-events,
> >>>>>>> partition = 2, leader = 1, replicas = [1,2], isr = [2,1])])
> >>>>>>> 2017-05-04 14:31:06,085 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Disconnecting from node 2147483646 due to request timeout.
> >>>>>>> 2017-05-04 14:31:06,086 [StreamThread-1] DEBUG
> >>>>>>> o.a.k.c.c.i.ConsumerNetworkClient: Cancelled JOIN_GROUP request
> >>>>>>>
> >>>>>> {api_key=11,api_version=1,correlation_id=16,client_id=
> >>>> sa-5788b5a5-aadc-4276-916f
> >>>>>>> -1640008c17da-StreamThread-1-consumer} with correlation id 16 due
> >> to
> >>>> node
> >>>>>>> 2147483646 being disconnected
> >>>>>>> 2017-05-04 14:31:06,086 [StreamThread-1] INFO
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Marking the coordinator
> >>>>>>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) dead
> >> for
> >>>>>> group
> >>>>>>> sa
> >>>>>>> 2017-05-04 14:31:06,195 [StreamThread-1] DEBUG
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending GroupCoordinator request
> >>> for
> >>>>>> group
> >>>>>>> sa to broker ip-10-0-91-10.ec2.internal:9092 (id: 1 rack: null)
> >>>>>>> 2017-05-04 14:31:06,196 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to
> >>> node 2
> >>>>>>> 2017-05-04 14:31:06,200 [StreamThread-1] DEBUG
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Received GroupCoordinator
> response
> >>>>>>> ClientResponse(receivedTimeMs=1493908266200, latencyMs=5,
> >>>>>>> disconnected=false, requestHeader=
> >>>>>>>
> >>>>>> {api_key=10,api_version=0,correlation_id=19,client_id=
> >>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-consumer},
> >>>>>>> responseBody={error_code=0,coordinator={node_id=1,host=ip
> >>>> -10-0-91-10.ec
> >>>>>>> 2.internal,port=9092}}) for group sa
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, May 1, 2017 at 4:19 PM, Eno Thereska <
> >> eno.thereska@gmail.com
> >>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Shimi,
> >>>>>>>>
> >>>>>>>> 0.10.2.1 contains a number of fixes that should make the out of
> >> box
> >>>>>>>> experience better, including resiliency under broker failures and
> >>>> better
> >>>>>>>> exception handling. If you ever get back to it, and if the problem
> >>>>>> happens
> >>>>>>>> again, please do send us the logs and we'll happily have a look.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Eno
> >>>>>>>>> On 1 May 2017, at 12:05, Shimi Kiviti <sh...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Eno,
> >>>>>>>>> I am afraid I played too much with the configuration to make this
> >>>>>>>>> productive investigation :(
> >>>>>>>>>
> >>>>>>>>> This is a QA environment which includes 2 kafka instances and 3
> >>>>>> zookeeper
> >>>>>>>>> instances in AWS. There are only 3 partition for this topic.
> >>>>>>>>> Kafka broker and kafka-stream are version 0.10.1.1
> >>>>>>>>> Our kafka-stream app run on docker using kubernetes.
> >>>>>>>>> I played around with with 1 to 3  kafka-stream processes, but I
> >> got
> >>>> the
> >>>>>>>>> same results. It is too easy to scale with kubernetes :)
> >>>>>>>>> Since there are only 3 partitions, I didn't start more then 3
> >>>>>> instances.
> >>>>>>>>>
> >>>>>>>>> I was too quick to upgraded only the kafka-stream app to 0.10.2.1
> >>>> with
> >>>>>>>> hope
> >>>>>>>>> that it will solve the problem, It didn't.
> >>>>>>>>> The log I sent before are from this version.
> >>>>>>>>>
> >>>>>>>>> I did notice "unknown" offset for the main topic with
> >> kafka-stream
> >>>>>>>> version
> >>>>>>>>> 0.10.2.1
> >>>>>>>>> $ ./bin/kafka-consumer-groups.sh   --bootstrap-server
> >>> localhost:9092
> >>>>>>>>> --describe --group sa
> >>>>>>>>> GROUP                          TOPIC
> >>>> PARTITION
> >>>>>>>>> CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
> >>>>>>>>> sa             sa-events                 0          842199
> >>>>>>>>> 842199          0
> >>>>>>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-
> consumer_/
> >>>>>>>> 10.0.10.9
> >>>>>>>>> sa             sa-events                 1          1078428
> >>>>>>>>> 1078428         0
> >>>>>>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-
> consumer_/
> >>>>>>>> 10.0.10.9
> >>>>>>>>> sa             sa-events                 2          unknown
> >>>>>>>>> 26093910        unknown
> >>>>>>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-
> consumer_/
> >>>>>>>> 10.0.10.9
> >>>>>>>>>
> >>>>>>>>> After that I downgraded the kafka-stream app back to version
> >>> 0.10.1.1
> >>>>>>>>> After a LONG startup time (more than an hour) where the status of
> >>> the
> >>>>>>>> group
> >>>>>>>>> was rebalancing, all the 3 processes started processing messages
> >>>> again.
> >>>>>>>>>
> >>>>>>>>> This all thing started after we hit a bug in our code (NPE) that
> >>>>>> crashed
> >>>>>>>>> the stream processing thread.
> >>>>>>>>> So now after 4 days, everything is back to normal.
> >>>>>>>>> This worries me since it can happen again
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, May 1, 2017 at 11:45 AM, Eno Thereska <
> >>>> eno.thereska@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Shimi,
> >>>>>>>>>>
> >>>>>>>>>> Could you provide more info on your setup? How many kafka
> >> streams
> >>>>>>>>>> processes do you have and from how many partitions are they
> >>>> consuming
> >>>>>>>> from.
> >>>>>>>>>> If you have more processes than partitions some of the processes
> >>>> will
> >>>>>> be
> >>>>>>>>>> idle and won’t do anything.
> >>>>>>>>>>
> >>>>>>>>>> Eno
> >>>>>>>>>>> On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <sh...@gmail.com>
> >>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Everyone,
> >>>>>>>>>>>
> >>>>>>>>>>> I have a problem and I hope one of you can help me figuring it
> >>> out.
> >>>>>>>>>>> One of our kafka-streams processes stopped processing messages
> >>>>>>>>>>>
> >>>>>>>>>>> When I turn on debug log I see lots of these messages:
> >>>>>>>>>>>
> >>>>>>>>>>> 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG
> >>> o.a.k.c.c.i.Fetcher:
> >>>>>>>>>> Sending
> >>>>>>>>>>> fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
> >>>>>>>>>>> .ec2.internal:9092 (id: 1 rack: null)
> >>>>>>>>>>> 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG
> >>> o.a.k.c.c.i.Fetcher:
> >>>>>>>>>>> Ignoring fetched records for devlast-changelog-2 at offset
> >>> 2962649
> >>>>>>>> since
> >>>>>>>>>>> the current position is 2963379
> >>>>>>>>>>>
> >>>>>>>>>>> After a LONG time, the only messages in the log are these:
> >>>>>>>>>>>
> >>>>>>>>>>> 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread |
> >>> sa]
> >>>>>>>> DEBUG
> >>>>>>>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for
> >>>> group
> >>>>>>>> sa
> >>>>>>>>>> to
> >>>>>>>>>>> coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack:
> >>>> null)
> >>>>>>>>>>> 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread |
> >>> sa]
> >>>>>>>> DEBUG
> >>>>>>>>>>> o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat
> >>>>>>>> response
> >>>>>>>>>> for
> >>>>>>>>>>> group same
> >>>>>>>>>>>
> >>>>>>>>>>> Any idea?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Shimi
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> -- Guozhang
> >>
>
>


-- 
-- Guozhang

Re: Kafka-streams process stopped processing messages

Posted by Eno Thereska <en...@gmail.com>.

Hi Shimi,

Could we start a new email thread on the slow booting to separate it from the initial thread (call it "slow boot" or something)? Thank you. Also, could you provide the logs for the booting part if possible, together with your streams config.

Thanks
Eno
> On 15 May 2017, at 20:49, Shimi Kiviti <sh...@gmail.com> wrote:
> 
> I do run the clients with 0.10.2.1 and it takes hours
> What I don't understand is why it takes hours to boot on a server that has
> all the data in RocksDB already. Is that related to the amount of data in
> RocksDB (changelog topics) or the data in the source topic the processes
> reads from?
> On Mon, 15 May 2017 at 20:32 Guozhang Wang <wa...@gmail.com> wrote:
> 
>> Hello Shimi,
>> 
>> Could you try upgrading your clients to 0.10.2.1 (note you do not need to
>> upgrade your servers if it is already on 0.10.1, since newer Streams
>> clients can directly talk to older versioned brokers since 0.10.1+) and try
>> it out again? I have a few optimizations to reduce rebalance latencies in
>> both the underlying consumer client as well as streams library, and
>> hopefully they will help with your rebalance issues.
>> 
>> Also, we have a bunch of more fixes on consumer rebalance that we have
>> already pushed in trunk and hence will be included in the upcoming June
>> release of 0.11.0.0.
>> 
>> 
>> Guozhang
>> 
>> On Sat, May 13, 2017 at 12:32 PM, Shimi Kiviti <sh...@gmail.com> wrote:
>> 
>>> I tried all these configurations and now like version 0.10.1.1 I see a
>> very
>>> slow startup.
>>> I decreased the cluster to a single server which was running without any
>>> problem for a few hours. Now, each time I restart this process it gets
>> into
>>> rebalancing state for several hours.
>>> That mean that every time we need to deploy a new version of our app
>> (which
>>> can be several times a day) we have a down time of hours.
>>> 
>>> 
>>> On Sat, May 6, 2017 at 5:13 PM, Eno Thereska <en...@gmail.com>
>>> wrote:
>>> 
>>>> Yeah we’ve seen cases when the session timeout might also need
>>> increasing.
>>>> Could you try upping it to something like 60000ms and let us know how
>> it
>>>> goes:
>>>> 
>>>>>> streamsProps.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 60000);
>>>> 
>>>> 
>>>> Thanks
>>>> Eno
>>>> 
>>>>> On May 6, 2017, at 8:35 AM, Shimi Kiviti <sh...@gmail.com> wrote:
>>>>> 
>>>>> Thanks Eno,
>>>>> I already set the the recurve buffer size to 1MB
>>>>> I will also try producer
>>>>> 
>>>>> What about session timeout and heart beat timeout? Do you think it
>>> should
>>>>> be increased?
>>>>> 
>>>>> Thanks,
>>>>> Shimi
>>>>> 
>>>>> On Sat, 6 May 2017 at 0:21 Eno Thereska <en...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Hi Shimi,
>>>>>> 
>>>>>> I’ve noticed with our benchmarks that on AWS environments with high
>>>>>> network latency the network socket buffers often need adjusting. Any
>>>> chance
>>>>>> you could add the following to your streams configuration to change
>>> the
>>>>>> default socket size bytes to a higher value (at least 1MB) and let
>> us
>>>> know?
>>>>>> 
>>>>>> private static final int SOCKET_SIZE_BYTES = 1 * 1024 * 1024; // at
>>>> least
>>>>>> 1MB
>>>>>> streamsProps.put(ConsumerConfig.RECEIVE_BUFFER_CONFIG,
>>>> SOCKET_SIZE_BYTES);
>>>>>> streamsProps.put(ProducerConfig.SEND_BUFFER_CONFIG,
>>> SOCKET_SIZE_BYTES);
>>>>>> 
>>>>>> Thanks
>>>>>> Eno
>>>>>> 
>>>>>>> On May 4, 2017, at 3:45 PM, Shimi Kiviti <sh...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>> Thanks Eno,
>>>>>>> 
>>>>>>> We still see problems on our side.
>>>>>>> when we run kafka-streams 0.10.1.1 eventually the problem goes away
>>> but
>>>>>>> with 0.10.2.1 it is not.
>>>>>>> We see a lot of the rebalancing messages I wrote before
>>>>>>> 
>>>>>>> on at least 1 kafka-stream nodes we see disconnection messages like
>>> the
>>>>>>> following. These messages repeat all the time
>>>>>>> 
>>>>>>> 2017-05-04 14:25:56,063 [StreamThread-1] INFO
>>>>>>> o.a.k.c.c.i.AbstractCoordinator: Discovered coordinator
>>>>>>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) for
>>> group
>>>> sa.
>>>>>>> 2017-05-04 14:25:56,063 [StreamThread-1] DEBUG
>> o.a.k.c.NetworkClient:
>>>>>>> Initiating connection to node 2147483646 at
>>>>>> ip-10-0-91-10.ec2.internal:9092.
>>>>>>> 2017-05-04 14:25:56,091 [StreamThread-1] INFO
>>>>>>> o.a.k.c.c.i.AbstractCoordinator: (Re-)joining group sa
>>>>>>> 2017-05-04 14:25:56,093 [StreamThread-1] DEBUG
>>>>>>> o.a.k.s.p.i.StreamPartitionAssignor: stream-thread [StreamThread-1]
>>>> found
>>>>>>> [sa-events] topics possibly matching regex
>>>>>>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
>>>> o.a.k.s.p.TopologyBuilder:
>>>>>>> stream-thread [StreamThread-1] updating builder with
>>>>>>> SubscriptionUpdates{updatedTopicSubscriptions=[sa-events]} topic(s)
>>>> with
>>>>>> po
>>>>>>> ssible matching regex subscription(s)
>>>>>>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
>>>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending JoinGroup ((type:
>>>>>>> JoinGroupRequest, groupId=sa, sessionTimeout=10000,
>>>>>>> rebalanceTimeout=2147483647, memb
>>>>>>> erId=, protocolType=consumer,
>>>>>>> 
>>>>>> groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$
>>>> ProtocolMetadata@2f894d9b
>>>>>> ))
>>>>>>> to coordinator ip-10-0-91-10.ec2.internal:9092 (id: 2147483646
>> rack:
>>>>>> null)
>>>>>>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.n.Selector:
>>>>>> Created
>>>>>>> socket with SO_RCVBUF = 1048576, SO_SNDBUF = 131072, SO_TIMEOUT = 0
>>> to
>>>>>> node
>>>>>>> 2147483646
>>>>>>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG
>> o.a.k.c.NetworkClient:
>>>>>>> Completed connection to node 2147483646.  Fetching API versions.
>>>>>>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG
>> o.a.k.c.NetworkClient:
>>>>>>> Initiating API versions fetch from node 2147483646.
>>>>>>> 2017-05-04 14:25:56,104 [StreamThread-1] DEBUG
>> o.a.k.c.NetworkClient:
>>>>>>> Recorded API versions for node 2147483646: (Produce(0): 0 to 2
>>> [usable:
>>>>>> 2],
>>>>>>> Fetch(1): 0 to 3 [usable: 3], Offsets(2): 0 to 1 [usable: 1],
>>>>>>> Metadata(3): 0 to 2 [usable: 2], LeaderAndIsr(4): 0 [usable: 0],
>>>>>>> StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 2 [usable:
>> 2],
>>>>>>> ControlledShutdown(7): 1 [usable: 1], OffsetCommit(8): 0 to 2
>>> [usable:
>>>>>>> 2], OffsetFetch(9): 0 to 1 [usable: 1], GroupCoordinator(10): 0
>>>> [usable:
>>>>>>> 0], JoinGroup(11): 0 to 1 [usable: 1], Heartbeat(12): 0 [usable:
>> 0],
>>>>>>> LeaveGroup(13): 0 [usable: 0], SyncGroup(14): 0 [usable: 0], Desc
>>>>>>> ribeGroups(15): 0 [usable: 0], ListGroups(16): 0 [usable: 0],
>>>>>>> SaslHandshake(17): 0 [usable: 0], ApiVersions(18): 0 [usable: 0],
>>>>>>> CreateTopics(19): 0 [usable: 0], DeleteTopics(20): 0 [usable: 0])
>>>>>>> 2017-05-04 14:29:44,800 [kafka-producer-network-thread |
>>>>>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
>>> DEBUG
>>>>>>> o.a.k.c.NetworkClient: Node -2 disconnected.
>>>>>>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
>>>>>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
>>> DEBUG
>>>>>>> o.a.k.c.NetworkClient: Sending metadata request (type=MetadataR
>>>>>>> equest, topics=) to node 1
>>>>>>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
>>>>>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
>>> DEBUG
>>>>>>> o.a.k.c.NetworkClient: Node -1 disconnected.
>>>>>>> 2017-05-04 14:29:44,802 [kafka-producer-network-thread |
>>>>>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
>>> DEBUG
>>>>>>> o.a.k.c.Metadata: Updated cluster metadata version 4 to Cluster
>>>>>>> (id = JsVqjH3tS4CIcqpd2jkogA, nodes = [ip-10-0-91-10.ec2.internal:
>>> 9092
>>>>>> (id:
>>>>>>> 1 rack: null), ip-10-0-95-250.ec2.internal:9092 (id: 2 rack:
>> null)],
>>>>>>> partitions = [])
>>>>>>> 2017-05-04 14:30:56,062 [StreamThread-1] DEBUG
>> o.a.k.c.NetworkClient:
>>>>>>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to
>>> node 2
>>>>>>> 2017-05-04 14:30:56,073 [StreamThread-1] DEBUG o.a.k.c.Metadata:
>>>> Updated
>>>>>>> cluster metadata version 7 to Cluster(id = JsVqjH3tS4CIcqpd2jkogA,
>>>> nodes
>>>>>> =
>>>>>>> [ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null), ip-10
>>>>>>> -0-91-10.ec2.internal:9092 (id: 1 rack: null)], partitions =
>>>>>>> [Partition(topic = sa-events, partition = 0, leader = 1, replicas =
>>>>>> [1,2],
>>>>>>> isr = [2,1]), Partition(topic = sa-events, partition = 1, lea
>>>>>>> der = 2, replicas = [1,2], isr = [2,1]), Partition(topic =
>> sa-events,
>>>>>>> partition = 2, leader = 1, replicas = [1,2], isr = [2,1])])
>>>>>>> 2017-05-04 14:31:06,085 [StreamThread-1] DEBUG
>> o.a.k.c.NetworkClient:
>>>>>>> Disconnecting from node 2147483646 due to request timeout.
>>>>>>> 2017-05-04 14:31:06,086 [StreamThread-1] DEBUG
>>>>>>> o.a.k.c.c.i.ConsumerNetworkClient: Cancelled JOIN_GROUP request
>>>>>>> 
>>>>>> {api_key=11,api_version=1,correlation_id=16,client_id=
>>>> sa-5788b5a5-aadc-4276-916f
>>>>>>> -1640008c17da-StreamThread-1-consumer} with correlation id 16 due
>> to
>>>> node
>>>>>>> 2147483646 being disconnected
>>>>>>> 2017-05-04 14:31:06,086 [StreamThread-1] INFO
>>>>>>> o.a.k.c.c.i.AbstractCoordinator: Marking the coordinator
>>>>>>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) dead
>> for
>>>>>> group
>>>>>>> sa
>>>>>>> 2017-05-04 14:31:06,195 [StreamThread-1] DEBUG
>>>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending GroupCoordinator request
>>> for
>>>>>> group
>>>>>>> sa to broker ip-10-0-91-10.ec2.internal:9092 (id: 1 rack: null)
>>>>>>> 2017-05-04 14:31:06,196 [StreamThread-1] DEBUG
>> o.a.k.c.NetworkClient:
>>>>>>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to
>>> node 2
>>>>>>> 2017-05-04 14:31:06,200 [StreamThread-1] DEBUG
>>>>>>> o.a.k.c.c.i.AbstractCoordinator: Received GroupCoordinator response
>>>>>>> ClientResponse(receivedTimeMs=1493908266200, latencyMs=5,
>>>>>>> disconnected=false, requestHeader=
>>>>>>> 
>>>>>> {api_key=10,api_version=0,correlation_id=19,client_id=
>>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-consumer},
>>>>>>> responseBody={error_code=0,coordinator={node_id=1,host=ip
>>>> -10-0-91-10.ec
>>>>>>> 2.internal,port=9092}}) for group sa
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, May 1, 2017 at 4:19 PM, Eno Thereska <
>> eno.thereska@gmail.com
>>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Shimi,
>>>>>>>> 
>>>>>>>> 0.10.2.1 contains a number of fixes that should make the out of
>> box
>>>>>>>> experience better, including resiliency under broker failures and
>>>> better
>>>>>>>> exception handling. If you ever get back to it, and if the problem
>>>>>> happens
>>>>>>>> again, please do send us the logs and we'll happily have a look.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Eno
>>>>>>>>> On 1 May 2017, at 12:05, Shimi Kiviti <sh...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Eno,
>>>>>>>>> I am afraid I played too much with the configuration to make this
>>>>>>>>> productive investigation :(
>>>>>>>>> 
>>>>>>>>> This is a QA environment which includes 2 kafka instances and 3
>>>>>> zookeeper
>>>>>>>>> instances in AWS. There are only 3 partition for this topic.
>>>>>>>>> Kafka broker and kafka-stream are version 0.10.1.1
>>>>>>>>> Our kafka-stream app run on docker using kubernetes.
>>>>>>>>> I played around with with 1 to 3  kafka-stream processes, but I
>> got
>>>> the
>>>>>>>>> same results. It is too easy to scale with kubernetes :)
>>>>>>>>> Since there are only 3 partitions, I didn't start more then 3
>>>>>> instances.
>>>>>>>>> 
>>>>>>>>> I was too quick to upgraded only the kafka-stream app to 0.10.2.1
>>>> with
>>>>>>>> hope
>>>>>>>>> that it will solve the problem, It didn't.
>>>>>>>>> The log I sent before are from this version.
>>>>>>>>> 
>>>>>>>>> I did notice "unknown" offset for the main topic with
>> kafka-stream
>>>>>>>> version
>>>>>>>>> 0.10.2.1
>>>>>>>>> $ ./bin/kafka-consumer-groups.sh   --bootstrap-server
>>> localhost:9092
>>>>>>>>> --describe --group sa
>>>>>>>>> GROUP                          TOPIC
>>>> PARTITION
>>>>>>>>> CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
>>>>>>>>> sa             sa-events                 0          842199
>>>>>>>>> 842199          0
>>>>>>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
>>>>>>>> 10.0.10.9
>>>>>>>>> sa             sa-events                 1          1078428
>>>>>>>>> 1078428         0
>>>>>>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
>>>>>>>> 10.0.10.9
>>>>>>>>> sa             sa-events                 2          unknown
>>>>>>>>> 26093910        unknown
>>>>>>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
>>>>>>>> 10.0.10.9
>>>>>>>>> 
>>>>>>>>> After that I downgraded the kafka-stream app back to version
>>> 0.10.1.1
>>>>>>>>> After a LONG startup time (more than an hour) where the status of
>>> the
>>>>>>>> group
>>>>>>>>> was rebalancing, all the 3 processes started processing messages
>>>> again.
>>>>>>>>> 
>>>>>>>>> This all thing started after we hit a bug in our code (NPE) that
>>>>>> crashed
>>>>>>>>> the stream processing thread.
>>>>>>>>> So now after 4 days, everything is back to normal.
>>>>>>>>> This worries me since it can happen again
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, May 1, 2017 at 11:45 AM, Eno Thereska <
>>>> eno.thereska@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Shimi,
>>>>>>>>>> 
>>>>>>>>>> Could you provide more info on your setup? How many kafka
>> streams
>>>>>>>>>> processes do you have and from how many partitions are they
>>>> consuming
>>>>>>>> from.
>>>>>>>>>> If you have more processes than partitions some of the processes
>>>> will
>>>>>> be
>>>>>>>>>> idle and won’t do anything.
>>>>>>>>>> 
>>>>>>>>>> Eno
>>>>>>>>>>> On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <sh...@gmail.com>
>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>> 
>>>>>>>>>>> I have a problem and I hope one of you can help me figuring it
>>> out.
>>>>>>>>>>> One of our kafka-streams processes stopped processing messages
>>>>>>>>>>> 
>>>>>>>>>>> When I turn on debug log I see lots of these messages:
>>>>>>>>>>> 
>>>>>>>>>>> 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG
>>> o.a.k.c.c.i.Fetcher:
>>>>>>>>>> Sending
>>>>>>>>>>> fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
>>>>>>>>>>> .ec2.internal:9092 (id: 1 rack: null)
>>>>>>>>>>> 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG
>>> o.a.k.c.c.i.Fetcher:
>>>>>>>>>>> Ignoring fetched records for devlast-changelog-2 at offset
>>> 2962649
>>>>>>>> since
>>>>>>>>>>> the current position is 2963379
>>>>>>>>>>> 
>>>>>>>>>>> After a LONG time, the only messages in the log are these:
>>>>>>>>>>> 
>>>>>>>>>>> 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread |
>>> sa]
>>>>>>>> DEBUG
>>>>>>>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for
>>>> group
>>>>>>>> sa
>>>>>>>>>> to
>>>>>>>>>>> coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack:
>>>> null)
>>>>>>>>>>> 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread |
>>> sa]
>>>>>>>> DEBUG
>>>>>>>>>>> o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat
>>>>>>>> response
>>>>>>>>>> for
>>>>>>>>>>> group same
>>>>>>>>>>> 
>>>>>>>>>>> Any idea?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Shimi
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> -- Guozhang
>>

Re: Kafka-streams process stopped processing messages

Posted by Shimi Kiviti <sh...@gmail.com>.

I do run the clients with 0.10.2.1 and it takes hours
What I don't understand is why it takes hours to boot on a server that has
all the data in RocksDB already. Is that related to the amount of data in
RocksDB (changelog topics) or the data in the source topic the processes
reads from?
On Mon, 15 May 2017 at 20:32 Guozhang Wang <wa...@gmail.com> wrote:

> Hello Shimi,
>
> Could you try upgrading your clients to 0.10.2.1 (note you do not need to
> upgrade your servers if it is already on 0.10.1, since newer Streams
> clients can directly talk to older versioned brokers since 0.10.1+) and try
> it out again? I have a few optimizations to reduce rebalance latencies in
> both the underlying consumer client as well as streams library, and
> hopefully they will help with your rebalance issues.
>
> Also, we have a bunch of more fixes on consumer rebalance that we have
> already pushed in trunk and hence will be included in the upcoming June
> release of 0.11.0.0.
>
>
> Guozhang
>
> On Sat, May 13, 2017 at 12:32 PM, Shimi Kiviti <sh...@gmail.com> wrote:
>
> > I tried all these configurations and now like version 0.10.1.1 I see a
> very
> > slow startup.
> > I decreased the cluster to a single server which was running without any
> > problem for a few hours. Now, each time I restart this process it gets
> into
> > rebalancing state for several hours.
> > That mean that every time we need to deploy a new version of our app
> (which
> > can be several times a day) we have a down time of hours.
> >
> >
> > On Sat, May 6, 2017 at 5:13 PM, Eno Thereska <en...@gmail.com>
> > wrote:
> >
> > > Yeah we’ve seen cases when the session timeout might also need
> > increasing.
> > > Could you try upping it to something like 60000ms and let us know how
> it
> > > goes:
> > >
> > > >> streamsProps.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 60000);
> > >
> > >
> > > Thanks
> > > Eno
> > >
> > > > On May 6, 2017, at 8:35 AM, Shimi Kiviti <sh...@gmail.com> wrote:
> > > >
> > > > Thanks Eno,
> > > > I already set the the recurve buffer size to 1MB
> > > > I will also try producer
> > > >
> > > > What about session timeout and heart beat timeout? Do you think it
> > should
> > > > be increased?
> > > >
> > > > Thanks,
> > > > Shimi
> > > >
> > > > On Sat, 6 May 2017 at 0:21 Eno Thereska <en...@gmail.com>
> > wrote:
> > > >
> > > >> Hi Shimi,
> > > >>
> > > >> I’ve noticed with our benchmarks that on AWS environments with high
> > > >> network latency the network socket buffers often need adjusting. Any
> > > chance
> > > >> you could add the following to your streams configuration to change
> > the
> > > >> default socket size bytes to a higher value (at least 1MB) and let
> us
> > > know?
> > > >>
> > > >> private static final int SOCKET_SIZE_BYTES = 1 * 1024 * 1024; // at
> > > least
> > > >> 1MB
> > > >> streamsProps.put(ConsumerConfig.RECEIVE_BUFFER_CONFIG,
> > > SOCKET_SIZE_BYTES);
> > > >> streamsProps.put(ProducerConfig.SEND_BUFFER_CONFIG,
> > SOCKET_SIZE_BYTES);
> > > >>
> > > >> Thanks
> > > >> Eno
> > > >>
> > > >>> On May 4, 2017, at 3:45 PM, Shimi Kiviti <sh...@gmail.com>
> wrote:
> > > >>>
> > > >>> Thanks Eno,
> > > >>>
> > > >>> We still see problems on our side.
> > > >>> when we run kafka-streams 0.10.1.1 eventually the problem goes away
> > but
> > > >>> with 0.10.2.1 it is not.
> > > >>> We see a lot of the rebalancing messages I wrote before
> > > >>>
> > > >>> on at least 1 kafka-stream nodes we see disconnection messages like
> > the
> > > >>> following. These messages repeat all the time
> > > >>>
> > > >>> 2017-05-04 14:25:56,063 [StreamThread-1] INFO
> > > >>> o.a.k.c.c.i.AbstractCoordinator: Discovered coordinator
> > > >>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) for
> > group
> > > sa.
> > > >>> 2017-05-04 14:25:56,063 [StreamThread-1] DEBUG
> o.a.k.c.NetworkClient:
> > > >>> Initiating connection to node 2147483646 at
> > > >> ip-10-0-91-10.ec2.internal:9092.
> > > >>> 2017-05-04 14:25:56,091 [StreamThread-1] INFO
> > > >>> o.a.k.c.c.i.AbstractCoordinator: (Re-)joining group sa
> > > >>> 2017-05-04 14:25:56,093 [StreamThread-1] DEBUG
> > > >>> o.a.k.s.p.i.StreamPartitionAssignor: stream-thread [StreamThread-1]
> > > found
> > > >>> [sa-events] topics possibly matching regex
> > > >>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
> > > o.a.k.s.p.TopologyBuilder:
> > > >>> stream-thread [StreamThread-1] updating builder with
> > > >>> SubscriptionUpdates{updatedTopicSubscriptions=[sa-events]} topic(s)
> > > with
> > > >> po
> > > >>> ssible matching regex subscription(s)
> > > >>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
> > > >>> o.a.k.c.c.i.AbstractCoordinator: Sending JoinGroup ((type:
> > > >>> JoinGroupRequest, groupId=sa, sessionTimeout=10000,
> > > >>> rebalanceTimeout=2147483647, memb
> > > >>> erId=, protocolType=consumer,
> > > >>>
> > > >> groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$
> > > ProtocolMetadata@2f894d9b
> > > >> ))
> > > >>> to coordinator ip-10-0-91-10.ec2.internal:9092 (id: 2147483646
> rack:
> > > >> null)
> > > >>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.n.Selector:
> > > >> Created
> > > >>> socket with SO_RCVBUF = 1048576, SO_SNDBUF = 131072, SO_TIMEOUT = 0
> > to
> > > >> node
> > > >>> 2147483646
> > > >>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG
> o.a.k.c.NetworkClient:
> > > >>> Completed connection to node 2147483646.  Fetching API versions.
> > > >>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG
> o.a.k.c.NetworkClient:
> > > >>> Initiating API versions fetch from node 2147483646.
> > > >>> 2017-05-04 14:25:56,104 [StreamThread-1] DEBUG
> o.a.k.c.NetworkClient:
> > > >>> Recorded API versions for node 2147483646: (Produce(0): 0 to 2
> > [usable:
> > > >> 2],
> > > >>> Fetch(1): 0 to 3 [usable: 3], Offsets(2): 0 to 1 [usable: 1],
> > > >>> Metadata(3): 0 to 2 [usable: 2], LeaderAndIsr(4): 0 [usable: 0],
> > > >>> StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 2 [usable:
> 2],
> > > >>> ControlledShutdown(7): 1 [usable: 1], OffsetCommit(8): 0 to 2
> > [usable:
> > > >>> 2], OffsetFetch(9): 0 to 1 [usable: 1], GroupCoordinator(10): 0
> > > [usable:
> > > >>> 0], JoinGroup(11): 0 to 1 [usable: 1], Heartbeat(12): 0 [usable:
> 0],
> > > >>> LeaveGroup(13): 0 [usable: 0], SyncGroup(14): 0 [usable: 0], Desc
> > > >>> ribeGroups(15): 0 [usable: 0], ListGroups(16): 0 [usable: 0],
> > > >>> SaslHandshake(17): 0 [usable: 0], ApiVersions(18): 0 [usable: 0],
> > > >>> CreateTopics(19): 0 [usable: 0], DeleteTopics(20): 0 [usable: 0])
> > > >>> 2017-05-04 14:29:44,800 [kafka-producer-network-thread |
> > > >>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> > DEBUG
> > > >>> o.a.k.c.NetworkClient: Node -2 disconnected.
> > > >>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> > > >>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> > DEBUG
> > > >>> o.a.k.c.NetworkClient: Sending metadata request (type=MetadataR
> > > >>> equest, topics=) to node 1
> > > >>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> > > >>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> > DEBUG
> > > >>> o.a.k.c.NetworkClient: Node -1 disconnected.
> > > >>> 2017-05-04 14:29:44,802 [kafka-producer-network-thread |
> > > >>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> > DEBUG
> > > >>> o.a.k.c.Metadata: Updated cluster metadata version 4 to Cluster
> > > >>> (id = JsVqjH3tS4CIcqpd2jkogA, nodes = [ip-10-0-91-10.ec2.internal:
> > 9092
> > > >> (id:
> > > >>> 1 rack: null), ip-10-0-95-250.ec2.internal:9092 (id: 2 rack:
> null)],
> > > >>> partitions = [])
> > > >>> 2017-05-04 14:30:56,062 [StreamThread-1] DEBUG
> o.a.k.c.NetworkClient:
> > > >>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to
> > node 2
> > > >>> 2017-05-04 14:30:56,073 [StreamThread-1] DEBUG o.a.k.c.Metadata:
> > > Updated
> > > >>> cluster metadata version 7 to Cluster(id = JsVqjH3tS4CIcqpd2jkogA,
> > > nodes
> > > >> =
> > > >>> [ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null), ip-10
> > > >>> -0-91-10.ec2.internal:9092 (id: 1 rack: null)], partitions =
> > > >>> [Partition(topic = sa-events, partition = 0, leader = 1, replicas =
> > > >> [1,2],
> > > >>> isr = [2,1]), Partition(topic = sa-events, partition = 1, lea
> > > >>> der = 2, replicas = [1,2], isr = [2,1]), Partition(topic =
> sa-events,
> > > >>> partition = 2, leader = 1, replicas = [1,2], isr = [2,1])])
> > > >>> 2017-05-04 14:31:06,085 [StreamThread-1] DEBUG
> o.a.k.c.NetworkClient:
> > > >>> Disconnecting from node 2147483646 due to request timeout.
> > > >>> 2017-05-04 14:31:06,086 [StreamThread-1] DEBUG
> > > >>> o.a.k.c.c.i.ConsumerNetworkClient: Cancelled JOIN_GROUP request
> > > >>>
> > > >> {api_key=11,api_version=1,correlation_id=16,client_id=
> > > sa-5788b5a5-aadc-4276-916f
> > > >>> -1640008c17da-StreamThread-1-consumer} with correlation id 16 due
> to
> > > node
> > > >>> 2147483646 being disconnected
> > > >>> 2017-05-04 14:31:06,086 [StreamThread-1] INFO
> > > >>> o.a.k.c.c.i.AbstractCoordinator: Marking the coordinator
> > > >>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) dead
> for
> > > >> group
> > > >>> sa
> > > >>> 2017-05-04 14:31:06,195 [StreamThread-1] DEBUG
> > > >>> o.a.k.c.c.i.AbstractCoordinator: Sending GroupCoordinator request
> > for
> > > >> group
> > > >>> sa to broker ip-10-0-91-10.ec2.internal:9092 (id: 1 rack: null)
> > > >>> 2017-05-04 14:31:06,196 [StreamThread-1] DEBUG
> o.a.k.c.NetworkClient:
> > > >>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to
> > node 2
> > > >>> 2017-05-04 14:31:06,200 [StreamThread-1] DEBUG
> > > >>> o.a.k.c.c.i.AbstractCoordinator: Received GroupCoordinator response
> > > >>> ClientResponse(receivedTimeMs=1493908266200, latencyMs=5,
> > > >>> disconnected=false, requestHeader=
> > > >>>
> > > >> {api_key=10,api_version=0,correlation_id=19,client_id=
> > > sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-consumer},
> > > >>> responseBody={error_code=0,coordinator={node_id=1,host=ip
> > > -10-0-91-10.ec
> > > >>> 2.internal,port=9092}}) for group sa
> > > >>>
> > > >>>
> > > >>> On Mon, May 1, 2017 at 4:19 PM, Eno Thereska <
> eno.thereska@gmail.com
> > >
> > > >> wrote:
> > > >>>
> > > >>>> Hi Shimi,
> > > >>>>
> > > >>>> 0.10.2.1 contains a number of fixes that should make the out of
> box
> > > >>>> experience better, including resiliency under broker failures and
> > > better
> > > >>>> exception handling. If you ever get back to it, and if the problem
> > > >> happens
> > > >>>> again, please do send us the logs and we'll happily have a look.
> > > >>>>
> > > >>>> Thanks
> > > >>>> Eno
> > > >>>>> On 1 May 2017, at 12:05, Shimi Kiviti <sh...@gmail.com> wrote:
> > > >>>>>
> > > >>>>> Hi Eno,
> > > >>>>> I am afraid I played too much with the configuration to make this
> > > >>>>> productive investigation :(
> > > >>>>>
> > > >>>>> This is a QA environment which includes 2 kafka instances and 3
> > > >> zookeeper
> > > >>>>> instances in AWS. There are only 3 partition for this topic.
> > > >>>>> Kafka broker and kafka-stream are version 0.10.1.1
> > > >>>>> Our kafka-stream app run on docker using kubernetes.
> > > >>>>> I played around with with 1 to 3  kafka-stream processes, but I
> got
> > > the
> > > >>>>> same results. It is too easy to scale with kubernetes :)
> > > >>>>> Since there are only 3 partitions, I didn't start more then 3
> > > >> instances.
> > > >>>>>
> > > >>>>> I was too quick to upgraded only the kafka-stream app to 0.10.2.1
> > > with
> > > >>>> hope
> > > >>>>> that it will solve the problem, It didn't.
> > > >>>>> The log I sent before are from this version.
> > > >>>>>
> > > >>>>> I did notice "unknown" offset for the main topic with
> kafka-stream
> > > >>>> version
> > > >>>>> 0.10.2.1
> > > >>>>> $ ./bin/kafka-consumer-groups.sh   --bootstrap-server
> > localhost:9092
> > > >>>>> --describe --group sa
> > > >>>>> GROUP                          TOPIC
> > > PARTITION
> > > >>>>> CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
> > > >>>>> sa             sa-events                 0          842199
> > > >>>>> 842199          0
> > > >>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> > > >>>> 10.0.10.9
> > > >>>>> sa             sa-events                 1          1078428
> > > >>>>> 1078428         0
> > > >>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> > > >>>> 10.0.10.9
> > > >>>>> sa             sa-events                 2          unknown
> > > >>>>> 26093910        unknown
> > > >>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> > > >>>> 10.0.10.9
> > > >>>>>
> > > >>>>> After that I downgraded the kafka-stream app back to version
> > 0.10.1.1
> > > >>>>> After a LONG startup time (more than an hour) where the status of
> > the
> > > >>>> group
> > > >>>>> was rebalancing, all the 3 processes started processing messages
> > > again.
> > > >>>>>
> > > >>>>> This all thing started after we hit a bug in our code (NPE) that
> > > >> crashed
> > > >>>>> the stream processing thread.
> > > >>>>> So now after 4 days, everything is back to normal.
> > > >>>>> This worries me since it can happen again
> > > >>>>>
> > > >>>>>
> > > >>>>> On Mon, May 1, 2017 at 11:45 AM, Eno Thereska <
> > > eno.thereska@gmail.com>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hi Shimi,
> > > >>>>>>
> > > >>>>>> Could you provide more info on your setup? How many kafka
> streams
> > > >>>>>> processes do you have and from how many partitions are they
> > > consuming
> > > >>>> from.
> > > >>>>>> If you have more processes than partitions some of the processes
> > > will
> > > >> be
> > > >>>>>> idle and won’t do anything.
> > > >>>>>>
> > > >>>>>> Eno
> > > >>>>>>> On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <sh...@gmail.com>
> > > wrote:
> > > >>>>>>>
> > > >>>>>>> Hi Everyone,
> > > >>>>>>>
> > > >>>>>>> I have a problem and I hope one of you can help me figuring it
> > out.
> > > >>>>>>> One of our kafka-streams processes stopped processing messages
> > > >>>>>>>
> > > >>>>>>> When I turn on debug log I see lots of these messages:
> > > >>>>>>>
> > > >>>>>>> 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG
> > o.a.k.c.c.i.Fetcher:
> > > >>>>>> Sending
> > > >>>>>>> fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
> > > >>>>>>> .ec2.internal:9092 (id: 1 rack: null)
> > > >>>>>>> 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG
> > o.a.k.c.c.i.Fetcher:
> > > >>>>>>> Ignoring fetched records for devlast-changelog-2 at offset
> > 2962649
> > > >>>> since
> > > >>>>>>> the current position is 2963379
> > > >>>>>>>
> > > >>>>>>> After a LONG time, the only messages in the log are these:
> > > >>>>>>>
> > > >>>>>>> 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread |
> > sa]
> > > >>>> DEBUG
> > > >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for
> > > group
> > > >>>> sa
> > > >>>>>> to
> > > >>>>>>> coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack:
> > > null)
> > > >>>>>>> 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread |
> > sa]
> > > >>>> DEBUG
> > > >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat
> > > >>>> response
> > > >>>>>> for
> > > >>>>>>> group same
> > > >>>>>>>
> > > >>>>>>> Any idea?
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>> Shimi
> > > >>>>>>
> > > >>>>>>
> > > >>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>
>
>
> --
> -- Guozhang
>

Re: Kafka-streams process stopped processing messages

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Shimi,

Could you try upgrading your clients to 0.10.2.1 (note you do not need to
upgrade your servers if it is already on 0.10.1, since newer Streams
clients can directly talk to older versioned brokers since 0.10.1+) and try
it out again? I have a few optimizations to reduce rebalance latencies in
both the underlying consumer client as well as streams library, and
hopefully they will help with your rebalance issues.

Also, we have a bunch of more fixes on consumer rebalance that we have
already pushed in trunk and hence will be included in the upcoming June
release of 0.11.0.0.


Guozhang

On Sat, May 13, 2017 at 12:32 PM, Shimi Kiviti <sh...@gmail.com> wrote:

> I tried all these configurations and now like version 0.10.1.1 I see a very
> slow startup.
> I decreased the cluster to a single server which was running without any
> problem for a few hours. Now, each time I restart this process it gets into
> rebalancing state for several hours.
> That mean that every time we need to deploy a new version of our app (which
> can be several times a day) we have a down time of hours.
>
>
> On Sat, May 6, 2017 at 5:13 PM, Eno Thereska <en...@gmail.com>
> wrote:
>
> > Yeah we’ve seen cases when the session timeout might also need
> increasing.
> > Could you try upping it to something like 60000ms and let us know how it
> > goes:
> >
> > >> streamsProps.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 60000);
> >
> >
> > Thanks
> > Eno
> >
> > > On May 6, 2017, at 8:35 AM, Shimi Kiviti <sh...@gmail.com> wrote:
> > >
> > > Thanks Eno,
> > > I already set the the recurve buffer size to 1MB
> > > I will also try producer
> > >
> > > What about session timeout and heart beat timeout? Do you think it
> should
> > > be increased?
> > >
> > > Thanks,
> > > Shimi
> > >
> > > On Sat, 6 May 2017 at 0:21 Eno Thereska <en...@gmail.com>
> wrote:
> > >
> > >> Hi Shimi,
> > >>
> > >> I’ve noticed with our benchmarks that on AWS environments with high
> > >> network latency the network socket buffers often need adjusting. Any
> > chance
> > >> you could add the following to your streams configuration to change
> the
> > >> default socket size bytes to a higher value (at least 1MB) and let us
> > know?
> > >>
> > >> private static final int SOCKET_SIZE_BYTES = 1 * 1024 * 1024; // at
> > least
> > >> 1MB
> > >> streamsProps.put(ConsumerConfig.RECEIVE_BUFFER_CONFIG,
> > SOCKET_SIZE_BYTES);
> > >> streamsProps.put(ProducerConfig.SEND_BUFFER_CONFIG,
> SOCKET_SIZE_BYTES);
> > >>
> > >> Thanks
> > >> Eno
> > >>
> > >>> On May 4, 2017, at 3:45 PM, Shimi Kiviti <sh...@gmail.com> wrote:
> > >>>
> > >>> Thanks Eno,
> > >>>
> > >>> We still see problems on our side.
> > >>> when we run kafka-streams 0.10.1.1 eventually the problem goes away
> but
> > >>> with 0.10.2.1 it is not.
> > >>> We see a lot of the rebalancing messages I wrote before
> > >>>
> > >>> on at least 1 kafka-stream nodes we see disconnection messages like
> the
> > >>> following. These messages repeat all the time
> > >>>
> > >>> 2017-05-04 14:25:56,063 [StreamThread-1] INFO
> > >>> o.a.k.c.c.i.AbstractCoordinator: Discovered coordinator
> > >>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) for
> group
> > sa.
> > >>> 2017-05-04 14:25:56,063 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > >>> Initiating connection to node 2147483646 at
> > >> ip-10-0-91-10.ec2.internal:9092.
> > >>> 2017-05-04 14:25:56,091 [StreamThread-1] INFO
> > >>> o.a.k.c.c.i.AbstractCoordinator: (Re-)joining group sa
> > >>> 2017-05-04 14:25:56,093 [StreamThread-1] DEBUG
> > >>> o.a.k.s.p.i.StreamPartitionAssignor: stream-thread [StreamThread-1]
> > found
> > >>> [sa-events] topics possibly matching regex
> > >>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
> > o.a.k.s.p.TopologyBuilder:
> > >>> stream-thread [StreamThread-1] updating builder with
> > >>> SubscriptionUpdates{updatedTopicSubscriptions=[sa-events]} topic(s)
> > with
> > >> po
> > >>> ssible matching regex subscription(s)
> > >>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
> > >>> o.a.k.c.c.i.AbstractCoordinator: Sending JoinGroup ((type:
> > >>> JoinGroupRequest, groupId=sa, sessionTimeout=10000,
> > >>> rebalanceTimeout=2147483647, memb
> > >>> erId=, protocolType=consumer,
> > >>>
> > >> groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$
> > ProtocolMetadata@2f894d9b
> > >> ))
> > >>> to coordinator ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack:
> > >> null)
> > >>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.n.Selector:
> > >> Created
> > >>> socket with SO_RCVBUF = 1048576, SO_SNDBUF = 131072, SO_TIMEOUT = 0
> to
> > >> node
> > >>> 2147483646
> > >>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > >>> Completed connection to node 2147483646.  Fetching API versions.
> > >>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > >>> Initiating API versions fetch from node 2147483646.
> > >>> 2017-05-04 14:25:56,104 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > >>> Recorded API versions for node 2147483646: (Produce(0): 0 to 2
> [usable:
> > >> 2],
> > >>> Fetch(1): 0 to 3 [usable: 3], Offsets(2): 0 to 1 [usable: 1],
> > >>> Metadata(3): 0 to 2 [usable: 2], LeaderAndIsr(4): 0 [usable: 0],
> > >>> StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 2 [usable: 2],
> > >>> ControlledShutdown(7): 1 [usable: 1], OffsetCommit(8): 0 to 2
> [usable:
> > >>> 2], OffsetFetch(9): 0 to 1 [usable: 1], GroupCoordinator(10): 0
> > [usable:
> > >>> 0], JoinGroup(11): 0 to 1 [usable: 1], Heartbeat(12): 0 [usable: 0],
> > >>> LeaveGroup(13): 0 [usable: 0], SyncGroup(14): 0 [usable: 0], Desc
> > >>> ribeGroups(15): 0 [usable: 0], ListGroups(16): 0 [usable: 0],
> > >>> SaslHandshake(17): 0 [usable: 0], ApiVersions(18): 0 [usable: 0],
> > >>> CreateTopics(19): 0 [usable: 0], DeleteTopics(20): 0 [usable: 0])
> > >>> 2017-05-04 14:29:44,800 [kafka-producer-network-thread |
> > >>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> DEBUG
> > >>> o.a.k.c.NetworkClient: Node -2 disconnected.
> > >>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> > >>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> DEBUG
> > >>> o.a.k.c.NetworkClient: Sending metadata request (type=MetadataR
> > >>> equest, topics=) to node 1
> > >>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> > >>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> DEBUG
> > >>> o.a.k.c.NetworkClient: Node -1 disconnected.
> > >>> 2017-05-04 14:29:44,802 [kafka-producer-network-thread |
> > >>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> DEBUG
> > >>> o.a.k.c.Metadata: Updated cluster metadata version 4 to Cluster
> > >>> (id = JsVqjH3tS4CIcqpd2jkogA, nodes = [ip-10-0-91-10.ec2.internal:
> 9092
> > >> (id:
> > >>> 1 rack: null), ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null)],
> > >>> partitions = [])
> > >>> 2017-05-04 14:30:56,062 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > >>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to
> node 2
> > >>> 2017-05-04 14:30:56,073 [StreamThread-1] DEBUG o.a.k.c.Metadata:
> > Updated
> > >>> cluster metadata version 7 to Cluster(id = JsVqjH3tS4CIcqpd2jkogA,
> > nodes
> > >> =
> > >>> [ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null), ip-10
> > >>> -0-91-10.ec2.internal:9092 (id: 1 rack: null)], partitions =
> > >>> [Partition(topic = sa-events, partition = 0, leader = 1, replicas =
> > >> [1,2],
> > >>> isr = [2,1]), Partition(topic = sa-events, partition = 1, lea
> > >>> der = 2, replicas = [1,2], isr = [2,1]), Partition(topic = sa-events,
> > >>> partition = 2, leader = 1, replicas = [1,2], isr = [2,1])])
> > >>> 2017-05-04 14:31:06,085 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > >>> Disconnecting from node 2147483646 due to request timeout.
> > >>> 2017-05-04 14:31:06,086 [StreamThread-1] DEBUG
> > >>> o.a.k.c.c.i.ConsumerNetworkClient: Cancelled JOIN_GROUP request
> > >>>
> > >> {api_key=11,api_version=1,correlation_id=16,client_id=
> > sa-5788b5a5-aadc-4276-916f
> > >>> -1640008c17da-StreamThread-1-consumer} with correlation id 16 due to
> > node
> > >>> 2147483646 being disconnected
> > >>> 2017-05-04 14:31:06,086 [StreamThread-1] INFO
> > >>> o.a.k.c.c.i.AbstractCoordinator: Marking the coordinator
> > >>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) dead for
> > >> group
> > >>> sa
> > >>> 2017-05-04 14:31:06,195 [StreamThread-1] DEBUG
> > >>> o.a.k.c.c.i.AbstractCoordinator: Sending GroupCoordinator request
> for
> > >> group
> > >>> sa to broker ip-10-0-91-10.ec2.internal:9092 (id: 1 rack: null)
> > >>> 2017-05-04 14:31:06,196 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > >>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to
> node 2
> > >>> 2017-05-04 14:31:06,200 [StreamThread-1] DEBUG
> > >>> o.a.k.c.c.i.AbstractCoordinator: Received GroupCoordinator response
> > >>> ClientResponse(receivedTimeMs=1493908266200, latencyMs=5,
> > >>> disconnected=false, requestHeader=
> > >>>
> > >> {api_key=10,api_version=0,correlation_id=19,client_id=
> > sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-consumer},
> > >>> responseBody={error_code=0,coordinator={node_id=1,host=ip
> > -10-0-91-10.ec
> > >>> 2.internal,port=9092}}) for group sa
> > >>>
> > >>>
> > >>> On Mon, May 1, 2017 at 4:19 PM, Eno Thereska <eno.thereska@gmail.com
> >
> > >> wrote:
> > >>>
> > >>>> Hi Shimi,
> > >>>>
> > >>>> 0.10.2.1 contains a number of fixes that should make the out of box
> > >>>> experience better, including resiliency under broker failures and
> > better
> > >>>> exception handling. If you ever get back to it, and if the problem
> > >> happens
> > >>>> again, please do send us the logs and we'll happily have a look.
> > >>>>
> > >>>> Thanks
> > >>>> Eno
> > >>>>> On 1 May 2017, at 12:05, Shimi Kiviti <sh...@gmail.com> wrote:
> > >>>>>
> > >>>>> Hi Eno,
> > >>>>> I am afraid I played too much with the configuration to make this
> > >>>>> productive investigation :(
> > >>>>>
> > >>>>> This is a QA environment which includes 2 kafka instances and 3
> > >> zookeeper
> > >>>>> instances in AWS. There are only 3 partition for this topic.
> > >>>>> Kafka broker and kafka-stream are version 0.10.1.1
> > >>>>> Our kafka-stream app run on docker using kubernetes.
> > >>>>> I played around with with 1 to 3  kafka-stream processes, but I got
> > the
> > >>>>> same results. It is too easy to scale with kubernetes :)
> > >>>>> Since there are only 3 partitions, I didn't start more then 3
> > >> instances.
> > >>>>>
> > >>>>> I was too quick to upgraded only the kafka-stream app to 0.10.2.1
> > with
> > >>>> hope
> > >>>>> that it will solve the problem, It didn't.
> > >>>>> The log I sent before are from this version.
> > >>>>>
> > >>>>> I did notice "unknown" offset for the main topic with kafka-stream
> > >>>> version
> > >>>>> 0.10.2.1
> > >>>>> $ ./bin/kafka-consumer-groups.sh   --bootstrap-server
> localhost:9092
> > >>>>> --describe --group sa
> > >>>>> GROUP                          TOPIC
> > PARTITION
> > >>>>> CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
> > >>>>> sa             sa-events                 0          842199
> > >>>>> 842199          0
> > >>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> > >>>> 10.0.10.9
> > >>>>> sa             sa-events                 1          1078428
> > >>>>> 1078428         0
> > >>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> > >>>> 10.0.10.9
> > >>>>> sa             sa-events                 2          unknown
> > >>>>> 26093910        unknown
> > >>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> > >>>> 10.0.10.9
> > >>>>>
> > >>>>> After that I downgraded the kafka-stream app back to version
> 0.10.1.1
> > >>>>> After a LONG startup time (more than an hour) where the status of
> the
> > >>>> group
> > >>>>> was rebalancing, all the 3 processes started processing messages
> > again.
> > >>>>>
> > >>>>> This all thing started after we hit a bug in our code (NPE) that
> > >> crashed
> > >>>>> the stream processing thread.
> > >>>>> So now after 4 days, everything is back to normal.
> > >>>>> This worries me since it can happen again
> > >>>>>
> > >>>>>
> > >>>>> On Mon, May 1, 2017 at 11:45 AM, Eno Thereska <
> > eno.thereska@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi Shimi,
> > >>>>>>
> > >>>>>> Could you provide more info on your setup? How many kafka streams
> > >>>>>> processes do you have and from how many partitions are they
> > consuming
> > >>>> from.
> > >>>>>> If you have more processes than partitions some of the processes
> > will
> > >> be
> > >>>>>> idle and won’t do anything.
> > >>>>>>
> > >>>>>> Eno
> > >>>>>>> On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <sh...@gmail.com>
> > wrote:
> > >>>>>>>
> > >>>>>>> Hi Everyone,
> > >>>>>>>
> > >>>>>>> I have a problem and I hope one of you can help me figuring it
> out.
> > >>>>>>> One of our kafka-streams processes stopped processing messages
> > >>>>>>>
> > >>>>>>> When I turn on debug log I see lots of these messages:
> > >>>>>>>
> > >>>>>>> 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG
> o.a.k.c.c.i.Fetcher:
> > >>>>>> Sending
> > >>>>>>> fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
> > >>>>>>> .ec2.internal:9092 (id: 1 rack: null)
> > >>>>>>> 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG
> o.a.k.c.c.i.Fetcher:
> > >>>>>>> Ignoring fetched records for devlast-changelog-2 at offset
> 2962649
> > >>>> since
> > >>>>>>> the current position is 2963379
> > >>>>>>>
> > >>>>>>> After a LONG time, the only messages in the log are these:
> > >>>>>>>
> > >>>>>>> 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread |
> sa]
> > >>>> DEBUG
> > >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for
> > group
> > >>>> sa
> > >>>>>> to
> > >>>>>>> coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack:
> > null)
> > >>>>>>> 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread |
> sa]
> > >>>> DEBUG
> > >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat
> > >>>> response
> > >>>>>> for
> > >>>>>>> group same
> > >>>>>>>
> > >>>>>>> Any idea?
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Shimi
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
>



-- 
-- Guozhang

Re: Kafka-streams process stopped processing messages

Posted by Shimi Kiviti <sh...@gmail.com>.

I tried all these configurations and now like version 0.10.1.1 I see a very
slow startup.
I decreased the cluster to a single server which was running without any
problem for a few hours. Now, each time I restart this process it gets into
rebalancing state for several hours.
That mean that every time we need to deploy a new version of our app (which
can be several times a day) we have a down time of hours.


On Sat, May 6, 2017 at 5:13 PM, Eno Thereska <en...@gmail.com> wrote:

> Yeah we’ve seen cases when the session timeout might also need increasing.
> Could you try upping it to something like 60000ms and let us know how it
> goes:
>
> >> streamsProps.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 60000);
>
>
> Thanks
> Eno
>
> > On May 6, 2017, at 8:35 AM, Shimi Kiviti <sh...@gmail.com> wrote:
> >
> > Thanks Eno,
> > I already set the the recurve buffer size to 1MB
> > I will also try producer
> >
> > What about session timeout and heart beat timeout? Do you think it should
> > be increased?
> >
> > Thanks,
> > Shimi
> >
> > On Sat, 6 May 2017 at 0:21 Eno Thereska <en...@gmail.com> wrote:
> >
> >> Hi Shimi,
> >>
> >> I’ve noticed with our benchmarks that on AWS environments with high
> >> network latency the network socket buffers often need adjusting. Any
> chance
> >> you could add the following to your streams configuration to change the
> >> default socket size bytes to a higher value (at least 1MB) and let us
> know?
> >>
> >> private static final int SOCKET_SIZE_BYTES = 1 * 1024 * 1024; // at
> least
> >> 1MB
> >> streamsProps.put(ConsumerConfig.RECEIVE_BUFFER_CONFIG,
> SOCKET_SIZE_BYTES);
> >> streamsProps.put(ProducerConfig.SEND_BUFFER_CONFIG, SOCKET_SIZE_BYTES);
> >>
> >> Thanks
> >> Eno
> >>
> >>> On May 4, 2017, at 3:45 PM, Shimi Kiviti <sh...@gmail.com> wrote:
> >>>
> >>> Thanks Eno,
> >>>
> >>> We still see problems on our side.
> >>> when we run kafka-streams 0.10.1.1 eventually the problem goes away but
> >>> with 0.10.2.1 it is not.
> >>> We see a lot of the rebalancing messages I wrote before
> >>>
> >>> on at least 1 kafka-stream nodes we see disconnection messages like the
> >>> following. These messages repeat all the time
> >>>
> >>> 2017-05-04 14:25:56,063 [StreamThread-1] INFO
> >>> o.a.k.c.c.i.AbstractCoordinator: Discovered coordinator
> >>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) for group
> sa.
> >>> 2017-05-04 14:25:56,063 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> >>> Initiating connection to node 2147483646 at
> >> ip-10-0-91-10.ec2.internal:9092.
> >>> 2017-05-04 14:25:56,091 [StreamThread-1] INFO
> >>> o.a.k.c.c.i.AbstractCoordinator: (Re-)joining group sa
> >>> 2017-05-04 14:25:56,093 [StreamThread-1] DEBUG
> >>> o.a.k.s.p.i.StreamPartitionAssignor: stream-thread [StreamThread-1]
> found
> >>> [sa-events] topics possibly matching regex
> >>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
> o.a.k.s.p.TopologyBuilder:
> >>> stream-thread [StreamThread-1] updating builder with
> >>> SubscriptionUpdates{updatedTopicSubscriptions=[sa-events]} topic(s)
> with
> >> po
> >>> ssible matching regex subscription(s)
> >>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
> >>> o.a.k.c.c.i.AbstractCoordinator: Sending JoinGroup ((type:
> >>> JoinGroupRequest, groupId=sa, sessionTimeout=10000,
> >>> rebalanceTimeout=2147483647, memb
> >>> erId=, protocolType=consumer,
> >>>
> >> groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$
> ProtocolMetadata@2f894d9b
> >> ))
> >>> to coordinator ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack:
> >> null)
> >>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.n.Selector:
> >> Created
> >>> socket with SO_RCVBUF = 1048576, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to
> >> node
> >>> 2147483646
> >>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> >>> Completed connection to node 2147483646.  Fetching API versions.
> >>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> >>> Initiating API versions fetch from node 2147483646.
> >>> 2017-05-04 14:25:56,104 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> >>> Recorded API versions for node 2147483646: (Produce(0): 0 to 2 [usable:
> >> 2],
> >>> Fetch(1): 0 to 3 [usable: 3], Offsets(2): 0 to 1 [usable: 1],
> >>> Metadata(3): 0 to 2 [usable: 2], LeaderAndIsr(4): 0 [usable: 0],
> >>> StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 2 [usable: 2],
> >>> ControlledShutdown(7): 1 [usable: 1], OffsetCommit(8): 0 to 2 [usable:
> >>> 2], OffsetFetch(9): 0 to 1 [usable: 1], GroupCoordinator(10): 0
> [usable:
> >>> 0], JoinGroup(11): 0 to 1 [usable: 1], Heartbeat(12): 0 [usable: 0],
> >>> LeaveGroup(13): 0 [usable: 0], SyncGroup(14): 0 [usable: 0], Desc
> >>> ribeGroups(15): 0 [usable: 0], ListGroups(16): 0 [usable: 0],
> >>> SaslHandshake(17): 0 [usable: 0], ApiVersions(18): 0 [usable: 0],
> >>> CreateTopics(19): 0 [usable: 0], DeleteTopics(20): 0 [usable: 0])
> >>> 2017-05-04 14:29:44,800 [kafka-producer-network-thread |
> >>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
> >>> o.a.k.c.NetworkClient: Node -2 disconnected.
> >>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> >>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
> >>> o.a.k.c.NetworkClient: Sending metadata request (type=MetadataR
> >>> equest, topics=) to node 1
> >>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> >>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
> >>> o.a.k.c.NetworkClient: Node -1 disconnected.
> >>> 2017-05-04 14:29:44,802 [kafka-producer-network-thread |
> >>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
> >>> o.a.k.c.Metadata: Updated cluster metadata version 4 to Cluster
> >>> (id = JsVqjH3tS4CIcqpd2jkogA, nodes = [ip-10-0-91-10.ec2.internal:9092
> >> (id:
> >>> 1 rack: null), ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null)],
> >>> partitions = [])
> >>> 2017-05-04 14:30:56,062 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> >>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to node 2
> >>> 2017-05-04 14:30:56,073 [StreamThread-1] DEBUG o.a.k.c.Metadata:
> Updated
> >>> cluster metadata version 7 to Cluster(id = JsVqjH3tS4CIcqpd2jkogA,
> nodes
> >> =
> >>> [ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null), ip-10
> >>> -0-91-10.ec2.internal:9092 (id: 1 rack: null)], partitions =
> >>> [Partition(topic = sa-events, partition = 0, leader = 1, replicas =
> >> [1,2],
> >>> isr = [2,1]), Partition(topic = sa-events, partition = 1, lea
> >>> der = 2, replicas = [1,2], isr = [2,1]), Partition(topic = sa-events,
> >>> partition = 2, leader = 1, replicas = [1,2], isr = [2,1])])
> >>> 2017-05-04 14:31:06,085 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> >>> Disconnecting from node 2147483646 due to request timeout.
> >>> 2017-05-04 14:31:06,086 [StreamThread-1] DEBUG
> >>> o.a.k.c.c.i.ConsumerNetworkClient: Cancelled JOIN_GROUP request
> >>>
> >> {api_key=11,api_version=1,correlation_id=16,client_id=
> sa-5788b5a5-aadc-4276-916f
> >>> -1640008c17da-StreamThread-1-consumer} with correlation id 16 due to
> node
> >>> 2147483646 being disconnected
> >>> 2017-05-04 14:31:06,086 [StreamThread-1] INFO
> >>> o.a.k.c.c.i.AbstractCoordinator: Marking the coordinator
> >>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) dead for
> >> group
> >>> sa
> >>> 2017-05-04 14:31:06,195 [StreamThread-1] DEBUG
> >>> o.a.k.c.c.i.AbstractCoordinator: Sending GroupCoordinator request for
> >> group
> >>> sa to broker ip-10-0-91-10.ec2.internal:9092 (id: 1 rack: null)
> >>> 2017-05-04 14:31:06,196 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> >>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to node 2
> >>> 2017-05-04 14:31:06,200 [StreamThread-1] DEBUG
> >>> o.a.k.c.c.i.AbstractCoordinator: Received GroupCoordinator response
> >>> ClientResponse(receivedTimeMs=1493908266200, latencyMs=5,
> >>> disconnected=false, requestHeader=
> >>>
> >> {api_key=10,api_version=0,correlation_id=19,client_id=
> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-consumer},
> >>> responseBody={error_code=0,coordinator={node_id=1,host=ip
> -10-0-91-10.ec
> >>> 2.internal,port=9092}}) for group sa
> >>>
> >>>
> >>> On Mon, May 1, 2017 at 4:19 PM, Eno Thereska <en...@gmail.com>
> >> wrote:
> >>>
> >>>> Hi Shimi,
> >>>>
> >>>> 0.10.2.1 contains a number of fixes that should make the out of box
> >>>> experience better, including resiliency under broker failures and
> better
> >>>> exception handling. If you ever get back to it, and if the problem
> >> happens
> >>>> again, please do send us the logs and we'll happily have a look.
> >>>>
> >>>> Thanks
> >>>> Eno
> >>>>> On 1 May 2017, at 12:05, Shimi Kiviti <sh...@gmail.com> wrote:
> >>>>>
> >>>>> Hi Eno,
> >>>>> I am afraid I played too much with the configuration to make this
> >>>>> productive investigation :(
> >>>>>
> >>>>> This is a QA environment which includes 2 kafka instances and 3
> >> zookeeper
> >>>>> instances in AWS. There are only 3 partition for this topic.
> >>>>> Kafka broker and kafka-stream are version 0.10.1.1
> >>>>> Our kafka-stream app run on docker using kubernetes.
> >>>>> I played around with with 1 to 3  kafka-stream processes, but I got
> the
> >>>>> same results. It is too easy to scale with kubernetes :)
> >>>>> Since there are only 3 partitions, I didn't start more then 3
> >> instances.
> >>>>>
> >>>>> I was too quick to upgraded only the kafka-stream app to 0.10.2.1
> with
> >>>> hope
> >>>>> that it will solve the problem, It didn't.
> >>>>> The log I sent before are from this version.
> >>>>>
> >>>>> I did notice "unknown" offset for the main topic with kafka-stream
> >>>> version
> >>>>> 0.10.2.1
> >>>>> $ ./bin/kafka-consumer-groups.sh   --bootstrap-server localhost:9092
> >>>>> --describe --group sa
> >>>>> GROUP                          TOPIC
> PARTITION
> >>>>> CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
> >>>>> sa             sa-events                 0          842199
> >>>>> 842199          0
> >>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> >>>> 10.0.10.9
> >>>>> sa             sa-events                 1          1078428
> >>>>> 1078428         0
> >>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> >>>> 10.0.10.9
> >>>>> sa             sa-events                 2          unknown
> >>>>> 26093910        unknown
> >>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> >>>> 10.0.10.9
> >>>>>
> >>>>> After that I downgraded the kafka-stream app back to version 0.10.1.1
> >>>>> After a LONG startup time (more than an hour) where the status of the
> >>>> group
> >>>>> was rebalancing, all the 3 processes started processing messages
> again.
> >>>>>
> >>>>> This all thing started after we hit a bug in our code (NPE) that
> >> crashed
> >>>>> the stream processing thread.
> >>>>> So now after 4 days, everything is back to normal.
> >>>>> This worries me since it can happen again
> >>>>>
> >>>>>
> >>>>> On Mon, May 1, 2017 at 11:45 AM, Eno Thereska <
> eno.thereska@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Shimi,
> >>>>>>
> >>>>>> Could you provide more info on your setup? How many kafka streams
> >>>>>> processes do you have and from how many partitions are they
> consuming
> >>>> from.
> >>>>>> If you have more processes than partitions some of the processes
> will
> >> be
> >>>>>> idle and won’t do anything.
> >>>>>>
> >>>>>> Eno
> >>>>>>> On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <sh...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>> Hi Everyone,
> >>>>>>>
> >>>>>>> I have a problem and I hope one of you can help me figuring it out.
> >>>>>>> One of our kafka-streams processes stopped processing messages
> >>>>>>>
> >>>>>>> When I turn on debug log I see lots of these messages:
> >>>>>>>
> >>>>>>> 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
> >>>>>> Sending
> >>>>>>> fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
> >>>>>>> .ec2.internal:9092 (id: 1 rack: null)
> >>>>>>> 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
> >>>>>>> Ignoring fetched records for devlast-changelog-2 at offset 2962649
> >>>> since
> >>>>>>> the current position is 2963379
> >>>>>>>
> >>>>>>> After a LONG time, the only messages in the log are these:
> >>>>>>>
> >>>>>>> 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread | sa]
> >>>> DEBUG
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for
> group
> >>>> sa
> >>>>>> to
> >>>>>>> coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack:
> null)
> >>>>>>> 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread | sa]
> >>>> DEBUG
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat
> >>>> response
> >>>>>> for
> >>>>>>> group same
> >>>>>>>
> >>>>>>> Any idea?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Shimi
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Kafka-streams process stopped processing messages

Posted by Eno Thereska <en...@gmail.com>.

Yeah we’ve seen cases when the session timeout might also need increasing. Could you try upping it to something like 60000ms and let us know how it goes:

>> streamsProps.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 60000);


Thanks
Eno

> On May 6, 2017, at 8:35 AM, Shimi Kiviti <sh...@gmail.com> wrote:
> 
> Thanks Eno,
> I already set the the recurve buffer size to 1MB
> I will also try producer
> 
> What about session timeout and heart beat timeout? Do you think it should
> be increased?
> 
> Thanks,
> Shimi
> 
> On Sat, 6 May 2017 at 0:21 Eno Thereska <en...@gmail.com> wrote:
> 
>> Hi Shimi,
>> 
>> I’ve noticed with our benchmarks that on AWS environments with high
>> network latency the network socket buffers often need adjusting. Any chance
>> you could add the following to your streams configuration to change the
>> default socket size bytes to a higher value (at least 1MB) and let us know?
>> 
>> private static final int SOCKET_SIZE_BYTES = 1 * 1024 * 1024; // at least
>> 1MB
>> streamsProps.put(ConsumerConfig.RECEIVE_BUFFER_CONFIG, SOCKET_SIZE_BYTES);
>> streamsProps.put(ProducerConfig.SEND_BUFFER_CONFIG, SOCKET_SIZE_BYTES);
>> 
>> Thanks
>> Eno
>> 
>>> On May 4, 2017, at 3:45 PM, Shimi Kiviti <sh...@gmail.com> wrote:
>>> 
>>> Thanks Eno,
>>> 
>>> We still see problems on our side.
>>> when we run kafka-streams 0.10.1.1 eventually the problem goes away but
>>> with 0.10.2.1 it is not.
>>> We see a lot of the rebalancing messages I wrote before
>>> 
>>> on at least 1 kafka-stream nodes we see disconnection messages like the
>>> following. These messages repeat all the time
>>> 
>>> 2017-05-04 14:25:56,063 [StreamThread-1] INFO
>>> o.a.k.c.c.i.AbstractCoordinator: Discovered coordinator
>>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) for group sa.
>>> 2017-05-04 14:25:56,063 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
>>> Initiating connection to node 2147483646 at
>> ip-10-0-91-10.ec2.internal:9092.
>>> 2017-05-04 14:25:56,091 [StreamThread-1] INFO
>>> o.a.k.c.c.i.AbstractCoordinator: (Re-)joining group sa
>>> 2017-05-04 14:25:56,093 [StreamThread-1] DEBUG
>>> o.a.k.s.p.i.StreamPartitionAssignor: stream-thread [StreamThread-1] found
>>> [sa-events] topics possibly matching regex
>>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG o.a.k.s.p.TopologyBuilder:
>>> stream-thread [StreamThread-1] updating builder with
>>> SubscriptionUpdates{updatedTopicSubscriptions=[sa-events]} topic(s) with
>> po
>>> ssible matching regex subscription(s)
>>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
>>> o.a.k.c.c.i.AbstractCoordinator: Sending JoinGroup ((type:
>>> JoinGroupRequest, groupId=sa, sessionTimeout=10000,
>>> rebalanceTimeout=2147483647, memb
>>> erId=, protocolType=consumer,
>>> 
>> groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$ProtocolMetadata@2f894d9b
>> ))
>>> to coordinator ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack:
>> null)
>>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.n.Selector:
>> Created
>>> socket with SO_RCVBUF = 1048576, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to
>> node
>>> 2147483646
>>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
>>> Completed connection to node 2147483646.  Fetching API versions.
>>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
>>> Initiating API versions fetch from node 2147483646.
>>> 2017-05-04 14:25:56,104 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
>>> Recorded API versions for node 2147483646: (Produce(0): 0 to 2 [usable:
>> 2],
>>> Fetch(1): 0 to 3 [usable: 3], Offsets(2): 0 to 1 [usable: 1],
>>> Metadata(3): 0 to 2 [usable: 2], LeaderAndIsr(4): 0 [usable: 0],
>>> StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 2 [usable: 2],
>>> ControlledShutdown(7): 1 [usable: 1], OffsetCommit(8): 0 to 2 [usable:
>>> 2], OffsetFetch(9): 0 to 1 [usable: 1], GroupCoordinator(10): 0 [usable:
>>> 0], JoinGroup(11): 0 to 1 [usable: 1], Heartbeat(12): 0 [usable: 0],
>>> LeaveGroup(13): 0 [usable: 0], SyncGroup(14): 0 [usable: 0], Desc
>>> ribeGroups(15): 0 [usable: 0], ListGroups(16): 0 [usable: 0],
>>> SaslHandshake(17): 0 [usable: 0], ApiVersions(18): 0 [usable: 0],
>>> CreateTopics(19): 0 [usable: 0], DeleteTopics(20): 0 [usable: 0])
>>> 2017-05-04 14:29:44,800 [kafka-producer-network-thread |
>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
>>> o.a.k.c.NetworkClient: Node -2 disconnected.
>>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
>>> o.a.k.c.NetworkClient: Sending metadata request (type=MetadataR
>>> equest, topics=) to node 1
>>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
>>> o.a.k.c.NetworkClient: Node -1 disconnected.
>>> 2017-05-04 14:29:44,802 [kafka-producer-network-thread |
>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
>>> o.a.k.c.Metadata: Updated cluster metadata version 4 to Cluster
>>> (id = JsVqjH3tS4CIcqpd2jkogA, nodes = [ip-10-0-91-10.ec2.internal:9092
>> (id:
>>> 1 rack: null), ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null)],
>>> partitions = [])
>>> 2017-05-04 14:30:56,062 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
>>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to node 2
>>> 2017-05-04 14:30:56,073 [StreamThread-1] DEBUG o.a.k.c.Metadata: Updated
>>> cluster metadata version 7 to Cluster(id = JsVqjH3tS4CIcqpd2jkogA, nodes
>> =
>>> [ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null), ip-10
>>> -0-91-10.ec2.internal:9092 (id: 1 rack: null)], partitions =
>>> [Partition(topic = sa-events, partition = 0, leader = 1, replicas =
>> [1,2],
>>> isr = [2,1]), Partition(topic = sa-events, partition = 1, lea
>>> der = 2, replicas = [1,2], isr = [2,1]), Partition(topic = sa-events,
>>> partition = 2, leader = 1, replicas = [1,2], isr = [2,1])])
>>> 2017-05-04 14:31:06,085 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
>>> Disconnecting from node 2147483646 due to request timeout.
>>> 2017-05-04 14:31:06,086 [StreamThread-1] DEBUG
>>> o.a.k.c.c.i.ConsumerNetworkClient: Cancelled JOIN_GROUP request
>>> 
>> {api_key=11,api_version=1,correlation_id=16,client_id=sa-5788b5a5-aadc-4276-916f
>>> -1640008c17da-StreamThread-1-consumer} with correlation id 16 due to node
>>> 2147483646 being disconnected
>>> 2017-05-04 14:31:06,086 [StreamThread-1] INFO
>>> o.a.k.c.c.i.AbstractCoordinator: Marking the coordinator
>>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) dead for
>> group
>>> sa
>>> 2017-05-04 14:31:06,195 [StreamThread-1] DEBUG
>>> o.a.k.c.c.i.AbstractCoordinator: Sending GroupCoordinator request for
>> group
>>> sa to broker ip-10-0-91-10.ec2.internal:9092 (id: 1 rack: null)
>>> 2017-05-04 14:31:06,196 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
>>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to node 2
>>> 2017-05-04 14:31:06,200 [StreamThread-1] DEBUG
>>> o.a.k.c.c.i.AbstractCoordinator: Received GroupCoordinator response
>>> ClientResponse(receivedTimeMs=1493908266200, latencyMs=5,
>>> disconnected=false, requestHeader=
>>> 
>> {api_key=10,api_version=0,correlation_id=19,client_id=sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-consumer},
>>> responseBody={error_code=0,coordinator={node_id=1,host=ip-10-0-91-10.ec
>>> 2.internal,port=9092}}) for group sa
>>> 
>>> 
>>> On Mon, May 1, 2017 at 4:19 PM, Eno Thereska <en...@gmail.com>
>> wrote:
>>> 
>>>> Hi Shimi,
>>>> 
>>>> 0.10.2.1 contains a number of fixes that should make the out of box
>>>> experience better, including resiliency under broker failures and better
>>>> exception handling. If you ever get back to it, and if the problem
>> happens
>>>> again, please do send us the logs and we'll happily have a look.
>>>> 
>>>> Thanks
>>>> Eno
>>>>> On 1 May 2017, at 12:05, Shimi Kiviti <sh...@gmail.com> wrote:
>>>>> 
>>>>> Hi Eno,
>>>>> I am afraid I played too much with the configuration to make this
>>>>> productive investigation :(
>>>>> 
>>>>> This is a QA environment which includes 2 kafka instances and 3
>> zookeeper
>>>>> instances in AWS. There are only 3 partition for this topic.
>>>>> Kafka broker and kafka-stream are version 0.10.1.1
>>>>> Our kafka-stream app run on docker using kubernetes.
>>>>> I played around with with 1 to 3  kafka-stream processes, but I got the
>>>>> same results. It is too easy to scale with kubernetes :)
>>>>> Since there are only 3 partitions, I didn't start more then 3
>> instances.
>>>>> 
>>>>> I was too quick to upgraded only the kafka-stream app to 0.10.2.1 with
>>>> hope
>>>>> that it will solve the problem, It didn't.
>>>>> The log I sent before are from this version.
>>>>> 
>>>>> I did notice "unknown" offset for the main topic with kafka-stream
>>>> version
>>>>> 0.10.2.1
>>>>> $ ./bin/kafka-consumer-groups.sh   --bootstrap-server localhost:9092
>>>>> --describe --group sa
>>>>> GROUP                          TOPIC                          PARTITION
>>>>> CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
>>>>> sa             sa-events                 0          842199
>>>>> 842199          0
>>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
>>>> 10.0.10.9
>>>>> sa             sa-events                 1          1078428
>>>>> 1078428         0
>>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
>>>> 10.0.10.9
>>>>> sa             sa-events                 2          unknown
>>>>> 26093910        unknown
>>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
>>>> 10.0.10.9
>>>>> 
>>>>> After that I downgraded the kafka-stream app back to version 0.10.1.1
>>>>> After a LONG startup time (more than an hour) where the status of the
>>>> group
>>>>> was rebalancing, all the 3 processes started processing messages again.
>>>>> 
>>>>> This all thing started after we hit a bug in our code (NPE) that
>> crashed
>>>>> the stream processing thread.
>>>>> So now after 4 days, everything is back to normal.
>>>>> This worries me since it can happen again
>>>>> 
>>>>> 
>>>>> On Mon, May 1, 2017 at 11:45 AM, Eno Thereska <en...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi Shimi,
>>>>>> 
>>>>>> Could you provide more info on your setup? How many kafka streams
>>>>>> processes do you have and from how many partitions are they consuming
>>>> from.
>>>>>> If you have more processes than partitions some of the processes will
>> be
>>>>>> idle and won’t do anything.
>>>>>> 
>>>>>> Eno
>>>>>>> On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <sh...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi Everyone,
>>>>>>> 
>>>>>>> I have a problem and I hope one of you can help me figuring it out.
>>>>>>> One of our kafka-streams processes stopped processing messages
>>>>>>> 
>>>>>>> When I turn on debug log I see lots of these messages:
>>>>>>> 
>>>>>>> 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
>>>>>> Sending
>>>>>>> fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
>>>>>>> .ec2.internal:9092 (id: 1 rack: null)
>>>>>>> 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
>>>>>>> Ignoring fetched records for devlast-changelog-2 at offset 2962649
>>>> since
>>>>>>> the current position is 2963379
>>>>>>> 
>>>>>>> After a LONG time, the only messages in the log are these:
>>>>>>> 
>>>>>>> 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread | sa]
>>>> DEBUG
>>>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for group
>>>> sa
>>>>>> to
>>>>>>> coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack: null)
>>>>>>> 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread | sa]
>>>> DEBUG
>>>>>>> o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat
>>>> response
>>>>>> for
>>>>>>> group same
>>>>>>> 
>>>>>>> Any idea?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Shimi
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Kafka-streams process stopped processing messages

Posted by Shimi Kiviti <sh...@gmail.com>.

Thanks Eno,
I already set the the recurve buffer size to 1MB
I will also try producer

What about session timeout and heart beat timeout? Do you think it should
be increased?

Thanks,
Shimi

On Sat, 6 May 2017 at 0:21 Eno Thereska <en...@gmail.com> wrote:

> Hi Shimi,
>
> I’ve noticed with our benchmarks that on AWS environments with high
> network latency the network socket buffers often need adjusting. Any chance
> you could add the following to your streams configuration to change the
> default socket size bytes to a higher value (at least 1MB) and let us know?
>
> private static final int SOCKET_SIZE_BYTES = 1 * 1024 * 1024; // at least
> 1MB
> streamsProps.put(ConsumerConfig.RECEIVE_BUFFER_CONFIG, SOCKET_SIZE_BYTES);
> streamsProps.put(ProducerConfig.SEND_BUFFER_CONFIG, SOCKET_SIZE_BYTES);
>
> Thanks
> Eno
>
> > On May 4, 2017, at 3:45 PM, Shimi Kiviti <sh...@gmail.com> wrote:
> >
> > Thanks Eno,
> >
> > We still see problems on our side.
> > when we run kafka-streams 0.10.1.1 eventually the problem goes away but
> > with 0.10.2.1 it is not.
> > We see a lot of the rebalancing messages I wrote before
> >
> > on at least 1 kafka-stream nodes we see disconnection messages like the
> > following. These messages repeat all the time
> >
> > 2017-05-04 14:25:56,063 [StreamThread-1] INFO
> > o.a.k.c.c.i.AbstractCoordinator: Discovered coordinator
> > ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) for group sa.
> > 2017-05-04 14:25:56,063 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > Initiating connection to node 2147483646 at
> ip-10-0-91-10.ec2.internal:9092.
> > 2017-05-04 14:25:56,091 [StreamThread-1] INFO
> > o.a.k.c.c.i.AbstractCoordinator: (Re-)joining group sa
> > 2017-05-04 14:25:56,093 [StreamThread-1] DEBUG
> > o.a.k.s.p.i.StreamPartitionAssignor: stream-thread [StreamThread-1] found
> > [sa-events] topics possibly matching regex
> > 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG o.a.k.s.p.TopologyBuilder:
> > stream-thread [StreamThread-1] updating builder with
> > SubscriptionUpdates{updatedTopicSubscriptions=[sa-events]} topic(s) with
> po
> > ssible matching regex subscription(s)
> > 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
> > o.a.k.c.c.i.AbstractCoordinator: Sending JoinGroup ((type:
> > JoinGroupRequest, groupId=sa, sessionTimeout=10000,
> > rebalanceTimeout=2147483647, memb
> > erId=, protocolType=consumer,
> >
> groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$ProtocolMetadata@2f894d9b
> ))
> > to coordinator ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack:
> null)
> > 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.n.Selector:
> Created
> > socket with SO_RCVBUF = 1048576, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to
> node
> > 2147483646
> > 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > Completed connection to node 2147483646.  Fetching API versions.
> > 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > Initiating API versions fetch from node 2147483646.
> > 2017-05-04 14:25:56,104 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > Recorded API versions for node 2147483646: (Produce(0): 0 to 2 [usable:
> 2],
> > Fetch(1): 0 to 3 [usable: 3], Offsets(2): 0 to 1 [usable: 1],
> > Metadata(3): 0 to 2 [usable: 2], LeaderAndIsr(4): 0 [usable: 0],
> > StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 2 [usable: 2],
> > ControlledShutdown(7): 1 [usable: 1], OffsetCommit(8): 0 to 2 [usable:
> > 2], OffsetFetch(9): 0 to 1 [usable: 1], GroupCoordinator(10): 0 [usable:
> > 0], JoinGroup(11): 0 to 1 [usable: 1], Heartbeat(12): 0 [usable: 0],
> > LeaveGroup(13): 0 [usable: 0], SyncGroup(14): 0 [usable: 0], Desc
> > ribeGroups(15): 0 [usable: 0], ListGroups(16): 0 [usable: 0],
> > SaslHandshake(17): 0 [usable: 0], ApiVersions(18): 0 [usable: 0],
> > CreateTopics(19): 0 [usable: 0], DeleteTopics(20): 0 [usable: 0])
> > 2017-05-04 14:29:44,800 [kafka-producer-network-thread |
> > sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
> > o.a.k.c.NetworkClient: Node -2 disconnected.
> > 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> > sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
> > o.a.k.c.NetworkClient: Sending metadata request (type=MetadataR
> > equest, topics=) to node 1
> > 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> > sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
> > o.a.k.c.NetworkClient: Node -1 disconnected.
> > 2017-05-04 14:29:44,802 [kafka-producer-network-thread |
> > sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
> > o.a.k.c.Metadata: Updated cluster metadata version 4 to Cluster
> > (id = JsVqjH3tS4CIcqpd2jkogA, nodes = [ip-10-0-91-10.ec2.internal:9092
> (id:
> > 1 rack: null), ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null)],
> > partitions = [])
> > 2017-05-04 14:30:56,062 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > Sending metadata request (type=MetadataRequest, topics=<ALL>) to node 2
> > 2017-05-04 14:30:56,073 [StreamThread-1] DEBUG o.a.k.c.Metadata: Updated
> > cluster metadata version 7 to Cluster(id = JsVqjH3tS4CIcqpd2jkogA, nodes
> =
> > [ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null), ip-10
> > -0-91-10.ec2.internal:9092 (id: 1 rack: null)], partitions =
> > [Partition(topic = sa-events, partition = 0, leader = 1, replicas =
> [1,2],
> > isr = [2,1]), Partition(topic = sa-events, partition = 1, lea
> > der = 2, replicas = [1,2], isr = [2,1]), Partition(topic = sa-events,
> > partition = 2, leader = 1, replicas = [1,2], isr = [2,1])])
> > 2017-05-04 14:31:06,085 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > Disconnecting from node 2147483646 due to request timeout.
> > 2017-05-04 14:31:06,086 [StreamThread-1] DEBUG
> > o.a.k.c.c.i.ConsumerNetworkClient: Cancelled JOIN_GROUP request
> >
> {api_key=11,api_version=1,correlation_id=16,client_id=sa-5788b5a5-aadc-4276-916f
> > -1640008c17da-StreamThread-1-consumer} with correlation id 16 due to node
> > 2147483646 being disconnected
> > 2017-05-04 14:31:06,086 [StreamThread-1] INFO
> > o.a.k.c.c.i.AbstractCoordinator: Marking the coordinator
> > ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) dead for
> group
> > sa
> > 2017-05-04 14:31:06,195 [StreamThread-1] DEBUG
> > o.a.k.c.c.i.AbstractCoordinator: Sending GroupCoordinator request for
> group
> > sa to broker ip-10-0-91-10.ec2.internal:9092 (id: 1 rack: null)
> > 2017-05-04 14:31:06,196 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> > Sending metadata request (type=MetadataRequest, topics=<ALL>) to node 2
> > 2017-05-04 14:31:06,200 [StreamThread-1] DEBUG
> > o.a.k.c.c.i.AbstractCoordinator: Received GroupCoordinator response
> > ClientResponse(receivedTimeMs=1493908266200, latencyMs=5,
> > disconnected=false, requestHeader=
> >
> {api_key=10,api_version=0,correlation_id=19,client_id=sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-consumer},
> > responseBody={error_code=0,coordinator={node_id=1,host=ip-10-0-91-10.ec
> > 2.internal,port=9092}}) for group sa
> >
> >
> > On Mon, May 1, 2017 at 4:19 PM, Eno Thereska <en...@gmail.com>
> wrote:
> >
> >> Hi Shimi,
> >>
> >> 0.10.2.1 contains a number of fixes that should make the out of box
> >> experience better, including resiliency under broker failures and better
> >> exception handling. If you ever get back to it, and if the problem
> happens
> >> again, please do send us the logs and we'll happily have a look.
> >>
> >> Thanks
> >> Eno
> >>> On 1 May 2017, at 12:05, Shimi Kiviti <sh...@gmail.com> wrote:
> >>>
> >>> Hi Eno,
> >>> I am afraid I played too much with the configuration to make this
> >>> productive investigation :(
> >>>
> >>> This is a QA environment which includes 2 kafka instances and 3
> zookeeper
> >>> instances in AWS. There are only 3 partition for this topic.
> >>> Kafka broker and kafka-stream are version 0.10.1.1
> >>> Our kafka-stream app run on docker using kubernetes.
> >>> I played around with with 1 to 3  kafka-stream processes, but I got the
> >>> same results. It is too easy to scale with kubernetes :)
> >>> Since there are only 3 partitions, I didn't start more then 3
> instances.
> >>>
> >>> I was too quick to upgraded only the kafka-stream app to 0.10.2.1 with
> >> hope
> >>> that it will solve the problem, It didn't.
> >>> The log I sent before are from this version.
> >>>
> >>> I did notice "unknown" offset for the main topic with kafka-stream
> >> version
> >>> 0.10.2.1
> >>> $ ./bin/kafka-consumer-groups.sh   --bootstrap-server localhost:9092
> >>> --describe --group sa
> >>> GROUP                          TOPIC                          PARTITION
> >>> CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
> >>> sa             sa-events                 0          842199
> >>> 842199          0
> >>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> >> 10.0.10.9
> >>> sa             sa-events                 1          1078428
> >>> 1078428         0
> >>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> >> 10.0.10.9
> >>> sa             sa-events                 2          unknown
> >>> 26093910        unknown
> >>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> >> 10.0.10.9
> >>>
> >>> After that I downgraded the kafka-stream app back to version 0.10.1.1
> >>> After a LONG startup time (more than an hour) where the status of the
> >> group
> >>> was rebalancing, all the 3 processes started processing messages again.
> >>>
> >>> This all thing started after we hit a bug in our code (NPE) that
> crashed
> >>> the stream processing thread.
> >>> So now after 4 days, everything is back to normal.
> >>> This worries me since it can happen again
> >>>
> >>>
> >>> On Mon, May 1, 2017 at 11:45 AM, Eno Thereska <en...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi Shimi,
> >>>>
> >>>> Could you provide more info on your setup? How many kafka streams
> >>>> processes do you have and from how many partitions are they consuming
> >> from.
> >>>> If you have more processes than partitions some of the processes will
> be
> >>>> idle and won’t do anything.
> >>>>
> >>>> Eno
> >>>>> On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <sh...@gmail.com> wrote:
> >>>>>
> >>>>> Hi Everyone,
> >>>>>
> >>>>> I have a problem and I hope one of you can help me figuring it out.
> >>>>> One of our kafka-streams processes stopped processing messages
> >>>>>
> >>>>> When I turn on debug log I see lots of these messages:
> >>>>>
> >>>>> 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
> >>>> Sending
> >>>>> fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
> >>>>> .ec2.internal:9092 (id: 1 rack: null)
> >>>>> 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
> >>>>> Ignoring fetched records for devlast-changelog-2 at offset 2962649
> >> since
> >>>>> the current position is 2963379
> >>>>>
> >>>>> After a LONG time, the only messages in the log are these:
> >>>>>
> >>>>> 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread | sa]
> >> DEBUG
> >>>>> o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for group
> >> sa
> >>>> to
> >>>>> coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack: null)
> >>>>> 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread | sa]
> >> DEBUG
> >>>>> o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat
> >> response
> >>>> for
> >>>>> group same
> >>>>>
> >>>>> Any idea?
> >>>>>
> >>>>> Thanks,
> >>>>> Shimi
> >>>>
> >>>>
> >>
> >>
>
>

Re: Kafka-streams process stopped processing messages

Posted by Eno Thereska <en...@gmail.com>.

Hi Shimi,

I’ve noticed with our benchmarks that on AWS environments with high network latency the network socket buffers often need adjusting. Any chance you could add the following to your streams configuration to change the default socket size bytes to a higher value (at least 1MB) and let us know?

private static final int SOCKET_SIZE_BYTES = 1 * 1024 * 1024; // at least 1MB
streamsProps.put(ConsumerConfig.RECEIVE_BUFFER_CONFIG, SOCKET_SIZE_BYTES);
streamsProps.put(ProducerConfig.SEND_BUFFER_CONFIG, SOCKET_SIZE_BYTES);

Thanks
Eno 

> On May 4, 2017, at 3:45 PM, Shimi Kiviti <sh...@gmail.com> wrote:
> 
> Thanks Eno,
> 
> We still see problems on our side.
> when we run kafka-streams 0.10.1.1 eventually the problem goes away but
> with 0.10.2.1 it is not.
> We see a lot of the rebalancing messages I wrote before
> 
> on at least 1 kafka-stream nodes we see disconnection messages like the
> following. These messages repeat all the time
> 
> 2017-05-04 14:25:56,063 [StreamThread-1] INFO
> o.a.k.c.c.i.AbstractCoordinator: Discovered coordinator
> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) for group sa.
> 2017-05-04 14:25:56,063 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> Initiating connection to node 2147483646 at ip-10-0-91-10.ec2.internal:9092.
> 2017-05-04 14:25:56,091 [StreamThread-1] INFO
> o.a.k.c.c.i.AbstractCoordinator: (Re-)joining group sa
> 2017-05-04 14:25:56,093 [StreamThread-1] DEBUG
> o.a.k.s.p.i.StreamPartitionAssignor: stream-thread [StreamThread-1] found
> [sa-events] topics possibly matching regex
> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG o.a.k.s.p.TopologyBuilder:
> stream-thread [StreamThread-1] updating builder with
> SubscriptionUpdates{updatedTopicSubscriptions=[sa-events]} topic(s) with po
> ssible matching regex subscription(s)
> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
> o.a.k.c.c.i.AbstractCoordinator: Sending JoinGroup ((type:
> JoinGroupRequest, groupId=sa, sessionTimeout=10000,
> rebalanceTimeout=2147483647, memb
> erId=, protocolType=consumer,
> groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$ProtocolMetadata@2f894d9b))
> to coordinator ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null)
> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.n.Selector: Created
> socket with SO_RCVBUF = 1048576, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to node
> 2147483646
> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> Completed connection to node 2147483646.  Fetching API versions.
> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> Initiating API versions fetch from node 2147483646.
> 2017-05-04 14:25:56,104 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> Recorded API versions for node 2147483646: (Produce(0): 0 to 2 [usable: 2],
> Fetch(1): 0 to 3 [usable: 3], Offsets(2): 0 to 1 [usable: 1],
> Metadata(3): 0 to 2 [usable: 2], LeaderAndIsr(4): 0 [usable: 0],
> StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 2 [usable: 2],
> ControlledShutdown(7): 1 [usable: 1], OffsetCommit(8): 0 to 2 [usable:
> 2], OffsetFetch(9): 0 to 1 [usable: 1], GroupCoordinator(10): 0 [usable:
> 0], JoinGroup(11): 0 to 1 [usable: 1], Heartbeat(12): 0 [usable: 0],
> LeaveGroup(13): 0 [usable: 0], SyncGroup(14): 0 [usable: 0], Desc
> ribeGroups(15): 0 [usable: 0], ListGroups(16): 0 [usable: 0],
> SaslHandshake(17): 0 [usable: 0], ApiVersions(18): 0 [usable: 0],
> CreateTopics(19): 0 [usable: 0], DeleteTopics(20): 0 [usable: 0])
> 2017-05-04 14:29:44,800 [kafka-producer-network-thread |
> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
> o.a.k.c.NetworkClient: Node -2 disconnected.
> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
> o.a.k.c.NetworkClient: Sending metadata request (type=MetadataR
> equest, topics=) to node 1
> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
> o.a.k.c.NetworkClient: Node -1 disconnected.
> 2017-05-04 14:29:44,802 [kafka-producer-network-thread |
> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
> o.a.k.c.Metadata: Updated cluster metadata version 4 to Cluster
> (id = JsVqjH3tS4CIcqpd2jkogA, nodes = [ip-10-0-91-10.ec2.internal:9092 (id:
> 1 rack: null), ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null)],
> partitions = [])
> 2017-05-04 14:30:56,062 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> Sending metadata request (type=MetadataRequest, topics=<ALL>) to node 2
> 2017-05-04 14:30:56,073 [StreamThread-1] DEBUG o.a.k.c.Metadata: Updated
> cluster metadata version 7 to Cluster(id = JsVqjH3tS4CIcqpd2jkogA, nodes =
> [ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null), ip-10
> -0-91-10.ec2.internal:9092 (id: 1 rack: null)], partitions =
> [Partition(topic = sa-events, partition = 0, leader = 1, replicas = [1,2],
> isr = [2,1]), Partition(topic = sa-events, partition = 1, lea
> der = 2, replicas = [1,2], isr = [2,1]), Partition(topic = sa-events,
> partition = 2, leader = 1, replicas = [1,2], isr = [2,1])])
> 2017-05-04 14:31:06,085 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> Disconnecting from node 2147483646 due to request timeout.
> 2017-05-04 14:31:06,086 [StreamThread-1] DEBUG
> o.a.k.c.c.i.ConsumerNetworkClient: Cancelled JOIN_GROUP request
> {api_key=11,api_version=1,correlation_id=16,client_id=sa-5788b5a5-aadc-4276-916f
> -1640008c17da-StreamThread-1-consumer} with correlation id 16 due to node
> 2147483646 being disconnected
> 2017-05-04 14:31:06,086 [StreamThread-1] INFO
> o.a.k.c.c.i.AbstractCoordinator: Marking the coordinator
> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) dead for group
> sa
> 2017-05-04 14:31:06,195 [StreamThread-1] DEBUG
> o.a.k.c.c.i.AbstractCoordinator: Sending GroupCoordinator request for group
> sa to broker ip-10-0-91-10.ec2.internal:9092 (id: 1 rack: null)
> 2017-05-04 14:31:06,196 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
> Sending metadata request (type=MetadataRequest, topics=<ALL>) to node 2
> 2017-05-04 14:31:06,200 [StreamThread-1] DEBUG
> o.a.k.c.c.i.AbstractCoordinator: Received GroupCoordinator response
> ClientResponse(receivedTimeMs=1493908266200, latencyMs=5,
> disconnected=false, requestHeader=
> {api_key=10,api_version=0,correlation_id=19,client_id=sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-consumer},
> responseBody={error_code=0,coordinator={node_id=1,host=ip-10-0-91-10.ec
> 2.internal,port=9092}}) for group sa
> 
> 
> On Mon, May 1, 2017 at 4:19 PM, Eno Thereska <en...@gmail.com> wrote:
> 
>> Hi Shimi,
>> 
>> 0.10.2.1 contains a number of fixes that should make the out of box
>> experience better, including resiliency under broker failures and better
>> exception handling. If you ever get back to it, and if the problem happens
>> again, please do send us the logs and we'll happily have a look.
>> 
>> Thanks
>> Eno
>>> On 1 May 2017, at 12:05, Shimi Kiviti <sh...@gmail.com> wrote:
>>> 
>>> Hi Eno,
>>> I am afraid I played too much with the configuration to make this
>>> productive investigation :(
>>> 
>>> This is a QA environment which includes 2 kafka instances and 3 zookeeper
>>> instances in AWS. There are only 3 partition for this topic.
>>> Kafka broker and kafka-stream are version 0.10.1.1
>>> Our kafka-stream app run on docker using kubernetes.
>>> I played around with with 1 to 3  kafka-stream processes, but I got the
>>> same results. It is too easy to scale with kubernetes :)
>>> Since there are only 3 partitions, I didn't start more then 3 instances.
>>> 
>>> I was too quick to upgraded only the kafka-stream app to 0.10.2.1 with
>> hope
>>> that it will solve the problem, It didn't.
>>> The log I sent before are from this version.
>>> 
>>> I did notice "unknown" offset for the main topic with kafka-stream
>> version
>>> 0.10.2.1
>>> $ ./bin/kafka-consumer-groups.sh   --bootstrap-server localhost:9092
>>> --describe --group sa
>>> GROUP                          TOPIC                          PARTITION
>>> CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
>>> sa             sa-events                 0          842199
>>> 842199          0
>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
>> 10.0.10.9
>>> sa             sa-events                 1          1078428
>>> 1078428         0
>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
>> 10.0.10.9
>>> sa             sa-events                 2          unknown
>>> 26093910        unknown
>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
>> 10.0.10.9
>>> 
>>> After that I downgraded the kafka-stream app back to version 0.10.1.1
>>> After a LONG startup time (more than an hour) where the status of the
>> group
>>> was rebalancing, all the 3 processes started processing messages again.
>>> 
>>> This all thing started after we hit a bug in our code (NPE) that crashed
>>> the stream processing thread.
>>> So now after 4 days, everything is back to normal.
>>> This worries me since it can happen again
>>> 
>>> 
>>> On Mon, May 1, 2017 at 11:45 AM, Eno Thereska <en...@gmail.com>
>>> wrote:
>>> 
>>>> Hi Shimi,
>>>> 
>>>> Could you provide more info on your setup? How many kafka streams
>>>> processes do you have and from how many partitions are they consuming
>> from.
>>>> If you have more processes than partitions some of the processes will be
>>>> idle and won’t do anything.
>>>> 
>>>> Eno
>>>>> On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <sh...@gmail.com> wrote:
>>>>> 
>>>>> Hi Everyone,
>>>>> 
>>>>> I have a problem and I hope one of you can help me figuring it out.
>>>>> One of our kafka-streams processes stopped processing messages
>>>>> 
>>>>> When I turn on debug log I see lots of these messages:
>>>>> 
>>>>> 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
>>>> Sending
>>>>> fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
>>>>> .ec2.internal:9092 (id: 1 rack: null)
>>>>> 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
>>>>> Ignoring fetched records for devlast-changelog-2 at offset 2962649
>> since
>>>>> the current position is 2963379
>>>>> 
>>>>> After a LONG time, the only messages in the log are these:
>>>>> 
>>>>> 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread | sa]
>> DEBUG
>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for group
>> sa
>>>> to
>>>>> coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack: null)
>>>>> 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread | sa]
>> DEBUG
>>>>> o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat
>> response
>>>> for
>>>>> group same
>>>>> 
>>>>> Any idea?
>>>>> 
>>>>> Thanks,
>>>>> Shimi
>>>> 
>>>> 
>> 
>>

Re: Kafka-streams process stopped processing messages

Posted by Shimi Kiviti <sh...@gmail.com>.

Thanks Eno,

We still see problems on our side.
when we run kafka-streams 0.10.1.1 eventually the problem goes away but
with 0.10.2.1 it is not.
We see a lot of the rebalancing messages I wrote before

on at least 1 kafka-stream nodes we see disconnection messages like the
following. These messages repeat all the time

2017-05-04 14:25:56,063 [StreamThread-1] INFO
o.a.k.c.c.i.AbstractCoordinator: Discovered coordinator
ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) for group sa.
2017-05-04 14:25:56,063 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Initiating connection to node 2147483646 at ip-10-0-91-10.ec2.internal:9092.
2017-05-04 14:25:56,091 [StreamThread-1] INFO
o.a.k.c.c.i.AbstractCoordinator: (Re-)joining group sa
2017-05-04 14:25:56,093 [StreamThread-1] DEBUG
o.a.k.s.p.i.StreamPartitionAssignor: stream-thread [StreamThread-1] found
[sa-events] topics possibly matching regex
2017-05-04 14:25:56,096 [StreamThread-1] DEBUG o.a.k.s.p.TopologyBuilder:
stream-thread [StreamThread-1] updating builder with
SubscriptionUpdates{updatedTopicSubscriptions=[sa-events]} topic(s) with po
ssible matching regex subscription(s)
2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
o.a.k.c.c.i.AbstractCoordinator: Sending JoinGroup ((type:
JoinGroupRequest, groupId=sa, sessionTimeout=10000,
rebalanceTimeout=2147483647, memb
erId=, protocolType=consumer,
groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$ProtocolMetadata@2f894d9b))
to coordinator ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null)
2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.n.Selector: Created
socket with SO_RCVBUF = 1048576, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to node
2147483646
2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Completed connection to node 2147483646.  Fetching API versions.
2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Initiating API versions fetch from node 2147483646.
2017-05-04 14:25:56,104 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Recorded API versions for node 2147483646: (Produce(0): 0 to 2 [usable: 2],
Fetch(1): 0 to 3 [usable: 3], Offsets(2): 0 to 1 [usable: 1],
 Metadata(3): 0 to 2 [usable: 2], LeaderAndIsr(4): 0 [usable: 0],
StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 2 [usable: 2],
ControlledShutdown(7): 1 [usable: 1], OffsetCommit(8): 0 to 2 [usable:
2], OffsetFetch(9): 0 to 1 [usable: 1], GroupCoordinator(10): 0 [usable:
0], JoinGroup(11): 0 to 1 [usable: 1], Heartbeat(12): 0 [usable: 0],
LeaveGroup(13): 0 [usable: 0], SyncGroup(14): 0 [usable: 0], Desc
ribeGroups(15): 0 [usable: 0], ListGroups(16): 0 [usable: 0],
SaslHandshake(17): 0 [usable: 0], ApiVersions(18): 0 [usable: 0],
CreateTopics(19): 0 [usable: 0], DeleteTopics(20): 0 [usable: 0])
2017-05-04 14:29:44,800 [kafka-producer-network-thread |
sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
o.a.k.c.NetworkClient: Node -2 disconnected.
2017-05-04 14:29:44,801 [kafka-producer-network-thread |
sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
o.a.k.c.NetworkClient: Sending metadata request (type=MetadataR
equest, topics=) to node 1
2017-05-04 14:29:44,801 [kafka-producer-network-thread |
sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
o.a.k.c.NetworkClient: Node -1 disconnected.
2017-05-04 14:29:44,802 [kafka-producer-network-thread |
sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
o.a.k.c.Metadata: Updated cluster metadata version 4 to Cluster
(id = JsVqjH3tS4CIcqpd2jkogA, nodes = [ip-10-0-91-10.ec2.internal:9092 (id:
1 rack: null), ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null)],
partitions = [])
2017-05-04 14:30:56,062 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Sending metadata request (type=MetadataRequest, topics=<ALL>) to node 2
2017-05-04 14:30:56,073 [StreamThread-1] DEBUG o.a.k.c.Metadata: Updated
cluster metadata version 7 to Cluster(id = JsVqjH3tS4CIcqpd2jkogA, nodes =
[ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null), ip-10
-0-91-10.ec2.internal:9092 (id: 1 rack: null)], partitions =
[Partition(topic = sa-events, partition = 0, leader = 1, replicas = [1,2],
isr = [2,1]), Partition(topic = sa-events, partition = 1, lea
der = 2, replicas = [1,2], isr = [2,1]), Partition(topic = sa-events,
partition = 2, leader = 1, replicas = [1,2], isr = [2,1])])
2017-05-04 14:31:06,085 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Disconnecting from node 2147483646 due to request timeout.
2017-05-04 14:31:06,086 [StreamThread-1] DEBUG
o.a.k.c.c.i.ConsumerNetworkClient: Cancelled JOIN_GROUP request
{api_key=11,api_version=1,correlation_id=16,client_id=sa-5788b5a5-aadc-4276-916f
-1640008c17da-StreamThread-1-consumer} with correlation id 16 due to node
2147483646 being disconnected
2017-05-04 14:31:06,086 [StreamThread-1] INFO
o.a.k.c.c.i.AbstractCoordinator: Marking the coordinator
ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) dead for group
sa
2017-05-04 14:31:06,195 [StreamThread-1] DEBUG
o.a.k.c.c.i.AbstractCoordinator: Sending GroupCoordinator request for group
sa to broker ip-10-0-91-10.ec2.internal:9092 (id: 1 rack: null)
2017-05-04 14:31:06,196 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Sending metadata request (type=MetadataRequest, topics=<ALL>) to node 2
2017-05-04 14:31:06,200 [StreamThread-1] DEBUG
o.a.k.c.c.i.AbstractCoordinator: Received GroupCoordinator response
ClientResponse(receivedTimeMs=1493908266200, latencyMs=5,
disconnected=false, requestHeader=
{api_key=10,api_version=0,correlation_id=19,client_id=sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-consumer},
responseBody={error_code=0,coordinator={node_id=1,host=ip-10-0-91-10.ec
2.internal,port=9092}}) for group sa


On Mon, May 1, 2017 at 4:19 PM, Eno Thereska <en...@gmail.com> wrote:

> Hi Shimi,
>
> 0.10.2.1 contains a number of fixes that should make the out of box
> experience better, including resiliency under broker failures and better
> exception handling. If you ever get back to it, and if the problem happens
> again, please do send us the logs and we'll happily have a look.
>
> Thanks
> Eno
> > On 1 May 2017, at 12:05, Shimi Kiviti <sh...@gmail.com> wrote:
> >
> > Hi Eno,
> > I am afraid I played too much with the configuration to make this
> > productive investigation :(
> >
> > This is a QA environment which includes 2 kafka instances and 3 zookeeper
> > instances in AWS. There are only 3 partition for this topic.
> > Kafka broker and kafka-stream are version 0.10.1.1
> > Our kafka-stream app run on docker using kubernetes.
> > I played around with with 1 to 3  kafka-stream processes, but I got the
> > same results. It is too easy to scale with kubernetes :)
> > Since there are only 3 partitions, I didn't start more then 3 instances.
> >
> > I was too quick to upgraded only the kafka-stream app to 0.10.2.1 with
> hope
> > that it will solve the problem, It didn't.
> > The log I sent before are from this version.
> >
> > I did notice "unknown" offset for the main topic with kafka-stream
> version
> > 0.10.2.1
> > $ ./bin/kafka-consumer-groups.sh   --bootstrap-server localhost:9092
> > --describe --group sa
> > GROUP                          TOPIC                          PARTITION
> > CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
> > sa             sa-events                 0          842199
> > 842199          0
> > sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> 10.0.10.9
> > sa             sa-events                 1          1078428
> > 1078428         0
> > sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> 10.0.10.9
> > sa             sa-events                 2          unknown
> > 26093910        unknown
> > sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> 10.0.10.9
> >
> > After that I downgraded the kafka-stream app back to version 0.10.1.1
> > After a LONG startup time (more than an hour) where the status of the
> group
> > was rebalancing, all the 3 processes started processing messages again.
> >
> > This all thing started after we hit a bug in our code (NPE) that crashed
> > the stream processing thread.
> > So now after 4 days, everything is back to normal.
> > This worries me since it can happen again
> >
> >
> > On Mon, May 1, 2017 at 11:45 AM, Eno Thereska <en...@gmail.com>
> > wrote:
> >
> >> Hi Shimi,
> >>
> >> Could you provide more info on your setup? How many kafka streams
> >> processes do you have and from how many partitions are they consuming
> from.
> >> If you have more processes than partitions some of the processes will be
> >> idle and won’t do anything.
> >>
> >> Eno
> >>> On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <sh...@gmail.com> wrote:
> >>>
> >>> Hi Everyone,
> >>>
> >>> I have a problem and I hope one of you can help me figuring it out.
> >>> One of our kafka-streams processes stopped processing messages
> >>>
> >>> When I turn on debug log I see lots of these messages:
> >>>
> >>> 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
> >> Sending
> >>> fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
> >>> .ec2.internal:9092 (id: 1 rack: null)
> >>> 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
> >>> Ignoring fetched records for devlast-changelog-2 at offset 2962649
> since
> >>> the current position is 2963379
> >>>
> >>> After a LONG time, the only messages in the log are these:
> >>>
> >>> 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread | sa]
> DEBUG
> >>> o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for group
> sa
> >> to
> >>> coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack: null)
> >>> 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread | sa]
> DEBUG
> >>> o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat
> response
> >> for
> >>> group same
> >>>
> >>> Any idea?
> >>>
> >>> Thanks,
> >>> Shimi
> >>
> >>
>
>

Re: Kafka-streams process stopped processing messages

Posted by Eno Thereska <en...@gmail.com>.

Hi Shimi,

0.10.2.1 contains a number of fixes that should make the out of box experience better, including resiliency under broker failures and better exception handling. If you ever get back to it, and if the problem happens again, please do send us the logs and we'll happily have a look.

Thanks
Eno
> On 1 May 2017, at 12:05, Shimi Kiviti <sh...@gmail.com> wrote:
> 
> Hi Eno,
> I am afraid I played too much with the configuration to make this
> productive investigation :(
> 
> This is a QA environment which includes 2 kafka instances and 3 zookeeper
> instances in AWS. There are only 3 partition for this topic.
> Kafka broker and kafka-stream are version 0.10.1.1
> Our kafka-stream app run on docker using kubernetes.
> I played around with with 1 to 3  kafka-stream processes, but I got the
> same results. It is too easy to scale with kubernetes :)
> Since there are only 3 partitions, I didn't start more then 3 instances.
> 
> I was too quick to upgraded only the kafka-stream app to 0.10.2.1 with hope
> that it will solve the problem, It didn't.
> The log I sent before are from this version.
> 
> I did notice "unknown" offset for the main topic with kafka-stream version
> 0.10.2.1
> $ ./bin/kafka-consumer-groups.sh   --bootstrap-server localhost:9092
> --describe --group sa
> GROUP                          TOPIC                          PARTITION
> CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
> sa             sa-events                 0          842199
> 842199          0
> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/10.0.10.9
> sa             sa-events                 1          1078428
> 1078428         0
> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/10.0.10.9
> sa             sa-events                 2          unknown
> 26093910        unknown
> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/10.0.10.9
> 
> After that I downgraded the kafka-stream app back to version 0.10.1.1
> After a LONG startup time (more than an hour) where the status of the group
> was rebalancing, all the 3 processes started processing messages again.
> 
> This all thing started after we hit a bug in our code (NPE) that crashed
> the stream processing thread.
> So now after 4 days, everything is back to normal.
> This worries me since it can happen again
> 
> 
> On Mon, May 1, 2017 at 11:45 AM, Eno Thereska <en...@gmail.com>
> wrote:
> 
>> Hi Shimi,
>> 
>> Could you provide more info on your setup? How many kafka streams
>> processes do you have and from how many partitions are they consuming from.
>> If you have more processes than partitions some of the processes will be
>> idle and won’t do anything.
>> 
>> Eno
>>> On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <sh...@gmail.com> wrote:
>>> 
>>> Hi Everyone,
>>> 
>>> I have a problem and I hope one of you can help me figuring it out.
>>> One of our kafka-streams processes stopped processing messages
>>> 
>>> When I turn on debug log I see lots of these messages:
>>> 
>>> 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
>> Sending
>>> fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
>>> .ec2.internal:9092 (id: 1 rack: null)
>>> 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
>>> Ignoring fetched records for devlast-changelog-2 at offset 2962649 since
>>> the current position is 2963379
>>> 
>>> After a LONG time, the only messages in the log are these:
>>> 
>>> 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread | sa] DEBUG
>>> o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for group sa
>> to
>>> coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack: null)
>>> 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread | sa] DEBUG
>>> o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat response
>> for
>>> group same
>>> 
>>> Any idea?
>>> 
>>> Thanks,
>>> Shimi
>> 
>>

Re: Kafka-streams process stopped processing messages

Posted by Shimi Kiviti <sh...@gmail.com>.

Hi Eno,
I am afraid I played too much with the configuration to make this
productive investigation :(

This is a QA environment which includes 2 kafka instances and 3 zookeeper
instances in AWS. There are only 3 partition for this topic.
Kafka broker and kafka-stream are version 0.10.1.1
Our kafka-stream app run on docker using kubernetes.
I played around with with 1 to 3  kafka-stream processes, but I got the
same results. It is too easy to scale with kubernetes :)
Since there are only 3 partitions, I didn't start more then 3 instances.

I was too quick to upgraded only the kafka-stream app to 0.10.2.1 with hope
that it will solve the problem, It didn't.
The log I sent before are from this version.

I did notice "unknown" offset for the main topic with kafka-stream version
0.10.2.1
$ ./bin/kafka-consumer-groups.sh   --bootstrap-server localhost:9092
--describe --group sa
GROUP                          TOPIC                          PARTITION
CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
sa             sa-events                 0          842199
842199          0
sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/10.0.10.9
sa             sa-events                 1          1078428
1078428         0
sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/10.0.10.9
sa             sa-events                 2          unknown
26093910        unknown
sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/10.0.10.9

After that I downgraded the kafka-stream app back to version 0.10.1.1
After a LONG startup time (more than an hour) where the status of the group
was rebalancing, all the 3 processes started processing messages again.

This all thing started after we hit a bug in our code (NPE) that crashed
the stream processing thread.
So now after 4 days, everything is back to normal.
This worries me since it can happen again

On Mon, May 1, 2017 at 11:45 AM, Eno Thereska <en...@gmail.com>
wrote:

> Hi Shimi,
>
> Could you provide more info on your setup? How many kafka streams
> processes do you have and from how many partitions are they consuming from.
> If you have more processes than partitions some of the processes will be
> idle and won’t do anything.
>
> Eno
> > On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <sh...@gmail.com> wrote:
> >
> > Hi Everyone,
> >
> > I have a problem and I hope one of you can help me figuring it out.
> > One of our kafka-streams processes stopped processing messages
> >
> > When I turn on debug log I see lots of these messages:
> >
> > 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
> Sending
> > fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
> > .ec2.internal:9092 (id: 1 rack: null)
> > 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
> > Ignoring fetched records for devlast-changelog-2 at offset 2962649 since
> > the current position is 2963379
> >
> > After a LONG time, the only messages in the log are these:
> >
> > 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread | sa] DEBUG
> > o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for group sa
> to
> > coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack: null)
> > 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread | sa] DEBUG
> > o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat response
> for
> > group same
> >
> > Any idea?
> >
> > Thanks,
> > Shimi
>
>

Re: Kafka-streams process stopped processing messages

Posted by Eno Thereska <en...@gmail.com>.

Hi Shimi,

Could you provide more info on your setup? How many kafka streams processes do you have and from how many partitions are they consuming from. If you have more processes than partitions some of the processes will be idle and won’t do anything.

Eno
> On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <sh...@gmail.com> wrote:
> 
> Hi Everyone,
> 
> I have a problem and I hope one of you can help me figuring it out.
> One of our kafka-streams processes stopped processing messages
> 
> When I turn on debug log I see lots of these messages:
> 
> 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher: Sending
> fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
> .ec2.internal:9092 (id: 1 rack: null)
> 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
> Ignoring fetched records for devlast-changelog-2 at offset 2962649 since
> the current position is 2963379
> 
> After a LONG time, the only messages in the log are these:
> 
> 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread | sa] DEBUG
> o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for group sa to
> coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack: null)
> 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread | sa] DEBUG
> o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat response for
> group same
> 
> Any idea?
> 
> Thanks,
> Shimi