You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by John Roesler <jo...@confluent.io> on 2019/01/03 21:36:40 UTC

Re: Kafka Streams 2.1.0, 3rd time data lose investigation

Hi Nitay,

I'm sorry to hear of these troubles; it sounds frustrating.

No worries about spamming the list, but it does sound like this might be
worth tracking as a bug report in Jira.
Obviously, we do not expect to lose data when instances come and go,
regardless of the frequency, and we do have tests in place to verify this.
Of course, you might be exercising something that our tests miss.

Thanks for collating the logs. It really helps to understand what's going
on.

Unfortunately, the red coloring didn't make it through the mailing list, so
I'm not sure which specific line you were referencing as demonstrating data
loss.

Just in case you're concerned about the "Updating StandbyTasks failed"
warnings, they should be fine. It indicates that a thread was unable to
re-use a state store that it had previously been assigned in the past, so
instead it deletes the local data and recreates the whole thing from the
changelog.

The Streams logs that would be really useful to capture are the lifecycle
ones, like

[2018-12-14 17:34:30,326] INFO stream-thread
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> State transition from RUNNING to PARTITIONS_REVOKED
> (org.apache.kafka.streams.processor.internals.StreamThread)



[2018-12-14 17:34:30,326] INFO stream-client
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980] State
> transition from RUNNING to REBALANCING
> (org.apache.kafka.streams.KafkaStreams)


Also, it would be helpful to see the assignment transitions in line with
the state transitions. Examples:

[2018-12-14 17:34:31,863] DEBUG stream-thread
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> Adding assigned tasks as active: {0_2=[standbyTaskSource1-2],
> 0_5=[standbyTaskSource1-5]}
> (org.apache.kafka.streams.processor.internals.TaskManager)



[2018-12-14 17:34:31,882] DEBUG stream-thread
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> Adding assigned standby tasks {0_4=[standbyTaskSource1-4],
> 0_1=[standbyTaskSource1-1]}
> (org.apache.kafka.streams.processor.internals.TaskManager)



[2018-12-14 17:34:31,885] INFO stream-thread
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> partition assignment took 22 ms.
>    current active tasks: [0_2, 0_5]
>    current standby tasks: [0_1, 0_4]
>    previous active tasks: []
>  (org.apache.kafka.streams.processor.internals.StreamThread)



I look forward to hearing back from you (either with more detailed logs or
just a clarification about how the given logs indicate data loss). A report
about potential data loss is very concerning.

Thanks,
-John

On Sun, Dec 30, 2018 at 9:23 AM Nitay Kufert <ni...@ironsrc.com> wrote:

> Hey everybody,
> We are running Kafka streams in production for the last year or so - we
> currently using the latest version (2.1.0) and we suffered from data lose
> several times before.
> The first time we noticed a data loss, we were able to trace it back to
> Exception that we were getting in the code - which eventually mapped to an
> open bug that the Kafka team is still working on. So the temporary solution
> was to disable the feature that causes the Exception (in this case - it was
> the "exactly_once" semantics) and move to "at_lease_once" semantics + piece
> of code that handles duplications.
> The 2nd time we noticed a data loss, we traced it back to some kind of
> Exception caused by lack of memory. To make a long story short - we hit the
> limit for open files on the machines (a lot of files are used by rocksDB) -
> so increasing the RAM of the machines & increasing the number of allowed
> open files on the OS solved this problem.
>
> Now, we are facing data loss for the 3rd time - this time it seems to
> happen when our Kafka stream instances switch (reproducible - happened 2
> separate times). let me explain:
> We are using a 3rd party company called Spotinst - which basically helps
> you save costs by monitoring the Amazon spot market, and switching between
> instances when they find a cheaper one.
>
> The question is, why would it cause data loss?
> Those are logs I collected and put together in a single timeline, including
> messages from Kafka stream instances (from Kibana), Spotinst (3rd party
> company) & the data in the compacted topic where the data should have been
> kept (basically its a compacted topic behind a reduce function - and it
> seems like the aggregated data was lost and the function was invocated as
> if its the first time its aggregating anything).
> What you are seeing is that Spotinst saw an increase in CPU - and initiated
> an Upscale (2 instances), and shortly after it - 2 instances went down
> (Downscale) as the load was over. In *RED* you can see the actual data loss
> (as observed from the compacted topic)
>
> DATE TIME FACILITY INFO
> 12/25/2018 5:17:03 Spotinst Instances Launched - Autoscaling: Policy Name:
> Scaling Policy-Up, Threshold: 70.0, Value Observed: 70.0
> 12/25/2018 5:22:34 Spotinst Got Signal INSTANCE_READY For Instance
> I-instace1
> 12/25/2018 5:22:54 instace2 The following subscribed topics are not
> assigned to any members: [bosThresholds]
> 12/25/2018 5:22:56 instace3 Updating StandbyTasks failed. Deleting
> StandbyTasks stores to recreate from scratch.:
> org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
> Offsets out of range with no configured reset policy for partitions:
> {bos-unique_input_message-changelog-1=7491727}
> 12/25/2018 5:23:21 Spotinst instace4 - INSTANCE_READY
> 12/25/2018 5:23:27 Compacted Topic cdr_44334 -> 1621.72
> 12/25/2018 5:23:33 Spotinst instace4 - Successfully Registered To Load
> Balancer
> 12/25/2018 5:23:33 Spotinst instace1 - Successfully Registered To Load
> Balancer
> 12/25/2018 5:23:34 Compacted Topic cdr_44334-> 0.27
> 12/25/2018 5:23:40 instace2 The following subscribed topics are not
> assigned to any members: [bosThresholds]
> 12/25/2018 5:23:46 instace3
> org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
> Offsets out of range with no configured reset policy for partitions:
> {bos-unique_input_message-changelog-1=7491727}
> 12/25/2018 5:27:05 Spotinst Instances Terminated - Autoscaling: Policy
> Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 6.0
> 12/25/2018 5:27:05 Spotinst instace2 - Was Successfully Deregistered From
> CLASSIC Load Balancer
> 12/25/2018 5:27:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN For
> Instance I-instace2
> 12/25/2018 5:27:23 Spotinst Shutdown Script Completed. InstanceId:
> I-instace2 (View Details)
> 12/25/2018 5:27:27 instace4 The following subscribed topics are not
> assigned to any members: [bosThresholds]
> 12/25/2018 5:33:03 Spotinst Instances I-instace3 Was Successfully
> Deregistered From CLASSIC Load Balancer
> 12/25/2018 5:33:03 Spotinst Instances Terminated - Autoscaling: Policy
> Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 25.0
> 12/25/2018 5:33:23 Spotinst Shutdown Script Completed. InstanceId:
> I-instace3 (View Details)
> 12/25/2018 5:33:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN For
> Instance I-instace3
> 12/25/2018 5:33:28 instace4 The following subscribed topics are not
> assigned to any members: [bosThresholds]
> I can provide much more information but I feel like I already spammed the
> group as it is :/
> Hope you can make sense of what I am writing and maybe shed some light on
> the possible reasons for this strange behavior.
>
> For now, as a temporary solution, we are moving to "on-demand" instances
> (which basically means that machines won't go up and down often), so I hope
> it will solve our problems.
>
> Thanks
> --
> Nitay Kufert
> Backend Team Leader
> [image: ironSource] <http://www.ironsrc.com/>
>
> email nitay.k@ironsrc.com
> mobile +972-54-5480021
> fax +972-77-5448273
> Derech Menachem Begin 121, Tel- Aviv
> ironsrc.com <http://www.ironsrc.com/>
> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> twitter] <https://twitter.com/ironsource>[image: facebook]
> <https://www.facebook.com/ironSource>[image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>

Re: Kafka Streams 2.1.0, 3rd time data lose investigation

Posted by Nitay Kufert <ni...@ironsrc.com>.
Added the log file (In the previous mail I saw the lines are cut)

On Tue, Jan 8, 2019 at 2:39 PM Nitay Kufert <ni...@ironsrc.com> wrote:

> Thanks. it seems promising. Sounds a lot like the problems we are having.
> Do you know when the fix will be released?
>
> BTW,  It just happened to us again, this time when I manually added 2 new
> instances (we had 4 and I increased to 6).
> This is the compacted topic showing the data loss:
>
> CreateTime:1546934915862   cdr_8_23   3010.5875
> CreateTime:1546935024173   cdr_8_23   3012.2675
> CreateTime:1546935182875   cdr_8_23   3013.9475
> CreateTime:1546935394044   cdr_8_23   3016.1575
> CreateTime:1546935401635   cdr_8_23   3017.2775
> CreateTime:1546935466391   cdr_8_23   3019.4875
> CreateTime:1546935510847   cdr_8_23   3021.1675  // Tuesday, January 8, 2019 10:18:30.847 AM GMT+02:00
> CreateTime:1546935559457   cdr_8_23   2.08       // Tuesday, January 8, 2019 10:19:19.457 AM GMT+02:00
> CreateTime:1546935875705   cdr_8_23   6.5
> CreateTime:1546935785418   cdr_8_23   10.66
> CreateTime:1546936046269   cdr_8_23   12.87
> CreateTime:1546936181875   cdr_8_23   15.08
> CreateTime:1546936967967   cdr_8_23   17.29
> CreateTime:1546937000117   cdr_8_23   18.6925
> CreateTime:1546937015276   cdr_8_23   19.8125
> CreateTime:1546937066957   cdr_8_23   21.4925
> CreateTime:1546937253908   cdr_8_23   23.7025
> CreateTime:1546937656179   cdr_8_23   25.9125
> CreateTime:1546937712409   cdr_8_23   28.1225
> CreateTime:1546937796223   cdr_8_23   30.3325
> CreateTime:1546938051609   cdr_8_23   32.5425
> CreateTime:1546938162873   cdr_8_23   34.3425
>
> Looking at the logs, the first thing that is weird is a HUGE burst *[700K~
> INFO messages in a single minute (10:19 GMT+2)]* of:
> [Consumer
> clientId=bos-40d0e3c2-b240-4def-a9e8-52c88b540417-StreamThread-1-restore-consumer,
> groupId=] Unsubscribed all topics or patterns and assigned partitions:
>
> This created a problem reading the logs, so I tried getting to a more
> informative log by hiding it and zooming in on a single instance. This is
> the result:
>
> Time                            level   facility    message
> January 8th 2019, 10:21:01.044 INFO   sonic-bos  [Consumer clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer, groupId=] Resetting offset for partition bos-clrev_cap_status_and_change_log_store-changelog-0 to offset 0.:
> January 8th 2019, 10:21:01.044 INFO   sonic-bos  [Consumer clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer, groupId=] Resetting offset for partition bos-clrev_cap_status_store-changelog-0 to offset 0.:
> January 8th 2019, 10:21:01.044 INFO   sonic-bos  [Consumer clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer, groupId=] Resetting offset for partition bos-cdrev_cap_status_store-changelog-6 to offset 0.:
> January 8th 2019, 10:21:00.950 INFO   sonic-bos  [Consumer clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer, groupId=] Resetting offset for partition bos-clrev_store-changelog-0 to offset 0.:
> January 8th 2019, 10:21:00.950 INFO   sonic-bos  [Consumer clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer, groupId=] Resetting offset for partition bos-cdrev_cap_status_and_change_log_store-changelog-6 to offset 0.:
> January 8th 2019, 10:21:00.533 INFO   sonic-bos  [Consumer clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer, groupId=] Resetting offset for partition bos-cdrev_store-changelog-6 to offset 0.:
> January 8th 2019, 10:21:00.518 INFO   sonic-bos  stream-thread [bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2] State transition from PARTITIONS_ASSIGNED to RUNNING:
> January 8th 2019, 10:21:00.515 INFO   sonic-bos  [Consumer clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer, groupId=] Unsubscribed all topics or patterns and assigned partitions:
> January 8th 2019, 10:20:56.151 INFO   sonic-bos  stream-thread [bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-3] State transition from PARTITIONS_ASSIGNED to RUNNING:
> January 8th 2019, 10:20:56.151 INFO   sonic-bos  stream-client [bos-6e2726b6-489e-4206-b909-2131cc8d5cbc] State transition from REBALANCING to RUNNING:
> January 8th 2019, 10:20:56.148 INFO   sonic-bos  [Consumer clientId=bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-3-restore-consumer, groupId=] Unsubscribed all topics or patterns and assigned partitions:
> January 8th 2019, 10:20:52.663 INFO   sonic-bos  stream-thread [bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-5] State transition from PARTITIONS_ASSIGNED to RUNNING:
> January 8th 2019, 10:20:52.660 INFO   sonic-bos  [Consumer clientId=bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-5-restore-consumer, groupId=] Unsubscribed all topics or patterns and assigned partitions:
> January 8th 2019, 10:20:52.544 INFO   sonic-bos  [Consumer clientId=bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-5-restore-consumer, groupId=] Unsubscribed all topics or patterns and assigned partitions:
> January 8th 2019, 10:20:52.306 INFO   sonic-bos  stream-thread [bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-6] State transition from PARTITIONS_ASSIGNED to RUNNING:
> January 8th 2019, 10:20:52.303 INFO   sonic-bos  [Consumer clientId=bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-6-restore-consumer, groupId=] Unsubscribed all topics or patterns and assigned partitions:
> January 8th 2019, 10:20:52.160 INFO   sonic-bos  stream-thread [bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-2] State transition from PARTITIONS_ASSIGNED to RUNNING:
> January 8th 2019, 10:20:52.143 INFO   sonic-bos  [Consumer clientId=bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-2-restore-consumer, groupId=] Unsubscribed all topics or patterns and assigned partitions:
> January 8th 2019, 10:20:51.903 INFO   sonic-bos  Cluster ID: gDAxt9QsQmO1qdazdSwkXw:
> January 8th 2019, 10:20:51.819 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5] Restoring task 3_10's state store cdrev_store from beginning of the changelog bos-cdrev_store-changelog-10 :
> January 8th 2019, 10:20:51.818 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5] Restoring task 0_8's state store bos_unique_input_message from beginning of the changelog bos-bos_unique_input_message-changelog-8 :
> January 8th 2019, 10:20:51.809 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3] Restoring task 6_10's state store clcomp_store from beginning of the changelog bos-clcomp_store-changelog-10 :
> January 8th 2019, 10:20:51.796 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5] Restoring task 3_10's state store cdrev_cap_status_and_change_log_store from beginning of the changelog bos-cdrev_cap_status_and_change_log_store-changelog-10
> January 8th 2019, 10:20:51.786 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3] Restoring task 6_10's state store clcomp_cap_status_and_change_log_store from beginning of the changelog bos-clcomp_cap_status_and_change_log_store-changelog-10
> January 8th 2019, 10:20:51.784 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3] Restoring task 0_9's state store bos_unique_input_message from beginning of the changelog bos-bos_unique_input_message-changelog-9 :
> January 8th 2019, 10:20:51.763 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5] Restoring task 5_10's state store clrev_cap_status_and_change_log_store from beginning of the changelog bos-clrev_cap_status_and_change_log_store-changelog-10 :
> January 8th 2019, 10:20:51.750 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3] Restoring task 6_10's state store clcomp_cap_status_store from beginning of the changelog bos-clcomp_cap_status_store-changelog-10 :
> January 8th 2019, 10:20:51.739 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5] Restoring task 3_10's state store cdrev_cap_status_store from beginning of the changelog bos-cdrev_cap_status_store-changelog-10 :
> January 8th 2019, 10:20:51.720 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3] Restoring task 4_4's state store cdcomp_cap_status_and_change_log_store from beginning of the changelog bos-cdcomp_cap_status_and_change_log_store-changelog-4 :
> January 8th 2019, 10:20:51.720 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-1] Restoring task 4_11's state store cdcomp_cap_status_store from beginning of the changelog bos-cdcomp_cap_status_store-changelog-11 :
> January 8th 2019, 10:20:51.716 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-6] Restoring task 5_3's state store clrev_cap_status_store from beginning of the changelog bos-clrev_cap_status_store-changelog-3 :
> January 8th 2019, 10:20:51.710 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5] Restoring task 5_10's state store clrev_store from beginning of the changelog bos-clrev_store-changelog-10 :
> January 8th 2019, 10:20:51.702 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-2] Restoring task 4_9's state store cdcomp_store from beginning of the changelog bos-cdcomp_store-changelog-9 :
> January 8th 2019, 10:20:51.690 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-1] Restoring task 3_8's state store cdrev_store from beginning of the changelog bos-cdrev_store-changelog-8 :
> January 8th 2019, 10:20:51.688 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-6] Restoring task 3_9's state store cdrev_store from beginning of the changelog bos-cdrev_store-changelog-9 :
> January 8th 2019, 10:20:51.688 INFO   sonic-bos  [Consumer clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3-restore-consumer, groupId=] Resetting offset for partition bos-cdcomp_cap_status_store-changelog-4 to offset 0.:
> January 8th 2019, 10:20:51.688 INFO   sonic-bos  [Consumer clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3-restore-consumer, groupId=] Resetting offset for partition bos-clcomp_cap_status_and_change_log_store-changelog-10 to offset 0.:
> January 8th 2019, 10:20:51.688 INFO   sonic-bos  [Consumer clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3-restore-consumer, groupId=] Resetting offset for partition bos-bos_unique_input_message-changelog-9 to offset 3667118.:
> January 8th 2019, 10:20:51.687 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3] Restoring task 4_4's state store cdcomp_cap_status_store from beginning of the changelog bos-cdcomp_cap_status_store-changelog-4 :
> January 8th 2019, 10:20:51.687 INFO   sonic-bos  [Consumer clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3-restore-consumer, groupId=] Resetting offset for partition bos-cdcomp_cap_status_and_change_log_store-changelog-4 to offset 0.:
> January 8th 2019, 10:20:51.547 INFO   sonic-bos  [Consumer clientId=bos-04221576-e08a-4e43-bdcb-cb9231034641-StreamThread-2-restore-consumer, groupId=] Resetting offset for partition bos-bos_unique_input_message-changelog-1 to offset 3670834.:
> January 8th 2019, 10:20:51.498 INFO   sonic-bos  [Consumer clientId=bos-04221576-e08a-4e43-bdcb-cb9231034641-StreamThread-4-restore-consumer, groupId=] Resetting offset for partition bos-cdcomp_cap_status_store-changelog-10 to offset 0.:
> January 8th 2019, 10:20:51.498 INFO   sonic-bos  [Consumer clientId=bos-04221576-e08a-4e43-bdcb-cb9231034641-StreamThread-4-restore-consumer, groupId=] Resetting offset for partition bos-bos_unique_input_message-changelog-10 to offset 3694368.:
> January 8th 2019, 10:20:51.489 INFO   sonic-bos  stream-thread [bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-2] Restoring task 4_9's state store cdcomp_cap_status_store from beginning of the changelog bos-cdcomp_cap_status_store-changelog-9 :
> January 8th 2019, 10:20:51.486 INFO   sonic-bos  [Consumer clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5-restore-consumer, groupId=] Resetting offset for partition bos-clrev_cap_status_and_change_log_store-changelog-10 to offset 0.:
> January 8th 2019, 10:20:51.484 INFO   sonic-bos  [Consumer clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5-restore-consumer, groupId=] Resetting offset for partition bos-cdrev_cap_status_store-changelog-10 to offset 0.:
> January 8th 2019, 10:20:51.480 INFO   sonic-bos  [Consumer clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5-restore-consumer, groupId=] Resetting offset for partition bos-clrev_cap_status_store-changelog-10 to offset 0.:
> January 8th 2019, 10:20:51.478 INFO   sonic-bos  [Consumer clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5-restore-consumer, groupId=] Resetting offset for partition bos-cdrev_store-changelog-10 to offset 0.:
> January 8th 2019, 10:20:51.476 INFO   sonic-bos  [Consumer clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5-restore-consumer, groupId=] Resetting offset for partition bos-bos_unique_input_message-changelog-8 to offset 3748107.:
> January 8th 2019, 10:20:51.474 INFO   sonic-bos  [Consumer clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5-restore-consumer, groupId=] Resetting offset for partition bos-cdrev_cap_status_and_change_log_store-changelog-10 to offset 0.:
>
>
> We have some restoration going on with a very quick "RUNNING" state
> change. (cdrev_store is the state store of the compacted topic I showed at
> the beginning of this email)
>
> Thanks again John :)
>
> Nitay
> On Tue, Jan 8, 2019 at 12:26 AM John Roesler <jo...@confluent.io> wrote:
>
>> Hi Nitay,
>>
>> > I will provide extra logs if it will happen again (I really really hope
>> it
>> > won't hehe :))
>>
>> Yeah, I hear you. Reproducing errors in production is a real double-edged
>> sword!
>>
>> Thanks for the explanation. It makes sense now.
>>
>> This may be grasping at straws, but it seems like your frequent rebalances
>> may be exposing you to this recently reported bug:
>> https://issues.apache.org/jira/browse/KAFKA-7672
>>
>> The reporter mentioned one thing that can help identify it, which is that
>> it prints a message saying that it's going to re-initialize the state
>> store, followed immediately by a transition to "running". Perhaps you can
>> check your Streams logs to see if you see anything similar.
>>
>> Thanks,
>> -John
>>
>> On Sat, Jan 5, 2019 at 10:48 AM Nitay Kufert <ni...@ironsrc.com> wrote:
>>
>> > Hey John,
>> > Thanks for the response!
>> >
>> > I will provide extra logs if it will happen again (I really really hope
>> it
>> > won't hehe :))
>> >
>> > Some clarification regarding the previous mail:
>> > The only thing that shows the data loss is the messages from the
>> compacted
>> > topic which I consumed a couple of hours after the I noticed the data
>> loss.
>> > This compacted topic is an output of my stream application (basically,
>> I am
>> > using reduce on the same key to SUM the values, and pushing it to the
>> > compacted topic using ".to")
>> >
>> > The only correlation I have for those messages are the timestamps of the
>> > messages in the compacted topic.
>> > So I took logs from Spotinst & Kafka Stream instances around the same
>> time
>> > and shared them here (I know correlation doesn't mean causation but
>> that's
>> > the only thing I have :) )
>> >
>> > The messages showed an ever-increasing value for the specific key I was
>> > investigating, which was expected.
>> > The unexpected thing was that suddenly the value started the aggregation
>> > back at 0 for some reason.
>> >
>> > In an effort to understand what's going on, I added logs to the
>> function I
>> > use for reducing the stream to try and log cases where this thing
>> happens -
>> > but it didn't log anything.. which makes me think as if the reduce
>> function
>> > "initialized" itself, meaning it acted as if it was the first message
>> > (nothing to aggregate the value with - so we just put the value)
>> >
>> > In the example I have shared, I have keys in the format cdr_<ID> with
>> > values which are BigDecimal numbers.
>> > I could have shared the thousands of messages I consumed from the topic
>> > before reaching the value 1621.72, It would have looked something like :
>> > cdr_44334 -> 1619.32
>> > cdr_44334 -> 1619.72
>> > cdr_44334 -> 1620.12
>> > cdr_44334 -> 1620.52
>> > cdr_44334 -> 1620.92
>> > cdr_44334 -> 1621.32
>> > cdr_44334 -> 1621.72
>> > cdr_44334 -> 0.27
>> > cdr_44334 -> 0.67
>> > cdr_44334 -> 1.07
>> >
>> > So basically, the only thing that shows the loss is the sudden decrease
>> in
>> > value in a specific key (I had thousands of keys who lost their value -
>> but
>> > many many more that didn't lose their value).
>> > (I am monitoring those changes using datadog, so I know which keys are
>> > affected and I can investigate them)
>> >
>> > Let me know if you need some more details or if you want me to escalate
>> > this situation to a jira
>> >
>> > Thanks again
>> >
>> >
>> >
>> > On Thu, Jan 3, 2019 at 11:36 PM John Roesler <jo...@confluent.io> wrote:
>> >
>> > > Hi Nitay,
>> > >
>> > > I'm sorry to hear of these troubles; it sounds frustrating.
>> > >
>> > > No worries about spamming the list, but it does sound like this might
>> be
>> > > worth tracking as a bug report in Jira.
>> > > Obviously, we do not expect to lose data when instances come and go,
>> > > regardless of the frequency, and we do have tests in place to verify
>> > this.
>> > > Of course, you might be exercising something that our tests miss.
>> > >
>> > > Thanks for collating the logs. It really helps to understand what's
>> going
>> > > on.
>> > >
>> > > Unfortunately, the red coloring didn't make it through the mailing
>> list,
>> > so
>> > > I'm not sure which specific line you were referencing as demonstrating
>> > data
>> > > loss.
>> > >
>> > > Just in case you're concerned about the "Updating StandbyTasks failed"
>> > > warnings, they should be fine. It indicates that a thread was unable
>> to
>> > > re-use a state store that it had previously been assigned in the
>> past, so
>> > > instead it deletes the local data and recreates the whole thing from
>> the
>> > > changelog.
>> > >
>> > > The Streams logs that would be really useful to capture are the
>> lifecycle
>> > > ones, like
>> > >
>> > > [2018-12-14 17:34:30,326] INFO stream-thread
>> > > >
>> > >
>> >
>> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
>> > > > State transition from RUNNING to PARTITIONS_REVOKED
>> > > > (org.apache.kafka.streams.processor.internals.StreamThread)
>> > >
>> > >
>> > >
>> > > [2018-12-14 17:34:30,326] INFO stream-client
>> > > > [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980]
>> > State
>> > > > transition from RUNNING to REBALANCING
>> > > > (org.apache.kafka.streams.KafkaStreams)
>> > >
>> > >
>> > > Also, it would be helpful to see the assignment transitions in line
>> with
>> > > the state transitions. Examples:
>> > >
>> > > [2018-12-14 17:34:31,863] DEBUG stream-thread
>> > > >
>> > >
>> >
>> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
>> > > > Adding assigned tasks as active: {0_2=[standbyTaskSource1-2],
>> > > > 0_5=[standbyTaskSource1-5]}
>> > > > (org.apache.kafka.streams.processor.internals.TaskManager)
>> > >
>> > >
>> > >
>> > > [2018-12-14 17:34:31,882] DEBUG stream-thread
>> > > >
>> > >
>> >
>> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
>> > > > Adding assigned standby tasks {0_4=[standbyTaskSource1-4],
>> > > > 0_1=[standbyTaskSource1-1]}
>> > > > (org.apache.kafka.streams.processor.internals.TaskManager)
>> > >
>> > >
>> > >
>> > > [2018-12-14 17:34:31,885] INFO stream-thread
>> > > >
>> > >
>> >
>> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
>> > > > partition assignment took 22 ms.
>> > > >    current active tasks: [0_2, 0_5]
>> > > >    current standby tasks: [0_1, 0_4]
>> > > >    previous active tasks: []
>> > > >  (org.apache.kafka.streams.processor.internals.StreamThread)
>> > >
>> > >
>> > >
>> > > I look forward to hearing back from you (either with more detailed
>> logs
>> > or
>> > > just a clarification about how the given logs indicate data loss). A
>> > report
>> > > about potential data loss is very concerning.
>> > >
>> > > Thanks,
>> > > -John
>> > >
>> > > On Sun, Dec 30, 2018 at 9:23 AM Nitay Kufert <ni...@ironsrc.com>
>> > wrote:
>> > >
>> > > > Hey everybody,
>> > > > We are running Kafka streams in production for the last year or so
>> - we
>> > > > currently using the latest version (2.1.0) and we suffered from data
>> > lose
>> > > > several times before.
>> > > > The first time we noticed a data loss, we were able to trace it
>> back to
>> > > > Exception that we were getting in the code - which eventually
>> mapped to
>> > > an
>> > > > open bug that the Kafka team is still working on. So the temporary
>> > > solution
>> > > > was to disable the feature that causes the Exception (in this case
>> - it
>> > > was
>> > > > the "exactly_once" semantics) and move to "at_lease_once" semantics
>> +
>> > > piece
>> > > > of code that handles duplications.
>> > > > The 2nd time we noticed a data loss, we traced it back to some kind
>> of
>> > > > Exception caused by lack of memory. To make a long story short - we
>> hit
>> > > the
>> > > > limit for open files on the machines (a lot of files are used by
>> > > rocksDB) -
>> > > > so increasing the RAM of the machines & increasing the number of
>> > allowed
>> > > > open files on the OS solved this problem.
>> > > >
>> > > > Now, we are facing data loss for the 3rd time - this time it seems
>> to
>> > > > happen when our Kafka stream instances switch (reproducible -
>> happened
>> > 2
>> > > > separate times). let me explain:
>> > > > We are using a 3rd party company called Spotinst - which basically
>> > helps
>> > > > you save costs by monitoring the Amazon spot market, and switching
>> > > between
>> > > > instances when they find a cheaper one.
>> > > >
>> > > > The question is, why would it cause data loss?
>> > > > Those are logs I collected and put together in a single timeline,
>> > > including
>> > > > messages from Kafka stream instances (from Kibana), Spotinst (3rd
>> party
>> > > > company) & the data in the compacted topic where the data should
>> have
>> > > been
>> > > > kept (basically its a compacted topic behind a reduce function -
>> and it
>> > > > seems like the aggregated data was lost and the function was
>> invocated
>> > as
>> > > > if its the first time its aggregating anything).
>> > > > What you are seeing is that Spotinst saw an increase in CPU - and
>> > > initiated
>> > > > an Upscale (2 instances), and shortly after it - 2 instances went
>> down
>> > > > (Downscale) as the load was over. In *RED* you can see the actual
>> data
>> > > loss
>> > > > (as observed from the compacted topic)
>> > > >
>> > > > DATE TIME FACILITY INFO
>> > > > 12/25/2018 5:17:03 Spotinst Instances Launched - Autoscaling: Policy
>> > > Name:
>> > > > Scaling Policy-Up, Threshold: 70.0, Value Observed: 70.0
>> > > > 12/25/2018 5:22:34 Spotinst Got Signal INSTANCE_READY For Instance
>> > > > I-instace1
>> > > > 12/25/2018 5:22:54 instace2 The following subscribed topics are not
>> > > > assigned to any members: [bosThresholds]
>> > > > 12/25/2018 5:22:56 instace3 Updating StandbyTasks failed. Deleting
>> > > > StandbyTasks stores to recreate from scratch.:
>> > > > org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
>> > > > Offsets out of range with no configured reset policy for partitions:
>> > > > {bos-unique_input_message-changelog-1=7491727}
>> > > > 12/25/2018 5:23:21 Spotinst instace4 - INSTANCE_READY
>> > > > 12/25/2018 5:23:27 Compacted Topic cdr_44334 -> 1621.72
>> > > > 12/25/2018 5:23:33 Spotinst instace4 - Successfully Registered To
>> Load
>> > > > Balancer
>> > > > 12/25/2018 5:23:33 Spotinst instace1 - Successfully Registered To
>> Load
>> > > > Balancer
>> > > > 12/25/2018 5:23:34 Compacted Topic cdr_44334-> 0.27
>> > > > 12/25/2018 5:23:40 instace2 The following subscribed topics are not
>> > > > assigned to any members: [bosThresholds]
>> > > > 12/25/2018 5:23:46 instace3
>> > > > org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
>> > > > Offsets out of range with no configured reset policy for partitions:
>> > > > {bos-unique_input_message-changelog-1=7491727}
>> > > > 12/25/2018 5:27:05 Spotinst Instances Terminated - Autoscaling:
>> Policy
>> > > > Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 6.0
>> > > > 12/25/2018 5:27:05 Spotinst instace2 - Was Successfully Deregistered
>> > From
>> > > > CLASSIC Load Balancer
>> > > > 12/25/2018 5:27:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN
>> For
>> > > > Instance I-instace2
>> > > > 12/25/2018 5:27:23 Spotinst Shutdown Script Completed. InstanceId:
>> > > > I-instace2 (View Details)
>> > > > 12/25/2018 5:27:27 instace4 The following subscribed topics are not
>> > > > assigned to any members: [bosThresholds]
>> > > > 12/25/2018 5:33:03 Spotinst Instances I-instace3 Was Successfully
>> > > > Deregistered From CLASSIC Load Balancer
>> > > > 12/25/2018 5:33:03 Spotinst Instances Terminated - Autoscaling:
>> Policy
>> > > > Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 25.0
>> > > > 12/25/2018 5:33:23 Spotinst Shutdown Script Completed. InstanceId:
>> > > > I-instace3 (View Details)
>> > > > 12/25/2018 5:33:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN
>> For
>> > > > Instance I-instace3
>> > > > 12/25/2018 5:33:28 instace4 The following subscribed topics are not
>> > > > assigned to any members: [bosThresholds]
>> > > > I can provide much more information but I feel like I already
>> spammed
>> > the
>> > > > group as it is :/
>> > > > Hope you can make sense of what I am writing and maybe shed some
>> light
>> > on
>> > > > the possible reasons for this strange behavior.
>> > > >
>> > > > For now, as a temporary solution, we are moving to "on-demand"
>> > instances
>> > > > (which basically means that machines won't go up and down often),
>> so I
>> > > hope
>> > > > it will solve our problems.
>> > > >
>> > > > Thanks
>> > > > --
>> > > > Nitay Kufert
>> > > > Backend Team Leader
>> > > > [image: ironSource] <http://www.ironsrc.com/>
>> > > >
>> > > > email nitay.k@ironsrc.com
>> > > > mobile +972-54-5480021 <+972%2054-548-0021> <+972%2054-548-0021>
>> > > > fax +972-77-5448273 <+972%2077-544-8273> <+972%2077-544-8273>
>> > > > Derech Menachem Begin 121, Tel- Aviv
>> > > > ironsrc.com <http://www.ironsrc.com/>
>> > > > [image: linkedin] <https://www.linkedin.com/company/ironsource
>> >[image:
>> > > > twitter] <https://twitter.com/ironsource>[image: facebook]
>> > > > <https://www.facebook.com/ironSource>[image: googleplus]
>> > > > <https://plus.google.com/+ironsrc>
>> > > > This email (including any attachments) is for the sole use of the
>> > > intended
>> > > > recipient and may contain confidential information which may be
>> > protected
>> > > > by legal privilege. If you are not the intended recipient, or the
>> > > employee
>> > > > or agent responsible for delivering it to the intended recipient,
>> you
>> > are
>> > > > hereby notified that any use, dissemination, distribution or
>> copying of
>> > > > this communication and/or its content is strictly prohibited. If you
>> > are
>> > > > not the intended recipient, please immediately notify us by reply
>> email
>> > > or
>> > > > by telephone, delete this email and destroy any copies. Thank you.
>> > > >
>> > >
>> > --
>> > Nitay Kufert
>> > Backend Team Leader
>> > [image: ironSource] <http://www.ironsrc.com/>
>> >
>> > email nitay.k@ironsrc.com
>> > mobile +972-54-5480021 <+972%2054-548-0021>
>> > fax +972-77-5448273 <+972%2077-544-8273>
>> > Derech Menachem Begin 121, Tel- Aviv
>> > ironsrc.com <http://www.ironsrc.com/>
>> > [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>> > twitter] <https://twitter.com/ironsource>[image: facebook]
>> > <https://www.facebook.com/ironSource>[image: googleplus]
>> > <https://plus.google.com/+ironsrc>
>> > This email (including any attachments) is for the sole use of the
>> intended
>> > recipient and may contain confidential information which may be
>> protected
>> > by legal privilege. If you are not the intended recipient, or the
>> employee
>> > or agent responsible for delivering it to the intended recipient, you
>> are
>> > hereby notified that any use, dissemination, distribution or copying of
>> > this communication and/or its content is strictly prohibited. If you are
>> > not the intended recipient, please immediately notify us by reply email
>> or
>> > by telephone, delete this email and destroy any copies. Thank you.
>> >
>>
> --
> Nitay Kufert
> Backend Team Leader
> [image: ironSource] <http://www.ironsrc.com/>
>
> email nitay.k@ironsrc.com
> mobile +972-54-5480021 <+972%2054-548-0021>
> fax +972-77-5448273 <+972%2077-544-8273>
> Derech Menachem Begin 121, Tel- Aviv
> ironsrc.com <http://www.ironsrc.com/>
> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> twitter] <https://twitter.com/ironsource>[image: facebook]
> <https://www.facebook.com/ironSource>[image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>
-- 
Nitay Kufert
Backend Team Leader
[image: ironSource] <http://www.ironsrc.com/>

email nitay.k@ironsrc.com
mobile +972-54-5480021
fax +972-77-5448273
Derech Menachem Begin 121, Tel- Aviv
ironsrc.com <http://www.ironsrc.com/>
[image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
twitter] <https://twitter.com/ironsource>[image: facebook]
<https://www.facebook.com/ironSource>[image: googleplus]
<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.

Re: Kafka Streams 2.1.0, 3rd time data lose investigation

Posted by Nitay Kufert <ni...@ironsrc.com>.
Thanks. it seems promising. Sounds a lot like the problems we are having.
Do you know when the fix will be released?

BTW,  It just happened to us again, this time when I manually added 2 new
instances (we had 4 and I increased to 6).
This is the compacted topic showing the data loss:

CreateTime:1546934915862   cdr_8_23   3010.5875
CreateTime:1546935024173   cdr_8_23   3012.2675
CreateTime:1546935182875   cdr_8_23   3013.9475
CreateTime:1546935394044   cdr_8_23   3016.1575
CreateTime:1546935401635   cdr_8_23   3017.2775
CreateTime:1546935466391   cdr_8_23   3019.4875
CreateTime:1546935510847   cdr_8_23   3021.1675  // Tuesday, January
8, 2019 10:18:30.847 AM GMT+02:00
CreateTime:1546935559457   cdr_8_23   2.08       // Tuesday, January
8, 2019 10:19:19.457 AM GMT+02:00
CreateTime:1546935875705   cdr_8_23   6.5
CreateTime:1546935785418   cdr_8_23   10.66
CreateTime:1546936046269   cdr_8_23   12.87
CreateTime:1546936181875   cdr_8_23   15.08
CreateTime:1546936967967   cdr_8_23   17.29
CreateTime:1546937000117   cdr_8_23   18.6925
CreateTime:1546937015276   cdr_8_23   19.8125
CreateTime:1546937066957   cdr_8_23   21.4925
CreateTime:1546937253908   cdr_8_23   23.7025
CreateTime:1546937656179   cdr_8_23   25.9125
CreateTime:1546937712409   cdr_8_23   28.1225
CreateTime:1546937796223   cdr_8_23   30.3325
CreateTime:1546938051609   cdr_8_23   32.5425
CreateTime:1546938162873   cdr_8_23   34.3425

Looking at the logs, the first thing that is weird is a HUGE burst *[700K~
INFO messages in a single minute (10:19 GMT+2)]* of:
[Consumer
clientId=bos-40d0e3c2-b240-4def-a9e8-52c88b540417-StreamThread-1-restore-consumer,
groupId=] Unsubscribed all topics or patterns and assigned partitions:

This created a problem reading the logs, so I tried getting to a more
informative log by hiding it and zooming in on a single instance. This is
the result:

Time                            level   facility    message
January 8th 2019, 10:21:01.044 INFO   sonic-bos  [Consumer
clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer,
groupId=] Resetting offset for partition
bos-clrev_cap_status_and_change_log_store-changelog-0 to offset 0.:
January 8th 2019, 10:21:01.044 INFO   sonic-bos  [Consumer
clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer,
groupId=] Resetting offset for partition
bos-clrev_cap_status_store-changelog-0 to offset 0.:
January 8th 2019, 10:21:01.044 INFO   sonic-bos  [Consumer
clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer,
groupId=] Resetting offset for partition
bos-cdrev_cap_status_store-changelog-6 to offset 0.:
January 8th 2019, 10:21:00.950 INFO   sonic-bos  [Consumer
clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer,
groupId=] Resetting offset for partition bos-clrev_store-changelog-0
to offset 0.:
January 8th 2019, 10:21:00.950 INFO   sonic-bos  [Consumer
clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer,
groupId=] Resetting offset for partition
bos-cdrev_cap_status_and_change_log_store-changelog-6 to offset 0.:
January 8th 2019, 10:21:00.533 INFO   sonic-bos  [Consumer
clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer,
groupId=] Resetting offset for partition bos-cdrev_store-changelog-6
to offset 0.:
January 8th 2019, 10:21:00.518 INFO   sonic-bos  stream-thread
[bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2] State
transition from PARTITIONS_ASSIGNED to RUNNING:
January 8th 2019, 10:21:00.515 INFO   sonic-bos  [Consumer
clientId=bos-743af6ce-57f0-41c1-9005-b00116a244f2-StreamThread-2-restore-consumer,
groupId=] Unsubscribed all topics or patterns and assigned partitions:
January 8th 2019, 10:20:56.151 INFO   sonic-bos  stream-thread
[bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-3] State
transition from PARTITIONS_ASSIGNED to RUNNING:
January 8th 2019, 10:20:56.151 INFO   sonic-bos  stream-client
[bos-6e2726b6-489e-4206-b909-2131cc8d5cbc] State transition from
REBALANCING to RUNNING:
January 8th 2019, 10:20:56.148 INFO   sonic-bos  [Consumer
clientId=bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-3-restore-consumer,
groupId=] Unsubscribed all topics or patterns and assigned partitions:
January 8th 2019, 10:20:52.663 INFO   sonic-bos  stream-thread
[bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-5] State
transition from PARTITIONS_ASSIGNED to RUNNING:
January 8th 2019, 10:20:52.660 INFO   sonic-bos  [Consumer
clientId=bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-5-restore-consumer,
groupId=] Unsubscribed all topics or patterns and assigned partitions:
January 8th 2019, 10:20:52.544 INFO   sonic-bos  [Consumer
clientId=bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-5-restore-consumer,
groupId=] Unsubscribed all topics or patterns and assigned partitions:
January 8th 2019, 10:20:52.306 INFO   sonic-bos  stream-thread
[bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-6] State
transition from PARTITIONS_ASSIGNED to RUNNING:
January 8th 2019, 10:20:52.303 INFO   sonic-bos  [Consumer
clientId=bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-6-restore-consumer,
groupId=] Unsubscribed all topics or patterns and assigned partitions:
January 8th 2019, 10:20:52.160 INFO   sonic-bos  stream-thread
[bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-2] State
transition from PARTITIONS_ASSIGNED to RUNNING:
January 8th 2019, 10:20:52.143 INFO   sonic-bos  [Consumer
clientId=bos-6e2726b6-489e-4206-b909-2131cc8d5cbc-StreamThread-2-restore-consumer,
groupId=] Unsubscribed all topics or patterns and assigned partitions:
January 8th 2019, 10:20:51.903 INFO   sonic-bos  Cluster ID:
gDAxt9QsQmO1qdazdSwkXw:
January 8th 2019, 10:20:51.819 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5] Restoring
task 3_10's state store cdrev_store from beginning of the changelog
bos-cdrev_store-changelog-10 :
January 8th 2019, 10:20:51.818 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5] Restoring
task 0_8's state store bos_unique_input_message from beginning of the
changelog bos-bos_unique_input_message-changelog-8 :
January 8th 2019, 10:20:51.809 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3] Restoring
task 6_10's state store clcomp_store from beginning of the changelog
bos-clcomp_store-changelog-10 :
January 8th 2019, 10:20:51.796 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5] Restoring
task 3_10's state store cdrev_cap_status_and_change_log_store from
beginning of the changelog
bos-cdrev_cap_status_and_change_log_store-changelog-10
January 8th 2019, 10:20:51.786 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3] Restoring
task 6_10's state store clcomp_cap_status_and_change_log_store from
beginning of the changelog
bos-clcomp_cap_status_and_change_log_store-changelog-10
January 8th 2019, 10:20:51.784 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3] Restoring
task 0_9's state store bos_unique_input_message from beginning of the
changelog bos-bos_unique_input_message-changelog-9 :
January 8th 2019, 10:20:51.763 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5] Restoring
task 5_10's state store clrev_cap_status_and_change_log_store from
beginning of the changelog
bos-clrev_cap_status_and_change_log_store-changelog-10 :
January 8th 2019, 10:20:51.750 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3] Restoring
task 6_10's state store clcomp_cap_status_store from beginning of the
changelog bos-clcomp_cap_status_store-changelog-10 :
January 8th 2019, 10:20:51.739 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5] Restoring
task 3_10's state store cdrev_cap_status_store from beginning of the
changelog bos-cdrev_cap_status_store-changelog-10 :
January 8th 2019, 10:20:51.720 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3] Restoring
task 4_4's state store cdcomp_cap_status_and_change_log_store from
beginning of the changelog
bos-cdcomp_cap_status_and_change_log_store-changelog-4 :
January 8th 2019, 10:20:51.720 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-1] Restoring
task 4_11's state store cdcomp_cap_status_store from beginning of the
changelog bos-cdcomp_cap_status_store-changelog-11 :
January 8th 2019, 10:20:51.716 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-6] Restoring
task 5_3's state store clrev_cap_status_store from beginning of the
changelog bos-clrev_cap_status_store-changelog-3 :
January 8th 2019, 10:20:51.710 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5] Restoring
task 5_10's state store clrev_store from beginning of the changelog
bos-clrev_store-changelog-10 :
January 8th 2019, 10:20:51.702 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-2] Restoring
task 4_9's state store cdcomp_store from beginning of the changelog
bos-cdcomp_store-changelog-9 :
January 8th 2019, 10:20:51.690 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-1] Restoring
task 3_8's state store cdrev_store from beginning of the changelog
bos-cdrev_store-changelog-8 :
January 8th 2019, 10:20:51.688 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-6] Restoring
task 3_9's state store cdrev_store from beginning of the changelog
bos-cdrev_store-changelog-9 :
January 8th 2019, 10:20:51.688 INFO   sonic-bos  [Consumer
clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3-restore-consumer,
groupId=] Resetting offset for partition
bos-cdcomp_cap_status_store-changelog-4 to offset 0.:
January 8th 2019, 10:20:51.688 INFO   sonic-bos  [Consumer
clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3-restore-consumer,
groupId=] Resetting offset for partition
bos-clcomp_cap_status_and_change_log_store-changelog-10 to offset 0.:
January 8th 2019, 10:20:51.688 INFO   sonic-bos  [Consumer
clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3-restore-consumer,
groupId=] Resetting offset for partition
bos-bos_unique_input_message-changelog-9 to offset 3667118.:
January 8th 2019, 10:20:51.687 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3] Restoring
task 4_4's state store cdcomp_cap_status_store from beginning of the
changelog bos-cdcomp_cap_status_store-changelog-4 :
January 8th 2019, 10:20:51.687 INFO   sonic-bos  [Consumer
clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-3-restore-consumer,
groupId=] Resetting offset for partition
bos-cdcomp_cap_status_and_change_log_store-changelog-4 to offset 0.:
January 8th 2019, 10:20:51.547 INFO   sonic-bos  [Consumer
clientId=bos-04221576-e08a-4e43-bdcb-cb9231034641-StreamThread-2-restore-consumer,
groupId=] Resetting offset for partition
bos-bos_unique_input_message-changelog-1 to offset 3670834.:
January 8th 2019, 10:20:51.498 INFO   sonic-bos  [Consumer
clientId=bos-04221576-e08a-4e43-bdcb-cb9231034641-StreamThread-4-restore-consumer,
groupId=] Resetting offset for partition
bos-cdcomp_cap_status_store-changelog-10 to offset 0.:
January 8th 2019, 10:20:51.498 INFO   sonic-bos  [Consumer
clientId=bos-04221576-e08a-4e43-bdcb-cb9231034641-StreamThread-4-restore-consumer,
groupId=] Resetting offset for partition
bos-bos_unique_input_message-changelog-10 to offset 3694368.:
January 8th 2019, 10:20:51.489 INFO   sonic-bos  stream-thread
[bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-2] Restoring
task 4_9's state store cdcomp_cap_status_store from beginning of the
changelog bos-cdcomp_cap_status_store-changelog-9 :
January 8th 2019, 10:20:51.486 INFO   sonic-bos  [Consumer
clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5-restore-consumer,
groupId=] Resetting offset for partition
bos-clrev_cap_status_and_change_log_store-changelog-10 to offset 0.:
January 8th 2019, 10:20:51.484 INFO   sonic-bos  [Consumer
clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5-restore-consumer,
groupId=] Resetting offset for partition
bos-cdrev_cap_status_store-changelog-10 to offset 0.:
January 8th 2019, 10:20:51.480 INFO   sonic-bos  [Consumer
clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5-restore-consumer,
groupId=] Resetting offset for partition
bos-clrev_cap_status_store-changelog-10 to offset 0.:
January 8th 2019, 10:20:51.478 INFO   sonic-bos  [Consumer
clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5-restore-consumer,
groupId=] Resetting offset for partition bos-cdrev_store-changelog-10
to offset 0.:
January 8th 2019, 10:20:51.476 INFO   sonic-bos  [Consumer
clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5-restore-consumer,
groupId=] Resetting offset for partition
bos-bos_unique_input_message-changelog-8 to offset 3748107.:
January 8th 2019, 10:20:51.474 INFO   sonic-bos  [Consumer
clientId=bos-312025d8-ea91-4483-8652-d8128cd99c7e-StreamThread-5-restore-consumer,
groupId=] Resetting offset for partition
bos-cdrev_cap_status_and_change_log_store-changelog-10 to offset 0.:


We have some restoration going on with a very quick "RUNNING" state change.
(cdrev_store is the state store of the compacted topic I showed at the
beginning of this email)

Thanks again John :)

Nitay
On Tue, Jan 8, 2019 at 12:26 AM John Roesler <jo...@confluent.io> wrote:

> Hi Nitay,
>
> > I will provide extra logs if it will happen again (I really really hope
> it
> > won't hehe :))
>
> Yeah, I hear you. Reproducing errors in production is a real double-edged
> sword!
>
> Thanks for the explanation. It makes sense now.
>
> This may be grasping at straws, but it seems like your frequent rebalances
> may be exposing you to this recently reported bug:
> https://issues.apache.org/jira/browse/KAFKA-7672
>
> The reporter mentioned one thing that can help identify it, which is that
> it prints a message saying that it's going to re-initialize the state
> store, followed immediately by a transition to "running". Perhaps you can
> check your Streams logs to see if you see anything similar.
>
> Thanks,
> -John
>
> On Sat, Jan 5, 2019 at 10:48 AM Nitay Kufert <ni...@ironsrc.com> wrote:
>
> > Hey John,
> > Thanks for the response!
> >
> > I will provide extra logs if it will happen again (I really really hope
> it
> > won't hehe :))
> >
> > Some clarification regarding the previous mail:
> > The only thing that shows the data loss is the messages from the
> compacted
> > topic which I consumed a couple of hours after the I noticed the data
> loss.
> > This compacted topic is an output of my stream application (basically, I
> am
> > using reduce on the same key to SUM the values, and pushing it to the
> > compacted topic using ".to")
> >
> > The only correlation I have for those messages are the timestamps of the
> > messages in the compacted topic.
> > So I took logs from Spotinst & Kafka Stream instances around the same
> time
> > and shared them here (I know correlation doesn't mean causation but
> that's
> > the only thing I have :) )
> >
> > The messages showed an ever-increasing value for the specific key I was
> > investigating, which was expected.
> > The unexpected thing was that suddenly the value started the aggregation
> > back at 0 for some reason.
> >
> > In an effort to understand what's going on, I added logs to the function
> I
> > use for reducing the stream to try and log cases where this thing
> happens -
> > but it didn't log anything.. which makes me think as if the reduce
> function
> > "initialized" itself, meaning it acted as if it was the first message
> > (nothing to aggregate the value with - so we just put the value)
> >
> > In the example I have shared, I have keys in the format cdr_<ID> with
> > values which are BigDecimal numbers.
> > I could have shared the thousands of messages I consumed from the topic
> > before reaching the value 1621.72, It would have looked something like :
> > cdr_44334 -> 1619.32
> > cdr_44334 -> 1619.72
> > cdr_44334 -> 1620.12
> > cdr_44334 -> 1620.52
> > cdr_44334 -> 1620.92
> > cdr_44334 -> 1621.32
> > cdr_44334 -> 1621.72
> > cdr_44334 -> 0.27
> > cdr_44334 -> 0.67
> > cdr_44334 -> 1.07
> >
> > So basically, the only thing that shows the loss is the sudden decrease
> in
> > value in a specific key (I had thousands of keys who lost their value -
> but
> > many many more that didn't lose their value).
> > (I am monitoring those changes using datadog, so I know which keys are
> > affected and I can investigate them)
> >
> > Let me know if you need some more details or if you want me to escalate
> > this situation to a jira
> >
> > Thanks again
> >
> >
> >
> > On Thu, Jan 3, 2019 at 11:36 PM John Roesler <jo...@confluent.io> wrote:
> >
> > > Hi Nitay,
> > >
> > > I'm sorry to hear of these troubles; it sounds frustrating.
> > >
> > > No worries about spamming the list, but it does sound like this might
> be
> > > worth tracking as a bug report in Jira.
> > > Obviously, we do not expect to lose data when instances come and go,
> > > regardless of the frequency, and we do have tests in place to verify
> > this.
> > > Of course, you might be exercising something that our tests miss.
> > >
> > > Thanks for collating the logs. It really helps to understand what's
> going
> > > on.
> > >
> > > Unfortunately, the red coloring didn't make it through the mailing
> list,
> > so
> > > I'm not sure which specific line you were referencing as demonstrating
> > data
> > > loss.
> > >
> > > Just in case you're concerned about the "Updating StandbyTasks failed"
> > > warnings, they should be fine. It indicates that a thread was unable to
> > > re-use a state store that it had previously been assigned in the past,
> so
> > > instead it deletes the local data and recreates the whole thing from
> the
> > > changelog.
> > >
> > > The Streams logs that would be really useful to capture are the
> lifecycle
> > > ones, like
> > >
> > > [2018-12-14 17:34:30,326] INFO stream-thread
> > > >
> > >
> >
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> > > > State transition from RUNNING to PARTITIONS_REVOKED
> > > > (org.apache.kafka.streams.processor.internals.StreamThread)
> > >
> > >
> > >
> > > [2018-12-14 17:34:30,326] INFO stream-client
> > > > [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980]
> > State
> > > > transition from RUNNING to REBALANCING
> > > > (org.apache.kafka.streams.KafkaStreams)
> > >
> > >
> > > Also, it would be helpful to see the assignment transitions in line
> with
> > > the state transitions. Examples:
> > >
> > > [2018-12-14 17:34:31,863] DEBUG stream-thread
> > > >
> > >
> >
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> > > > Adding assigned tasks as active: {0_2=[standbyTaskSource1-2],
> > > > 0_5=[standbyTaskSource1-5]}
> > > > (org.apache.kafka.streams.processor.internals.TaskManager)
> > >
> > >
> > >
> > > [2018-12-14 17:34:31,882] DEBUG stream-thread
> > > >
> > >
> >
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> > > > Adding assigned standby tasks {0_4=[standbyTaskSource1-4],
> > > > 0_1=[standbyTaskSource1-1]}
> > > > (org.apache.kafka.streams.processor.internals.TaskManager)
> > >
> > >
> > >
> > > [2018-12-14 17:34:31,885] INFO stream-thread
> > > >
> > >
> >
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> > > > partition assignment took 22 ms.
> > > >    current active tasks: [0_2, 0_5]
> > > >    current standby tasks: [0_1, 0_4]
> > > >    previous active tasks: []
> > > >  (org.apache.kafka.streams.processor.internals.StreamThread)
> > >
> > >
> > >
> > > I look forward to hearing back from you (either with more detailed logs
> > or
> > > just a clarification about how the given logs indicate data loss). A
> > report
> > > about potential data loss is very concerning.
> > >
> > > Thanks,
> > > -John
> > >
> > > On Sun, Dec 30, 2018 at 9:23 AM Nitay Kufert <ni...@ironsrc.com>
> > wrote:
> > >
> > > > Hey everybody,
> > > > We are running Kafka streams in production for the last year or so -
> we
> > > > currently using the latest version (2.1.0) and we suffered from data
> > lose
> > > > several times before.
> > > > The first time we noticed a data loss, we were able to trace it back
> to
> > > > Exception that we were getting in the code - which eventually mapped
> to
> > > an
> > > > open bug that the Kafka team is still working on. So the temporary
> > > solution
> > > > was to disable the feature that causes the Exception (in this case -
> it
> > > was
> > > > the "exactly_once" semantics) and move to "at_lease_once" semantics +
> > > piece
> > > > of code that handles duplications.
> > > > The 2nd time we noticed a data loss, we traced it back to some kind
> of
> > > > Exception caused by lack of memory. To make a long story short - we
> hit
> > > the
> > > > limit for open files on the machines (a lot of files are used by
> > > rocksDB) -
> > > > so increasing the RAM of the machines & increasing the number of
> > allowed
> > > > open files on the OS solved this problem.
> > > >
> > > > Now, we are facing data loss for the 3rd time - this time it seems to
> > > > happen when our Kafka stream instances switch (reproducible -
> happened
> > 2
> > > > separate times). let me explain:
> > > > We are using a 3rd party company called Spotinst - which basically
> > helps
> > > > you save costs by monitoring the Amazon spot market, and switching
> > > between
> > > > instances when they find a cheaper one.
> > > >
> > > > The question is, why would it cause data loss?
> > > > Those are logs I collected and put together in a single timeline,
> > > including
> > > > messages from Kafka stream instances (from Kibana), Spotinst (3rd
> party
> > > > company) & the data in the compacted topic where the data should have
> > > been
> > > > kept (basically its a compacted topic behind a reduce function - and
> it
> > > > seems like the aggregated data was lost and the function was
> invocated
> > as
> > > > if its the first time its aggregating anything).
> > > > What you are seeing is that Spotinst saw an increase in CPU - and
> > > initiated
> > > > an Upscale (2 instances), and shortly after it - 2 instances went
> down
> > > > (Downscale) as the load was over. In *RED* you can see the actual
> data
> > > loss
> > > > (as observed from the compacted topic)
> > > >
> > > > DATE TIME FACILITY INFO
> > > > 12/25/2018 5:17:03 Spotinst Instances Launched - Autoscaling: Policy
> > > Name:
> > > > Scaling Policy-Up, Threshold: 70.0, Value Observed: 70.0
> > > > 12/25/2018 5:22:34 Spotinst Got Signal INSTANCE_READY For Instance
> > > > I-instace1
> > > > 12/25/2018 5:22:54 instace2 The following subscribed topics are not
> > > > assigned to any members: [bosThresholds]
> > > > 12/25/2018 5:22:56 instace3 Updating StandbyTasks failed. Deleting
> > > > StandbyTasks stores to recreate from scratch.:
> > > > org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
> > > > Offsets out of range with no configured reset policy for partitions:
> > > > {bos-unique_input_message-changelog-1=7491727}
> > > > 12/25/2018 5:23:21 Spotinst instace4 - INSTANCE_READY
> > > > 12/25/2018 5:23:27 Compacted Topic cdr_44334 -> 1621.72
> > > > 12/25/2018 5:23:33 Spotinst instace4 - Successfully Registered To
> Load
> > > > Balancer
> > > > 12/25/2018 5:23:33 Spotinst instace1 - Successfully Registered To
> Load
> > > > Balancer
> > > > 12/25/2018 5:23:34 Compacted Topic cdr_44334-> 0.27
> > > > 12/25/2018 5:23:40 instace2 The following subscribed topics are not
> > > > assigned to any members: [bosThresholds]
> > > > 12/25/2018 5:23:46 instace3
> > > > org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
> > > > Offsets out of range with no configured reset policy for partitions:
> > > > {bos-unique_input_message-changelog-1=7491727}
> > > > 12/25/2018 5:27:05 Spotinst Instances Terminated - Autoscaling:
> Policy
> > > > Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 6.0
> > > > 12/25/2018 5:27:05 Spotinst instace2 - Was Successfully Deregistered
> > From
> > > > CLASSIC Load Balancer
> > > > 12/25/2018 5:27:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN For
> > > > Instance I-instace2
> > > > 12/25/2018 5:27:23 Spotinst Shutdown Script Completed. InstanceId:
> > > > I-instace2 (View Details)
> > > > 12/25/2018 5:27:27 instace4 The following subscribed topics are not
> > > > assigned to any members: [bosThresholds]
> > > > 12/25/2018 5:33:03 Spotinst Instances I-instace3 Was Successfully
> > > > Deregistered From CLASSIC Load Balancer
> > > > 12/25/2018 5:33:03 Spotinst Instances Terminated - Autoscaling:
> Policy
> > > > Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 25.0
> > > > 12/25/2018 5:33:23 Spotinst Shutdown Script Completed. InstanceId:
> > > > I-instace3 (View Details)
> > > > 12/25/2018 5:33:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN For
> > > > Instance I-instace3
> > > > 12/25/2018 5:33:28 instace4 The following subscribed topics are not
> > > > assigned to any members: [bosThresholds]
> > > > I can provide much more information but I feel like I already spammed
> > the
> > > > group as it is :/
> > > > Hope you can make sense of what I am writing and maybe shed some
> light
> > on
> > > > the possible reasons for this strange behavior.
> > > >
> > > > For now, as a temporary solution, we are moving to "on-demand"
> > instances
> > > > (which basically means that machines won't go up and down often), so
> I
> > > hope
> > > > it will solve our problems.
> > > >
> > > > Thanks
> > > > --
> > > > Nitay Kufert
> > > > Backend Team Leader
> > > > [image: ironSource] <http://www.ironsrc.com/>
> > > >
> > > > email nitay.k@ironsrc.com
> > > > mobile +972-54-5480021 <+972%2054-548-0021> <+972%2054-548-0021>
> > > > fax +972-77-5448273 <+972%2077-544-8273> <+972%2077-544-8273>
> > > > Derech Menachem Begin 121, Tel- Aviv
> > > > ironsrc.com <http://www.ironsrc.com/>
> > > > [image: linkedin] <https://www.linkedin.com/company/ironsource
> >[image:
> > > > twitter] <https://twitter.com/ironsource>[image: facebook]
> > > > <https://www.facebook.com/ironSource>[image: googleplus]
> > > > <https://plus.google.com/+ironsrc>
> > > > This email (including any attachments) is for the sole use of the
> > > intended
> > > > recipient and may contain confidential information which may be
> > protected
> > > > by legal privilege. If you are not the intended recipient, or the
> > > employee
> > > > or agent responsible for delivering it to the intended recipient, you
> > are
> > > > hereby notified that any use, dissemination, distribution or copying
> of
> > > > this communication and/or its content is strictly prohibited. If you
> > are
> > > > not the intended recipient, please immediately notify us by reply
> email
> > > or
> > > > by telephone, delete this email and destroy any copies. Thank you.
> > > >
> > >
> > --
> > Nitay Kufert
> > Backend Team Leader
> > [image: ironSource] <http://www.ironsrc.com/>
> >
> > email nitay.k@ironsrc.com
> > mobile +972-54-5480021 <+972%2054-548-0021>
> > fax +972-77-5448273 <+972%2077-544-8273>
> > Derech Menachem Begin 121, Tel- Aviv
> > ironsrc.com <http://www.ironsrc.com/>
> > [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> > twitter] <https://twitter.com/ironsource>[image: facebook]
> > <https://www.facebook.com/ironSource>[image: googleplus]
> > <https://plus.google.com/+ironsrc>
> > This email (including any attachments) is for the sole use of the
> intended
> > recipient and may contain confidential information which may be protected
> > by legal privilege. If you are not the intended recipient, or the
> employee
> > or agent responsible for delivering it to the intended recipient, you are
> > hereby notified that any use, dissemination, distribution or copying of
> > this communication and/or its content is strictly prohibited. If you are
> > not the intended recipient, please immediately notify us by reply email
> or
> > by telephone, delete this email and destroy any copies. Thank you.
> >
>
-- 
Nitay Kufert
Backend Team Leader
[image: ironSource] <http://www.ironsrc.com/>

email nitay.k@ironsrc.com
mobile +972-54-5480021
fax +972-77-5448273
Derech Menachem Begin 121, Tel- Aviv
ironsrc.com <http://www.ironsrc.com/>
[image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
twitter] <https://twitter.com/ironsource>[image: facebook]
<https://www.facebook.com/ironSource>[image: googleplus]
<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.

Re: Kafka Streams 2.1.0, 3rd time data lose investigation

Posted by John Roesler <jo...@confluent.io>.
Hi Nitay,

> I will provide extra logs if it will happen again (I really really hope it
> won't hehe :))

Yeah, I hear you. Reproducing errors in production is a real double-edged
sword!

Thanks for the explanation. It makes sense now.

This may be grasping at straws, but it seems like your frequent rebalances
may be exposing you to this recently reported bug:
https://issues.apache.org/jira/browse/KAFKA-7672

The reporter mentioned one thing that can help identify it, which is that
it prints a message saying that it's going to re-initialize the state
store, followed immediately by a transition to "running". Perhaps you can
check your Streams logs to see if you see anything similar.

Thanks,
-John

On Sat, Jan 5, 2019 at 10:48 AM Nitay Kufert <ni...@ironsrc.com> wrote:

> Hey John,
> Thanks for the response!
>
> I will provide extra logs if it will happen again (I really really hope it
> won't hehe :))
>
> Some clarification regarding the previous mail:
> The only thing that shows the data loss is the messages from the compacted
> topic which I consumed a couple of hours after the I noticed the data loss.
> This compacted topic is an output of my stream application (basically, I am
> using reduce on the same key to SUM the values, and pushing it to the
> compacted topic using ".to")
>
> The only correlation I have for those messages are the timestamps of the
> messages in the compacted topic.
> So I took logs from Spotinst & Kafka Stream instances around the same time
> and shared them here (I know correlation doesn't mean causation but that's
> the only thing I have :) )
>
> The messages showed an ever-increasing value for the specific key I was
> investigating, which was expected.
> The unexpected thing was that suddenly the value started the aggregation
> back at 0 for some reason.
>
> In an effort to understand what's going on, I added logs to the function I
> use for reducing the stream to try and log cases where this thing happens -
> but it didn't log anything.. which makes me think as if the reduce function
> "initialized" itself, meaning it acted as if it was the first message
> (nothing to aggregate the value with - so we just put the value)
>
> In the example I have shared, I have keys in the format cdr_<ID> with
> values which are BigDecimal numbers.
> I could have shared the thousands of messages I consumed from the topic
> before reaching the value 1621.72, It would have looked something like :
> cdr_44334 -> 1619.32
> cdr_44334 -> 1619.72
> cdr_44334 -> 1620.12
> cdr_44334 -> 1620.52
> cdr_44334 -> 1620.92
> cdr_44334 -> 1621.32
> cdr_44334 -> 1621.72
> cdr_44334 -> 0.27
> cdr_44334 -> 0.67
> cdr_44334 -> 1.07
>
> So basically, the only thing that shows the loss is the sudden decrease in
> value in a specific key (I had thousands of keys who lost their value - but
> many many more that didn't lose their value).
> (I am monitoring those changes using datadog, so I know which keys are
> affected and I can investigate them)
>
> Let me know if you need some more details or if you want me to escalate
> this situation to a jira
>
> Thanks again
>
>
>
> On Thu, Jan 3, 2019 at 11:36 PM John Roesler <jo...@confluent.io> wrote:
>
> > Hi Nitay,
> >
> > I'm sorry to hear of these troubles; it sounds frustrating.
> >
> > No worries about spamming the list, but it does sound like this might be
> > worth tracking as a bug report in Jira.
> > Obviously, we do not expect to lose data when instances come and go,
> > regardless of the frequency, and we do have tests in place to verify
> this.
> > Of course, you might be exercising something that our tests miss.
> >
> > Thanks for collating the logs. It really helps to understand what's going
> > on.
> >
> > Unfortunately, the red coloring didn't make it through the mailing list,
> so
> > I'm not sure which specific line you were referencing as demonstrating
> data
> > loss.
> >
> > Just in case you're concerned about the "Updating StandbyTasks failed"
> > warnings, they should be fine. It indicates that a thread was unable to
> > re-use a state store that it had previously been assigned in the past, so
> > instead it deletes the local data and recreates the whole thing from the
> > changelog.
> >
> > The Streams logs that would be really useful to capture are the lifecycle
> > ones, like
> >
> > [2018-12-14 17:34:30,326] INFO stream-thread
> > >
> >
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> > > State transition from RUNNING to PARTITIONS_REVOKED
> > > (org.apache.kafka.streams.processor.internals.StreamThread)
> >
> >
> >
> > [2018-12-14 17:34:30,326] INFO stream-client
> > > [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980]
> State
> > > transition from RUNNING to REBALANCING
> > > (org.apache.kafka.streams.KafkaStreams)
> >
> >
> > Also, it would be helpful to see the assignment transitions in line with
> > the state transitions. Examples:
> >
> > [2018-12-14 17:34:31,863] DEBUG stream-thread
> > >
> >
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> > > Adding assigned tasks as active: {0_2=[standbyTaskSource1-2],
> > > 0_5=[standbyTaskSource1-5]}
> > > (org.apache.kafka.streams.processor.internals.TaskManager)
> >
> >
> >
> > [2018-12-14 17:34:31,882] DEBUG stream-thread
> > >
> >
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> > > Adding assigned standby tasks {0_4=[standbyTaskSource1-4],
> > > 0_1=[standbyTaskSource1-1]}
> > > (org.apache.kafka.streams.processor.internals.TaskManager)
> >
> >
> >
> > [2018-12-14 17:34:31,885] INFO stream-thread
> > >
> >
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> > > partition assignment took 22 ms.
> > >    current active tasks: [0_2, 0_5]
> > >    current standby tasks: [0_1, 0_4]
> > >    previous active tasks: []
> > >  (org.apache.kafka.streams.processor.internals.StreamThread)
> >
> >
> >
> > I look forward to hearing back from you (either with more detailed logs
> or
> > just a clarification about how the given logs indicate data loss). A
> report
> > about potential data loss is very concerning.
> >
> > Thanks,
> > -John
> >
> > On Sun, Dec 30, 2018 at 9:23 AM Nitay Kufert <ni...@ironsrc.com>
> wrote:
> >
> > > Hey everybody,
> > > We are running Kafka streams in production for the last year or so - we
> > > currently using the latest version (2.1.0) and we suffered from data
> lose
> > > several times before.
> > > The first time we noticed a data loss, we were able to trace it back to
> > > Exception that we were getting in the code - which eventually mapped to
> > an
> > > open bug that the Kafka team is still working on. So the temporary
> > solution
> > > was to disable the feature that causes the Exception (in this case - it
> > was
> > > the "exactly_once" semantics) and move to "at_lease_once" semantics +
> > piece
> > > of code that handles duplications.
> > > The 2nd time we noticed a data loss, we traced it back to some kind of
> > > Exception caused by lack of memory. To make a long story short - we hit
> > the
> > > limit for open files on the machines (a lot of files are used by
> > rocksDB) -
> > > so increasing the RAM of the machines & increasing the number of
> allowed
> > > open files on the OS solved this problem.
> > >
> > > Now, we are facing data loss for the 3rd time - this time it seems to
> > > happen when our Kafka stream instances switch (reproducible - happened
> 2
> > > separate times). let me explain:
> > > We are using a 3rd party company called Spotinst - which basically
> helps
> > > you save costs by monitoring the Amazon spot market, and switching
> > between
> > > instances when they find a cheaper one.
> > >
> > > The question is, why would it cause data loss?
> > > Those are logs I collected and put together in a single timeline,
> > including
> > > messages from Kafka stream instances (from Kibana), Spotinst (3rd party
> > > company) & the data in the compacted topic where the data should have
> > been
> > > kept (basically its a compacted topic behind a reduce function - and it
> > > seems like the aggregated data was lost and the function was invocated
> as
> > > if its the first time its aggregating anything).
> > > What you are seeing is that Spotinst saw an increase in CPU - and
> > initiated
> > > an Upscale (2 instances), and shortly after it - 2 instances went down
> > > (Downscale) as the load was over. In *RED* you can see the actual data
> > loss
> > > (as observed from the compacted topic)
> > >
> > > DATE TIME FACILITY INFO
> > > 12/25/2018 5:17:03 Spotinst Instances Launched - Autoscaling: Policy
> > Name:
> > > Scaling Policy-Up, Threshold: 70.0, Value Observed: 70.0
> > > 12/25/2018 5:22:34 Spotinst Got Signal INSTANCE_READY For Instance
> > > I-instace1
> > > 12/25/2018 5:22:54 instace2 The following subscribed topics are not
> > > assigned to any members: [bosThresholds]
> > > 12/25/2018 5:22:56 instace3 Updating StandbyTasks failed. Deleting
> > > StandbyTasks stores to recreate from scratch.:
> > > org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
> > > Offsets out of range with no configured reset policy for partitions:
> > > {bos-unique_input_message-changelog-1=7491727}
> > > 12/25/2018 5:23:21 Spotinst instace4 - INSTANCE_READY
> > > 12/25/2018 5:23:27 Compacted Topic cdr_44334 -> 1621.72
> > > 12/25/2018 5:23:33 Spotinst instace4 - Successfully Registered To Load
> > > Balancer
> > > 12/25/2018 5:23:33 Spotinst instace1 - Successfully Registered To Load
> > > Balancer
> > > 12/25/2018 5:23:34 Compacted Topic cdr_44334-> 0.27
> > > 12/25/2018 5:23:40 instace2 The following subscribed topics are not
> > > assigned to any members: [bosThresholds]
> > > 12/25/2018 5:23:46 instace3
> > > org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
> > > Offsets out of range with no configured reset policy for partitions:
> > > {bos-unique_input_message-changelog-1=7491727}
> > > 12/25/2018 5:27:05 Spotinst Instances Terminated - Autoscaling: Policy
> > > Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 6.0
> > > 12/25/2018 5:27:05 Spotinst instace2 - Was Successfully Deregistered
> From
> > > CLASSIC Load Balancer
> > > 12/25/2018 5:27:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN For
> > > Instance I-instace2
> > > 12/25/2018 5:27:23 Spotinst Shutdown Script Completed. InstanceId:
> > > I-instace2 (View Details)
> > > 12/25/2018 5:27:27 instace4 The following subscribed topics are not
> > > assigned to any members: [bosThresholds]
> > > 12/25/2018 5:33:03 Spotinst Instances I-instace3 Was Successfully
> > > Deregistered From CLASSIC Load Balancer
> > > 12/25/2018 5:33:03 Spotinst Instances Terminated - Autoscaling: Policy
> > > Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 25.0
> > > 12/25/2018 5:33:23 Spotinst Shutdown Script Completed. InstanceId:
> > > I-instace3 (View Details)
> > > 12/25/2018 5:33:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN For
> > > Instance I-instace3
> > > 12/25/2018 5:33:28 instace4 The following subscribed topics are not
> > > assigned to any members: [bosThresholds]
> > > I can provide much more information but I feel like I already spammed
> the
> > > group as it is :/
> > > Hope you can make sense of what I am writing and maybe shed some light
> on
> > > the possible reasons for this strange behavior.
> > >
> > > For now, as a temporary solution, we are moving to "on-demand"
> instances
> > > (which basically means that machines won't go up and down often), so I
> > hope
> > > it will solve our problems.
> > >
> > > Thanks
> > > --
> > > Nitay Kufert
> > > Backend Team Leader
> > > [image: ironSource] <http://www.ironsrc.com/>
> > >
> > > email nitay.k@ironsrc.com
> > > mobile +972-54-5480021 <+972%2054-548-0021>
> > > fax +972-77-5448273 <+972%2077-544-8273>
> > > Derech Menachem Begin 121, Tel- Aviv
> > > ironsrc.com <http://www.ironsrc.com/>
> > > [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> > > twitter] <https://twitter.com/ironsource>[image: facebook]
> > > <https://www.facebook.com/ironSource>[image: googleplus]
> > > <https://plus.google.com/+ironsrc>
> > > This email (including any attachments) is for the sole use of the
> > intended
> > > recipient and may contain confidential information which may be
> protected
> > > by legal privilege. If you are not the intended recipient, or the
> > employee
> > > or agent responsible for delivering it to the intended recipient, you
> are
> > > hereby notified that any use, dissemination, distribution or copying of
> > > this communication and/or its content is strictly prohibited. If you
> are
> > > not the intended recipient, please immediately notify us by reply email
> > or
> > > by telephone, delete this email and destroy any copies. Thank you.
> > >
> >
> --
> Nitay Kufert
> Backend Team Leader
> [image: ironSource] <http://www.ironsrc.com/>
>
> email nitay.k@ironsrc.com
> mobile +972-54-5480021
> fax +972-77-5448273
> Derech Menachem Begin 121, Tel- Aviv
> ironsrc.com <http://www.ironsrc.com/>
> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> twitter] <https://twitter.com/ironsource>[image: facebook]
> <https://www.facebook.com/ironSource>[image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>

Re: Kafka Streams 2.1.0, 3rd time data lose investigation

Posted by Nitay Kufert <ni...@ironsrc.com>.
Hey John,
Thanks for the response!

I will provide extra logs if it will happen again (I really really hope it
won't hehe :))

Some clarification regarding the previous mail:
The only thing that shows the data loss is the messages from the compacted
topic which I consumed a couple of hours after the I noticed the data loss.
This compacted topic is an output of my stream application (basically, I am
using reduce on the same key to SUM the values, and pushing it to the
compacted topic using ".to")

The only correlation I have for those messages are the timestamps of the
messages in the compacted topic.
So I took logs from Spotinst & Kafka Stream instances around the same time
and shared them here (I know correlation doesn't mean causation but that's
the only thing I have :) )

The messages showed an ever-increasing value for the specific key I was
investigating, which was expected.
The unexpected thing was that suddenly the value started the aggregation
back at 0 for some reason.

In an effort to understand what's going on, I added logs to the function I
use for reducing the stream to try and log cases where this thing happens -
but it didn't log anything.. which makes me think as if the reduce function
"initialized" itself, meaning it acted as if it was the first message
(nothing to aggregate the value with - so we just put the value)

In the example I have shared, I have keys in the format cdr_<ID> with
values which are BigDecimal numbers.
I could have shared the thousands of messages I consumed from the topic
before reaching the value 1621.72, It would have looked something like :
cdr_44334 -> 1619.32
cdr_44334 -> 1619.72
cdr_44334 -> 1620.12
cdr_44334 -> 1620.52
cdr_44334 -> 1620.92
cdr_44334 -> 1621.32
cdr_44334 -> 1621.72
cdr_44334 -> 0.27
cdr_44334 -> 0.67
cdr_44334 -> 1.07

So basically, the only thing that shows the loss is the sudden decrease in
value in a specific key (I had thousands of keys who lost their value - but
many many more that didn't lose their value).
(I am monitoring those changes using datadog, so I know which keys are
affected and I can investigate them)

Let me know if you need some more details or if you want me to escalate
this situation to a jira

Thanks again



On Thu, Jan 3, 2019 at 11:36 PM John Roesler <jo...@confluent.io> wrote:

> Hi Nitay,
>
> I'm sorry to hear of these troubles; it sounds frustrating.
>
> No worries about spamming the list, but it does sound like this might be
> worth tracking as a bug report in Jira.
> Obviously, we do not expect to lose data when instances come and go,
> regardless of the frequency, and we do have tests in place to verify this.
> Of course, you might be exercising something that our tests miss.
>
> Thanks for collating the logs. It really helps to understand what's going
> on.
>
> Unfortunately, the red coloring didn't make it through the mailing list, so
> I'm not sure which specific line you were referencing as demonstrating data
> loss.
>
> Just in case you're concerned about the "Updating StandbyTasks failed"
> warnings, they should be fine. It indicates that a thread was unable to
> re-use a state store that it had previously been assigned in the past, so
> instead it deletes the local data and recreates the whole thing from the
> changelog.
>
> The Streams logs that would be really useful to capture are the lifecycle
> ones, like
>
> [2018-12-14 17:34:30,326] INFO stream-thread
> >
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> > State transition from RUNNING to PARTITIONS_REVOKED
> > (org.apache.kafka.streams.processor.internals.StreamThread)
>
>
>
> [2018-12-14 17:34:30,326] INFO stream-client
> > [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980] State
> > transition from RUNNING to REBALANCING
> > (org.apache.kafka.streams.KafkaStreams)
>
>
> Also, it would be helpful to see the assignment transitions in line with
> the state transitions. Examples:
>
> [2018-12-14 17:34:31,863] DEBUG stream-thread
> >
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> > Adding assigned tasks as active: {0_2=[standbyTaskSource1-2],
> > 0_5=[standbyTaskSource1-5]}
> > (org.apache.kafka.streams.processor.internals.TaskManager)
>
>
>
> [2018-12-14 17:34:31,882] DEBUG stream-thread
> >
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> > Adding assigned standby tasks {0_4=[standbyTaskSource1-4],
> > 0_1=[standbyTaskSource1-1]}
> > (org.apache.kafka.streams.processor.internals.TaskManager)
>
>
>
> [2018-12-14 17:34:31,885] INFO stream-thread
> >
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> > partition assignment took 22 ms.
> >    current active tasks: [0_2, 0_5]
> >    current standby tasks: [0_1, 0_4]
> >    previous active tasks: []
> >  (org.apache.kafka.streams.processor.internals.StreamThread)
>
>
>
> I look forward to hearing back from you (either with more detailed logs or
> just a clarification about how the given logs indicate data loss). A report
> about potential data loss is very concerning.
>
> Thanks,
> -John
>
> On Sun, Dec 30, 2018 at 9:23 AM Nitay Kufert <ni...@ironsrc.com> wrote:
>
> > Hey everybody,
> > We are running Kafka streams in production for the last year or so - we
> > currently using the latest version (2.1.0) and we suffered from data lose
> > several times before.
> > The first time we noticed a data loss, we were able to trace it back to
> > Exception that we were getting in the code - which eventually mapped to
> an
> > open bug that the Kafka team is still working on. So the temporary
> solution
> > was to disable the feature that causes the Exception (in this case - it
> was
> > the "exactly_once" semantics) and move to "at_lease_once" semantics +
> piece
> > of code that handles duplications.
> > The 2nd time we noticed a data loss, we traced it back to some kind of
> > Exception caused by lack of memory. To make a long story short - we hit
> the
> > limit for open files on the machines (a lot of files are used by
> rocksDB) -
> > so increasing the RAM of the machines & increasing the number of allowed
> > open files on the OS solved this problem.
> >
> > Now, we are facing data loss for the 3rd time - this time it seems to
> > happen when our Kafka stream instances switch (reproducible - happened 2
> > separate times). let me explain:
> > We are using a 3rd party company called Spotinst - which basically helps
> > you save costs by monitoring the Amazon spot market, and switching
> between
> > instances when they find a cheaper one.
> >
> > The question is, why would it cause data loss?
> > Those are logs I collected and put together in a single timeline,
> including
> > messages from Kafka stream instances (from Kibana), Spotinst (3rd party
> > company) & the data in the compacted topic where the data should have
> been
> > kept (basically its a compacted topic behind a reduce function - and it
> > seems like the aggregated data was lost and the function was invocated as
> > if its the first time its aggregating anything).
> > What you are seeing is that Spotinst saw an increase in CPU - and
> initiated
> > an Upscale (2 instances), and shortly after it - 2 instances went down
> > (Downscale) as the load was over. In *RED* you can see the actual data
> loss
> > (as observed from the compacted topic)
> >
> > DATE TIME FACILITY INFO
> > 12/25/2018 5:17:03 Spotinst Instances Launched - Autoscaling: Policy
> Name:
> > Scaling Policy-Up, Threshold: 70.0, Value Observed: 70.0
> > 12/25/2018 5:22:34 Spotinst Got Signal INSTANCE_READY For Instance
> > I-instace1
> > 12/25/2018 5:22:54 instace2 The following subscribed topics are not
> > assigned to any members: [bosThresholds]
> > 12/25/2018 5:22:56 instace3 Updating StandbyTasks failed. Deleting
> > StandbyTasks stores to recreate from scratch.:
> > org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
> > Offsets out of range with no configured reset policy for partitions:
> > {bos-unique_input_message-changelog-1=7491727}
> > 12/25/2018 5:23:21 Spotinst instace4 - INSTANCE_READY
> > 12/25/2018 5:23:27 Compacted Topic cdr_44334 -> 1621.72
> > 12/25/2018 5:23:33 Spotinst instace4 - Successfully Registered To Load
> > Balancer
> > 12/25/2018 5:23:33 Spotinst instace1 - Successfully Registered To Load
> > Balancer
> > 12/25/2018 5:23:34 Compacted Topic cdr_44334-> 0.27
> > 12/25/2018 5:23:40 instace2 The following subscribed topics are not
> > assigned to any members: [bosThresholds]
> > 12/25/2018 5:23:46 instace3
> > org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
> > Offsets out of range with no configured reset policy for partitions:
> > {bos-unique_input_message-changelog-1=7491727}
> > 12/25/2018 5:27:05 Spotinst Instances Terminated - Autoscaling: Policy
> > Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 6.0
> > 12/25/2018 5:27:05 Spotinst instace2 - Was Successfully Deregistered From
> > CLASSIC Load Balancer
> > 12/25/2018 5:27:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN For
> > Instance I-instace2
> > 12/25/2018 5:27:23 Spotinst Shutdown Script Completed. InstanceId:
> > I-instace2 (View Details)
> > 12/25/2018 5:27:27 instace4 The following subscribed topics are not
> > assigned to any members: [bosThresholds]
> > 12/25/2018 5:33:03 Spotinst Instances I-instace3 Was Successfully
> > Deregistered From CLASSIC Load Balancer
> > 12/25/2018 5:33:03 Spotinst Instances Terminated - Autoscaling: Policy
> > Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 25.0
> > 12/25/2018 5:33:23 Spotinst Shutdown Script Completed. InstanceId:
> > I-instace3 (View Details)
> > 12/25/2018 5:33:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN For
> > Instance I-instace3
> > 12/25/2018 5:33:28 instace4 The following subscribed topics are not
> > assigned to any members: [bosThresholds]
> > I can provide much more information but I feel like I already spammed the
> > group as it is :/
> > Hope you can make sense of what I am writing and maybe shed some light on
> > the possible reasons for this strange behavior.
> >
> > For now, as a temporary solution, we are moving to "on-demand" instances
> > (which basically means that machines won't go up and down often), so I
> hope
> > it will solve our problems.
> >
> > Thanks
> > --
> > Nitay Kufert
> > Backend Team Leader
> > [image: ironSource] <http://www.ironsrc.com/>
> >
> > email nitay.k@ironsrc.com
> > mobile +972-54-5480021 <+972%2054-548-0021>
> > fax +972-77-5448273 <+972%2077-544-8273>
> > Derech Menachem Begin 121, Tel- Aviv
> > ironsrc.com <http://www.ironsrc.com/>
> > [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> > twitter] <https://twitter.com/ironsource>[image: facebook]
> > <https://www.facebook.com/ironSource>[image: googleplus]
> > <https://plus.google.com/+ironsrc>
> > This email (including any attachments) is for the sole use of the
> intended
> > recipient and may contain confidential information which may be protected
> > by legal privilege. If you are not the intended recipient, or the
> employee
> > or agent responsible for delivering it to the intended recipient, you are
> > hereby notified that any use, dissemination, distribution or copying of
> > this communication and/or its content is strictly prohibited. If you are
> > not the intended recipient, please immediately notify us by reply email
> or
> > by telephone, delete this email and destroy any copies. Thank you.
> >
>
-- 
Nitay Kufert
Backend Team Leader
[image: ironSource] <http://www.ironsrc.com/>

email nitay.k@ironsrc.com
mobile +972-54-5480021
fax +972-77-5448273
Derech Menachem Begin 121, Tel- Aviv
ironsrc.com <http://www.ironsrc.com/>
[image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
twitter] <https://twitter.com/ironsource>[image: facebook]
<https://www.facebook.com/ironSource>[image: googleplus]
<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.