You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/05/28 06:56:07 UTC

[GitHub] [incubator-druid] c0de0ff opened a new issue #7776: kafka indexing service should always use earliest offset for newly discovered topic-partitions instead of useEarliestOffset config

c0de0ff opened a new issue #7776: kafka indexing service should always use earliest offset for newly discovered topic-partitions instead of useEarliestOffset config
URL: https://github.com/apache/incubator-druid/issues/7776
 
 
   ### Description
   
   Kafka-indexing-service currently uses `useEarliestOffset` config whenever it can't find any data for a topic-partition. This happens when the supervisor is running for the first time or when there is a new partition for this topic in kafka. The config is also used to reset the offsets ( if `resetOffsetAutomatically` is set to true ).
   
   In 2 of the above 3 scenarios, it makes sense to use `useEarliestOffset` config. However, it doesn't seem like indexing service should use this config on newly discovered partitions. If `useEarliestOffset` is set to false then this might result in data loss. In production environment, with large kafka clusters and many long running supervisors, adding new partitions to kafka topics would be a common occurrence and therefore this config must always remain true to avoid any data loss.
   
   A typical use case in large kafka clusters is to start the new supervisor from latest offset and keep consuming without any data loss ( exactly once ). In order to achieve this currently, we have to start supervisor with `useEarliestOffset` set to false and then wait for it to start running and then set the config back to true to avoid data loss in new partitions. User may also want to reset to latest offsets manually using the reset api, in this case also, he need to remember setting the config back to true which can be error prone.
   
   The solution to this might be to not use the config while getting offsets for new partitions ( always use earliest ), however, i am not sure how we can differentiate the 2 events "new partitions added" vs "supervisor first run".
   
   ### Motivation
   
   - Currently in order to avoid data loss from new partitions, we must always keep `useEarliestOffset` set to true, which creates the need to manually change the config back and forth in case we want to use the diff option for first-start/reset.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org