You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 19:19:26 UTC

[GitHub] [beam] kennknowles opened a new issue, #18566: Caveats with KafkaIO exactly once support.

kennknowles opened a new issue, #18566:
URL: https://github.com/apache/beam/issues/18566

BEAM-2720 adds support for exactly-once semantics in KafkaIO sink. It is tested with Dataflow and seems to work well. It does some with a few caveats we should address over time:

* Implementation requires specific durability/checkpoint semantics across a GroupByKey transform.
** It requires a checkpoint across GBK. Not all runners support this, specifically horizontal distributed checkpoint in Flink does not work.
** The implementation includes runtime check for compatibility.
* Requires stateful DoFn support. Not all runners support this. Even those that do, often includes their own caveats. This is part of core Beam model, and overtime it will be widely supported across the runners.
* The user records go through extra shuffles. The implementation results in 2 extra shuffles in Dataflow. Some enhancements to Beam API might reduce number of shuffles.
* It requires user to specify 'number of shards', which determines sink parallelism. The shard ids are also used to store some stage in topic metadata on Kafka servers. If the number of shards is larger than the number of partitions for the output topic, the behavior is not documented, though tests seem to work fine. I.e. I am able to store metadata for 100 partitions even though a topic has just 10 partitions. We should probably file a jira for Kafka. Alternately we could limit number of shards to be fewer than the number of partitions (not required yet).
* The metadata mentioned above is kept only for 24 hours by default. i.e., if a pipeline does not write anything for a day or is down for a day, it could lose crucial state stored with Kafka. Admin can configure this on Kafka servers to be larger, but there is no way for a client to increase it for specific topic. Note that Kafka Streams also face the same issue.
** I asked about both of Kafka issue on user list : https://www.mail-archive.com/users@kafka.apache.org/msg27624.html .

Imported from Jira [BEAM-3025](https://issues.apache.org/jira/browse/BEAM-3025). Original Jira may contain additional context.
Reported by: rangadi.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org