You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Alexey Romanenko (Jira)" <ji...@apache.org> on 2019/11/12 12:57:00 UTC
[jira] [Resolved] (BEAM-8207) KafkaIOITs generate different hashes
each run, sometimes dropping records
[ https://issues.apache.org/jira/browse/BEAM-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Romanenko resolved BEAM-8207.
------------------------------------
Fix Version/s: Not applicable
Resolution: Not A Problem
> KafkaIOITs generate different hashes each run, sometimes dropping records
> -------------------------------------------------------------------------
>
> Key: BEAM-8207
> URL: https://issues.apache.org/jira/browse/BEAM-8207
> Project: Beam
> Issue Type: Bug
> Components: io-java-kafka, testing
> Reporter: Michal Walenia
> Priority: Major
> Fix For: Not applicable
>
>
> While working to adapt Java's KafkaIOIT to work with a large dataset generated by a SyntheticSource I encountered a problem. I want to push 100M records through a Kafka topic, verify data correctness and at the same time check the performance of KafkaIO.Write and KafkaIO.Read.
>
> To perform the tests I'm using a Kafka cluster on Kubernetes from the Beam repo ([here|https://github.com/apache/beam/tree/master/.test-infra/kubernetes/kafka-cluster]).
>
> The expected result would be that first the records are generated in a deterministic way (using hashes of list positions as Random seeds), next they are written to Kafka - this concludes the write pipeline.
> As for reading and correctness checking - first, the data is read from the topic and after being decoded into String representations, a hashcode of the whole PCollection is calculated (For details, check KafkaIOIT.java).
>
> During the testing I ran into several problems:
> 1. When all the records are read from the Kafka topic, the hash is different each time.
> 2. Sometimes not all the records are read and the Dataflow task waits for the input indefinitely, occasionally throwing exceptions.
>
> I believe there are two possible causes of this behavior:
>
> either there is something wrong with the Kafka cluster configuration
> or KafkaIO behaves erratically on high data volumes, duplicating and/or dropping records.
> Second option seems troubling and I would be grateful for help with the first.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)