You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Alexey Romanenko (Jira)" <ji...@apache.org> on 2019/11/12 12:57:00 UTC

[jira] [Resolved] (BEAM-8207) KafkaIOITs generate different hashes each run, sometimes dropping records

     [ https://issues.apache.org/jira/browse/BEAM-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexey Romanenko resolved BEAM-8207.
------------------------------------
    Fix Version/s: Not applicable
       Resolution: Not A Problem

> KafkaIOITs generate different hashes each run, sometimes dropping records
> -------------------------------------------------------------------------
>
>                 Key: BEAM-8207
>                 URL: https://issues.apache.org/jira/browse/BEAM-8207
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-kafka, testing
>            Reporter: Michal Walenia
>            Priority: Major
>             Fix For: Not applicable
>
>
> While working to adapt Java's KafkaIOIT to work with a large dataset generated by a SyntheticSource I encountered a problem. I want to push 100M records through a Kafka topic, verify data correctness and at the same time check the performance of KafkaIO.Write and KafkaIO.Read.
>  
> To perform the tests I'm using a Kafka cluster on Kubernetes from the Beam repo ([here|https://github.com/apache/beam/tree/master/.test-infra/kubernetes/kafka-cluster]).
>  
> The expected result would be that first the records are generated in a deterministic way (using hashes of list positions as Random seeds), next they are written to Kafka - this concludes the write pipeline.
> As for reading and correctness checking - first, the data is read from the topic and after being decoded into String representations, a hashcode of the whole PCollection is calculated (For details, check KafkaIOIT.java).
>  
> During the testing I ran into several problems:
> 1. When all the records are read from the Kafka topic, the hash is different each time.
> 2. Sometimes not all the records are read and the Dataflow task waits for the input indefinitely, occasionally throwing exceptions.
>  
> I believe there are two possible causes of this behavior:
>  
> either there is something wrong with the Kafka cluster configuration
> or KafkaIO behaves erratically on high data volumes, duplicating and/or dropping records.
> Second option seems troubling and I would be grateful for help with the first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)