You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by StephanEwen <gi...@git.apache.org> on 2017/07/11 20:29:38 UTC

[GitHub] flink issue #4301: (release-1.3) [FLINK-7143] [kafka] Fix indeterminate part...

Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/4301
  
    I think that would fix the bug. There are two things I would like to improve, though:
    
      1. Relying on `hashCode()` makes very implicit assumptions about the behavior of the hash code implementation. This does not really document/articulate well how critical this `int` value that we rely on is. For example, by Java specification, hashCode may vary between processes - it only needs to be stable within a single JVM. Our hash code implementation happens to be stable currently, as long as the JDK does not change the implementation of the String hash code method (which they could in theory do in any minor release, although they have not done that in a while).
    
      2. It is crucial that the distribution of partitions is uniform. That is a bit harder to guarantee when all sources pick up their own set of topics. At the least, distribution should be uniform of the partitions within a topic. For example, the topic defines "where to start" in the parallel subtasks, and the partitions then go "round robin".
    Well, as it happens, this is actually the implementation of the hash code function, but again, this looks a bit like it "coincidentally" behaves like that, rather than that we have a strict contract for that behavior. For example, changing the hashCode from `31 * topic + partition` to `31 * partition + topic` results in non-uniform distribution, but is an equally valid hashCode.
    
    I would suggest to have a function `int assignmentIndex()` or so, for which we define the above contract. We should also have tests that this distributes partitions within a single topic uniform.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---