You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@samza.apache.org by "Bharath Kumarasubramanian (Jira)" <ji...@apache.org> on 2020/05/27 03:40:00 UTC

[jira] [Resolved] (SAMZA-2502) Byte array keys be partitioned based on array contents in InMemorySystemProducer

     [ https://issues.apache.org/jira/browse/SAMZA-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bharath Kumarasubramanian resolved SAMZA-2502.
----------------------------------------------
    Fix Version/s: 1.5
       Resolution: Fixed

> Byte array keys be partitioned based on array contents in InMemorySystemProducer
> --------------------------------------------------------------------------------
>
>                 Key: SAMZA-2502
>                 URL: https://issues.apache.org/jira/browse/SAMZA-2502
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Yixing Zhang
>            Assignee: Yixing Zhang
>            Priority: Major
>             Fix For: 1.5
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> InMemorySystemProducer uses the hashCode of the partition key to decide to which partition the message goes. This works well when the key is an object whose hashCode method can be override. But in the case when the partition key is serialized as a byte[], the message can go to any partition. It turns out that the hash code of a byte array is based on the address in memory but not the content. Therefore, even though two messages may have same key, they can be sent to different partitions after their keys are serialized into byte[] whose hash code is kind of random.
>  
> We want to be able to partition messages based on the contents of the partition keys. An easy fix would be: in the case of byte array, we calculate the hash code with Arrays.hashCode(byte[] input). This allows us to calculate the hash code of the byte array by its contents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)