You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Gleb Kanterov (JIRA)" <ji...@apache.org> on 2018/10/04 09:18:00 UTC

[jira] [Created] (BEAM-5646) Equality is broken for Rows with BYTES field

Gleb Kanterov created BEAM-5646:
-----------------------------------

             Summary: Equality is broken for Rows with BYTES field
                 Key: BEAM-5646
                 URL: https://issues.apache.org/jira/browse/BEAM-5646
             Project: Beam
          Issue Type: Bug
          Components: dsl-sql
    Affects Versions: 2.7.0
            Reporter: Gleb Kanterov
            Assignee: Xu Mingmin


The problem is with `org.apache.beam.sdk.values.Row#equals` and `hashCode`. Java arrays do reference equality instead of comparing contents. Row stores fields of type BYTES as byte[].

These failing tests illustrate the problem:
{code:java}
@Test
public void testByteArrayEquality() {
  byte[] a0 = new byte[16];
  byte[] b0 = new byte[16];

  Schema schema = Schema.of(Schema.Field.of("bytes", Schema.FieldType.BYTES));

  Row a = Row.withSchema(schema).addValue(a0).build();
  Row b = Row.withSchema(schema).addValue(b0).build();

  Assert.assertEquals(a, b);
}

@Test
public void testByteBufferEquality() {
  byte[] a0 = new byte[16];
  byte[] b0 = new byte[16];

  Schema schema = Schema.of(Schema.Field.of("bytes", Schema.FieldType.BYTES));

  Row a = Row.withSchema(schema).addValue(ByteBuffer.wrap(a0)).build();
  Row b = Row.withSchema(schema).addValue(ByteBuffer.wrap(b0)).build();

  Assert.assertEquals(a, b);
}
{code}
 

Option 1. Fix by storing `byte[]` as `ByteBuffer`, or something more simple that doesn't have offsets. `Row#getValue` will return this type, and for consistency, it would be preferable to change `Row#getBytes` in an incompatible way to be consistent with `Row#getValue` because that's how it behaves for the rest of the methods.

 

Option 2. Do the same as Spark does, add `if (x instanceof byte[])` to `equals`. The problem in Spark is that `hashCode` implementation isn't consistent with `equals`, see SPARK-25122.

 

Option 3. Consider it as intended behavior, and fix `RowCoder#consistentWithEquals` implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)