You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Konstantin Shaposhnikov (JIRA)" <ji...@apache.org> on 2015/04/22 17:14:58 UTC

[jira] [Created] (PARQUET-258) Binary statistics is not updated correctly if an underlying Binary array is modified in place

Konstantin Shaposhnikov created PARQUET-258:
-----------------------------------------------

             Summary: Binary statistics is not updated correctly if an underlying Binary array is modified in place
                 Key: PARQUET-258
                 URL: https://issues.apache.org/jira/browse/PARQUET-258
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.6.0
            Reporter: Konstantin Shaposhnikov


The following test case shows the problem:

{code}
    byte[] bytes = new byte[] { 49 };
    BinaryStatistics reusableStats =  new BinaryStatistics();
    reusableStats.updateStats(Binary.fromByteArray(bytes));
    bytes[0] = 50;
    reusableStats.updateStats(Binary.fromByteArray(bytes, 0, 1));
 
    assertArrayEquals(new byte[] { 49 }, reusableStats.getMinBytes());
    assertArrayEquals(new byte[] { 50 }, reusableStats.getMaxBytes());
{code}

I discovered the bug when converting an AVRO file to a Parquet file by reading GenericRecords from a file using [DataFileStream.next(D reuse)|http://javadox.com/org.apache.avro/avro/1.7.6/org/apache/avro/file/DataFileStream.html#next(D)] method. The problem is that underlying byte array of avro Utf8 object is passed to parquet that saves it as part of BinaryStatistics and then the same array is modified in place on the next read.

I am not sure what is the right way to fix the problem (in BinaryStatistics or AvroWriteSupport).

If BinaryStatistics implementation is correct (for performance reasons) then this behavior should be documented and AvroWriteSupport.fromAvroString should be fixed to duplicate underlying Utf8 array.

I am happy to create a pull request once the desired way to fix the issue is discussed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)