You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2015/04/28 02:51:06 UTC

[jira] [Resolved] (PARQUET-258) Binary statistics is not updated correctly if an underlying Binary array is modified in place

     [ https://issues.apache.org/jira/browse/PARQUET-258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan Blue resolved PARQUET-258.
-------------------------------
    Resolution: Duplicate

No problem, I just wanted to make sure.

> Binary statistics is not updated correctly if an underlying Binary array is modified in place
> ---------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-258
>                 URL: https://issues.apache.org/jira/browse/PARQUET-258
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Konstantin Shaposhnikov
>
> The following test case shows the problem:
> {code}
>     byte[] bytes = new byte[] { 49 };
>     BinaryStatistics reusableStats =  new BinaryStatistics();
>     reusableStats.updateStats(Binary.fromByteArray(bytes));
>     bytes[0] = 50;
>     reusableStats.updateStats(Binary.fromByteArray(bytes, 0, 1));
>  
>     assertArrayEquals(new byte[] { 49 }, reusableStats.getMinBytes());
>     assertArrayEquals(new byte[] { 50 }, reusableStats.getMaxBytes());
> {code}
> I discovered the bug when converting an AVRO file to a Parquet file by reading GenericRecords from a file using [DataFileStream.next(D reuse)|http://javadox.com/org.apache.avro/avro/1.7.6/org/apache/avro/file/DataFileStream.html#next(D)] method. The problem is that underlying byte array of avro Utf8 object is passed to parquet that saves it as part of BinaryStatistics and then the same array is modified in place on the next read.
> I am not sure what is the right way to fix the problem (in BinaryStatistics or AvroWriteSupport).
> If BinaryStatistics implementation is correct (for performance reasons) then this behavior should be documented and AvroWriteSupport.fromAvroString should be fixed to duplicate underlying Utf8 array.
> I am happy to create a pull request once the desired way to fix the issue is discussed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)