You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ashish K Singh (JIRA)" <ji...@apache.org> on 2015/05/05 00:02:08 UTC

[jira] [Commented] (PARQUET-251) Binary column statistics error when reuse byte[] among rows

    [ https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527414#comment-14527414 ] 

Ashish K Singh commented on PARQUET-251:
----------------------------------------

Hey guys, I am planning to take a stab on this. After following the detailed discussion so far, I think (feel free to correct me) following needs to be done.

1. We need a clone() or copyToByteArray() method in Binary. Having the method should make it obvious to the user that the byte[] should not be mutated and if it plans on doing it clone() or copyToByteArray() should be used. However, this should be explicitly mentioned in the docs as well.

2. Update BinaryStatistics to use clone() or copyToByteArray() on the Binary value passed to it.

3. Ignore min/max byte arrays for data written with 1.6.0 and earlier.

If you guys agree with it, I should be able to submit a PR soon.

> Binary column statistics error when reuse byte[] among rows
> -----------------------------------------------------------
>
>                 Key: PARQUET-251
>                 URL: https://issues.apache.org/jira/browse/PARQUET-251
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Yijie Shen
>            Assignee: Ashish K Singh
>            Priority: Blocker
>
> I think it is a common practice when inserting table data as parquet file, one would always reuse the same object among rows, and if a column is byte[] of fixed length, the byte[] would also be reused. 
> If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row groups created by a single task would have the same max & min binary value, just as the last row's binary content.
> The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary references, since I use ByteArrayBackedBinary for byte[], the real content of max & min would always point to the reused byte[], therefore the latest row's content.
> Does parquet declare somewhere that the user shouldn't reuse byte[] for Binary type?  If it doesn't, I think it's a bug and can be reproduced by [Spark SQL's RowWriteSupport |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
> The related Spark JIRA ticket: [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)