You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2015/07/01 19:37:04 UTC
[jira] [Updated] (PARQUET-251) Binary column statistics error when
reuse byte[] among rows
[ https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryan Blue updated PARQUET-251:
------------------------------
Fix Version/s: (was: 2.0.0)
1.8.0
> Binary column statistics error when reuse byte[] among rows
> -----------------------------------------------------------
>
> Key: PARQUET-251
> URL: https://issues.apache.org/jira/browse/PARQUET-251
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.6.0
> Reporter: Yijie Shen
> Assignee: Ashish K Singh
> Priority: Blocker
> Fix For: 1.8.0
>
>
> I think it is a common practice when inserting table data as parquet file, one would always reuse the same object among rows, and if a column is byte[] of fixed length, the byte[] would also be reused.
> If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row groups created by a single task would have the same max & min binary value, just as the last row's binary content.
> The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary references, since I use ByteArrayBackedBinary for byte[], the real content of max & min would always point to the reused byte[], therefore the latest row's content.
> Does parquet declare somewhere that the user shouldn't reuse byte[] for Binary type? If it doesn't, I think it's a bug and can be reproduced by [Spark SQL's RowWriteSupport |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
> The related Spark JIRA ticket: [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)