You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2015/10/16 22:26:05 UTC

[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

    [ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961318#comment-14961318 ] 

Cheng Lian commented on SPARK-6859:
-----------------------------------

This issue was left unresolved because Parquet filter push-down wasn't enabled by default. But now in 1.5, it's turned on by default. Opened SPARK-11153 to disable filter push-down for strings and binaries.

> Parquet File Binary column statistics error when reuse byte[] among rows
> ------------------------------------------------------------------------
>
>                 Key: SPARK-6859
>                 URL: https://issues.apache.org/jira/browse/SPARK-6859
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0, 1.3.0, 1.4.0
>            Reporter: Yijie Shen
>            Priority: Minor
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is reused among rows but has different content each time. When I convert it to a dataFrame and save it as Parquet File, the file's row group statistic(max & min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as parquet.io.api.Binary references, Spark sql would generate a new Binary backed by the same Array\[Byte\] passed from row.
> 						     		   	
> | |reference| |backed| |	
> |max: Binary|---------->|ByteArrayBackedBinary|---------->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would always refer to the same Array\[Byte\], which has new content each time. When parquet decides to save it into file, the last row's content would be saved as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org