You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Yijie Shen <he...@gmail.com> on 2015/04/12 07:50:21 UTC

Parquet File Binary column statistics error when reuse byte[] among rows

Hi,

Suppose I create a dataRDD which extends RDD[Row], and each row is
GenericMutableRow(Array(Int, Array[Byte])). A same Array[Byte] object is
reused among rows but has different content each time. When I convert it to
a dataFrame and save it as Parquet File, the file's row group statistic(max
& min) of Binary column would be wrong.



Here is the reason: In Parquet, BinaryStatistic just keep max & min as
parquet.io.api.Binary references, Spark sql would generate a new Binary
backed by the same Array[Byte] passed from row.
 reference backed max: Binary---------->ByteArrayBackedBinary---------->
Array[Byte]

Therefore, each time parquet updating row group's statistic, max & min
would always refer to the same Array[Byte], which has new content each
time. When parquet decides to save it into file, the last row's content
would be saved as both max & min.



It seems it is a parquet bug because it's parquet's responsibility to
update statistics correctly.
But not quite sure. Should I report it as a bug in parquet JIRA?


The spark JIRA is https://issues.apache.org/jira/browse/SPARK-6859

Re: Parquet File Binary column statistics error when reuse byte[] among rows

Posted by Cheng Lian <li...@gmail.com>.
Thanks Yijie! Also cc the user list.

Cheng

On 4/13/15 9:19 AM, Yijie Shen wrote:
> I opened a new Parquet JIRA ticket here: 
> https://issues.apache.org/jira/browse/PARQUET-251
>
> Yijie
>
> On April 12, 2015 at 11:48:57 PM, Cheng Lian (lian.cs.zju@gmail.com 
> <ma...@gmail.com>) wrote:
>
>> Thanks for reporting this! Would you mind to open JIRA tickets for both
>> Spark and Parquet?
>>
>> I'm not sure whether Parquet declares somewhere the user mustn't reuse
>> byte arrays when using binary type. If it does, then it's a Spark bug.
>> Anyway, this should be fixed.
>>
>> Cheng
>>
>> On 4/12/15 1:50 PM, Yijie Shen wrote:
>> > Hi,
>> >
>> > Suppose I create a dataRDD which extends RDD[Row], and each row is
>> > GenericMutableRow(Array(Int, Array[Byte])). A same Array[Byte] 
>> object is
>> > reused among rows but has different content each time. When I 
>> convert it to
>> > a dataFrame and save it as Parquet File, the file's row group 
>> statistic(max
>> > & min) of Binary column would be wrong.
>> >
>> >
>> >
>> > Here is the reason: In Parquet, BinaryStatistic just keep max & min as
>> > parquet.io.api.Binary references, Spark sql would generate a new 
>> Binary
>> > backed by the same Array[Byte] passed from row.
>> > reference backed max: 
>> Binary---------->ByteArrayBackedBinary---------->
>> > Array[Byte]
>> >
>> > Therefore, each time parquet updating row group's statistic, max & min
>> > would always refer to the same Array[Byte], which has new content each
>> > time. When parquet decides to save it into file, the last row's 
>> content
>> > would be saved as both max & min.
>> >
>> >
>> >
>> > It seems it is a parquet bug because it's parquet's responsibility to
>> > update statistics correctly.
>> > But not quite sure. Should I report it as a bug in parquet JIRA?
>> >
>> >
>> > The spark JIRA is https://issues.apache.org/jira/browse/SPARK-6859
>> >
>>


Re: Parquet File Binary column statistics error when reuse byte[] among rows

Posted by Cheng Lian <li...@gmail.com>.
Thanks for reporting this! Would you mind to open JIRA tickets for both 
Spark and Parquet?

I'm not sure whether Parquet declares somewhere the user mustn't reuse 
byte arrays when using binary type. If it does, then it's a Spark bug. 
Anyway, this should be fixed.

Cheng

On 4/12/15 1:50 PM, Yijie Shen wrote:
> Hi,
>
> Suppose I create a dataRDD which extends RDD[Row], and each row is
> GenericMutableRow(Array(Int, Array[Byte])). A same Array[Byte] object is
> reused among rows but has different content each time. When I convert it to
> a dataFrame and save it as Parquet File, the file's row group statistic(max
> & min) of Binary column would be wrong.
>
>
>
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as
> parquet.io.api.Binary references, Spark sql would generate a new Binary
> backed by the same Array[Byte] passed from row.
>   reference backed max: Binary---------->ByteArrayBackedBinary---------->
> Array[Byte]
>
> Therefore, each time parquet updating row group's statistic, max & min
> would always refer to the same Array[Byte], which has new content each
> time. When parquet decides to save it into file, the last row's content
> would be saved as both max & min.
>
>
>
> It seems it is a parquet bug because it's parquet's responsibility to
> update statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA?
>
>
> The spark JIRA is https://issues.apache.org/jira/browse/SPARK-6859
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org