You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Andrew Duffy <ad...@palantir.com> on 2016/08/19 10:37:33 UTC

Spark + Parquet UTF8 treatment of binary

Hello Parquet-Dev,

 

I wanted to get some feedback on something we’ve been running into that revolves around the difference between Parquet and Spark sorted ordering for UTF8 strings.

 

Spark has a special type for Strings, so when it performs sort ex. before writing out to Parquet it will perform string-wise ordering, so all Unicode characters are “greater than” anything in the ASCII range.

 

However, when Spark pushes down to Parquet strings are treated as Binary, and the Binary comparison of two strings is byte[], which is signed byte type, so anything starting with a UTF8 character is seen as been “less than” anything in ASCII range. The way this manifests itself is that Spark sorts the records using its comparison, and then Parquet calculates the min and max for Statistics using signed bytes comparison, so when you pushdown in Spark you’re basically required to look at things that you shouldn’t have to look at because your statistics are broken for what you’re trying to do.

 

I was wondering if anyone had strong opinions about the best way to fix this, perhaps adding a true “String” type in Parquet that has a well-defined ordering would be the way to go, or does anyone have recommendations for Spark-side fixes? Another thing we could do is force binary comparisons to assume that bytes are supposed to be unsigned, which would be a breaking change but might be the thing we want to actually be doing when comparing bytes?

 

-Andrew

Re: Spark + Parquet UTF8 treatment of binary

Posted by Andrew Duffy <ad...@palantir.com>.

Hey Julien,

Thanks for the pointer! I actually went and took a look and I think I found a way to fix this that doesn’t require a format change.
It required adding one new static comparison method to Binary that is UTF-8 aware, and as I mention in the PR, it performs UTF-8 comparison the same way that Avro and Spark both perform it:

Avro: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/io/BinaryData.java#L184
Spark: https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L835

This doesn’t break any compatibility as it only affects the write-path, and only affects generation of BinaryStatistics when the Binary type is a FromStringBinary, so won’t affect existing Parquet datasets.

I pushed a PR and tagged you in it, it’s visible here: https://github.com/apache/parquet-mr/pull/362, please let me know what you think.
-Andrew

On 8/20/16, 9:02 PM, "Julien Le Dem" <ju...@dremio.com> wrote:

>This sounds like Parquet could be improved in that regard.
>One way to evolve this in a backward compatible manner is to add optional
>fields in the Statistics struct that would have the min_utf8, max_utf8
>semantics you describe. These would be added to binary fields labelled with
>the logical type UTF8 [1] (Which is the true String type in parquet).
>
>[1]
>https://github.com/apache/parquet-format/blob/66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/parquet.thrift#L50
>
>On Fri, Aug 19, 2016 at 3:37 AM, Andrew Duffy <ad...@palantir.com> wrote:
>
>> Hello Parquet-Dev,
>>
>>
>>
>> I wanted to get some feedback on something we’ve been running into that
>> revolves around the difference between Parquet and Spark sorted ordering
>> for UTF8 strings.
>>
>>
>>
>> Spark has a special type for Strings, so when it performs sort ex. before
>> writing out to Parquet it will perform string-wise ordering, so all Unicode
>> characters are “greater than” anything in the ASCII range.
>>
>>
>>
>> However, when Spark pushes down to Parquet strings are treated as Binary,
>> and the Binary comparison of two strings is byte[], which is *signed*
>> byte type, so anything starting with a UTF8 character is seen as been “less
>> than” anything in ASCII range. The way this manifests itself is that Spark
>> sorts the records using its comparison, and then Parquet calculates the min
>> and max for Statistics using signed bytes comparison, so when you pushdown
>> in Spark you’re basically required to look at things that you shouldn’t
>> have to look at because your statistics are broken for what you’re trying
>> to do.
>>
>>
>>
>> I was wondering if anyone had strong opinions about the best way to fix
>> this, perhaps adding a true “String” type in Parquet that has a
>> well-defined ordering would be the way to go, or does anyone have
>> recommendations for Spark-side fixes? Another thing we could do is force
>> binary comparisons to assume that bytes are supposed to be unsigned, which
>> would be a breaking change but might be the thing we want to actually be
>> doing when comparing bytes?
>>
>>
>>
>> -Andrew
>>
>>
>>
>
>
>
>-- 
>Julien

Re: Spark + Parquet UTF8 treatment of binary

Posted by Julien Le Dem <ju...@dremio.com>.

This sounds like Parquet could be improved in that regard.
One way to evolve this in a backward compatible manner is to add optional
fields in the Statistics struct that would have the min_utf8, max_utf8
semantics you describe. These would be added to binary fields labelled with
the logical type UTF8 [1] (Which is the true String type in parquet).

[1]
https://github.com/apache/parquet-format/blob/66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/parquet.thrift#L50

On Fri, Aug 19, 2016 at 3:37 AM, Andrew Duffy <ad...@palantir.com> wrote:

> Hello Parquet-Dev,
>
>
>
> I wanted to get some feedback on something we’ve been running into that
> revolves around the difference between Parquet and Spark sorted ordering
> for UTF8 strings.
>
>
>
> Spark has a special type for Strings, so when it performs sort ex. before
> writing out to Parquet it will perform string-wise ordering, so all Unicode
> characters are “greater than” anything in the ASCII range.
>
>
>
> However, when Spark pushes down to Parquet strings are treated as Binary,
> and the Binary comparison of two strings is byte[], which is *signed*
> byte type, so anything starting with a UTF8 character is seen as been “less
> than” anything in ASCII range. The way this manifests itself is that Spark
> sorts the records using its comparison, and then Parquet calculates the min
> and max for Statistics using signed bytes comparison, so when you pushdown
> in Spark you’re basically required to look at things that you shouldn’t
> have to look at because your statistics are broken for what you’re trying
> to do.
>
>
>
> I was wondering if anyone had strong opinions about the best way to fix
> this, perhaps adding a true “String” type in Parquet that has a
> well-defined ordering would be the way to go, or does anyone have
> recommendations for Spark-side fixes? Another thing we could do is force
> binary comparisons to assume that bytes are supposed to be unsigned, which
> would be a breaking change but might be the thing we want to actually be
> doing when comparing bytes?
>
>
>
> -Andrew
>
>
>



-- 
Julien