You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Cheng Lian <li...@gmail.com> on 2015/08/13 13:04:15 UTC

Re: Thrift binary type in Parquet

Filed a JIRA ticket for this issue 
https://issues.apache.org/jira/browse/PARQUET-357

Cheng

On 7/15/15 1:10 AM, Ryan Blue wrote:
> This sounds like something we should fix. It may work for Thrift and 
> Scrooge because they have an external schema, but you're right that 
> this will cause buggy behavior for other object models.
>
> Alex and Tianshuo, any ideas about how to address this? It looks like 
> we need to update the ThriftSchemaConverter when converting the Thrift 
> object to Parquet's representation of its schema. That should detect 
> that the field is a binary (through reflection?) even though the 
> underlying Thrift metadata doesn't encode it.
>
> rb
>
> On 07/07/2015 04:00 PM, Cheng Lian wrote:
>> You may see that parquet-mr 1.7.0 can only handle Thrift STRING, and
>> always add UTF8 annotation:
>> https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConvertVisitor.java#L249-L252 
>>
>>
>>
>> Because there’s just no |ThriftType.BinaryType|.
>>
>> On 7/7/15 3:56 PM, Cheng Lian wrote:
>>
>>> On 7/7/15 3:48 PM, Ryan Blue wrote:
>>>
>>>> On 07/07/2015 03:23 PM, Cheng Lian wrote:
>>>>> On 7/7/15 1:28 PM, Ashish Singh wrote:
>>>>>> I think you mean that we can’t treat Thrift BINARY type as UTF-8
>>>>>> string,
>>>>>> right?
>>>>> Yeah, it's possible that a Thrift BINARY contains illegal UTF-8 byte
>>>>> sequence(s) and I suppose this may cause problem. Trying to verify
>>>>> this.
>>>>
>>>> Isn't this the right behavior? As long as it isn't annotated as a
>>>> UTF8, then storing it as binary should be fine.
>>>
>>> Ah, it’s actually annotated as UTF8…
>>>
>>> Internally Thrift just maps BINARY to STRING and doesn’t have any
>>> annotation indicating that this field is a BINARY, so Parquet just
>>> assume it’s a normal UTF8 string and writes “BINARY (UTF8)”.
>>>
>>> Here are my testing Thrift schema and the Parquet schema extracted
>>> from the written Parquet file by |parquet-schema|:
>>>
>>> |struct ParquetThriftCompat { 1: binary binaryColumn; 2: string
>>> stringColumn; } message ParquetSchema { optional binary binaryColumn
>>> (UTF8); optional binary stringColumn (UTF8); } |
>>>>
>>>> rb
>>>>
>>>>
>>> ​
>>
>> ​
>>
>
>