You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Jacques Nadeau <ja...@apache.org> on 2017/11/20 16:34:23 UTC

Codec value missing from Turbodbc files? Format issue?

One of our community members hit an issue where we couldn't parse a Parquet
footer. It looks like the file is missing the Codec field for a column but
the Parquet Thrift spec expects one.

https://community.dremio.com/t/unable-to-read-parquet-footer-with-file-generated-with-turbodbc/474/9

Was there a recent change in format? Any thoughts would be appreciated.

thanks,
Jacques

Re: Codec value missing from Turbodbc files? Format issue?

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
The files are produced by Parquet C++ through pyarrow. Turbodbc cannot itself write Parquet, it only talks ODBC with a database and then returns Arrow tables/Pandas Dataframes. The conversion Arrow -> Parquet is done in pyarrow.

Additionally I would add zstandard +... that were recently added to the Parquet standard to parquet-cpp quite soon. This is nice for users that only use tools that are on the newest version of Parquet, for older tools we will probably see the above error more often as people will use the new codecs despite warnings in the documentation.

Uwe

(note that besides being involved in Arrow and Parquet, I'm one of the two turbodbc developers)

> Am 20.11.2017 um 17:44 schrieb Jacques Nadeau <ja...@apache.org>:
> 
> Got it, nice catch. Thanks for the help!
> 
> On Mon, Nov 20, 2017 at 8:42 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
> 
>> The file that the user posted is stored with Brotli compression. You should
>> be able to read it with the latest Parquet master. I can cat the contents
>> with our tools that use brotli.
>> 
>> I'm surprised to see files like this already. We added the new compression
>> codecs just recently. Also, whatever wrote this file should not default to
>> brotli and should warn users that using brotli compression breaks forward
>> compatibility: older readers can't read the files or metadata because of
>> how Thrift handles enums.
>> 
>> rb
>> 
>> On Mon, Nov 20, 2017 at 8:34 AM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>> 
>>> One of our community members hit an issue where we couldn't parse a
>> Parquet
>>> footer. It looks like the file is missing the Codec field for a column
>> but
>>> the Parquet Thrift spec expects one.
>>> 
>>> https://community.dremio.com/t/unable-to-read-parquet-
>>> footer-with-file-generated-with-turbodbc/474/9
>>> 
>>> Was there a recent change in format? Any thoughts would be appreciated.
>>> 
>>> thanks,
>>> Jacques
>>> 
>> 
>> 
>> 
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>> 


Re: Codec value missing from Turbodbc files? Format issue?

Posted by Jacques Nadeau <ja...@apache.org>.
Got it, nice catch. Thanks for the help!

On Mon, Nov 20, 2017 at 8:42 AM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> The file that the user posted is stored with Brotli compression. You should
> be able to read it with the latest Parquet master. I can cat the contents
> with our tools that use brotli.
>
> I'm surprised to see files like this already. We added the new compression
> codecs just recently. Also, whatever wrote this file should not default to
> brotli and should warn users that using brotli compression breaks forward
> compatibility: older readers can't read the files or metadata because of
> how Thrift handles enums.
>
> rb
>
> On Mon, Nov 20, 2017 at 8:34 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > One of our community members hit an issue where we couldn't parse a
> Parquet
> > footer. It looks like the file is missing the Codec field for a column
> but
> > the Parquet Thrift spec expects one.
> >
> > https://community.dremio.com/t/unable-to-read-parquet-
> > footer-with-file-generated-with-turbodbc/474/9
> >
> > Was there a recent change in format? Any thoughts would be appreciated.
> >
> > thanks,
> > Jacques
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Codec value missing from Turbodbc files? Format issue?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
The file that the user posted is stored with Brotli compression. You should
be able to read it with the latest Parquet master. I can cat the contents
with our tools that use brotli.

I'm surprised to see files like this already. We added the new compression
codecs just recently. Also, whatever wrote this file should not default to
brotli and should warn users that using brotli compression breaks forward
compatibility: older readers can't read the files or metadata because of
how Thrift handles enums.

rb

On Mon, Nov 20, 2017 at 8:34 AM, Jacques Nadeau <ja...@apache.org> wrote:

> One of our community members hit an issue where we couldn't parse a Parquet
> footer. It looks like the file is missing the Codec field for a column but
> the Parquet Thrift spec expects one.
>
> https://community.dremio.com/t/unable-to-read-parquet-
> footer-with-file-generated-with-turbodbc/474/9
>
> Was there a recent change in format? Any thoughts would be appreciated.
>
> thanks,
> Jacques
>



-- 
Ryan Blue
Software Engineer
Netflix