You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Arina Yelchiyeva <ar...@gmail.com> on 2019/07/19 14:49:47 UTC

Metadata compression

Hi all,

Recent changes in metadata compression started adding “.gz” after metadata file name, not in the end as before.
Before: v1.metadata.json.gz
Now: v1.gz.metadata.json

Looks like this was done intentionally but for me it looks rather confusing. Since gz is indication of compressed file and usually placed in the end. Plus is causes problems when reading such file using external tools.
For example, Apache Drill cannot read "v1.gz.metadata.json” as it assumes, it is a json but it can successfully read "v1.metadata.json.gz” since it understands that it is a compressed json file.


Any thoughts?

Kind regards,
Arina

Re: Metadata compression

Posted by Arina Yelchiyeva <ar...@gmail.com>.
Agree, that this is metadata only for Iceberg and should not be read by other systems, it was just an example.
Main point is that having gz in the middle is confusing. I guess expectation is that if file ends with json suffix, it is a json.
Maybe another option is to remove “.json" from metadata files names at all, this might be less confusing.

Kind regards,
Arina

> On Jul 19, 2019, at 7:38 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> The intent here was to make it easier to identify the format of a file, but if this makes the files incompatible with other systems maybe we should change it back.
> 
> I think the argument against changing it back is that I wouldn't expect people to read these files with systems like Drill. Instead, we want to move to using metadata tables to inspect table state, like the recently added history, snapshots, manifests, and files tables.
> 
> On Fri, Jul 19, 2019 at 7:50 AM Arina Yelchiyeva <arina.yelchiyeva@gmail.com <ma...@gmail.com>> wrote:
> Hi all,
> 
> Recent changes in metadata compression started adding “.gz” after metadata file name, not in the end as before.
> Before: v1.metadata.json.gz
> Now: v1.gz.metadata.json
> 
> Looks like this was done intentionally but for me it looks rather confusing. Since gz is indication of compressed file and usually placed in the end. Plus is causes problems when reading such file using external tools.
> For example, Apache Drill cannot read "v1.gz.metadata.json” as it assumes, it is a json but it can successfully read "v1.metadata.json.gz” since it understands that it is a compressed json file.
> 
> 
> Any thoughts?
> 
> Kind regards,
> Arina
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: Metadata compression

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
The intent here was to make it easier to identify the format of a file, but
if this makes the files incompatible with other systems maybe we should
change it back.

I think the argument against changing it back is that I wouldn't expect
people to read these files with systems like Drill. Instead, we want to
move to using metadata tables to inspect table state, like the recently
added history, snapshots, manifests, and files tables.

On Fri, Jul 19, 2019 at 7:50 AM Arina Yelchiyeva <ar...@gmail.com>
wrote:

> Hi all,
>
> Recent changes in metadata compression started adding “.gz” after metadata
> file name, not in the end as before.
> Before: v1.metadata.json.gz
> Now: v1.gz.metadata.json
>
> Looks like this was done intentionally but for me it looks rather
> confusing. Since gz is indication of compressed file and usually placed in
> the end. Plus is causes problems when reading such file using external
> tools.
> For example, Apache Drill cannot read "v1.gz.metadata.json” as it assumes,
> it is a json but it can successfully read "v1.metadata.json.gz” since it
> understands that it is a compressed json file.
>
>
> Any thoughts?
>
> Kind regards,
> Arina



-- 
Ryan Blue
Software Engineer
Netflix