You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by "Leyne, Sean" <Se...@BroadViewSoftware.com> on 2021/06/15 18:26:07 UTC

Supported compression methods

All,

The documentation describes that gzip/gz compression as supported for text files, and that snappy and gzip are support for parquet files.

I have also read that zip compression was also added (though not documented) for text files.


But is zip also supported for parquet files?

What about support for other compression algorithms/methods?  LZ4?  Bzip2? zstd??


Sean



RE: Supported compression methods

Posted by "Leyne, Sean" <Se...@BroadViewSoftware.com>.
Charles,

> Thanks for your interest in Drill.  Maybe we could take a step back here.
> Could you explain your use case in a little more detail?  It sounds to me like
> you'd like the ability to write compressed parquet files and to choose the
> compression codec.

My question came more out of a desire to compare the performance of various options based on my data files.

Further, my desire is to minimize the storage footprint of the Parquet files (while keeping acceptable query performance).


>  This might be a good feature to add as a config option.
> IE:  When you execute a CTAS query, you could select compression... or not.

I would be happy with just the option of defining compression codec at the Storage Plugin level. 

Personally, I prefer to keep SQL/CTAS clean.


Sean

> 
> Thanks!
> -- C
> 
> 
> > On Jun 18, 2021, at 10:15 AM, Leyne, Sean
> <Se...@BroadViewSoftware.com> wrote:
> >
> > James,
> >
> >> -----Original Message-----
> >> From: James Turton <ja...@somecomputer.xyz.INVALID>
> >
> >> Zip is a file format, not a codec.  Various codecs are employed in
> >> Zip archives, most commonly DEFLATE.  The different set of codecs
> >> that are supported in the Parquet file format are described in
> >> https://github.com/apache/parquet-
> >> format/blob/master/Compression.md.
> >
> > Thanks for the link, the problem is that often the codec and the file format
> are synonymous, so people like myself don't make the distinction.
> >
> > Not helping is the Drill use of the ambiguous "Compression Type"
> terminology rather than "codec" in the Drill options.
> >
> >
> >> Since, then, Zip is not sensible or possible inside a Parquet file,
> >> the only way to effect what you describe would be to embed a Parquet
> >> file inside a Zip archive.  This would be perverse and misguided but
> >> possibly still queryable since Drill might transparently do the right
> >> things to decode it anyway.  Using a supported codec within the
> >> Parquet file format and forgetting about Zip is certainly a better approach.
> >
> > Might seem perverse to you, however, given that that "zip compression"
> support for text file was added in v1.17.0 (DRILL-5674)*, I think it is a
> reasonable question to ask about support for Parquet files.
> >
> > *there were no details on which of the codecs are supported.
> >
> >
> >>   If you want compression ratios comparable to those found in Zip
> >> files then you would choose GZip and pay with CPU cycles.  When Drill
> >> gains support for Zstandard there will be little reason to choose
> >> anything else.
> >
> > This is another area of confusion, if Parquet provides support for ZSTD (as
> well as other codecs) why doesn't Drill?
> >
> > Isn't there a standard "Parquet Library" that is available which enables
> Parquet file support with all "features", which any project implementing
> Parquet file support would use?
> >
> >
> >
> >>
> >> On 2021/06/17 18:59, Leyne, Sean wrote:
> >>> Luoc,
> >>>
> >>>>   Could you please tell me first which case you are talking about?
> >>>> Only write(CTAS syntax) or read(SELECT)?
> >>> Really both, since you need a mechanism to create the zip'd parquet
> >>> file to
> >> begin with.  Having to create a special/side process to zip the file
> >> outside of drill would be ... awkward.
> >>>
> >>>
> >>> Sean
> >>>
> >>>>> 在 2021年6月16日,02:26,Leyne, Sean
> >>>> <Se...@broadviewsoftware.com> 写道:
> >>>>> All,
> >>>>>
> >>>>> The documentation describes that gzip/gz compression as supported
> >>>>> for
> >>>> text files, and that snappy and gzip are support for parquet files.
> >>>>> I have also read that zip compression was also added (though not
> >>>> documented) for text files.
> >>>>>
> >>>>> But is zip also supported for parquet files?
> >>>>>
> >>>>> What about support for other compression algorithms/methods?  LZ4?
> >>>> Bzip2? zstd??
> >>>>>
> >>>>> Sean
> >>>>>
> >>>>>
> >>>>>
> >


Re: Supported compression methods

Posted by Charles Givre <cg...@gmail.com>.
Hey Sean, 
Thanks for your interest in Drill.  Maybe we could take a step back here.  Could you explain your use case in a little more detail?  It sounds to me like you'd like the ability to write compressed parquet files and to choose the compression codec.  This might be a good feature to add as a config option.  IE:  When you execute a CTAS query, you could select compression... or not.

Thanks!
-- C


> On Jun 18, 2021, at 10:15 AM, Leyne, Sean <Se...@BroadViewSoftware.com> wrote:
> 
> James,
> 
>> -----Original Message-----
>> From: James Turton <ja...@somecomputer.xyz.INVALID>
> 
>> Zip is a file format, not a codec.  Various codecs are employed in Zip archives,
>> most commonly DEFLATE.  The different set of codecs that are supported in
>> the Parquet file format are described in https://github.com/apache/parquet-
>> format/blob/master/Compression.md.
> 
> Thanks for the link, the problem is that often the codec and the file format are synonymous, so people like myself don't make the distinction.
> 
> Not helping is the Drill use of the ambiguous "Compression Type" terminology rather than "codec" in the Drill options.
> 
> 
>> Since, then, Zip is not sensible or possible inside a Parquet file, the only way to
>> effect what you describe would be to embed a Parquet file inside a Zip
>> archive.  This would be perverse and misguided but possibly still queryable
>> since Drill might transparently do the right things to decode it anyway.  Using a
>> supported codec within the Parquet file format and forgetting about Zip is
>> certainly a better approach.
> 
> Might seem perverse to you, however, given that that "zip compression" support for text file was added in v1.17.0 (DRILL-5674)*, I think it is a reasonable question to ask about support for Parquet files.
> 
> *there were no details on which of the codecs are supported.
> 
> 
>>   If you want compression ratios comparable to
>> those found in Zip files then you would choose GZip and pay with CPU
>> cycles.  When Drill gains support for Zstandard there will be little reason to
>> choose anything else.
> 
> This is another area of confusion, if Parquet provides support for ZSTD (as well as other codecs) why doesn't Drill?  
> 
> Isn't there a standard "Parquet Library" that is available which enables Parquet file support with all "features", which any project implementing Parquet file support would use?
> 
> 
> 
>> 
>> On 2021/06/17 18:59, Leyne, Sean wrote:
>>> Luoc,
>>> 
>>>>   Could you please tell me first which case you are talking about?
>>>> Only write(CTAS syntax) or read(SELECT)?
>>> Really both, since you need a mechanism to create the zip'd parquet file to
>> begin with.  Having to create a special/side process to zip the file outside of
>> drill would be ... awkward.
>>> 
>>> 
>>> Sean
>>> 
>>>>> 在 2021年6月16日,02:26,Leyne, Sean
>>>> <Se...@broadviewsoftware.com> 写道:
>>>>> All,
>>>>> 
>>>>> The documentation describes that gzip/gz compression as supported
>>>>> for
>>>> text files, and that snappy and gzip are support for parquet files.
>>>>> I have also read that zip compression was also added (though not
>>>> documented) for text files.
>>>>> 
>>>>> But is zip also supported for parquet files?
>>>>> 
>>>>> What about support for other compression algorithms/methods?  LZ4?
>>>> Bzip2? zstd??
>>>>> 
>>>>> Sean
>>>>> 
>>>>> 
>>>>> 
> 


Re: Supported compression methods

Posted by James Turton <ja...@somecomputer.xyz.INVALID>.
You're right the distinctions are murky, including in my own comments 
here.  Anyway, zipping Parquet files would be like zipping JPEGs or 
PDFs.  Zip acts like tar in these cases but I guess a tarball of JPEGs 
is not unheard of *shrug*.

Re. your last question, there is work that must be done in Drill to 
support new codecs, even though they are already standardised, and 
possibly even implemented in an upstream version of parquet-mr etc.


On 2021/06/18 16:15, Leyne, Sean wrote:
> James,
>
>> -----Original Message-----
>> From: James Turton <ja...@somecomputer.xyz.INVALID>
>> Zip is a file format, not a codec.  Various codecs are employed in Zip archives,
>> most commonly DEFLATE.  The different set of codecs that are supported in
>> the Parquet file format are described in https://github.com/apache/parquet-
>> format/blob/master/Compression.md.
> Thanks for the link, the problem is that often the codec and the file format are synonymous, so people like myself don't make the distinction.
>
> Not helping is the Drill use of the ambiguous "Compression Type" terminology rather than "codec" in the Drill options.
>
>
>> Since, then, Zip is not sensible or possible inside a Parquet file, the only way to
>> effect what you describe would be to embed a Parquet file inside a Zip
>> archive.  This would be perverse and misguided but possibly still queryable
>> since Drill might transparently do the right things to decode it anyway.  Using a
>> supported codec within the Parquet file format and forgetting about Zip is
>> certainly a better approach.
> Might seem perverse to you, however, given that that "zip compression" support for text file was added in v1.17.0 (DRILL-5674)*, I think it is a reasonable question to ask about support for Parquet files.
>
> *there were no details on which of the codecs are supported.
>
>
>>    If you want compression ratios comparable to
>> those found in Zip files then you would choose GZip and pay with CPU
>> cycles.  When Drill gains support for Zstandard there will be little reason to
>> choose anything else.
> This is another area of confusion, if Parquet provides support for ZSTD (as well as other codecs) why doesn't Drill?
>
> Isn't there a standard "Parquet Library" that is available which enables Parquet file support with all "features", which any project implementing Parquet file support would use?
>
>
>
>> On 2021/06/17 18:59, Leyne, Sean wrote:
>>> Luoc,
>>>
>>>>     Could you please tell me first which case you are talking about?
>>>> Only write(CTAS syntax) or read(SELECT)?
>>> Really both, since you need a mechanism to create the zip'd parquet file to
>> begin with.  Having to create a special/side process to zip the file outside of
>> drill would be ... awkward.
>>>
>>> Sean
>>>
>>>>> 在 2021年6月16日,02:26,Leyne, Sean
>>>> <Se...@broadviewsoftware.com> 写道:
>>>>> All,
>>>>>
>>>>> The documentation describes that gzip/gz compression as supported
>>>>> for
>>>> text files, and that snappy and gzip are support for parquet files.
>>>>> I have also read that zip compression was also added (though not
>>>> documented) for text files.
>>>>> But is zip also supported for parquet files?
>>>>>
>>>>> What about support for other compression algorithms/methods?  LZ4?
>>>> Bzip2? zstd??
>>>>> Sean
>>>>>
>>>>>
>>>>>


RE: Supported compression methods

Posted by "Leyne, Sean" <Se...@BroadViewSoftware.com>.
James,

> -----Original Message-----
> From: James Turton <ja...@somecomputer.xyz.INVALID>

> Zip is a file format, not a codec.  Various codecs are employed in Zip archives,
> most commonly DEFLATE.  The different set of codecs that are supported in
> the Parquet file format are described in https://github.com/apache/parquet-
> format/blob/master/Compression.md.

Thanks for the link, the problem is that often the codec and the file format are synonymous, so people like myself don't make the distinction.

Not helping is the Drill use of the ambiguous "Compression Type" terminology rather than "codec" in the Drill options.


> Since, then, Zip is not sensible or possible inside a Parquet file, the only way to
> effect what you describe would be to embed a Parquet file inside a Zip
> archive.  This would be perverse and misguided but possibly still queryable
> since Drill might transparently do the right things to decode it anyway.  Using a
> supported codec within the Parquet file format and forgetting about Zip is
> certainly a better approach.

Might seem perverse to you, however, given that that "zip compression" support for text file was added in v1.17.0 (DRILL-5674)*, I think it is a reasonable question to ask about support for Parquet files.

*there were no details on which of the codecs are supported.


>  If you want compression ratios comparable to
> those found in Zip files then you would choose GZip and pay with CPU
> cycles.  When Drill gains support for Zstandard there will be little reason to
> choose anything else.

This is another area of confusion, if Parquet provides support for ZSTD (as well as other codecs) why doesn't Drill?  

Isn't there a standard "Parquet Library" that is available which enables Parquet file support with all "features", which any project implementing Parquet file support would use?



> 
> On 2021/06/17 18:59, Leyne, Sean wrote:
> > Luoc,
> >
> >>    Could you please tell me first which case you are talking about?
> >> Only write(CTAS syntax) or read(SELECT)?
> > Really both, since you need a mechanism to create the zip'd parquet file to
> begin with.  Having to create a special/side process to zip the file outside of
> drill would be ... awkward.
> >
> >
> > Sean
> >
> >>> 在 2021年6月16日,02:26,Leyne, Sean
> >> <Se...@broadviewsoftware.com> 写道:
> >>> All,
> >>>
> >>> The documentation describes that gzip/gz compression as supported
> >>> for
> >> text files, and that snappy and gzip are support for parquet files.
> >>> I have also read that zip compression was also added (though not
> >> documented) for text files.
> >>>
> >>> But is zip also supported for parquet files?
> >>>
> >>> What about support for other compression algorithms/methods?  LZ4?
> >> Bzip2? zstd??
> >>>
> >>> Sean
> >>>
> >>>
> >>>


Re: Supported compression methods

Posted by James Turton <ja...@somecomputer.xyz.INVALID>.
Zip is a file format, not a codec.  Various codecs are employed in Zip 
archives, most commonly DEFLATE.  The different set of codecs that are 
supported in the Parquet file format are described in 
https://github.com/apache/parquet-format/blob/master/Compression.md. 
Since, then, Zip is not sensible or possible inside a Parquet file, the 
only way to effect what you describe would be to embed a Parquet file 
inside a Zip archive.  This would be perverse and misguided but possibly 
still queryable since Drill might transparently do the right things to 
decode it anyway.  Using a supported codec within the Parquet file 
format and forgetting about Zip is certainly a better approach.  If you 
want compression ratios comparable to those found in Zip files then you 
would choose GZip and pay with CPU cycles.  When Drill gains support for 
Zstandard there will be little reason to choose anything else.

On 2021/06/17 18:59, Leyne, Sean wrote:
> Luoc,
>
>>    Could you please tell me first which case you are talking about? Only
>> write(CTAS syntax) or read(SELECT)?
> Really both, since you need a mechanism to create the zip'd parquet file to begin with.  Having to create a special/side process to zip the file outside of drill would be ... awkward.
>
>
> Sean
>
>>> 在 2021年6月16日,02:26,Leyne, Sean
>> <Se...@broadviewsoftware.com> 写道:
>>> All,
>>>
>>> The documentation describes that gzip/gz compression as supported for
>> text files, and that snappy and gzip are support for parquet files.
>>> I have also read that zip compression was also added (though not
>> documented) for text files.
>>>
>>> But is zip also supported for parquet files?
>>>
>>> What about support for other compression algorithms/methods?  LZ4?
>> Bzip2? zstd??
>>>
>>> Sean
>>>
>>>
>>>


RE: Supported compression methods

Posted by "Leyne, Sean" <Se...@BroadViewSoftware.com>.
Luoc,

>   Could you please tell me first which case you are talking about? Only
> write(CTAS syntax) or read(SELECT)?

Really both, since you need a mechanism to create the zip'd parquet file to begin with.  Having to create a special/side process to zip the file outside of drill would be ... awkward.


Sean

> 
> > 在 2021年6月16日,02:26,Leyne, Sean
> <Se...@broadviewsoftware.com> 写道:
> >
> > All,
> >
> > The documentation describes that gzip/gz compression as supported for
> text files, and that snappy and gzip are support for parquet files.
> >
> > I have also read that zip compression was also added (though not
> documented) for text files.
> >
> >
> > But is zip also supported for parquet files?
> >
> > What about support for other compression algorithms/methods?  LZ4?
> Bzip2? zstd??
> >
> >
> > Sean
> >
> >
> >


Re: Supported compression methods

Posted by luoc <lu...@apache.org>.
Hi Sean,
  Could you please tell me first which case you are talking about? Only write(CTAS syntax) or read(SELECT)?

> 在 2021年6月16日,02:26,Leyne, Sean <Se...@broadviewsoftware.com> 写道:
> 
> All,
> 
> The documentation describes that gzip/gz compression as supported for text files, and that snappy and gzip are support for parquet files.
> 
> I have also read that zip compression was also added (though not documented) for text files.
> 
> 
> But is zip also supported for parquet files?
> 
> What about support for other compression algorithms/methods?  LZ4?  Bzip2? zstd??
> 
> 
> Sean
> 
> 
>