You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Dong, Xin" <xi...@intel.com> on 2020/06/04 04:46:27 UTC

Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

Hi, All,

The existing Parquet compress codec framework only supports codec name based compression implementation lookup. And it's one-2-one mapping which means only one implementation is supported given a codec name.
However, there are various implementations for the same codec name. And different implementations may not be compatible with others due to different purposes. Given Gzip as an example, for some accelerators, it's limited in memory capacity and the history buffer size is relatively smaller than CPU based.  And currently codec framework doesn't provide a mechanism to allow users to customize standard compression codec for their own purposes (e.g. performance acceleration, workload offloading).
To address the problem, we propose a provider-aware compression codec lookup for parquet-mr. We've put the proposal here:
https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B474dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm

Any comment is welcome and please let us know your feedback.

Thanks,
Xin Dong

RE: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

Posted by "Xu, Cheng A" <ch...@intel.com>.
Hi Micahm, thank you for your suggestions.

We have a sync discussing this in Parquet sync up meeting. Besides the meta data items suggested by you (compression level and block size), we will also provide information like vendor name, versions. Will refine the proposal further (https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU6EwS82dov2E/edit#heading=h.5b2qz2ba32wm for better access).

In-compatible with standard compression codec is not the expectation here. The expected behavior here is: 1) leveraging user defined or optimized implementation provided by users as first; 2) fallback to standard codec implementation. If compatible issue happens, as a framework, we will throw error message to end users directly.

Thanks
Cheng Xu

-----Original Message-----
From: Micah Kornfield <em...@gmail.com> 
Sent: Tuesday, June 23, 2020 11:57 AM
To: Parquet Dev <de...@parquet.apache.org>
Subject: Re: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

Instead of a custom compressor name is there some way to expose more metadata about the parameters a particular codec used for compression (e.g.
compression level used or block size) be sufficient?  I'm not sure how standardized these are across given implementations/versions of the codecs currently supported so it might not be.

Just to clarify the proposal.  Is the suggestion that there will be some compression algorithms that won't be decodable by standard library implementation for each codec?  I agree with Gabor that this shouldn't be supported.

If the only requirement is, that in some cases, a customized decoder needs to be supported, and the customized decoder might not support decoding all encoded data then this seems less bad.  However it is still a slippery slope.  If this is the case another two options to consider:
1.  Always use the plugin by default and if it fails fallback to the normal codec.
2.  Don't specify a name specifically but add the customization hook to accept the entire set of metadata from the footer.

-Micah

-Micah

On Mon, Jun 22, 2020 at 6:56 AM Xu, Cheng A <ch...@intel.com> wrote:

> Thanks Gabor for the comments.
> https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU6E
> wS82dov2E/edit
> Updated with comment access.
>
> Yes, ideally, we should have all codec backward compatible with 
> customized ones. However, in some cases, it's hard to support that. 
> For some users, they may reply on some accelerators to do the 
> compression work. Those accelerators are limited in memory which 
> doesn't allow a large history buffer to decompress.
>
> My understanding for this proposal is we try to introduce a framework 
> to allow customers customize their compression codec. And it's 
> customer's own responsibility if they use in-compatible format in 
> return with good performance.
> This is similar to what airlift did. Airlift is actually a codec provider.
> It provides a few codec supported by Parquet. We can have some 
> official supported codec provider IDs like built-in, airlift. And 
> users can make their own decisions to extend providers with their new codec providers.
>
> Your thoughts on this?
>
> Thanks
> Cheng Xu
>
> -----Original Message-----
> From: Gabor Szadovszky <ga...@apache.org>
> Sent: Monday, June 22, 2020 5:09 PM
> To: Parquet Dev <de...@parquet.apache.org>
> Subject: Re: Proposal for CompressionCodec Provider-aware Compression 
> Codec Lookup for parquet-mr
>
> Hi Cheng Xu,
>
> It would be easier if we would have comment access to the document.
> After the first look I have the following comments:
> - "different [codec] implementations may not be compatible with others 
> due to different purposes." - This is a huge problem. Parquet 
> specifies the compression codecs that the format supports. We've 
> already had issues by not specifying the codecs properly (see 
> PARQUET-1241 < https://issues.apache.org/jira/browse/PARQUET-1241> for 
> details). We shall not allow situations like this one. If a parquet 
> file is written with a compression codec from the spec shall be 
> readable by another parquet implementation that supports that codec independently from the provider.
> - providers of the compression codecs are usually implementation dependent.
> How would different parquet implementations handle the different providers?
> (e.g. a java based compression provider is to be used by parquet-cpp)
> - how do we specify the provider names?
>
> Regards,
> Gabor
>
> On Fri, Jun 19, 2020 at 4:30 PM Xu, Cheng A <ch...@intel.com> wrote:
>
> > Hi folks, any suggestions on this?
> >
> > Thanks
> > Cheng Xu
> >
> > -----Original Message-----
> > From: Dong, Xin <xi...@intel.com>
> > Sent: Friday, June 5, 2020 2:19 PM
> > To: dev@parquet.apache.org
> > Subject: RE: Proposal for CompressionCodec Provider-aware 
> > Compression Codec Lookup for parquet-mr
> >
> > Hi, Walid,
> >
> > We've moved the doc here for public access:
> >
> > https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU
> > 6E
> > wS82dov2E/
> >
> > Thanks,
> > Xin Dong
> >
> > -----Original Message-----
> > From: Gara Walid <gw...@gmail.com>
> > Sent: Thursday, June 4, 2020 2:14 PM
> > To: dev@parquet.apache.org
> > Subject: Re: Proposal for CompressionCodec Provider-aware 
> > Compression Codec Lookup for parquet-mr
> >
> > Hi Xin,
> >
> > Thanks for the proposal. Could you please make the google doc public?
> >
> > Cheers,
> > Walid
> >
> > On Thu, Jun 4, 2020, 6:46 AM Dong, Xin <xi...@intel.com> wrote:
> >
> > > Hi, All,
> > >
> > > The existing Parquet compress codec framework only supports codec 
> > > name based compression implementation lookup. And it's one-2-one 
> > > mapping which means only one implementation is supported given a 
> > > codec
> name.
> > > However, there are various implementations for the same codec name.
> > > And different implementations may not be compatible with others 
> > > due to different purposes. Given Gzip as an example, for some 
> > > accelerators, it's limited in memory capacity and the history 
> > > buffer size is relatively smaller than CPU based.  And currently 
> > > codec framework doesn't provide a mechanism to allow users to 
> > > customize standard compression codec for their own purposes (e.g. 
> > > performance acceleration,
> > workload offloading).
> > > To address the problem, we propose a provider-aware compression 
> > > codec lookup for parquet-mr. We've put the proposal here:
> > >
> > > https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM
> > > -B
> > > 47 4dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm
> > >
> > > Any comment is welcome and please let us know your feedback.
> > >
> > > Thanks,
> > > Xin Dong
> > >
> >
>

Re: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

Posted by Micah Kornfield <em...@gmail.com>.
Instead of a custom compressor name is there some way to expose more
metadata about the parameters a particular codec used for compression (e.g.
compression level used or block size) be sufficient?  I'm not sure how
standardized these are across given implementations/versions of the codecs
currently supported so it might not be.

Just to clarify the proposal.  Is the suggestion that there will be some
compression algorithms that won't be decodable by standard library
implementation for each codec?  I agree with Gabor that this shouldn't be
supported.

If the only requirement is, that in some cases, a customized decoder needs
to be supported, and the customized decoder might not support decoding all
encoded data then this seems less bad.  However it is still a slippery
slope.  If this is the case another two options to consider:
1.  Always use the plugin by default and if it fails fallback to the normal
codec.
2.  Don't specify a name specifically but add the customization hook to
accept the entire set of metadata from the footer.

-Micah

-Micah

On Mon, Jun 22, 2020 at 6:56 AM Xu, Cheng A <ch...@intel.com> wrote:

> Thanks Gabor for the comments.
> https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU6EwS82dov2E/edit
> Updated with comment access.
>
> Yes, ideally, we should have all codec backward compatible with customized
> ones. However, in some cases, it's hard to support that. For some users,
> they may reply on some accelerators to do the compression work. Those
> accelerators are limited in memory which doesn't allow a large history
> buffer to decompress.
>
> My understanding for this proposal is we try to introduce a framework to
> allow customers customize their compression codec. And it's customer's own
> responsibility if they use in-compatible format in return with good
> performance.
> This is similar to what airlift did. Airlift is actually a codec provider.
> It provides a few codec supported by Parquet. We can have some official
> supported codec provider IDs like built-in, airlift. And users can make
> their own decisions to extend providers with their new codec providers.
>
> Your thoughts on this?
>
> Thanks
> Cheng Xu
>
> -----Original Message-----
> From: Gabor Szadovszky <ga...@apache.org>
> Sent: Monday, June 22, 2020 5:09 PM
> To: Parquet Dev <de...@parquet.apache.org>
> Subject: Re: Proposal for CompressionCodec Provider-aware Compression
> Codec Lookup for parquet-mr
>
> Hi Cheng Xu,
>
> It would be easier if we would have comment access to the document.
> After the first look I have the following comments:
> - "different [codec] implementations may not be compatible with others due
> to different purposes." - This is a huge problem. Parquet specifies the
> compression codecs that the format supports. We've already had issues by
> not specifying the codecs properly (see PARQUET-1241 <
> https://issues.apache.org/jira/browse/PARQUET-1241> for details). We
> shall not allow situations like this one. If a parquet file is written with
> a compression codec from the spec shall be readable by another parquet
> implementation that supports that codec independently from the provider.
> - providers of the compression codecs are usually implementation dependent.
> How would different parquet implementations handle the different providers?
> (e.g. a java based compression provider is to be used by parquet-cpp)
> - how do we specify the provider names?
>
> Regards,
> Gabor
>
> On Fri, Jun 19, 2020 at 4:30 PM Xu, Cheng A <ch...@intel.com> wrote:
>
> > Hi folks, any suggestions on this?
> >
> > Thanks
> > Cheng Xu
> >
> > -----Original Message-----
> > From: Dong, Xin <xi...@intel.com>
> > Sent: Friday, June 5, 2020 2:19 PM
> > To: dev@parquet.apache.org
> > Subject: RE: Proposal for CompressionCodec Provider-aware Compression
> > Codec Lookup for parquet-mr
> >
> > Hi, Walid,
> >
> > We've moved the doc here for public access:
> >
> > https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU6E
> > wS82dov2E/
> >
> > Thanks,
> > Xin Dong
> >
> > -----Original Message-----
> > From: Gara Walid <gw...@gmail.com>
> > Sent: Thursday, June 4, 2020 2:14 PM
> > To: dev@parquet.apache.org
> > Subject: Re: Proposal for CompressionCodec Provider-aware Compression
> > Codec Lookup for parquet-mr
> >
> > Hi Xin,
> >
> > Thanks for the proposal. Could you please make the google doc public?
> >
> > Cheers,
> > Walid
> >
> > On Thu, Jun 4, 2020, 6:46 AM Dong, Xin <xi...@intel.com> wrote:
> >
> > > Hi, All,
> > >
> > > The existing Parquet compress codec framework only supports codec
> > > name based compression implementation lookup. And it's one-2-one
> > > mapping which means only one implementation is supported given a codec
> name.
> > > However, there are various implementations for the same codec name.
> > > And different implementations may not be compatible with others due
> > > to different purposes. Given Gzip as an example, for some
> > > accelerators, it's limited in memory capacity and the history buffer
> > > size is relatively smaller than CPU based.  And currently codec
> > > framework doesn't provide a mechanism to allow users to customize
> > > standard compression codec for their own purposes (e.g. performance
> > > acceleration,
> > workload offloading).
> > > To address the problem, we propose a provider-aware compression
> > > codec lookup for parquet-mr. We've put the proposal here:
> > >
> > > https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B
> > > 47 4dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm
> > >
> > > Any comment is welcome and please let us know your feedback.
> > >
> > > Thanks,
> > > Xin Dong
> > >
> >
>

RE: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

Posted by "Xu, Cheng A" <ch...@intel.com>.
Thanks Gabor for the comments. https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU6EwS82dov2E/edit Updated with comment access. 

Yes, ideally, we should have all codec backward compatible with customized ones. However, in some cases, it's hard to support that. For some users, they may reply on some accelerators to do the compression work. Those accelerators are limited in memory which doesn't allow a large history buffer to decompress. 

My understanding for this proposal is we try to introduce a framework to allow customers customize their compression codec. And it's customer's own responsibility if they use in-compatible format in return with good performance.
This is similar to what airlift did. Airlift is actually a codec provider. It provides a few codec supported by Parquet. We can have some official supported codec provider IDs like built-in, airlift. And users can make their own decisions to extend providers with their new codec providers.

Your thoughts on this?

Thanks
Cheng Xu

-----Original Message-----
From: Gabor Szadovszky <ga...@apache.org> 
Sent: Monday, June 22, 2020 5:09 PM
To: Parquet Dev <de...@parquet.apache.org>
Subject: Re: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

Hi Cheng Xu,

It would be easier if we would have comment access to the document.
After the first look I have the following comments:
- "different [codec] implementations may not be compatible with others due to different purposes." - This is a huge problem. Parquet specifies the compression codecs that the format supports. We've already had issues by not specifying the codecs properly (see PARQUET-1241 <https://issues.apache.org/jira/browse/PARQUET-1241> for details). We shall not allow situations like this one. If a parquet file is written with a compression codec from the spec shall be readable by another parquet implementation that supports that codec independently from the provider.
- providers of the compression codecs are usually implementation dependent.
How would different parquet implementations handle the different providers?
(e.g. a java based compression provider is to be used by parquet-cpp)
- how do we specify the provider names?

Regards,
Gabor

On Fri, Jun 19, 2020 at 4:30 PM Xu, Cheng A <ch...@intel.com> wrote:

> Hi folks, any suggestions on this?
>
> Thanks
> Cheng Xu
>
> -----Original Message-----
> From: Dong, Xin <xi...@intel.com>
> Sent: Friday, June 5, 2020 2:19 PM
> To: dev@parquet.apache.org
> Subject: RE: Proposal for CompressionCodec Provider-aware Compression 
> Codec Lookup for parquet-mr
>
> Hi, Walid,
>
> We've moved the doc here for public access:
>
> https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU6E
> wS82dov2E/
>
> Thanks,
> Xin Dong
>
> -----Original Message-----
> From: Gara Walid <gw...@gmail.com>
> Sent: Thursday, June 4, 2020 2:14 PM
> To: dev@parquet.apache.org
> Subject: Re: Proposal for CompressionCodec Provider-aware Compression 
> Codec Lookup for parquet-mr
>
> Hi Xin,
>
> Thanks for the proposal. Could you please make the google doc public?
>
> Cheers,
> Walid
>
> On Thu, Jun 4, 2020, 6:46 AM Dong, Xin <xi...@intel.com> wrote:
>
> > Hi, All,
> >
> > The existing Parquet compress codec framework only supports codec 
> > name based compression implementation lookup. And it's one-2-one 
> > mapping which means only one implementation is supported given a codec name.
> > However, there are various implementations for the same codec name.
> > And different implementations may not be compatible with others due 
> > to different purposes. Given Gzip as an example, for some 
> > accelerators, it's limited in memory capacity and the history buffer 
> > size is relatively smaller than CPU based.  And currently codec 
> > framework doesn't provide a mechanism to allow users to customize 
> > standard compression codec for their own purposes (e.g. performance 
> > acceleration,
> workload offloading).
> > To address the problem, we propose a provider-aware compression 
> > codec lookup for parquet-mr. We've put the proposal here:
> >
> > https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B
> > 47 4dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm
> >
> > Any comment is welcome and please let us know your feedback.
> >
> > Thanks,
> > Xin Dong
> >
>

Re: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

Posted by Gabor Szadovszky <ga...@apache.org>.
Hi Cheng Xu,

It would be easier if we would have comment access to the document.
After the first look I have the following comments:
- "different [codec] implementations may not be compatible with others due
to different purposes." - This is a huge problem. Parquet specifies the
compression codecs that the format supports. We've already had issues by
not specifying the codecs properly (see PARQUET-1241
<https://issues.apache.org/jira/browse/PARQUET-1241> for details). We shall
not allow situations like this one. If a parquet file is written with a
compression codec from the spec shall be readable by another parquet
implementation that supports that codec independently from the provider.
- providers of the compression codecs are usually implementation dependent.
How would different parquet implementations handle the different providers?
(e.g. a java based compression provider is to be used by parquet-cpp)
- how do we specify the provider names?

Regards,
Gabor

On Fri, Jun 19, 2020 at 4:30 PM Xu, Cheng A <ch...@intel.com> wrote:

> Hi folks, any suggestions on this?
>
> Thanks
> Cheng Xu
>
> -----Original Message-----
> From: Dong, Xin <xi...@intel.com>
> Sent: Friday, June 5, 2020 2:19 PM
> To: dev@parquet.apache.org
> Subject: RE: Proposal for CompressionCodec Provider-aware Compression
> Codec Lookup for parquet-mr
>
> Hi, Walid,
>
> We've moved the doc here for public access:
>
> https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU6EwS82dov2E/
>
> Thanks,
> Xin Dong
>
> -----Original Message-----
> From: Gara Walid <gw...@gmail.com>
> Sent: Thursday, June 4, 2020 2:14 PM
> To: dev@parquet.apache.org
> Subject: Re: Proposal for CompressionCodec Provider-aware Compression
> Codec Lookup for parquet-mr
>
> Hi Xin,
>
> Thanks for the proposal. Could you please make the google doc public?
>
> Cheers,
> Walid
>
> On Thu, Jun 4, 2020, 6:46 AM Dong, Xin <xi...@intel.com> wrote:
>
> > Hi, All,
> >
> > The existing Parquet compress codec framework only supports codec name
> > based compression implementation lookup. And it's one-2-one mapping
> > which means only one implementation is supported given a codec name.
> > However, there are various implementations for the same codec name.
> > And different implementations may not be compatible with others due to
> > different purposes. Given Gzip as an example, for some accelerators,
> > it's limited in memory capacity and the history buffer size is
> > relatively smaller than CPU based.  And currently codec framework
> > doesn't provide a mechanism to allow users to customize standard
> > compression codec for their own purposes (e.g. performance acceleration,
> workload offloading).
> > To address the problem, we propose a provider-aware compression codec
> > lookup for parquet-mr. We've put the proposal here:
> >
> > https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B47
> > 4dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm
> >
> > Any comment is welcome and please let us know your feedback.
> >
> > Thanks,
> > Xin Dong
> >
>

RE: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

Posted by "Xu, Cheng A" <ch...@intel.com>.
Hi folks, any suggestions on this?

Thanks
Cheng Xu

-----Original Message-----
From: Dong, Xin <xi...@intel.com> 
Sent: Friday, June 5, 2020 2:19 PM
To: dev@parquet.apache.org
Subject: RE: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

Hi, Walid,

We've moved the doc here for public access:
https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU6EwS82dov2E/

Thanks,
Xin Dong

-----Original Message-----
From: Gara Walid <gw...@gmail.com>
Sent: Thursday, June 4, 2020 2:14 PM
To: dev@parquet.apache.org
Subject: Re: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

Hi Xin,

Thanks for the proposal. Could you please make the google doc public?

Cheers,
Walid

On Thu, Jun 4, 2020, 6:46 AM Dong, Xin <xi...@intel.com> wrote:

> Hi, All,
>
> The existing Parquet compress codec framework only supports codec name 
> based compression implementation lookup. And it's one-2-one mapping 
> which means only one implementation is supported given a codec name.
> However, there are various implementations for the same codec name. 
> And different implementations may not be compatible with others due to 
> different purposes. Given Gzip as an example, for some accelerators, 
> it's limited in memory capacity and the history buffer size is 
> relatively smaller than CPU based.  And currently codec framework 
> doesn't provide a mechanism to allow users to customize standard 
> compression codec for their own purposes (e.g. performance acceleration, workload offloading).
> To address the problem, we propose a provider-aware compression codec 
> lookup for parquet-mr. We've put the proposal here:
>
> https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B47
> 4dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm
>
> Any comment is welcome and please let us know your feedback.
>
> Thanks,
> Xin Dong
>

RE: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

Posted by "Dong, Xin" <xi...@intel.com>.
Hi, Walid,

We've moved the doc here for public access:
https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU6EwS82dov2E/

Thanks,
Xin Dong

-----Original Message-----
From: Gara Walid <gw...@gmail.com> 
Sent: Thursday, June 4, 2020 2:14 PM
To: dev@parquet.apache.org
Subject: Re: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

Hi Xin,

Thanks for the proposal. Could you please make the google doc public?

Cheers,
Walid

On Thu, Jun 4, 2020, 6:46 AM Dong, Xin <xi...@intel.com> wrote:

> Hi, All,
>
> The existing Parquet compress codec framework only supports codec name 
> based compression implementation lookup. And it's one-2-one mapping 
> which means only one implementation is supported given a codec name.
> However, there are various implementations for the same codec name. 
> And different implementations may not be compatible with others due to 
> different purposes. Given Gzip as an example, for some accelerators, 
> it's limited in memory capacity and the history buffer size is 
> relatively smaller than CPU based.  And currently codec framework 
> doesn't provide a mechanism to allow users to customize standard 
> compression codec for their own purposes (e.g. performance acceleration, workload offloading).
> To address the problem, we propose a provider-aware compression codec 
> lookup for parquet-mr. We've put the proposal here:
>
> https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B47
> 4dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm
>
> Any comment is welcome and please let us know your feedback.
>
> Thanks,
> Xin Dong
>

Re: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

Posted by Gara Walid <gw...@gmail.com>.
Hi Xin,

Thanks for the proposal. Could you please make the google doc public?

Cheers,
Walid

On Thu, Jun 4, 2020, 6:46 AM Dong, Xin <xi...@intel.com> wrote:

> Hi, All,
>
> The existing Parquet compress codec framework only supports codec name
> based compression implementation lookup. And it's one-2-one mapping which
> means only one implementation is supported given a codec name.
> However, there are various implementations for the same codec name. And
> different implementations may not be compatible with others due to
> different purposes. Given Gzip as an example, for some accelerators, it's
> limited in memory capacity and the history buffer size is relatively
> smaller than CPU based.  And currently codec framework doesn't provide a
> mechanism to allow users to customize standard compression codec for their
> own purposes (e.g. performance acceleration, workload offloading).
> To address the problem, we propose a provider-aware compression codec
> lookup for parquet-mr. We've put the proposal here:
>
> https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B474dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm
>
> Any comment is welcome and please let us know your feedback.
>
> Thanks,
> Xin Dong
>