You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Xie, Qi" <qi...@intel.com> on 2020/06/19 14:50:19 UTC

Proposal for the plugin API to support user customized compression codec

Hi,


In demand of better performance, quite some end users want to leverage accelerators (e.g. FPGA, Intel QAT) to offload compression. However, in current Arrow compression framework, it only supports codec name based compression implementation and can't be customized to leverage accelerators. For example, for gzip format, we can't call customized codec to accelerate the compression. We would like to proposal a plugin API to support the customized compression codec. We've put the proposal here:



https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit



Any comment is welcome and please let us know your feedback.



Thanks,

XieQi




Re: Proposal for the plugin API to support user customized compression codec

Posted by Antoine Pitrou <so...@pitrou.net>.
What is the performance of, say, HW GZip against SW ZSTD?

Regards

Antoine.


On Thu, 25 Jun 2020 07:06:58 +0000
"Xu, Cheng A" <ch...@intel.com> wrote:
> Thanks Micha and Wes for the reply. W.R.T the scope, we’re working in progress together with Parquet community to refine our proposal. https://www.mail-archive.com/dev@parquet.apache.org/msg12463.html
> 
> This proposal here is more general to Arrow (indeed it can be used by native Parquet as well). Since Arrow is more in memory format mostly for intermediate data, I would expect less consideration in backward compatibility different from on-disk Parquet format. Considering this, we can discuss those two things separately. For Parquet part, it should be consistent behavior as Java Parquet. For Arrow part, it should also be compatible with new extendable Parquet compression codec framework. And we can start with Parquet part first.
> 
> Thanks
> Cheng Xu
> 
> From: Micah Kornfield <em...@gmail.com>
> Sent: Tuesday, June 23, 2020 12:11 PM
> To: dev <de...@arrow.apache.org>
> Cc: Xu, Cheng A <ch...@intel.com>; Xie, Qi <qi...@intel.com>
> Subject: Re: Proposal for the plugin API to support user customized compression codec
> 
> It would be good to clarify the exact scope of this.  If it is particular to parquet then we should wait for the discussion on dev@parquet to conclude before moving forward.  If it is more general to Arrow, then working through scenarios of how this would be used for decompression when the Codec can't support generic input would be useful (the codec library is a singleton across the arrow codebase).
> 
> On Mon, Jun 22, 2020 at 4:23 PM Wes McKinney <we...@gmail.com>> wrote:
> hi XieQi,
> 
> Is the idea that your custom Gzip implementation would automatically
> override any places in the codebase where the built-in one would be
> used (like the Parquet codebase)? I see some things in the design doc
> about serializing the plugin information in the Parquet file metadata
> (assuming you want to speed up decompression Parquet data pages) -- is
> there a reason to believe that the plugin would be _required_ in order
> to read the file? I recall some messages to the Parquet mailing list
> about user-defined codecs.
> 
> In general, having a plugin API to provide a means to substitute one
> functionally identical for another seems reasonable to me (I could
> envision having people customizing kernel execution in the future). We
> should try to create a general enough API so that it can be used for
> customizations beyond compression codecs so we don't have to go
> through a design exercise to support plugin/algorithm overrides for
> something else. This is something we could hash out during code review
> -- I should have some opinions and I'm sure others will as well
> 
> - Wes
> 
> On Fri, Jun 19, 2020 at 10:21 AM Xie, Qi <qi...@intel.com>> wrote:
> >
> > Hi,
> >
> >
> > In demand of better performance, quite some end users want to leverage accelerators (e.g. FPGA, Intel QAT) to offload compression. However, in current Arrow compression framework, it only supports codec name based compression implementation and can't be customized to leverage accelerators. For example, for gzip format, we can't call customized codec to accelerate the compression. We would like to proposal a plugin API to support the customized compression codec. We've put the proposal here:
> >
> >
> >
> > https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit
> >
> >
> >
> > Any comment is welcome and please let us know your feedback.
> >
> >
> >
> > Thanks,
> >
> > XieQi
> >
> >
> >  




Re: Proposal for the plugin API to support user customized compression codec

Posted by Micah Kornfield <em...@gmail.com>.
Hi Cheng Xu,

> Since Arrow is more in memory format mostly for intermediate data, I would
> expect less consideration in backward compatibility different from on-disk
> Parquet format.

1.  The Arrow file format is not ephemeral and now supports compressed
buffers.
2.  Even with  other parts of Arrow being ephemeral, the compression
libraries are used as components in a generic IO subsystem
(see arrow/io/compressed.h in the codebase).  It would be good to work
through the implications of this.

Thanks,
Micah

On Thu, Jun 25, 2020 at 12:07 AM Xu, Cheng A <ch...@intel.com> wrote:

> Thanks Micha and Wes for the reply. W.R.T the scope, we’re working in
> progress together with Parquet community to refine our proposal.
> https://www.mail-archive.com/dev@parquet.apache.org/msg12463.html
>
>
>
> This proposal here is more general to Arrow (indeed it can be used by
> native Parquet as well). Since Arrow is more in memory format mostly for
> intermediate data, I would expect less consideration in backward
> compatibility different from on-disk Parquet format. Considering this, we
> can discuss those two things separately. For Parquet part, it should be
> consistent behavior as Java Parquet. For Arrow part, it should also be
> compatible with new extendable Parquet compression codec framework. And we
> can start with Parquet part first.
>
>
>
> Thanks
>
> Cheng Xu
>
>
>
> *From:* Micah Kornfield <em...@gmail.com>
> *Sent:* Tuesday, June 23, 2020 12:11 PM
> *To:* dev <de...@arrow.apache.org>
> *Cc:* Xu, Cheng A <ch...@intel.com>; Xie, Qi <qi...@intel.com>
> *Subject:* Re: Proposal for the plugin API to support user customized
> compression codec
>
>
>
> It would be good to clarify the exact scope of this.  If it is
> particular to parquet then we should wait for the discussion on dev@parquet
> to conclude before moving forward.  If it is more general to Arrow, then
> working through scenarios of how this would be used for decompression when
> the Codec can't support generic input would be useful (the codec library is
> a singleton across the arrow codebase).
>
>
>
> On Mon, Jun 22, 2020 at 4:23 PM Wes McKinney <we...@gmail.com> wrote:
>
> hi XieQi,
>
> Is the idea that your custom Gzip implementation would automatically
> override any places in the codebase where the built-in one would be
> used (like the Parquet codebase)? I see some things in the design doc
> about serializing the plugin information in the Parquet file metadata
> (assuming you want to speed up decompression Parquet data pages) -- is
> there a reason to believe that the plugin would be _required_ in order
> to read the file? I recall some messages to the Parquet mailing list
> about user-defined codecs.
>
> In general, having a plugin API to provide a means to substitute one
> functionally identical for another seems reasonable to me (I could
> envision having people customizing kernel execution in the future). We
> should try to create a general enough API so that it can be used for
> customizations beyond compression codecs so we don't have to go
> through a design exercise to support plugin/algorithm overrides for
> something else. This is something we could hash out during code review
> -- I should have some opinions and I'm sure others will as well
>
> - Wes
>
> On Fri, Jun 19, 2020 at 10:21 AM Xie, Qi <qi...@intel.com> wrote:
> >
> > Hi,
> >
> >
> > In demand of better performance, quite some end users want to leverage
> accelerators (e.g. FPGA, Intel QAT) to offload compression. However, in
> current Arrow compression framework, it only supports codec name based
> compression implementation and can't be customized to leverage
> accelerators. For example, for gzip format, we can't call customized codec
> to accelerate the compression. We would like to proposal a plugin API to
> support the customized compression codec. We've put the proposal here:
> >
> >
> >
> >
> https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit
> >
> >
> >
> > Any comment is welcome and please let us know your feedback.
> >
> >
> >
> > Thanks,
> >
> > XieQi
> >
> >
> >
>
>

RE: Proposal for the plugin API to support user customized compression codec

Posted by "Xu, Cheng A" <ch...@intel.com>.
Thanks Micha and Wes for the reply. W.R.T the scope, we’re working in progress together with Parquet community to refine our proposal. https://www.mail-archive.com/dev@parquet.apache.org/msg12463.html

This proposal here is more general to Arrow (indeed it can be used by native Parquet as well). Since Arrow is more in memory format mostly for intermediate data, I would expect less consideration in backward compatibility different from on-disk Parquet format. Considering this, we can discuss those two things separately. For Parquet part, it should be consistent behavior as Java Parquet. For Arrow part, it should also be compatible with new extendable Parquet compression codec framework. And we can start with Parquet part first.

Thanks
Cheng Xu

From: Micah Kornfield <em...@gmail.com>
Sent: Tuesday, June 23, 2020 12:11 PM
To: dev <de...@arrow.apache.org>
Cc: Xu, Cheng A <ch...@intel.com>; Xie, Qi <qi...@intel.com>
Subject: Re: Proposal for the plugin API to support user customized compression codec

It would be good to clarify the exact scope of this.  If it is particular to parquet then we should wait for the discussion on dev@parquet to conclude before moving forward.  If it is more general to Arrow, then working through scenarios of how this would be used for decompression when the Codec can't support generic input would be useful (the codec library is a singleton across the arrow codebase).

On Mon, Jun 22, 2020 at 4:23 PM Wes McKinney <we...@gmail.com>> wrote:
hi XieQi,

Is the idea that your custom Gzip implementation would automatically
override any places in the codebase where the built-in one would be
used (like the Parquet codebase)? I see some things in the design doc
about serializing the plugin information in the Parquet file metadata
(assuming you want to speed up decompression Parquet data pages) -- is
there a reason to believe that the plugin would be _required_ in order
to read the file? I recall some messages to the Parquet mailing list
about user-defined codecs.

In general, having a plugin API to provide a means to substitute one
functionally identical for another seems reasonable to me (I could
envision having people customizing kernel execution in the future). We
should try to create a general enough API so that it can be used for
customizations beyond compression codecs so we don't have to go
through a design exercise to support plugin/algorithm overrides for
something else. This is something we could hash out during code review
-- I should have some opinions and I'm sure others will as well

- Wes

On Fri, Jun 19, 2020 at 10:21 AM Xie, Qi <qi...@intel.com>> wrote:
>
> Hi,
>
>
> In demand of better performance, quite some end users want to leverage accelerators (e.g. FPGA, Intel QAT) to offload compression. However, in current Arrow compression framework, it only supports codec name based compression implementation and can't be customized to leverage accelerators. For example, for gzip format, we can't call customized codec to accelerate the compression. We would like to proposal a plugin API to support the customized compression codec. We've put the proposal here:
>
>
>
> https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit
>
>
>
> Any comment is welcome and please let us know your feedback.
>
>
>
> Thanks,
>
> XieQi
>
>
>

Re: Proposal for the plugin API to support user customized compression codec

Posted by Micah Kornfield <em...@gmail.com>.
It would be good to clarify the exact scope of this.  If it is
particular to parquet then we should wait for the discussion on dev@parquet
to conclude before moving forward.  If it is more general to Arrow, then
working through scenarios of how this would be used for decompression when
the Codec can't support generic input would be useful (the codec library is
a singleton across the arrow codebase).

On Mon, Jun 22, 2020 at 4:23 PM Wes McKinney <we...@gmail.com> wrote:

> hi XieQi,
>
> Is the idea that your custom Gzip implementation would automatically
> override any places in the codebase where the built-in one would be
> used (like the Parquet codebase)? I see some things in the design doc
> about serializing the plugin information in the Parquet file metadata
> (assuming you want to speed up decompression Parquet data pages) -- is
> there a reason to believe that the plugin would be _required_ in order
> to read the file? I recall some messages to the Parquet mailing list
> about user-defined codecs.
>
> In general, having a plugin API to provide a means to substitute one
> functionally identical for another seems reasonable to me (I could
> envision having people customizing kernel execution in the future). We
> should try to create a general enough API so that it can be used for
> customizations beyond compression codecs so we don't have to go
> through a design exercise to support plugin/algorithm overrides for
> something else. This is something we could hash out during code review
> -- I should have some opinions and I'm sure others will as well
>
> - Wes
>
> On Fri, Jun 19, 2020 at 10:21 AM Xie, Qi <qi...@intel.com> wrote:
> >
> > Hi,
> >
> >
> > In demand of better performance, quite some end users want to leverage
> accelerators (e.g. FPGA, Intel QAT) to offload compression. However, in
> current Arrow compression framework, it only supports codec name based
> compression implementation and can't be customized to leverage
> accelerators. For example, for gzip format, we can't call customized codec
> to accelerate the compression. We would like to proposal a plugin API to
> support the customized compression codec. We've put the proposal here:
> >
> >
> >
> >
> https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit
> >
> >
> >
> > Any comment is welcome and please let us know your feedback.
> >
> >
> >
> > Thanks,
> >
> > XieQi
> >
> >
> >
>

Re: Proposal for the plugin API to support user customized compression codec

Posted by Wes McKinney <we...@gmail.com>.
hi XieQi,

Is the idea that your custom Gzip implementation would automatically
override any places in the codebase where the built-in one would be
used (like the Parquet codebase)? I see some things in the design doc
about serializing the plugin information in the Parquet file metadata
(assuming you want to speed up decompression Parquet data pages) -- is
there a reason to believe that the plugin would be _required_ in order
to read the file? I recall some messages to the Parquet mailing list
about user-defined codecs.

In general, having a plugin API to provide a means to substitute one
functionally identical for another seems reasonable to me (I could
envision having people customizing kernel execution in the future). We
should try to create a general enough API so that it can be used for
customizations beyond compression codecs so we don't have to go
through a design exercise to support plugin/algorithm overrides for
something else. This is something we could hash out during code review
-- I should have some opinions and I'm sure others will as well

- Wes

On Fri, Jun 19, 2020 at 10:21 AM Xie, Qi <qi...@intel.com> wrote:
>
> Hi,
>
>
> In demand of better performance, quite some end users want to leverage accelerators (e.g. FPGA, Intel QAT) to offload compression. However, in current Arrow compression framework, it only supports codec name based compression implementation and can't be customized to leverage accelerators. For example, for gzip format, we can't call customized codec to accelerate the compression. We would like to proposal a plugin API to support the customized compression codec. We've put the proposal here:
>
>
>
> https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit
>
>
>
> Any comment is welcome and please let us know your feedback.
>
>
>
> Thanks,
>
> XieQi
>
>
>