You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Xie, Qi" <qi...@intel.com> on 2020/10/19 01:30:26 UTC

[Discuss] Provide pluggable APIs to support user customized compression codec

Hi, all

Again as we discussed in the previous email, We are proposing an pluggable APIs to support user customized compression codec in ARROW.
See proposal https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit
We want to redefine the scope of the pluggable API and have a discuss with the community.

1. Goal
Through the plugin API, the end user can use the customized compression codec to override the built-in compression codec. E.g. use the HW-GZIP codec to replace the ARROW built-in GZIP codec to speed up the compress/decompress.
It is not plan to add new compression codecs for Arrow.
Currently we are focused on parquet format. In the future will support Arrow format. But some components should be common to the Arrow, such as plugin manager module, dynamic library loading module etc.

2. Compatibility with the Java implementation
Both implementations will write the plugin information to the parquet key value metadata, either in parquet FileMetaData level or in the ColumnMetaData level.
The plugin information include the plugin library name used for native parquet and plugin class name used for java parquet.
E.g. plugin_library_name:libgzipplugin.so, plugin_class_name:com.intel.icl.customizedGzipCodec
we're working in progress together with Parquet community to refine our proposal. https://www.mail-archive.com/dev@parquet.apache.org/msg12463.html

3. The end user API.
For write, the end-user should callout they want to use plugin codec, so we add a compression_plugin API in parquet WriteProperties builder, when call this function, the internal parquet writer will write the plugin_library_name and plugin_class_name to the parquet key value metadata. The end user code snippet like this:
parquet::WriterProperties::Builder builder;
builder.compression(parquet::Compression::GZIP);
builder.compression_plugin("libGzipPlugin.so");
std::shared_ptr<parquet::WriterProperties> props = builder.build();



For read, the internal parquet reader will first check if there are plugin information in the metadata. For native parquet, it will read plugin_library_name from the key value metadata, if the key exist, it will load the plugin library automatically and  return the plugin codec from GetReadCodec.

So no code change for read, it is transparent for end-user in parquet read side.



Looking forward to any other suggestions or feedback.

Thanks,
XieQi


RE: [Discuss] Provide pluggable APIs to support user customized compression codec

Posted by "Xie, Qi" <qi...@intel.com>.
Hi, Antoine

About the API you mentioned, I want to know what scope this API will be covered, about the configure API to overwrite the built-in gzip?  

Thanks,
XieQi
-----Original Message-----
From: Antoine Pitrou <an...@python.org> 
Sent: Tuesday, October 27, 2020 11:39 PM
To: Xie, Qi <qi...@intel.com>; dev@arrow.apache.org
Cc: Xu, Cheng A <ch...@intel.com>; Dong, Xin <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>
Subject: Re: [Discuss] Provide pluggable APIs to support user customized compression codec


Hi,

Le 27/10/2020 à 09:55, Xie, Qi a écrit :
> 
> The HW decompressor can't fall back automatically on SW decompression, but we can fallback to SW in the HW library.

Yes, that's what I meant :-)

> How about  HW-Gzip as an enhanced Gzip and still use the Compression::GZIP as Compression::type, the end user can through some configurations to enable HW-Gzip instead of the built-in Gzip in MakeGZipCodec?

That sounds reasonable.  API details will have to be discussed in a PR, but that sounds (IMHO) reasonable on the principle.

Also, note that this can be beneficial for other things than Parquet, for example reading a GZip-compressed CSV file (which right now would be bottlenecked by zlib performance).

I'll let others chime in.

Best regards

Antoine.


> 
> Thanks,
> XieQi
> -----Original Message-----
> From: Antoine Pitrou <an...@python.org>
> Sent: Thursday, October 22, 2020 7:20 PM
> To: dev@arrow.apache.org; Xie, Qi <qi...@intel.com>
> Cc: Xu, Cheng A <ch...@intel.com>; Dong, Xin 
> <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>
> Subject: Re: [Discuss] Provide pluggable APIs to support user 
> customized compression codec
> 
> 
> Ok, thank you.  Another question: why doesn't the HW decompressor fall back automatically on SW decompression when the window size is too large?
> 
> That would avoid having to define metadata strings for this.
> 
> Regards
> 
> Antoine.
> 
> 
> Le 22/10/2020 à 10:38, Xie, Qi a écrit :
>> Yes, the HW-GZIP is able to work on multiple threads too, but the test program lzbench https://github.com/inikep/lzbench  seems work on single thread, so I can't run it with multiple threads.
>>
>> Thanks,
>> XieQi
>>
>> -----Original Message-----
>> From: Antoine Pitrou <an...@python.org>
>> Sent: Thursday, October 22, 2020 4:30 PM
>> To: Xie, Qi <qi...@intel.com>; dev <de...@arrow.apache.org>
>> Cc: Xu, Cheng A <ch...@intel.com>; Dong, Xin 
>> <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>
>> Subject: Re: [Discuss] Provide pluggable APIs to support user 
>> customized compression codec
>>
>>
>> Le 22/10/2020 à 05:38, Xie, Qi a écrit :
>>> Hi,
>>>
>>> I just tested with the Intel QuickAssist Technology, which provide 
>>> hardware accelerate to GZIP, you can see detail here 
>>> https://www.intel.com/content/www/us/en/architecture-and-technology/
>>> i ntel-quick-assist-technology-overview.html
>>>
>>> Here is the benchmark result run on Intel(R) Xeon(R) Gold 6252 CPU @ 
>>> 2.10GHz with single thread
>>>
>>> lzbench 1.7.2 (64-bit Linux)   Assembled by P.Skibinski
>>> | Compressor name         | Compression| Decompress.| Compr. size | Ratio | Filename |
>>> | memcpy                  |  4942 MB/s |  5688 MB/s |     3263523 |  1.00 | calgary/calgary.tar |
>>> | qat 1.0.0                 |  2312 MB/s |  3538 MB/s |     1274379 |  2.56 | calgary/calgary.tar |
>>> | snappy 1.1.4          |   283 MB/s  |  1144 MB/s |     1686240 |  1.94 | calgary/calgary.tar |
>>> | lz4 1.7.5                  |   453 MB/s  |  2514 MB/s |     1685795 |  1.94 | calgary/calgary.tar |
>>> | zstd 1.3.1 -1           |   279 MB/s  |   723 MB/s  |     1187211 |  2.75 | calgary/calgary.tar |
>>> | zlib 1.2.11 -1          |    79 MB/s   |   261 MB/s  |     1240838 |  2.63 | calgary/calgary.tar |
>>
>> Very nice, thank you.  Is it able to work on multiple threads too?
>>
>> Regards
>>
>> Antoine.
>>

Re: [Discuss] Provide pluggable APIs to support user customized compression codec

Posted by Antoine Pitrou <an...@python.org>.
Hi,

Le 27/10/2020 à 09:55, Xie, Qi a écrit :
> 
> The HW decompressor can't fall back automatically on SW decompression, but we can fallback to SW in the HW library.

Yes, that's what I meant :-)

> How about  HW-Gzip as an enhanced Gzip and still use the Compression::GZIP as Compression::type, the end user can through some configurations to enable HW-Gzip instead of the built-in Gzip in MakeGZipCodec?

That sounds reasonable.  API details will have to be discussed in a PR,
but that sounds (IMHO) reasonable on the principle.

Also, note that this can be beneficial for other things than Parquet,
for example reading a GZip-compressed CSV file (which right now would be
bottlenecked by zlib performance).

I'll let others chime in.

Best regards

Antoine.


> 
> Thanks,
> XieQi
> -----Original Message-----
> From: Antoine Pitrou <an...@python.org> 
> Sent: Thursday, October 22, 2020 7:20 PM
> To: dev@arrow.apache.org; Xie, Qi <qi...@intel.com>
> Cc: Xu, Cheng A <ch...@intel.com>; Dong, Xin <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>
> Subject: Re: [Discuss] Provide pluggable APIs to support user customized compression codec
> 
> 
> Ok, thank you.  Another question: why doesn't the HW decompressor fall back automatically on SW decompression when the window size is too large?
> 
> That would avoid having to define metadata strings for this.
> 
> Regards
> 
> Antoine.
> 
> 
> Le 22/10/2020 à 10:38, Xie, Qi a écrit :
>> Yes, the HW-GZIP is able to work on multiple threads too, but the test program lzbench https://github.com/inikep/lzbench  seems work on single thread, so I can't run it with multiple threads.
>>
>> Thanks,
>> XieQi
>>
>> -----Original Message-----
>> From: Antoine Pitrou <an...@python.org>
>> Sent: Thursday, October 22, 2020 4:30 PM
>> To: Xie, Qi <qi...@intel.com>; dev <de...@arrow.apache.org>
>> Cc: Xu, Cheng A <ch...@intel.com>; Dong, Xin 
>> <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>
>> Subject: Re: [Discuss] Provide pluggable APIs to support user 
>> customized compression codec
>>
>>
>> Le 22/10/2020 à 05:38, Xie, Qi a écrit :
>>> Hi,
>>>
>>> I just tested with the Intel QuickAssist Technology, which provide 
>>> hardware accelerate to GZIP, you can see detail here 
>>> https://www.intel.com/content/www/us/en/architecture-and-technology/i
>>> ntel-quick-assist-technology-overview.html
>>>
>>> Here is the benchmark result run on Intel(R) Xeon(R) Gold 6252 CPU @ 
>>> 2.10GHz with single thread
>>>
>>> lzbench 1.7.2 (64-bit Linux)   Assembled by P.Skibinski
>>> | Compressor name         | Compression| Decompress.| Compr. size | Ratio | Filename |
>>> | memcpy                  |  4942 MB/s |  5688 MB/s |     3263523 |  1.00 | calgary/calgary.tar |
>>> | qat 1.0.0                 |  2312 MB/s |  3538 MB/s |     1274379 |  2.56 | calgary/calgary.tar |
>>> | snappy 1.1.4          |   283 MB/s  |  1144 MB/s |     1686240 |  1.94 | calgary/calgary.tar |
>>> | lz4 1.7.5                  |   453 MB/s  |  2514 MB/s |     1685795 |  1.94 | calgary/calgary.tar |
>>> | zstd 1.3.1 -1           |   279 MB/s  |   723 MB/s  |     1187211 |  2.75 | calgary/calgary.tar |
>>> | zlib 1.2.11 -1          |    79 MB/s   |   261 MB/s  |     1240838 |  2.63 | calgary/calgary.tar |
>>
>> Very nice, thank you.  Is it able to work on multiple threads too?
>>
>> Regards
>>
>> Antoine.
>>

RE: [Discuss] Provide pluggable APIs to support user customized compression codec

Posted by "Xie, Qi" <qi...@intel.com>.
Hi, 

The HW decompressor can't fall back automatically on SW decompression, but we can fallback to SW in the HW library.
How about  HW-Gzip as an enhanced Gzip and still use the Compression::GZIP as Compression::type, the end user can through some configurations to enable HW-Gzip instead of the built-in Gzip in MakeGZipCodec?

Thanks,
XieQi
-----Original Message-----
From: Antoine Pitrou <an...@python.org> 
Sent: Thursday, October 22, 2020 7:20 PM
To: dev@arrow.apache.org; Xie, Qi <qi...@intel.com>
Cc: Xu, Cheng A <ch...@intel.com>; Dong, Xin <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>
Subject: Re: [Discuss] Provide pluggable APIs to support user customized compression codec


Ok, thank you.  Another question: why doesn't the HW decompressor fall back automatically on SW decompression when the window size is too large?

That would avoid having to define metadata strings for this.

Regards

Antoine.


Le 22/10/2020 à 10:38, Xie, Qi a écrit :
> Yes, the HW-GZIP is able to work on multiple threads too, but the test program lzbench https://github.com/inikep/lzbench  seems work on single thread, so I can't run it with multiple threads.
> 
> Thanks,
> XieQi
> 
> -----Original Message-----
> From: Antoine Pitrou <an...@python.org>
> Sent: Thursday, October 22, 2020 4:30 PM
> To: Xie, Qi <qi...@intel.com>; dev <de...@arrow.apache.org>
> Cc: Xu, Cheng A <ch...@intel.com>; Dong, Xin 
> <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>
> Subject: Re: [Discuss] Provide pluggable APIs to support user 
> customized compression codec
> 
> 
> Le 22/10/2020 à 05:38, Xie, Qi a écrit :
>> Hi,
>>
>> I just tested with the Intel QuickAssist Technology, which provide 
>> hardware accelerate to GZIP, you can see detail here 
>> https://www.intel.com/content/www/us/en/architecture-and-technology/i
>> ntel-quick-assist-technology-overview.html
>>
>> Here is the benchmark result run on Intel(R) Xeon(R) Gold 6252 CPU @ 
>> 2.10GHz with single thread
>>
>> lzbench 1.7.2 (64-bit Linux)   Assembled by P.Skibinski
>> | Compressor name         | Compression| Decompress.| Compr. size | Ratio | Filename |
>> | memcpy                  |  4942 MB/s |  5688 MB/s |     3263523 |  1.00 | calgary/calgary.tar |
>> | qat 1.0.0                 |  2312 MB/s |  3538 MB/s |     1274379 |  2.56 | calgary/calgary.tar |
>> | snappy 1.1.4          |   283 MB/s  |  1144 MB/s |     1686240 |  1.94 | calgary/calgary.tar |
>> | lz4 1.7.5                  |   453 MB/s  |  2514 MB/s |     1685795 |  1.94 | calgary/calgary.tar |
>> | zstd 1.3.1 -1           |   279 MB/s  |   723 MB/s  |     1187211 |  2.75 | calgary/calgary.tar |
>> | zlib 1.2.11 -1          |    79 MB/s   |   261 MB/s  |     1240838 |  2.63 | calgary/calgary.tar |
> 
> Very nice, thank you.  Is it able to work on multiple threads too?
> 
> Regards
> 
> Antoine.
> 

Re: [Discuss] Provide pluggable APIs to support user customized compression codec

Posted by Antoine Pitrou <an...@python.org>.
Ok, thank you.  Another question: why doesn't the HW decompressor fall
back automatically on SW decompression when the window size is too large?

That would avoid having to define metadata strings for this.

Regards

Antoine.


Le 22/10/2020 à 10:38, Xie, Qi a écrit :
> Yes, the HW-GZIP is able to work on multiple threads too, but the test program lzbench https://github.com/inikep/lzbench  seems work on single thread, so I can't run it with multiple threads.
> 
> Thanks,
> XieQi
> 
> -----Original Message-----
> From: Antoine Pitrou <an...@python.org> 
> Sent: Thursday, October 22, 2020 4:30 PM
> To: Xie, Qi <qi...@intel.com>; dev <de...@arrow.apache.org>
> Cc: Xu, Cheng A <ch...@intel.com>; Dong, Xin <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>
> Subject: Re: [Discuss] Provide pluggable APIs to support user customized compression codec
> 
> 
> Le 22/10/2020 à 05:38, Xie, Qi a écrit :
>> Hi, 
>>
>> I just tested with the Intel QuickAssist Technology, which provide hardware accelerate to GZIP, you can see detail here https://www.intel.com/content/www/us/en/architecture-and-technology/intel-quick-assist-technology-overview.html 
>>
>> Here is the benchmark result run on Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with single thread 
>>
>> lzbench 1.7.2 (64-bit Linux)   Assembled by P.Skibinski
>> | Compressor name         | Compression| Decompress.| Compr. size | Ratio | Filename |
>> | memcpy                  |  4942 MB/s |  5688 MB/s |     3263523 |  1.00 | calgary/calgary.tar |
>> | qat 1.0.0                 |  2312 MB/s |  3538 MB/s |     1274379 |  2.56 | calgary/calgary.tar |
>> | snappy 1.1.4          |   283 MB/s  |  1144 MB/s |     1686240 |  1.94 | calgary/calgary.tar |
>> | lz4 1.7.5                  |   453 MB/s  |  2514 MB/s |     1685795 |  1.94 | calgary/calgary.tar |
>> | zstd 1.3.1 -1           |   279 MB/s  |   723 MB/s  |     1187211 |  2.75 | calgary/calgary.tar |
>> | zlib 1.2.11 -1          |    79 MB/s   |   261 MB/s  |     1240838 |  2.63 | calgary/calgary.tar |
> 
> Very nice, thank you.  Is it able to work on multiple threads too?
> 
> Regards
> 
> Antoine.
> 

RE: [Discuss] Provide pluggable APIs to support user customized compression codec

Posted by "Xie, Qi" <qi...@intel.com>.
Yes, the HW-GZIP is able to work on multiple threads too, but the test program lzbench https://github.com/inikep/lzbench  seems work on single thread, so I can't run it with multiple threads.

Thanks,
XieQi

-----Original Message-----
From: Antoine Pitrou <an...@python.org> 
Sent: Thursday, October 22, 2020 4:30 PM
To: Xie, Qi <qi...@intel.com>; dev <de...@arrow.apache.org>
Cc: Xu, Cheng A <ch...@intel.com>; Dong, Xin <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>
Subject: Re: [Discuss] Provide pluggable APIs to support user customized compression codec


Le 22/10/2020 à 05:38, Xie, Qi a écrit :
> Hi, 
> 
> I just tested with the Intel QuickAssist Technology, which provide hardware accelerate to GZIP, you can see detail here https://www.intel.com/content/www/us/en/architecture-and-technology/intel-quick-assist-technology-overview.html 
> 
> Here is the benchmark result run on Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with single thread 
> 
> lzbench 1.7.2 (64-bit Linux)   Assembled by P.Skibinski
> | Compressor name         | Compression| Decompress.| Compr. size | Ratio | Filename |
> | memcpy                  |  4942 MB/s |  5688 MB/s |     3263523 |  1.00 | calgary/calgary.tar |
> | qat 1.0.0                 |  2312 MB/s |  3538 MB/s |     1274379 |  2.56 | calgary/calgary.tar |
> | snappy 1.1.4          |   283 MB/s  |  1144 MB/s |     1686240 |  1.94 | calgary/calgary.tar |
> | lz4 1.7.5                  |   453 MB/s  |  2514 MB/s |     1685795 |  1.94 | calgary/calgary.tar |
> | zstd 1.3.1 -1           |   279 MB/s  |   723 MB/s  |     1187211 |  2.75 | calgary/calgary.tar |
> | zlib 1.2.11 -1          |    79 MB/s   |   261 MB/s  |     1240838 |  2.63 | calgary/calgary.tar |

Very nice, thank you.  Is it able to work on multiple threads too?

Regards

Antoine.

Re: [Discuss] Provide pluggable APIs to support user customized compression codec

Posted by Antoine Pitrou <an...@python.org>.
Le 22/10/2020 à 05:38, Xie, Qi a écrit :
> Hi, 
> 
> I just tested with the Intel QuickAssist Technology, which provide hardware accelerate to GZIP, you can see detail here https://www.intel.com/content/www/us/en/architecture-and-technology/intel-quick-assist-technology-overview.html 
> 
> Here is the benchmark result run on Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with single thread 
> 
> lzbench 1.7.2 (64-bit Linux)   Assembled by P.Skibinski
> | Compressor name         | Compression| Decompress.| Compr. size | Ratio | Filename |
> | memcpy                  |  4942 MB/s |  5688 MB/s |     3263523 |  1.00 | calgary/calgary.tar |
> | qat 1.0.0                 |  2312 MB/s |  3538 MB/s |     1274379 |  2.56 | calgary/calgary.tar |
> | snappy 1.1.4          |   283 MB/s  |  1144 MB/s |     1686240 |  1.94 | calgary/calgary.tar |
> | lz4 1.7.5                  |   453 MB/s  |  2514 MB/s |     1685795 |  1.94 | calgary/calgary.tar |
> | zstd 1.3.1 -1           |   279 MB/s  |   723 MB/s  |     1187211 |  2.75 | calgary/calgary.tar |
> | zlib 1.2.11 -1          |    79 MB/s   |   261 MB/s  |     1240838 |  2.63 | calgary/calgary.tar |

Very nice, thank you.  Is it able to work on multiple threads too?

Regards

Antoine.

RE: [Discuss] Provide pluggable APIs to support user customized compression codec

Posted by "Xie, Qi" <qi...@intel.com>.
Hi, 

I just tested with the Intel QuickAssist Technology, which provide hardware accelerate to GZIP, you can see detail here https://www.intel.com/content/www/us/en/architecture-and-technology/intel-quick-assist-technology-overview.html 

Here is the benchmark result run on Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with single thread 

lzbench 1.7.2 (64-bit Linux)   Assembled by P.Skibinski
| Compressor name         | Compression| Decompress.| Compr. size | Ratio | Filename |
| memcpy                  |  4942 MB/s |  5688 MB/s |     3263523 |  1.00 | calgary/calgary.tar |
| qat 1.0.0                 |  2312 MB/s |  3538 MB/s |     1274379 |  2.56 | calgary/calgary.tar |
| snappy 1.1.4          |   283 MB/s  |  1144 MB/s |     1686240 |  1.94 | calgary/calgary.tar |
| lz4 1.7.5                  |   453 MB/s  |  2514 MB/s |     1685795 |  1.94 | calgary/calgary.tar |
| zstd 1.3.1 -1           |   279 MB/s  |   723 MB/s  |     1187211 |  2.75 | calgary/calgary.tar |
| zlib 1.2.11 -1          |    79 MB/s   |   261 MB/s  |     1240838 |  2.63 | calgary/calgary.tar |

Thanks,
XieQi
-----Original Message-----
From: Wes McKinney <we...@gmail.com> 
Sent: Thursday, October 22, 2020 9:58 AM
To: dev <de...@arrow.apache.org>
Cc: antoine@python.org; Xu, Cheng A <ch...@intel.com>; Dong, Xin <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>; Xie, Qi <qi...@intel.com>
Subject: Re: [Discuss] Provide pluggable APIs to support user customized compression codec

Yes, I think he's asking about the motivation for the project. My understanding is that Snappy is used more often than Gzip with Parquet

On Wed, Oct 21, 2020 at 8:53 PM Xie, Qi <qi...@intel.com> wrote:
>
> Hi, Antoine
>
> Do you mean the performance data HW-GZIP compared with LZ4/ZSTD?
>
> Thanks,
> XieQi
>
> -----Original Message-----
> From: Antoine Pitrou <an...@python.org>
> Sent: Tuesday, October 20, 2020 10:38 PM
> To: dev@arrow.apache.org; Xie, Qi <qi...@intel.com>
> Cc: Xu, Cheng A <ch...@intel.com>; Dong, Xin 
> <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>
> Subject: Re: [Discuss] Provide pluggable APIs to support user 
> customized compression codec
>
>
>
> Le 20/10/2020 à 12:09, Xie, Qi a écrit :
> > Hi, Wes
> >
> > Yes currently the purpose of the key-value metadata is just a hint to indicate that the parquet file is compressed by plugin so that the parquet reader can load the plugin library and use plugin to decompress the file.
> > There are many optimized GZIP implementations and may not compatible with the standard gzip, for example due to hardware limit, the HW-GZIP history window size maybe smaller than the standard gzip, so that HW-GZIP can't decompress the file compressed by standard gzip and because we are still use the Compression::GZIP as Compression::type, we need that metadata to distinguish it from the standard gzip.
>
> What does it bring over ZSTD or LZ4 exactly?
>
> Regards
>
> Antoine.

Re: [Discuss] Provide pluggable APIs to support user customized compression codec

Posted by Wes McKinney <we...@gmail.com>.
Yes, I think he's asking about the motivation for the project. My
understanding is that Snappy is used more often than Gzip with Parquet

On Wed, Oct 21, 2020 at 8:53 PM Xie, Qi <qi...@intel.com> wrote:
>
> Hi, Antoine
>
> Do you mean the performance data HW-GZIP compared with LZ4/ZSTD?
>
> Thanks,
> XieQi
>
> -----Original Message-----
> From: Antoine Pitrou <an...@python.org>
> Sent: Tuesday, October 20, 2020 10:38 PM
> To: dev@arrow.apache.org; Xie, Qi <qi...@intel.com>
> Cc: Xu, Cheng A <ch...@intel.com>; Dong, Xin <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>
> Subject: Re: [Discuss] Provide pluggable APIs to support user customized compression codec
>
>
>
> Le 20/10/2020 à 12:09, Xie, Qi a écrit :
> > Hi, Wes
> >
> > Yes currently the purpose of the key-value metadata is just a hint to indicate that the parquet file is compressed by plugin so that the parquet reader can load the plugin library and use plugin to decompress the file.
> > There are many optimized GZIP implementations and may not compatible with the standard gzip, for example due to hardware limit, the HW-GZIP history window size maybe smaller than the standard gzip, so that HW-GZIP can't decompress the file compressed by standard gzip and because we are still use the Compression::GZIP as Compression::type, we need that metadata to distinguish it from the standard gzip.
>
> What does it bring over ZSTD or LZ4 exactly?
>
> Regards
>
> Antoine.

RE: [Discuss] Provide pluggable APIs to support user customized compression codec

Posted by "Xie, Qi" <qi...@intel.com>.
Hi, Antoine

Do you mean the performance data HW-GZIP compared with LZ4/ZSTD? 

Thanks,
XieQi

-----Original Message-----
From: Antoine Pitrou <an...@python.org> 
Sent: Tuesday, October 20, 2020 10:38 PM
To: dev@arrow.apache.org; Xie, Qi <qi...@intel.com>
Cc: Xu, Cheng A <ch...@intel.com>; Dong, Xin <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>
Subject: Re: [Discuss] Provide pluggable APIs to support user customized compression codec



Le 20/10/2020 à 12:09, Xie, Qi a écrit :
> Hi, Wes
> 
> Yes currently the purpose of the key-value metadata is just a hint to indicate that the parquet file is compressed by plugin so that the parquet reader can load the plugin library and use plugin to decompress the file.
> There are many optimized GZIP implementations and may not compatible with the standard gzip, for example due to hardware limit, the HW-GZIP history window size maybe smaller than the standard gzip, so that HW-GZIP can't decompress the file compressed by standard gzip and because we are still use the Compression::GZIP as Compression::type, we need that metadata to distinguish it from the standard gzip.

What does it bring over ZSTD or LZ4 exactly?

Regards

Antoine.

Re: [Discuss] Provide pluggable APIs to support user customized compression codec

Posted by Antoine Pitrou <an...@python.org>.

Le 20/10/2020 à 12:09, Xie, Qi a écrit :
> Hi, Wes
> 
> Yes currently the purpose of the key-value metadata is just a hint to indicate that the parquet file is compressed by plugin so that the parquet reader can load the plugin library and use plugin to decompress the file.
> There are many optimized GZIP implementations and may not compatible with the standard gzip, for example due to hardware limit, the HW-GZIP history window size maybe smaller than the standard gzip, so that HW-GZIP can't decompress the file compressed by standard gzip and because we are still use the Compression::GZIP as Compression::type, we need that metadata to distinguish it from the standard gzip.

What does it bring over ZSTD or LZ4 exactly?

Regards

Antoine.

RE: [Discuss] Provide pluggable APIs to support user customized compression codec

Posted by "Xie, Qi" <qi...@intel.com>.
Hi, Wes

Yes currently the purpose of the key-value metadata is just a hint to indicate that the parquet file is compressed by plugin so that the parquet reader can load the plugin library and use plugin to decompress the file.
There are many optimized GZIP implementations and may not compatible with the standard gzip, for example due to hardware limit, the HW-GZIP history window size maybe smaller than the standard gzip, so that HW-GZIP can't decompress the file compressed by standard gzip and because we are still use the Compression::GZIP as Compression::type, we need that metadata to distinguish it from the standard gzip.

Thanks,
XieQi

-----Original Message-----
From: Wes McKinney <we...@gmail.com> 
Sent: Tuesday, October 20, 2020 11:06 AM
To: dev <de...@arrow.apache.org>
Cc: Xie, Qi <qi...@intel.com>; Xu, Cheng A <ch...@intel.com>; Dong, Xin <xi...@intel.com>; Zhang, Jie1 <ji...@intel.com>
Subject: Re: [Discuss] Provide pluggable APIs to support user customized compression codec

What is the purpose of the key-value metadata aside from automatically loading the plugin library if it's available (which seems like a security risk if reading a data file can cause a shared library to be loaded dynamically)? Is it necessary to have that metadata for it to be safe to use the optimized GZIP plugin (or could you just always have the plugin enabled on a system that supports it, even for files that were not compressed using the plugin but rather the system / standard gzip)?

On Mon, Oct 19, 2020 at 8:42 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hi,
>
> Again, I think the whole plugin concept falls outside of Arrow.
>
> It should be much simpler to simply allow people to override the 
> compression codec factory.  Then applications can define "plugins" if 
> they want to.
>
> Regards
>
> Antoine.
>
>
> Le 19/10/2020 à 03:30, Xie, Qi a écrit :
> > Hi, all
> >
> > Again as we discussed in the previous email, We are proposing an pluggable APIs to support user customized compression codec in ARROW.
> > See proposal 
> > https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMl
> > JWy6aqC6WG8/edit We want to redefine the scope of the pluggable API 
> > and have a discuss with the community.
> >
> > 1. Goal
> > Through the plugin API, the end user can use the customized compression codec to override the built-in compression codec. E.g. use the HW-GZIP codec to replace the ARROW built-in GZIP codec to speed up the compress/decompress.
> > It is not plan to add new compression codecs for Arrow.
> > Currently we are focused on parquet format. In the future will support Arrow format. But some components should be common to the Arrow, such as plugin manager module, dynamic library loading module etc.
> >
> > 2. Compatibility with the Java implementation Both implementations 
> > will write the plugin information to the parquet key value metadata, either in parquet FileMetaData level or in the ColumnMetaData level.
> > The plugin information include the plugin library name used for native parquet and plugin class name used for java parquet.
> > E.g. plugin_library_name:libgzipplugin.so, 
> > plugin_class_name:com.intel.icl.customizedGzipCodec
> > we're working in progress together with Parquet community to refine 
> > our proposal. 
> > https://www.mail-archive.com/dev@parquet.apache.org/msg12463.html
> >
> > 3. The end user API.
> > For write, the end-user should callout they want to use plugin codec, so we add a compression_plugin API in parquet WriteProperties builder, when call this function, the internal parquet writer will write the plugin_library_name and plugin_class_name to the parquet key value metadata. The end user code snippet like this:
> > parquet::WriterProperties::Builder builder; 
> > builder.compression(parquet::Compression::GZIP);
> > builder.compression_plugin("libGzipPlugin.so");
> > std::shared_ptr<parquet::WriterProperties> props = builder.build();
> >
> >
> >
> > For read, the internal parquet reader will first check if there are plugin information in the metadata. For native parquet, it will read plugin_library_name from the key value metadata, if the key exist, it will load the plugin library automatically and  return the plugin codec from GetReadCodec.
> >
> > So no code change for read, it is transparent for end-user in parquet read side.
> >
> >
> >
> > Looking forward to any other suggestions or feedback.
> >
> > Thanks,
> > XieQi
> >
> >

Re: [Discuss] Provide pluggable APIs to support user customized compression codec

Posted by Wes McKinney <we...@gmail.com>.
What is the purpose of the key-value metadata aside from automatically
loading the plugin library if it's available (which seems like a
security risk if reading a data file can cause a shared library to be
loaded dynamically)? Is it necessary to have that metadata for it to
be safe to use the optimized GZIP plugin (or could you just always
have the plugin enabled on a system that supports it, even for files
that were not compressed using the plugin but rather the system /
standard gzip)?

On Mon, Oct 19, 2020 at 8:42 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hi,
>
> Again, I think the whole plugin concept falls outside of Arrow.
>
> It should be much simpler to simply allow people to override the
> compression codec factory.  Then applications can define "plugins" if
> they want to.
>
> Regards
>
> Antoine.
>
>
> Le 19/10/2020 à 03:30, Xie, Qi a écrit :
> > Hi, all
> >
> > Again as we discussed in the previous email, We are proposing an pluggable APIs to support user customized compression codec in ARROW.
> > See proposal https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit
> > We want to redefine the scope of the pluggable API and have a discuss with the community.
> >
> > 1. Goal
> > Through the plugin API, the end user can use the customized compression codec to override the built-in compression codec. E.g. use the HW-GZIP codec to replace the ARROW built-in GZIP codec to speed up the compress/decompress.
> > It is not plan to add new compression codecs for Arrow.
> > Currently we are focused on parquet format. In the future will support Arrow format. But some components should be common to the Arrow, such as plugin manager module, dynamic library loading module etc.
> >
> > 2. Compatibility with the Java implementation
> > Both implementations will write the plugin information to the parquet key value metadata, either in parquet FileMetaData level or in the ColumnMetaData level.
> > The plugin information include the plugin library name used for native parquet and plugin class name used for java parquet.
> > E.g. plugin_library_name:libgzipplugin.so, plugin_class_name:com.intel.icl.customizedGzipCodec
> > we're working in progress together with Parquet community to refine our proposal. https://www.mail-archive.com/dev@parquet.apache.org/msg12463.html
> >
> > 3. The end user API.
> > For write, the end-user should callout they want to use plugin codec, so we add a compression_plugin API in parquet WriteProperties builder, when call this function, the internal parquet writer will write the plugin_library_name and plugin_class_name to the parquet key value metadata. The end user code snippet like this:
> > parquet::WriterProperties::Builder builder;
> > builder.compression(parquet::Compression::GZIP);
> > builder.compression_plugin("libGzipPlugin.so");
> > std::shared_ptr<parquet::WriterProperties> props = builder.build();
> >
> >
> >
> > For read, the internal parquet reader will first check if there are plugin information in the metadata. For native parquet, it will read plugin_library_name from the key value metadata, if the key exist, it will load the plugin library automatically and  return the plugin codec from GetReadCodec.
> >
> > So no code change for read, it is transparent for end-user in parquet read side.
> >
> >
> >
> > Looking forward to any other suggestions or feedback.
> >
> > Thanks,
> > XieQi
> >
> >

Re: [Discuss] Provide pluggable APIs to support user customized compression codec

Posted by Antoine Pitrou <an...@python.org>.
Hi,

Again, I think the whole plugin concept falls outside of Arrow.

It should be much simpler to simply allow people to override the
compression codec factory.  Then applications can define "plugins" if
they want to.

Regards

Antoine.


Le 19/10/2020 à 03:30, Xie, Qi a écrit :
> Hi, all
> 
> Again as we discussed in the previous email, We are proposing an pluggable APIs to support user customized compression codec in ARROW.
> See proposal https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit
> We want to redefine the scope of the pluggable API and have a discuss with the community.
> 
> 1. Goal
> Through the plugin API, the end user can use the customized compression codec to override the built-in compression codec. E.g. use the HW-GZIP codec to replace the ARROW built-in GZIP codec to speed up the compress/decompress.
> It is not plan to add new compression codecs for Arrow.
> Currently we are focused on parquet format. In the future will support Arrow format. But some components should be common to the Arrow, such as plugin manager module, dynamic library loading module etc.
> 
> 2. Compatibility with the Java implementation
> Both implementations will write the plugin information to the parquet key value metadata, either in parquet FileMetaData level or in the ColumnMetaData level.
> The plugin information include the plugin library name used for native parquet and plugin class name used for java parquet.
> E.g. plugin_library_name:libgzipplugin.so, plugin_class_name:com.intel.icl.customizedGzipCodec
> we're working in progress together with Parquet community to refine our proposal. https://www.mail-archive.com/dev@parquet.apache.org/msg12463.html
> 
> 3. The end user API.
> For write, the end-user should callout they want to use plugin codec, so we add a compression_plugin API in parquet WriteProperties builder, when call this function, the internal parquet writer will write the plugin_library_name and plugin_class_name to the parquet key value metadata. The end user code snippet like this:
> parquet::WriterProperties::Builder builder;
> builder.compression(parquet::Compression::GZIP);
> builder.compression_plugin("libGzipPlugin.so");
> std::shared_ptr<parquet::WriterProperties> props = builder.build();
> 
> 
> 
> For read, the internal parquet reader will first check if there are plugin information in the metadata. For native parquet, it will read plugin_library_name from the key value metadata, if the key exist, it will load the plugin library automatically and  return the plugin codec from GetReadCodec.
> 
> So no code change for read, it is transparent for end-user in parquet read side.
> 
> 
> 
> Looking forward to any other suggestions or feedback.
> 
> Thanks,
> XieQi
> 
>