You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Yang, Yang10" <ya...@intel.com> on 2023/04/23 09:38:02 UTC

[DISCUSS][C++][Parquet] Expose the API to customize the compression parameter

Hi,

As discussed in this issue: https://github.com/apache/arrow/issues/35287, currently Arrow only supports one parameter: compression_level to be customized. We would like to make more compression parameters (such as window_bits) customizable when creating the Codec, given the variety of usage scenarios. As suggested by @kou, we may introduce a new options class such as Codec::Options to make the structure clear and easy to extend. But it may take some effort as this is more like a code structure refactor. Passing a parameter directly is another approach, easy to implement but may be hard to extend. So we would like some further discussion here. If you have any suggestion or comments, please share them on above issue or here. Thanks!

Best,
Yang

RE: [DISCUSS][C++][Parquet] Expose the API to customize the compression parameter

Posted by "Yang, Yang10" <ya...@intel.com>.
Hi, thanks for all your great suggestions. 

Seems like we are having an alignment on adding a codec::options class instead of passing extra parameters in existing function calls. A virtual base class with derived classes sounds like a good idea to add specific codec options for different codecs and avoid misuse of unsupported properties. But just need to implement CodecOptions for each Codec.

May I draft a PR or design doc first based on our discussion for your review? Or do we need more discussions/votes on the details before jumping to the code? Thanks!

Best,
Yang

-----Original Message-----
From: Antoine Pitrou <an...@python.org> 
Sent: Sunday, April 23, 2023 11:03 PM
To: dev@arrow.apache.org
Subject: Re: [DISCUSS][C++][Parquet] Expose the API to customize the compression parameter


The most idiomatic option seems to be a virtual base class with derived classes as required:

```
class CodecOptions {
  public:
   virtual ~CodecOptions() = default;

   int compression_level;
};

class ZlibCodecOptions : public CodecOptions {
  public:
   ~CodecOptions() = default;

   int window_bits;
};

class Codec {
  public:
   static Result<std::unique_ptr<Codec>> Create(
       Compression::type codec, const CodecOptions* = nullptr); }; ```


Le 23/04/2023 à 16:50, Gang Wu a écrit :
> It is a good idea to extend the Codec factory to offer more options.
> However, I don't think adding
> a `window_bits` parameter as `compression_level` is a good approach as 
> it does not apply to some codecs.
> 
> IMO, the proposed new `Codec::Options` can be as simple as a 
> std::map<std::string, std::string>.
> To avoid misuse, we need to formalize all acceptable keys somewhere 
> and specific codecs only need to deal with config keys that they know.
> 
> WDYT?
> 
> Best,
> Gang
> 
> On Sun, Apr 23, 2023 at 5:38 PM Yang, Yang10 <ya...@intel.com> wrote:
> 
>> Hi,
>>
>> As discussed in this issue: 
>> https://github.com/apache/arrow/issues/35287,
>> currently Arrow only supports one parameter: compression_level to be 
>> customized. We would like to make more compression parameters (such 
>> as
>> window_bits) customizable when creating the Codec, given the variety 
>> of usage scenarios. As suggested by @kou, we may introduce a new 
>> options class such as Codec::Options to make the structure clear and 
>> easy to extend. But it may take some effort as this is more like a code structure refactor.
>> Passing a parameter directly is another approach, easy to implement 
>> but may be hard to extend. So we would like some further discussion 
>> here. If you have any suggestion or comments, please share them on above issue or here.
>> Thanks!
>>
>> Best,
>> Yang
>>
> 

Re: [DISCUSS][C++][Parquet] Expose the API to customize the compression parameter

Posted by Antoine Pitrou <an...@python.org>.
The most idiomatic option seems to be a virtual base class with derived 
classes as required:

```
class CodecOptions {
  public:
   virtual ~CodecOptions() = default;

   int compression_level;
};

class ZlibCodecOptions : public CodecOptions {
  public:
   ~CodecOptions() = default;

   int window_bits;
};

class Codec {
  public:
   static Result<std::unique_ptr<Codec>> Create(
       Compression::type codec, const CodecOptions* = nullptr);
};
```


Le 23/04/2023 à 16:50, Gang Wu a écrit :
> It is a good idea to extend the Codec factory to offer more options.
> However, I don't think adding
> a `window_bits` parameter as `compression_level` is a good approach as it
> does not apply to
> some codecs.
> 
> IMO, the proposed new `Codec::Options` can be as simple as a
> std::map<std::string, std::string>.
> To avoid misuse, we need to formalize all acceptable keys somewhere and
> specific codecs only
> need to deal with config keys that they know.
> 
> WDYT?
> 
> Best,
> Gang
> 
> On Sun, Apr 23, 2023 at 5:38 PM Yang, Yang10 <ya...@intel.com> wrote:
> 
>> Hi,
>>
>> As discussed in this issue: https://github.com/apache/arrow/issues/35287,
>> currently Arrow only supports one parameter: compression_level to be
>> customized. We would like to make more compression parameters (such as
>> window_bits) customizable when creating the Codec, given the variety of
>> usage scenarios. As suggested by @kou, we may introduce a new options class
>> such as Codec::Options to make the structure clear and easy to extend. But
>> it may take some effort as this is more like a code structure refactor.
>> Passing a parameter directly is another approach, easy to implement but may
>> be hard to extend. So we would like some further discussion here. If you
>> have any suggestion or comments, please share them on above issue or here.
>> Thanks!
>>
>> Best,
>> Yang
>>
> 

Re: [DISCUSS][C++][Parquet] Expose the API to customize the compression parameter

Posted by Gang Wu <us...@gmail.com>.
It is a good idea to extend the Codec factory to offer more options.
However, I don't think adding
a `window_bits` parameter as `compression_level` is a good approach as it
does not apply to
some codecs.

IMO, the proposed new `Codec::Options` can be as simple as a
std::map<std::string, std::string>.
To avoid misuse, we need to formalize all acceptable keys somewhere and
specific codecs only
need to deal with config keys that they know.

WDYT?

Best,
Gang

On Sun, Apr 23, 2023 at 5:38 PM Yang, Yang10 <ya...@intel.com> wrote:

> Hi,
>
> As discussed in this issue: https://github.com/apache/arrow/issues/35287,
> currently Arrow only supports one parameter: compression_level to be
> customized. We would like to make more compression parameters (such as
> window_bits) customizable when creating the Codec, given the variety of
> usage scenarios. As suggested by @kou, we may introduce a new options class
> such as Codec::Options to make the structure clear and easy to extend. But
> it may take some effort as this is more like a code structure refactor.
> Passing a parameter directly is another approach, easy to implement but may
> be hard to extend. So we would like some further discussion here. If you
> have any suggestion or comments, please share them on above issue or here.
> Thanks!
>
> Best,
> Yang
>