You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/21 01:01:51 UTC

[GitHub] [arrow] xieqi opened a new pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

xieqi opened a new pull request #8229:
URL: https://github.com/apache/arrow/pull/8229


   This PR provide plugin framework for parquet to support customized compression codec , please see proposal 
   https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8229:
URL: https://github.com/apache/arrow/pull/8229#issuecomment-695862390


   https://issues.apache.org/jira/browse/ARROW-9579


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] winningsix commented on pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
winningsix commented on pull request #8229:
URL: https://github.com/apache/arrow/pull/8229#issuecomment-696193668


   @pitrou  @emkornfield  FYI. This is Java side PR. https://github.com/apache/parquet-mr/pull/803/files 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8229:
URL: https://github.com/apache/arrow/pull/8229#issuecomment-696176912


   Hmm, reading the mailing-list discussion again, I don't think we had agreed on a design. The first question for me is what the end-user API should be.
   * should the user calling the Parquet reader ask explicitly for a compression override (e.g. "instead of using standard GZip, use HW-accelerated GZip")
   * should instead the compression override be configured using a global setting?
   * or should the override even be transparent? (how?)
   
   From a first look, I see this PR proposes a plugin API, which does not seem necessary to solve the problem at hand. Arrow is a library and I'm not sure it's our duty to implement a specific extension-loading mechanism (though I could be convinced otherwise :-)).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8229:
URL: https://github.com/apache/arrow/pull/8229#issuecomment-696146448


   Thank you for the PR this will likely need a great deal of review from both code and design perspective.  Before it is reviewed it should have thorough unit tests.  And since it deals with Parquet interop it should also likely have some method of verifying compatibility with the Java implementation of possible (also a link to the corresponding Java pr would be useful)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] xieqi commented on pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
xieqi commented on pull request #8229:
URL: https://github.com/apache/arrow/pull/8229#issuecomment-697120706


   @pitrou 
   For Parquet write, the end-user still use the standard GZip as the compression codec, we add a compression_plugin API in parquet WriterProperties Builder, the end-user can use the following code snippet to enable plugin:
   `parquet::WriterProperties::Builder builder;`
   `builder.compression(parquet::Compression::GZIP);`
   `builder.compression_plugin("libGzipPlugin.so");`
   It will use the plugin to compress and write some plugin hint in ColumnMetaData's key_value_metadata.
   
   For parquet read, it will first check if parquet ColumnMetaData's key_value_metadata has plugin information. It will call plugin to decompress the data if has such metadata, otherwise it will call the standard GZip to decompress data. So it is transparent for end-user in parquet read side.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] xieqi commented on pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
xieqi commented on pull request #8229:
URL: https://github.com/apache/arrow/pull/8229#issuecomment-697120706


   @pitrou 
   For Parquet write, the end-user still use the standard GZip as the compression codec, we add a compression_plugin API in parquet WriterProperties Builder, the end-user can use the following code snippet to enable plugin:
   `parquet::WriterProperties::Builder builder;`
   `builder.compression(parquet::Compression::GZIP);`
   `builder.compression_plugin("libGzipPlugin.so");`
   It will use the plugin to compress and write some plugin hint in ColumnMetaData's key_value_metadata.
   
   For parquet read, it will first check if parquet ColumnMetaData's key_value_metadata has plugin information. It will call plugin to decompress the data if has such metadata, otherwise it will call the standard GZip to decompress data. So it is transparent for end-user in parquet read side.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] winningsix commented on pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
winningsix commented on pull request #8229:
URL: https://github.com/apache/arrow/pull/8229#issuecomment-696193668


   @pitrou  @emkornfield  FYI. This is Java side PR. https://github.com/apache/parquet-mr/pull/803/files 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #8229:
URL: https://github.com/apache/arrow/pull/8229


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8229:
URL: https://github.com/apache/arrow/pull/8229#issuecomment-696176912


   Hmm, reading the mailing-list discussion again, I don't think we had agreed on a design. The first question for me is what the end-user API should be.
   * should the user calling the Parquet reader ask explicitly for a compression override (e.g. "instead of using standard GZip, use HW-accelerated GZip")
   * should instead the compression override be configured using a global setting?
   * or should the override even be transparent? (how?)
   
   From a first look, I see this PR proposes a plugin API, which does not seem necessary to solve the problem at hand. Arrow is a library and I'm not sure it's our duty to implement a specific extension-loading mechanism (though I could be convinced otherwise :-)).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8229:
URL: https://github.com/apache/arrow/pull/8229#issuecomment-695862390


   https://issues.apache.org/jira/browse/ARROW-9579


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8229:
URL: https://github.com/apache/arrow/pull/8229#issuecomment-696146448


   Thank you for the PR this will likely need a great deal of review from both code and design perspective.  Before it is reviewed it should have thorough unit tests.  And since it deals with Parquet interop it should also likely have some method of verifying compatibility with the Java implementation of possible (also a link to the corresponding Java pr would be useful)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] winningsix commented on pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
winningsix commented on pull request #8229:
URL: https://github.com/apache/arrow/pull/8229#issuecomment-697141258


   @xieqi How about the on-disk path? How does user determine whether to use a customized codec for a given compression codec?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8229:
URL: https://github.com/apache/arrow/pull/8229#issuecomment-887311279


   I'm going to close this PR as stale, and because the approach here is too heavy-weight as already discussed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8229: ARROW-9579: [C++] Provide the plugin API to support customized compression codec for parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8229:
URL: https://github.com/apache/arrow/pull/8229#issuecomment-887311848


   Link to ML discussion: https://lists.apache.org/thread.html/r88c56e47cdd69eda23477c67c8e4ad5e66a3ccb144087082d427ffbf%40%3Cdev.arrow.apache.org%3E


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org