You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Kyle Barron (Jira)" <ji...@apache.org> on 2022/04/15 19:40:00 UTC
[jira] [Commented] (ARROW-8674) [JS] Implement IPC RecordBatch body buffer compression from ARROW-300

    [ https://issues.apache.org/jira/browse/ARROW-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522941#comment-17522941 ] 

Kyle Barron commented on ARROW-8674:
------------------------------------

Hello! I'd like to revisit this issue and potentially submit a PR for this.

I think there are various reasons why we might not want to pull in LZ4 and ZSTD implementations by default:
 * Bundle-size conscious users who don't want any codecs, or who might not use the arrow IPC features at all. The WASM codecs in [numcodecs.js|https://github.com/manzt/numcodecs.js] appear to be 17.1KB for LZ4 and 206KB for ZSTD (uncompressed).
 * Some users may prefer dynamically importing codecs as required but this requires a slightly more complex setup (at least it requires choosing a CDN from which to import the bundle, right?)
 * I came across at least 4 LZ4 implementations and at least 6 ZSTD implementations. It could be better to leave to the user the choice of which implementation to use. If the user is using one implementation in their app already, then allowing the user to choose the same implementation in Arrow JS would reduce their bundle size.
 * At least one LZ4 implementation is in [pure JS|https://github.com/Benzinga/lz4js], with no WASM components. Some users may prefer a pure JS library for simplicity.

How would others feel about a codec registry system? Something like what [Zarr.js allows|http://guido.io/zarr.js/#/installation?id=zarrjs-core-export], where you can [dynamically register codecs|https://github.com/gzuidhof/zarr.js/blob/29280463ff2f275c31c1fa0f002daa947b8f09b2/src/compression/registry.ts] on demand.

The `arrow.tableFromIPC` function is currently synchronous, so unless we changed that function to be async, we wouldn't be able to import the codec _after_ seeing that a data file has a given compression, because a dynamic import would have to be async.

In terms of implementation, I'd expect it to be relatively straightforward? Presumably look to update `decodeBuffers` here: https://github.com/apache/arrow/blob/b67e3c8ef1e173e1840c4fa897b7c6c493932e10/js/src/ipc/metadata/message.ts#L303.

 

References:

LZ4 implementations:
 * [https://github.com/gorhill/lz4-wasm] 
 * [https://github.com/manzt/numcodecs.js/tree/main/codecs/lz4] 
 * [https://www.npmjs.com/package/lz4-wasm]
 * [https://github.com/Benzinga/lz4js] 

ZSTD implementations:
 * [https://github.com/manzt/numcodecs.js/tree/main/codecs/zstd] 
 * [https://github.com/bokuweb/zstd-wasm]
 * [https://github.com/yoshihitoh/zstd-codec]
 * [https://github.com/donmccurdy/zstddec] 
 * [https://github.com/fabiospampinato/zstandard-wasm]
 * [https://github.com/OneIdentity/zstd-js] 

> [JS] Implement IPC RecordBatch body buffer compression from ARROW-300
> ---------------------------------------------------------------------
>
>                 Key: ARROW-8674
>                 URL: https://issues.apache.org/jira/browse/ARROW-8674
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: JavaScript
>            Reporter: Wes McKinney
>            Priority: Major
>
> This may not be a hard requirement for JS because this would require pulling in implementations of LZ4 and ZSTD which not all users may want



--
This message was sent by Atlassian Jira
(v8.20.1#820001)