You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/21 15:48:15 UTC

[GitHub] [arrow] kitsunde commented on a diff in pull request #13076: ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression

kitsunde commented on code in PR #13076:
URL: https://github.com/apache/arrow/pull/13076#discussion_r1028207049


##########
js/src/ipc/reader.ts:
##########
@@ -352,12 +353,24 @@ abstract class RecordBatchReaderImpl<T extends TypeMap = any> implements RecordB
         return this;
     }
 
-    protected _loadRecordBatch(header: metadata.RecordBatch, body: any) {
+    protected _loadRecordBatch(header: metadata.RecordBatch, body: Uint8Array) {
+        if (header.compression) {
+            const codec = compressionRegistry.get(header.compression);
+            if (codec?.decode) {
+                // TODO: does this need to be offset by 8 bytes? Since the uncompressed length is
+                // written in the first 8 bytes:
+                // https://github.com/apache/arrow/blob/1fc251f18d5b48f0c9fe8af8168237e7e6d05a45/format/Message.fbs#L59-L65

Review Comment:
   I started digging into this although being unfamiliar to arrow and zstd both it has been a bit of a struggle.
   
   The first 8 bytes are not part of the compressed section. You can see it corrected for in other places like here: https://github.com/apache/arrow/pull/9137/files#diff-38c3da9d6c7d521df83f59ee7d8cfced00951e8dcd38f39c7c36d26b6feee3d6R75
   
   If you encode something with ZSTD you'll see it starts with `[40, 181, 47, 253, 32]` which comes after those 8 bytes.  I'e just encoding:
   
   ```
   const testVal = compress(Buffer.from(new Uint8Array([1, 0, 0, 0, 0, 0, 0, 0])));
   
   Uint8Array(17) [
     40, 181, 47, 253, 32, 8, 65,
      0,  0,  1,   0,  0, 0,  0,
      0,  0,  0
   ]
   ```
   
   This within the body here is:
   
   ```
   const compressedMessage = new Uint8Array([
     8, 0, 0, 0,
     0, 0, 0, 0,
     40, 181, 47,
     253, 32, 8, 65,
     0, 0, 1, 0,
     0, 0, 0, 0,
     0, 0, 0, 0,
     0, 0, 0, 0,
     0
   ]);
   ```
   
   Now there's another issue where which is there appears to be padding at the end and between the fields which I'm very uncertain on how to detect besides brute forcing in reverse between the LZ4 header and the end.
   
   ```
   console.log(decompress(compressedMessage.slice(8, 17 + 8)));
   
   Uint8Array([
     40, 181, 47, 253, 
     32, 8, 65, 0,
     0, 1, 0, 0,
     0, 0, 0, 0,
     0
   ]);
   ```
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org